HyperAI

MLDR Multilingual Document Retrieval Dataset

Date

a month ago

Size

9.3 GB

Publish URL

huggingface.co

Categories

MLDR (Multilingual Long-Document Retrieval) is a multilingual long document retrieval dataset built based on Wikipedia, Wudao and mC4 multilingual corpus, which aims to support the research and development of cross-language long text retrieval tasks. It covers 13 typologically different languages, including Arabic (ar), German (de), English (en), Spanish (es), French (fr), Hindi (hi), Italian (it), Japanese (ja), Korean (ko), Portuguese (pt), Russian (ru), Thai (th), and Chinese (zh).

Features and advantages:

  • Wide multi-language coverage: It includes 13 languages, covering multiple language families (such as Indo-European, Sino-Tibetan, Arabic, etc.).
  • Long document feature: The average length of a document is 4,737 words, which is suitable for long text processing needs in real scenarios.
  • Standardized construction: Generate high-quality queries through GPT-3.5 to ensure strong relevance of queries to document content.
MLDR.torrent
Seeding 1Downloading 0Completed 29Total Downloads 26
  • MLDR/
    • README.md
      1.62 KB
    • README.txt
      3.24 KB
      • data/
        • MLDR.zip
          9.3 GB