ApolloCorpora Multilingual Medical Dataset
Date
Size
Publish URL
Tags
Categories
ApolloCorpora is a multilingual medical dataset jointly constructed by the Shenzhen Big Data Research Institute and the Chinese University of Hong Kong research team. The dataset covers six major languages used by 6.1 billion people worldwide, including English, Chinese, Hindi, Spanish, French and Arabic.
Data collection involves books, clinical guidelines, encyclopedias, papers, forums, and exams. In terms of data processing, researchers convert the original pre-training corpus into question-answer pairs to enhance the medical capabilities of the model. ApolloCorpora also focuses on localized features such as symptom diagnosis, drug names, communication terms, and medical practice standards to adapt to different cultures and medical systems. This dataset provides a solid foundation for the development and evaluation of multilingual medical AI models, and helps promote the global application of medical AI technology.