Multi-dimensional pre-training Data Screening Framework Meta-rater
A Multi-dimensional Data Selection Method for Pre-training Language Models (Meta-rater) was proposed by Shanghai Artificial Intelligence Laboratory and East China Normal University on June 4, 2025. It aims to integrate the four dimensions of professionalism, readability, reasoning, and cleanliness with existing quality indicators by learning optimal weights.Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models", which won the ACL 25 Best Theme Paper Award.
Meta-rater uses a surrogate model to train a regression model and predict the validation set loss, thereby identifying the optimal quality score combination. Experimental results show that Meta-rater can triple the convergence speed of a 1.3 billion parameter model and improve downstream task performance by 3.23%. This advantage is scalable to a 7.2 billion parameter model.