HyperAI超神経
Back to Headlines

Meta's Secret Ablation Experiments Show How Specific Data Boosts AI Model Performance

18日前

A high-profile legal suit has shed light on Meta’s secret experiments in the realm of artificial intelligence (AI), particularly focusing on the training data for its Llama model. Internal documents reveal that Meta researchers used a technique called "ablation," which involves removing or substituting parts of the training data to assess its impact on model performance. In January 2025, Meta disclosed in legal filings that they had replaced some training data with pirated books from the LibGen database and then retrained the Llama model. In one experiment, Meta added scientific, technological, and fiction books to the training data. In another, they included only fiction books. The results, as shown in internal documents, were significant improvements in the model's performance on industry benchmarks. For instance, the model’s performance on the BooIQ benchmark improved by 4.5% after adding scientific, technology, and fiction books, and by 6% with just fiction books. In the SIQA benchmark, the performance boost was even more substantial, at 5.5%. Ablation is a common technique used by Meta and other AI companies to optimize their models. A Meta engineer mentioned on LinkedIn that they conducted over 100 ablation experiments during the development of Llama 4 and its earlier versions. However, Meta has never publicly shared the results of these experiments, even for open-source models. This secrecy is mirrored across the AI industry, where sensitive information about data sources is typically kept confidential. The secrecy underscores the tech industry's concerns about copyright issues. Revealing the specific impact of training data sets could lead to demands for compensation from copyright holders. Nick Vincent, an assistant professor of computing science at Simon Fraser University, noted, "Disclosing these data value estimates could influence large tech companies’ positions in U.S. copyright lawsuits." Meta spokespersons have stated that the company disagrees with the claims in the lawsuit and remains committed to the healthy development of generative AI. In contrast, past practices in the industry were more transparent. For example, in 2017, Google detailed the use of about 40,000 sentences from the Wall Street Journal in a groundbreaking study. OpenAI, in its GPT-2 paper, outlined how they collected web content via links from Reddit. However, companies now generally avoid disclosing specific data sources. When releasing Llama 4, Meta only vaguely mentioned that the dataset came from publicly available, licensed data and from their own products and services. Bill Gross, CEO of ProRata, argues that this secrecy is unfair to content creators. He believes that creators should be compensated each time their data is used for AI training and whenever the model generates content influenced by that data. This sentiment is growing among industry professionals, who are calling for a more equitable system of data usage and compensation. The experiments not only highlight the importance of specific data in enhancing AI model performance but also spark broader discussions about copyright and data rights. Vincent hopes that the release of such experimental results will help create a more transparent and just data use framework. After all, all AI products are built on the foundation of human-generated content and knowledge. Without this data, it is difficult for AI models to achieve their current levels of performance. He emphasizes that ensuring the sustainability of knowledge creation and sharing is crucial. As one of the world's leading tech companies, Meta’s Llama series has made significant market inroads. Industry experts agree that these confidential experimental results could offer valuable insights to other companies but also raise ethical questions about copyright and data rights. Meta’s secret experiments not only demonstrate innovative techniques in AI model development but also provide a new perspective on addressing copyright issues. The push for greater transparency and fairness in data use is gaining momentum, with the aim of protecting the rights of content creators while advancing the field of AI.

Related Links