Model Collapse
Model collapse is a problem in the field of artificial intelligence, especially in machine learning and deep learning model training. It refers to the situation where when a model starts to generate data that is far from the real data distribution during training, the performance of the model will drop sharply, and eventually the model output will become meaningless.
The concept of model collapse has received a lot of attention in 2024, especially in the training of large language models (LLMs).Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data"Through experimental and theoretical analysis, the model collapse problem was explored, and a strategy to avoid model collapse by accumulating data was proposed. The paper has been published in "ICML 2024 Workshop on Foundation Models in the Wild". The paper points out that when the model is trained on self-generated data, the model performance will gradually deteriorate until the model becomes useless. This phenomenon is called model collapse. The researchers verified experimentally that when the original real data is replaced by each generation of synthetic data, it does cause the model to collapse. They then showed that model collapse can be avoided by accumulating consecutive generations of synthetic data as well as the original real data, and these results hold true for a range of model sizes, architectures, and hyperparameters.
References
【2】Is generative AI doomed? An expert's take on the “model collapse” theory