HyperAI

Glitch Token

Glitch tokens refer to the abnormal output of a large language model that should have helped the model run smoothly. A research team jointly formed by Huazhong University of Science and Technology, Nanyang Technological University and other universities recently published a study 「Glitch Tokens in Large Language Models」It shows that there are some faulty tokens in the large model, which will cause errors or inconsistencies in the model's output results. The research team's detection method for faulty tokens provides meaningful insights for reducing tokenizer-related errors in large models. In their research, they found that faulty tokens have a clustering effect in the embedding space, which inspired them to complete the identification of faulty tokens through clustering algorithms.

The generation of Glitch Token may be caused by the following reasons:

  1. Data issues: Errors, noise, or inconsistencies in the training data may cause the model to learn incorrect information.
  2. Model architecture issues:Deficiencies or limitations in the model architecture may lead to the generation of glitch tokens.
  3. Overfitting: The model overfits the training data, which may lead to poor performance on new data.
  4. Problems with the training process: For example, inappropriate learning rate, number of training rounds, etc.
  5. Data augmentation problem: Inappropriate data augmentation methods may introduce errors.
  6. Hardware failure or error: A hardware problem may have occurred during the calculation process.
  7. Algorithm Error: Algorithmic error in model implementation.
  8. Model size issues: Too large or too small a model size may affect performance.
  9. Data distribution skew: The distribution of actual data is different from that of training data.
  10. Lack of sufficient training data: May lead to insufficient model learning.