Token Preference Optimization
Token Preference Optimization (TPO) is a novel method proposed by Alibaba Group and Mohamed bin Zayed University of Artificial Intelligence in January 2025 to reduce the hallucination problem of large visual language models (LVLMs).Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation".
TPO aims to achieve token-level distribution correction without fine-grained manual annotation by introducing a self-calibrated visual anchor reward mechanism, allowing the model to pay more attention to visual information and reduce hallucinations. It can automatically identify "visual anchor tokens" that are highly correlated with the input visual embedding and adaptively distribute rewards based on their dependence on visual information. Compared with traditional sentence-level rewards, TPO can more finely adjust the generated content and reduce hallucination problems.