2 months ago

Token Merging: Your ViT But Faster

Bolya, Daniel ; Fu, Cheng-Yang ; Dai, Xiaoliang ; Zhang, Peizhao ; Feichtenhofer, Christoph ; Hoffman, Judy

Abstract

We introduce Token Merging (ToMe), a simple method to increase the throughputof existing ViT models without needing to train. ToMe gradually combinessimilar tokens in a transformer using a general and light-weight matchingalgorithm that is as fast as pruning while being more accurate. Off-the-shelf,ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3%accuracy drop in each case. ToMe can also easily be applied during training,improving in practice training speed up to 2x for MAE fine-tuning on video.Training with ToMe further minimizes accuracy drop, leading to 2x thethroughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we findthat ToMe merges object parts into one token, even over multiple frames ofvideo. Overall, ToMe's accuracy and speed are competitive with state-of-the-arton images, video, and audio.