HyperAIHyperAI

Command Palette

Search for a command to run...

Improving Local Features with Relevant Spatial Information by Vision Transformer for Crowd Counting

Steven Q.H. Truong Trung Bui Chanh D. Tr. Nguyen Dao Huu Hung Phan Nguyen Soan T. M. Duong Ta Duc Huy Nguyen H. Tran

Abstract

Vision Transformer (ViT) variants have demonstrated state-of-the-art performances in plenty of computer vision benchmarks, including crowd counting. Although Transformer based models have shown breakthroughs in crowd counting, existing methodshave some limitations. Global embeddings extracted from ViTs do not encapsulate finegrained local features and, thus, are prone to errors in crowded scenes with diverse human scales and densities. In this paper, we propose LoViTCrowd with the argument that: LOcal features with spatial information from relevant regions via the attention mechanism of ViT can effectively reduce the crowd counting error. To this end, we divide each image into a cell grid. Considering patches of 3 × 3 cells, in which the main partsof the human body are encapsulated, the surrounding cells provide meaningful cues for crowd estimation. ViT is adapted on each patch to employ the attention mechanism across the 3 × 3 cells to count the number of people in the central cell. The numberof people in the image is obtained by summing up the counts of its non-overlapping cells. Extensive experiments on four public datasets of sparse and dense scenes, i.e., Mall, ShanghaiTech Part A, ShanghaiTech Part B, and UCF-QNRF, demonstrate ourmethod’s state-of-the-art performance. Compared to TransCrowd, LoViTCrowd reduces the root mean square errors (RMSE) and the mean absolute errors (MAE) by an average of 14.2% and 9.7%, respectively. The source is available at https://github.com/nguyen1312/LoViTCrowd


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Improving Local Features with Relevant Spatial Information by Vision Transformer for Crowd Counting | Papers | HyperAI