HyperAI

Scaling Recommender Transformers to a Billion Parameters My name is Kirill Khrylchenko, and I lead the RecSys R&D team at Yandex. For the past five years, our team has been focused on advancing transformer-based technologies within recommender systems. Recently, we achieved a major milestone: deploying a new generation of transformer recommenders with up to one billion parameters, significantly improving recommendation quality across our services. Recommender systems are essential in today’s world, where content is growing exponentially. They help users discover music, videos, products, and more, while also enabling creators to reach their audiences. At Yandex, we’ve integrated these models into multiple services and are seeing consistent improvements in relevance and engagement. For ML engineers, this article shares practical insights into building large-scale transformer recommenders. For users, it offers a glimpse into how their recommendations are generated. How Recommender Systems Work At its core, the recommendation problem is about predicting which items a user is likely to interact with. This requires machine learning models capable of handling vast amounts of unstructured data—user-item interactions, item metadata, and user behavior sequences. Neural networks, especially transformers, are ideal for this task. Unlike traditional methods that rely on manual feature engineering, transformers can automatically learn complex patterns from raw data. In our case, we use user histories—sequences of interactions like plays, likes, and skips—and encode them into vector representations. The two-tower architecture is widely used in the retrieval stage. It encodes users and items independently into vectors, then computes their similarity via dot product. A key advantage is efficiency: item vectors can be precomputed and stored in an index (e.g., HNSW), allowing fast approximate nearest-neighbor search even for massive catalogs. However, at scale, the challenge becomes how to handle long user histories and large models. We’ve long believed that recommender models could be made larger, but the question remained: does size still matter? The scaling hypothesis suggests that as models grow and data increases, performance improves. This has been proven in NLP with large language models. But in recommender systems, models have remained relatively small—often with just a few million parameters. We set out to test whether scaling could deliver similar gains. We identified four axes for scaling: embedding size, context length, training dataset size, and encoder capacity. While embeddings and datasets are already massive (with some systems using billions of training examples), we found that scaling context length and encoder size had been largely overlooked. Inspired by Meta’s HSTU paper, which introduced a long-history encoder with 8,000 events and 176 million parameters, we asked: why can’t we build even larger models? We realized that recommender systems were missing a key ingredient: a fundamental understanding of user behavior, not just prediction of next actions. This led us to develop ARGUS—AutoRegressive Generative User Sequential Modeling. ARGUS: A New Paradigm ARGUS treats recommendations as a reinforcement learning task. Instead of just predicting the next positive interaction, it models the full interaction sequence: (context, item, feedback). This includes both positive and negative signals—likes, skips, listens, playlist additions. We introduced two core learning tasks: Next Item Prediction: Predict the next item in the sequence, including both recommended and organic interactions. This helps the model learn to imitate past recommender policies. Feedback Prediction: Predict user reactions—whether they liked, skipped, or listened to a track. This builds deep understanding of user preferences. These tasks are trained jointly using a two-headed transformer. The model learns not only what users do, but why they do it. To handle long histories efficiently, we developed a simplified version of ARGUS. Instead of processing each triple (context, item, feedback) as three separate tokens, we compress them into a single vector. This reduces input length and enables faster training. We also optimized the training pipeline. Instead of running the transformer once per impression (as in traditional methods), we now process a user’s entire history in a single forward pass. This provides a 10- to 100-fold speedup in training. Results We tested ARGUS on our music streaming platform using a dataset of over 300 billion listens. We trained four model sizes, from 3.2 million to 1.007 billion parameters. The results were clear: larger models delivered better performance. The relationship between model size and quality followed a linear scaling law on a log scale—just like in LLMs. Compared to HSTU, our transformer-based model outperformed it despite having fewer parameters. This suggests that the transformer architecture, when properly scaled and trained, is highly effective for recommendations. Ablation studies confirmed that pre-training and longer fine-tuning are crucial. Removing pre-training led to a significant drop in quality. We also tested longer context lengths. Going from 2,000 to 8,000 events improved recommendations, especially for long-term user behavior. Implementation and Impact We deployed ARGUS in multiple ways: as a ranking feature, a candidate generator, and in real-time on smart devices. The results were impressive. In our main music service, ARGUS delivered a quality boost equivalent to all previous model iterations combined. For the "Unfamiliar" personalization setting, we saw a 12% increase in listening time and 10% higher like likelihood—proof that ARGUS excels at discovery. On smart speakers, using full user and item vectors (not just scalar features) increased gains by 1.5 to 2 times. These results confirm that large-scale transformers are not just feasible but transformative in recommender systems. Looking Ahead We’re just beginning. The future of recommendations lies in neural networks that understand users deeply, not just predict their next move. ARGUS is a step toward that vision. Thank you for reading.

Related Links

Related Links

Related Links

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Command Palette

Scaling Recommender Transformers to a Billion Parameters: Yandex’s ARGUS Breakthrough in AI-Powered Music Recommendations

Related Links

Command Palette

Scaling Recommender Transformers to a Billion Parameters: Yandex’s ARGUS Breakthrough in AI-Powered Music Recommendations

Related Links

Command Palette

Scaling Recommender Transformers to a Billion Parameters: Yandex’s ARGUS Breakthrough in AI-Powered Music Recommendations

Related Links

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models