HyperAIHyperAI

Command Palette

Search for a command to run...

MARS: تمكين نماذج Autoregressive من توليد Multi-Token

Ziqi Jin Lei Wang Ziwei Luo Aixin Sun

الملخص

تقوم نماذج اللغة ذات التوليد الذاتي (Autoregressive - AR) بتوليد النص عبر توكن (token) واحد في كل مرة، حتى عندما تكون التوكنات المتتالية قابلة للتنبؤ بدرجة عالية بناءً على السياق السابق. في هذا البحث، نقدم MARS (Mask AutoRegreSsion)، وهي طريقة fine-tuning خفيفة الوزن تُعلم نموذج AR الذي تم ضبطه مسبقاً على التعليمات (instruction-tuned) كيفية التنبؤ بعدة tokens في عملية forward pass واحدة. لا يتطلب MARS أي تعديلات في البنية (architecture)، ولا يضيف أي parameters إضافية، وينتج عنه نموذج واحد يمكن استدعاؤه تماماً مثل نموذج AR الأصلي دون أي تدهور في الأداء.وعلى عكس تقنية speculative decoding، التي تحافظ على نموذج مسودة (draft model) منفصل إلى جانب النموذج المستهدف، أو النهج متعدد الرؤوس (multi-head) مثل Medusa، التي ترفق رؤوس تنبؤ (prediction heads) إضافية، يتطلب MARS فقط استمرار التدريب (continued training) على بيانات التعليمات الحالية. عند توليد token واحد لكل forward pass، يحقق MARS نتائج تماثل أو تتجاوز خط الأساس (baseline) لنموذج AR في ستة benchmarks قياسية. وعند السماح له بقبول عدة tokens في الخطوة الواحدة، فإنه يحافظ على دقة بمستوى الـ baseline مع تحقيق throughput أعلى بمقدار 1.5 إلى 1.7 مرة.علاوة على ذلك، قمنا بتطوير استراتيجية KV caching على مستوى الكتلة (block-level) لعمليات batch inference، مما حقق تسريعاً في الوقت الفعلي (wall-clock speedup) يصل إلى 1.71 مرة مقارنة بنموذج AR مع KV cache على نموذج Qwen2.5-7B. وأخيراً، يدعم MARS ضبط السرعة في الوقت الفعلي عبر عتبة الثقة (confidence thresholding)؛ ففي حالات ضغط الطلبات العالي، يمكن لنظام الخدمة زيادة الـ throughput فوراً دون الحاجة إلى تبديل النماذج أو إعادة تشغيلها، مما يوفر أداة تحكم عملية (latency-quality knob) لعمليات النشر (deployment).

One-sentence Summary

By utilizing a lightweight fine-tuning method that requires no architectural modifications or additional parameters, MARS enables instruction-tuned autoregressive models to perform multi-token generation through continued training, achieving up to 1.7x throughput and real-time speed adjustment via confidence thresholding.

Key Contributions

  • The paper introduces MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that enables instruction-tuned autoregressive models to predict multiple tokens per forward pass without adding extra parameters or architectural modifications.
  • This method utilizes [MASK] tokens as explicit placeholders to train models to predict from incomplete context, achieving 1.5 to 1.7x throughput while maintaining baseline accuracy on six standard benchmarks.
  • The work develops a block-level KV caching strategy and a confidence-based thresholding mechanism, which together enable up to 1.71x wall-clock speedup and real-time, on-the-fly throughput adjustments during deployment.

Introduction

Autoregressive language models generate text one token at a time, which results in uniform compute costs even when subsequent tokens are highly predictable. While methods like speculative decoding and multi-head architectures attempt to accelerate this process, they often introduce significant overhead by requiring separate draft models or additional architectural parameters and heads. The authors leverage a lightweight fine-tuning method called MARS (Mask AutoRegreSsion) to enable multi-token generation without any architectural modifications or extra parameters. By closing the design gaps between standard autoregressive models and block-masked prediction, MARS allows a single model to act as a strict superset of the original, offering a practical latency-quality knob that can achieve up to 1.71x throughput speedup during deployment.

Method

The authors leverage a dual-stream training framework to integrate multi-token prediction into an existing autoregressive (AR) model while preserving its core AR functionality. The method, named MARS, begins from a pre-trained AR supervised fine-tuning (SFT) checkpoint, ensuring the model has already learned the target data distribution. This allows the training process to focus exclusively on learning the masked prediction paradigm without disrupting the model's fundamental AR capabilities.

Training and inference framework
Training and inference framework

The training process involves running two copies of a sequence in parallel through the model: a clean stream and a noisy stream. The clean stream consists of the original, unmodified tokens and is used to train the model with standard AR next-token prediction. The noisy stream is constructed by dividing the sequence into blocks of size BBB and replacing all tokens within each block with [MASK] placeholders. The model processes a concatenated input z=[x;x~]\mathbf{z} = [\mathbf{x}; \tilde{\mathbf{x}}]z=[x;x~] of length 2L2L2L, where the first LLL positions are the clean stream and the last LLL are the noisy stream. A structured attention mask ensures that the model's attention pattern remains strictly causal, enforcing the correct visibility for each position. This design inherently closes the gaps related to attention pattern and logits alignment identified in prior block-masked approaches. The attention mask M\mathbf{M}M is defined such that the clean causal case provides standard causal self-attention for the clean stream, the noisy intra-block causal case allows each noisy position to attend causally within its own block, and the cross-stream case enables each noisy block to see clean tokens from all preceding blocks, providing the necessary context for prediction. The training loss on the noisy stream is computed as the cross-entropy over the masked positions.

Inference process
Inference process

To preserve the model's AR competence, the training loss is augmented with a standard AR next-token prediction loss computed from the clean stream logits. This combined loss, L=Lmask+LAR\mathcal{L} = \mathcal{L}_{\text{mask}} + \mathcal{L}_{\text{AR}}L=Lmask+LAR, ensures that the model continuously practices standard AR prediction, counteracting the erosion of the AR signal that occurs as the block size increases. This mechanism decouples the AR signal from the block size, maintaining a high fraction of AR-equivalent signal even for large blocks. During inference, the model operates in a left-to-right sliding window fashion, closing the final gap related to generation order. At each step, BBB [MASK] tokens are appended to the current prefix, and the model runs a single forward pass to obtain logits for all BBB positions. Tokens are accepted consecutively from left to right while the maximum probability of the predicted token meets a confidence threshold τ\tauτ. The accepted tokens are added to the prefix, and new [MASK] tokens are appended to maintain the window size. This process ensures the model degrades gracefully to standard AR decoding when confidence is low, while enabling high-throughput multi-token generation when confidence is high. The confidence threshold τ\tauτ can be adjusted dynamically during serving to control the throughput-quality tradeoff.

Experiment

The experiments evaluate MARS across different model scales to validate that it preserves original autoregressive quality while enabling efficient multi-token generation. Results demonstrate that incorporating SFT loss prevents the signal decay typically seen with larger block sizes, allowing for a smooth and controllable speed-quality tradeoff via confidence thresholding. Furthermore, the implementation of a block-level KV cache ensures that these algorithmic improvements translate into significant wall-clock speedups during batch inference.

The authors evaluate MARS under one-token generation mode to assess its ability to preserve the quality of the original autoregressive (AR) model. Results show that MARS maintains or improves performance across multiple benchmarks compared to the AR SFT baseline, with significant gains observed on reasoning and coding tasks. The inclusion of the SFT loss is critical, as its absence leads to substantial performance degradation with larger block sizes. MARS maintains or improves performance over AR SFT on reasoning and coding tasks with one-token generation Without the SFT loss, larger block sizes cause significant quality degradation across benchmarks The SFT loss stabilizes performance across block sizes, ensuring consistent quality regardless of block size

MARS preserves AR quality
MARS preserves AR quality

The authors compare the training setup for MARS across two model scales, 0.5B and 7B, highlighting differences in hardware, block sizes, and the inclusion of the SFT loss. The evaluation uses greedy decoding with a maximum generation length of 256 tokens, and MARS is configured to generate one token per step at a threshold of 1.0. The 7B model uses NVIDIA H200 hardware and a block size of 4, while the 0.5B model uses 4, 8, and 16 block sizes. The SFT loss is included for the 0.5B model but excluded for the 7B model in the tested configurations. Both models use a learning rate of 5e-6 and are trained for 5 epochs per stage with a maximum sequence length of 512 for the 7B model and 48 for the 0.5B model.

MARS training setup comparison
MARS training setup comparison

The the the table compares MARS performance with and without the SFT loss across different confidence thresholds and block sizes. Results show that MARS with the SFT loss maintains higher accuracy and smoother speed-quality tradeoffs compared to the variant without the SFT loss, especially as block size increases. The presence of the SFT loss stabilizes performance across varying generation speeds. MARS with the SFT loss achieves higher accuracy than without the SFT loss across all thresholds and block sizes. The accuracy drop with multi-token generation is small and predictable when using the SFT loss. Performance degrades significantly without the SFT loss as block size increases, indicating the loss's importance for stability.

MARS speed-quality tradeoff
MARS speed-quality tradeoff

The authors compare MARS with AR SFT and Block Diffusion models, showing that MARS maintains or improves upon the original AR model's performance across multiple benchmarks. The results indicate that MARS achieves higher accuracy than AR SFT and Block Diffusion in one-token generation mode, confirming that the additional masked-prediction training enhances quality without degrading performance. MARS achieves higher accuracy than AR SFT and Block Diffusion on multiple benchmarks in one-token mode. The gains from MARS come from the masked-prediction objective, not additional training compute. Block Diffusion performs poorly, indicating that not all block prediction formulations are compatible with AR pretraining.

MARS preserves AR quality
MARS preserves AR quality

The the the table compares different generation methods across four criteria: token masking, attention pattern, logits alignment, and generation order. MARS uses causal attention and left-to-right generation similar to AR, but with masked token prediction within blocks, while Block Diffusion uses bidirectional attention and confidence-based generation. MARS uses causal attention and left-to-right generation like AR, but with masked token prediction within blocks Block Diffusion uses bidirectional intra-block attention and confidence-based generation MARS maintains right-shifted logits alignment similar to AR, unlike Block Diffusion which varies

Comparison of generation methods
Comparison of generation methods

The authors evaluate MARS across different model scales and generation modes to assess its ability to preserve or enhance the quality of original autoregressive models. Results demonstrate that MARS maintains or improves performance on reasoning and coding tasks compared to standard autoregressive baselines and alternative block prediction methods. The inclusion of the SFT loss is found to be essential for stabilizing performance and preventing quality degradation as block sizes increase.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp