Command Palette
Search for a command to run...
K-EXAONE 기술 보고서
K-EXAONE 기술 보고서
초록
이 기술 보고서는 LG AI 연구소에서 개발한 대규모 다국어 언어 모델인 K-EXAONE을 소개한다. K-EXAONE은 총 2360억 파라미터를 갖춘 전문가 집합(Mixture-of-Experts) 아키텍처 기반으로 설계되었으며, 추론 시 230억 파라미터만 활성화된다. 모델은 최대 256K 토큰의 컨텍스트 창을 지원하며, 한국어, 영어, 스페인어, 독일어, 일본어, 베트남어 등 총 6개 언어를 커버한다. 우리는 K-EXAONE을 추론 능력, 에이전트 기반 능력, 일반적 능력, 한국어 전용 능력, 다국어 능력 등 다양한 벤치마크 테스트 세트에서 평가하였다. 이러한 평가 결과, K-EXAONE은 크기가 유사한 오픈웨이트 모델과 비슷한 성능을 보였다. K-EXAONE은 더 나은 삶을 위한 AI 발전을 목표로 설계되었으며, 다양한 산업 및 연구 응용 분야에 활용 가능한 강력한 전용 기초 AI 모델로 위치지어진다.
One-sentence Summary
LG AI Research presents K-EXAONE, a 236B-parameter Mixture-of-Experts model with 23B active parameters, supporting a 256K-token context and six languages including Korean, English, German, Japanese, and Vietnamese. It achieves competitive performance across reasoning, agentic, multilingual, and long-context benchmarks by leveraging a hybrid attention mechanism, fine-grained MoE routing with sequence-level load balancing, and a multi-stage training curriculum with FP8 precision. The model integrates a Multi-Token Prediction module for efficient inference and employs a Korea-Augmented Universal Taxonomy to ensure culturally grounded safety, positioning it as a sovereign AI foundation model for industrial and research applications.
Key Contributions
- K-EXAONE addresses South Korea's infrastructure limitations in AI hardware and data centers by leveraging government-supported resources to develop a trillion-parameter foundation model, enabling global competitive performance despite regional constraints.
- The model introduces a hybrid architecture combining reasoning and non-reasoning capabilities with a Mixture-of-Experts (MoE) design and a hybrid attention mechanism, enabling efficient long-context processing and scalable computation.
- K-EXAONE extends multilingual support to include German, Japanese, and Vietnamese, and demonstrates robust performance across benchmarks while adhering to strict AI compliance protocols to mitigate risks related to data privacy, bias, and inappropriate content.
Introduction
The development of large language models (LLMs) is increasingly driven by scaling, with open-weight models closing the performance gap with closed-source counterparts through massive parameter counts. However, countries like South Korea face significant infrastructure constraints—limited access to AI-specialized data centers and high-performance chips—hindering large-scale model development. To overcome this, the Korean government launched a strategic initiative providing critical resources, enabling LG AI Research to develop K-EXAONE, a trillion-parameter foundation model. The authors leverage a hybrid architecture combining reasoning and non-reasoning capabilities, a hybrid attention mechanism for long-context processing, and a Mixture-of-Experts (MoE) design to enable efficient, scalable computation. K-EXAONE extends multilingual support to include German, Japanese, and Vietnamese, enhancing its global applicability. A key contribution is the creation of CODEUTILITYBENCH, a real-world coding benchmark that evaluates practical coding workflows across understanding, implementation, refinement, and maintenance tasks, where K-EXAONE outperforms its predecessor. Additionally, the authors introduce KGC-SAFETY, a culturally grounded safety benchmark based on the Korea-Augmented Universal Taxonomy (K-AUT), addressing the limitations of Western-centric ethical frameworks in evaluating AI safety within Korean sociocultural contexts.
Dataset
- The dataset is composed of in-house and publicly available resources, with a focus on evaluating long-context understanding in Korean.
- Key components include KO-LONGBENCH, an in-house benchmark featuring diverse tasks such as Document QA, Story Understanding, Dialogue History Understanding, In-Context Learning, Structured QA, and RAG.
- KO-LONGBENCH is designed to assess large language models’ performance in practical, long-context scenarios and includes detailed statistics, prompt examples, and task descriptions as documented in the EXAONE 4.0 Technical Report [2].
- The authors use KO-LONGBENCH as part of their evaluation suite, applying both prompt-loose and prompt-strict metrics from IFBENCH and IFEVAL benchmarks to measure model performance.
- No explicit cropping or metadata construction is described; the focus is on leveraging existing task structures and prompt formats for consistent evaluation.
- The dataset is used in model training and evaluation workflows, with mixture ratios and split configurations derived from benchmark-specific guidelines and technical documentation.
Method
The authors leverage a Mixture-of-Experts (MoE) architecture to construct K-EXAONE, a large-scale multilingual language model with 236B total parameters, of which approximately 23B are activated during inference. The overall framework is built upon a transformer-based design, with the model composed of multiple stacked blocks that process input tokens through a sequence of transformations. As shown in the figure below, the architecture is divided into two primary components: the main model block and the Multi-Token Prediction (MTP) block, which is used during training and inference for self-drafting.
The main block, which processes the primary input sequence, begins with an embedding layer that converts input tokens into dense vectors. The first block in this stack is a dense feed-forward network (FFN) with a hidden size of 18,432, serving as a training stability mechanism. This is followed by a sequence of 12 blocks, each containing a sparse MoE layer, a global attention mechanism, and RMSNorm normalization. The MoE layer is the core computational component, where the input is routed to a subset of experts. Specifically, the model employs a fine-grained sparse MoE design with 128 experts, and for each token, the top-8 experts are selected along with a shared expert, resulting in nine concurrently active experts per routing decision. The routing mechanism is managed by a dedicated router that determines the contribution of each expert to the final output. The sparse MoE block is followed by a global attention layer, which computes attention across the entire sequence, and RMSNorm normalization to stabilize training dynamics.
In addition to the main block, the model incorporates a hybrid attention architecture that combines global attention with sliding window attention (SWA) to efficiently handle long-context inputs. The architecture includes both global attention and sliding window attention layers, which are selectively applied across different layers to balance computational cost and modeling capacity. The sliding window size is reduced to 128 tokens to minimize key-value cache usage while preserving the model's ability to capture long-range dependencies. This design enables cost-efficient long-context modeling, supporting a maximum context length of 256K tokens.
The MTP block, which is used during training and inference, is designed to enhance future-token predictive capability and improve decoding throughput. This block is structured similarly to the main block but includes a dense feed-forward network and a linear layer to predict the next token. During training, the MTP block is supervised to predict an additional future token, providing an auxiliary objective that minimizes routing overhead and memory consumption. At inference time, the MTP block enables self-drafting, allowing the model to generate multiple tokens in parallel and achieving approximately a 1.5× improvement in decoding throughput compared to standard autoregressive decoding. The MTP block also includes a shared embedding layer and RMSNorm normalization, ensuring consistency with the main model's architecture.
To further stabilize training and improve long-context extrapolation, the model incorporates architectural features such as QK Norm and SWA-only RoPE. QK Norm applies layer normalization to the query and key vectors prior to attention computation, mitigating attention logit explosion in deep networks and stabilizing training dynamics. RoPE (Rotary Positional Embeddings) are selectively applied only to the SWA layers, preventing interference with global token interactions and improving robustness to long-sequence extrapolation. These features, combined with the MoE architecture and hybrid attention mechanism, enable K-EXAONE to achieve high performance while maintaining resource efficiency.
Experiment
- Evaluated K-EXAONE across 9 benchmark categories: World Knowledge, Math, Coding/Agentic Coding, Agentic Tool Use, Instruction Following, Long Context Understanding, Korean, Multilinguality, and Safety.
- On MMLU-Pro, GPQA-DIAMOND, and HUMANITY'S LAST EXAM, K-EXAONE demonstrates strong academic knowledge and reasoning, outperforming gpt-oss-120b and Qwen3-235B-A22B-Thinking-2507 on most math benchmarks, except HMMT Nov 2025 where it trails Qwen.
- Achieved 29.0 on TERMINAL-BENCH 2.0 and 49.4 on SWE-BENCH VERIFIED, showing strong agentic coding capability; scored 73.2 on tau2-BENCH, indicating effective multi-step tool use.
- On instruction following, scored 67.3 on IFBENCH and 89.7 on IFEVAL in REASONING mode, surpassing most baselines.
- On long-context benchmarks, achieved 53.5 on AA-LCR and 52.3 on OPENAI-MRCR in REASONING mode, and 45.2 on AA-LCR and 65.9 on OPENAI-MRCR in NON-REASONING mode, demonstrating robust long-context processing.
- On Korean benchmarks: 67.3 on KMMLU-Pro, 61.8 on KoBALT, 83.9 on CLiCK, 90.9 on HRM8K, and 86.8 on Ko-LONGBENCH, showing strong multilingual and domain-specific performance.
- Achieved 85.7 on MMMLU and 90.5 on WMT24++ (average translation score), indicating balanced and high-quality multilingual capabilities.
- On safety benchmarks, achieved competitive Safe Rates on WILDJAILBREAK and KGC-SAFETY, with strong performance in universal values and social safety, and improved handling of Korean-specific cultural sensitivity and future risks.
The authors use the bar chart to compare the token efficiency of K-EXAONE and EXAONE 4.0 across different text domains, measuring bytes per token. Results show that K-EXAONE achieves higher token efficiency than EXAONE 4.0 in all categories, with the largest improvement observed in the Korean domain at +29.0%.

The authors use K-EXAONE in non-reasoning mode to evaluate its performance across multiple domains, showing strong results in world knowledge, math, coding, and safety benchmarks. K-EXAONE achieves competitive scores compared to larger models like Qwen3-235B-A22B-Instruct-2507 and DeepSeek-V3.2, particularly excelling in Korean benchmarks and safety evaluations, while maintaining efficient parameter usage with 23B activated parameters.

The authors use the KGC-SAFETY benchmark to evaluate model safety across multiple dimensions, including universal human values, social safety, Korean sensitivity, and future risk. Results show that K-EXAONE achieves the highest total score of 96.1, outperforming all compared models across all categories, particularly excelling in universal human values and social safety.

The authors use the table to compare the multilingual translation performance of K-EXAONE and EXAONE 4.0 across various language pairs. Results show that K-EXAONE achieves higher scores than EXAONE 4.0 in most translation directions, particularly in EN→KO, EN→DE, EN→ES, EN→JA, EN→VI, and KO→EN, indicating improved multilingual translation quality.

The authors use the table to present key training resources for the 236B-A23B model, indicating it was pretrained on 11 trillion tokens and required 1.52 × 10²⁴ FLOPs of computation. This data provides context for the model's scale and computational demands.
