한 달 전

Eunbi Choi Kibong Choi Sehyun Chun Seokhee Hong Junwon Hwang Hyojin Jeon Ahra Jo Hyunjik Jo Yeonsik Jo Joonkee Kim

초록

이 기술 보고서는 LG AI Research가 출시한 최초의 open-weight vision language model인 EXAONE 4.5를 소개합니다. EXAONE 4.5는 기존 EXAONE 4.0 프레임워크에 전용 visual encoder를 통합한 구조로 설계되었으며, 이를 통해 시각적 및 텍스트 모달리티 모두에 대한 native multimodal pretraining이 가능합니다. 본 모델은 세심하게 큐레이션된 대규모 데이터를 통해 학습되었으며, 특히 LG의 전략적 애플리케이션 도메인과 일치하는 문서 중심의 코퍼스(document-centric corpora)를 강조하였습니다. 이러한 타겟팅된 데이터 설계는 문서 이해 및 관련 작업에서 상당한 성능 향상을 가능하게 하는 동시에, 일반적인 언어 능력 전반에 걸친 폭넓은 개선을 가져왔습니다. EXAONE 4.5는 context length를 최대 256K tokens까지 확장하여, long-context reasoning 및 기업 규모의 유스케이스(use cases)를 용이하게 합니다. 비교 평가 결과, EXAONE 4.5는 일반적인 benchmark에서 경쟁력 있는 성능을 달성하는 동시에, 문서 이해 및 한국어 맥락 추론(Korean contextual reasoning) 측면에서는 유사한 규모의 state-of-the-art 모델들을 능가하는 성능을 보여주었습니다. 실질적인 산업 현장 배포를 향한 LG의 지속적인 노력의 일환으로, EXAONE 4.5는 'A Better Life'를 위한 AI 발전을 위해 추가적인 도메인 및 애플리케이션 시나리오로 지속적으로 확장될 수 있도록 설계되었습니다.

One-sentence Summary

LG AI Research introduces EXAONE 4.5, an open-weight vision language model that integrates a dedicated visual encoder into the EXAONE 4.0 framework for native multimodal pretraining and achieves superior performance in document understanding and Korean contextual reasoning through curated document-centric corpora, supporting six languages with a 256K token context window.

Key Contributions

The paper introduces EXAONE 4.5, an open-weight vision language model that integrates a 2B-parameter visual encoder into the EXAONE 4.0 framework to enable native multimodal pretraining.
This work implements architectural innovations including Grouped Query Attention (GQA) in the vision encoder, 2D Rotary Positional Embedding (RoPE), and a Multi-Token Prediction (MTP) module to balance computational efficiency with model capacity.
The model achieves a stable context length of 256K tokens through direct embedding during the supervised fine-tuning stage and demonstrates state-of-the-art performance in document understanding, mathematical reasoning, and Korean contextual reasoning.

Introduction

As industrial AI shifts toward agentic workflows, there is an increasing need for models that can bridge the gap between advanced linguistic reasoning and visual perception. While previous iterations of the EXAONE series focused on text-based reasoning and mathematical tasks, they lacked the multimodal capabilities required for complex industrial applications like manufacturing quality control or technical blueprint analysis. The authors address this by introducing EXAONE 4.5, LG's first open-weight vision-language model. They integrate a 1.2B parameter vision encoder into the existing 32B EXAONE 4.0 framework to enable native multimodal pretraining. This architecture, combined with a focus on document-centric corpora and an extended 256K token context length, allows the model to achieve state-of-the-art performance in document understanding and Korean contextual reasoning.

Dataset

The authors construct a diverse pre-training and supervised fine-tuning (SFT) dataset designed to enhance multimodal reasoning, document understanding, and cultural nuance. The dataset composition and processing details are as follows:

Dataset Composition and Sources
- Image Caption Data: Primarily Korean-English bilingual pairs. The authors use a synthetic captioning pipeline to transform noisy web captions into semantically rich descriptions, prioritizing entity diversity and visual complexity.
- Interleaved Image-Text Data: A massive corpus of open-source and in-house web content.
- OCR and Documents: A mix of English and Korean resources at character, word, and document levels.
- Grounding and Counting: A combination of high-quality open-source sets and an in-house synthetic pipeline.
- STEM and Reasoning: Domain-specific academic content, including math, engineering, and science diagrams, retrieved via a search-based synthesis pipeline.
- Korean-Specific Data: Specialized corpora including cultural images from the Korea Tourism Organization and digital culture content from IT Donga and Game Donga.
Key Processing and Filtering Details
- Textual Filtering: A lightweight text-based classifier evaluates interleaved data based on educational quality and STEM relevance to upsample high-density information.
- Document Parsing: Charts, tables, and documents are transformed into structured formats like HTML, Markdown, and JSON to improve layout understanding.
- Spatial Grounding: Object locations are represented as normalized bounding boxes with coordinates scaled to a range of [0, 1000].
- Counting Balancing: To prevent bias toward simple categories, the authors use synthetic generation to handle occlusion and explicitly balance the dataset across different count ranges and object types.
- Text-to-Vision Augmentation: For Korean reasoning, text-based problems are converted into high-resolution rendered images to improve academic content parsing.
Training Strategy and Usage
- Pre-training Curriculum: The authors employ a progressive curriculum starting with broad filtering for visual diversity, followed by strategic upsampling of specialized datasets to bridge performance gaps.
- SFT Framework: The supervised fine-tuning stage uses a multi-stage curriculum where data is organized by domain. The model is jointly trained on text-only and vision-language data, integrating both non-reasoning and reasoning supervision.
- Multilingual Support: The SFT mixture is designed for multilingual instruction following across English, Korean, Spanish, German, Japanese, and Vietnamese.

Method

The authors leverage a scalable and efficient architecture for EXAONE 4.5, designed to handle high-resolution visual inputs while maintaining strong multimodal alignment and computational efficiency. The framework centers on a 1.2-billion-parameter vision encoder trained from scratch, which processes visual data using hybrid attention and Grouped Query Attention (GQA). This design choice enables the model to retain rich visual representations without aggressive token truncation, addressing the challenge of high-resolution image processing. The vision encoder employs 2D Rotary Positional Embedding (2D RoPE) to capture spatial structure, while the language model retains standard 1D RoPE for compatibility with pre-trained textual representations. The language backbone is derived from the EXAONE 4.0 architecture, enhanced with the Multi-Token Prediction (MTP) module to improve decoding throughput.

Training configuration across pretraining stages

The pre-training pipeline is structured in two sequential stages to progressively establish cross-modal alignment and expand representational coverage. In Stage 1, foundational modality alignment is achieved through end-to-end joint training of the vision encoder, merger, and LLM. This stage integrates a diverse mix of image-text pairs, interleaved documents, document understanding data, and OCR-centric samples, while preserving language modeling capability through the inclusion of text-only datasets. Stage 2 focuses on perceptual and knowledge refinement by emphasizing high-density, structured information. The data curriculum shifts toward grounding, document, and OCR-centric data, supplemented with knowledge, mathematics, and STEM datasets to provide the model with domain-specific exposure.

As shown in the figure below, the architecture integrates a native-resolution vision encoder that processes images at 800×1000 resolution, generating 612 visual tokens. These tokens are merged with textual inputs through an MLP projector and fed into the language decoder, which supports a maximum context length of 256K tokens. The language decoder leverages hybrid attention and GQA to optimize computational efficiency and hardware utilization. The model supports a maximum resolution of 800×1000, calibrated to match real-world inputs to balance performance and resource efficiency. The tokenizer, inherited from K-EXAONE, enhances multilingual and Korean language processing, ensuring robust text representations across diverse linguistic contexts. The framework enables stable context extension by integrating long-context training directly into the supervised fine-tuning phase, leveraging a 128K-capable base LLM to stabilize optimization. Context Parallelism is employed to manage the computational demands of 256K sequences. Additionally, joint multimodal reinforcement learning is applied, using task-specific reward functions and GRPO with zero-variance filtering to enhance reasoning across text and vision modalities.

Experiment

EXAONE 4.5 was evaluated across a comprehensive set of vision and language benchmarks to validate its multimodal reasoning, document understanding, and linguistic capabilities. The model demonstrates robust and well-balanced performance, showing particular strength in mathematical reasoning, coding, and complex instruction following. Overall, the results indicate that EXAONE 4.5 is highly competitive against both large-scale and specialized baseline models across a wide range of diverse evaluation scenarios.

The authors compare EXAONE 4.5 33B with several baseline models across multiple vision benchmarks. Results show that EXAONE 4.5 achieves competitive performance across all categories, particularly excelling in STEM/Puzzle and Document Understanding tasks, while maintaining strong performance in general and Korean benchmarks. EXAONE 4.5 outperforms larger models on several STEM and puzzle benchmarks. It demonstrates strong performance in document understanding tasks, surpassing some larger models. The model maintains consistent results across general and Korean benchmark categories.

The the the table compares two training stages of a model, showing differences in training modules, data volume, sequence length, and computational requirements. Stage 2 uses less training data and lower computational resources compared to Stage 1. Stage 2 uses fewer image and text tokens than Stage 1 Both stages use the same sequence length of 8K Stage 2 requires significantly less computational resources than Stage 1

The authors compare EXAONE 4.5 with several baseline models across various language benchmarks. Results show that EXAONE 4.5 achieves strong performance in reasoning and coding tasks, particularly excelling on LIVECODEBENCH V6 and performing competitively on other categories such as agentic tool use and instruction following. EXAONE 4.5 achieves the best score on LIVECODEBENCH V6 among all compared models. EXAONE 4.5 outperforms other models on agentic tool-use benchmarks, including τ²-BENCH. EXAONE 4.5 demonstrates competitive performance on instruction-following and long-context understanding tasks.

EXAONE 4.5 was evaluated against several baseline models across vision and language benchmarks to assess its reasoning, coding, and multimodal capabilities. The model demonstrates exceptional proficiency in STEM, puzzle-solving, and document understanding, while also outperforming larger models in coding and agentic tool-use tasks. Furthermore, the experimental results highlight a training progression where the second stage achieves high performance using significantly fewer computational resources and data compared to the first stage.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

한 달 전

Eunbi Choi Kibong Choi Sehyun Chun Seokhee Hong Junwon Hwang Hyojin Jeon Ahra Jo Hyunjik Jo Yeonsik Jo Joonkee Kim

초록

One-sentence Summary

Key Contributions

The paper introduces EXAONE 4.5, an open-weight vision language model that integrates a 2B-parameter visual encoder into the EXAONE 4.0 framework to enable native multimodal pretraining.
This work implements architectural innovations including Grouped Query Attention (GQA) in the vision encoder, 2D Rotary Positional Embedding (RoPE), and a Multi-Token Prediction (MTP) module to balance computational efficiency with model capacity.
The model achieves a stable context length of 256K tokens through direct embedding during the supervised fine-tuning stage and demonstrates state-of-the-art performance in document understanding, mathematical reasoning, and Korean contextual reasoning.

Introduction

Dataset

Dataset Composition and Sources
- Image Caption Data: Primarily Korean-English bilingual pairs. The authors use a synthetic captioning pipeline to transform noisy web captions into semantically rich descriptions, prioritizing entity diversity and visual complexity.
- Interleaved Image-Text Data: A massive corpus of open-source and in-house web content.
- OCR and Documents: A mix of English and Korean resources at character, word, and document levels.
- Grounding and Counting: A combination of high-quality open-source sets and an in-house synthetic pipeline.
- STEM and Reasoning: Domain-specific academic content, including math, engineering, and science diagrams, retrieved via a search-based synthesis pipeline.
- Korean-Specific Data: Specialized corpora including cultural images from the Korea Tourism Organization and digital culture content from IT Donga and Game Donga.
Key Processing and Filtering Details
- Textual Filtering: A lightweight text-based classifier evaluates interleaved data based on educational quality and STEM relevance to upsample high-density information.
- Document Parsing: Charts, tables, and documents are transformed into structured formats like HTML, Markdown, and JSON to improve layout understanding.
- Spatial Grounding: Object locations are represented as normalized bounding boxes with coordinates scaled to a range of [0, 1000].
- Counting Balancing: To prevent bias toward simple categories, the authors use synthetic generation to handle occlusion and explicitly balance the dataset across different count ranges and object types.
- Text-to-Vision Augmentation: For Korean reasoning, text-based problems are converted into high-resolution rendered images to improve academic content parsing.
Training Strategy and Usage
- Pre-training Curriculum: The authors employ a progressive curriculum starting with broad filtering for visual diversity, followed by strategic upsampling of specialized datasets to bridge performance gaps.
- SFT Framework: The supervised fine-tuning stage uses a multi-stage curriculum where data is organized by domain. The model is jointly trained on text-only and vision-language data, integrating both non-reasoning and reasoning supervision.
- Multilingual Support: The SFT mixture is designed for multilingual instruction following across English, Korean, Spanish, German, Japanese, and Vietnamese.

Method

Experiment

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

EXAONE 4.5 기술 보고서

Eunbi Choi Kibong Choi Sehyun Chun Seokhee Hong Junwon Hwang Hyojin Jeon Ahra Jo Hyunjik Jo Yeonsik Jo Joonkee Kim48 more

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

EXAONE 4.5 기술 보고서

Eunbi Choi Kibong Choi Sehyun Chun Seokhee Hong Junwon Hwang Hyojin Jeon Ahra Jo Hyunjik Jo Yeonsik Jo Joonkee Kim48 more

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

EXAONE 4.5 기술 보고서

Eunbi Choi Kibong Choi Sehyun Chun Seokhee Hong Junwon Hwang Hyojin Jeon Ahra Jo Hyunjik Jo Yeonsik Jo Joonkee Kim48 more

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Eunbi Choi Kibong Choi Sehyun Chun Seokhee Hong Junwon Hwang Hyojin Jeon Ahra Jo Hyunjik Jo Yeonsik Jo Joonkee Kim

Eunbi Choi Kibong Choi Sehyun Chun Seokhee Hong Junwon Hwang Hyojin Jeon Ahra Jo Hyunjik Jo Yeonsik Jo Joonkee Kim

Eunbi Choi Kibong Choi Sehyun Chun Seokhee Hong Junwon Hwang Hyojin Jeon Ahra Jo Hyunjik Jo Yeonsik Jo Joonkee Kim