2달 전

Xinping Zhao Xinshuo Hu Jiaxin Xu Danyu Tang Xin Zhang Mengjia Zhou Yan Zhong Yao Zhou Zifei Shan Meishan Zhang

초록

메모리 임베딩은 OpenClaw와 같은 메모리 증강 시스템에 있어 핵심적이지만, 현재 텍스트 임베딩 벤치마크는 전통적인 패시지 검색에 집중함으로써 파편화되고 문맥 의존적이며 시간적으로 멀리 떨어진 정보를 포함하는 장시간 기억 검색 과제를 평가하는 데 있어 모델의 능력을 충분히 검증하지 못하고 있다. 이를 해결하기 위해 우리는 복잡한 장시간 기억 검색 과제를 처리하는 임베딩 모델의 능력을 종합적으로 평가하는 포괄적인 프레임워크인 장시간 기억 임베딩 벤치마크 (Long-horizon Memory Embedding Benchmark, LMEB) 를 소개한다. LMEB 는 에피소드적, 대화적, 의미적, 절차적이라는 4 가지 기억 유형에 걸쳐 AI 생성 및 인간 주석 데이터를 포함하는 22 개의 데이터셋과 193 개의 제로샷 검색 과제를 아우른다. 이러한 기억 유형들은 추상화 수준과 시간적 종속성 측면에서 상이하여, 실제 세계의 다양한 도전을 반영하는 기억 검색의 고유한 측면들을 포착한다. 우리는 수억에서 100 억 파라미터에 이르는 15 개의 널리 사용되는 임베딩 모델을 평가하였다. 그 결과, (1) LMEB 는 적절한 난이도 수준을 제공하며, (2) 더 큰 모델이 항상 더 우수한 성능을 보이는 것은 아니며, (3) LMEB 와 MTEB 는 직교성 (orthogonality) 을 보인다는 사실이 밝혀졌다. 이는 모든 기억 검색 과제에서 탁월한 보편적 모델에 대한 합의가 아직 이루어지지 않았으며, 전통적인 패시지 검색에서의 성능이 장시간 기억 검색으로 일반화되지 않을 수 있음을 시사한다. 요약하자면, 표준화되고 재현 가능한 평가 프레임워크를 제공함으로써 LMEB 는 기억 임베딩 평가의 중요한 공백을 메우고, 장기적이고 문맥 의존적인 기억 검색을 위한 텍스트 임베딩 기술의 추가 발전을 촉진한다. LMEB 는 https://github.com/KaLM-Embedding/LMEB 에서 이용할 수 있다.

One-sentence Summary

Researchers from Harbin Institute of Technology and Shenzhen Loop Area Institute introduce LMEB, a comprehensive benchmark evaluating memory embeddings across diverse long-horizon retrieval tasks. Unlike traditional benchmarks, LMEB reveals that larger models do not always excel, highlighting a critical gap in current text embedding capabilities for complex, context-dependent memory scenarios.

Key Contributions

Current text embedding benchmarks fail to assess long-horizon memory retrieval involving fragmented, context-dependent, and temporally distant information, creating a gap in evaluating memory-augmented systems.
The authors introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework spanning 22 datasets and 193 zero-shot tasks across episodic, dialogue, semantic, and procedural memory types.
Evaluation of 15 embedding models reveals that larger models do not consistently outperform smaller ones and that LMEB performance is orthogonal to traditional benchmarks like MTEB, indicating a lack of universal models for this domain.

Introduction

Text embedding models are critical for enabling efficient similarity search and powering downstream applications like retrieval and classification, yet current evaluation standards fall short in assessing their ability to handle long-horizon memory tasks. Existing benchmarks primarily focus on sentence-level similarity or standard retrieval across domains but rarely test scenarios requiring the synthesis of fragmented, context-dependent, and temporally distant evidence. To address this gap, the authors introduce LMEB, a specialized benchmark designed to evaluate how well embedding models support complex memory-centric retrieval that traditional metrics overlook.

Dataset

LMEB Dataset Overview

The authors introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework designed to evaluate embedding models on complex, long-term memory retrieval tasks that traditional benchmarks like MTEB fail to address.

Dataset Composition and Sources
- The benchmark consolidates 22 English datasets into a unified schema, covering 193 zero-shot retrieval tasks.
- Data sources include a mix of AI-generated content and human-annotated material derived from crowdsourcing, academic papers, novels, and real-world logs.
- The datasets are categorized into four distinct memory types: Episodic, Dialogue, Semantic, and Procedural.
Key Details by Subset
- Episodic Memory: Focuses on recalling past events with temporal and spatial cues using datasets like EPBench (synthetic events) and KnowMeBench (autobiographical narratives).
- Dialogue Memory: Targets multi-turn context retention and user preferences using long-form conversations from LoCoMo, LongMemEval, REALTALK, and TMD.
- Semantic Memory: Assesses retrieval of stable, general knowledge from scientific papers (QASPER, SciFact), novels (NovelQA), and reports (ESG-Reports).
- Procedural Memory: Evaluates the retrieval of skills and action sequences using API documentation (Gorilla, ToolBench) and task trajectories (ReMe, DeepPlanning).
Usage in Model Evaluation
- The authors utilize the entire collection for zero-shot evaluation, meaning models are tested on their pre-trained capabilities without task-specific fine-tuning.
- The benchmark serves as a diagnostic tool to measure performance across varying levels of abstraction and temporal dependency.
- Results indicate that LMEB and MTEB are orthogonal, suggesting that high performance in traditional passage retrieval does not guarantee success in long-horizon memory tasks.
Processing and Construction Details
- Unified Schema: All resources are converted into a standard Information Retrieval format containing queries.jsonl, corpus.jsonl, qrels.csv, and an optional candidates.jsonl.
- Temporal Anchoring: For queries with relative time expressions, the authors append explicit time anchors (e.g., "Current time: 11:17 AM on Sunday") to disambiguate references.
- Metadata Encoding: Timestamps and hierarchical structures (such as session or turn levels in dialogues) are preserved in the title or text fields to support time-sensitive and scope-specific queries.
- Text Segmentation: Long documents in semantic tasks, such as novels and research papers, are segmented into passages or sentences using tools like semchunk with a chunk size of 256 tokens.
- Candidate Constraints: An optional candidates file restricts retrieval to specific memory scopes, such as a single conversation history, rather than the full corpus.

Method

The authors introduce the Long-horizon Memory Embedding Benchmark (LMEB) to systematically evaluate memory capabilities across diverse scenarios. The framework categorizes tasks into four distinct memory types: Episodic, Dialogue, Semantic, and Procedural. Each type involves a query retrieving relevant information from a specific memory store.

To structure these evaluations, the authors define a taxonomy based on abstraction and temporal dependency. As shown in the figure below, Episodic and Dialogue memories exhibit strong temporal dependencies, while Semantic and Procedural memories vary in abstraction levels.

To ensure embeddings follow instructions during downstream tasks, the authors prepend specific task instructions to the queries. The instructed query is formulated as follows:

$q_{\text{inst}} = \text{Instruct: \{task instruction\} \n Query: } q$

where $q$ denotes the original query and $q_{\text{inst}}$ is the instructed query. This mechanism allows the model to adapt its retrieval and processing based on the specific requirements of the task.

Experiment

The LMEB benchmark evaluation validates a unified pipeline for testing embedding models across episodic, dialogue, semantic, and procedural memory tasks, demonstrating that the benchmark offers a balanced difficulty level that effectively challenges current models.
Experiments comparing models of varying scales reveal that larger parameter counts do not guarantee superior performance, as smaller models often achieve comparable or better results depending on architecture and training data.
Analysis of task instructions shows that their impact on retrieval performance is model-dependent, with some models benefiting from instructions while others perform better without them or remain unaffected.
Correlation studies confirm that LMEB evaluates capabilities orthogonal to traditional benchmarks like MTEB, particularly showing that strong performance in standard passage retrieval does not generalize well to complex episodic or dialogue memory scenarios.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

2달 전

Xinping Zhao Xinshuo Hu Jiaxin Xu Danyu Tang Xin Zhang Mengjia Zhou Yan Zhong Yao Zhou Zifei Shan Meishan Zhang

초록

One-sentence Summary

Key Contributions

Current text embedding benchmarks fail to assess long-horizon memory retrieval involving fragmented, context-dependent, and temporally distant information, creating a gap in evaluating memory-augmented systems.
The authors introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework spanning 22 datasets and 193 zero-shot tasks across episodic, dialogue, semantic, and procedural memory types.
Evaluation of 15 embedding models reveals that larger models do not consistently outperform smaller ones and that LMEB performance is orthogonal to traditional benchmarks like MTEB, indicating a lack of universal models for this domain.

Introduction

Dataset

LMEB Dataset Overview

Dataset Composition and Sources
- The benchmark consolidates 22 English datasets into a unified schema, covering 193 zero-shot retrieval tasks.
- Data sources include a mix of AI-generated content and human-annotated material derived from crowdsourcing, academic papers, novels, and real-world logs.
- The datasets are categorized into four distinct memory types: Episodic, Dialogue, Semantic, and Procedural.
Key Details by Subset
- Episodic Memory: Focuses on recalling past events with temporal and spatial cues using datasets like EPBench (synthetic events) and KnowMeBench (autobiographical narratives).
- Dialogue Memory: Targets multi-turn context retention and user preferences using long-form conversations from LoCoMo, LongMemEval, REALTALK, and TMD.
- Semantic Memory: Assesses retrieval of stable, general knowledge from scientific papers (QASPER, SciFact), novels (NovelQA), and reports (ESG-Reports).
- Procedural Memory: Evaluates the retrieval of skills and action sequences using API documentation (Gorilla, ToolBench) and task trajectories (ReMe, DeepPlanning).
Usage in Model Evaluation
- The authors utilize the entire collection for zero-shot evaluation, meaning models are tested on their pre-trained capabilities without task-specific fine-tuning.
- The benchmark serves as a diagnostic tool to measure performance across varying levels of abstraction and temporal dependency.
- Results indicate that LMEB and MTEB are orthogonal, suggesting that high performance in traditional passage retrieval does not guarantee success in long-horizon memory tasks.
Processing and Construction Details
- Unified Schema: All resources are converted into a standard Information Retrieval format containing queries.jsonl, corpus.jsonl, qrels.csv, and an optional candidates.jsonl.
- Temporal Anchoring: For queries with relative time expressions, the authors append explicit time anchors (e.g., "Current time: 11:17 AM on Sunday") to disambiguate references.
- Metadata Encoding: Timestamps and hierarchical structures (such as session or turn levels in dialogues) are preserved in the title or text fields to support time-sensitive and scope-specific queries.
- Text Segmentation: Long documents in semantic tasks, such as novels and research papers, are segmented into passages or sentences using tools like semchunk with a chunk size of 256 tokens.
- Candidate Constraints: An optional candidates file restricts retrieval to specific memory scopes, such as a single conversation history, rather than the full corpus.

Method

To ensure embeddings follow instructions during downstream tasks, the authors prepend specific task instructions to the queries. The instructed query is formulated as follows:

$q_{\text{inst}} = \text{Instruct: \{task instruction\} \n Query: } q$

Experiment

The LMEB benchmark evaluation validates a unified pipeline for testing embedding models across episodic, dialogue, semantic, and procedural memory tasks, demonstrating that the benchmark offers a balanced difficulty level that effectively challenges current models.
Experiments comparing models of varying scales reveal that larger parameter counts do not guarantee superior performance, as smaller models often achieve comparable or better results depending on architecture and training data.
Analysis of task instructions shows that their impact on retrieval performance is model-dependent, with some models benefiting from instructions while others perform better without them or remain unaffected.
Correlation studies confirm that LMEB evaluates capabilities orthogonal to traditional benchmarks like MTEB, particularly showing that strong performance in standard passage retrieval does not generalize well to complex episodic or dialogue memory scenarios.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

LMEB: Long-horizon Memory Embedding Benchmark

Xinping Zhao Xinshuo Hu Jiaxin Xu Danyu Tang Xin Zhang Mengjia Zhou Yan Zhong Yao Zhou Zifei Shan Meishan Zhang2 more

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

LMEB Dataset Overview

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

LMEB: Long-horizon Memory Embedding Benchmark

Xinping Zhao Xinshuo Hu Jiaxin Xu Danyu Tang Xin Zhang Mengjia Zhou Yan Zhong Yao Zhou Zifei Shan Meishan Zhang2 more

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

LMEB Dataset Overview

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

LMEB: Long-horizon Memory Embedding Benchmark

Xinping Zhao Xinshuo Hu Jiaxin Xu Danyu Tang Xin Zhang Mengjia Zhou Yan Zhong Yao Zhou Zifei Shan Meishan Zhang2 more

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

LMEB Dataset Overview

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Xinping Zhao Xinshuo Hu Jiaxin Xu Danyu Tang Xin Zhang Mengjia Zhou Yan Zhong Yao Zhou Zifei Shan Meishan Zhang

Xinping Zhao Xinshuo Hu Jiaxin Xu Danyu Tang Xin Zhang Mengjia Zhou Yan Zhong Yao Zhou Zifei Shan Meishan Zhang

Xinping Zhao Xinshuo Hu Jiaxin Xu Danyu Tang Xin Zhang Mengjia Zhou Yan Zhong Yao Zhou Zifei Shan Meishan Zhang