12 days ago

SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

Junjie Wu, Jiangnan Li, Yuqing Li, Lemao Liu, Liyan Xu, Jiwei Li, Dit-Yan Yeung, Jie Zhou, Mo Yu

Abstract

Retrieval-augmented generation (RAG) over long documents typically involvessplitting the text into smaller chunks, which serve as the basic units forretrieval. However, due to dependencies across the original document,contextual information is often essential for accurately interpreting eachchunk. To address this, prior work has explored encoding longer context windowsto produce embeddings for longer chunks. Despite these efforts, gains inretrieval and downstream tasks remain limited. This is because (1) longerchunks strain the capacity of embedding models due to the increased amount ofinformation they must encode, and (2) many real-world applications stillrequire returning localized evidence due to constraints on model or humanbandwidth. We propose an alternative approach to this challenge by representing shortchunks in a way that is conditioned on a broader context window to enhanceretrieval performance -- i.e., situating a chunk's meaning within its context.We further show that existing embedding models are not well-equipped to encodesuch situated context effectively, and thus introduce a new training paradigmand develop the situated embedding models (SitEmb). To evaluate our method, wecurate a book-plot retrieval dataset specifically designed to assess situatedretrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3substantially outperforms state-of-the-art embedding models, including severalwith up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 modelfurther improves performance by over 10% and shows strong results acrossdifferent languages and several downstream applications.