Command Palette
Search for a command to run...
Dor Arviv Yehonatan Elisha Oren Barkan Noam Koenigstein

초록
본 연구는 추천 시스템의 사용자 및 아이템 임베딩(embedding)으로부터 일관성 있고 해석 가능한 개념과 정렬되는 잠재 차원(latent dimension)인 ‘단의미 뉴런(monosemantic neurons)’을 추출하는 방법론을 제시합니다.본 연구의 접근 방식은 희소 오토인코더(Sparse Autoencoder, SAE)를 활용하여 사전 학습된 표현(pretrained representation) 내의 의미적 구조를 규명합니다. 기존 언어 모델 연구와 달리, 추천 시스템에서의 단의미성은 분리된 사용자 및 아이템 임베딩 간의 상호작용을 필수적으로 보존해야 합니다. 이를 달성하기 위해, 본 연구는 고정된(frozen) 추천 모델을 통해 역전파(backpropagation)를 수행하고, 학습된 잠재 구조를 모델의 사용자-아이템 선호도(affinity) 예측과 정렬시키는 예측 인식(prediction aware) 학습 목적 함수를 도입했습니다.그 결과 도출된 뉴런들은 장르, 인기도, 시간적 추세와 같은 속성을 포착하며, 기반 모델(base model)을 수정하지 않고도 타겟 필터링(targeted filtering)이나 콘텐츠 프로모션(content promotion)과 같은 사후 제어(post hoc control) 작업을 지원합니다. 본 방식은 다양한 추천 모델 및 데이터셋에 걸쳐 일반화가 가능하며, 해석 및 제어가 가능한 개인화를 구현하는 데 있어 실용적인 도구를 제공합니다.관련 코드 및 평가 리소스는 https://github.com/DeltaLabTLV/Monosemanticity4Rec 에서 확인할 수 있습니다.
Summarization
Researchers from Tel Aviv University and The Open University, Israel, introduce a method employing Sparse Autoencoders with a novel prediction-aware training objective to extract interpretable monosemantic neurons from recommender system embeddings, enabling precise post hoc control operations such as targeted filtering and content promotion without modifying the base model.
Introduction
Modern recommender systems rely on latent embeddings to generate personalized suggestions at scale, but these representations often lack semantic meaning, making the models opaque and difficult to audit for fairness or reliability. While Sparse Autoencoders (SAEs) have successfully extracted interpretable features from Large Language Models, existing methods fail to capture the distinct user-item interaction logic fundamental to recommendation architectures. The authors address this by introducing a novel SAE framework specifically designed to extract "monosemantic neurons" from recommender embeddings, revealing interpretable concepts like genre and popularity within the latent space.
Key innovations in this approach include:
- Prediction-aware reconstruction loss: Unlike standard geometric reconstruction, this mechanism backpropagates gradients through the frozen recommender to ensure the extracted features preserve actual recommendation behavior and affinity patterns.
- KL-divergence regularization: The framework replaces the Top-K sparsity objective common in LLM research with KL-divergence, which improves stability and prevents the issue of dead neurons during training.
- Intervention capabilities: The extracted neurons enable precise, post-hoc control over model output, allowing developers to suppress specific content types or boost target items without retraining the base model.
Method
The authors leverage a sparse autoencoder (SAE) framework designed to extract monosemantic concepts from user and item embeddings within a two-tower recommender architecture. The overall system operates by first encoding user and item inputs into embeddings through independent encoders, followed by a scoring function that predicts user-item affinity. The SAE is applied post hoc to these embeddings, encoding them into a sparse latent representation and reconstructing the original embeddings. The framework incorporates a Matryoshka SAE structure, which trains multiple nested autoencoders with increasing dictionary sizes, enabling a hierarchical representation where early latent dimensions capture general features and later ones specialize in finer-grained concepts.

The SAE is trained with a total loss composed of reconstruction and sparsity objectives. The reconstruction loss includes two components: an embedding-level loss that ensures geometric fidelity between the original and reconstructed embeddings, and a novel prediction-level loss tailored for recommender systems. The prediction-level loss measures the mean squared difference between the original affinity prediction and the prediction computed using the reconstructed embeddings, with the scoring function kept frozen during training. This term encourages the SAE to preserve interaction semantics and ranking consistency, which are critical for recommendation quality. The final reconstruction loss is a weighted sum of the embedding-level and prediction-level losses. The sparsity loss combines ℓ1 regularization and a KL divergence penalty on the activation rates of the latent neurons, promoting compact and disentangled representations. The training procedure involves sampling user-item pairs, computing the total loss, and backpropagating gradients through the frozen recommender to align the latent representation with the recommender’s behavioral outputs.
Experiment
- Experiments evaluated Matrix Factorization and Neural Collaborative Filtering models on MovieLens 1M and Last.FM datasets to assess the interpretability of Sparse Autoencoders.
- Qualitative analysis confirmed that monosemantic neurons emerge naturally without supervision, effectively encoding concepts such as specific genres, stylistic eras, and item popularity.
- Quantitative assessments using a semantic purity metric demonstrated high precision; notably, Matrix Factorization neurons for Comedy and Horror achieved 100% purity across all top K thresholds, with near perfect alignment for music genres like Country and Metal.
- Ablation studies revealed that increasing the prediction level loss weight improves recommendation fidelity, measured by Rank Biased Overlap and Kendall Tau, though optimal monosemanticity requires balancing this weight against bottleneck sparsity.
- Intervention experiments validated the ability to modify model behavior post hoc, such as successfully promoting specific artists to users with unrelated preferences by adjusting latent neuron activations.
- Hierarchical analysis using Matryoshka SAEs showed that early neurons capture broad mainstream preferences while later neurons specialize in niche micro genres, a pattern particularly evident in the Last.FM dataset.
Results show that monosemantic neurons extracted from recommender models achieve high semantic purity for genre concepts, with many achieving 100% purity at K=10 for both MF and NCF on MovieLens. The table also reveals that a "popularity neuron" consistently activates for high-ranking items, indicating a latent dimension capturing mainstream appeal across both datasets.

Results show that monosemantic neurons extracted from recommender models achieve high semantic purity for music genres such as Electronic, Metal, and Folk, with purity values often reaching 1.00 at K=10. The table also reveals that the popularity neuron consistently activates for high-ranking items across both datasets, indicating a strong bias toward widely consumed content.
