Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

Kristian Kuznetsov, Laida Kushnareva, Polina Druzhinina, Anton Razzhigaev, Anastasia Voznyuk, Irina Piontkovskaya, Evgeny Burnaev, Serguei Barannikov

Release Date: 4/24/2025

Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

Abstract

Artificial Text Detection (ATD) is becoming increasingly important with therise of advanced Large Language Models (LLMs). Despite numerous efforts, nosingle algorithm performs consistently well across different types of unseentext or guarantees effective generalization to new LLMs. Interpretability playsa crucial role in achieving this goal. In this study, we enhance ATDinterpretability by using Sparse Autoencoders (SAE) to extract features fromGemma-2-2b residual stream. We identify both interpretable and efficientfeatures, analyzing their semantics and relevance through domain- andmodel-specific statistics, a steering approach, and manual or LLM-basedinterpretation. Our methods offer valuable insights into how texts from variousmodels differ from human-written content. We show that modern LLMs have adistinct writing style, especially in information-dense domains, even thoughthey can produce human-like outputs with personalized prompts.

View Paper Details