Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

Artificial Text Detection (ATD) is becoming increasingly important with therise of advanced Large Language Models (LLMs). Despite numerous efforts, nosingle algorithm performs consistently well across different types of unseentext or guarantees effective generalization to new LLMs. Interpretability playsa crucial role in achieving this goal. In this study, we enhance ATDinterpretability by using Sparse Autoencoders (SAE) to extract features fromGemma-2-2b residual stream. We identify both interpretable and efficientfeatures, analyzing their semantics and relevance through domain- andmodel-specific statistics, a steering approach, and manual or LLM-basedinterpretation. Our methods offer valuable insights into how texts from variousmodels differ from human-written content. We show that modern LLMs have adistinct writing style, especially in information-dense domains, even thoughthey can produce human-like outputs with personalized prompts.