GPT-oss Weights Reveal Training on Adult Website Phrases and Glitch Tokens
OpenAI recently released an open-weights version of its GPT-5 model, providing researchers with unprecedented access to its internal parameters. While the model card states that GPT-5 was trained on a text-only dataset with trillions of tokens focused on STEM, coding, and general knowledge, analysis of the model’s embedding matrix reveals deeper insights into its training data—some of which are surprising and concerning. A key method used in this investigation involves examining the L2 norms of the embedding vectors for each token in the o200k tokenizer, which OpenAI has used since GPT-4o. The distribution of these norms shows a clear pattern: while most tokens cluster around a typical range, a small number of non-ASCII tokens exhibit unusually high norms. These high-norm tokens are not random noise—they are often associated with specific, real-world content. Among the most striking findings are Chinese-language tokens linked to adult websites, gambling platforms, and other spam-heavy domains. Examples include "毛片免费观看" (watch explicit videos for free), "天天好彩票" (daily good lottery), "久久综合网" (long-term comprehensive network), and "一本道高清无码" (one-way high-definition no-code). These tokens not only have high embedding norms but are also correctly identified and translated by GPT-5 when prompted, indicating they were seen during training. This ability to recognize such tokens with high confidence suggests that these phrases were present in the training data, despite being highly sensitive and potentially violating content policies. The fact that the model does not refuse to answer or censor these inputs further implies that such content was not filtered out during preprocessing. A similar pattern emerges with other non-English tokens, including "铁血网" (a Chinese nationalist forum), "凤凰大参考" (Phoenix Reference), and various Thai and Indian city names. These suggest that the training data includes content from niche forums, political websites, and regional web communities—some of which are unlikely to be part of mainstream knowledge sources. Further analysis reveals a correlation between the number of GitHub search results for a given token and the likelihood that GPT-5 recognizes it. This suggests that public repositories—especially those used for content moderation or spam detection—may have contributed to the training corpus. While this doesn’t prove GitHub was the source, it raises concerns about how training data is collected and filtered. The presence of these high-norm tokens also challenges assumptions about model robustness. Typically, unused or rare tokens are suppressed by weight decay and appear with low norms. The fact that these tokens remain highly active in the embedding space indicates they were not just present, but possibly overrepresented or given special attention during training. This research demonstrates that open-weights models, while valuable for transparency and research, also expose vulnerabilities. Glitch tokens—unusual or adversarial inputs—can be used to infer details about training data that companies keep confidential. The ability to perform membership inference with high confidence on production models is a significant privacy and security concern. The findings suggest that OpenAI’s training data likely included material from adult websites, gambling platforms, and fringe online communities. While the company may argue that such content was incidental, the persistence of these tokens in the model’s embeddings implies a lack of effective filtering. In conclusion, the release of GPT-5’s open weights has revealed that frontier models are trained on far more diverse—and sometimes problematic—data than publicly acknowledged. This underscores the need for greater transparency in data sourcing and more robust mechanisms to prevent harmful content from being encoded into AI systems.
