Testing a Markov Model on 24 Years of Blog Content Yields Absurd and Amusing Gibberish
I fed 24 years of my blog posts to a Markov model and got some delightfully absurd results. The experiment began with a simple program called Mark V. Shaney Junior, a minimal Markov text generator inspired by the legendary 1980s program of the same name. I often write such small, exploratory programs just for fun, testing ideas around Markov chains on different kinds of data. This one, now shared on GitHub and Codeberg, uses trigrams—sequences of three consecutive words—to build a model that predicts the next word based on the previous two. I decided to train it on all 24 years of my blog content, which includes around 200,000 words across over 200 posts. Comments were excluded, totaling about 40,000 words, so the model only learned from my own writing. The results were a mix of surreal logic, technical jargon, and unintentional humor. One output began: "While a query replace operation is approved by the user. The above variable defines the build job. It can be incredibly useful while working on assembly language and machine code..." Another: "Enjoy asking 'what happens if' and then type M-x zap-up-to-char RET b. The buffer for this specific video, the actual fare for 8.3 km and 11 are all written from scratch..." These snippets blend phrases from real posts—like Emacs keybindings, Lisp programming, and discussions about integral domains—into bizarre new contexts. The model picked up on recurring phrases such as "Lisp source file" and "self-esteem" from different posts and mashed them together in strange ways. The algorithm is straightforward: it builds a map where each pair of words (the key) points to a list of words that follow them in the original text. When generating, it picks a random starting pair and then randomly selects the next word from the list of possible followers. This process repeats to build a sentence. By default, the model uses order 2 (two-word keys), but increasing the order improves coherence. At order 4, the output becomes more structured: "It is also possible to search for channels by channel names. For example, on Libera Chat, to search for all channels with 'python' in its name, enter the IRC command: /msg alis list python." However, pushing the order to 5 causes the model to repeat large chunks of original text verbatim, losing all creativity and becoming dry and mechanical. Finally, I tested generating text from a prompt: "Finally we divide this number by a feed aggregator for Emacs-related blogs..." The result was a chaotic yet oddly plausible-sounding fragment that captured the rhythm and tone of my writing—except for the meaning. It felt like a parody of my own voice, as if I’d written it while sleep-deprived and half in code. This experiment reminded me how much of our writing style is embedded in word patterns, not just content. The model didn’t understand meaning, but it learned the cadence, the syntax, and the quirks of my prose. It’s a fun reminder that language modeling isn’t about intelligence—it’s about patterns. And sometimes, the most coherent nonsense is the most revealing.
