Tech Developer Tricks ImageSiftBot with Fake JPEGs to Thwart Abusive Web Crawlers
On March 25, 2025, a developer announced a new feature for Spigot, a small web application designed to generate fake hierarchies of web pages using a Markov Chain. The application serves over a million pages per day, primarily aimed at aggressive web crawlers. Recently, the developer noticed a new heavy hitter called "ImageSiftBot," which was aggressively requesting images from Spigot, even though none were initially available. Motivated by a desire to please ImageSiftBot and to increase the cost of operation for abusive web crawlers, the developer began working on a method to generate fake JPEG images. The primary goal was to create images with minimal CPU usage, as on-the-fly image generation can be resource-intensive. The developer leveraged the structure of JPEG files, which are made up of chunks containing markers and lengths. By scanning existing JPEG files, discarding the "comment" chunks, and keeping the structured parts while noting the lengths of the "pixel data" chunks, the developer created realistic templates. These templates allowed the application to generate JPEGs by filling the pixel data sections with random bytes, a process that requires very little CPU power. However, the developer found that simply generating random data for the pixel chunks resulted in occasional decoding errors due to the specific encoding techniques, such as Huffman coding, used in JPEGs. Despite these errors, most JPEG viewers, including those used by the web crawlers, accepted and displayed the images. The developer concluded that even faulty JPEGs could be useful in inconveniencing the crawlers, as they would still need to download and attempt to decode the images, thus increasing their operational costs. To optimize the fake image generation, the developer tested the method on a web server using Python and achieved impressive results. The server could generate around 900 garbage JPEG images per second, each around 1280x960 pixels and 200-300KBytes in size. This rate is much faster than the server's internet connection speed, making the process highly efficient. The developer integrated the fake JPEG generation into Spigot, ensuring that about 60% of the generated pages now include a garbage JPEG. The random number generator used for image creation is seeded with a value derived from the URL, meaning that reloading the same URL will always produce the same image. ImageSiftBot, along with other crawlers like Meta's bot, AmazonBot, and GPTBot, have shown increased activity since the integration. On the first day, ImageSiftBot grabbed around 15,000 garbage images, and it is expected to ramp up further as it discovers more links. The developer released the Python class for generating these fake JPEGs on GitHub on March 26, 2025. Following additional research on Huffman coding, the developer implemented a bitwise AND operation with 0x6D to reduce the likelihood of generating invalid Huffman codes from over 90% to less than 4%. This tweak minimizes CPU usage while effectively increasing thecrawler's costs. The developer noted that generating perfectly valid Huffman streams would require more CPU resources, offering little additional benefit. Industry insiders see this move as a creative and effective method to combat abusive web crawling, particularly in the context of increasing demands for quality training data in AI and machine learning projects. Spigot, though a small application, has become a significant tool in the ongoing battle to protect server resources and maintain data integrity. The approach highlights the developer’s ingenuity in using minimal resources to thwart unwanted activity, setting a precedent for similar strategies in the tech community.