HyperAIHyperAI

Command Palette

Search for a command to run...

Reddit Sues Startups Over Alleged Unauthorized AI Data Scraping

Reddit has filed a new lawsuit against four companies—Perplexity AI, SerpApi, Oxylabs, and AWMProxy—alleging that they violated the platform’s terms of service by using automated bots to scrape publicly available content from Reddit and sell it for use in training artificial intelligence models. The case, filed in New York, represents the latest front in a growing legal battle between online platforms and data-harvesting firms that feed AI systems with vast amounts of web content. Reddit claims the defendants used a sophisticated workaround: instead of directly accessing Reddit’s site, they accessed Google search results that included snippets of Reddit posts, effectively bypassing Reddit’s login wall and scraping protections. The lawsuit argues that this method still constitutes unauthorized data extraction, even if the data appears on third-party pages. Reddit asserts that the companies’ actions violated its terms of service, which explicitly prohibit automated data collection. The platform also claims that despite being told to stop, Perplexity AI continued to access Reddit 100,000 times after allegedly promising to cease. The suit seeks damages and a permanent injunction to block future scraping. Perplexity AI, known for its AI-powered search engine, is the most high-profile defendant. The company has previously drawn criticism for its aggressive data collection practices. The other three companies—SerpApi (Texas), Oxylabs (Lithuania), and AWMProxy (Russia)—are alleged to have provided the technical infrastructure for scraping, with Oxylabs claiming in a statement to the New York Times that “no company should claim ownership of public data that does not belong to them.” This argument reflects a broader legal debate: whether publicly accessible data, even if behind a paywall or login, can be legally harvested and repurposed for commercial AI training. However, Reddit’s path to legal victory is far from certain. The case is being heard in New York, but the defendants are based in multiple countries, raising jurisdictional and enforcement challenges. Additionally, past rulings have cast doubt on the ability of platforms to control how their data is used once it’s publicly available. In 2023, a similar lawsuit by Elon Musk’s X (formerly Twitter) was dismissed, with a federal judge warning that overly broad claims of data control could create “information monopolies” that harm public access and innovation. The outcome of Reddit’s case could set a significant precedent for how courts view data scraping in the age of AI. If successful, it could limit the ability of AI companies to gather training data through indirect means. If not, it may reinforce the idea that once content is publicly accessible—even via search engines—it’s fair game for AI developers. This legal clash underscores the tension between digital platforms that invest in user-generated content and AI firms that rely on massive datasets to train their models. While platforms like Reddit argue they deserve protection and compensation, AI companies maintain that their use of publicly available data is transformative and beneficial to society. The case also highlights the growing sophistication of data harvesting techniques. Rather than directly violating a site’s terms, companies now exploit the visibility of content in search results, making enforcement harder. As AI continues to expand, such legal battles are likely to intensify, with courts facing the difficult task of balancing innovation, user rights, and intellectual property in a rapidly evolving digital landscape.

Related Links