Reddit sues Perplexity over alleged large-scale data scraping for AI training

Reddit has filed a sweeping lawsuit against Perplexity and three web scraping firms — Oxylabs UAB, AWMProxy, and SerpApi — accusing them of running what it calls an “industrial-scale” operation to extract Reddit’s user-generated content for use in artificial intelligence training. The complaint, filed in the Southern District of New York, claims the companies collaborated to bypass Reddit’s anti-scraping systems and harvest massive amounts of material from the platform without permission.

According to Reddit, the defendants accessed its data through indirect channels designed to evade detection, including scraping content from Google search results pages instead of Reddit’s own servers. The company alleges that this method was deliberately used to get around rate limits and technical barriers it put in place to prevent automated data collection.

In the filing, Reddit portrays Perplexity’s approach as both reckless and parasitic, arguing that the AI startup has refused to negotiate legitimate data licensing agreements, unlike OpenAI and Google. The suit goes so far as to compare Perplexity’s tactics to those of a “North Korean hacker” — a characterization that underscores Reddit’s frustration over what it sees as a pattern of exploitation by AI firms unwilling to pay for the data that powers their products.

The lawsuit targets the foundation of Perplexity’s “answer engine,” which Reddit dismisses as standard retrieval-augmented generation (RAG) technology rather than an innovation. In this system, scraped web data is processed by a third-party large language model to produce conversational answers. Reddit claims Perplexity’s business model relies heavily on unlicensed use of its platform’s discussions, while the company’s multibillion-dollar valuation has not translated into fair compensation for the data it consumes.

Reddit says it first confronted Perplexity in May 2024 with a cease-and-desist letter. The AI company allegedly agreed to comply with Reddit’s robots.txt file — a protocol meant to signal which parts of a website can be indexed — but Reddit asserts that activity from Perplexity-linked crawlers has increased “forty-fold” since then.

The case also points to a sting operation Reddit set up to confirm its suspicions. The company created a hidden “test post” visible only to Google’s indexing system and not to public users. Within hours, the content from that post appeared in Perplexity’s results, suggesting the company’s crawlers were indirectly pulling material through Google’s cached pages.

The complaint demands that Perplexity and its scraping partners be permanently barred from accessing Reddit’s data, and it seeks financial restitution, including the return of any profits made through what Reddit calls the “unauthorized use” of its content.

Perplexity now joins a growing list of AI companies facing legal scrutiny for data sourcing practices. In June, Reddit filed a similar lawsuit against Anthropic, alleging that the Claude developer violated its terms of service while publicly promoting ethical AI development. Cloudflare has also previously accused Perplexity of ignoring robots.txt files and using covert crawlers to circumvent firewall protections.

This latest action underscores a broader industry clash over who controls the data that fuels artificial intelligence — and whether platforms built on user communities, like Reddit, can reclaim authority over how their content is monetized in the AI era.