Cloudflare has stated in a blog post that AI search company Perplexity may be actively crawling websites without respecting the applicable guidelines and restrictions for bots.
According to the network service provider, Perplexity uses techniques to circumvent detection and gain access to content that is normally protected from automated traffic.
The suspicions focus on so-called stealth crawling. This involves Perplexity initially crawling under its own name, but switching to other methods as soon as the traffic is blocked. Cloudflare found that the crawlers changed their identity by pretending to be regular browsers, such as Chrome on macOS. It also used changing IP addresses and different autonomous systems (ASNs) to circumvent firewall rules.
To verify these findings, Cloudflare set up a test environment with new domains on which restrictions were set to target Perplexity bots. According to the company, the results showed that the crawlers were initially recognizable as coming from Perplexity. Still, when blocked, they switched to generic user agents commonly associated with human users. In addition, it was found that the IP addresses used were outside the company’s known ranges and that the ASNs varied.
According to Cloudflare, this behavior deviates from generally accepted standards on the internet, such as the robots.txt protocol. This protocol allows website owners to specify which parts of a site are accessible to automated systems and which are not. Cloudflare emphasizes in the blog post that transparency about the purpose and identity of crawlers is essential, especially in light of the increasing use of AI in information processing.
Perplexity offers an AI search engine that provides users with summaries and answers in natural language, based on web content. Crawling plays an important role in this, as the underlying models depend on access to up-to-date online information.
Millions of requests per day
According to Cloudflare, the scale of the activity detected is significant. It involves millions of requests per day, spread across tens of thousands of domains. The company states that this pattern is not incidental and has taken measures. Perplexity has been removed from the list of verified bots, and additional network rules have been activated to block this type of traffic.
The Verge quotes a spokesperson for Perplexity, Jesse Dwyer, who says that Cloudflare’s report is mainly a publicity stunt and contains many misunderstandings about how they work. The company denies any intentional deception or technical tricks such as changing user agents or IP infrastructure. Although Perplexity questions the conclusions of the investigation, they do not elaborate on Cloudflare’s specific technical findings, such as modified user agents or autonomous systems.
Perplexity has previously faced criticism regarding its collection of web content, with questions raised about how the company handles robots.txt restrictions, among other things. At the time, management indicated that certain scraping activities may have originated from test bots outside their own infrastructure.
Clear guidelines necessary
In its blog post, Cloudflare reiterates its call for the industry to agree on clear guidelines for AI crawling. According to the company, such systems must be recognizable, adhere to website preferences, and only collect information ethically and transparently.