Cloudflare Blocking Web Bots from Scraping AI Training Data

By Paula Parisi
July 9, 2024

Cloudflare has a new tool that can block AI from scraping a website’s content for model training. The no-code feature is available even to customers on the free tier. “Declare your ‘AIndependence’” by blocking AI bots, scrapers and crawlers with a single click, the San Francisco-based company urged last week, simultaneously releasing a chart of frequent crawlers by “request volume” on websites using Cloudflare. The ByteDance-owned Bytespider was number one, presumably gathering training data for its large language models “including those that support its ChatGPT rival, Doubao,” Cloudflare says. Amazonbot, ClaudeBot and GPTBot rounded out the top four.

What Cloudflare is calling the “easy button” feature will automatically update “as we see new fingerprints of offending bots we identify as widely scraping the web for model training,” the company explains in its blog post.

Insofar as traffic already surveyed across Cloudflare’s network — which ZDNet says proxies “about 20 percent of the web” — the company offers charts, graphs and insights based on activity within the past year.

Amazonbot is “reportedly used to index content for Alexa’s question-answering,” while “ClaudeBot, used to train the Claude chatbot, has recently increased in request volume,” Cloudflare explains.

Bytespider leads in number of requests and “the extent of its Internet property crawling” but also in the “frequency with which it is blocked,” followed closely by OpenAI’s GPTBot, second in both crawling and being blocked.

While some AI companies make an effort to identify their web scraping bots, not all are transparent. “OpenAI, Google and several other market players enable website operators to opt out of scraping,” SiliconANGLE writes, explaining that Cloudflare’s new blocking tool is a defense against surreptitious scraping.

Cloudflare says its software can identify even those bots that strive to avoid detection.

“When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint,” the Cloudflare engineers write, noting that the company “sees over 57 million requests per second on average.” Analyzing that activity the company assesses the trustworthiness, or lack thereof, for each fingerprint.

“The problem of AI bots has come into sharp relief as the generative AI boom fuels the demand for model training data,” writes TechCrunch, noting that “blocking isn’t a surefire protection” against bad actors.

Citing a highly critical article on Perplexity published in Wired last month, SiliconANGLE accuses the AI search engine of “impersonating legitimate visitors.”

Meanwhile, TechCrunch writes that “OpenAI and Anthropic are said to have at times ignored” the standardized Robots Exclusion Protocol — included in the code of virtually all websites as the file robots.txt — which is the first line of defense against unwanted scraping.

Reuters wrote in June that content licensing startup TollBit found AI agents commonly ignore the “no crawl” directive in robots.txt.

Cloudflare Blocking Web Bots from Scraping AI Training Data

No Comments Yet

Leave a comment