Web Crawler Archives

Creative Commons Introduces New Licensing Platform for AI

By Paula Parisi
June 27, 2025

Creative Commons, the non-profit that pioneered sharing content through permissive licensing, is launching CC Signals, a framework to signal permissions for content use by machines in the age of artificial intelligence. “They are both a technical and legal tool and a social proposition: a call for a new pact between those who share data and those who use it to train AI models,” says Creative Commons CEO Anna Tumadóttir, noting the signals are “based on a set of limited but meaningful options shaped in the public interest.” The framework is designed to bridge the openness of the Internet with AI’s insatiable demand for training data, according to Creative Commons. Continue reading Creative Commons Introduces New Licensing Platform for AI

Cloudflare Blocking Web Bots from Scraping AI Training Data

By Paula Parisi
July 9, 2024

Cloudflare has a new tool that can block AI from scraping a website’s content for model training. The no-code feature is available even to customers on the free tier. “Declare your ‘AIndependence’” by blocking AI bots, scrapers and crawlers with a single click, the San Francisco-based company urged last week, simultaneously releasing a chart of frequent crawlers by “request volume” on websites using Cloudflare. The ByteDance-owned Bytespider was number one, presumably gathering training data for its large language models “including those that support its ChatGPT rival, Doubao,” Cloudflare says. Amazonbot, ClaudeBot and GPTBot rounded out the top four. Continue reading Cloudflare Blocking Web Bots from Scraping AI Training Data

The New York Times Looks to Protect IP Content in Era of AI

By Paula Parisi
August 18, 2023

Newsrooms can potentially benefit greatly from AI language models, but at this early stage they’ve begun laying down boundaries to ensure that rather than having their data coopted to build artificial intelligence by third parties they’ll survive long enough to create models of their own, or license proprietary IP. As industries await regulations from the federal government, The New York Times has proactively updated its terms of service to prohibit data-scraping of its content for machine learning. The move follows a Google policy refresh that expressly states it uses search data to train AI. Continue reading The New York Times Looks to Protect IP Content in Era of AI