Home Artificial Intelligence Cloudflare offers simpler way to stop AI bots

by Paul Barker

Cloudflare offers simpler way to stop AI bots

news

Jul 04, 20245 mins

Generative AIWeb Search

Unchecked AI bots scraping content for their training could spell the end of the open web if enterprises follow one analyst’s advice to put their intellectual property behind a paywall.

Content distribution network Cloudflare is making it simpler for customers who have had enough of badly behaved bots to block them from their website.

It’s long been possible to prevent well-behaved bots from crawling your corporate website by adding a “robots.txt” file listing who’s welcome and who isn’t — and content distribution networks such as Cloudflare offer visual interfaces to simplify the creation of such files.

But faced with the arrival of a new generation of badly behaved AI bots, scraping content to feed their large language models (LLMs), Cloudflare has introduced an even quicker way to block all such bots with one click.

“The popularity of generative AI has made the demand for content used to train models or run inference on skyrocket, and although some AI companies clearly identify their web scraping bots, not all AI companies are being transparent,” Cloudflare staff wrote in a blog post.

According to authors of the post, “Google reportedly paid $60 million a year to license Reddit’s user generated content, Scarlett Johansson alleged OpenAI used her voice for their new personal assistant without her consent, and most recently, Perplexity has been accused of impersonating legitimate visitors in order to scrape content from websites. The value of original content in bulk has never been higher.”

Last year, Cloudflare introduced a way for any of its customers, on any plan, to block specific categories of bots, including certain AI crawlers. These bots, said Cloudflare, observe requests in sites’ robots.txt files, and do not use unlicensed content to train their models, nor gather to feed for retrieval-augmented generation (RAG) applications.

To do this it identifies bots by their “user-agent string” — a kind of calling card presented by browsers, bots and other tools requesting data from a web server.

“Even though these AI bots follow the rules, Cloudflare customers overwhelmingly opt to block them. We hear clearly that customers do not want AI bots visiting their websites, and especially those that do so dishonestly,” the post said.

The top four AI webcrawlers visiting sites protected by Cloudflare were Bytespider, Amazonbot, ClaudeBot and GPTBot, it said. Bytespider, the most frequent visitor, is operated by ByteDance, the Chinese company that owns TikTok. It visited 40.4% of protected websites, and is reportedly used to gather training data for its LLMs, including those that support its ChatGPT rival Doubao. Amazonbot is reportedly used to index content to help Amazon’s Alexa’s chatbot answer questions, while ClaudeBot gathers data for Anthropic’s AI assistant Claude.

Blocking bad bots

Blocking bots based on their user-agent string will only work if such bots tell the truth about their identity — but there are signs that not all do, or not all the time.

In such cases, other measures will be necessary — and enterprises’ main recourse against unwanted web scraping is normally reactive: pursue legal action, according to Thomas Randall, director of AI market research at Info-Tech Research Group.

“While some software applications exist for web scraping prevention (such as DataDome and Cloudflare), these can only go so far: if an AI bot is rarely scraping a site, the bot may still go undetected,” he said via email.

To justify legal action against the operators of bad bots, enterprises will need to do more than claim that the bot didn’t leave when asked.

The best course of action, Randall said, is for “enterprises to hide intellectual property or other important information behind a membership paywall. Any scraping done behind the paywall is liable for legal action, reinforced with a clear restrictive copyright license on the site. The organization must, therefore, be prepared to legally follow through. Any scraping done on the public site is accepted as part of the organization’s risk tolerance.”

Randall noted that if organizations have the resources to go further, they could consider rate-limiting connections to their site, temporarily automatically blocking suspicious IP addresses, limiting information on why access has been blocked to a message such as “For help, contact support via helpdesk@company.com” in order to force a human interaction, and double-checking how much of their websites are available on their mobile site and apps.

“Ultimately, scraping cannot be stopped, but hindered at best,” he said.

Americas

Asia

Europe

Oceania

Topics

About

Policies

Our Network

More

Cloudflare offers simpler way to stop AI bots

Unchecked AI bots scraping content for their training could spell the end of the open web if enterprises follow one analyst’s advice to put their intellectual property behind a paywall.

Blocking bad bots

More from this author

Chance of Nvidia losing antitrust probe unlikely, says analyst

D-Wave launches new quantum roadmap geared to AI/ML

Platform lets creators monetize their content for use in LLM training

Retirement of Office 365 connectors in Teams not sitting well

Japanese government says ‘sayonara’ to floppy disk

Omnissa downplays its VMware past in official launch

Box announces upgrade to Box AI, integration with GPT-4o

AR/VR headset sales decline is temporary: IDC

Most popular authors

Show me more

Microsoft's Patch Tuesday updates: Keeping up with the latest fixes

For August, Patch Tuesday means patch now

Germany’s BSI guns for better tech security

Podcast: Is the gold rush for AI talent slowing down?

Podcast: Google loses antitrust, and the world yawns

Podcast: Does a chief risk officer make sense?

Is there still a gold rush for AI talent?

Tech news roundup: Google antitrust, Delta-Microsoft tiff, and stuck astronauts

Do companies need a Chief Risk Officer?

Cloudflare offers simpler way to stop AI bots

Unchecked AI bots scraping content for their training could spell the end of the open web if enterprises follow one analyst’s advice to put their intellectual property behind a paywall.

Blocking bad bots

Related content

Google is a 'monopolist' that violated antitrust laws, court finds

Microsoft now sees OpenAI as a competitor in AI and search

OpenAI’s AI-powered SearchGPT is set to challenge Google’s web search dominance

From our editors straight to your inbox

More from this author

Chance of Nvidia losing antitrust probe unlikely, says analyst

D-Wave launches new quantum roadmap geared to AI/ML

Platform lets creators monetize their content for use in LLM training

Retirement of Office 365 connectors in Teams not sitting well

Japanese government says ‘sayonara’ to floppy disk

Omnissa downplays its VMware past in official launch

Box announces upgrade to Box AI, integration with GPT-4o

AR/VR headset sales decline is temporary: IDC

Most popular authors

Show me more

Microsoft's Patch Tuesday updates: Keeping up with the latest fixes

For August, Patch Tuesday means patch now

Germany’s BSI guns for better tech security

Podcast: Is the gold rush for AI talent slowing down?

Podcast: Google loses antitrust, and the world yawns

Podcast: Does a chief risk officer make sense?

Is there still a gold rush for AI talent?

Tech news roundup: Google antitrust, Delta-Microsoft tiff, and stuck astronauts

Do companies need a Chief Risk Officer?