Skip to content

Should You Let AI Train on Your Content? An Agency Guide

Training and search are two different doors. The expensive mistake is slamming the search door shut while trying to close the training one. Here is how to advise clients on both.

John Cravey with EleviFounder10 min read

Every client will eventually ask you some version of the same question: is AI stealing my content, and should I block it? It is a good question with a genuinely nuanced answer, and the agencies that handle it well turn a moment of client anxiety into a documented policy decision they get paid to make. The key is to separate two things clients almost always conflate: being trained on, and being found. They are different crawlers, different decisions, and different consequences.

The plain-English version

OpenAI runs a crawler called GPTBot whose job is to collect content to train its models. It is separate from OAI-SearchBot, the crawler that surfaces sites in ChatGPT search. Blocking GPTBot tells OpenAI your client's content should not be used in training (OpenAI's crawler docs). Crucially, blocking GPTBot does not remove the client from ChatGPT search, because search runs on the other crawler. So the client can say no to training and still say yes to being found, which is what most of them actually want once you explain the difference.

This is the single most important thing to get right for a client, because the failure mode is expensive and silent. A client says "block AI from my site," a well-meaning developer adds a blanket rule that catches every AI crawler including OAI-SearchBot, and now the client is invisible in ChatGPT search, which is a growing discovery channel, in exchange for a training block they could have gotten on its own. You lost a marketing channel to satisfy a privacy instinct. Separating the two doors is the whole job.

Source: OpenAI, Overview of OpenAI crawlers. Blocking the training door does not close the search door.

What blocking training actually buys, and what it does not

Be precise with clients about what a GPTBot block does and does not do, because the marketing around "AI is stealing your content" oversells it. A block going forward tells OpenAI not to use the client's content in future training. It does not claw back anything a model already learned in an earlier training run, and it does not stop other companies' crawlers, each of which has its own control. It also does not affect whether the client appears in ChatGPT search, which many clients are surprised and relieved to hear. So a GPTBot block is a forward-looking, OpenAI-specific, training-only decision. Framed honestly, it is neither the shield nor the sacrifice clients imagine.

There is also a quieter consideration on the other side of the ledger. Content that a model has learned from can shape what the model knows about a business, its category, and its expertise. For a firm building authority, being part of the corpus is not obviously bad, and for many it is mildly good. This is why the default should be a deliberate decision, not a reflex to block.

What "training on your content" actually means

Half of a client's anxiety here comes from a mental picture that is wrong. They imagine a model storing a perfect copy of their page in a database and handing it out on request. That is not what training does. Training adjusts the statistical patterns inside a model based on enormous volumes of text, so no single page is stored or retrievable verbatim. Your content becomes one of billions of influences on how the model writes and what it associates with your industry, not a file the model can be asked to reproduce. For a page of public marketing copy, this is closer to your writing being one drop in an ocean than to your work being photocopied and sold. Explaining that to a client deflates most of the fear before you even get to the decision.

That does not make the concern meaningless. For content that is genuinely distinctive, a paid course, original research, a signature framework, the influence on the model can be more legible, and the client's instinct to protect it is reasonable. The point is to calibrate the fear to the content. Generic service-page copy being trained on is a non-event. A proprietary methodology being trained on is a real decision. Your job is to help the client tell those apart instead of treating every page as if it were the crown jewels.

The strategic case for staying in the corpus

There is an upside most clients never consider, and a good agency raises it. When a model has been trained on content about a business and its category, it is better able to talk about that business, understand what it does, and represent its expertise when someone asks. Being part of the corpus is a soft form of distribution: it shapes what the AI knows before anyone runs a live search. A firm that has published thoughtful, specific content for years and allowed it to be learned from is, in a real sense, teaching the models that will field its buyers' questions. Blocking training forfeits that quietly.

For a client whose whole strategy is authority, thought leadership, published expertise, a considered point of view, the case for allowing training on that public content is strong. The content was written to influence how people, and now models, understand the field. Pulling it out of training is like publishing a book and then refusing to let anyone remember they read it. So the strategic default for most authority-driven clients is to allow training on the public content and reserve blocking for the genuinely proprietary slice, which is exactly the calibrated position the rest of this guide argues for.

The default recommendation, and when to deviate

For most clients, especially small and local businesses whose content is public marketing, allowing training is low-risk and occasionally helpful, so the sensible default is to leave GPTBot allowed and keep OAI-SearchBot allowed. Deviate, and block GPTBot, when a client has content that is genuinely an asset to protect.

The stronger the reason near the top, the more a training block is warranted. A vague unease is not, on its own, a reason to give up being part of the corpus.

The list at the top is where a block clearly earns its place: content the client sells, content they license from someone else, and the original methods or research that are their edge. Lower down, a general unease about "AI using my stuff" is a conversation to have, not an automatic block, because blocking costs the client presence in the corpus for a benefit that is mostly emotional. Your value as the agency is walking the client through that trade honestly rather than reflexively blocking to look protective.

The technical version: the policy you apply per client

Turn the decision into a two-line robots.txt policy you apply consistently and document. The default keeps both doors open; the protected variant blocks training while preserving search.

User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Allow: /
Default policy: findable and trainable. Right for most small-business clients whose content is public marketing.
User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /
Protected policy: findable but NOT trainable. Right for clients with paywalled, licensed, or proprietary content.

Notice what both policies share: OAI-SearchBot stays allowed in each. That is deliberate and non-negotiable in almost every case, because it protects the search visibility we cover in the OAI-SearchBot agency playbook. Whichever training policy a client picks, keeping the search door open is the constant. Verify a GPTBot block against OpenAI's published ranges at openai.com/gptbot.json so you are acting on the real crawler, not a spoofed user agent, and remember the block is forward-looking only.

Making it a service, not a one-off answer

The train-or-block decision fits neatly into the same access audit you already run for search visibility. Add an "AI training policy" line to the audit: for each client property, record the decision, the reason, who signed off, and the date. That record is worth more than it looks. When a client's counsel or a nervous stakeholder asks "what is our position on AI training," the agency that can answer in one sentence with a dated decision looks like the trusted advisor it wants to be.

Run this identically on every client so the decision is consistent, documented, and defensible.

Handling the anxious client conversation

The train-or-block question usually arrives as anxiety, not as a neutral request, because the client just read a scary headline. So the conversation matters as much as the technical answer. The move that works is to slow down, separate the two doors, and ask what they are actually worried about, because the honest answer is usually one of two things: they do not want their work used without permission, or they are worried about being left behind by AI. Those pull in opposite directions, and naming which one you are solving for turns a vague fear into a decision.

If the worry is about their work being used, walk them through what training actually does, that it is statistical influence rather than a stored copy, and then ask the sharper question: is there specific content here you sell or would not want summarized. If yes, you block that. If no, you explain that blocking generic marketing copy protects nothing and costs them a little presence in the systems their buyers now ask. If the worry is about being left behind, the answer flips: staying in the corpus and, more importantly, staying findable in search is how you are present in AI, and blocking everything is the opposite of what they want.

Either way, the client leaves with a decision they understand and you leave with a documented policy. That is a fundamentally different outcome from silently flipping a setting and hoping nobody asks. It positions the agency as the advisor who understood the nuance, which is worth far more than the fifteen minutes the conversation takes. And it inoculates you against the worst version of this, where a client later discovers they were removed from ChatGPT search and blames you for a decision they never actually made.

The mistakes that make an agency look careless here

  • Conflating training and search. "We blocked AI for you" is a firing offense if it quietly removed the client from ChatGPT search. Always separate the two doors, out loud, in writing.
  • Blocking by default to look protective. A reflexive block costs the client presence in the corpus for a mostly emotional benefit. Recommend deliberately, per content type.
  • Overselling the block. Telling a client a GPTBot block claws back what a model already learned is wrong. It is forward-looking only. Set expectations honestly.
  • Skipping sign-off. A training decision is the client's to make. Document who decided and when, so nobody relitigates it in six months without context.
  • Trusting the user agent. Verify GPTBot traffic against the published ranges before you rate-limit or celebrate a block, since the string is trivially spoofed.

Keep the policy from drifting

A training decision made cleanly today can quietly become wrong later, so build a light recheck into the same cadence you already use for search access. The two things that break a client's training policy are a site change and a content change. A replatform or a new security tool can silently alter which crawlers reach the site, undoing a block or an allow without anyone noticing. And when a client launches a new paid product, a members area, or an original research report, a new protect-pile asset just appeared that the old policy never accounted for. Fold a one-line training-policy check into your regular client review: is the documented decision still being honored, and has any new content type shipped that changes it. That habit is what turns a one-time answer into a standard the client can actually rely on, and it is the same discipline that makes the search-access monitoring worth paying for.

What changes by client size inside your book

  • Micro and solo clients (1 to 9): the answer is almost always "leave training on, keep search on, move on." Low stakes, high volume, easy to standardize. The owner-facing version is the micro-business training guide.
  • Small and mid clients (10 to 249): more likely to have a genuine asset (a paid resource, original research) worth a targeted block. Map it to the growing-business training guide.
  • Mid-market clients (250+): the decision becomes a governance and legal matter across many properties, covered in the mid-market training governance guide.

The training decision is about protecting assets. The search decision is about staying reachable. Keep them separate, document the first, and default the second to "on," and you have handled the whole GPTBot question for a client cleanly. Being trained on or not, they still need to be the firm ChatGPT names, which is the authority work in the answer engine optimization cornerstone.

Want us to set the AI training policy across your client book with you, or under your brand? Run discovery or see what we ship.

Written by
John Cravey
Founder

Founder of Frontend Horizon. Writes most of the long-form work on the FH blog.

Newer post
Should You Let AI Train on Your Content? A Micro Business Guide
Older post
How to Govern ChatGPT Search Visibility Across a Mid-Market Brand
Keep reading

More from the blog

AI·12 min

AEO for Agencies: Get Every Client Named in AI Answers

Your clients' buyers now ask ChatGPT and Perplexity who to hire before they ever see ten blue links. AEO is the service that gets your clients named. Here is how to sell and ship it.

AI·12 min

AI-Assisted Content for Agencies: Ship Client Content at Volume Without the Slop

AI can draft for twenty client accounts at once. Your job is making sure none of them sound like the other nineteen.

AI·10 min

How to Get Your Clients Into ChatGPT Search: An Agency Playbook

One crawler decides whether your clients show up when a buyer asks ChatGPT who to hire. Managing it well is a service you can package, price, and report on.