How do you control each OpenAI crawler separately in robots.txt?

Each crawler has its own user-agent name, so you write a separate group for each: OAI-SearchBot for search, GPTBot for training, OAI-AdsBot for ad landing pages, and ChatGPT-User for live user fetches. A named group for a specific crawler takes precedence over the wildcard User-agent: * group, so you can set each crawler's access independently.

Does a User-agent: * Disallow rule block the OpenAI crawlers?

It can, which is a common accidental-block trap. A wildcard disallow applies to any crawler without its own named group. If you want OAI-SearchBot allowed while a wildcard disallow exists, add an explicit named OAI-SearchBot group with Allow, because the named group overrides the wildcard.

Does robots.txt control ChatGPT-User?

Not reliably. Per OpenAI's current documentation, ChatGPT-User does not comply with robots.txt for user-initiated actions, because it fetches on behalf of a live user. So robots.txt governs OAI-SearchBot, GPTBot, and OAI-AdsBot, but live user fetches are managed at the edge and through page readiness, not the file.

How long do robots.txt changes take to apply?

OpenAI processes robots.txt changes in roughly 24 hours. Deploy a change and verify the next day rather than same-day. Also confirm the change is live at the origin and not overridden by a CDN or WAF, which can block a crawler even when robots.txt allows it.

How to Control the OpenAI Crawlers With robots.txt: An Agency Playbook

Almost everything an agency needs to control about how OpenAI treats a client's site runs through one small text file at the root of the domain: robots.txt. It decides whether the client is findable in ChatGPT search, whether their content is used for training, and whether the ad crawler can validate landing pages. It is also one of the easiest files to get subtly wrong, in ways that never show up in a ranking report but quietly remove a client from a growing channel. The agencies that treat it as a standardized deliverable, not a per-client afterthought, avoid an entire class of silent failures.

#The plain-English version

robots.txt is a plain text file that lives at a domain's root, like example.com/robots.txt, and tells crawlers what they may access. OpenAI runs several crawlers, each with its own name, and each is controlled by its own group of rules in that file (OpenAI's crawler docs). That independence is the whole point: you can allow the search crawler while blocking the training crawler, or any other combination, by writing a separate group for each. The mistake most sites make is treating AI as a single on-off switch, when it is really four separate switches in one file, plus two important caveats about what the file can and cannot do.

For an agency, the value is that this is completely standardizable. The right robots.txt policy for a client is not a creative decision, it is a house standard with a small number of variants, applied consistently and verified. Once you have the standard, deploying it to a new client is minutes of work, and monitoring it is a line in your regular report. The failure mode you are preventing is the one where a developer, a plugin, or a migration writes a rule that catches OAI-SearchBot and quietly makes the client invisible in ChatGPT, with nothing on the site looking broken.

Source: OpenAI, Overview of OpenAI crawlers. Three crawlers obey robots.txt; the live-fetch agent does not.

#The precedence rule that trips everyone up

The single most important technical fact for getting robots.txt right is how precedence works, because it is the source of most accidental blocks. A robots.txt file can have a wildcard group, written as User-agent: *, that applies to any crawler without its own named group. It can also have named groups for specific crawlers. The rule is that a named group wins over the wildcard for that crawler. So if a site has User-agent: * Disallow: / (block everything) and nothing else, that blanket rule catches OAI-SearchBot, and the client is invisible in ChatGPT search. The fix is not to remove the wildcard, which may exist for good reasons, but to add an explicit named group for OAI-SearchBot that allows it, because the named group overrides the wildcard.

This is exactly the trap that produces the invisible-client emergency. A security plugin or a cautious developer adds a broad disallow, or a staging block ships to production, and because there is no named OAI-SearchBot group to override it, the search crawler is caught in the blanket rule. Nobody notices, because rankings are unaffected and the site looks fine. The client simply stops appearing in ChatGPT answers. An agency that understands precedence catches this in an audit and fixes it with a few explicit lines, turning a silent failure into a routine hygiene item.

#The two caveats that robots.txt cannot solve

A good agency also knows the limits of the file, because promising a client something robots.txt cannot deliver is how you lose trust. There are two important caveats. First, ChatGPT-User, the agent that fetches a page live when a user asks, does not comply with robots.txt for user-initiated actions, per OpenAI's current documentation. So you cannot reliably block live fetches with the file, and you generally would not want to, since those are interested users, which we cover in the live-fetch agency guide. Second, robots.txt is only honored if the request reaches the origin. A CDN or WAF can block or challenge a crawler before robots.txt is ever consulted, so a perfect file behind a hostile edge is still a blocked crawler.

Those two caveats reshape how you advise clients. robots.txt is the right and sufficient control for search visibility and training, the OAI-SearchBot and GPTBot decisions. It is not the control for live fetches, and it is not the whole story wherever a CDN or WAF sits in front of the site. So the house standard is: set the file correctly for search and training, verify it is live at the origin, and separately confirm the edge is not overriding it. That layered check is what separates an agency that actually delivered the outcome from one that just edited a file and hoped.

#The technical version: the house standard

Reduce the whole thing to two policies that cover almost every client, plus the ad line if they run ChatGPT ads. The default policy keeps a client findable and trainable, right for most small businesses whose content is public marketing. The protected policy keeps them findable but blocks training, right for clients with paid, licensed, or proprietary content.

# Search: allowed. This is what keeps the client in ChatGPT search.
User-agent: OAI-SearchBot
Allow: /

# Training: allowed. Fine for public marketing content.
User-agent: GPTBot
Allow: /

Sitemap: https://www.clientdomain.com/sitemap.xml

The default house standard: findable and trainable. Named groups so a restrictive wildcard cannot silently catch the search crawler.

User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

Sitemap: https://www.clientdomain.com/sitemap.xml

The protected variant: findable but not trainable. Swap in for clients with content worth keeping out of training.

Notice that OAI-SearchBot is explicitly allowed in both, which is the non-negotiable that protects the search visibility covered in the OAI-SearchBot agency playbook. The GPTBot line is where the client's training decision lives, which we take apart in the GPTBot agency guide. Being explicit rather than relying on defaults is deliberate: an explicit named group is what makes the policy survive a future wildcard disallow that a plugin or migration might introduce. The Sitemap line helps every crawler discover the client's pages, which matters more the larger the site.

The last two steps are what most agencies skip, and they are where the silent failures hide.

#A worked audit: finding the silent block

Here is what the audit actually looks like in practice, because the theory only matters if it catches real problems. You take on a new client whose ChatGPT presence is weaker than their traffic suggests it should be. You pull their live robots.txt from production and find a User-agent: * group with Disallow: /wp-admin/ and a few other paths, which is normal, but also, further down, a security plugin has added User-agent: * Disallow: / during a past incident and it was never removed. There is no named OAI-SearchBot group anywhere. That means the search crawler falls under the wildcard, which now disallows everything, so the client is invisible in ChatGPT search and has been for months, with rankings and human traffic completely unaffected the whole time.

The fix is three lines: add an explicit User-agent: OAI-SearchBot group with Allow: /, which overrides the wildcard for that crawler. You deploy it, wait a day for OpenAI to process the change, verify the file is live at the origin, and confirm the CDN is not also blocking. A week later the client starts appearing in ChatGPT answers again. From the outside it looks like magic; from the inside it was understanding precedence and knowing where to look. That is the difference an agency that understands this file makes, and it is exactly the kind of invisible-problem-solved story that earns trust and renewals.

#Packaging robots.txt as a standing deliverable

Because the right robots.txt is a standard rather than a creative choice, it packages cleanly. The onboarding version is a fixed audit: pull the live file for every client property, check for the silent-block conditions, confirm the named allow for the search crawler, apply the correct house-standard variant, and verify end to end. The ongoing version is monitoring: robots.txt files drift when sites get replatformed, when plugins update, and when security tools are added, so a client who was correct in the spring can be blocked by the autumn with no ranking change to warn anyone. Selling only the one-time audit leaves the recurring value on the table; the drift is where the retainer earns its keep.

Framed for the client, this is insurance on a discovery channel their own tools can silently switch off. That is a fundamentally more valuable positioning than editing a file, and it justifies a real fee rather than an hour of billable time. The report is simple to produce once you are pulling and checking the files on a cadence: for each property, is the standard in place and verified, did anything change, and is any action needed. That single recurring line item quietly protects the client from an entire class of invisible failure, and it differentiates your agency from the many that have never looked at this file at all.

#Why explicit always beats default

A principle worth adopting as house policy: always set the crawlers you care about explicitly, never rely on the default of not being blocked. It is technically true that a crawler with no matching disallow is allowed by default, so in a clean file, OAI-SearchBot would be fine without a named group. But client sites are not clean files, and they do not stay clean. Plugins add rules, developers ship broad blocks, migrations reset everything. An explicit named allow for OAI-SearchBot is a durable safeguard: it survives a future wildcard disallow because a named group overrides the wildcard, so the search crawler stays allowed even after someone adds a block that would otherwise have caught it.

This is why the house standard writes the search and training crawlers out explicitly even when the current file would technically work without them. You are not just setting today's behavior; you are hardening the client against tomorrow's accidental change. The cost is a few extra lines. The benefit is that the most important outcome, staying findable, is protected against the most common way it silently breaks. Making explicitness the default across your whole book means one fewer way for a client to quietly go invisible between your audits.

#The mistakes that make an agency look careless

Relying on a wildcard when you meant to allow search. Without a named OAI-SearchBot group, a broad disallow catches it. Always set the crawlers you care about explicitly.
Promising a robots.txt block of live fetches. ChatGPT-User does not obey the file for user actions. Do not claim you blocked it there.
Trusting the file when a CDN sits in front. Verify at the edge too, or you report a fix that is not live.
Declaring success same-day. The roughly 24-hour processing window means next-day verification, not a same-call done.
Copy-pasting one client's file to another without updating the Sitemap line and domain. A stale sitemap URL points crawlers at the wrong place.

It is worth naming why this small file rewards an agency's attention out of proportion to its size. robots.txt sits at the intersection of the two highest-value AI-visibility decisions, findability and training, and it fails silently, which is the most dangerous combination there is. A silent failure with high stakes is exactly the kind of problem clients cannot catch themselves and will value you for catching. Rankings dashboards do not show it, human traffic does not reveal it, and the client has no reason to look. That is precisely why an agency that audits, standardizes, and monitors this file is delivering something the client genuinely cannot get any other way, and why it belongs in your standard offering rather than as an occasional favor. The effort is small and repeatable; the downside it prevents, a client quietly absent from a growing channel, is large.

#What changes by client size inside your book

Micro and solo clients (1 to 9): one domain, one policy, deploy and monitor. The owner-facing version is the micro-business robots.txt guide.
Small and mid clients (10 to 249): often path-scoped rules for protected sections, covered in the growing-business robots.txt guide.
Mid-market clients (250+): many properties and an edge you do not control, which is a governance problem covered in the mid-market robots.txt governance guide.

robots.txt is the control panel for search and training; the edge and page readiness handle live fetches. Get the file right, verify it end to end, and standardize it, and you have removed an entire class of silent client failures. Controlling access is step one; being the client the model actually names is the authority work in the answer engine optimization cornerstone.

Want us to build your house robots.txt standard and roll it across your client book with monitoring? Run discovery or see what we ship.