If you have seen a headline about AI companies scraping the web to train their models and wondered whether you should do something about it, this is the calm version of the answer. There is a specific crawler involved, you have a simple choice, and for most small businesses the choice is easier and lower-stakes than the panic implies. The one thing you must not do is accidentally hide yourself from customers while trying to protect yourself from robots.
The short version
OpenAI uses a crawler called GPTBot to learn from web pages so its models get better. It is a different crawler from the one that decides whether you show up in ChatGPT search. That means you have two separate switches, not one. You can let AI learn from your site, or not, and completely separately, you can be findable in ChatGPT search, or not. The mistake to avoid is flipping both switches off at once when you only meant to flip one (OpenAI's crawler docs spell out the difference).
For most small businesses, here is the honest recommendation: let AI learn from your normal marketing pages, and make sure you stay findable. Your public website exists to be seen. Content that helps a model understand that you are a plumber in a specific city, with specific services, is not a secret you are protecting, it is the whole point of having a site. The businesses that need to block training are the ones with something genuinely private to protect, which most micro businesses do not have on their public site.
What "training" actually means for your little website
The word training sounds ominous, so it helps to know what actually happens. When AI learns from web pages, it is not saving a copy of your site in a file somewhere that it can hand out on request. It is adjusting patterns across billions of pages, so your content becomes one tiny influence on how the model writes and what it knows about your kind of business. Nobody can ask the model to reproduce your homepage. Your words are a drop in an enormous bucket, blended in with everyone else's. For a normal small business website, this is much closer to your page being read and generally understood than to it being photocopied and resold.
That matters because most of the fear about AI training is aimed at that photocopy image, which is not what is happening to a plumber's service page or a cafe's menu. Your public website exists precisely so that people, and increasingly machines, can read it and understand what you offer. Being part of what an AI generally knows about local plumbers or neighborhood cafes is not a loss. It is, if anything, a small win, because it makes the tools your customers use a little more likely to understand and mention a business like yours.
Why letting AI learn from you can actually help
Here is the part the scary headlines leave out. When AI models have learned from lots of content about your industry and your area, they get better at talking about businesses like yours. A model that has read a great deal about, say, mobile dog grooming in mid-sized cities is better equipped to give a useful answer when someone asks about it, and to understand where a business like yours fits. By keeping your public marketing readable and lettable, you are quietly contributing to, and benefiting from, the general knowledge these tools have about your line of work.
This is why, for the vast majority of small businesses, the instinct to block everything is counterproductive. You would be pulling your business out of the general understanding these tools are building, for no real gain, since your marketing pages hold no secret worth protecting. The businesses that thrive as more customers use AI tools are the ones that are easy to find and easy to understand, not the ones that hid. Unless you have something specific and valuable to protect, openness is the position that helps you.
When a small business SHOULD block training
There is a short list of situations where blocking makes sense, and it is worth checking honestly whether any apply to you before you bother.
- You sell content. If part of your business is selling courses, guides, templates, or memberships, the paid material is an asset. You may not want it feeding a model that could summarize it for free.
- You publish original research or methods. If your site includes a genuinely proprietary approach, a signature framework, or data you gathered, you might prefer to keep it out of training.
- You host client or private information. Anything sensitive should not be publicly readable at all, but if you have gated material, keep it gated and consider blocking training on it too.
- You have a specific reason you can name. Not a vague unease, but an actual asset or obligation. If you cannot name what you are protecting, blocking mostly costs you nothing and gains you nothing.
If none of those apply, and for a typical local service business none of them do, you can leave training on and stop thinking about it. Your marketing pages being part of what an AI knows about your industry is neutral to slightly helpful, not a threat.
How to block training if you decide to
If one of the situations above applies, blocking training is a two-line change to the same robots.txt file we talked about for search visibility. The important part is that you block only the learning crawler and keep the finding crawler allowed.
User-agent: OAI-SearchBot
Allow: /
User-agent: GPTBot
Disallow: /That is the whole change. The first block keeps you findable, which is the job we covered in the micro-business guide to showing up in ChatGPT search. The second block tells GPTBot not to use your content for training. Add it, wait about a day for OpenAI to process the change, and you are done. Two things to remember: this only affects future training, not what models already learned, and it only affects OpenAI, since other AI companies have their own separate crawlers and controls.
If you do decide to block, a small sanity check a day later is worth the minute it takes. Re-open your robots.txt in a browser and confirm the GPTBot line is still there, since some website builders regenerate that file and can quietly drop a manual edit. And confirm your search crawler line is still present too, so you did not protect yourself into invisibility. Two lines, one glance, done.
What blocking does not do
It is worth being clear so you do not expect too much. Blocking GPTBot does not remove anything a model already learned in the past, it does not affect other AI companies, and it does not make you invisible to ChatGPT search as long as you keep the finding switch on. It is a forward-looking, OpenAI-specific choice about training, nothing more. If your real worry is something private being public, the fix is not a training block, it is making sure the private thing is not on a public page in the first place.
And if your worry is the opposite, that you want to be as visible as possible to AI, then leaving training on is the move, and the more important work is making your pages clear and answer-shaped so a model can actually use them, which is the same content work that helps you show up.
What about all the other AI companies?
A fair question: this whole article is about OpenAI's GPTBot, but there are other AI companies too. That is true, and it is worth understanding so you do not think one change solves everything or, worse, that the problem is hopeless. Each major AI company runs its own crawler with its own name and its own control, so blocking OpenAI's GPTBot does nothing to the others, and allowing it does nothing to the others either. If you truly want to keep your content out of AI training broadly, that is a longer list of separate settings, not one switch.
For most small businesses, the practical takeaway is calming rather than alarming. If you have nothing you specifically need to protect, you do not need to chase every AI crawler across the internet. Leave them be, keep your site readable, and focus on being findable. If you do have something to protect, know that it is a per-company effort, and that the highest-value single step, if you only do one thing, is making sure you are not accidentally blocking the crawlers that make you findable to customers. Protection is optional and piecemeal. Findability is the thing that earns money, and it is one clear setting.
A real example: the neighborhood bakery
Picture a small bakery with a simple website: a homepage, a menu, hours, location, and a short story about the owner. Should they block AI training? Walk it through. The menu, the hours, the location, the story, none of it is secret. All of it exists to be seen. If an AI model learns from it, the model becomes marginally better at knowing there is a bakery in that neighborhood that does, say, gluten-free sourdough. When a nearby resident later asks an AI tool where to find gluten-free bread nearby, the business that was readable and understood is the one that can be surfaced. Blocking training would have removed the bakery from that general understanding in exchange for protecting a menu that was never a secret.
Now change one detail. Suppose the bakery also sells a paid online sourdough course. That course, the actual lessons, is the one thing worth protecting, because people pay for it. So the sensible move is to leave the public site open and readable, and block training only on the course pages. That is the whole decision in miniature: open by default, protect the specific thing you sell. Almost every micro business lands in exactly this shape once they stop and look at what they actually have.
Do this once, then stop worrying about it
Unlike keeping yourself findable, which is worth a quick recheck now and then, the training decision is close to set-and-forget for a micro business. You make the call once, you either add the block line or you do not, and you get back to running the business. There is no ongoing maintenance, no monthly report, no dashboard to watch. The only reason to revisit it is if you start selling content you did not sell before, like launching a paid course or a members area, at which point you spend two minutes adding a block on that one new section.
So if the whole topic has been sitting in the back of your mind as one more thing you are probably doing wrong, you can put it down. For an ordinary small business, the correct action is small and final: confirm you are findable, decide whether you sell anything worth protecting, add one line if you do, and move on. The anxiety the headlines produce is out of all proportion to the actual decision, which for most micro businesses is genuinely easy once you separate the two switches and look at what you really have on your site.
The bottom line for a micro business
For nearly every small local business, the answer is: leave training on, keep finding on, and spend your limited time making your site clear and useful instead of worrying about crawlers. Block training only if you can name the paid or proprietary thing you are protecting. Either way, never let a well-meaning setting hide you from the customers who are now asking ChatGPT for a recommendation instead of typing into a search box. The finding switch is the one that pays your bills, so guard it.
If you run an agency or help other small businesses, the way to package this decision is in the agency version of this guide. And if you have grown past owner-run and have a real content asset to think about, step up to the growing-business training guide. Want a second set of eyes on your setup? Run a free discovery and we will check both switches for you.