Sitemaps and robots.txt in Next.js: Telling Crawlers and AI Bots What Actually Matters

Every crawler that visits your site — Googlebot, Bingbot, and now a growing crowd of AI bots — arrives with a budget: how many pages it’ll fetch before it moves on. Two small files decide how that budget gets spent. Your sitemap says “here is everything worth indexing, and when it last changed.” Your robots.txt says “go here, don’t bother going there.” Next.js generates both from code with a file convention, so they’re never stale and never hand-maintained. This is how to use them well, at any size.

#sitemap.ts: your index, generated from your content

Drop a `sitemap.ts` in your `app` directory that default-exports a function returning an array of URL entries, and Next serves it at `/sitemap.xml`. Because it’s code, you build the list from the same data that builds your pages — so a new blog post or location appears in the sitemap the moment it exists, with no separate step to forget.

// app/sitemap.ts
import type { MetadataRoute } from "next";
import { getAllPosts } from "@/lib/content";

export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
  const posts = await getAllPosts();
  const postUrls = posts.map((p) => ({
    url: `https://acme.co/blog/${p.slug}`,
    lastModified: p.updatedAt ?? p.publishedAt,
    changeFrequency: "monthly" as const,
    priority: 0.7,
  }));

  return [
    { url: "https://acme.co", lastModified: new Date(), priority: 1 },
    { url: "https://acme.co/services", priority: 0.9 },
    ...postUrls,
  ];
}

`lastModified` is the field people skip and shouldn’t. It tells crawlers what changed since their last visit, so they re-fetch updated pages and skip the ones that didn’t move — spending your crawl budget where it matters. Feed it a real modified date from your content, not `new Date()` on every page, or you’re telling Google everything changed every day, which it will stop believing.

#robots.ts: the front door policy

A sibling `robots.ts` file generates `/robots.txt`. Return rules for which user-agents may crawl which paths, and point crawlers at your sitemap so they can find it. The most common real use is fencing off routes that shouldn’t be indexed — admin, cart, search-result pages, staging — while leaving the money pages wide open.

// app/robots.ts
import type { MetadataRoute } from "next";

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      { userAgent: "*", allow: "/", disallow: ["/admin", "/cart", "/api"] },
    ],
    sitemap: "https://acme.co/sitemap.xml",
  };
}

#The noindex vs disallow distinction that trips everyone

These do different jobs and people conflate them constantly. `disallow` in robots.txt says “don’t crawl this.” `noindex` (set via the Metadata API’s `robots` field on the page) says “you may crawl this, but don’t put it in the index.” If you want a page gone from search, you need `noindex`, and the crawler has to be allowed to reach the page to see it. Block it in robots.txt instead and Google may keep a bare listing it can’t update. Use `disallow` for crawl-budget hygiene on low-value paths; use `noindex` to actually remove something.

// app/internal-tool/page.tsx — keep it out of the index
export const metadata = {
  robots: { index: false, follow: true },
};

#AI bots: the new entry in your robots.txt

There’s a decision here that didn’t exist a few years ago: whether to allow the AI crawlers (the ones that fetch pages to train models or answer questions) the same access as search bots. For most businesses whose goal is to be found and cited, the answer is yes — you want to be in the answer. But if you publish proprietary content you don’t want ingested, you can name specific AI user-agents in robots.txt and disallow them. It’s a real strategy choice, not a default, and it’s worth making deliberately rather than by omission. The companion move — actively inviting AI systems with an `llms.txt` — is its own topic, covered in AGENTS.md and llms.txt.

#generateSitemaps: when one file isn’t enough

A single sitemap tops out at 50,000 URLs, and even below that, a giant flat sitemap is harder for both you and Google to reason about. Next.js supports splitting with `generateSitemaps`: you return a list of sitemap IDs, and Next serves each as its own file behind a sitemap index. The natural split is by section — products, blog, locations — so each part re-crawls on its own rhythm.

// app/products/sitemap.ts
export async function generateSitemaps() {
  const pages = await countProductPages(); // e.g. 4 chunks of 50k
  return Array.from({ length: pages }, (_, id) => ({ id }));
}

export default async function sitemap({ id }: { id: number }) {
  const products = await getProducts({ page: id, size: 50000 });
  return products.map((p) => ({
    url: `https://acme.co/products/${p.slug}`,
    lastModified: p.updatedAt,
  }));
}

Google’s own sitemap documentation spells out the limits and the index format if you need to go deeper. Most businesses never approach 50,000 URLs — but if you run a real catalog or a large multi-location footprint, build the split before you cross the ceiling, not after Google starts silently dropping URLs past the cap.

#How to confirm crawlers are actually using it

Shipping a sitemap isn’t the same as it working. Submit it once in Google Search Console under the Sitemaps report, and Google tells you how many URLs it discovered and how many it indexed. The gap between “discovered” and “indexed” is the interesting number: if you submitted 500 URLs and Google indexed 300, the other 200 are being judged too thin, duplicate, or low-value to bother with — a signal worth acting on. The robots.txt documentation from Google is the companion reference for understanding how your allow/disallow rules interact with what actually gets crawled. Check the Crawl Stats report periodically to see where Googlebot is spending its budget; if it’s hammering faceted filter URLs or an internal search you meant to disallow, your robots.txt has a gap.

#The duplicate-URL trap that wastes crawl budget

The most common way a growing site quietly burns crawl budget is URL parameters that create infinite near-duplicate pages: `?sort=price`, `?color=blue`, `?page=2`, session IDs, tracking params. Each is a distinct URL a crawler may try to fetch and index, and together they can balloon a 200-page site into tens of thousands of crawlable variants that dilute your ranking signals across duplicates. The fix is layered: set a canonical URL on the base page (covered in the metadata playbook) so variants point back to one authority, keep those variants out of your sitemap so you’re not actively advertising them, and disallow the truly worthless parameter paths in robots.txt. Get this right and Googlebot spends its visits on your real pages instead of a combinatorial explosion of filters.

#What Google actually uses (and what it ignores)

A word on the sitemap fields, because there’s folklore here. The `priority` and `changeFrequency` values in a sitemap are widely misunderstood: Google has said publicly that it largely ignores them as ranking or crawl-scheduling signals. Setting `priority: 1.0` on every page doesn’t make Google crawl them more, and inflating `changeFrequency` to “always” on static pages just teaches Google your signals aren’t trustworthy. The field that genuinely matters is `lastModified`, and only when it’s honest — a real edit date Google can use to decide whether to re-fetch. So keep your sitemap lean and truthful: accurate `lastModified`, no priority theater, and only URLs you actually want indexed. A sitemap is a set of recommendations, not commands, and its credibility is the currency.

Beyond the standard sitemap, there are specialized formats worth knowing exist even if most sites don’t need them. Image and video sitemaps help media-heavy sites get their media discovered and eligible for image and video search. News sitemaps matter only if you’re a news publisher in Google News. For the vast majority of businesses, a single clean sitemap of your real pages is the whole job — but if visual search or video is a real channel for you, the specialized sitemaps are how you tell Google what to index there. When in doubt, start with the basic sitemap done well; it covers the ninety percent case, and you can add the specialized ones the day you have a reason. The discipline that matters at every size is the same: your sitemap is the list of pages you’re proud of, and if a URL doesn’t belong on that list, that’s a signal it probably shouldn’t be indexable at all. Used that way, the sitemap stops being a passive dump of every route and becomes an editorial decision — a curated map of what you want found — which is exactly how a crawler with a limited budget should be steered.

#What this means for your business

The files are trivial to generate. The judgment is in what you include, exclude, and prioritize.

#For agencies

Standardize both files in your starter so every client site ships a content-driven sitemap and a sane robots policy from day one. The recurring bug you’re preventing is the staging robots.txt that ships to production with `Disallow: /` still in it — the single fastest way to make a client’s entire site vanish from Google overnight, and a mistake that has taken real sites offline for weeks before anyone noticed traffic had cratered. Generate robots from an environment check so production is always open and staging is always closed, and add a smoke test that fails the deploy if production returns a site-wide disallow. That one guard rail is worth more than any report.

#For micro businesses (1–5 people)

Your whole site is probably a handful of pages, so the sitemap is short and the robots file is simple: allow everything, point at the sitemap, done. The one thing to verify is that your sitemap actually lists your real pages and that you submit it once in Google Search Console — that submission is how Google learns your site exists and gets it crawled faster. If you’ve ever wondered why a new page isn’t showing up in search, a missing or unsubmitted sitemap is the first thing to check. It’s a five-minute task with an outsized payoff for a small site.

#For small businesses (SMEs)

You have enough pages that crawl budget starts to matter — especially if you generate location or service-combination pages. Use `lastModified` honestly so Google re-crawls what changed, and disallow the low-value routes (internal search results, faceted filter URLs, print views) that otherwise eat budget and create duplicate-content noise. Keep your sitemap driven by your content source so it never drifts from reality. If you’ve got thin auto-generated pages, the sitemap is also a good forcing function: if a page isn’t good enough to list in your sitemap, it probably shouldn’t be indexable at all.

#For mid-size companies

At thousands to millions of URLs, sitemaps become an operational tool. Split them by section with a sitemap index, use `lastModified` to drive efficient re-crawling of a catalog that changes constantly, and monitor crawl stats in Search Console to see where Googlebot is spending time it shouldn’t. Your robots.txt becomes a real policy document — which sections, which parameters, which bots — and it needs ownership and change control, because a careless edit at your scale is a site-wide incident. This is also where the AI-bot access decision has real strategic weight: which of your content do you want feeding AI answers, and which is proprietary enough to fence off?

#How we run this at Frontend Horizon

Every FH site generates its sitemap from the same content modules that build the pages, sets `lastModified` from real edit dates, and ships an environment-aware robots.txt that is open in production and closed on staging — with a check so it can’t ship the wrong way. If your site isn’t getting crawled the way you expect, or you’re not sure what your robots.txt is actually blocking, run a free discovery and we’ll audit it. Next up: what happens to all these carefully-indexed URLs when you redesign or rebrand — and how to move them without losing your rankings — in Redirects in Next.js.