Skip to content

Anthropic API Prompt Caching: The Pattern That Saves Thousands on Content Generation

Prompt caching cuts our content-gen costs by an order of magnitude. Here’s how and where it works.

John Cravey with EleviFounder4 min read

Anthropic’s prompt caching, introduced in 2024 and refined in the 2026 releases, is the API feature that most teams aren’t using and should be. For workloads where the same long context (a brand-voice doc, a knowledge base, a product catalog) is sent on every request, caching can cut both cost and latency by 80-90%. Here’s how it works, where it fits, and the exact wiring we use across FH content pipelines.

What prompt caching actually does

Normally, every API call to Claude charges per-token for the entire prompt (system + user). Long context = expensive. Caching lets you mark a portion of the prompt as cacheable. The first request processes and caches it; subsequent requests within the cache window reuse the cached state at a fraction of the cost (~10% of the normal price for cached tokens) and dramatically lower latency.

The cache lives for 5 minutes by default, refreshing each time the cached content is hit. There’s also a 1-hour cache TTL option for less-frequent but more-persistent workloads.

When caching wins

  • Content generation pipelines where every call ships the same brand-voice doc, content guidelines, or example library as context.
  • RAG (retrieval-augmented generation) where the same large document set is queried repeatedly within a session.
  • Chat applications where the conversation history grows but a stable system prompt + tool definitions stays constant.
  • Code review pipelines where the codebase context is consistent across requests.

When caching doesn’t help

  • Single-shot calls where each prompt is unique.
  • Workloads with bursty, infrequent calls where the cache expires between requests.
  • Tiny prompts where the absolute savings don’t justify the marginal complexity.

The wiring: cache_control on the right block

Mark a content block in the prompt with `cache_control: { type: "ephemeral" }` to make it cacheable. Everything before that block becomes part of the cache key; everything after stays variable per request.

import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 2048,
  system: [
    {
      type: "text",
      text: BRAND_VOICE_GUIDE, // long stable doc
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [
    {
      role: "user",
      content: `Write a blog intro for: ${TITLE}`,
    },
  ],
});

The FH content pipeline example

We use Claude for first-draft generation of location pages and service pages. Every call ships the same 4,000-word brand voice doc, the same 8,000 words of example FH content, and a small per-page variable section describing the city + project type. The stable part is cached; the variable part is small. Result: per-page generation cost dropped from $0.42 to $0.06 — an 86% reduction. Cumulative savings over the last 90 days across the client book: $4,200.

Cache TTL: 5 minutes vs 1 hour

Default is 5 minutes — refreshed on each hit. Good for active sessions. The 1-hour option (`cache_control: { type: "ephemeral", ttl: "1h" }`) is for workloads that need longer persistence — overnight batch processing, daily reports, anything where the gap between requests exceeds 5 minutes but the cache content is stable enough.

Cost math

Cached tokens cost ~10% of the normal token rate, but writing to the cache costs ~25% more than a normal token. Caching wins as soon as you read the cached content more than once — which is basically every batch workload.

Measuring cache hit rate

Every API response includes `usage.cache_read_input_tokens` and `usage.cache_creation_input_tokens`. Compare cache reads to cache creations. A healthy cache hit rate is above 90% for batch workloads. Below that, your cache is expiring between requests — either shorten the gap, use the 1-hour TTL, or batch more aggressively.

Common mistakes

  • Caching the wrong block. The cache key is everything before and including the cached block. If your prompt structure changes between requests (e.g., reordering messages), the cache invalidates.
  • Caching too small a block. Below ~1024 tokens, caching has diminishing returns.
  • Forgetting to refresh the cache before TTL expires. For batch workloads, structure your job so consecutive calls hit within the cache window.
  • Caching tool definitions but not the system prompt (or vice versa). Cache the largest stable portion.

Pairing with batch processing

For overnight runs (generate 50 location pages), structure the job so calls fire within minutes of each other. Use the 1-hour TTL. The first call writes to cache; the next 49 read from it. Total wall time: 8-15 minutes for 50 pages. Total cost: ~$3 instead of ~$21.

Caching across users (privacy considerations)

Cache scope is per-API-key-per-organization. If you’re running multi-tenant generation, the cache is shared across your tenants by default. For most B2B use cases this is fine. For sensitive workloads where one tenant’s cached content shouldn’t affect another, structure your prompts so the tenant-specific data lives outside the cached block.

When NOT to cache for cost reasons alone

Caching adds slight prompt structure complexity. For workloads under $50/month in API spend, the engineering time to set up caching may exceed the savings. Cache when the workload is scaled enough that 80% cost reduction is meaningful.

How this lands across FH client work

Every FH AI-assisted content pipeline uses prompt caching: location page generation, blog draft generation, image alt-text generation, lead-scoring enrichment. The combined cost reduction across the client book is meaningful — we’re running 6x the volume of AI-assisted content this year compared to last for roughly the same monthly API spend. If you’re running Claude in production without caching, book a consultation — the wiring is a half-day engagement with immediate measurable savings.

Written by
John Cravey
Founder

Founder of Frontend Horizon. Writes most of the long-form work on the FH blog.

Newer post
2026 Web Design Trends That Aren’t Just Visual Noise
Older post
Next.js 16.1 in Production: The Migration Playbook We Run on Every FH Site
Keep reading

More from the blog

AI·4 min

Cost-Effective AI: How to Pick Claude Opus vs Sonnet vs Haiku for Each Workload

Opus for the hard stuff. Sonnet for daily work. Haiku for high-volume cheap work. Mixing them right cuts costs by 70%.

Next.js·6 min

Next.js 16.1 in Production: The Migration Playbook We Run on Every FH Site

Next 16.1 is the lean target. Here’s the exact migration we run, what breaks, and what to delete after.

Cloudflare·6 min

Cloudflare DNS and CDN: The Base Configuration for Every FH Client Site

Every FH site sits behind Cloudflare. Here’s the exact configuration and why each setting is where it is.