Cost-Effective AI for Agencies: Right-Sizing Models Across Client Workloads

If you run AI across a book of clients, your model choice is a margin decision, not a technical one. Anthropic ships three Claude models at very different price points, and the reflex is to reach for the biggest one and stop thinking about it. That reflex is expensive. On a single account it hides in the noise. Across a whole client book it is the difference between a productized AI service that prints margin and one that leaks it. This is the practitioner's version of the model-selection call: a simple tiering rubric, the math on how model choice changes your unit economics, and where a platform partner takes the routing off your plate so your team does not have to think about it on every job.

#The three tiers, in plain terms

#Draft, judge, or bulk

Anthropic's lineup is three models: Opus, Sonnet, and Haiku. Opus is the most capable and the most expensive. Sonnet is the daily driver. Haiku is fast and cheap. They sit several times apart in cost per token, so the choice is not cosmetic. The full per-workload breakdown lives in the model-selection guide, and the current numbers are on Anthropic's pricing page. For agency work you do not need to memorize the price sheet. You need one mental model that any account lead can apply without a meeting.

Sort every recurring AI job into one of three buckets: draft, judge, or bulk. Draft work produces the thing a human then refines: a blog first draft, an outline, a proposal section. Judge work makes a call that has to be right: a code review on a complex change, a quality check on client-facing copy, a decision about what to escalate. Bulk work is high-volume and low-stakes per item: classification, alt-text, tagging, lead scoring, short summaries. Each bucket maps to a tier, and once your team learns the three buckets, the model choice stops being a debate.

Draft the voice-critical, human-refined work on the top tier when the voice matters, and on the middle tier when a template plus a human edit carries it. This is where your brand-voice and long-form work lives.
Judge the decisions that have to be exactly right on the top tier. Multi-step reasoning, complex code review, and anything where a wrong call costs a client relationship. You are paying for correctness here, and it is worth it.
Bulk the high-volume, low-stakes work on the cheapest tier. Classification, alt-text, tagging, short Q&A over a small context. Run it a thousand times a day and the per-item cost is what decides whether the service is profitable.

#Why defaulting to the biggest model is a margin leak

Most teams new to Claude reach for Opus because it is the best one. For a lot of jobs that is paying several times the cost for output a smaller model produces just as well. On one client that overspend is invisible. Multiply it across every workload on every account and it is a real line on your P&L, sitting inside a service you sell at a fixed price. The client pays the same either way. The difference lands entirely on your margin.

The correction is cheap to run and you only have to do it once per workload. Take twenty representative tasks from a workload. Run each through the top tier and the middle tier. Score the outputs blind, without knowing which model produced which. For most tasks the middle tier wins or ties, and you keep the cost savings. For the handful where the top tier genuinely wins, you keep the top tier on exactly those and nowhere else. That is the whole method. It replaces a gut feeling that Opus is safer with evidence about where it actually earns its price.

The mistake in the other direction is real too. Reaching for the cheapest tier on work that needs judgment produces a subtly wrong answer that a client notices, and the trust cost of that dwarfs the token savings. The cheapest tier is the wrong call when the task needs context beyond a modest window, when brand voice matters and generic will not pass, when tool use requires careful reasoning about which tool to call, and anytime a slightly-off client-facing response damages trust. Right-sizing cuts both ways: do not overpay, and do not underspend on the jobs that carry your reputation.

#The decision the account lead runs on the fly

You do not want a router meeting for every new workload. You want a short checklist anyone on the team can run in ten seconds when a new AI job shows up on an account. Here it is, in order.

Does the output go straight to a client or end user? If yes, do not drop below the middle tier. Client-facing is a floor, not a preference.
Does the task need multi-step reasoning across several inputs, or does a wrong answer cost the relationship? If yes, top tier. This is the judge bucket.
Is the volume high, a thousand or more a day, and the output simple and structured? If yes, cheapest tier. This is the bulk bucket.
Is the task voice-sensitive, where generic output would embarrass the client? Middle or top tier depending on the stakes, never the cheapest.
Everything else defaults to the middle tier. When in doubt, the daily-driver model is the right answer more often than either extreme.

Write those five lines into your delivery playbook and the model choice stops being tribal knowledge held by one senior person. That is what lets you hire against the service and scale it across accounts instead of bottlenecking every workload on the one engineer who knows the price sheet.

#How model choice changes the unit economics

#The per-unit cost is the whole game

This is the part that separates an agency that treats AI as a cost center from one that treats it as a product with a margin. When you productize an AI service, you sell it at a price the client agrees to once. Your cost to deliver each unit is what determines whether that price is profit or loss. Model choice is one of the two biggest levers on that per-unit cost. The other is caching, and they compound.

Run the arithmetic on a real workload. Say you generate templated location and service pages at volume. On the top tier, each page costs some amount to produce. Move the draft work to the middle tier where a template plus a human edit carries the quality, and the per-page cost falls by a multiple. Now layer prompt caching on top, so the stable brand-voice context is not re-billed on every call, and it falls again. The source pipeline saw generation cost drop from about 42 cents to about 6 cents a page once caching was in place. Model right-sizing and caching together are what turn a thin AI line item into a service you can price with room to spare.

The pricing move that follows is straightforward. You are no longer pricing your AI services against top-tier-for-everything cost. You are pricing against right-sized cost plus caching plus your human time. That gap is margin you either keep or reinvest into quality that a competitor undercutting you on price cannot match. Either way, the client does not see the difference in their invoice. The difference lives entirely on your side of the ledger, which is exactly where you want a cost lever to live.

#Mixing tiers inside one workflow

The most efficient pattern is not picking one tier per workload. It is chaining tiers inside a single job so the cheap tier does the sorting and the expensive tier only runs when it is actually needed. The classic shape: the cheapest tier reads the incoming request and decides whether it is simple or complex; simple requests it answers itself, and complex ones it escalates to a stronger model. You pay top-tier rates only on the fraction of requests that earn them.

async function triage(userMessage: string): Promise<"simple" | "complex"> {
  const res = await claude.messages.create({
    model: HAIKU_MODEL, // cheapest tier: fast, high-volume classification
    max_tokens: 20,
    system: "Classify the message as 'simple' (a one-shot answer fits) or 'complex' (needs reasoning).",
    messages: [{ role: "user", content: userMessage }],
  });
  const text = res.content[0].type === "text" ? res.content[0].text : "";
  return text.toLowerCase().includes("complex") ? "complex" : "simple";
}

async function answer(userMessage: string) {
  const level = await triage(userMessage);
  // Only the complex fraction pays top-tier rates.
  const model = level === "complex" ? OPUS_MODEL : SONNET_MODEL;
  return await claude.messages.create({ model, max_tokens: 1024, messages: [{ role: "user", content: userMessage }] });
}

Triage on the cheap tier, escalate only when it earns it. Model ids are illustrative; check the docs for current ids.

One more lever sits underneath all of this and costs you nothing but a little patience. Work that does not need a real-time answer, nightly content generation, weekly digests, periodic re-indexing, can run through the asynchronous batch path at a large discount to the standard per-token rate. For an agency running scheduled production across many accounts overnight, that discount stacks on top of your tier choice. The setup details and the current discount are in Anthropic's documentation. If a job can wait until morning, it should not be billed at real-time rates.

#Measuring it so the savings are real, not a story

A tiering rubric you cannot measure is a rubric you will not keep. Tag every model call with a workload identifier. Roll the calls up monthly by tier, multiply by that tier's rate, and compare the total to the counterfactual: what the same volume would have cost on the top tier for everything. That counterfactual is your savings number, and it is the one to put in front of an account lead who is tempted to reach for the biggest model out of caution.

Do this per workload, not just in aggregate, because the aggregate hides the offenders. A single mis-tiered high-volume job can quietly consume more budget than a dozen correctly-tiered ones. When you can see cost by workload, the fix is obvious and takes an afternoon. When you can only see one monthly bill, the leak is invisible and it compounds. The source pipeline this rewrite is built on tracks exactly this and reports it monthly, and the discipline is what keeps the tiering honest over time.

#Where agencies get right-sizing wrong

Standardizing on the top tier for safety. Consistency is worth something, but paying several times over for jobs the middle tier handles is not the way to buy it. Right-size per workload and standardize the rubric, not the model.
Right-sizing once and never revisiting. New workloads land on accounts constantly. A tier choice made for last quarter's work is not automatically correct for this quarter's. Re-run the twenty-task check when the work changes.
Chasing the cheapest tier on judgment work to shave pennies. A subtly wrong client-facing answer costs more than the tokens ever saved. The cheapest tier belongs on bulk, not on anything a client reads unedited.
Ignoring the batch path. Real-time rates on work that could have waited until morning is pure waste, and overnight production across a client book is where it hides.
Treating model choice and caching as separate initiatives. They are the same margin conversation. Price against the combined cost or you are leaving the biggest lever on the table.

#Where a platform partner handles the routing

You can build all of this yourself. The tiering rubric, the twenty-task evaluations, the per-workload cost tagging, the triage-and-escalate chains, the batch scheduling. It is a real amount of engineering, and for an agency whose value is client strategy rather than inference plumbing, it is engineering that does not touch what clients pay you for. That is where a platform layer earns its place. The agency owns the client relationship and the judgment about what quality each workload needs. The platform handles the routing, the cost accounting, and the batch scheduling underneath, so your team applies the rubric without maintaining the machinery.

That split is deliberate. The part that does not templatize, knowing which client work needs top-tier judgment and which is safe to run cheap, stays with you, because it is the part clients actually pay for. The repeatable part underneath, the routing and the measurement, is exactly what a platform should absorb. If you would rather own the whole stack, the rubric above and the model-selection guide are the full playbook. Either way, right-sizing is the move that protects your margin. See how the platform fits across the full solution set.

#The same play, retold for other operators

The three-tier rubric scales down as cleanly as it scales across a book of clients. The stakes and the volume change, but draft, judge, and bulk are the same three buckets whether you run one workload or two hundred. We retold this cost discipline for three other readers: micro businesses getting the most from AI without overpaying, SMEs matching the model to the job to control spend, and mid-market teams writing a model-selection policy that holds across a whole team. The mechanics are identical. Only the scale of the savings changes.

Right-sizing the model to the job is the least glamorous and most reliable margin lever in a productized AI service. It is not a technical nicety. It is the difference between an AI line that funds itself and one that quietly bleeds. Sort the work into draft, judge, and bulk. Run the top tier only where it earns its price. Layer caching and the batch path on top. Measure by workload so the savings are real. Do that across your book and the same client volume costs a fraction of what it does on the default-to-the-biggest-model approach, at equivalent or better quality.

Want to run right-sized AI across your client book without building the routing and cost accounting yourself? Run the estimator and we will show you the tiering rubric, the per-workload cost picture, and where the platform takes the routing off your team's plate. Or talk to us about a partner engagement.