RAG for SMB Sites: When Retrieval-Augmented Generation Actually Solves a Real Problem

Retrieval-Augmented Generation (RAG) is the pattern where you embed your own content into a vector database, retrieve relevant chunks at query time, and feed them to an LLM as context. Every AI consultant in 2026 is pitching RAG as the answer to whatever question your SMB has. Sometimes it is. Most of the time, it’s overkill. Here’s the framework for deciding.

#What RAG actually does

Three steps. (1) Chunk your content (docs, pages, knowledge base) into 200-800 word pieces. (2) Run each chunk through an embedding model to get a vector. Store the vector + chunk in a database. (3) At query time, embed the user’s question, do a vector similarity search to find the most relevant chunks, paste them into the LLM prompt as context, ask the LLM to answer using only the retrieved context.

Result: the LLM answers questions using your content instead of its training data. Reduces hallucinations, keeps the answer current, lets you cite sources.

#When RAG is the right tool

Customer support: answering questions about your product using your own docs.
Internal knowledge search: a chat interface over your company wiki.
Content discovery: ‘find me the FH blog post that covers X.’
Regulatory or legal-adjacent Q&A where you need to cite the source for every answer.

#When RAG isn’t

When the content fits in the LLM’s context window. With Claude’s 200k+ token window, most SMB knowledge bases fit entirely in context — just send the whole thing. RAG adds operational complexity for no gain.
When the user is looking for something other than information retrieval. Booking, transactional, creative output — RAG doesn’t help.
When your knowledge base is small (under 50,000 words). The retrieval step adds latency without saving meaningful context tokens.
When the queries are well-served by traditional search (keyword match, BM25). RAG isn’t magic; it’s a fancy search.

#The Postgres + pgvector stack

Supabase ships pgvector by default. You get a vector type, distance operators (cosine, L2, dot product), and ANN indexes (HNSW or IVFFlat). Storage is regular Postgres rows. RLS works the same as any other table. Total infrastructure setup: ten minutes.

create extension if not exists vector;

create table doc_chunks (
  id uuid primary key default gen_random_uuid(),
  site_id text references sites(id) not null,
  source_path text not null,
  chunk_text text not null,
  embedding vector(1024),
  created_at timestamptz default now()
);

create index on doc_chunks using hnsw (embedding vector_cosine_ops);
create index on doc_chunks(site_id);

alter table doc_chunks enable row level security;
create policy "select own site" on doc_chunks
  for select using (site_id = (auth.jwt() ->> 'site_id'));

#Embedding generation

Use Voyage AI or OpenAI’s text-embedding-3 family. Voyage’s voyage-3-large embeddings are the highest-quality general-purpose option in 2026; OpenAI’s text-embedding-3-large is cheaper and a touch less accurate. We use Voyage for high-stakes use cases, OpenAI for everything else.

import { VoyageAIClient } from "voyageai";
const voyage = new VoyageAIClient({ apiKey: process.env.VOYAGE_API_KEY! });

async function embed(text: string): Promise<number[]> {
  const res = await voyage.embed({
    input: [text],
    model: "voyage-3-large",
    inputType: "document",
  });
  return res.data[0].embedding;
}

async function indexChunk(siteId: string, sourcePath: string, chunk: string) {
  const embedding = await embed(chunk);
  await supabase.from("doc_chunks").insert({
    site_id: siteId,
    source_path: sourcePath,
    chunk_text: chunk,
    embedding,
  });
}

#Querying

Embed the user’s query (with `inputType: "query"` for asymmetric models). Run a vector similarity search. Take the top 5-10 chunks. Send to Claude as context.

async function answerQuestion(siteId: string, question: string): Promise<string> {
  const queryEmbedding = await embed(question);
  const { data: chunks } = await supabase.rpc("match_chunks", {
    site_id_filter: siteId,
    query_embedding: queryEmbedding,
    match_count: 8,
  });

  const context = chunks.map((c) => `[${c.source_path}] ${c.chunk_text}`).join("\n\n");

  const response = await claude.messages.create({
    model: "claude-opus-4-7",
    max_tokens: 1024,
    system: "Answer using only the provided context. Cite sources.",
    messages: [{
      role: "user",
      content: `Context:\n${context}\n\nQuestion: ${question}`,
    }],
  });

  return response.content[0].type === "text" ? response.content[0].text : "";
}

create or replace function match_chunks(
  site_id_filter text,
  query_embedding vector(1024),
  match_count int default 8
) returns table (
  id uuid,
  source_path text,
  chunk_text text,
  similarity float
) language sql as $$
  select
    id,
    source_path,
    chunk_text,
    1 - (embedding <=> query_embedding) as similarity
  from doc_chunks
  where site_id = site_id_filter
  order by embedding <=> query_embedding
  limit match_count;
$$;

#Chunking strategy

Three common approaches. Fixed-size (every 500 tokens), works fine for prose. Semantic (split on paragraph breaks, then merge to size), produces better-quality chunks for mixed content. Heading-based (use the page’s heading structure to define chunks), works well for structured docs like the FH blog. We default to heading-based for blog content, semantic for free-form pages.

#Hybrid search: BM25 + vector

Pure vector search misses exact-match queries (a user typing the exact title of a doc). Hybrid combines BM25 keyword scoring with vector similarity, reranks. Reciprocal Rank Fusion (RRF) is the simplest hybrid algorithm. Supabase doesn’t ship hybrid search built-in; you implement it as two queries and merge in app code.

#Reranking

After retrieving the top-N chunks, run them through a reranker (Voyage’s rerank-2.5 or Cohere’s rerank-3) for a quality boost. The reranker scores each chunk against the query in a more compute-intensive but accurate way. Adds ~200ms latency, dramatically improves answer quality on hard queries.

#Evaluation

RAG quality is hard to measure without ground truth. Build a small (50-100 question) evaluation set with known-correct answers and run it every time you change embedding, chunking, or retrieval. Track answer accuracy, source-citation accuracy, and hallucination rate. Without this, you’re flying blind on whether changes help or hurt.

#Cost at SMB scale

Indexing 10,000 chunks via Voyage: ~$2-5 one-time. Storage in Supabase: negligible at this scale. Per query: ~$0.001-0.005 (embed + retrieve + LLM call). For a customer-support bot fielding 500 questions/day, monthly cost is ~$30-75. Well within SMB-tier budgets.

#How this lands across FH client work

We’ve built RAG into exactly two FH client products. One is a B2B documentation chat (the docs are too large to fit in context). The other is a multi-tenant knowledge-base search where each tenant has their own document set with RLS isolation. Everywhere else we’ve looked at RAG, the long-context approach has worked better with less complexity. If you’re evaluating RAG for your stack, book a consultation — half the time the answer is ‘just use long context with prompt caching.’