Which pgvector index should we use, HNSW or IVFFlat?

HNSW is the right default for read-heavy production search: higher recall and faster queries, at the cost of a slower build and more memory. IVFFlat builds faster and uses less memory but needs tuning and rebuilding as data grows. Most production search should use HNSW.

How do we keep a RAG assistant from leaking documents a user cannot see?

Enforce permissions at retrieval, in the database, with Row Level Security, and mark the match function security invoker so it runs as the caller. Then vector search can only ever return rows the user is cleared for, and the LLM can only be handed passages they could have read themselves.

Do we need a dedicated vector database at our scale?

Usually not. Keeping vectors in the Postgres you already run puts one datastore in scope for security review instead of two, reuses your existing audit trail, and removes a sync pipeline between systems. The consolidation is typically worth more than a specialized store's marginal performance edge.

AI Search for Mid-Size Companies: Production Vector Search, RAG Permissions, and AEO at Scale

Q: How do we change embedding models without downtime?

Use a shadow column: add a second embedding column, backfill it in the background, verify recall against your evaluation set, then cut reads over atomically. Version the model per row and never query across two vector spaces.

You are not asking whether to do AI search. You have real content volume, an engineering team, and a security review that will ask hard questions before anything ships. The interesting problems at your scale are not "how do embeddings work." They are index choice at millions of vectors, keeping RAG retrieval inside each user's permission boundary, latency under load, cost at volume, and proving all of it to a compliance function. Supabase's AI and Vectors stack handles this in production. Here is the version at your altitude.

The foundation is the same as everyone else's: hybrid search over content stored in Postgres with pgvector. If you have not read the build at the small-team level, the SME version lays out the table, the embedding pipeline, and the hybrid ranking function. This post assumes that and goes to the parts that only matter at scale.

#Vector index choice: HNSW vs IVFFlat

At a few thousand vectors, the index does not matter. At a few million, it decides your latency and your recall. pgvector gives you two: IVFFlat and HNSW. The tradeoff is real and you should make it deliberately.

HNSW (hierarchical navigable small world): higher recall and faster queries, slower to build, more memory. This is the right default for read-heavy search where query latency and result quality matter more than index build time. Most production search should use it.
IVFFlat: faster to build, lower memory, but recall depends on tuning the number of lists and probes, and it needs rebuilding as data grows. Reasonable when you are memory-constrained or your data is relatively static.

-- HNSW: tune m (graph connectivity) and ef_construction (build quality).
create index on documents
  using hnsw (embedding vector_cosine_ops)
  with (m = 16, ef_construction = 64);

-- Query-time recall/latency knob (per session):
set hnsw.ef_search = 40;

-- IVFFlat alternative: lists ~ rows / 1000, tune probes at query time.
create index on documents
  using ivfflat (embedding vector_cosine_ops)
  with (lists = 200);
set ivfflat.probes = 10;

#RAG that respects permissions

This is the requirement that separates a demo from a system your security team will sign off on. In a mid-size company, not every user may see every document. If your RAG assistant retrieves across the whole corpus and then composes an answer, it will happily leak a passage from a document the asking user was never allowed to read. The generated answer becomes an exfiltration path around your access controls.

The fix is to enforce permissions at retrieval, in the database, with Row Level Security, so the vector search itself can only ever return rows the caller is allowed to see. Supabase documents this retrieval pattern, and it is the same RLS posture we run multi-tenant across the FH client book in the RLS post.

alter table documents enable row level security;

-- Retrieval can only see rows the caller's role is cleared for.
create policy "read within clearance" on documents
  for select using (
    visibility = 'public'
    or owner_team = (auth.jwt() ->> 'team')
    or (auth.jwt() ->> 'role') = 'admin'
  );

-- The match function runs as the caller, so RLS filters BEFORE ranking.
create or replace function match_documents(
  query_embedding vector(1536), match_count int default 8
)
returns setof documents
language sql stable
security invoker            -- critical: run as caller, not definer
as $$
  select * from documents
  order by embedding <=> query_embedding
  limit match_count;
$$;

#Going to production: the operational checklist

The retrieval quality is table stakes at your scale. What gets you through a launch is the operational posture around it.

Connection pooling. Serverless and edge functions open connections fast. Use the Supabase pooler (port 6543) for short-lived query paths and reserve direct connections for migrations and the embedding job. The Supabase performance post covers the pooler choice in depth.
Statement timeouts. Bound the damage of a slow vector scan with a per-role `statement_timeout` so one expensive query cannot starve the pool.
Embedding versioning. When you change embedding models, the old and new vectors are not comparable. Store the model name per row and re-embed in a controlled backfill; never mix embedding spaces in one index.
Observability. Log retrieval latency, recall proxies, and the RAG answer's grounding. A retrieval regression is invisible until users complain unless you watch it. The Supabase logs post has the query set we run.
Cost tracking. At volume, hosted embeddings and LLM generation are the line items that grow. Meter them per feature so you can see which surface is expensive.

Rank your hardening work by blast radius. A permission leak through RAG outranks a latency blip.

#Latency budgets under load

At your traffic, retrieval latency is a budget, not an afterthought. Set a target (for most search UIs, keep vector retrieval under 50 milliseconds at the 95th percentile) and defend it. The levers are concrete: keep the vector working set in memory so the index is not paging from disk, size HNSW `ef_search` to the lowest value that holds your recall target, and cache embeddings for repeated queries so you are not re-embedding the same question thousands of times a day.

Measure the whole path, not just the database. User-visible latency is embed-the-query plus retrieve plus, in the RAG tier, generate. The embedding call and the LLM call are often larger than the vector scan itself. Cache query embeddings for common questions, and reserve the generation step for queries that actually need a composed answer rather than a ranked list. Most queries do not need one.

#Re-embedding without downtime

You will change embedding models at least once, and the old and new vectors are not comparable. The wrong way is to re-embed in place and watch search quality thrash while half the corpus sits in one vector space and half in another. The right way is a shadow column: add a second embedding column, backfill it in the background, verify recall against your evaluation set, then cut reads over in one atomic switch.

Version the model name per row so you always know which space a vector lives in. Keep the old column until the new one is proven, then drop it. A dimensional change (a new model with a different vector width) needs its own column and its own index, built alongside the live one, never on top of it. Done this way, a model upgrade is a background job and a config flip, not an outage.

#The working set and when to scale compute

pgvector performance lives and dies on whether the index fits in memory. When your vector working set outgrows the buffer cache, queries start paging from disk and latency falls off a cliff. Watch the cache hit ratio the way you would for any Postgres workload: above 99 percent is healthy, and a sustained drop is the signal to size up compute, not to blame the index.

Sizing up on Supabase is a tier change, not a migration, so scaling is a dial rather than a re-architecture. For the deeper Postgres tuning (pooling, statement timeouts, vacuum, and the cache-hit math) the same principles as any high-volume Postgres apply, and they are laid out in the Supabase performance post.

#Cost at volume

The math changes shape at your scale, but not as much as vendors want you to think. Storage and query compute stay on Supabase's tiers; you size the instance to keep the vector working set in memory and the cache hit ratio high. Embeddings are a one-time cost per content change, not per query, so a large but slow-changing corpus is cheap to keep indexed. The variable cost that actually scales with traffic is LLM generation in the RAG tier, which is why you meter it per surface and why not every query should invoke a model.

#Tuning the hybrid balance

Reciprocal rank fusion treats both rankings equally by default, but your content may not deserve an equal split. A codebase or a product catalog leans on exact identifiers, so the keyword side should carry more weight. A knowledge base of prose leans on meaning, so the semantic side should. Expose the fusion weight as a parameter and tune it against your evaluation set rather than guessing.

The `rrf_k` constant is the other knob: a larger value flattens the influence of the very top ranks, a smaller value sharpens it. Start at the documented default, change one variable at a time, and let the hit-rate number on your evaluation set decide. Tuning hybrid search is not folklore at your scale. It is a measured search over two or three parameters, pinned by a test set, and rerun whenever your content mix shifts.

Keep the tuning in version control alongside the function. When the weight or `rrf_k` changes, that is a search-quality change, and it deserves the same review as a code change, because a bad value degrades every query at once.

#Build versus buy, decided

At your scale a vendor will pitch a dedicated vector database as the serious choice. Run the real comparison before you sign. A separate vector store is a second production datastore to secure, sync, patch, monitor, and staff, and it introduces a consistency problem: your content lives in one system and its vectors in another, and the two drift.

Keeping vectors in the Postgres you already run means one datastore in scope for your security review, not two.
Retrieval runs through the same RLS policies as every other query, so your existing audit trail already covers the AI feature.
There is no sync pipeline between your primary database and a separate index to break at 2 a.m.
You scale compute with a tier change instead of operating a second distributed system.
The marginal performance edge a specialized store claims rarely outweighs removing an entire system from your architecture diagram.

#Rolling it out across teams

A mid-size rollout fails on organizational seams more often than technical ones. The content that feeds retrieval is owned by different teams, the permission model is owned by security, and the surfaces that consume search are owned by product. Name an owner for each before you build, not after. The retrieval function is one team's; the content quality is another's; the permission policy is security's to review and sign.

Ship it staged. Prove retrieval quality on one team's corpus with an evaluation set and a permission test, get security's sign-off on the RLS model against that one corpus, then widen to the next. A big-bang launch across every team's content at once gives you no clean place to debug when recall or a permission edge case goes wrong. The measured, one-corpus-at-a-time rollout is slower to start and far faster to finish.

#AEO at scale, and measuring it

At your size, answer-engine visibility is a measurable channel, not a hope. The same retrieval you built for on-site search is a lens on how the engines read your content: run your published pages through your own embeddings and hybrid ranking, and you can see which sections are retrievable and which are dead weight. That becomes a content-quality metric your team can move.

Instrument the downstream too. Track referral and citation traffic from AI engines, watch the queries where you appear in AI Overviews, and pair it with Search Console. The measurement approach is in the Search Console AI Overviews post, and the strategic frame is the AEO playbook. The point is that at your scale you can close the loop: structure content, measure retrievability, measure citations, iterate.

#Governance and the compliance review

Data residency and isolation: RLS-enforced retrieval plus a documented tenant or clearance model answers the "can user A see user B's data" question with a policy a reviewer can read, not a promise.
Auditability: every retrieval runs as the caller through the same policies as direct access, so your existing database audit trail covers the AI feature too. There is no separate access path to reason about.
Model and prompt governance: version the embedding model and the generation prompt, log which version answered, and keep the RAG answer grounded with citations so any answer can be traced to its source passages.
Blast radius: keeping vectors in Postgres means one datastore in scope for the review, not two. Fewer systems, fewer policies, a smaller audit.

If you want this designed against your permission model and your compliance requirements, or reviewed before it goes to a security board, book a consultation. This is the platform posture we run as Elevi and across every site in the FH book. The lighter builds are small and growing teams and micro businesses; the delivery model for serving many clients at once is AI search for agencies.