How to Optimize pgvector Queries at Scale
Your vectors are already in PostgreSQL. If you'll allow me, I shall ensure they perform as though they never considered leaving.
The performance problem with vector search at scale
pgvector works remarkably well at small to medium scale. Store a few hundred thousand embeddings, run a nearest-neighbor query, get results in a few milliseconds. It feels effortless because PostgreSQL is doing the hard work underneath, and pgvector slots in cleanly. This is one of PostgreSQL's quiet virtues — the ability to absorb an entirely new category of workload without requiring a separate system. A good household does not hire a specialist for every task.
Then you cross a million vectors. Then five million. The queries that ran in 2ms now take 200ms. Filtered searches — combining vector similarity with WHERE clauses — degrade faster than unfiltered ones. Index builds that took minutes now take hours. Memory consumption creeps past what your instance can comfortably hold.
None of this is surprising. Approximate nearest neighbor search is a fundamentally memory-intensive operation, and the parameters that control it have different optimal values at 100,000 vectors than at 10,000,000. The defaults that work beautifully at small scale become bottlenecks at large scale. pgvector provides the knobs to address this. The challenge is knowing which knobs to turn, and in which direction.
Allow me to walk you through the practical tuning work: choosing and configuring index types, handling filtered search, reducing memory footprint with half-precision vectors and quantization, monitoring query performance, and — because honesty is a professional obligation — recognizing when you have genuinely outgrown what pgvector can offer.
What does a slow vector query look like?
Before tuning anything, one must establish a baseline. Run EXPLAIN (ANALYZE, BUFFERS) on your vector query and attend to what PostgreSQL tells you.
-- A typical vector similarity query: find 10 nearest neighbors
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, content, embedding <=> $1 AS distance
FROM documents
ORDER BY embedding <=> $1
LIMIT 10; Without an index, this is what happens on a table with 1 million vectors:
Sort (cost=284102.45..286602.45 rows=1000000 width=552)
(actual time=4218.33..4218.35 rows=10 loops=1)
Sort Key: (embedding <=> $1)
Sort Method: top-N heapsort Memory: 29kB
-> Seq Scan on documents (cost=0.00..259102.00 rows=1000000 width=552)
(actual time=0.04..3842.19 rows=1000000 loops=1)
Buffers: shared hit=42891 read=189312
Planning Time: 0.18 ms
Execution Time: 4218.52 ms PostgreSQL computes the cosine distance against every single row — one million distance calculations — sorts the results, and returns the top 10. The Buffers: shared read=189312 tells you it read roughly 1.5GB of data from disk. This is exact nearest neighbor search: perfect recall, entirely unacceptable latency. I trust we can agree that 4.2 seconds for ten results is not a standard we shall tolerate.
With a properly configured HNSW index, the same query:
Index Scan using idx_documents_embedding_hnsw on documents
(cost=228.42..292.71 rows=10 width=552)
(actual time=1.82..1.94 rows=10 loops=1)
Order By: (embedding <=> $1)
Buffers: shared hit=1847
Planning Time: 0.12 ms
Execution Time: 2.08 ms The Index Scan traverses the HNSW graph, visiting roughly 1,847 pages instead of 232,000. Execution time drops from 4.2 seconds to 2 milliseconds — a 2,000x improvement. The tradeoff: recall is approximate rather than exact. With default settings, you get roughly 92% recall. With tuning, 99%+. For the vast majority of applications, that tradeoff is not merely acceptable — it is the correct operating point. Perfection at the cost of four seconds is a luxury no production system can afford.
HNSW vs IVFFlat: which index type should you use?
pgvector provides two approximate nearest neighbor index types. They use fundamentally different data structures and have different performance characteristics. I shall give you my recommendation plainly: use HNSW. The longer answer explains when IVFFlat still earns its place at the table.
HNSW (Hierarchical Navigable Small World)
HNSW builds a multi-layered graph where each node connects to its approximate nearest neighbors. Queries traverse the graph from a random entry point, navigating toward the query vector through progressively more accurate layers. The result is logarithmic search time that scales well with dataset size.
-- HNSW: the recommended default for most workloads
CREATE INDEX idx_documents_embedding_hnsw ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 128);
-- Build time scales with dataset size and parameter values.
-- For 1M vectors at 1536 dimensions: expect 10-30 minutes. The advantages are considerable: HNSW can be created on an empty table (no training step), handles concurrent inserts without rebuilding, and achieves higher recall than IVFFlat at the same query speed. In benchmarks by AWS, at 99.8% recall HNSW achieved 40.5 QPS versus IVFFlat's 2.6 QPS — a 15x throughput advantage. These are not marginal differences. They are the sort of numbers that settle a debate.
IVFFlat (Inverted File with Flat Compression)
IVFFlat divides vectors into clusters using k-means, then at query time searches only the nearest clusters. It requires data in the table before building because it needs to compute cluster centroids.
-- IVFFlat: faster builds, lower memory, lower recall
CREATE INDEX idx_documents_embedding_ivfflat ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000);
-- Requires data in the table first (k-means clustering step).
-- lists: number of clusters. Start with rows / 1000 for up to 1M rows,
-- or sqrt(rows) for larger datasets. IVFFlat builds 10-30x faster than HNSW and uses 2-3x less memory. On a dataset that rarely changes and where build time is a genuine constraint, IVFFlat has something to offer. But its recall degrades as data drifts from the original cluster centroids — inserts and updates gradually reduce quality until you rebuild the index. An index that requires regular rebuilding to maintain its dignity is, if I may say so, an index with a condition.
Head-to-head comparison
| Parameter | HNSW | IVFFlat |
|---|---|---|
| Build time (1M vectors, 1536d) | 10-30 minutes | 1-3 minutes |
| Index size (1M vectors, 1536d) | ~700 MB | ~250 MB |
| Query latency (99% recall) | 1-5 ms | 5-20 ms |
| Recall at same latency | Higher | Lower |
| Requires data to build | No | Yes (k-means training) |
| Handles inserts well | Yes | Degrades (needs rebuild) |
| Memory during build | High (graph in RAM) | Lower |
For most production workloads — RAG pipelines, semantic search, recommendation engines — HNSW is the correct default. I would not recommend IVFFlat unless build time or memory is genuinely the binding constraint on a large, mostly-static dataset. If you are uncertain, you are not in that situation.
How do you tune HNSW parameters for your workload?
HNSW has three parameters that control the recall-speed-memory tradeoff. Understanding what each one does — and what the practical ranges are — is the heart of pgvector performance tuning. If you will indulge me, I shall take each in turn.
-- m: connections per node in the graph (default 16)
-- Higher m = better recall, larger index, slower builds
-- Range: 12-48 for most workloads. 64+ for very high recall requirements.
-- ef_construction: build-time candidate list size (default 64)
-- Higher ef_construction = better graph quality, slower builds
-- Must be >= 2 * m. Range: 128-256 is a solid starting point.
-- Speed up builds by fitting the graph in maintenance_work_mem:
SET maintenance_work_mem = '2GB';
-- Parallel index builds (PostgreSQL 17+ with pgvector 0.8.0+):
SET max_parallel_maintenance_workers = 4; m (connections per node)
Each node in the HNSW graph maintains up to m bidirectional connections to its neighbors. Higher m means more connections, which means the search algorithm has more paths to explore — improving recall but increasing index size and build time proportionally.
- Default: 16
- Practical range: 12-48 for most workloads
- Memory impact: each connection stores a pointer and distance value. Doubling
mroughly doubles the non-vector portion of the index. - When to increase: if recall is below target even with high
ef_search. This is the most impactful parameter for index quality according to the original HNSW paper.
ef_construction (build-time candidate list)
During index construction, ef_construction controls how many candidates are evaluated when placing each vector in the graph. Higher values produce a better-connected graph at the cost of slower builds.
- Default: 64
- Constraint: must be ≥ 2 ×
m - Practical range: 128-256 is a strong starting point. Beyond 256, diminishing returns set in quickly while build time continues to increase linearly.
- When to increase: if recall plateaus despite tuning
ef_search. A poorly constructed graph cannot be saved by a wider search beam.
ef_search (query-time beam width)
The only parameter you can change without rebuilding the index. It controls how many candidates the search algorithm evaluates at query time. This is your primary knob for the recall-latency tradeoff.
-- ef_search: query-time beam width (default 40)
-- Higher = better recall, slower queries. Lower = faster, lower recall.
-- Low recall, fast queries (prototyping, non-critical search):
SET hnsw.ef_search = 20;
-- Balanced (good default for production):
SET hnsw.ef_search = 100;
-- High recall (when accuracy matters more than latency):
SET hnsw.ef_search = 400; | ef_search | Recall | Latency (10 results, 1M vectors) | Use case |
|---|---|---|---|
| 20 | ~85% | 0.5-1 ms | Prototyping, low-stakes search |
| 40 | ~92% | 1-2 ms | Default. Good enough for most apps |
| 100 | ~97% | 2-4 ms | Production RAG, semantic search |
| 200 | ~99% | 4-8 ms | High-accuracy requirements |
| 400 | ~99.5% | 8-15 ms | Near-exact recall needed |
These numbers are approximate and vary with dataset size, dimensionality, and hardware. Measure on your own data — I cannot overstate this. The recall-latency curve is not linear; the first 90% of recall comes cheaply, and the last 5% is where the bill arrives.
How do you tune IVFFlat parameters?
If you are using IVFFlat — and I trust you have good reason — the tuning parameters are lists (build-time) and probes (query-time).
-- probes: number of clusters to search at query time (default 1)
-- Higher = better recall, slower queries.
-- Default is far too low for production:
SET ivfflat.probes = 1; -- ~40% recall. Not useful.
SET ivfflat.probes = 10; -- ~80% recall. Reasonable starting point.
SET ivfflat.probes = 50; -- ~95%+ recall. Good for production.
SET ivfflat.probes = 100; -- ~99% recall. Slower, but thorough. I must be direct: the default probes = 1 is almost never sufficient for production. It searches a single cluster out of potentially thousands, yielding recall around 40%. This is not a tuning parameter left at a reasonable default — it is a default that produces unreasonable results. You must increase it. A starting point of sqrt(lists) is sound — for 1,000 lists, set probes to at least 32.
Unlike HNSW, IVFFlat recall degrades as data drifts. If you are inserting or updating vectors after the index was built, the cluster centroids become stale and recall drops. The remedy is periodic REINDEX, which re-runs k-means clustering on the current data. Plan for this in your maintenance window.
The filtering problem: WHERE clauses and vector similarity
Pure vector similarity search — "find the 10 nearest neighbors in the entire table" — is the easy case. The hard case, and the one that arrives uninvited in every production application, is filtered search: "find the 10 nearest neighbors where the category is 'technical' and the tenant is 42." This is where the real work begins.
-- The innocent query that becomes a problem at scale:
SELECT id, content, embedding <=> $1 AS distance
FROM documents
WHERE tenant_id = 42
AND category = 'technical'
ORDER BY embedding <=> $1
LIMIT 10; This is where pgvector performance tuning demands your full attention. The PostgreSQL query planner faces a choice: use the vector index (fast for similarity, ignores the filter) or use a B-tree index on tenant_id (fast for the filter, ignores similarity ordering). It cannot use both simultaneously in a single index scan. The planner must pick one strategy, and it does not always choose wisely.
What goes wrong with naive filtering
Before pgvector 0.8.0, the planner often chose the scalar filter index and then computed distances via a sequential sort over the filtered rows:
-- What often happens: the planner ignores the vector index
Bitmap Heap Scan on documents (cost=124.50..58431.21 rows=5200 width=552)
(actual time=1.82..842.53 rows=5187 loops=1)
Recheck Cond: (tenant_id = 42)
Filter: (category = 'technical')
Rows Removed by Filter: 3214
-> Bitmap Index Scan on idx_documents_tenant_id
(cost=0.00..123.20 rows=8401 width=0)
(actual time=1.21..1.21 rows=8401 loops=1)
Sort Key: (embedding <=> $1)
Sort Method: top-N heapsort Memory: 29kB For selective filters that match a few hundred rows, this works fine — sorting a few hundred distances is fast. But when the filter matches tens of thousands of rows, the sort dominates. And when the filter matches a large fraction of the table, you are back to a near-sequential scan with extra overhead.
Iterative index scans (pgvector 0.8.0+)
pgvector 0.8.0 introduced iterative index scans, which fundamentally improve filtered search. The algorithm traverses the HNSW (or IVFFlat) index, applies the filter to each candidate, and continues searching deeper until enough filtered results are found.
-- pgvector 0.8.0+: iterative index scans for filtered queries
-- Searches deeper into the HNSW graph until enough filtered results are found
SET hnsw.iterative_scan = strict_order;
-- Configure the maximum tuples to scan before giving up
SET hnsw.max_scan_tuples = 20000;
-- Now the query can use the HNSW index AND apply the WHERE filter:
SELECT id, content, embedding <=> $1 AS distance
FROM documents
WHERE tenant_id = 42
AND category = 'technical'
ORDER BY embedding <=> $1
LIMIT 10; Benchmarks show up to 100x improvement in result completeness and up to 5.7x improvement in query performance compared to version 0.7.4. If you are running pgvector for filtered workloads and have not yet upgraded to 0.8.0+, I would consider this a matter requiring immediate attention. It is the single highest-impact change available to you.
Partial indexes for categorical filters
When your queries consistently filter on the same column — tenant ID in a multi-tenant application, category in a content system — partial indexes are the practical solution. They put only the relevant vectors in each index, and the search space shrinks accordingly.
-- When most queries filter by the same column, partial indexes
-- put only the relevant vectors in each index:
-- One index per tenant (for multi-tenant applications):
CREATE INDEX idx_docs_emb_tenant_42 ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 128)
WHERE tenant_id = 42;
-- One index per category (for categorical filtering):
CREATE INDEX idx_docs_emb_technical ON documents
USING hnsw (embedding vector_cosine_ops)
WHERE category = 'technical'; The tradeoff is maintenance overhead: each partial index must be created and maintained separately. For 5-10 distinct values, this is manageable. For hundreds, consider table partitioning instead.
Table partitioning for many filter values
When the number of filter values is large, declarative partitioning gives each partition its own HNSW index automatically.
-- For many filter values, table partitioning gives each partition its own index
CREATE TABLE documents (
id bigserial,
tenant_id integer NOT NULL,
content text,
embedding vector(1536)
) PARTITION BY LIST (tenant_id);
CREATE TABLE documents_tenant_1 PARTITION OF documents FOR VALUES IN (1);
CREATE TABLE documents_tenant_2 PARTITION OF documents FOR VALUES IN (2);
-- ...
-- Each partition gets its own HNSW index automatically:
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops); PostgreSQL prunes irrelevant partitions at query time, so a filtered query only scans the partition (and index) for the matching tenant. This scales to hundreds or thousands of filter values without manual index management.
Half-precision vectors: halfvec for memory savings
pgvector's halfvec type stores each dimension in 2 bytes instead of 4, cutting storage and index size in half. For most embedding models, the recall impact is below 1%. I appreciate economy of this sort — half the storage, negligible cost. A well-run household does not allocate twice the resources a task requires.
-- Step 1: Add a halfvec column
ALTER TABLE documents ADD COLUMN embedding_half halfvec(1536);
-- Step 2: Cast existing vectors to half precision
UPDATE documents SET embedding_half = embedding::halfvec(1536);
-- Step 3: Create an HNSW index on the halfvec column
CREATE INDEX idx_docs_embedding_half ON documents
USING hnsw (embedding_half halfvec_cosine_ops)
WITH (m = 16, ef_construction = 128);
-- Step 4: Query using the halfvec column
SELECT id, content, embedding_half <=> $1::halfvec(1536) AS distance
FROM documents
ORDER BY embedding_half <=> $1::halfvec(1536)
LIMIT 10; | Metric | vector (32-bit) | halfvec (16-bit) | Savings |
|---|---|---|---|
| Storage per vector (1536d) | 6,144 bytes | 3,072 bytes | 50% |
| Index size (1M vectors) | ~700 MB | ~350 MB | 50% |
| Index build time | Baseline | ~2x faster | ~50% |
| Query latency | Baseline | Comparable or faster | Cache effects |
| Recall (cosine similarity) | Baseline | >99% of full precision | Negligible loss |
The practical recommendation from Neon is direct, and I concur: use halfvec as your default unless you have measured a meaningful recall difference on your specific data. The 50% storage reduction compounds across table size, index size, backup size, and replication bandwidth. Savings of this kind are not optional optimizations — they are good stewardship.
Quantization and dimension reduction strategies
Beyond half-precision vectors, two further strategies merit your consideration for reducing the memory and compute cost of vector search.
Dimension reduction
Modern embedding models based on Matryoshka Representation Learning (MRL) produce embeddings where the first N dimensions carry the most information. Truncating from 1536 to 512 dimensions reduces storage by 3x with 2-5% recall loss for most tasks.
-- If your embedding model supports it, use a lower-dimensional output.
-- OpenAI text-embedding-3-small: 1536d (default) or 512d (shortened)
-- Cohere embed-v3: 1024d (default) or 256d (shortened)
-- Matryoshka Representation Learning (MRL) models produce embeddings
-- where the first N dimensions carry the most information.
-- Truncating from 1536d to 512d loses ~2-5% recall but reduces
-- storage and index size by 3x.
-- Store the truncated embedding directly:
CREATE TABLE documents_compact (
id bigserial PRIMARY KEY,
content text,
embedding vector(512) -- or halfvec(512) for another 2x reduction
); The compounding effect is where this becomes genuinely satisfying. Combining dimension reduction (1536 to 512) with halfvec (32-bit to 16-bit) reduces per-vector storage from 6,144 bytes to 1,024 bytes — a 6x reduction. Index sizes shrink proportionally, meaning indexes that could not fit in memory at full precision may now reside there quite comfortably.
Binary quantization
For extreme compression, binary quantization represents each dimension as a single bit. This is a 32x reduction from full-precision vectors. The recall loss is substantial for direct search, but binary vectors excel as a first-pass filter: use Hamming distance to find the top 100-1,000 candidates, then re-rank those candidates using full-precision vectors.
-- Binary quantization: 1 bit per dimension
-- Extreme compression (1536d: 6,144 bytes -> 192 bytes = 32x reduction)
-- Use for candidate generation, then re-rank with full vectors
-- Store a bit vector alongside the full vector:
ALTER TABLE documents ADD COLUMN embedding_bit bit(1536);
-- Generate the bit vector (positive = 1, negative = 0):
UPDATE documents
SET embedding_bit = (
SELECT string_agg(CASE WHEN v > 0 THEN '1' ELSE '0' END, '')
FROM unnest(embedding::float4[]) AS v
)::bit(1536);
-- Index the bit vector with Hamming distance:
CREATE INDEX ON documents USING hnsw (embedding_bit bit_hamming_ops); This two-stage approach — coarse search with binary vectors, fine re-ranking with full vectors — is a standard technique in information retrieval. It lets you keep full recall while dramatically reducing the memory footprint of the primary index.
How do you monitor vector query performance?
EXPLAIN (ANALYZE, BUFFERS) is the starting point for individual queries. For a view of the broader workload, pg_stat_statements reveals which vector queries consume the most total time. One does not manage what one does not measure.
-- Always use EXPLAIN (ANALYZE, BUFFERS) for vector queries
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, content, embedding <=> $1 AS distance
FROM documents
ORDER BY embedding <=> $1
LIMIT 10;
-- Things to check:
-- 1. Is the HNSW/IVFFlat index being used? (Index Scan vs Seq Scan)
-- 2. Buffers: shared hit vs read ratio (is the index in memory?)
-- 3. Actual rows vs expected rows (planner accuracy)
-- 4. Total execution time (baseline for tuning) The key indicators in a vector query plan:
-- A healthy vector query plan:
Index Scan using idx_documents_embedding_hnsw on documents
(cost=228.42..292.71 rows=10 width=552)
(actual time=1.82..1.94 rows=10 loops=1)
Order By: (embedding <=> '[0.1, 0.2, ...]'::vector)
Buffers: shared hit=1847
Planning Time: 0.12 ms
Execution Time: 2.08 ms
-- Red flags to watch for:
-- Seq Scan: index not used (missing ORDER BY ... LIMIT, wrong operator class)
-- Buffers: shared read >> hit: index not in memory
-- actual time >> 10ms for 10 results: ef_search too high or index needs tuning - Index Scan vs Seq Scan: if you see a sequential scan on a large table, the index is not being used. Verify you have
ORDER BY ... LIMITand the correct operator class. - Buffers: shared hit vs read: a high
readcount means the index is not in memory. Increaseshared_buffersor reduce the index size (halfvec, dimension reduction). - Actual time: for 10 results from 1M vectors, expect 1-5ms with HNSW. Over 10ms suggests
ef_searchis too high, the index does not fit in memory, or the query is falling back to a non-vector index.
Workload-level monitoring
-- Track vector query performance with pg_stat_statements
SELECT
substring(query, 1, 80) AS query_preview,
calls,
round(mean_exec_time::numeric, 2) AS avg_ms,
round(total_exec_time::numeric / 1000, 1) AS total_sec,
rows
FROM pg_stat_statements
WHERE query LIKE '%<=>' OR query LIKE '%<->%' OR query LIKE '%<#>%'
ORDER BY total_exec_time DESC
LIMIT 10; Sort by total_exec_time rather than mean_exec_time. A 5ms vector query that runs 100,000 times per hour costs more aggregate time than a 500ms query that runs twice. pg_stat_statements gives you the full picture.
Connection pooling considerations for vector workloads
Vector queries consume more memory per connection than typical OLTP queries — a detail that deserves respect, not dismissal. Each connection holds the query vector in session memory, and HNSW traversal allocates working memory proportional to ef_search. At 1536 dimensions with a 32-bit vector, the query vector alone is 6KB. Small in isolation, but it compounds across hundreds of concurrent connections alongside the traversal state.
-- Vector queries are memory-intensive:
-- - Each connection holds the query vector in session memory
-- - HNSW traversal allocates working memory proportional to ef_search
-- - 1536-dimension vectors = ~6KB per query vector alone
-- Size your pool conservatively:
-- Typical formula: pool_size = (CPU cores * 2) + effective_spindle_count
-- For vector workloads, lean toward the lower end.
-- PgBouncer transaction mode works well with pgvector:
-- No prepared statements are needed for vector queries,
-- so the usual PgBouncer caveats around prepared statements
-- do not apply here. The practical guidance: size your connection pool conservatively and let the pooler (PgBouncer, pgcat, or your application's built-in pool) multiplex connections. A pool of 20-30 connections serving 200 application threads is far more efficient for vector workloads than 200 direct connections, because each vector query completes quickly (1-5ms) and releases the connection back to the pool.
PgBouncer in transaction mode works well with pgvector. Vector queries do not depend on session-level prepared statements or connection-level state, so the usual PgBouncer caveats about prepared statement tracking do not apply. If you need to set hnsw.ef_search per query, use SET LOCAL within a transaction or set_config('hnsw.ef_search', '100', true).
How do you benchmark recall vs latency?
Tuning pgvector without measuring recall is guessing, and I am not in the business of guessing. The only way to know whether your index configuration delivers acceptable quality is to measure it directly.
-- Measure recall: compare approximate results against exact results
-- Step 1: Get the exact top-10 (no index, sequential scan)
SET LOCAL enable_indexscan = off;
SET LOCAL enable_bitmapscan = off;
SELECT id, embedding <=> $1 AS distance
FROM documents
ORDER BY embedding <=> $1
LIMIT 10;
-- Step 2: Get the approximate top-10 (with index)
RESET enable_indexscan;
RESET enable_bitmapscan;
SELECT id, embedding <=> $1 AS distance
FROM documents
ORDER BY embedding <=> $1
LIMIT 10;
-- Step 3: Count how many IDs appear in both result sets.
-- recall = (overlap count) / 10
-- Repeat with 50-100 random query vectors for a reliable estimate. The procedure:
- Select 50-100 random vectors from your dataset as query vectors.
- For each query vector, get the exact top-K results (disable index scans, brute-force search).
- For each query vector, get the approximate top-K results (with your index).
- Count the overlap between the two result sets. Recall = overlap / K.
- Average across all query vectors for a reliable estimate.
Run this benchmark after every parameter change. The relationship between ef_search and recall is not linear — small changes near the top of the curve can have outsized effects. Record both recall and latency at each setting to map your tradeoff curve. The numbers will tell you what no amount of intuition can.
A recall target of 95%+ is appropriate for most production applications. For search and recommendation systems where the top result matters disproportionately, target 99%+. For exploratory or low-stakes search, 85-90% may be acceptable if the latency savings justify it. Know your threshold before you begin tuning — it is far easier to navigate when you know where you are going.
When pgvector is not enough
I should be forthcoming here, because glossing over limits would be a disservice to you. pgvector is a PostgreSQL extension running on a single node. It inherits PostgreSQL's strengths — transactional consistency, SQL filtering, operational simplicity — and its limitations. At a certain scale, those limitations become binding constraints.
-- Signs pgvector may not be enough for your workload:
--
-- 1. Index no longer fits in memory
-- Check: SELECT pg_size_pretty(pg_relation_size('idx_documents_embedding_hnsw'));
-- If the index is larger than shared_buffers, queries will hit disk.
--
-- 2. Insert throughput degrades
-- HNSW inserts slow down as the graph grows. If you are ingesting
-- >10,000 vectors/second sustained, index maintenance becomes a bottleneck.
--
-- 3. You need sub-millisecond p99 latency at >1M QPS
-- pgvector is single-node PostgreSQL. At extreme throughput,
-- you need horizontal sharding or a distributed vector engine.
--
-- 4. You need real-time index updates with zero recall degradation
-- HNSW insert quality degrades slightly over time compared to
-- a full rebuild. Periodic REINDEX helps but requires downtime.
--
-- 5. Your vectors exceed 16,000 dimensions
-- pgvector's hard limit. Rare in practice, but a wall if you hit it. That said — and I say this with confidence earned from the benchmarks above — most applications never reach these thresholds. If you have fewer than 10 million vectors, a single well-tuned PostgreSQL instance with pgvector will serve you admirably. The operational simplicity of keeping vectors in the same database as your relational data — one backup strategy, one replication topology, one set of credentials — is a genuine advantage that dedicated vector databases cannot replicate. A household that runs well with one capable system has no business hiring a second.
The decision to move to a dedicated vector database (Pinecone, Weaviate, Qdrant, Milvus) should be driven by measured requirements, not anticipated ones. "We might need to handle a billion vectors someday" is not a reason to add a second database to your architecture today. Solve the problem you have, not the one you imagine.
How Gold Lapel optimizes vector query patterns
Gold Lapel sits between your application and PostgreSQL as a proxy. It observes every query before it reaches the database — including your vector similarity searches.
Vector workloads have a property that makes them particularly amenable to proxy-level optimization: the same or similar query vectors appear repeatedly. In a RAG pipeline, users ask semantically similar questions. In a recommendation engine, popular items are queried against the same embedding space. In semantic search, query patterns cluster around common topics. Repetition, once identified, is an opportunity.
Gold Lapel identifies these patterns. When the same embedding is searched repeatedly and the underlying data has not changed, the results can be served from the proxy without reaching the database. When filter patterns emerge across your traffic — the same WHERE clause appearing across many vector queries — Gold Lapel surfaces this as a signal that a partial index would improve performance.
The database still handles the vector search. Gold Lapel reduces how often it needs to. Your application code and pgvector configuration remain untouched — the optimization happens at the traffic layer, not the data layer.