How to Detect Missing Indexes Automatically with pg_qualstats
Your database has been keeping a detailed account of every predicate it evaluates — which columns, how often, how many rows. It simply hasn't been sharing. Allow me to remedy that.
What pg_qualstats does and why it matters
Every query that arrives at your PostgreSQL database contains predicates — the conditions in WHERE clauses, JOIN conditions, and HAVING clauses that filter rows. These predicates reveal exactly which columns your workload depends on. If a column appears in thousands of predicates per hour and has no index, that is a missing index. The evidence is right there in the query stream. PostgreSQL simply does not collect it by default — an oversight I have always found rather puzzling, given that the information is passing through its hands regardless.
pg_qualstats fills that gap. It hooks into the query executor and records metadata about each predicate it encounters: the table, the column, the operator, how many times the predicate executed, and how many rows it filtered. The result is a set of views that map your actual workload to specific column access patterns.
Where pg_stat_statements tells you which queries are slow, pg_qualstats tells you which columns need indexes. Different questions entirely. Used together, they form a complete picture: identify the expensive queries, then identify the missing indexes that make them expensive. One diagnoses the symptom. The other finds the cause.
The extension also tracks correlated columns — columns that frequently appear together in the same query's predicates. This is directly useful for composite index design, where knowing which columns are queried together determines the optimal column order.
How to install and configure pg_qualstats
pg_qualstats is a third-party extension maintained by the PoWA team. Because it hooks into the query executor, it must be loaded at server start via shared_preload_libraries. This requires a PostgreSQL restart — I'm afraid there is no way around it.
-- 1. Add to postgresql.conf (requires restart)
shared_preload_libraries = 'pg_qualstats'
-- 2. Restart PostgreSQL, then create the extension
CREATE EXTENSION pg_qualstats;
-- 3. Verify it is active
SELECT * FROM pg_qualstats LIMIT 1; After the restart, the extension begins sampling queries immediately. The default sample rate is 1 / max_connections, which means on a server with 100 connections, roughly 1 in 100 queries will have its predicates recorded. This provides statistically representative data with negligible overhead.
For a thorough audit — particularly on staging or during a dedicated analysis window — you can increase the sample rate to capture every query:
-- Check current sample rate
SHOW pg_qualstats.sample_rate;
-- For a low-traffic staging server, capture everything
ALTER SYSTEM SET pg_qualstats.sample_rate = 1.0;
SELECT pg_reload_conf();
-- For a high-throughput production server, sample 10%
ALTER SYSTEM SET pg_qualstats.sample_rate = 0.1;
SELECT pg_reload_conf(); The full set of configuration parameters controls what gets tracked and how much memory is allocated:
-- Key parameters in postgresql.conf
-- pg_qualstats.enabled = true -- master on/off switch
-- pg_qualstats.track_constants = true -- record literal values per predicate
-- pg_qualstats.max = 1000 -- max tracked predicates (shared memory)
-- pg_qualstats.sample_rate = -1 -- fraction of queries to sample
-- -- -1 = automatic: 1/max_connections
-- pg_qualstats.track_pg_catalog = false -- skip system catalog predicates
-- pg_qualstats.resolve_oids = false -- resolve table/column names at query time
-- -- easier analysis, more memory (624 vs 176 bytes)
-- For a thorough audit, capture everything:
ALTER SYSTEM SET pg_qualstats.sample_rate = 1.0;
SELECT pg_reload_conf(); Two parameters deserve special attention. pg_qualstats.max sets the upper limit on tracked predicates. The default of 1000 is sufficient for many workloads, but if your application generates a wide variety of query patterns, increase it. When the limit is reached, new predicates are silently dropped — no eviction, no warning, no apology. Second, pg_qualstats.resolve_oids trades memory for convenience: enabling it stores human-readable table and column names instead of raw OIDs, but increases per-entry memory from 176 bytes to 624 bytes. A reasonable trade in most circumstances. Reading OIDs in the small hours is not how anyone should spend their evening.
Finding missing indexes with pg_qualstats views
After pg_qualstats has been collecting data for a representative period — a full business day is a reasonable minimum — you can query its views to find missing indexes. There are three views, each at a different level of analysis. If you'll permit me, we shall start at the broadest and work inward.
pg_qualstats_pretty: the raw predicate report
The pg_qualstats_pretty view joins the raw statistics with catalog information to produce human-readable output. This is the starting point for any analysis: which predicates execute most often, and how many rows do they filter?
-- Most frequently executed predicates, with human-readable names
SELECT
left_schema,
left_table,
left_column,
operator,
execution_count,
nbfiltered
FROM pg_qualstats_pretty
ORDER BY execution_count DESC
LIMIT 20;
-- Sample output:
-- left_schema | left_table | left_column | operator | execution_count | nbfiltered
-- ------------+------------+-------------+----------+-----------------+-----------
-- public | orders | customer_id | = | 184720 | 0
-- public | orders | status | = | 92104 | 8741632
-- public | events | created_at | > | 47218 | 31204800
-- public | users | email | = | 38412 | 0 The nbfiltered column is where the story is. A predicate with a high execution_count and a high nbfiltered is doing substantial work — filtering millions of rows across thousands of executions. If that column has no index, every execution requires a sequential scan or a less efficient access path. The combination of high frequency and high filter count is the clearest signal that an index is needed.
Conversely, a predicate with nbfiltered = 0 is already using an index effectively. The predicate executes, but no rows are filtered out by a sequential scan — the index is doing its job.
pg_qualstats_indexes: the columns that have been neglected
The pg_qualstats_indexes view narrows the analysis further. It cross-references the collected predicates against existing indexes and reports only those columns where no matching index exists.
-- Columns used in predicates that lack a matching index
SELECT
relid::regclass AS table_name,
attnames,
possible_types,
execution_count
FROM pg_qualstats_indexes
ORDER BY execution_count DESC;
-- Sample output:
-- table_name | attnames | possible_types | execution_count
-- -----------+----------------+----------------+----------------
-- orders | {status} | {btree} | 92104
-- events | {created_at} | {btree} | 47218
-- sessions | {user_id,ip} | {btree} | 28417 This view does the detective work for you — and does it rather well, I should add. Instead of manually checking each high-frequency predicate against pg_indexes, it reports only the gaps. The possible_types column suggests which index type would be appropriate — typically btree for equality and range predicates, though it may suggest hash for pure equality patterns.
The attnames column can contain multiple columns in an array, indicating that the extension detected these columns being queried together. This is a direct recommendation for a composite index rather than separate single-column indexes.
Identifying composite index candidates
Beyond the built-in views, you can query the raw statistics to find columns that are frequently queried together within the same query. These co-occurrences are prime candidates for composite indexes.
-- Find columns frequently queried together (composite index candidates)
SELECT
left_schema || '.' || left_table AS table_name,
array_agg(DISTINCT left_column) AS columns_queried_together,
count(*) AS co_occurrence_count
FROM pg_qualstats_pretty
WHERE left_table IS NOT NULL
GROUP BY left_schema, left_table, queryid
HAVING count(DISTINCT left_column) > 1
ORDER BY co_occurrence_count DESC
LIMIT 10; If orders.status and orders.created_at appear together in thousands of queries, a composite index on (status, created_at) will likely outperform two separate single-column indexes. The equality-first rule applies: place the equality-filtered column before the range-filtered column.
The built-in index advisor: opinions, earned
Starting with version 2.0, pg_qualstats includes a function called pg_qualstats_index_advisor() that goes beyond listing missing indexes. It performs a global optimization: given all tracked predicates, it computes a minimal set of indexes that would cover the highest-impact predicates. Not speculation — arithmetic.
-- Global index recommendations (pg_qualstats 2.0+)
SELECT pg_qualstats_index_advisor(
min_filter := 1000, -- minimum rows filtered
min_selectivity := 30 -- minimum selectivity percentage
);
-- Returns JSON with two keys:
-- {
-- "indexes": [
-- "CREATE INDEX ON public.orders USING btree (status)",
-- "CREATE INDEX ON public.events USING btree (created_at)",
-- "CREATE INDEX ON public.sessions USING btree (user_id, ip)"
-- ],
-- "unoptimised": [
-- "orders.metadata ~~* ?"
-- ]
-- } The function accepts two parameters that control which predicates are considered. min_filter (default 1000) sets the minimum number of rows a predicate must have filtered to be included. min_selectivity (default 30) sets the minimum selectivity percentage. Lower these thresholds to include less impactful predicates; raise them to focus on the most significant gaps.
The JSON output contains two keys. The indexes array holds ready-to-execute CREATE INDEX statements. The unoptimised array lists predicates that cannot be addressed with a standard index — typically pattern matching operators like ~~* (ILIKE) that would require a trigram index or a different approach entirely.
To work with the results programmatically, parse the JSON into individual rows:
-- Parse the JSON into individual CREATE INDEX statements
SELECT v::text AS suggested_index
FROM json_array_elements(
pg_qualstats_index_advisor(min_filter := 500)->'indexes'
) v
ORDER BY v::text;
-- And see which predicates cannot be optimized:
SELECT v::text AS unoptimised_predicate
FROM json_array_elements(
pg_qualstats_index_advisor(min_filter := 500)->'unoptimised'
) v; The advantage of a global advisor over per-query index analysis is consolidation. Per-query tools might suggest six overlapping indexes on the same table — one for each query variant, each blissfully unaware of the others. The global advisor recognizes that a single composite index can serve multiple predicates and suggests accordingly. Fewer indexes mean less write overhead, less storage, and less maintenance. A well-run household does not employ three butlers where one suffices.
Testing recommendations with HypoPG before committing
The index advisor produces CREATE INDEX statements. The temptation is to run them all and call it an afternoon. I would counsel patience. Each index consumes disk space, adds write overhead to every INSERT, UPDATE, and DELETE, and may not actually be used by the planner for your specific queries. You want to verify that the planner would use a proposed index before paying the cost of creating it.
HypoPG solves this. It creates hypothetical indexes — virtual indexes that exist only in the planner's metadata. They consume no disk, no I/O, and no maintenance overhead. The planner considers them when generating plans, so you can see whether a proposed index would actually be chosen.
-- Install HypoPG (no shared_preload_libraries needed)
CREATE EXTENSION hypopg; Unlike pg_qualstats, HypoPG does not require shared_preload_libraries. A simple CREATE EXTENSION is sufficient, and hypothetical indexes are session-local — they are only visible to the session that created them and vanish when it disconnects.
The workflow combines pg_qualstats recommendations with HypoPG validation:
-- Step 1: Get the suggested index from pg_qualstats
-- pg_qualstats_index_advisor says:
-- "CREATE INDEX ON public.orders USING btree (status)"
-- Step 2: Create a hypothetical version instead
SELECT * FROM hypopg_create_index(
'CREATE INDEX ON public.orders USING btree (status)'
);
-- indexrelid | indexname
-- ------------+-----------------------------
-- 205301 | <205301>btree_orders_status
-- Step 3: Run EXPLAIN (not ANALYZE) to see if the planner would use it
EXPLAIN
SELECT * FROM orders WHERE status = 'pending';
-- Index Scan using <205301>btree_orders_status on orders
-- (cost=0.04..8.06 rows=1 width=64)
-- Index Cond: (status = 'pending')
-- Step 4: Compare with the current plan (no hypothetical indexes)
SELECT hypopg_drop_index(205301);
EXPLAIN
SELECT * FROM orders WHERE status = 'pending';
-- Seq Scan on orders (cost=0.00..18334.00 rows=50000 width=64)
-- Filter: (status = 'pending') One important detail: hypothetical indexes only work with plain EXPLAIN, not EXPLAIN ANALYZE. Since the index does not physically exist, it cannot be used during actual query execution. EXPLAIN alone shows whether the planner would use the index, which is the information you need.
If the planner chooses the hypothetical index and the estimated cost drops substantially, the real index is worth creating. If the planner ignores it — perhaps because the table is small enough for a sequential scan, or a better index already exists — you have saved yourself the cost of an unnecessary index. Better to discover this in a planning exercise than after the index has been occupying disk for six months.
Automating the workflow
The full pipeline — collect predicate statistics, generate recommendations, test with hypothetical indexes, create the real indexes — can be scripted into a repeatable process. And it should be. The value of this approach compounds with repetition.
-- Full pipeline: pg_qualstats -> index_advisor -> HypoPG -> EXPLAIN
-- 1. Get recommendations
SELECT v::text AS idx
FROM json_array_elements(
pg_qualstats_index_advisor(min_filter := 500)->'indexes'
) v;
-- 2. For each recommendation, create a hypothetical index
SELECT * FROM hypopg_create_index(
'CREATE INDEX ON public.orders USING btree (status)'
);
-- 3. Test your critical queries with EXPLAIN
EXPLAIN SELECT count(*) FROM orders WHERE status = 'pending';
-- 4. If the plan improves, create the real index
-- Use CONCURRENTLY to avoid locking the table
CREATE INDEX CONCURRENTLY idx_orders_status
ON orders USING btree (status);
-- 5. Clean up hypothetical indexes and reset stats
SELECT hypopg_reset();
SELECT pg_qualstats_reset(); For ongoing monitoring rather than one-time analysis, establish a periodic review cadence:
-- Schedule a periodic review of predicate statistics
-- Run this weekly or after significant workload changes
-- 1. Check current statistics volume
SELECT count(*) AS tracked_predicates FROM pg_qualstats;
-- 2. Review top unindexed predicates
SELECT
qs.relid::regclass AS table_name,
qs.attnames,
qs.execution_count,
pg_size_pretty(pg_relation_size(qs.relid)) AS table_size
FROM pg_qualstats_indexes qs
ORDER BY qs.execution_count DESC
LIMIT 10;
-- 3. After creating indexes, reset and start a new collection window
SELECT pg_qualstats_reset(); A reasonable cadence is weekly or after significant application changes — new features, schema migrations, or traffic pattern shifts. Each review cycle follows the same steps: check what pg_qualstats has collected, evaluate the recommendations, test with HypoPG, create the indexes that pass validation, then reset the statistics for a clean collection window.
When creating indexes on production tables, always use CREATE INDEX CONCURRENTLY. A standard CREATE INDEX acquires a SHARE lock on the table, blocking all writes until the index is built. On a table with millions of rows, this can mean minutes of write downtime. CONCURRENTLY builds the index without blocking concurrent DML, at the cost of taking longer and requiring two passes over the table.
After creating each index, verify that the planner actually uses it and measure the before-and-after performance difference:
-- Before creating the index, record baseline query performance
-- Pick a representative query that uses the missing-index column
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE status = 'pending' AND created_at > '2026-01-01';
-- Create the index
CREATE INDEX CONCURRENTLY idx_orders_status_created
ON orders (status, created_at);
-- After creation, run the same query again
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE status = 'pending' AND created_at > '2026-01-01';
-- Compare: Seq Scan -> Index Scan, Buffers read count, execution time Beyond the immediate EXPLAIN comparison, track index usage over time through pg_stat_user_indexes:
-- After creating an index, verify it is being used
-- Wait at least an hour for representative traffic, then check:
SELECT
schemaname,
relname AS table_name,
indexrelname AS index_name,
idx_scan,
idx_tup_read,
idx_tup_fetch,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE relname = 'orders'
ORDER BY idx_scan DESC;
-- An index with idx_scan = 0 after a full traffic cycle is unused.
-- Either the planner found a better path or the predicate pattern changed. An index that shows idx_scan = 0 after a full cycle of representative traffic is not earning its keep. It consumes disk space and adds write overhead without serving any reads. Drop it, or investigate why the planner is not choosing it — the table statistics may need updating, or the index may not match the predicate patterns as closely as expected. Staff who do not serve may be excused.
For teams that prefer a visual interface, the PoWA project integrates pg_qualstats with historical storage and a web dashboard. It adds trend analysis — you can see how predicate patterns change over time — and integrates HypoPG directly into its suggestion workflow. PoWA is not required, but it reduces the manual effort for teams that run this process regularly.
Interpreting the results: what to act on and what to leave alone
Not every missing index should be created. Allow me to be direct about this. The index advisor and the pg_qualstats_indexes view report every unindexed predicate above the threshold, but some of those predicates are perfectly fine without an index. Here is how to prioritize.
Create the index when:
- High execution count and high filter count. A predicate that runs 50,000 times per day and filters millions of rows each time is a clear candidate. The cumulative I/O savings from an index are substantial.
- The table is large. An index on a 100-row lookup table is pointless — a sequential scan is faster. An index on a 10-million-row table where the predicate selects 0.1% of rows is essential.
- HypoPG confirms the planner would use it. If the planner ignores the hypothetical index, the real index would also be ignored. Do not create it.
- The write overhead is acceptable. Every index slows down INSERTs, UPDATEs, and DELETEs on the indexed columns. For read-heavy tables, the trade-off is almost always favorable. For write-heavy tables with infrequent reads, the cost may exceed the benefit.
Skip the index when:
- The table is small. PostgreSQL's planner is correct to prefer sequential scans on small tables. The overhead of an index traversal exceeds the cost of reading a few hundred rows directly.
- The predicate has low selectivity. An index on a boolean column with 50/50 distribution is rarely useful. The planner will often choose a sequential scan because the index would return too many rows to be efficient.
- The predicate appears in the "unoptimised" list. Pattern matching operators (
LIKE '%search%',ILIKE) and certain function calls cannot be served by a standard btree index. These require specialized solutions: trigram indexes for pattern matching, expression indexes for function-based predicates, or full-text search for text queries. - An existing composite index could be extended. Instead of creating a new single-column index, check whether an existing composite index on the same table could be extended to cover the new predicate. Fewer indexes is better than more indexes, all else equal.
Common pitfalls — and I have seen each of these in the field
A few recurring mistakes when working with pg_qualstats that are worth knowing about before you begin. I mention them not to alarm but to spare you the discovery.
Analyzing too short a time window
If you install pg_qualstats, run it for an hour, and take recommendations, you have sampled a fraction of your workload. Batch jobs that run nightly, weekly reports, and end-of-month aggregations will not appear in your data. Collect for at least one full business cycle — a week is reasonable — before acting on the results. Otherwise, you optimize for daytime OLTP traffic and miss the overnight analytics queries that hammer the same tables differently.
Ignoring the sample rate
The default sample rate of 1 / max_connections is designed for production safety, not analytical completeness. On a server with 200 connections, only 0.5% of queries are sampled. For a one-time audit, set pg_qualstats.sample_rate = 1.0 to capture every predicate. On high-throughput production systems, run the full-capture audit on a staging replica with production-representative traffic instead.
Creating every suggested index
The index advisor outputs a list. The temptation is to run every CREATE INDEX statement on the list. I understand the impulse. Resist it. Each index adds write overhead (typically 5-15% per index for write-heavy tables), consumes disk space, and increases backup and replication transfer sizes. Evaluate each suggestion against your write workload and confirm with HypoPG that the planner would actually use it. An index is a commitment, not a suggestion.
Forgetting to reset after schema changes
After creating new indexes, the predicate statistics from before the index existed are still in pg_qualstats. A predicate that was unindexed yesterday may be fully indexed today, but the old statistics still show it as a missing index. Always call pg_qualstats_reset() after making index changes, then collect fresh data to see the current state.
Where Gold Lapel fits in this picture
I should be forthcoming. The approach outlined above works. pg_qualstats surfaces exactly the right data: which columns your workload filters on, how often, and whether indexes exist to serve those predicates. That insight — that predicate-level statistics are the key to index optimization — is the foundation of any serious indexing strategy. Everything in this guide is something I would recommend to any team, regardless of what else they use.
What follows from that insight is where the manual effort accumulates. You review the recommendations, test them with HypoPG, create indexes concurrently to avoid locking production, monitor whether they actually get used, and revisit the analysis periodically as your workload evolves. Each step is straightforward individually. In aggregate, it is an ongoing operational commitment — one that, in my experience, teams attend to diligently for the first few months and then gradually less so.
Gold Lapel performs this entire cycle continuously at the proxy level. It observes query predicates as they pass through, evaluates index candidates against the full workload — reads and writes together — creates indexes concurrently in the background, and monitors their impact over time. Indexes that stop serving their purpose are flagged for removal. The same analysis that pg_qualstats surfaces once per review cycle, Gold Lapel acts on continuously. I would be a poor butler indeed if I neglected to mention that trained staff are available for the work.
For teams running self-managed PostgreSQL who want to understand their predicate landscape before committing to anything, pg_qualstats is an excellent starting point. It provides the raw evidence that makes index decisions defensible rather than speculative. And that is worth a great deal on its own.