PostgreSQL Missing Index Detection: Automatic Recommendations
Your database has been cataloging its own performance characteristics since the moment you started it. The data was always there, waiting for someone to ask the right questions.
The signals your database is already sending
I should tell you something encouraging: your database already knows which indexes it wishes it had. PostgreSQL tracks every table access in detail — every sequential scan, every index scan, every row fetched. These statistics accumulate quietly in system catalogs, available to anyone who takes a moment to query them. Most teams have not had reason to look — until now.
The starting point is pg_stat_user_tables. This view exposes per-table statistics that reveal how PostgreSQL is accessing your data. Two columns matter most for missing index detection:
seq_scan— the number of sequential scans initiated against the tableidx_scan— the number of index scans initiated against the table
A table with a high seq_scan count and a low or zero idx_scan count is a candidate for investigation. If PostgreSQL is repeatedly scanning the entire table to satisfy queries, a missing index may be the cause.
If you'll permit me, a quick scan of your database:
SELECT
schemaname,
relname,
seq_scan,
idx_scan,
seq_tup_read,
n_live_tup
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 20; This query surfaces the tables reading the most rows via sequential scans. But — and this is worth noting before we proceed — a high seq_scan count alone is not proof of a missing index. Several perfectly legitimate situations produce high sequential scan counts:
- Small tables. A table with a few hundred rows fits in one or two pages. A sequential scan is faster than an index lookup because there is no index traversal overhead.
- Full-table analytics. Aggregate queries (
COUNT(*),SUM(), reporting dashboards) are expected to scan every row. No index helps a query that needs all the data. For precomputing these results, see materialized views. - Bulk operations. ETL pipelines and batch jobs that process entire tables will produce sequential scans by design.
The threshold heuristic: investigate when a table has a large n_live_tup count (thousands of rows or more), a high seq_tup_read value, and a seq_scan to idx_scan ratio that suggests most accesses are unindexed. Tables with fewer than roughly 1,000 rows can generally be left in peace — a sequential scan on a small table is often the fastest path available.
Step 1 — Identify candidates with pg_stat_user_tables
Good news first: pg_stat_user_tables is a built-in view — no extensions required, no configuration, no restart. It has been collecting data since your database started. The relevant columns (documented in the PostgreSQL statistics collector documentation):
| Column | Description |
|---|---|
seq_scan | Number of sequential scans initiated on the table |
seq_tup_read | Total rows read by sequential scans |
idx_scan | Number of index scans initiated on the table |
idx_tup_fetch | Total rows fetched by index scans |
n_live_tup | Estimated number of live rows in the table |
The most useful diagnostic query ranks tables by "wasted sequential reads" — the volume of rows read sequentially on tables large enough that an index should be serving those reads:
SELECT
relname AS table_name,
seq_scan,
seq_tup_read,
idx_scan,
n_live_tup,
ROUND(
CASE WHEN idx_scan > 0
THEN seq_scan::numeric / idx_scan
ELSE seq_scan
END, 2
) AS seq_to_idx_ratio,
pg_size_pretty(pg_relation_size(schemaname || '.' || relname)) AS table_size
FROM pg_stat_user_tables
WHERE n_live_tup > 5000
AND seq_scan > 100
ORDER BY seq_tup_read DESC
LIMIT 15; Example output:
table_name | seq_scan | seq_tup_read | idx_scan | n_live_tup | seq_to_idx_ratio | table_size
-------------+----------+--------------+----------+------------+------------------+-----------
orders | 48201 | 7230150000 | 12840 | 150000 | 3.75 | 98 MB
line_items | 31045 | 4967200000 | 980 | 620000 | 31.68 | 210 MB
sessions | 22890 | 915600000 | 285000 | 40000 | 0.08 | 18 MB
audit_log | 8450 | 2535000000 | 0 | 300000 | 8450.00 | 340 MB Allow me to walk through this:
ordershas a moderate ratio (3.75) but a very highseq_tup_read. Worth investigating — some queries on this large table are not finding their way to an index.line_itemshas a high ratio (31.68) on a large table. This one warrants prompt attention.sessionshas a low ratio (0.08) — most accesses are indexed. The sequential scans are likely legitimate (cleanup jobs, analytics). No cause for concern.audit_loghas zero index scans and 300,000 rows. I'm afraid this table is working rather harder than it needs to. Either it has no indexes at all, or every query against it is doing a full scan.
Resetting statistics
One detail worth noting: statistics accumulate since the last reset (or since the database was created). If your application's workload has changed significantly — a new version deployed, a feature removed, traffic patterns shifted — the accumulated statistics may reflect outdated access patterns rather than current ones. For a broader diagnostic workflow, see Your PostgreSQL Query Is Slow.
SELECT pg_stat_reset();
-- Or reset a single table:
SELECT pg_stat_reset_single_table_counters('orders'::regclass); Reset statistics after major workload changes, then let the new patterns accumulate for at least 24–48 hours (or one full traffic cycle) before drawing conclusions.
Limitations
pg_stat_user_tables tells you which tables are being scanned sequentially, but it does not tell you which columns or predicates are causing those scans. It identifies the rooms that need attention — but not the specific furnishings that are out of place. For that, you need pg_qualstats.
Step 2 — Discover missing predicates with pg_qualstats
Now we arrive at the more revealing instrument. pg_qualstats is a PostgreSQL extension (available on GitHub) that records WHERE clause predicates and JOIN conditions at runtime. Where pg_stat_user_tables tells you which tables need indexes, pg_qualstats tells you which columns those indexes should cover.
Installation and configuration
pg_qualstats must be loaded as a shared preload library, which requires a server restart:
# postgresql.conf
shared_preload_libraries = 'pg_qualstats'
# Optional: configure sampling rate (default is 1/max_connections)
pg_qualstats.sample_rate = 0.1 # Sample 10% of queries
pg_qualstats.track_constants = on # Track actual filter values After restarting PostgreSQL, create the extension:
CREATE EXTENSION IF NOT EXISTS pg_qualstats; Querying pg_qualstats
The core query joins pg_qualstats_all with system catalogs to produce human-readable output showing which predicates are used most often and whether they are backed by an index:
SELECT
qs.relid::regclass AS table_name,
a.attname AS column_name,
qs.opno::regoper AS operator,
qs.occurences AS occurrences,
qs.execution_count,
qs.nbfiltered AS rows_filtered,
CASE
WHEN EXISTS (
SELECT 1 FROM pg_index pi
JOIN pg_attribute pa ON pa.attrelid = pi.indrelid
AND pa.attnum = ANY(pi.indkey)
WHERE pi.indrelid = qs.relid
AND pa.attnum = qs.attnum
) THEN 'indexed'
ELSE 'NOT INDEXED'
END AS index_status
FROM pg_qualstats_all qs
JOIN pg_attribute a ON a.attrelid = qs.relid AND a.attnum = qs.attnum
WHERE qs.relid IS NOT NULL
ORDER BY qs.occurences DESC
LIMIT 20; Example output:
table_name | column_name | operator | occurrences | execution_count | rows_filtered | index_status
------------+-------------+----------+-------------+-----------------+---------------+-------------
orders | customer_id | = | 48200 | 48200 | 7230000 | NOT INDEXED
orders | status | = | 31000 | 31000 | 620000 | indexed
line_items | order_id | = | 29800 | 29800 | 4768000 | NOT INDEXED
audit_log | created_at | >= | 8400 | 8400 | 2520000 | NOT INDEXED
audit_log | entity_type | = | 8400 | 8400 | 2520000 | NOT INDEXED Now the picture becomes considerably clearer:
orders.customer_idis filtered on 48,200 times and has no index. There is your primary culprit from thepg_stat_user_tablesanalysis — confirmed.orders.statusis already indexed. No action needed here.line_items.order_idis filtered on 29,800 times without an index. This explains the high sequential scan count online_items.audit_log.created_atandaudit_log.entity_typeboth appear 8,400 times — these are very likely used together in the same queries, which suggests a composite index candidate.
Interpreting multi-column candidates
When multiple columns from the same table appear with the same occurences count, they are likely filtered together in the same queries. pg_qualstats tracks predicates individually, but correlated occurrence counts reveal composite index opportunities.
In the example above, audit_log.created_at and audit_log.entity_type both have exactly 8,400 occurrences. This strongly suggests queries like:
SELECT * FROM audit_log
WHERE entity_type = 'order'
AND created_at >= '2026-01-01'; A composite index on (entity_type, created_at) would serve this pattern better than two separate single-column indexes. Column order matters: the equality column (entity_type) should come first, followed by the range column (created_at). For more on this decision, see the composite indexes guide.
Performance overhead
A reasonable question: what does this cost? pg_qualstats adds overhead to query execution because it inspects predicate nodes in every sampled query's plan. The overhead depends on the sampling rate:
- Default sampling rate (1/max_connections, typically 0.5–1%): overhead is negligible, usually immeasurable.
- 10% sampling rate: overhead is typically 1–3% of total query execution time. Acceptable for most production workloads.
- 100% sampling rate (
sample_rate = 1.0): overhead can reach 5–8% on query-intensive workloads. Use this only for short diagnostic sessions, not permanently in production.
For production use, a 10% sampling rate provides enough data to identify patterns within a few hours of normal traffic.
Combining pg_stat_user_tables + pg_qualstats
The complete workflow uses both views together: pg_stat_user_tables identifies which tables are suffering from sequential scans, and pg_qualstats identifies which predicates on those tables are unindexed.
A combined query that produces a candidate index list:
SELECT
t.relname AS table_name,
a.attname AS column_name,
qs.opno::regoper AS operator,
qs.occurences AS predicate_uses,
t.seq_scan AS table_seq_scans,
t.seq_tup_read AS rows_scanned_sequentially,
t.n_live_tup AS table_rows
FROM pg_qualstats_all qs
JOIN pg_stat_user_tables t ON t.relid = qs.relid
JOIN pg_attribute a ON a.attrelid = qs.relid AND a.attnum = qs.attnum
WHERE NOT EXISTS (
SELECT 1 FROM pg_index pi
JOIN pg_attribute pa ON pa.attrelid = pi.indrelid
AND pa.attnum = ANY(pi.indkey)
WHERE pi.indrelid = qs.relid
AND pa.attnum = qs.attnum
)
AND t.n_live_tup > 5000
AND qs.occurences > 100
ORDER BY qs.occurences * t.seq_tup_read DESC
LIMIT 10; This surfaces the highest-impact missing index candidates: columns that are filtered frequently (occurences) on tables where sequential scans are reading the most rows (seq_tup_read).
For a deeper dive on pg_qualstats configuration and advanced usage, see the pg_qualstats how-to guide.
Step 3 — Validate with HypoPG before you commit
Before we proceed to creating anything, a word of caution. Indexes are not free. Each one consumes disk space, slows down INSERT, UPDATE, and DELETE operations (because the index must be maintained), and adds to vacuum workload. Creating an index based on a hunch — without validating that the query planner will actually use it — risks adding cost without benefit. One does not furnish a room without first measuring the space.
HypoPG (available on GitHub) solves this neatly by creating hypothetical indexes that exist only in the planner's metadata. The planner considers them when generating query plans, but no physical index is created. You can test whether an index would help before committing to the disk and write overhead.
Installation
CREATE EXTENSION IF NOT EXISTS hypopg; HypoPG does not require shared_preload_libraries — it can be installed and used immediately without a server restart.
Basic workflow
Create a hypothetical index, then run EXPLAIN on the query you want to optimize:
-- Create a hypothetical index on orders(customer_id)
SELECT * FROM hypopg_create_index('CREATE INDEX ON orders (customer_id)'); indexrelid | indexname
------------+-----------------------------
205462 | <205462>btree_orders_customer_id Now run EXPLAIN on the problem query:
EXPLAIN SELECT * FROM orders WHERE customer_id = 8472; QUERY PLAN
---------------------------------------------------------------------------------
Index Scan using <205462>btree_orders_customer_id on orders
(cost=0.05..8.07 rows=1 width=120)
Index Cond: (customer_id = 8472) Compare this to the plan without the hypothetical index (drop it first):
SELECT * FROM hypopg_drop_index(205462);
EXPLAIN SELECT * FROM orders WHERE customer_id = 8472; QUERY PLAN
--------------------------------------------------------------
Seq Scan on orders (cost=0.00..28734.00 rows=18 width=120)
Filter: (customer_id = 8472) The cost dropped from 28,734 to 8.07 — a 3,500x reduction. The planner switched from a sequential scan to an index scan. Quite the improvement. This hypothetical index is worth creating.
Testing multi-column and partial indexes
HypoPG accepts the same syntax as CREATE INDEX, including composite and partial indexes:
-- Hypothetical composite index
SELECT * FROM hypopg_create_index(
'CREATE INDEX ON audit_log (entity_type, created_at)'
);
-- Hypothetical partial index
SELECT * FROM hypopg_create_index(
'CREATE INDEX ON orders (created_at) WHERE status = ''pending'''
); Test each with EXPLAIN to see if the planner uses them. For more HypoPG examples and advanced usage, see the HypoPG how-to guide.
Cleanup
Remove all hypothetical indexes when you are done testing:
SELECT hypopg_reset(); Limitations
I should be forthcoming about what HypoPG cannot tell you. Hypothetical indexes inform the planner's cost model but do not account for several real-world factors:
- I/O patterns. A real index changes the ratio of sequential to random I/O. HypoPG cannot simulate the actual I/O behavior.
- Cache effects. Frequently accessed indexes stay in the buffer cache, improving performance beyond what cost estimates predict. HypoPG cannot model cache warming.
- Write amplification. Every INSERT, UPDATE, and DELETE must also update the index. HypoPG does not account for this cost because it only evaluates read queries.
- Index build time. Creating an index on a 100 million row table may take minutes and lock the table (unless you use
CREATE INDEX CONCURRENTLY). HypoPG does not simulate the build process.
HypoPG answers the question "will the planner use this index?" — not "will this index improve overall system performance?" The former is a necessary condition for the latter, but not sufficient.
The complete workflow — from signal to index
Allow me to bring this together with a complete example. We will walk through the entire pipeline on a concrete case: an e-commerce application with an orders table that has become slow as the dataset has grown.
The schema
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
customer_id BIGINT NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'pending',
total_amount NUMERIC(12,2) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Existing indexes: only the primary key
-- Table has 1.5 million rows Step 1: Identify the table
SELECT relname, seq_scan, seq_tup_read, idx_scan, n_live_tup
FROM pg_stat_user_tables
WHERE relname = 'orders'; relname | seq_scan | seq_tup_read | idx_scan | n_live_tup
---------+----------+--------------+----------+-----------
orders | 48201 | 7230150000 | 12840 | 1500000 48,201 sequential scans on a 1.5 million row table, reading 7.2 billion rows total. This table has been working extraordinarily hard on your behalf. It deserves our attention.
Step 2: Find the predicates
SELECT
a.attname AS column_name,
qs.opno::regoper AS operator,
qs.occurences,
qs.execution_count
FROM pg_qualstats_all qs
JOIN pg_attribute a ON a.attrelid = qs.relid AND a.attnum = qs.attnum
WHERE qs.relid = 'orders'::regclass
ORDER BY qs.occurences DESC; column_name | operator | occurences | execution_count
-------------+----------+------------+----------------
customer_id | = | 38400 | 38400
status | = | 9200 | 9200
created_at | >= | 4800 | 4800
created_at | <= | 4800 | 4800 customer_id equality lookups account for the majority of sequential scans. created_at range queries are also significant.
Step 3: Validate with HypoPG
Test a B-tree index on customer_id:
SELECT * FROM hypopg_create_index('CREATE INDEX ON orders (customer_id)'); Run EXPLAIN on the representative query. Note: use EXPLAIN without ANALYZE here — HypoPG indexes exist only in the planner's metadata and are not visible to the executor:
EXPLAIN SELECT * FROM orders WHERE customer_id = 8472; With hypothetical index:
QUERY PLAN
---------------------------------------------------------------------------------
Index Scan using <205462>btree_orders_customer_id on orders
(cost=0.05..12.42 rows=15 width=120)
Index Cond: (customer_id = 8472) Without hypothetical index (after dropping it):
QUERY PLAN
--------------------------------------------------------------
Seq Scan on orders (cost=0.00..38921.00 rows=15 width=120)
Filter: (customer_id = 8472) The estimated cost dropped from 38,921 to 12.42 — a 3,100x reduction. The planner switched from a sequential scan to an index scan. The evidence is clear.
To confirm the real-world impact, create the actual index and then run EXPLAIN ANALYZE:
CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders (customer_id);
ANALYZE orders;
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 8472; Index Scan using idx_orders_customer_id on orders
(cost=0.43..12.42 rows=15 width=120)
(actual time=0.04..0.09 rows=12 loops=1)
Index Cond: (customer_id = 8472)
Buffers: shared hit=4
Planning Time: 0.22 ms
Execution Time: 0.14 ms Execution time: 0.14 ms, reading 4 buffer pages. Compare this to the sequential scan that was reading 30,000+ pages. One index. A 3,100x improvement. This index is very much worth keeping.
For more on interpreting EXPLAIN output, see the EXPLAIN ANALYZE guide.
Step 4: Consider additional indexes
With the customer_id index confirmed and created, review the remaining pg_qualstats output for additional opportunities.
When to use a composite index instead
If queries frequently filter on multiple columns together, a composite index may be more effective than separate single-column indexes. From the pg_qualstats data, if status and created_at consistently appear in the same queries:
-- Common query pattern
SELECT * FROM orders
WHERE status = 'pending'
AND created_at >= '2026-01-01'
AND created_at < '2026-02-01'; A composite index serves this pattern efficiently:
CREATE INDEX CONCURRENTLY idx_orders_status_created
ON orders (status, created_at); The equality column (status) comes first, the range column (created_at) second. This ordering allows the index to narrow to the matching status value and then perform a range scan on the date. See the composite indexes guide for the full rationale on column ordering.
When to use a partial index instead
If the vast majority of queries against orders filter for a specific status value, a partial index covers only the relevant rows and is smaller and faster:
CREATE INDEX CONCURRENTLY idx_orders_pending_created
ON orders (created_at)
WHERE status = 'pending'; This index is smaller than a full index on (status, created_at) because it excludes all rows that are not pending. If 95% of your queries are looking for pending orders, this partial index is a better choice. If queries target multiple status values, a full composite index is more appropriate.
For more on choosing the right index type, see the index types guide.
What Gold Lapel automates
The workflow above — identify candidate tables, discover unindexed predicates, validate with hypothetical indexes, create the index — is effective but manual. It requires installing extensions, running diagnostic queries, interpreting results, and repeating the process whenever workloads change. It is excellent work, and I would encourage you to do it at least once for the understanding it builds.
Gold Lapel runs this same pipeline continuously as a PostgreSQL proxy. Because it sits between the application and the database, it observes every query in real time — not a statistical sample, but the actual workload. This gives it two advantages over the manual approach:
- Workload-aware recommendations. The proxy sees which queries run most often, which are slowest, and how query patterns change over time. It weighs index recommendations by actual query frequency and impact, not just predicate occurrence counts.
- Continuous monitoring. New features, traffic changes, and schema migrations constantly shift which indexes are needed. The proxy detects these shifts automatically rather than waiting for someone to run diagnostic queries.
The manual approach taught in this article is the right starting point. It works, it is free, and it builds the intuition needed to evaluate any tool's recommendations — including ours. For teams who want the same pipeline running hands-free in production, Gold Lapel is available.
Common mistakes and edge cases
I should mention several patterns I encounter frequently. None of these are causes for self-reproach — they are the natural result of working with incomplete information, which is precisely what this article aims to correct.
Creating indexes on low-cardinality columns
A column like is_active (boolean) or status (with three possible values) has low cardinality. An index on a low-cardinality column is rarely useful for equality lookups because each value matches a large fraction of the table. The planner will often choose a sequential scan even when the index exists, because the cost of random I/O to fetch 30% of the table via index exceeds the cost of a sequential scan. The PostgreSQL documentation on examining index usage explains this trade-off in detail.
Exceptions exist: if one value is rare (1% of rows are status = 'failed'), a partial index on that rare value is effective. A full B-tree index on the entire column is not.
Ignoring write amplification on hot tables
This is the counterbalance to everything above, and I would be doing you a disservice to gloss over it. Every index on a table adds overhead to every INSERT, UPDATE (if the indexed column changes), and DELETE. A table receiving 10,000 writes per second with five indexes pays the maintenance cost five times per write — a phenomenon known as write amplification. Adding a sixth index for a query that runs once per minute may not be a net improvement.
Before creating an index, consider the table's write volume. For the full analysis of index types and their write costs, including B-tree, GIN, and GiST maintenance characteristics, evaluate whether the read improvement justifies the write cost.
Over-indexing
The temptation, having found this workflow, is to index everything. I would gently counsel restraint. More indexes are not always better. Each index:
- Consumes disk space (often 10–30% of the table size per index)
- Slows down INSERTs, UPDATEs, and DELETEs
- Increases VACUUM workload (dead index tuples must be cleaned up alongside dead heap tuples)
- Adds to backup size and replication lag
A table with 15 indexes is worth a closer look. Review whether older indexes are still earning their keep by querying pg_stat_user_indexes:
SELECT
indexrelname,
idx_scan,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE relname = 'orders'
ORDER BY idx_scan ASC; Indexes with zero idx_scan since the last statistics reset are candidates for removal.
Sampling rate too low
A subtlety worth attending to: if pg_qualstats.sample_rate is set too low, infrequent but expensive queries may not be captured. A query that runs once per hour but takes 30 seconds each time is a significant performance problem, but at a 1% sampling rate, pg_qualstats may never observe it.
For thorough coverage, run a diagnostic session at a higher sampling rate (25–50%) for 24 hours, then reduce the rate for ongoing monitoring. Alternatively, use pg_stat_statements to identify slow queries by total execution time and manually check whether their predicates are indexed.
Forgetting to ANALYZE after creating the index
A small step that makes all the difference: after creating a new index, PostgreSQL needs updated table statistics to accurately estimate the index's selectivity. Without running ANALYZE, the planner may not choose the new index because its cost estimates are based on stale data.
ANALYZE orders; Autovacuum will eventually run ANALYZE, but the delay can be minutes to hours depending on your autovacuum configuration. Running it manually ensures the new index is used immediately.
The tools described here — pg_stat_user_tables, pg_qualstats, HypoPG — are not exotic. They are the built-in diagnostic instruments of a database that has been quietly cataloging its own performance characteristics since the moment you started it. The data was always there, waiting for someone to ask the right questions. Now you know how to ask them.