EXPLAIN ANALYZE: A Guided Tour of Your PostgreSQL Query Plan
The query plan tells you everything. If you'll allow me, I shall translate.
Good evening. You have a slow query.
This is not a guess. If you have arrived at an article about EXPLAIN ANALYZE, something is taking longer than it should, and you would like to know why. An entirely reasonable question, and one that PostgreSQL is prepared to answer — in considerable detail, once you know how to ask.
The tool is EXPLAIN ANALYZE. It runs your query and reports what happened: which tables were scanned, which indexes were used, how many rows were processed, how long each step took, and where the time went. It is, in effect, a detailed receipt for every millisecond your query consumed.
The difficulty is not running EXPLAIN ANALYZE. The difficulty is reading the output. It arrives as an indented tree of nodes with names like "Hash Join" and "Bitmap Heap Scan," accompanied by numbers that appear significant but resist casual interpretation. Many developers glance at it, note the total execution time at the bottom, and return to guessing.
I have watched this happen more times than I care to count. A developer stares at twenty lines of planner output, extracts the single number at the bottom — 487ms — and proceeds to add a random index in the hope that it helps. It does not help. Or worse, it helps this query and silently harms three others. The plan was right there, explaining exactly what went wrong, and it was ignored.
Allow me to offer a more productive approach.
A brief word before we begin: EXPLAIN versus EXPLAIN ANALYZE
These are not the same thing, and the distinction matters more than most guides acknowledge.
EXPLAIN without ANALYZE shows you what PostgreSQL plans to do. It estimates costs, row counts, and widths — but it does not execute the query. The numbers are predictions, not measurements. They are useful for understanding what the planner is thinking, but they cannot tell you whether the planner was right.
EXPLAIN ANALYZE executes the query and reports what actually happened. You get both the estimates and the reality, side by side. This is where the diagnostic power lives — in the gap between what the planner predicted and what the engine encountered.
I should be direct: for diagnosing slow queries, EXPLAIN alone is insufficient. Use EXPLAIN ANALYZE. The estimates are educated guesses. The actuals are evidence. I deal in evidence.
There is one exception, and it is important. EXPLAIN ANALYZE executes the query. If your query is an UPDATE that sets 50,000 rows to a new status, or a DELETE that purges a year of data, EXPLAIN ANALYZE will perform that operation. Protect yourself:
-- EXPLAIN ANALYZE executes the query. For mutations, protect yourself:
BEGIN;
EXPLAIN ANALYZE
UPDATE orders SET status = 'shipped'
WHERE status = 'pending' AND created_at < '2025-01-01';
-- Inspect the plan...
ROLLBACK;
-- No data was changed. The plan is produced. The data is untouched. This is the only responsible way to analyze data-modifying statements.
How do you run EXPLAIN ANALYZE?
Prefix your query with EXPLAIN ANALYZE. For the most useful output, include BUFFERS as well — it reveals disk I/O, which is often the real bottleneck that execution time alone cannot explain.
EXPLAIN ANALYZE
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'
AND o.created_at > '2025-01-01'; This produces a query plan — a tree of operations that PostgreSQL executed to get your result. Here is the complete output for that query:
Hash Join (cost=8.20..431.72 rows=312 width=52) (actual time=0.15..4.82 rows=287 loops=1)
Hash Cond: (o.customer_id = c.id)
-> Seq Scan on orders o (cost=0.00..419.00 rows=312 width=24) (actual time=0.03..4.21 rows=287 loops=1)
Filter: ((status = 'pending') AND (created_at > '2025-01-01'))
Rows Removed by Filter: 9713
-> Hash (cost=5.80..5.80 rows=192 width=36) (actual time=0.08..0.08 rows=192 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 18kB
-> Seq Scan on customers c (cost=0.00..5.80 rows=192 width=36) (actual time=0.01..0.04 rows=192 loops=1)
Planning Time: 0.18 ms
Execution Time: 4.94 ms That is the entire plan. Every number means something. Let me walk through it.
The options beyond ANALYZE and BUFFERS
PostgreSQL offers several output options that serve different purposes. Most day-to-day diagnostics need only ANALYZE and BUFFERS, but the others earn their keep in specific situations:
-- The full diagnostic: actual timing, buffers, WAL, and settings
EXPLAIN (ANALYZE, BUFFERS, WAL, SETTINGS, VERBOSE)
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'; BUFFERS— shows shared buffer hits and disk reads. I consider this non-optional. Without it, you are guessing at the I/O profile.WAL— shows WAL (write-ahead log) generation for write queries. Useful when investigating replication lag or checkpoint pressure.SETTINGS— includes any non-default planner settings. Helpful when you suspect someone has setenable_indexscan = offin the session and forgotten to mention it.VERBOSE— shows the full column list per node and the schema-qualified table names. Adds clutter, but essential for plans involving views or inheritance.
You may also choose the output format:
-- Plain text (default):
EXPLAIN ANALYZE SELECT ...;
-- JSON — machine-readable, includes all fields:
EXPLAIN (ANALYZE, FORMAT JSON) SELECT ...;
-- YAML — slightly more readable than JSON:
EXPLAIN (ANALYZE, FORMAT YAML) SELECT ...;
-- XML — for tooling that expects it:
EXPLAIN (ANALYZE, FORMAT XML) SELECT ...; JSON format is particularly useful when you intend to feed the plan into a visualization tool like explain.depesz.com or explain.dalibo.com. For reading with your own eyes, the default text format remains the most efficient.
How to read a query plan, node by node
A query plan reads from the inside out and bottom up. The most deeply indented nodes execute first. Their results flow upward to their parent nodes. The top node is the final operation — its timing is your total query time.
Think of it as a household instruction list, read in reverse. The scullery maid washes the vegetables before the cook can prepare the soup. The plan shows the scullery maid's work indented beneath the cook's. The cook's work is indented beneath the host's request. You read the indentation to understand the order of operations.
The anatomy of a single node
Every node carries the same structure. Take the outer node from our example:
Hash Join (cost=8.20..431.72 rows=312 width=52) (actual time=0.15..4.82 rows=287 loops=1) This contains two sets of numbers — and the gap between them is where problems hide:
cost=8.20..431.72— the planner's estimate. The first number (8.20) is the startup cost — the work required before the first row can be returned. The second number (431.72) is the total cost. These are in arbitrary units based onseq_page_cost, not milliseconds. They are useful for comparing two plans for the same query, not for absolute measurement.rows=312— the planner estimated 312 rows would be returned by this node.width=52— the average width of each row in bytes. A width of 52 means each row is about 52 bytes. This matters when large rows amplify memory and I/O costs.actual time=0.15..4.82— what actually happened. 0.15ms to return the first row, 4.82ms for all rows. These are milliseconds.rows=287(in the actual section) — 287 rows were actually returned. Close to the estimate of 312 — the planner was reasonably accurate here.loops=1— this node executed once. In a nested loop, inner nodes may execute many times — and this is crucial — the actual time shown is per loop. Multiply actual time by loops to get the true total time consumed by that node.
The critical comparison: estimated rows versus actual rows. When these diverge wildly — the planner expected 50, the engine processed 247,000 — the chosen plan is almost certainly wrong for the actual data. The planner was handed a map that did not match the territory.
A note on the loops multiplication
This is the single most common mistake I see when people read query plans. A node that shows actual time=0.01..0.05 rows=1 loops=50000 looks innocuous — 0.05ms, nothing to worry about. But it executed 50,000 times. The total time contributed by this node is 0.05 × 50,000 = 2,500ms. Two and a half seconds hidden behind a number that looks like a rounding error.
When you see loops greater than 1, always multiply. The plan is giving you per-iteration numbers, and the iterations are the cost.
What do the different node types mean?
PostgreSQL's plan nodes fall into three categories: scans (how it reads tables), joins (how it combines tables), and operations (sorting, grouping, limiting). Here is a reference for the ones you will encounter most frequently.
| Node type | What it does | When to worry |
|---|---|---|
| Seq Scan | Reads every row in the table | Slow on large tables. Fine for small ones. |
| Index Scan | Looks up rows via an index | Usually good. Check if it's the right index. |
| Index Only Scan | Answers from the index alone | Best case — no table access needed. |
| Bitmap Heap Scan | Index narrows candidates, then fetches from table | Good for medium selectivity (1-20% of rows). |
| Nested Loop | For each outer row, scan inner table | Fast for small outer sets. Disastrous for large ones. |
| Hash Join | Build hash table from one side, probe with the other | Good for medium-to-large joins. Watch memory usage. |
| Merge Join | Walk two sorted inputs in parallel | Excellent when both sides are pre-sorted (indexed). |
| Sort | Sort rows in memory or on disk | "Disk" in Sort Method means work_mem is too small. |
| HashAggregate | GROUP BY using a hash table | Good for low-cardinality grouping. |
| Materialize | Cache sub-plan results for re-use | Usually harmless. Expensive if the sub-plan is huge. |
| Gather | Collect results from parallel workers | Check Workers Launched matches Workers Planned. |
| CTE Scan | Scan a materialized CTE result | If unexpected, the CTE may be an optimization fence. |
This is not exhaustive — PostgreSQL has dozens of node types — but these twelve account for the vast majority of plans you will encounter in practice. The remaining node types tend to appear in specific circumstances: Append for partitioned tables and UNION ALL, SubPlan for correlated subqueries, WindowAgg for window functions. You will recognize them when you need them.
How does a sequential scan differ from an index scan?
This is the single most important distinction in query plan reading. A sequential scan reads every row in the table. An index scan looks up specific rows via an index. The performance difference on a large table is not incremental — it is orders of magnitude.
Sequential scan
Seq Scan on orders (cost=0.00..18334.00 rows=1000000 width=64)
(actual time=0.01..312.45 rows=1000000 loops=1)
Filter: (status = 'pending')
Rows Removed by Filter: 9000000
Buffers: shared hit=1204 read=72130 Note Rows Removed by Filter: 9000000. PostgreSQL read ten million rows and kept one million. Nine million rows were read, evaluated, and discarded. That is 90% waste — and every discarded row still consumed disk I/O and CPU time.
The buffer count tells the story in physical terms: 72,130 pages read from disk plus 1,204 from cache, totaling 73,334 pages. At 8kB per page, that is roughly 573MB of data read for this single query. If this query runs frequently, you are reading 573MB from disk every time someone checks a dashboard.
Index scan
Index Scan using idx_orders_status on orders
(cost=0.43..1245.32 rows=1000000 width=64)
(actual time=0.02..28.91 rows=1000000 loops=1)
Index Cond: (status = 'pending')
Buffers: shared hit=8924 The index scan goes directly to the matching rows. No wasted reads. The buffer count drops from 73,334 pages to 8,924. Same query, same result, 11x faster. The reduction in I/O is even more dramatic than the timing suggests — fewer disk reads mean less pressure on the buffer cache, less contention with other queries, and less wear on your storage subsystem.
Index Only Scan — the best case
-- Query: SELECT count(*) FROM orders WHERE status = 'pending';
-- With a covering index: CREATE INDEX idx_orders_status ON orders (status);
Index Only Scan using idx_orders_status on orders
(cost=0.43..412.28 rows=10000 width=0)
(actual time=0.018..1.42 rows=10247 loops=1)
Index Cond: (status = 'pending')
Heap Fetches: 0
Buffers: shared hit=42 An Index Only Scan answers the query entirely from the index, without visiting the table at all. Notice Heap Fetches: 0 — PostgreSQL never touched the heap. This is the ideal outcome, and it is possible when the index contains all the columns the query needs.
The critical metric here is Heap Fetches. If this number is high, PostgreSQL is going back to the table to check row visibility — which happens when the visibility map is out of date. Running VACUUM on the table updates the visibility map and reduces heap fetches, sometimes dramatically.
Bitmap scan
Bitmap Heap Scan on orders (cost=124.50..5843.21 rows=5200 width=64)
(actual time=1.82..14.53 rows=5187 loops=1)
Recheck Cond: (created_at > '2025-01-01')
Heap Blocks: exact=4012
-> Bitmap Index Scan on idx_orders_created_at
(cost=0.00..123.20 rows=5200 width=0)
(actual time=1.21..1.21 rows=5187 loops=1)
Index Cond: (created_at > '2025-01-01') A bitmap scan is PostgreSQL's compromise for medium selectivity. The index identifies which heap pages contain matching rows, sorts them to minimize random I/O, then fetches them in order. You will see this when the query returns too many rows for a plain index scan but few enough that a sequential scan would be wasteful.
Bitmap scans have a particularly useful trick — they can combine multiple indexes:
Bitmap Heap Scan on orders
(actual time=3.42..18.91 rows=1247 loops=1)
Recheck Cond: ((status = 'pending') AND (created_at > '2025-01-01'))
-> BitmapAnd
-> Bitmap Index Scan on idx_orders_status
(actual time=1.12..1.12 rows=50000 loops=1)
-> Bitmap Index Scan on idx_orders_created_at
(actual time=0.94..0.94 rows=25000 loops=1) The planner used two separate indexes, computed the intersection of their results via BitmapAnd, and then fetched only the pages that matched both conditions. This is why PostgreSQL does not always need a composite index — sometimes two single-column indexes and a bitmap combine operation achieve the same result. Not always faster than the composite, but it avoids maintaining an additional index.
An honest counterpoint about sequential scans
I should note that not every sequential scan is a problem. On a small table — a few hundred or a few thousand rows — a sequential scan is often faster than an index scan. The overhead of traversing the index B-tree and then performing random I/O to the heap can exceed the cost of simply reading the entire table sequentially. PostgreSQL's planner knows this, which is why it chooses sequential scans for small tables even when indexes are available.
The rule of thumb: a sequential scan on a table with fewer than a few thousand rows is fine. A sequential scan on a table with millions of rows, returning a small fraction of them, is where you should reach for an index. The Rows Removed by Filter metric is your signal. If the planner removed 97% of the rows it read, those rows were unnecessary work. For a thorough treatment of why the planner ignores your index and how to fix it, see Fix PostgreSQL Sequential Scans.
How do the join strategies compare?
When your query joins multiple tables, PostgreSQL chooses a join strategy. The choice depends on table sizes, available indexes, and the planner's row estimates. Understanding which strategy was chosen — and whether it was the right choice — is essential for diagnosing join performance.
Nested Loop
Nested Loop (cost=0.43..62.18 rows=8 width=100)
(actual time=0.03..0.12 rows=7 loops=1)
-> Index Scan using idx_orders_customer_id on orders
(actual time=0.02..0.04 rows=7 loops=1)
Index Cond: (customer_id = 42)
-> Index Scan using customers_pkey on customers
(actual time=0.01..0.01 rows=1 loops=7)
Index Cond: (id = o.customer_id) For each row in the outer table (7 rows from orders), PostgreSQL performs an index lookup in the inner table (customers). With 7 outer rows, that is 7 index lookups — quite efficient. With 70,000 outer rows, it would be 70,000 lookups. The loops=7 on the inner scan tells you exactly how many times it executed.
The nested loop is the best strategy when the outer side produces few rows and the inner side has an efficient index. It is the worst strategy when the outer side produces many rows — which is exactly what happens when the planner's row estimate is wrong:
Nested Loop (cost=0.43..4218.50 rows=100 width=100)
(actual time=0.03..14821.33 rows=247000 loops=1)
-> Seq Scan on line_items
(actual time=0.02..82.41 rows=247000 loops=1)
Filter: (order_id > 50000)
-> Index Scan using products_pkey on products
(actual time=0.05..0.05 rows=1 loops=247000)
Index Cond: (id = line_items.product_id) The planner estimated 100 outer rows and chose a nested loop. The actual count was 247,000. Each of those 247,000 rows triggered an index lookup on the products table. The query took 14.8 seconds. Had the planner known the true row count, it would have chosen a hash join, which would have completed in under a second. Stale statistics are the usual cause — run ANALYZE on the table and try again.
Hash Join
Hash Join (cost=8.20..431.72 rows=312 width=52)
(actual time=0.15..4.82 rows=287 loops=1)
Hash Cond: (o.customer_id = c.id)
-> Seq Scan on orders o
(actual time=0.03..4.21 rows=287 loops=1)
-> Hash (cost=5.80..5.80 rows=192 width=36)
(actual time=0.08..0.08 rows=192 loops=1)
Buckets: 1024 Memory Usage: 18kB
-> Seq Scan on customers c
(actual time=0.01..0.04 rows=192 loops=1) PostgreSQL builds a hash table from the smaller side (customers, 192 rows), then probes it once for each row in the larger side. The Memory Usage: 18kB tells you the hash table fit comfortably in memory. This is the workhorse join strategy — efficient for most medium-to-large joins where the smaller side fits in memory.
The metric to watch is Batches. If you see Batches: 1, the entire hash table fit in memory. If you see a higher number, the hash table spilled to disk:
Hash Join (cost=284102..892451 rows=5000000 width=96)
(actual time=1842.22..8921.45 rows=4872103 loops=1)
Hash Cond: (o.product_id = p.id)
-> Seq Scan on orders o ...
-> Hash (cost=184102..184102 rows=5000000 width=48)
(actual time=1841.92..1841.92 rows=5000000 loops=1)
Buckets: 262144 Batches: 32 Memory Usage: 8193kB
-> Seq Scan on products p ... Batches: 32 means the hash table was split into 32 batches because it exceeded work_mem. Each batch involves writing to and reading from temporary files on disk. The solution is either increasing work_mem for the session or reducing the size of the hashed side — perhaps by pushing a filter condition further down the plan.
Merge Join
Merge Join (cost=0.85..48291.42 rows=500000 width=80)
(actual time=0.04..312.18 rows=487293 loops=1)
Merge Cond: (o.customer_id = c.id)
-> Index Scan using idx_orders_customer_id on orders o
(actual time=0.02..198.41 rows=487293 loops=1)
-> Index Scan using customers_pkey on customers c
(actual time=0.01..0.42 rows=200 loops=1) Merge joins walk two sorted inputs in parallel. They are excellent when both sides are already sorted — often because they come from index scans on the join columns. You will see merge joins less frequently than hash joins, but when they appear, they are typically very efficient because they avoid building a hash table entirely and stream through both inputs in a single pass.
If you see a merge join preceded by a Sort node, PostgreSQL is sorting one or both inputs before merging. This may be worthwhile or wasteful depending on the data volume — if the sort is spilling to disk, a hash join might have been cheaper.
What does the BUFFERS output tell you?
Adding BUFFERS to your EXPLAIN reveals disk I/O — and disk I/O is where time actually goes. I consider this the most important optional output for performance diagnostics, and I recommend you include it always.
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 42;
-- Output:
Index Scan using idx_orders_customer_id on orders
(cost=0.43..12.65 rows=8 width=64)
(actual time=0.021..0.035 rows=7 loops=1)
Index Cond: (customer_id = 42)
Buffers: shared hit=4 read=1
Planning Time: 0.09 ms
Execution Time: 0.052 ms The key numbers:
shared hit=4— 4 pages were found in PostgreSQL's shared buffer cache (RAM). Fast — a memory read.read=1— 1 page was read from disk (or the OS page cache). Slow — potentially 0.1-10ms depending on storage, and orders of magnitude slower than a buffer hit.
The ratio of hits to reads tells you how well your working set fits in memory. If read dominates hit, either your shared_buffers is too small or the query is touching far more data than it needs to — which circles back to missing indexes and excessive sequential scans.
When I/O timing reveals the truth
On PostgreSQL 13+, if track_io_timing is enabled, you get an even more revealing number:
Seq Scan on events (cost=0.00..482103.00 rows=25000000 width=96)
(actual time=0.04..3842.19 rows=25000000 loops=1)
Buffers: shared hit=42891 read=289312
I/O Timings: read=1847.32
Planning Time: 0.08 ms
Execution Time: 5291.44 ms I/O Timings: read=1847.32 tells you that of the 5,291ms total execution time, 1,847ms — fully 35% — was spent waiting for disk reads. The rest was CPU time for processing 25 million rows. This distinction matters because the remedies are different: if the query is I/O-bound, you need better indexes or more memory. If it is CPU-bound, you may need to reduce the row count or simplify the computation.
Cold cache versus warm cache
This is a subtlety that can mislead you if you are not aware of it. Running the same query twice often produces dramatically different results:
-- First run: cold cache — pages read from disk
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 42;
-- Buffers: shared hit=0 read=5
-- Second run: warm cache — pages in shared_buffers
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 42;
-- Buffers: shared hit=5 read=0
-- The execution time may differ by 10-100x between runs.
-- Always note whether you are measuring cold or warm performance. The first execution reads pages from disk. The second finds them already in shared_buffers. The difference can be 10x to 100x. When benchmarking, be explicit about which scenario you are measuring. In production, your hot queries typically run against a warm cache. But the first query of the day, or a query against a rarely-accessed table, will hit cold pages.
Neither measurement is wrong. They answer different questions. Warm cache performance tells you the steady-state cost. Cold cache performance tells you the worst case — and the worst case is what your users experience at 3 AM after a restart.
What does it mean when the Sort spills to disk?
Sort (cost=1892.45..1917.45 rows=10000 width=64)
(actual time=28.41..31.22 rows=10000 loops=1)
Sort Key: created_at DESC
Sort Method: external merge Disk: 3120kB
-> Seq Scan on orders ... Sort Method: external merge Disk: 3120kB means the data to be sorted exceeded work_mem and spilled to temporary files on disk. This is dramatically slower than an in-memory sort. For comparison:
Sort (cost=892.45..917.45 rows=10000 width=64)
(actual time=4.21..5.18 rows=10000 loops=1)
Sort Key: created_at DESC
Sort Method: quicksort Memory: 1142kB
-> Index Scan on orders ... The in-memory sort uses quicksort and completes in 5ms. The disk-based sort uses external merge and takes 31ms — six times slower, and the gap widens as data volume increases.
Three remedies, in order of preference:
- Add an index on the sort column — if the query both filters and sorts on the same columns, a composite index can eliminate the sort entirely. PostgreSQL returns rows in index order, and the Sort node disappears from the plan. This is the best outcome because it eliminates work rather than making work faster.
- Use
LIMITwith an indexed sort — if you only need the top N results and have an index on the sort column, PostgreSQL can stop scanning after N rows. No sort node at all. - Increase
work_mem— but carefully. This is per-operation, per-connection. A single complex query with three sort operations uses three timeswork_mem. Setting it to 256MB with 100 concurrent connections means a theoretical ceiling of 76.8GB for sorts alone (3 sorts × 256MB × 100 connections). Set it at the session level for specific queries rather than globally, unless you have done the arithmetic.
When estimates go wrong
The planner is only as good as its statistics. When the row estimate is wildly off, the entire plan can be wrong — not subtly wrong, but catastrophically wrong. A plan chosen for 50 rows that encounters 247,000 is not merely suboptimal. It is working against itself.
-- Estimated: 50 rows. Actual: 247,000 rows.
-- PostgreSQL chose a nested loop expecting a handful of matches.
-- It got a quarter million. The plan is catastrophically wrong.
Nested Loop (cost=0.43..312.50 rows=50 width=64)
(actual time=0.02..8412.33 rows=247000 loops=1) The planner expected 50 rows and chose a nested loop — sensible for 50 rows, catastrophic for 247,000. The nested loop performed 247,000 index lookups instead of the 50 it planned for. The query took 8.4 seconds instead of the milliseconds it should have.
The fix is straightforward:
-- Update table statistics so the planner can make better estimates
ANALYZE orders;
-- Then re-run your EXPLAIN ANALYZE and compare The ANALYZE command updates the statistics that the planner uses for its estimates. After a bulk data load, a schema change, or any significant shift in data distribution, running ANALYZE on the affected tables is essential. PostgreSQL's autovacuum runs ANALYZE periodically, but it may not keep pace with rapid changes.
When ANALYZE is not enough: correlated columns
There is a subtler form of estimation error that single-column statistics cannot fix. When two columns in a WHERE clause are correlated, the planner assumes independence and underestimates the combined selectivity:
-- Column A has 10 distinct values. Column B has 10 distinct values.
-- The planner assumes independence: 1/10 * 1/10 = 1% of rows.
-- But A and B are perfectly correlated — the actual selectivity is 10%.
-- The planner underestimates by 10x.
EXPLAIN ANALYZE
SELECT * FROM events
WHERE category = 'purchase' AND subcategory = 'completed';
-- Estimated: 1,000 rows. Actual: 10,000 rows. PostgreSQL 10 introduced extended statistics to address this:
-- PostgreSQL 10+ can learn about column correlations:
CREATE STATISTICS stts_events_cat_subcat (dependencies)
ON category, subcategory FROM events;
ANALYZE events;
-- Now the planner knows these columns are correlated. Extended statistics are not created automatically. You must identify the correlated columns and create the statistics object yourself. This is a genuine limitation — the planner cannot discover correlations it has not been told about. In my experience, this accounts for a meaningful fraction of the "everything looks right but the estimates are still wrong" cases.
Checking whether autovacuum is keeping statistics fresh
If you suspect stale statistics, you can verify directly:
-- Check if autovacuum is keeping up:
SELECT relname,
n_live_tup,
n_dead_tup,
round(n_dead_tup::numeric / greatest(n_live_tup, 1) * 100, 1) AS dead_pct,
last_analyze,
last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC; Look at last_autoanalyze. If it was days ago on a table that receives frequent writes, autovacuum may not be running ANALYZE often enough. The default threshold is 10% of the table's rows — on a table with 10 million rows, that means 1 million rows must change before autoanalyze triggers. For tables with rapid churn, consider lowering the threshold with ALTER TABLE ... SET (autovacuum_analyze_scale_factor = 0.02).
"The query plan is not a suggestion — it is the database telling you exactly what it did and why. Learning to read it is the single highest-leverage skill in PostgreSQL performance work."
— from You Don't Need Redis, Chapter 18: The PostgreSQL Performance Decision Framework
Subqueries, CTEs, and optimization fences
Correlated subqueries and CTEs appear frequently in query plans, and their behavior has changed significantly across PostgreSQL versions. Understanding how they are executed — and when they become performance problems — is worth a separate discussion.
Correlated subqueries
-- Correlated subquery — executes inner query per outer row:
EXPLAIN ANALYZE
SELECT c.name,
(SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.id) AS order_count
FROM customers c;
-- SubPlan runs 200 times (once per customer):
Seq Scan on customers c (actual time=0.02..48.91 rows=200 loops=1)
SubPlan 1
-> Aggregate (actual time=0.23..0.23 rows=1 loops=200)
-> Index Scan using idx_orders_customer_id on orders o
(actual time=0.01..0.19 rows=487 loops=200)
-- Rewritten as a join — single pass:
EXPLAIN ANALYZE
SELECT c.name, COUNT(o.id) AS order_count
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
GROUP BY c.name;
-- No SubPlan. No loops=200. One hash join. The correlated subquery executed its inner plan 200 times — once per customer. The loops=200 is the signal. The join version processes the same data in a single pass. For 200 customers, the difference is modest. For 200,000 customers, the subquery version becomes untenable.
Not all subqueries are correlated, and not all correlated subqueries are slow. PostgreSQL can sometimes decorrelate a subquery automatically, converting it to a join internally. When it cannot, the SubPlan node with a high loop count is your warning. If you see it, consider rewriting the query as an explicit join.
CTEs: materialized or inlined?
-- PostgreSQL 12+: CTEs can be inlined (not optimization fences)
-- But if the CTE is referenced multiple times, it materializes.
EXPLAIN ANALYZE
WITH recent_orders AS (
SELECT * FROM orders WHERE created_at > '2025-01-01'
)
SELECT * FROM recent_orders WHERE status = 'pending';
-- In PostgreSQL 12+, this inlines the CTE — equivalent to a subquery.
-- In PostgreSQL 11 and earlier, the CTE is always materialized. Before PostgreSQL 12, CTEs were always materialized — they acted as optimization fences, preventing the planner from pushing predicates into or through them. This meant a CTE that selected all orders, followed by a filter for pending orders, would materialize all orders first and filter second. The planner could not push the filter into the CTE.
PostgreSQL 12 changed this. Simple CTEs (referenced once, not recursive, not side-effecting) are now inlined by default. The planner can push predicates through them, and the CTE disappears from the plan entirely. This was a significant improvement.
If you want materialization — perhaps because the CTE is referenced multiple times and you want to avoid recomputing it — you can force it:
-- Force materialization when you WANT a fence:
WITH recent_orders AS MATERIALIZED (
SELECT * FROM orders WHERE created_at > '2025-01-01'
)
SELECT * FROM recent_orders WHERE status = 'pending';
-- The CTE result is computed once and stored in memory.
-- Useful when the CTE is referenced multiple times. When reading a query plan, the presence of a CTE Scan node tells you the CTE was materialized. If you did not expect this, check whether the CTE is referenced multiple times or whether you are running a PostgreSQL version older than 12.
Parallel query plans
On PostgreSQL 9.6 and later, the planner can parallelize certain operations. When it does, the plan includes a Gather or Gather Merge node:
Gather (cost=1000.00..289412.45 rows=5000000 width=64)
(actual time=0.42..1842.91 rows=5000000 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Parallel Seq Scan on events
(cost=0.00..238412.45 rows=1250000 width=64)
(actual time=0.03..412.18 rows=1000000 loops=5)
Filter: (created_at > '2024-01-01')
Rows Removed by Filter: 4000000 Two numbers to check:
Workers Planned: 4— the planner intended to use 4 parallel workers.Workers Launched: 4— 4 workers were actually available and launched.
If Workers Launched is less than Workers Planned, the system had insufficient parallel worker capacity. Check max_parallel_workers and max_parallel_workers_per_gather. If workers were not launched, the parallel portion of the plan ran single-threaded, and the Gather node added overhead without benefit.
Also note that in parallel plans, each worker's rows count is its share of the total. The leader process also participates, so the total rows are rows × (loops) across all workers. The actual time is wall-clock time per worker — the total wall-clock time for the parallel portion is roughly actual time (not multiplied by workers), which is the whole point of parallelism.
An honest note on parallel query
Parallel query is not always beneficial. The overhead of spawning workers, distributing data, and gathering results means that small queries run slower in parallel. The planner generally makes correct decisions about when to parallelize, but if you see a parallel plan on a query that should be fast, consider whether the planner is overestimating the work required. The threshold for parallel execution is controlled by parallel_tuple_cost and parallel_setup_cost, but adjusting these is rarely necessary.
A diagnostic procedure, if I may
When faced with a slow query, the following procedure has served me well. It is not clever. It is methodical. In my experience, methodical outperforms clever when you are staring at a production issue at an inconvenient hour.
- Run
EXPLAIN (ANALYZE, BUFFERS)on the query. This is your starting point — never skip it. If the query is a write operation, wrap it inBEGIN/ROLLBACK. - Find the most expensive node. Look for the widest gap between
actual timeof a child node and its parent. That gap is where time accumulates. Remember to multiplyactual timebyloopswhen loops is greater than 1. - Check for sequential scans on large tables. If you see
Seq Scanwith a highRows Removed by Filter, an index is almost certainly missing. The ratio of rows removed to rows returned tells you how much waste the query produces. - Compare estimated vs actual rows. A 10x or greater divergence means stale statistics or correlated columns. Run
ANALYZEon the table and try again. If the estimates are still wrong afterANALYZE, consider extended statistics. - Check the join strategy. A nested loop with a high loop count on the inner side is suspect. PostgreSQL may have underestimated the outer row count. A hash join with
Batchesgreater than 1 is spilling to disk. - Look at BUFFERS. If
readvastly exceedshit, the query is I/O-bound. Either the working set does not fit in cache, or the query is scanning too much data. IfI/O Timingsis available, check what fraction of execution time is spent waiting for disk. - Check for Sort spilling to disk. Look for
Sort Method: external merge Disk:and consider whether an index could eliminate the sort or whetherwork_memneeds adjustment for this session. - Look for SubPlan with high loops. A correlated subquery executing thousands of times is a candidate for rewriting as a join.
-- The sequential scan on a 10M row table, filtering to 287 rows:
-- Rows Removed by Filter: 9,713. That is 97% waste.
-- The fix:
CREATE INDEX idx_orders_status_created
ON orders (status, created_at);
-- Re-run EXPLAIN ANALYZE. The Seq Scan becomes an Index Scan.
-- 312ms becomes 0.4ms. The planner is not the problem — the
-- missing index was. In the majority of cases, steps 3 and 4 identify the problem. A missing index or stale statistics account for perhaps 80% of the slow queries I encounter. For a systematic approach to finding and validating missing indexes, see the missing index detection guide. The remaining 20% involve more nuanced issues — poor join ordering, correlated subqueries, lock contention, table bloat — but the first 80% is where the largest gains live.
I should be forthcoming: this procedure is designed for single-query diagnostics. If you have dozens of slow queries, the more productive approach is to start with pg_stat_statements to identify which queries consume the most total time across all executions, then apply EXPLAIN ANALYZE to each one in order of impact. A query that runs in 5ms but executes 100,000 times per hour costs more than a query that runs in 2 seconds but executes twice.
Common misreadings and pitfalls
I have reviewed enough query plans to have developed opinions about the mistakes I see repeatedly. Allow me to save you from a few of them.
Mistaking cost for time
The cost numbers are not milliseconds. They are not seconds. They are abstract units derived from PostgreSQL's cost model. A cost of 1000 does not mean 1000ms — it means "1000 units of estimated work." Two plans for the same query can be compared by cost. Two plans for different queries cannot. I have seen engineers add indexes because a cost number "looked too high" on a query that completed in 2ms. The cost was irrelevant. The actual time was fine.
Ignoring the startup cost
The first number in cost=8.20..431.72 is the startup cost — the work required before the first row is produced. For most nodes, this is low. For a Sort node, it equals the total cost because all rows must be sorted before any can be returned. For a HashAggregate, it includes building the hash table. When a query is feeding rows to a client one at a time (streaming results), a high startup cost means the first row takes a long time to arrive, even if subsequent rows are fast.
Reading only the top node
The top node's actual time is the total query time — but it tells you nothing about where the time was spent. The diagnostic information lives in the child nodes. A query with actual time=0.02..487.45 at the top and a child Seq Scan with actual time=0.01..486.22 tells you that 486 of the 487 milliseconds were consumed by the sequential scan. The top node contributed almost nothing. Fix the scan, and the entire query improves.
Optimizing the wrong query
This is perhaps the most common pitfall, and it does not involve reading the plan at all. A developer finds a slow query, optimizes it from 500ms to 5ms, and celebrates. But the query only runs twice per hour — a total savings of 990ms per hour. Meanwhile, a 10ms query that runs 50,000 times per hour consumes 500 seconds of database time, and nobody looks at it because "it's only 10ms."
Total impact is frequency multiplied by duration. Always check pg_stat_statements before deciding which queries to optimize. The query that deserves your attention is rarely the one that feels slow — it is the one that costs the most in aggregate.
When the query plan is fine but the query is still slow
Occasionally, the plan looks reasonable — appropriate indexes, accurate estimates, no sequential scans — and the query is still slow. The plan does not lie, but it does not tell the whole truth either. A few possibilities worth investigating:
- Lock contention. The query is waiting for a lock held by another transaction. EXPLAIN ANALYZE does not show wait time — the timer starts when the lock is acquired. Check
pg_stat_activityandpg_locksfor concurrent transactions holding conflicting locks. This is particularly common withUPDATEandDELETEoperations on frequently modified rows. - Network overhead. The query returns a very large result set, and the time is spent transferring data rather than computing it. Look at the
widthin the plan — a width of 2048 with 500,000 rows is nearly a gigabyte of data. EXPLAIN ANALYZE reports execution time, not transfer time. The application may perceive a much longer delay. - Repeated execution. The query itself is fast, but it runs thousands of times — the N+1 pattern. No amount of index optimization helps if you are sending 2,000 queries where 1 would do. EXPLAIN ANALYZE examines a single execution. The problem is the multiplication.
- Autovacuum falling behind. Table bloat increases the number of pages to scan. A table with 50% dead tuples is doing twice the I/O it should be. Check
pg_stat_user_tablesfor the ratio ofn_dead_tupton_live_tup. If dead tuples are accumulating, autovacuum may be running but not finishing — perhaps because of long-running transactions holding back the dead tuple horizon. - Resource contention on the server. The database shares the machine with other processes, or multiple expensive queries run concurrently. CPU saturation, memory pressure, or I/O bandwidth limits can make any query slow regardless of its plan. This is not a planner problem — it is a capacity problem, and the solution is operational rather than architectural.
Gold Lapel detects several of these patterns at the proxy level. It observes the actual query traffic — not just individual plans, but the patterns across hundreds of queries. An N+1 that EXPLAIN cannot see, a repeated aggregation that could be materialized, a filter that would benefit from an index it does not have. The query plan tells you what happened inside one query; traffic-level observation tells you what is happening across all of them.
The limits of EXPLAIN ANALYZE
A waiter who overstates his case is no waiter at all, and I should be honest about what EXPLAIN ANALYZE cannot do.
It cannot show you network latency between your application and the database. It cannot reveal that your ORM is sending 400 queries for a single page render. It cannot detect that another transaction holds a lock that your query is waiting for. It cannot account for OS page cache behavior, storage controller queuing, or CPU scheduler contention. It shows you the execution of one query in isolation, and that is both its strength and its limitation.
For single-query diagnostics, it is the best tool available. Nothing else gives you this level of detail about PostgreSQL's internal decision-making. But database performance is not solely a per-query concern. It is also a pattern concern — how queries interact, how they accumulate, how they contend for resources. EXPLAIN ANALYZE is the microscope. You also need the telescope.
This is not a criticism of the tool. It is a framing of its role. Use EXPLAIN ANALYZE to understand individual queries. Use pg_stat_statements to understand aggregate behavior. Use pg_stat_activity to understand concurrency. Use pg_locks to understand contention. Each tool answers a different question, and the diagnostic skill is knowing which question to ask.
The plan has been presented. Allow me to take my leave.
EXPLAIN ANALYZE is not a mysterious incantation. It is a report — thorough, precise, and unfailingly honest about what your database did and why. The planner made a plan. The engine executed it. The report tells you whether the plan was wise or whether the planner was working from stale information.
Read the nodes from the inside out. Compare estimated rows to actual rows. Look for sequential scans with high waste. Check whether sorts spilled to disk. Multiply actual time by loops. These five habits, applied consistently, will resolve the majority of slow queries you encounter.
The database was not slow. It was doing exactly what it was asked to do, with the information it had available. Your task is to give it better information — accurate statistics, appropriate indexes, well-structured queries — and it will respond accordingly. It always does.
PostgreSQL is, in my experience, remarkably willing to perform well. It simply requires that someone pay attention to its plans. I am gratified that you have chosen to do so.
Common EXPLAIN anti-patterns
Here are five patterns in EXPLAIN output that should give you pause. Each one tells a specific story about what the planner chose — and, more importantly, why it may have chosen poorly.
1. Nested Loop with high outer row count
-> Nested Loop (cost=0.43..847293.12 rows=50000 actual time=0.031..4832.109 rows=50000 loops=1)
-> Seq Scan on orders (rows=50000)
-> Index Scan on customers (rows=1 loops=50000) Nested Loop executes the inner side once per outer row. With 50,000 outer rows, that inner index scan runs 50,000 times. Fifty thousand individual lookups. If the outer row count is large and the inner side does meaningful work, a Hash Join would very likely serve you better. The planner chose Nested Loop because it estimated fewer outer rows than actually materialized — check whether ANALYZE has been run recently. Stale statistics are the most common cause of this particular misjudgment.
2. Sort with "external merge Disk"
-> Sort (cost=185432.12..186432.12 rows=400000)
Sort Key: created_at
Sort Method: external merge Disk: 52432kB "External merge Disk" means the sort exceeded work_mem and spilled to disk. I'm afraid this is not a minor inconvenience — it is dramatically slower than an in-memory quicksort. You have three remedies, in order of preference: increase work_mem for this specific query (SET work_mem = '256MB' before the query), add an index that provides the sort order natively, or reduce the data being sorted with a more selective WHERE clause.
3. Seq Scan on a large table with a selective WHERE
-> Seq Scan on events (cost=0.00..284732.00 rows=142 actual time=12.431..1847.293 rows=138)
Filter: (user_id = 42 AND event_type = 'purchase')
Rows Removed by Filter: 12000000 The planner scanned 12 million rows to find 138. Twelve million to find one hundred and thirty-eight. This almost certainly calls for a composite index on (user_id, event_type). The planner chose a sequential scan either because no suitable index exists, or because the table statistics are stale. Run ANALYZE events and check again — if the sequential scan persists after fresh statistics, the index is your answer.
4. HashAggregate with Batches > 1
-> HashAggregate (cost=42187.00..52187.00 rows=500000)
Group Key: category_id
Batches: 5 Memory Usage: 8241kB Disk Usage: 31024kB When Batches exceeds 1, the hash table has exceeded work_mem and spilled to disk in multiple passes. This is PostgreSQL 13+ behavior — earlier versions would switch to GroupAggregate instead, which is a different kind of compromise. Increase work_mem for the session, or look for opportunities to reduce the number of groups being aggregated. The former is the quicker remedy; the latter is often the more durable one.
5. Materialize with high loop count
-> Materialize (cost=0.00..1247.00 rows=50000 actual time=0.001..12.432 rows=50000 loops=2500) Materialize stores intermediate results for reuse across loops. With 2,500 loops each reading 50,000 rows, this node is processing 125 million row-reads. That number deserves a moment of quiet reflection. Look at what is driving the loops — usually a Nested Loop join above it. If the materialized data is large and the loop count is high, restructuring the query to avoid the repeated scan is worth your time. A CTE or a rewritten join can often eliminate the pattern entirely.
Sharing EXPLAIN plans
When asking for help with a slow query — and there is no shame in asking; even the most experienced among us consult colleagues — these tools format and share your EXPLAIN output admirably:
- explain.depesz.com — highlights the slowest nodes and shows time percentages
- explain.dalibo.com — visual tree representation with flame graph
Both accept the output of EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) and produce shareable URLs. Excellent tools, both of them.
Frequently asked questions
Terms referenced in this article
If I may — the query plan nodes catalogued above each have their own story. I have written a reference guide to every PostgreSQL query plan node that goes deeper on the scan types, join strategies, and aggregate nodes than this guided tour permitted. It is the room behind the room, if you will.