Reading PostgreSQL EXPLAIN Output: A Visual Guide
The sketch arrived as a deeply indented tree. The innermost brushstrokes were applied first. Data flows upward from the leaves. I am told this is how one reads it.
The 60-second orientation
Ah. You ran EXPLAIN ANALYZE and received a wall of text. I had rather hoped we might avoid this, but since we are here, allow me to walk you through it one field at a time.
The essential mental model: a query plan is a tree. The innermost (most-indented) nodes execute first. Data flows upward from leaf nodes to the root. The root node produces the final result set.
Every node has:
- A type — what it does (Seq Scan, Index Scan, Hash Join, Sort, etc.)
- A cost estimate — what the planner predicted
- Actual execution numbers — what really happened (only with ANALYZE)
You read a plan from the inside out, bottom to top. The most indented lines are the first operations; the least indented line is the final operation that produces the query result.
This guide covers 6 real query plans, progressively more complex. Each plan builds on concepts from the previous one. By Plan #6, every field in the output will be familiar.
This guide teaches you to read the output itself — what each field means and how the pieces fit together. For optimization strategies based on what you find, see the comprehensive EXPLAIN ANALYZE guide.
Plan #1 — the sequential scan
We begin with the simplest case. A query on a table with 100,000 rows:
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'alice@example.com'; Output:
Seq Scan on users (cost=0.00..1520.00 rows=1 width=128) (actual time=0.015..12.340 rows=1 loops=1)
Filter: (email = 'alice@example.com'::text)
Rows Removed by Filter: 99999
Planning Time: 0.085 ms
Execution Time: 12.456 ms Field-by-field breakdown:
Seq Scan on users — the node type and the table it operates on. A Sequential Scan reads every row in the table from the first page to the last.
cost=0.00..1520.00 — the planner's estimated cost. The first number (0.00) is the startup cost — how much work before the first row can be returned. The second number (1520.00) is the total cost — the estimated cost to complete the entire operation. These are in planner cost units, not milliseconds.
rows=1 (in the cost parentheses) — the planner's estimate of how many rows this node will return.
width=128 — the planner's estimate of the average row width in bytes.
actual time=0.015..12.340 — the real execution time in milliseconds. The first number is the time to the first row returned. The second number is the time to complete the operation.
rows=1 (in the actual parentheses) — the actual number of rows returned. In this case, the planner's estimate matched reality.
loops=1 — how many times this node executed. For a standalone query, this is always 1. In joins, inner nodes may execute many times.
Filter: (email = 'alice@example.com'::text) — the condition applied during the scan. The database read every row and checked this condition on each one.
Rows Removed by Filter: 99999 — the rows that were read but did not match the filter. The database read 100,000 rows to find 1.
Planning Time: 0.085 ms — time the planner spent choosing this execution plan.
Execution Time: 12.456 ms — total wall-clock time to execute the query. This is the number that matters for application performance.
The concern here is plain: scanning 100,000 rows to find 1. This table needs an index on the email column.
What the cost numbers actually mean
Costs are not milliseconds. They are arbitrary units based on the planner's cost model — sequential page reads, random page reads, CPU processing per row, and CPU processing per operator.
cost=0.00..1520.00 means:
- Startup cost
0.00— no work needed before the first row can be returned. Seq Scans can return the first row immediately. - Total cost
1520.00— the estimated total cost to read the entire table and apply the filter.
Startup cost matters for queries with LIMIT. A node with a high startup cost (like a Sort) must finish a substantial amount of work before returning the first row. A node with a low startup cost (like a Seq Scan) can return rows immediately.
You compare costs between alternative plans — the planner does this internally to choose the cheapest plan. The relationship between cost and actual time should be correlated but is not linear.
Plan #2 — the index scan
Now the same query, after we attend to the missing index:
CREATE INDEX idx_users_email ON users(email);
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'alice@example.com'; Output:
Index Scan using idx_users_email on users (cost=0.42..8.44 rows=1 width=128) (actual time=0.025..0.027 rows=1 loops=1)
Index Cond: (email = 'alice@example.com'::text)
Planning Time: 0.092 ms
Execution Time: 0.045 ms The improvement is substantial:
Index Scan using idx_users_email on users — the planner chose to use the B-tree index instead of scanning the entire table.
cost=0.42..8.44 — dramatically lower than the Seq Scan's 1520. The startup cost (0.42) is the cost of descending the B-tree to find the first matching entry.
actual time=0.025..0.027 — 0.027ms versus 12.340ms. Approximately 450x faster.
Index Cond: (email = 'alice@example.com'::text) — the condition that drove the index lookup. This is different from Filter. Index Cond means the index was used to locate the matching rows directly. No unnecessary rows were read.
The critical distinction: Index Cond vs Filter. Index Cond means the index handled this condition — only matching rows were fetched. Filter means rows were fetched first and then checked — non-matching rows were read and discarded.
When you see both on the same node:
Index Scan using idx_orders_customer_id on orders (cost=0.42..520.00 rows=5 width=96) (actual time=0.030..1.250 rows=5 loops=1)
Index Cond: (customer_id = 42)
Filter: (status = 'shipped')
Rows Removed by Filter: 85 This means the index found 90 rows matching customer_id = 42, but 85 of them were rejected by the additional status = 'shipped' filter. Consider a composite index on (customer_id, status) to eliminate the filter entirely.
Plan #3 — the bitmap scan
Now a query that returns multiple rows, where the index covers only part of the condition:
EXPLAIN ANALYZE SELECT * FROM orders
WHERE customer_id = 42 AND status = 'shipped'; Output (assuming an index only on customer_id):
Bitmap Heap Scan on orders (cost=4.65..320.15 rows=5 width=96) (actual time=0.150..1.800 rows=5 loops=1)
Recheck Cond: (customer_id = 42)
Filter: (status = 'shipped')
Rows Removed by Filter: 85
Heap Blocks: exact=15
-> Bitmap Index Scan on idx_orders_customer_id (cost=0.00..4.65 rows=90 width=0) (actual time=0.100..0.100 rows=90 loops=1)
Index Cond: (customer_id = 42)
Planning Time: 0.110 ms
Execution Time: 1.920 ms This is a two-phase operation — an inner node and an outer node.
Phase 1: Bitmap Index Scan (the inner, more-indented node). The index scan finds all row locations matching customer_id = 42 and stores them as a bitmap of heap pages. Note rows=90 — 90 rows matched in the index. The width=0 indicates this node returns page locations, not actual row data.
Phase 2: Bitmap Heap Scan (the outer node). The heap scan reads the actual table pages identified by the bitmap, in physical order. It then applies additional conditions:
Recheck Cond— re-verifies the index condition on each fetched row.Filter— the second condition is not in the index, so it is applied as a post-fetch filter.Rows Removed by Filter: 85— 85 of the 90 rows were rejected.Heap Blocks: exact=15— 15 table blocks were read.
The optimization opportunity is clear: a composite index on (customer_id, status) would eliminate the filter and reduce the rows fetched from 90 to 5.
Lossy vs exact bitmap
Exact bitmap: Heap Blocks: exact=15. The bitmap stored the exact row locations within each page.
Lossy bitmap: When the bitmap exceeds work_mem, PostgreSQL drops from row-level precision to page-level precision for some blocks:
Bitmap Heap Scan on orders (cost=500.00..15000.00 rows=50000 width=96) (actual time=25.000..350.000 rows=48500 loops=1)
Recheck Cond: (customer_id = ANY ('{1,2,3,4,5}'::integer[]))
Rows Removed by Index Recheck: 15000
Heap Blocks: exact=1000 lossy=5000 5,000 blocks used lossy precision, causing 15,000 extra rows to be fetched and rechecked. Raising work_mem allows the bitmap to use exact precision for all blocks.
Plan #4 — the nested loop join
If you'll follow me to a slightly more complex arrangement — a two-table query with LIMIT:
EXPLAIN ANALYZE SELECT o.id, u.name
FROM orders o
JOIN users u ON u.id = o.user_id
WHERE o.created_at > '2026-01-01'
LIMIT 10; Output:
Limit (cost=0.85..15.30 rows=10 width=36) (actual time=0.050..0.120 rows=10 loops=1)
-> Nested Loop (cost=0.85..14500.00 rows=10000 width=36) (actual time=0.048..0.115 rows=10 loops=1)
-> Index Scan using idx_orders_created_at on orders o (cost=0.42..7250.00 rows=10000 width=16) (actual time=0.030..0.055 rows=10 loops=1)
Index Cond: (created_at > '2026-01-01'::date)
-> Index Scan using users_pkey on users u (cost=0.42..0.44 rows=1 width=24) (actual time=0.005..0.005 rows=1 loops=10)
Index Cond: (id = o.user_id)
Planning Time: 0.250 ms
Execution Time: 0.145 ms Reading top to bottom:
Limit — the outermost node. It stops execution after 10 rows.
Nested Loop — the join strategy. A Nested Loop works like a for-loop: for each row from the outer side, look up the matching row(s) on the inner side.
Index Scan using idx_orders_created_at on orders o — the outer side (the driving table). Because of the LIMIT, it only produced 10 rows before the Limit node stopped it.
Index Scan using users_pkey on users u — the inner side (the lookup table). Notice loops=10 — this node executed 10 times, once for each order row.
How to read actual time with loops > 1. The reported actual time is the per-loop average. actual time=0.005..0.005 rows=1 loops=10 means each individual lookup took about 0.005ms, and the total across all 10 lookups was approximately 0.05ms.
When Nested Loop is a concern: When loops is very high:
-> Index Scan using users_pkey on users u (cost=0.42..0.44 rows=1 width=24) (actual time=0.400..0.410 rows=1 loops=100000)
Index Cond: (id = o.user_id) loops=100000 means 100,000 individual index lookups. Even at 0.4ms each, that is 40 seconds total. A Hash Join would process the same data in a single pass.
Plan #5 — the hash join
Now an analytical query — two tables joined with aggregation:
EXPLAIN ANALYZE SELECT u.name, count(*)
FROM orders o
JOIN users u ON u.id = o.user_id
GROUP BY u.name; Output:
HashAggregate (cost=3520.00..3570.00 rows=5000 width=20) (actual time=85.000..87.500 rows=4850 loops=1)
Group Key: u.name
Batches: 1 Memory Usage: 625kB
-> Hash Join (cost=230.00..3020.00 rows=100000 width=12) (actual time=5.000..55.000 rows=100000 loops=1)
Hash Cond: (o.user_id = u.id)
-> Seq Scan on orders o (cost=0.00..1640.00 rows=100000 width=4) (actual time=0.010..12.000 rows=100000 loops=1)
-> Hash (cost=155.00..155.00 rows=5000 width=24) (actual time=4.500..4.500 rows=5000 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 290kB
-> Seq Scan on users u (cost=0.00..155.00 rows=5000 width=24) (actual time=0.010..2.000 rows=5000 loops=1)
Planning Time: 0.300 ms
Execution Time: 88.500 ms Reading from the inside out:
Seq Scan on users u — the innermost node. Reads all 5,000 users.
Hash — builds an in-memory hash table from the users rows. Key details:
Buckets: 8192— the hash table has 8,192 bucketsBatches: 1— the entire hash table fit in memory. This is what you want.Memory Usage: 290kB— the hash table uses 290KB of memory
Batches: 1 vs Batches: 4: If Batches is greater than 1, the hash table exceeded work_mem and spilled to disk. This significantly slows the join.
Hash Join — the join strategy. The Hash Join builds a hash table from the smaller table (users), then probes it with each row from the larger table (orders).
HashAggregate — performs the GROUP BY aggregation:
Group Key: u.name— the grouping columnBatches: 1— the aggregation fit in memoryMemory Usage: 625kB— memory consumed by the aggregation hash table
When you see HashAggregate with disk spill:
HashAggregate (cost=5000.00..5500.00 rows=500000 width=20) (actual time=350.000..500.000 rows=485000 loops=1)
Group Key: u.name
Planned Partitions: 4 Peak Memory Usage: 4145kB Disk Usage: 15360kB The aggregation has 500,000 distinct groups — too many for work_mem. Consider raising work_mem or adding an index that enables GroupAggregate with sorted access.
Plan #6 — the complex plan (Sort, CTE, Subquery)
And now, if I may, a real-world query that brings several elements together — a CTE, join, sort, and limit:
EXPLAIN ANALYZE
WITH recent_orders AS (
SELECT * FROM orders WHERE created_at > now() - interval '30 days'
)
SELECT u.name, ro.total
FROM recent_orders ro
JOIN users u ON u.id = ro.user_id
ORDER BY ro.total DESC
LIMIT 20; Output:
Limit (cost=2890.00..2890.05 rows=20 width=28) (actual time=35.000..35.010 rows=20 loops=1)
-> Sort (cost=2890.00..2940.00 rows=20000 width=28) (actual time=34.990..35.000 rows=20 loops=1)
Sort Key: ro.total DESC
Sort Method: top-N heapsort Memory: 27kB
-> Hash Join (cost=230.00..2640.00 rows=20000 width=28) (actual time=5.500..28.000 rows=19500 loops=1)
Hash Cond: (ro.user_id = u.id)
-> CTE Scan on recent_orders ro (cost=0.00..400.00 rows=20000 width=12) (actual time=0.015..8.000 rows=19500 loops=1)
-> Hash (cost=155.00..155.00 rows=5000 width=24) (actual time=5.000..5.000 rows=5000 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 290kB
-> Seq Scan on users u (cost=0.00..155.00 rows=5000 width=24) (actual time=0.010..2.500 rows=5000 loops=1)
Planning Time: 0.400 ms
Execution Time: 35.200 ms Reading from the inside out:
Seq Scan on users u — reads all users.
Hash — builds a hash table from users.
CTE Scan on recent_orders ro — reads from the materialized CTE. In PostgreSQL 12+, CTEs can be inlined when they are referenced only once and have no side effects.
Hash Join — joins the recent orders with users via the hash table.
Sort — sorts the joined results by ro.total DESC. Key details:
Sort Key: ro.total DESC— the column and directionSort Method: top-N heapsort— an optimized sort for LIMIT queries. Instead of sorting all 19,500 rows, it maintains a heap of the top 20.Memory: 27kB— the heap uses a trivial amount of memory
Limit — returns only the top 20 rows.
When you see external merge Disk:
Sort (cost=250000.00..252500.00 rows=1000000 width=64) (actual time=4500.000..5200.000 rows=1000000 loops=1)
Sort Key: created_at DESC
Sort Method: external merge Disk: 153600kB The sort spilled 150MB to disk because the data exceeded work_mem. Three possible fixes:
- Raise work_mem to accommodate the sort in memory
- Add an index on
created_at DESCto provide pre-sorted access, eliminating the Sort node entirely - Restructure the query to reduce the data volume before the sort
The cheat sheet — fields at a glance
For your reference — every field you will encounter in EXPLAIN ANALYZE output, gathered in one place:
Cost and estimates
| Field | Meaning |
|---|---|
cost=startup..total | Planner's estimated cost in arbitrary units. Compare between plans, not as absolute time. |
rows=N (estimated) | Planner's prediction of rows returned by this node |
width=N | Estimated average row width in bytes |
Actual execution (ANALYZE only)
| Field | Meaning |
|---|---|
actual time=startup..total | Real execution time in milliseconds |
rows=N (actual) | Real row count returned |
loops=N | Number of times this node executed. Multiply actual time x loops for total time. |
Scan and filter fields
| Field | Meaning |
|---|---|
Index Cond: | Condition that drove the index lookup — rows were found via the index |
Filter: | Condition applied after fetching rows — rows were read then checked |
Rows Removed by Filter: | Rows fetched but rejected by the Filter condition |
Recheck Cond: | Condition rechecked after a bitmap or lossy index scan |
Sort fields
| Field | Meaning |
|---|---|
Sort Method: quicksort | Sorted in memory — normal sort |
Sort Method: top-N heapsort | Optimized in-memory sort for LIMIT queries |
Sort Method: external merge Disk: NkB | Sort spilled to disk — work_mem exceeded |
Hash and join fields
| Field | Meaning |
|---|---|
Hash Cond: | The condition used for hash join matching |
Batches: 1 | Hash table fit in memory (good) |
Batches: N (N > 1) | Hash table spilled to disk (raise work_mem) |
Memory Usage: | Memory consumed by hash table or sort |
Summary fields
| Field | Meaning |
|---|---|
Planning Time: | Time spent choosing the plan (not executing) |
Execution Time: | Total wall-clock time to execute the query |
Buffer fields (with BUFFERS option)
| Field | Meaning |
|---|---|
Buffers: shared hit=N | Pages found in PostgreSQL's buffer cache (fast) |
Buffers: shared read=N | Pages read from the operating system (possibly disk) |
Buffers: shared dirtied=N | Pages modified in the buffer cache |
Buffers: shared written=N | Pages written to disk during this query |
The three things to check first
When you want a quick assessment — "is this plan sound?" — check these three things in order.
1. Estimated vs actual rows
Compare rows=N in the cost estimate with rows=N in the actual execution for each node. If they differ by 10x or more, the planner is making decisions based on wrong information.
-> Index Scan using idx_orders_status on orders (cost=0.42..8.44 rows=10 width=96) (actual time=0.030..450.000 rows=50000 loops=1) The planner estimated 10 rows; the actual was 50,000. Table statistics are stale — run ANALYZE orders; to update.
2. Seq Scan on large tables with high filter removal
Seq Scan on users (cost=0.00..1520.00 rows=1 width=128) (actual time=0.015..12.340 rows=1 loops=1)
Filter: (email = 'alice@example.com'::text)
Rows Removed by Filter: 99999 If Rows Removed by Filter is 90% or more of the rows scanned, an index on the filter column would eliminate the wasted I/O. Note: a Seq Scan is not always bad. On small tables, it is often faster than an Index Scan. See fixing sequential scans for detailed strategies.
3. Sort or Hash with disk usage
Any Sort Method: external merge Disk or Batches > 1 on a Hash means the operation exceeded work_mem and spilled to disk. Options:
- Raise
work_memfor this query:SET work_mem = '256MB'; - Add an index to avoid the sort entirely
- Restructure the query to reduce the data volume before the sort or hash
These three checks — row estimate accuracy, high filter removal on Seq Scans, and disk spills — catch the vast majority of EXPLAIN-visible performance problems.
Where to go next
You can now read every field in EXPLAIN ANALYZE output. The next steps, should you wish to continue:
- Optimization strategies: For what to do about the patterns you find, see the comprehensive EXPLAIN ANALYZE guide.
- Row estimation: For a deeper understanding of how the planner predicts row counts, see the row estimation glossary.
- Slow query diagnosis: If you arrived here because a specific query is slow, the slow query diagnostic walks through a complete diagnostic workflow.
- Performance tuning: For the full picture, see the performance tuning guide.
Gold Lapel reads query plans the way this article taught you to — continuously and automatically. It watches for the patterns described here (row estimate mismatches, unnecessary Seq Scans, disk spills) across every query your application runs, and attends to them without manual EXPLAIN sessions.