Row estimation

Q: How far off can a row estimate be before it causes problems?

There is no fixed threshold, but a useful rule of thumb is 10x. An estimate of 100 when the actual is 500 may produce the same plan — the planner has some tolerance for moderate errors. An estimate of 100 when the actual is 50,000 will almost certainly produce a poor plan, because the planner will choose strategies (like nested loop joins) that are efficient for small result sets and rather catastrophic for large ones. The further the estimate is from reality, the worse the plan selection.

Q: Why does the planner sometimes estimate exactly 1 row?

This usually means the planner has no useful statistics for the filter condition — it is, in essence, guessing in the dark. When the planner cannot find a matching entry in the most-common-values list or the histogram for a column, it falls back to a default selectivity estimate. For equality conditions on columns with no statistics, the default is 0.5% of the table — which rounds to 1 row on small tables. Running ANALYZE populates the statistics and replaces the guess with a data-driven estimate.

Q: Does row estimation affect INSERT, UPDATE, and DELETE performance?

Yes — and this surprises more people than it should. Any statement with a WHERE clause, a JOIN, or a subquery goes through the planner, and the planner uses row estimates to choose the execution strategy. An UPDATE with a subquery that returns far more rows than estimated will suffer the same cascading plan problems as a SELECT. The only statements unaffected are trivial ones with no filtering — like INSERT INTO ... VALUES, which has no rows to estimate.

Q: Can I see the row estimates without running the query?

Yes. Plain EXPLAIN (without ANALYZE) shows the planner's estimated row counts without executing the query. The "rows=N" value in each plan node is the estimate. You will not see actual rows — that requires EXPLAIN ANALYZE, which runs the query. Plain EXPLAIN is safe for write queries because nothing is executed.

Q: What is the relationship between row estimation and cost estimation?

Row estimation feeds directly into cost estimation. The planner first estimates how many rows each node will produce, then uses those counts to calculate the cost of downstream operations. A hash join costs roughly proportional to the number of rows in each input. A nested loop costs proportional to outer rows times inner cost. If the row estimate is wrong, the cost estimate is wrong, and the plan choice is wrong. Row estimation is, quite simply, the foundation that everything else rests on.

Q: Do prepared statements affect row estimation?

They can, and this is a subtle one. Since PostgreSQL 12, prepared statements use a custom plan for the first five executions (with the actual parameter values) and then may switch to a generic plan that uses average statistics instead of parameter-specific estimates. If your data is heavily skewed — one status value has 95% of rows and another has 0.1% — the generic plan may produce poor row estimates for the rare value. You can check this with EXPLAIN EXECUTE and, if needed, force custom plans by re-preparing the statement.

The planner's prediction of how many rows a query step will return — and, if you'll permit me, the single most common reason it picks a bad plan.

The Waiter of Gold Lapel · Updated Mar 30, 2026 Published Mar 21, 2026 · 8 min read

Every time PostgreSQL plans a query, it estimates how many rows each step will produce. These estimates drive every downstream decision — which join strategy to use, whether to scan an index or the whole table, how much memory to allocate for sorts. The estimates come from table statistics collected by ANALYZE. When the estimates are close to reality, the planner produces excellent plans. When they are orders of magnitude off — estimated 50 rows, actual 247,000 — the planner makes choices that can turn a 50-millisecond query into a 15-second one. I find this personally alarming, and I suspect you will too.

What row estimation is

At every node in a query plan, the planner needs to answer one question: how many rows will this step produce? For a sequential scan with a WHERE status = 'pending' filter, the planner looks up the frequency of 'pending' in the column's statistics and multiplies it by the table's row count. For a join, it estimates how many rows the join condition will match. For an aggregation, it estimates the number of groups.

These estimates are stored in pg_statistic (exposed through the pg_stats view) and include the most common values and their frequencies, histogram buckets for range queries, the number of distinct values, and the fraction of nulls. The ANALYZE command collects this data by sampling rows from each table — 30,000 values per column by default. The PostgreSQL documentation on statistics used by the planner details exactly how these numbers feed into plan decisions.

The planner does not guess. It calculates. But the calculation is only as good as the statistics it has been given — and stale statistics are, I'm afraid, rather common in the wild.

SQL

-- See what the planner knows about a column's data distribution
SELECT
  attname,
  n_distinct,
  most_common_vals,
  most_common_freqs,
  null_frac
FROM pg_stats
WHERE tablename = 'orders'
  AND attname = 'status';

-- If most_common_vals is NULL or outdated, the planner is guessing.

Why estimates matter

Row estimates are the foundation of every plan decision. The planner uses them to compare the cost of alternative strategies, and even modest errors can flip the choice between a plan that finishes in milliseconds and one that overstays its welcome by several seconds.

Join strategy — A nested loop join is efficient when the outer side has few rows (each row triggers one index lookup on the inner side). When the planner estimates 10 rows but the actual is 50,000, the inner side executes 50,000 times instead of 10. A hash join — which builds a hash table once and probes it for each outer row — would have been orders of magnitude faster. The planner chose correctly for the data it believed it had. The data it believed it had was fiction.
Scan method — The planner chooses between a sequential scan and an index scan based on how many rows it expects to return. If it estimates a query will return 80% of a table's rows, a sequential scan is cheaper than thousands of random index lookups. If it estimates 0.01% but the real answer is 5%, it may choose an index scan when a bitmap scan would be more efficient, or skip the index entirely.
Memory allocation — Sort and hash operations are sized based on estimated row counts. Underestimates cause spills to disk when the actual data volume exceeds work_mem.
Join order — In multi-table joins, the planner arranges tables to minimize intermediate result sizes. Wrong row estimates at one table can cascade through the join order, producing millions of intermediate rows that a correct estimate would have avoided.

SQL

-- A bad estimate at one node cascades through the entire plan
EXPLAIN ANALYZE
SELECT o.*, p.name, c.email
FROM orders o
JOIN products p ON p.id = o.product_id
JOIN customers c ON c.id = o.customer_id
WHERE o.region = 'eu-west' AND o.created_at > now() - interval '1 day';

-- Nested Loop  (cost=1.27..45.89 rows=3 width=192)
--              (actual time=0.045..14523.678 rows=8400 loops=1)
--   ->  Nested Loop  (cost=0.85..32.45 rows=3 width=160)
--                    (actual time=0.034..9876.543 rows=8400 loops=1)
--         ->  Index Scan using orders_created_at_idx on orders o
--               (cost=0.43..12.67 rows=3 width=128)
--               (actual time=0.021..123.456 rows=8400 loops=1)
--               Filter: (region = 'eu-west')
--         ->  Index Scan using products_pkey on products p
--               (cost=0.42..6.59 rows=1 width=32)
--               (actual time=0.002..0.002 rows=1 loops=8400)
--   ->  Index Scan using customers_pkey on customers c
--         (cost=0.42..4.48 rows=1 width=32)
--         (actual time=0.001..0.001 rows=1 loops=8400)

-- The planner estimated 3 rows from orders. It got 8,400.
-- It chose nested loops for both joins — fine for 3 rows,
-- catastrophic for 8,400 (the inner sides each execute 8,400 times).

How to spot bad estimates

EXPLAIN ANALYZE is the diagnostic tool — allow me to recommend it unreservedly. It runs the query and shows both the planner's estimated row count and the actual row count at every node in the plan. The comparison between these two numbers tells you whether the planner had accurate information or was operating on outdated intelligence.

SQL

-- Compare estimated rows vs actual rows
EXPLAIN ANALYZE SELECT * FROM orders WHERE status = 'pending';

-- Output:
-- Seq Scan on orders  (cost=0.00..25432.00 rows=1 width=128)
--                     (actual time=0.015..203.456 rows=47823 loops=1)
--   Filter: (status = 'pending')
--   Rows Removed by Filter: 952177
-- Planning Time: 0.062 ms
-- Execution Time: 211.789 ms

-- The planner estimated 1 row. The query returned 47,823.
-- That 47,000x error caused the planner to skip an index scan
-- that would have finished in under 50 ms.

What to look for:

Orders-of-magnitude differences — An estimate of 1 with an actual of 47,823 is the planner flying entirely blind. Anything beyond 10x warrants your attention.
Consistent underestimates at the same node — If a filter or join consistently returns far more rows than estimated across multiple query executions, the statistics for that column are stale or insufficient.
Cascading errors in joins — A bad estimate at an early node produces wrong cost calculations for every node above it. Look for the first node where estimates diverge — that is where the trouble begins, and where your fix should start.
Seq Scan on a large table with a selective filter — If the planner chose a sequential scan and the actual rows are a tiny fraction of the table, the estimate was too high (the planner thought the filter was not selective) or statistics are missing entirely.

Common causes of bad estimates

Stale statistics

The most frequent cause — and, mercifully, the simplest to resolve. A table receives a bulk load, a mass update changes the distribution of a status column, or a new category appears in a previously stable dataset. Autovacuum triggers ANALYZE when roughly 10% of rows have changed, but that threshold can be too slow for rapidly changing tables. Until ANALYZE runs, the planner works with the old distribution and produces estimates that bear no resemblance to the data actually in the table.

Correlated columns

The planner assumes that conditions on different columns are statistically independent. If you filter on WHERE country = 'US' AND state = 'California', the planner multiplies each selectivity separately: 40% (country) times 12% (state) equals 4.8%. But California only exists within the US, so the actual selectivity is closer to 15%. The planner's arithmetic is impeccable; its assumption is wrong. This independence assumption produces underestimates whenever columns are correlated — which is common in real-world data.

Complex expressions

The planner has detailed statistics for simple column comparisons (status = 'pending') but falls back to default estimates for expressions it cannot analyze. WHERE extract(year from created_at) = 2026 gets a generic 0.5% selectivity estimate because the planner does not maintain statistics on computed expressions. Rewriting as a range condition (WHERE created_at >= '2026-01-01' AND created_at < '2027-01-01') lets the planner use the histogram.

Joins multiplying errors

Row estimation errors compound across joins. If the planner underestimates table A by 5x and table B by 3x, the join estimate can be off by 15x or more. Multi-table joins are particularly vulnerable — a moderate error at the first table propagates through every subsequent join, and by the time the planner estimates the final result, the number can be off by several orders of magnitude. It is, if I may say so, the infrastructural equivalent of a small navigational error at departure producing a landing several hundred miles from the intended airport.

Fixing bad estimates

Run ANALYZE

The first and most effective fix — and I appreciate when the remedy is proportionate to the problem. If statistics are stale, updating them is usually all it takes. One command, no downtime, immediate improvement. For tables that change rapidly, lower autovacuum_analyze_scale_factor so autovacuum runs ANALYZE more frequently.

SQL

-- Update statistics for the table
ANALYZE orders;

-- Verify the plan changes
EXPLAIN ANALYZE SELECT * FROM orders WHERE status = 'pending';

-- Output after ANALYZE:
-- Bitmap Heap Scan on orders  (cost=534.12..18923.45 rows=47500 width=128)
--                              (actual time=12.345..45.678 rows=47823 loops=1)
--   Recheck Cond: (status = 'pending')
--   Heap Blocks: exact=8234
--   ->  Bitmap Index Scan on orders_status_idx  (cost=0.00..522.25 rows=47500 width=0)
--         (actual time=8.901..8.901 rows=47823 loops=1)
--         Index Cond: (status = 'pending')
-- Planning Time: 0.098 ms
-- Execution Time: 51.234 ms

-- 4x faster. Same query, same data. Better statistics.

Increase the statistics target

The default statistics target of 100 means PostgreSQL collects 100 histogram buckets per column. For columns with many distinct values or heavily skewed distributions, this resolution may be too coarse. Increasing it to 500 or 1000 gives the planner finer-grained data to work with, at the cost of slightly slower ANALYZE and slightly more memory for the statistics catalog.

SQL

-- Increase the statistics target for a column with many distinct values
ALTER TABLE orders ALTER COLUMN product_id SET STATISTICS 1000;
ANALYZE orders;

-- Default target is 100 histogram buckets.
-- 1000 gives the planner a finer-grained picture of the distribution.
-- Useful for high-cardinality columns or skewed data.

Create extended statistics for correlated columns

Since PostgreSQL 10, CREATE STATISTICS lets you tell the planner about dependencies between columns. This is a precision tool — it directly addresses the independence assumption that causes underestimates on correlated filters, and I am rather fond of it.

SQL

-- The planner assumes independent columns and multiplies selectivities.
-- country = 'US' returns 40% of rows. state = 'California' returns 12%.
-- Planner estimates: 40% * 12% = 4.8% of rows.
-- Reality: 15% of rows, because California is entirely within the US.

-- Fix: create extended statistics (PostgreSQL 10+)
CREATE STATISTICS orders_country_state (dependencies)
  ON country, state FROM orders;
ANALYZE orders;

-- The planner now knows that state depends on country
-- and adjusts its row estimate accordingly.

Extended statistics support three types: dependencies (functional dependencies between columns), ndistinct (distinct value counts for column combinations), and mcv (most common value combinations, PostgreSQL 12+). For most correlated-column problems, dependencies is sufficient. The row estimation examples in the PostgreSQL internals documentation walk through the exact arithmetic the planner uses.

Rewrite expressions the planner can analyze

Replace function calls and computed expressions in WHERE clauses with range conditions on the original column. WHERE date_trunc('month', created_at) = '2026-03-01' gets a default estimate. WHERE created_at >= '2026-03-01' AND created_at < '2026-04-01' uses the histogram. The results are identical; the estimates are not. The planner can only work with what it can see.

How Gold Lapel relates

Gold Lapel sits between your application and PostgreSQL as a query proxy. I mention this because row estimation is precisely the kind of problem that benefits from continuous observation rather than periodic investigation. Gold Lapel watches every query's execution plan over time, including the row estimates and the actual row counts. When a query pattern consistently shows large estimation errors — the kind that cause the planner to choose nested loops where it should choose hash joins — it flags the pattern and applies complementary optimizations.

These optimizations work with the planner, not against it. Gold Lapel may create a materialized view that pre-computes a problematic join, recommend an index that changes the set of plans available, or restructure a query so the planner receives better selectivity information. The goal is to change the inputs to the planner so that its row estimates — and therefore its plan choices — improve without you having to run EXPLAIN ANALYZE on every query yourself. Though I would never discourage you from doing so.

Row estimation

What row estimation is

Why estimates matter

How to spot bad estimates

Common causes of bad estimates

Stale statistics

Correlated columns

Complex expressions

Joins multiplying errors

Fixing bad estimates

Run ANALYZE

Increase the statistics target

Create extended statistics for correlated columns

Rewrite expressions the planner can analyze

How Gold Lapel relates

Frequently asked questions

How far off can a row estimate be before it causes problems?

Why does the planner sometimes estimate exactly 1 row?

Does row estimation affect INSERT, UPDATE, and DELETE performance?

Can I see the row estimates without running the query?

What is the relationship between row estimation and cost estimation?

Do prepared statements affect row estimation?

Related content