Table statistics

Q: How do I know if my statistics are stale?

Run EXPLAIN ANALYZE on a query and compare the "rows" estimate to the "actual rows" count at each plan node. If they differ by more than a factor of 10, your statistics are almost certainly stale. You can also check pg_stat_user_tables — the n_mod_since_analyze column shows how many rows have been modified since the last ANALYZE. If that number is a significant fraction of the table, the remedy is immediate: run ANALYZE.

Q: Does ANALYZE lock the table?

ANALYZE takes only a ShareUpdateExclusiveLock, which does not block reads or writes. It runs concurrently with normal operations. The only things it blocks are other ANALYZE or VACUUM commands on the same table, and DDL statements like ALTER TABLE. You may safely run ANALYZE on a production table at any time without affecting application traffic. There is no reason to defer it.

Q: How many rows does ANALYZE sample?

ANALYZE samples 300 * statistics_target random rows per column. With the default target of 100, that is 30,000 rows sampled per column. This sample is drawn from a random selection of table pages, and from each selected page, all rows are read. For most tables, 30,000 sampled rows provides more than sufficient accuracy for the planner. Increasing the statistics target increases the sample proportionally.

Q: When should I increase the statistics target above 100?

When a column has a highly non-uniform distribution and the default 100 histogram buckets do not capture it well. Common cases include multi-tenant ID columns where a few tenants have far more data than others, status columns with many possible values but heavy skew toward a few, and date columns used in range queries across highly variable time periods. Increasing to 500 or 1000 gives the planner a finer-grained picture of the data it is working with. The trade-off is slightly slower ANALYZE runs and marginally more planning time, both of which are negligible in practice.

Q: What is the difference between pg_statistic and pg_stats?

They contain the same data. pg_statistic is the system catalog where PostgreSQL actually stores statistics — its columns use internal type arrays that are difficult to read directly. pg_stats is a view that presents the same information in a human-readable format with named columns like most_common_vals, histogram_bounds, and null_frac. Always use pg_stats for manual inspection. pg_statistic exists for the planner internals.

Q: Do extended statistics replace regular per-column statistics?

No. Extended statistics are entirely additive. Per-column statistics continue to be collected and used as before. Extended statistics add cross-column information — functional dependencies and multi-column most-common-value lists — that the planner consults only when it encounters a query filtering on the specific column combination you defined. You still need regular per-column statistics for single-column predicates. Think of extended statistics as supplementary intelligence, not a replacement.

The intelligence your query planner relies on to make every decision it makes. When the intelligence is wrong, the decisions follow.

The Waiter of Gold Lapel · Updated Mar 30, 2026 Published Mar 21, 2026 · 9 min read

Table statistics are summaries of data distribution — how many distinct values a column has, which values are most common, how nulls are distributed, and how physical row order correlates with logical order. PostgreSQL stores these in the pg_statistic catalog (readable via the pg_stats view) and they are collected by the ANALYZE command. The query planner consults these statistics every time it plans a query to estimate how many rows each operation will return. When statistics are fresh and accurate, the planner makes sound decisions. When they are stale — after bulk loads, large deletes, or data distribution shifts — the planner is not wrong, precisely. It was given wrong data. The plans that follow are the natural consequence of outdated intelligence, and I have seen remarkably capable databases brought low by nothing more than a neglected ANALYZE.

What table statistics are

PostgreSQL does not look at actual data when planning a query — scanning a million-row table just to decide how to scan it would rather defeat the purpose. Instead, the planner relies on pre-computed summaries stored in the pg_statistic system catalog. These summaries describe the shape of each column's data: how many distinct values it has, which values appear most often, where the values fall on a number line, and how many are null.

The pg_stats view provides a human-readable interface to the same data. This is what you query when you want to see what the planner knows about a column. The PostgreSQL documentation on statistics used by the planner explains how these numbers translate into row estimates.

SQL

-- View statistics the planner uses for a table
SELECT
  attname,
  n_distinct,
  null_frac,
  avg_width,
  most_common_vals,
  most_common_freqs,
  histogram_bounds,
  correlation
FROM pg_stats
WHERE tablename = 'orders'
  AND attname IN ('status', 'customer_id', 'created_at');

The planner also uses pg_class for table-level information — the estimated number of rows (reltuples) and the number of disk pages (relpages). These are updated by both ANALYZE and VACUUM.

SQL

-- Row count and page estimates from pg_class
SELECT
  relname,
  reltuples::bigint AS estimated_rows,
  relpages AS data_pages
FROM pg_class
WHERE relname = 'orders';

Together, these numbers form the planner's entire understanding of your data. Every cost estimate, every row count prediction, every decision about whether to use an index or perform a sequential scan — all of it traces back to these statistics. The planner is making the best decisions it can with the information it has. Your responsibility is to ensure that information is current.

What statistics contain

For each column in each table, ANALYZE computes the following:

n_distinct — the estimated number of distinct values. A positive number is an absolute count. A negative number (e.g., -0.5) means "this fraction of total rows are distinct" — useful for columns where distinct values scale with table size, like primary keys
most_common_vals / most_common_freqs — parallel arrays listing the most frequent values and their frequencies. For a status column where 92% of rows are "active" and 5% are "pending," these arrays tell the planner exactly what selectivity to expect for WHERE status = 'pending'
histogram_bounds — equally-populated histogram buckets covering values not already captured in most_common_vals. Used for range predicates like WHERE created_at > '2026-01-01'. With 100 buckets (the default), each bucket represents roughly 1% of the non-MCV values
null_frac — the fraction of rows where the column is null. A WHERE email IS NOT NULL filter on a column with null_frac = 0.3 tells the planner to expect 70% of rows to pass
correlation — how closely the physical row order on disk matches the sorted order of the column's values. Ranges from -1 to 1. High correlation (close to 1 or -1) means an index range scan will read pages sequentially, which is fast. Low correlation (close to 0) means an index scan will hop between random pages, which costs more on spinning disks
avg_width — the average size of the column's values in bytes. Used to estimate how much memory a sort or hash operation will need

The planner combines these per-column statistics with the table-level row count from pg_class to estimate how many rows will survive each filter, how large intermediate results will be, and which join strategy will be cheapest.

ANALYZE

ANALYZE is the command that collects statistics. It samples rows from the table, computes the distribution summaries described above, and stores them in pg_statistic. Without it, the planner has nothing to work with — or worse, it works with outdated numbers from whenever ANALYZE last ran. The remedy, you will notice, is proportionate to the problem: one command, no downtime, immediate effect.

SQL

-- Analyze a single table
ANALYZE orders;

-- Analyze specific columns only (faster for wide tables)
ANALYZE orders (status, created_at);

-- Analyze every table in the database
ANALYZE;

-- Check when ANALYZE last ran and how stale statistics are
SELECT
  relname,
  last_analyze,
  last_autoanalyze,
  n_mod_since_analyze
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY n_mod_since_analyze DESC
LIMIT 10;

Autovacuum-triggered ANALYZE

You do not need to run ANALYZE manually in most cases. The autovacuum daemon tracks how many rows have been inserted, updated, or deleted since the last ANALYZE, and triggers a new one when the count exceeds a threshold.

SQL / postgresql.conf

-- Autovacuum triggers ANALYZE when this many rows have changed:
--   autovacuum_analyze_threshold + (autovacuum_analyze_scale_factor * reltuples)
-- Defaults: threshold = 50, scale_factor = 0.1 (10%)
-- For a table with 1,000,000 rows: after ~100,050 modifications

-- Lower the threshold for a high-churn table
ALTER TABLE events SET (autovacuum_analyze_scale_factor = 0.02);
-- Now triggers after ~20,050 modifications instead of ~100,050

For most tables, the default 10% threshold is reasonable. For large tables that receive steady writes — millions of rows per day — 10% can mean statistics go stale between ANALYZE runs. A table with ten million rows would need a million modifications before autovacuum notices. Lowering autovacuum_analyze_scale_factor to 0.02 or even 0.01 for those tables keeps statistics fresher without meaningful overhead. I consider this one of the more cost-effective adjustments available to you.

When to run ANALYZE manually

After any operation that changes the data distribution significantly — bulk loads, large deletes, data migrations, or ALTER TABLE operations that add or change columns. Autovacuum will catch up eventually, but "eventually" can mean minutes of bad query plans on a busy system. If you have just loaded two million rows and your application is about to start querying them, I would not recommend leaving the planner to discover this on its own.

SQL

-- Spot stale statistics: compare planner estimates to actual rows
EXPLAIN ANALYZE
SELECT * FROM orders WHERE status = 'pending';

-- If you see something like:
--   Seq Scan on orders  (cost=0.00..25432.00 rows=50 width=96)
--                       (actual time=0.015..189.456 rows=47000 loops=1)
-- The planner expected 50 rows but got 47,000.
-- Fix: ANALYZE orders;

Statistics target

The statistics target controls how much detail ANALYZE captures for each column. It determines both the number of most-common-values entries and the number of histogram buckets. The default is 100, meaning up to 100 MCV entries and 100 histogram buckets per column.

SQL

-- Check the current statistics target for a column
SELECT
  attname,
  attstattarget
FROM pg_attribute
WHERE attrelid = 'orders'::regclass
  AND attname = 'status';
-- -1 means "use default_statistics_target" (default: 100)

-- Increase to 500 for a column with skewed or high-cardinality data
ALTER TABLE orders ALTER COLUMN customer_id SET STATISTICS 500;
ANALYZE orders (customer_id);

-- Check the global default
SHOW default_statistics_target;  -- 100

For most columns, 100 is sufficient. The planner does not need a fine-grained histogram for a boolean column with two values or a status column with five. But for columns with highly skewed distributions or many distinct values — tenant ID columns in multi-tenant databases, user ID columns in activity tables, or any column where the most common values account for a large fraction of the table — increasing the target to 500 or 1000 gives the planner significantly better estimates.

The cost of a higher target is modest: ANALYZE samples more rows and stores a larger statistics array in pg_statistic. Planning time increases slightly because the planner searches a larger MCV list and interpolates across more histogram buckets. In practice, neither cost is noticeable. The benefit — better row estimates and better plans — is usually substantial. It is one of those rare tuning adjustments where you give up almost nothing and gain quite a lot.

Extended statistics

Per-column statistics assume that conditions on different columns are independent. If a query filters on WHERE country = 'US' AND state = 'California', the planner multiplies the selectivity of each predicate separately. It does not know that knowing country = 'US' makes state = 'California' far more likely than state = 'Bavaria'. This independence assumption can lead to severe underestimates — the planner expects a handful of rows and receives thousands.

Extended statistics, introduced in PostgreSQL 10, address this with admirable precision. You explicitly tell PostgreSQL which columns are correlated, and it tracks their joint distribution.

SQL

-- Create extended statistics for correlated columns (PostgreSQL 10+)
CREATE STATISTICS orders_country_state (dependencies)
  ON country, state FROM orders;

-- PostgreSQL 12+ also supports MCV (most common value) lists
CREATE STATISTICS orders_country_state_mcv (dependencies, mcv)
  ON country, state FROM orders;

-- After creating, run ANALYZE to populate them
ANALYZE orders;

-- View extended statistics
SELECT
  stxname,
  stxkeys,
  stxkind
FROM pg_statistic_ext
WHERE stxrelid = 'orders'::regclass;

Types of extended statistics

Functional dependencies (PostgreSQL 10+) — captures the degree to which one column determines another. Tells the planner that filtering on country already constrains state, so it should not multiply their selectivities independently
MCV lists (PostgreSQL 12+) — tracks the most common combinations of values across columns. More precise than functional dependencies for skewed multi-column distributions
N-distinct (PostgreSQL 10+) — tracks the number of distinct value combinations across a column group. Useful for GROUP BY estimates on multiple columns

Extended statistics are not collected automatically — you must create them explicitly for the column combinations that matter, and then run ANALYZE to populate them. The planner uses them transparently, requiring no changes to your queries. You provide the insight; PostgreSQL does the rest. A well-mannered arrangement.

How Gold Lapel relates

Gold Lapel's optimization engine depends on the planner making good decisions — and the planner depends on good statistics. When the intelligence is current, the planner chooses well, queries are fast, and Gold Lapel's role is straightforward: monitor, confirm, stay out of the way. When statistics are stale, the planner is working from an outdated picture of the household, and no amount of proxy-level optimization can correct a fundamentally wrong execution strategy inside PostgreSQL.

This is why Gold Lapel monitors the gap between estimated and actual row counts across your workload. When it detects persistent estimation errors — the hallmark of stale or insufficient statistics — it surfaces them so you can take the appropriate action: run ANALYZE, increase a statistics target, or create extended statistics for correlated columns. Fresh statistics are the foundation that every other optimization is built on, including our own. We would be poor stewards of your database if we did not say so plainly.

Table statistics

What table statistics are

What statistics contain

ANALYZE

Autovacuum-triggered ANALYZE

When to run ANALYZE manually

Statistics target

Extended statistics

Types of extended statistics

How Gold Lapel relates

Frequently asked questions

How do I know if my statistics are stale?

Does ANALYZE lock the table?

How many rows does ANALYZE sample?

When should I increase the statistics target above 100?

What is the difference between pg_statistic and pg_stats?

Do extended statistics replace regular per-column statistics?

Related content