← PostgreSQL Concepts

Table statistics

The intelligence your query planner relies on to make every decision it makes. When the intelligence is wrong, the decisions follow.

Concept · March 21, 2026 · 9 min read

Table statistics are summaries of data distribution — how many distinct values a column has, which values are most common, how nulls are distributed, and how physical row order correlates with logical order. PostgreSQL stores these in the pg_statistic catalog (readable via the pg_stats view) and they are collected by the ANALYZE command. The query planner consults these statistics every time it plans a query to estimate how many rows each operation will return. When statistics are fresh and accurate, the planner makes sound decisions. When they are stale — after bulk loads, large deletes, or data distribution shifts — the planner is not wrong, precisely. It was given wrong data. The plans that follow are the natural consequence of outdated intelligence, and I have seen remarkably capable databases brought low by nothing more than a neglected ANALYZE.

What table statistics are

PostgreSQL does not look at actual data when planning a query — scanning a million-row table just to decide how to scan it would rather defeat the purpose. Instead, the planner relies on pre-computed summaries stored in the pg_statistic system catalog. These summaries describe the shape of each column's data: how many distinct values it has, which values appear most often, where the values fall on a number line, and how many are null.

The pg_stats view provides a human-readable interface to the same data. This is what you query when you want to see what the planner knows about a column.

SQL
-- View statistics the planner uses for a table
SELECT
  attname,
  n_distinct,
  null_frac,
  avg_width,
  most_common_vals,
  most_common_freqs,
  histogram_bounds,
  correlation
FROM pg_stats
WHERE tablename = 'orders'
  AND attname IN ('status', 'customer_id', 'created_at');

The planner also uses pg_class for table-level information — the estimated number of rows (reltuples) and the number of disk pages (relpages). These are updated by both ANALYZE and VACUUM.

SQL
-- Row count and page estimates from pg_class
SELECT
  relname,
  reltuples::bigint AS estimated_rows,
  relpages AS data_pages
FROM pg_class
WHERE relname = 'orders';

Together, these numbers form the planner's entire understanding of your data. Every cost estimate, every row count prediction, every decision about whether to use an index or scan the table — all of it traces back to these statistics. The planner is making the best decisions it can with the information it has. Your responsibility is to ensure that information is current.

What statistics contain

For each column in each table, ANALYZE computes the following:

  • n_distinct — the estimated number of distinct values. A positive number is an absolute count. A negative number (e.g., -0.5) means "this fraction of total rows are distinct" — useful for columns where distinct values scale with table size, like primary keys
  • most_common_vals / most_common_freqs — parallel arrays listing the most frequent values and their frequencies. For a status column where 92% of rows are "active" and 5% are "pending," these arrays tell the planner exactly what selectivity to expect for WHERE status = 'pending'
  • histogram_bounds — equally-populated histogram buckets covering values not already captured in most_common_vals. Used for range predicates like WHERE created_at > '2026-01-01'. With 100 buckets (the default), each bucket represents roughly 1% of the non-MCV values
  • null_frac — the fraction of rows where the column is null. A WHERE email IS NOT NULL filter on a column with null_frac = 0.3 tells the planner to expect 70% of rows to pass
  • correlation — how closely the physical row order on disk matches the sorted order of the column's values. Ranges from -1 to 1. High correlation (close to 1 or -1) means an index range scan will read pages sequentially, which is fast. Low correlation (close to 0) means an index scan will hop between random pages, which costs more on spinning disks
  • avg_width — the average size of the column's values in bytes. Used to estimate how much memory a sort or hash operation will need

The planner combines these per-column statistics with the table-level row count from pg_class to estimate how many rows will survive each filter, how large intermediate results will be, and which join strategy will be cheapest.

ANALYZE

ANALYZE is the command that collects statistics. It samples rows from the table, computes the distribution summaries described above, and stores them in pg_statistic. Without it, the planner has nothing to work with — or worse, it works with outdated numbers from whenever ANALYZE last ran. The remedy, you will notice, is proportionate to the problem: one command, no downtime, immediate effect.

SQL
-- Analyze a single table
ANALYZE orders;

-- Analyze specific columns only (faster for wide tables)
ANALYZE orders (status, created_at);

-- Analyze every table in the database
ANALYZE;

-- Check when ANALYZE last ran and how stale statistics are
SELECT
  relname,
  last_analyze,
  last_autoanalyze,
  n_mod_since_analyze
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY n_mod_since_analyze DESC
LIMIT 10;

Autovacuum-triggered ANALYZE

You do not need to run ANALYZE manually in most cases. The autovacuum daemon tracks how many rows have been inserted, updated, or deleted since the last ANALYZE, and triggers a new one when the count exceeds a threshold.

SQL / postgresql.conf
-- Autovacuum triggers ANALYZE when this many rows have changed:
--   autovacuum_analyze_threshold + (autovacuum_analyze_scale_factor * reltuples)
-- Defaults: threshold = 50, scale_factor = 0.1 (10%)
-- For a table with 1,000,000 rows: after ~100,050 modifications

-- Lower the threshold for a high-churn table
ALTER TABLE events SET (autovacuum_analyze_scale_factor = 0.02);
-- Now triggers after ~20,050 modifications instead of ~100,050

For most tables, the default 10% threshold is reasonable. For large tables that receive steady writes — millions of rows per day — 10% can mean statistics go stale between ANALYZE runs. A table with ten million rows would need a million modifications before autovacuum notices. Lowering autovacuum_analyze_scale_factor to 0.02 or even 0.01 for those tables keeps statistics fresher without meaningful overhead. I consider this one of the more cost-effective adjustments available to you.

When to run ANALYZE manually

After any operation that changes the data distribution significantly — bulk loads, large deletes, data migrations, or ALTER TABLE operations that add or change columns. Autovacuum will catch up eventually, but "eventually" can mean minutes of bad query plans on a busy system. If you have just loaded two million rows and your application is about to start querying them, I would not recommend leaving the planner to discover this on its own.

SQL
-- Spot stale statistics: compare planner estimates to actual rows
EXPLAIN ANALYZE
SELECT * FROM orders WHERE status = 'pending';

-- If you see something like:
--   Seq Scan on orders  (cost=0.00..25432.00 rows=50 width=96)
--                       (actual time=0.015..189.456 rows=47000 loops=1)
-- The planner expected 50 rows but got 47,000.
-- Fix: ANALYZE orders;

Statistics target

The statistics target controls how much detail ANALYZE captures for each column. It determines both the number of most-common-values entries and the number of histogram buckets. The default is 100, meaning up to 100 MCV entries and 100 histogram buckets per column.

SQL
-- Check the current statistics target for a column
SELECT
  attname,
  attstattarget
FROM pg_attribute
WHERE attrelid = 'orders'::regclass
  AND attname = 'status';
-- -1 means "use default_statistics_target" (default: 100)

-- Increase to 500 for a column with skewed or high-cardinality data
ALTER TABLE orders ALTER COLUMN customer_id SET STATISTICS 500;
ANALYZE orders (customer_id);

-- Check the global default
SHOW default_statistics_target;  -- 100

For most columns, 100 is sufficient. The planner does not need a fine-grained histogram for a boolean column with two values or a status column with five. But for columns with highly skewed distributions or many distinct values — tenant ID columns in multi-tenant databases, user ID columns in activity tables, or any column where the most common values account for a large fraction of the table — increasing the target to 500 or 1000 gives the planner significantly better estimates.

The cost of a higher target is modest: ANALYZE samples more rows and stores a larger statistics array in pg_statistic. Planning time increases slightly because the planner searches a larger MCV list and interpolates across more histogram buckets. In practice, neither cost is noticeable. The benefit — better row estimates and better plans — is usually substantial. It is one of those rare tuning adjustments where you give up almost nothing and gain quite a lot.

Extended statistics

Per-column statistics assume that conditions on different columns are independent. If a query filters on WHERE country = 'US' AND state = 'California', the planner multiplies the selectivity of each predicate separately. It does not know that knowing country = 'US' makes state = 'California' far more likely than state = 'Bavaria'. This independence assumption can lead to severe underestimates — the planner expects a handful of rows and receives thousands.

Extended statistics, introduced in PostgreSQL 10, address this with admirable precision. You explicitly tell PostgreSQL which columns are correlated, and it tracks their joint distribution.

SQL
-- Create extended statistics for correlated columns (PostgreSQL 10+)
CREATE STATISTICS orders_country_state (dependencies)
  ON country, state FROM orders;

-- PostgreSQL 12+ also supports MCV (most common value) lists
CREATE STATISTICS orders_country_state_mcv (dependencies, mcv)
  ON country, state FROM orders;

-- After creating, run ANALYZE to populate them
ANALYZE orders;

-- View extended statistics
SELECT
  stxname,
  stxkeys,
  stxkind
FROM pg_statistic_ext
WHERE stxrelid = 'orders'::regclass;

Types of extended statistics

  • Functional dependencies (PostgreSQL 10+) — captures the degree to which one column determines another. Tells the planner that filtering on country already constrains state, so it should not multiply their selectivities independently
  • MCV lists (PostgreSQL 12+) — tracks the most common combinations of values across columns. More precise than functional dependencies for skewed multi-column distributions
  • N-distinct (PostgreSQL 10+) — tracks the number of distinct value combinations across a column group. Useful for GROUP BY estimates on multiple columns

Extended statistics are not collected automatically — you must create them explicitly for the column combinations that matter, and then run ANALYZE to populate them. The planner uses them transparently, requiring no changes to your queries. You provide the insight; PostgreSQL does the rest. A well-mannered arrangement.

How Gold Lapel relates

Gold Lapel's optimization engine depends on the planner making good decisions — and the planner depends on good statistics. When the intelligence is current, the planner chooses well, queries are fast, and Gold Lapel's role is straightforward: monitor, confirm, stay out of the way. When statistics are stale, the planner is working from an outdated picture of the household, and no amount of proxy-level optimization can correct a fundamentally wrong execution strategy inside PostgreSQL.

This is why Gold Lapel monitors the gap between estimated and actual row counts across your workload. When it detects persistent estimation errors — the hallmark of stale or insufficient statistics — it surfaces them so you can take the appropriate action: run ANALYZE, increase a statistics target, or create extended statistics for correlated columns. Fresh statistics are the foundation that every other optimization is built on, including our own. We would be poor stewards of your database if we did not say so plainly.

Frequently asked questions