HypoPG + pg_qualstats: Automatic Index Recommendations for PostgreSQL

Q: Can I use pg_qualstats and HypoPG on Amazon RDS or other managed PostgreSQL services?

HypoPG is available on some managed providers, including Amazon RDS (via the rds.allowed_extensions parameter) and Azure Flexible Server. pg_qualstats requires shared_preload_libraries access, which most managed services do not expose. Self-managed PostgreSQL and some providers like Supabase support both extensions. On providers where pg_qualstats is unavailable, the missing index detection workflow using pg_stat_user_tables provides an alternative approach with less granular data.

Q: How long should I let pg_qualstats collect data before acting on its recommendations?

At minimum, let it run through a complete business cycle — typically one week. The goal is to capture peak hours, off-peak hours, batch jobs, nightly reports, and any periodic workloads. Acting on a few hours of data risks optimizing for atypical patterns. If your workload has strong weekly patterns, a full week is the minimum meaningful sample.

Q: Does HypoPG work with expression indexes and partial indexes?

Yes. hypopg_create_index() accepts the same SQL syntax as CREATE INDEX, including expression indexes (e.g., lower(email)), partial indexes (e.g., WHERE status = 'active'), multi-column indexes, and indexes with INCLUDE columns. The planner evaluates hypothetical indexes identically to real indexes when computing cost estimates. The only limitation is that EXPLAIN ANALYZE cannot use them — since no physical data exists, only EXPLAIN (without ANALYZE) shows the planner's choice.

Q: What is the performance overhead of running pg_qualstats in production?

With default settings, typically 1-3% overhead. The sample_rate parameter controls the fraction of queries sampled — setting it to 0.1 (10% of queries) on high-throughput systems reduces overhead to well under 1% while still providing statistically meaningful data. The overhead is proportional to query volume, not query complexity.

Q: How is this different from the existing pg_qualstats and HypoPG guides on this site?

The pg_qualstats how-to covers the extension's installation, configuration, and querying in depth. The HypoPG how-to covers hypothetical index creation, testing, and cleanup. This article connects the two into a complete end-to-end workflow: collect predicate data, identify candidates, test hypothetically, validate, and create. The individual guides are reference material; this article is the operational procedure that puts both tools to work together.

Evidence, hypothesis, test, confirmation. Proper methodology for indexing decisions.

March 27, 2026 · 22 min read

We asked the illustrator which colours the piece needed. He said he would prefer it if the piece could analyse its own predicate history and recommend its own colours. We are not entirely sure he is wrong.

Why your database already knows which indexes it needs

I should tell you something encouraging: your database has been keeping notes. PostgreSQL tracks every predicate it evaluates during query execution — every WHERE clause, every JOIN condition, every filter — recorded and available for analysis. Most teams never query this data, which means the database is accumulating a detailed map of its own performance gaps and no one has taken a moment to read it.

The manual approach to indexing is essentially guesswork. A developer notices a slow query, adds an index on the columns in the WHERE clause, and hopes for the best. This leads to two predictable outcomes: missing indexes on the columns that actually matter, and unnecessary indexes on columns that do not. Both are costly — the first degrades read performance, the second degrades write performance.

Allow me to walk you through a systematic pipeline that replaces guesswork with data:

Collect — pg_qualstats records which predicates your workload evaluates most frequently
Identify — Query the collected data to find high-frequency predicates with no supporting index
Test — HypoPG creates hypothetical indexes that exist only in the planner's cost model, with zero storage cost
Validate — Run EXPLAIN against the hypothetical index to confirm the planner would use it
Create — Build the real index with CREATE INDEX CONCURRENTLY and verify the improvement

Each step builds on the previous one. The result is an index recommendation workflow grounded in actual workload data rather than intuition. For deeper coverage of each extension individually, see the pg_qualstats how-to and HypoPG how-to. This article connects them into a single end-to-end process — from evidence to action.

Prerequisites — installing the extensions

A moment of preparation before we begin. Both pg_qualstats and HypoPG require configuration before use. pg_qualstats must be loaded into shared memory at server startup, while HypoPG can be installed without a restart.

shared_preload_libraries configuration

pg_qualstats depends on pg_stat_statements, so both must be listed in shared_preload_libraries in postgresql.conf:

postgresql.conf

# postgresql.conf
shared_preload_libraries = 'pg_stat_statements, pg_qualstats'

This change requires a PostgreSQL restart. After restarting, create the extensions:

SQL

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
CREATE EXTENSION IF NOT EXISTS pg_qualstats;
CREATE EXTENSION IF NOT EXISTS hypopg;

HypoPG does not require shared_preload_libraries — it loads on demand when the extension is created.

Version compatibility

Both extensions support PostgreSQL 13 and later. pg_qualstats 2.1.0+ is recommended for PostgreSQL 16 compatibility, and HypoPG 1.4.0+ supports the latest planner features including INCLUDE columns.

Managed provider availability

pg_qualstats requires shared_preload_libraries access, which limits availability on some managed services. HypoPG is more widely available but still not universal.

Provider	pg_qualstats	HypoPG
Self-managed	Yes	Yes
Supabase	Yes	Yes
Amazon RDS	No	Yes
Google Cloud SQL	No	No
Azure Flexible Server	No	Yes
Neon	No	No

For providers that do not support pg_qualstats, the pg_stat_user_tables approach to missing index detection offers an alternative starting point — it requires no extensions at all, though it provides less granular data.

Step 1 — Collecting predicate statistics with pg_qualstats

pg_qualstats hooks into the query executor and records every WHERE clause and JOIN condition evaluated at runtime. Where pg_stat_user_tables tells you which tables are being scanned sequentially, pg_qualstats goes further — it tells you which specific columns and operators are involved.

Configuration

Two settings control the extension's behavior:

postgresql.conf

# postgresql.conf

# Whether to distinguish between WHERE id = 1 and WHERE id = 2
# "on" tracks actual constant values, "off" groups by column only
pg_qualstats.track_constants = on

# Fraction of queries to sample (1.0 = all, 0.1 = 10%)
# Lower values reduce overhead on busy systems
pg_qualstats.sample_rate = 1.0

On production systems with high query throughput (50,000+ queries/second), reducing sample_rate to 0.1 or 0.2 keeps overhead negligible while still capturing statistically representative data. On development and staging databases, always use 1.0 to capture everything.

Let it run

Patience, if you will. pg_qualstats needs time to accumulate representative data. The extension must observe a complete workload cycle — peak hours, off-peak hours, batch jobs, periodic reports — before its data is trustworthy. I would recommend a minimum of one week of data collection before acting on its output.

Viewing collected predicates

Once the data has had time to settle, query the most frequently evaluated predicates:

SQL

SELECT
  qs.relid::regclass AS table_name,
  a.attname AS column_name,
  qs.opno::regoper AS operator,
  qs.occurences AS occurrences,
  qs.execution_count,
  qs.mean_err_estimate_ratio,
  qs.queryid
FROM pg_qualstats_all qs
JOIN pg_attribute a
  ON a.attrelid = qs.relid AND a.attnum = qs.attnum
WHERE qs.relid IS NOT NULL
ORDER BY qs.occurences DESC
LIMIT 20;

-- table_name | column_name | operator | occurrences | execution_count | mean_err_estimate_ratio | queryid
-- -----------+-------------+----------+-------------+-----------------+-------------------------+---------
-- orders     | customer_id | =        |       52340 |           52340 |                    12.4 | 8827361
-- orders     | status      | =        |       48100 |           48100 |                     1.1 | 9912045
-- line_items | order_id    | =        |       41200 |           41200 |                     8.7 | 7734521
-- orders     | created_at  | >=       |       22800 |           22800 |                     3.2 | 6621893
-- users      | email       | =        |       18400 |           18400 |                     1.0 | 5519274

Key columns in the output:

occurrences — how many times this predicate was evaluated. Higher values indicate more frequently used filters.
execution_count — the number of distinct query executions that included this predicate. Often matches occurrences for simple queries; differs when a predicate appears multiple times within a single execution.
mean_err_estimate_ratio — how far off the planner's row estimates were, on average. A value of 1.0 means perfect estimates. A value of 12.4 means the planner's estimates were off by a factor of 12 — a strong signal that statistics or indexes are missing.
queryid — links to pg_stat_statements for correlating predicates with specific query texts.

Performance overhead

pg_qualstats adds typically 1-3% overhead with default settings. The overhead comes from recording predicate information during query execution. The sample_rate parameter provides direct control — reducing it to 0.1 (10% of queries) on busy systems keeps the overhead well under 1%.

Finding unindexed predicates

Now we arrive at the most valuable query in this entire article. It combines pg_qualstats data with the system catalogs to identify high-frequency predicates that lack index support:

SQL — unindexed predicates ranked by frequency

SELECT
  qs.relid::regclass AS table_name,
  a.attname AS column_name,
  qs.opno::regoper AS operator,
  qs.occurences AS occurrences,
  qs.execution_count,
  qs.mean_err_estimate_ratio AS err_ratio,
  pg_size_pretty(pg_relation_size(qs.relid)) AS table_size
FROM pg_qualstats_all qs
JOIN pg_attribute a
  ON a.attrelid = qs.relid AND a.attnum = qs.attnum
WHERE qs.relid IS NOT NULL
  AND NOT EXISTS (
    SELECT 1 FROM pg_index pi
    JOIN pg_attribute pa
      ON pa.attrelid = pi.indrelid AND pa.attnum = ANY(pi.indkey)
    WHERE pi.indrelid = qs.relid
      AND pa.attnum = qs.attnum
  )
ORDER BY qs.occurences DESC
LIMIT 15;

-- table_name | column_name | operator | occurrences | execution_count | err_ratio | table_size
-- -----------+-------------+----------+-------------+-----------------+-----------+-----------
-- orders     | customer_id | =        |       52340 |           52340 |      12.4 | 98 MB
-- line_items | order_id    | =        |       41200 |           41200 |       8.7 | 210 MB
-- orders     | created_at  | >=       |       22800 |           22800 |       3.2 | 98 MB
-- audit_log  | entity_type | =        |       14600 |           14600 |       5.1 | 340 MB
-- audit_log  | created_at  | >=       |       14600 |           14600 |       4.8 | 340 MB

Every row in this result is a predicate evaluated thousands of times with no supporting index. The err_ratio column reinforces the finding — high values indicate the planner is working with poor estimates, which typically improve after index creation.

When pg_qualstats shows multiple predicates on the same table with similar occurrence counts (like audit_log.entity_type and audit_log.created_at above), those predicates are likely used together in the same queries. This is a candidate for a composite index rather than two separate single-column indexes.

Using pg_qualstats_indexes() for automatic suggestions

pg_qualstats includes a built-in function that analyzes collected data and generates CREATE INDEX statements:

SQL

SELECT * FROM pg_qualstats_indexes();

--        relid        |          attnames           |        possible_types        | execution_count
-- --------------------+-----------------------------+------------------------------+-----------------
-- public.orders       | {customer_id}               | {btree}                      |           52340
-- public.line_items   | {order_id}                  | {btree}                      |           41200
-- public.orders       | {created_at}                | {btree}                      |           22800
-- public.audit_log    | {entity_type,created_at}    | {btree}                      |           14600

The function groups predicates by table, identifies the most impactful column combinations, and suggests the appropriate index type. The audit_log suggestion shows that pg_qualstats recognized the two columns as co-occurring predicates and recommended a composite index. Rather observant of it.

These suggestions are starting points, not final answers. They do not account for write amplification costs, existing indexes that partially cover the same predicates, or partial index opportunities. I should be forthcoming about that — trusting these suggestions without testing would be premature. The next step, hypothetical testing, validates whether the planner would actually use each suggested index.

Step 2 — Hypothetical testing with HypoPG

Creating a real index carries costs: disk space for the index data, write amplification on every INSERT, UPDATE, and DELETE, and (for large tables) significant time for the initial build. Creating an index only to discover the planner does not use it wastes all of those resources.

HypoPG offers a rather elegant solution. It creates indexes that exist only in the planner's cost model — no physical storage, no write overhead, no impact on running queries. The planner evaluates them exactly like real indexes when computing query plans, but they disappear when the session ends.

Creating a hypothetical index

SQL

SELECT * FROM hypopg_create_index('CREATE INDEX ON orders(customer_id)');

-- indexrelid |           indexname
-- -----------+-------------------------------
--     205841 | <205841>btree_orders_customer_id

The hypothetical index receives an OID and a name. The planner now "sees" this index when evaluating queries in the current session. No data is written to disk.

Testing a single index candidate

The workflow: create the hypothetical index, then run EXPLAIN (not EXPLAIN ANALYZE) on the query that was producing sequential scans:

SQL — testing a hypothetical index

-- Create the hypothetical index
SELECT * FROM hypopg_create_index('CREATE INDEX ON orders(customer_id)');

-- Test if the planner would use it
EXPLAIN SELECT * FROM orders WHERE customer_id = 12345;

-- Output with the hypothetical index:
--                                          QUERY PLAN
-- --------------------------------------------------------------------------------------------
-- Index Scan using <205841>btree_orders_customer_id on orders  (cost=0.04..8.06 rows=1 width=48)
--   Index Cond: (customer_id = 12345)

-- Compare to the plan without the index:
--                             QUERY PLAN
-- -------------------------------------------------------------------
-- Seq Scan on orders  (cost=0.00..15234.00 rows=1 width=48)
--   Filter: (customer_id = 12345)

The cost dropped from 15,234 to 8.06 — a reduction of over 99%. The planner switched from a sequential scan to an index scan using the hypothetical index. That is rather strong evidence that the real index would be worth creating.

Why EXPLAIN, not EXPLAIN ANALYZE: Hypothetical indexes have no physical data. EXPLAIN ANALYZE executes the query and reads actual pages — but a hypothetical index has no pages to read. Use plain EXPLAIN to see the planner's cost estimates and plan choice. For a deeper guide on reading execution plans, see the EXPLAIN ANALYZE guide.

When the planner ignores the hypothetical index: The planner may choose not to use a hypothetical index for several reasons:

The table is small enough that a sequential scan is cheaper than an index lookup
The column has very low selectivity (few distinct values relative to row count)
An existing real index already covers the predicate adequately
The query returns a large fraction of the table, making a sequential scan more efficient

Testing composite and partial indexes

pg_qualstats identified audit_log.entity_type and audit_log.created_at as co-occurring predicates. Test a composite index:

SQL — composite index with entity_type first

-- Test composite index with entity_type first
SELECT * FROM hypopg_create_index(
  'CREATE INDEX ON audit_log(entity_type, created_at)'
);

EXPLAIN SELECT * FROM audit_log
WHERE entity_type = 'order' AND created_at >= '2026-01-01';

--                                                QUERY PLAN
-- ---------------------------------------------------------------------------------------------------------
-- Index Scan using <205842>btree_audit_log_entity_type_created_at on audit_log  (cost=0.04..125.30 rows=840 width=64)
--   Index Cond: ((entity_type = 'order') AND (created_at >= '2026-01-01'))

Now test the reverse column order:

SQL — reverse column order

SELECT * FROM hypopg_create_index(
  'CREATE INDEX ON audit_log(created_at, entity_type)'
);

EXPLAIN SELECT * FROM audit_log
WHERE entity_type = 'order' AND created_at >= '2026-01-01';

--                                                QUERY PLAN
-- ---------------------------------------------------------------------------------------------------------
-- Index Scan using <205843>btree_audit_log_created_at_entity_type on audit_log  (cost=0.04..312.80 rows=840 width=64)
--   Index Cond: ((created_at >= '2026-01-01') AND (entity_type = 'order'))

The (entity_type, created_at) order produces a lower cost (125.30 vs 312.80). Column order matters in a composite index. The equality predicate (entity_type = 'order') narrows the scan range more effectively when it appears first in a B-tree, with the range predicate (created_at >= ...) refining within that narrowed range.

Partial indexes can be tested when a filter applies to a known subset:

SQL — partial index

SELECT * FROM hypopg_create_index(
  'CREATE INDEX ON orders(created_at) WHERE status = ''pending'''
);

Expression indexes work the same way:

SQL — expression index

SELECT * FROM hypopg_create_index(
  'CREATE INDEX ON users(lower(email))'
);

When multiple hypothetical indexes exist simultaneously, the planner considers all of them. This allows testing scenarios where several new indexes interact — the planner may choose one over another depending on the query.

For more on choosing between partial and full indexes, see the partial indexes guide. For a broader overview of index types, see PostgreSQL index types.

Cleaning up hypothetical indexes

SQL — listing and cleaning up

SELECT * FROM hypopg_list_indexes();

-- indexrelid |                  indexname                   | nspname |  relname  | amname
-- -----------+---------------------------------------------+---------+-----------+--------
--     205841 | <205841>btree_orders_customer_id            | public  | orders    | btree
--     205842 | <205842>btree_audit_log_entity_type_created | public  | audit_log | btree
--     205843 | <205843>btree_audit_log_created_at_entity   | public  | audit_log | btree

-- Drop all hypothetical indexes:
SELECT hypopg_drop_index(indexrelid) FROM hypopg_list_indexes();

-- Or clear everything at once:
SELECT hypopg_reset();

Hypothetical indexes are session-scoped — they do not survive a disconnect. No risk of leftover hypothetical indexes affecting other sessions or persisting after a restart. They clean up after themselves, which is more than one can say for most temporary solutions.

Step 3 — Validate and create the real index

The hypothetical index confirmed that the planner would use it and the estimated cost dropped significantly. Now we create the real thing.

Creating the index

Always use CONCURRENTLY in production to avoid holding an exclusive lock on the table during the build:

SQL

CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders(customer_id);

CREATE INDEX CONCURRENTLY builds the index without blocking concurrent INSERT, UPDATE, or DELETE operations. It takes longer than a standard CREATE INDEX because it requires two passes over the table, but it avoids the table lock that would otherwise block all writes.

Updating statistics

After creating the index, run ANALYZE so the planner has accurate statistics for the new index:

SQL

ANALYZE orders;

Without this step, the planner may not use the new index immediately — it relies on table statistics to estimate costs, and those statistics include index metadata.

Before/after comparison

Run EXPLAIN ANALYZE on the original problem query with the real index in place:

Before — sequential scan, 341ms

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 12345;

-- Before (no index):
--                                                     QUERY PLAN
-- -----------------------------------------------------------------------------------------------------------------
-- Seq Scan on orders  (cost=0.00..15234.00 rows=1 width=48) (actual time=215.402..341.887 rows=8 loops=1)
--   Filter: (customer_id = 12345)
--   Rows Removed by Filter: 149992
--   Buffers: shared hit=2048 read=10532
-- Planning Time: 0.082 ms
-- Execution Time: 341.923 ms

After — index scan, 0.07ms

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 12345;

-- After (with index):
--                                                            QUERY PLAN
-- ---------------------------------------------------------------------------------------------------------------------------------
-- Index Scan using idx_orders_customer_id on orders  (cost=0.42..8.44 rows=8 width=48) (actual time=0.031..0.048 rows=8 loops=1)
--   Index Cond: (customer_id = 12345)
--   Buffers: shared hit=4
-- Planning Time: 0.095 ms
-- Execution Time: 0.072 ms

If you'll permit a brief moment of satisfaction:

Execution time: 341ms to 0.07ms — a 4,800x improvement
Buffer reads: 12,580 pages (shared hit + read) down to 4 pages
Scan type: Sequential scan eliminated, replaced by index scan
Rows scanned: 150,000 rows filtered down from a full table scan to 8 rows via direct index lookup

The complete workflow — end-to-end example

If you'll follow me, I should like to walk through the full pipeline using a realistic e-commerce schema. The orders table has 150,000 rows:

SQL — table schema

CREATE TABLE orders (
  id          BIGSERIAL PRIMARY KEY,
  customer_id BIGINT NOT NULL,
  status      TEXT NOT NULL DEFAULT 'pending',
  total       NUMERIC(10,2) NOT NULL,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

No indexes exist besides the primary key on id.

Step 1: pg_qualstats identifies the gap

After one week of data collection:

SQL — identifying unindexed predicates

SELECT
  qs.relid::regclass AS table_name,
  a.attname AS column_name,
  qs.opno::regoper AS operator,
  qs.occurences AS occurrences,
  CASE
    WHEN EXISTS (
      SELECT 1 FROM pg_index pi
      JOIN pg_attribute pa ON pa.attrelid = pi.indrelid AND pa.attnum = ANY(pi.indkey)
      WHERE pi.indrelid = qs.relid AND pa.attnum = qs.attnum
    ) THEN 'indexed'
    ELSE 'NOT INDEXED'
  END AS index_status
FROM pg_qualstats_all qs
JOIN pg_attribute a ON a.attrelid = qs.relid AND a.attnum = qs.attnum
WHERE qs.relid = 'orders'::regclass
ORDER BY qs.occurences DESC;

-- table_name | column_name | operator | occurrences | index_status
-- -----------+-------------+----------+-------------+--------------
-- orders     | customer_id | =        |       52340 | NOT INDEXED
-- orders     | status      | =        |       48100 | NOT INDEXED
-- orders     | created_at  | >=       |       22800 | NOT INDEXED
-- orders     | id          | =        |       18900 | indexed

customer_id is filtered 52,340 times with no index.

Step 2: pg_qualstats_indexes() suggests indexes

SQL — automatic suggestions

SELECT * FROM pg_qualstats_indexes() WHERE relid = 'orders'::regclass;

--   relid    |    attnames     | possible_types | execution_count
-- -----------+-----------------+----------------+-----------------
-- orders     | {customer_id}   | {btree}        |           52340
-- orders     | {status}        | {btree}        |           48100
-- orders     | {created_at}    | {btree}        |           22800

Step 3: HypoPG tests the top candidate

SQL — hypothetical test

SELECT * FROM hypopg_create_index('CREATE INDEX ON orders(customer_id)');

EXPLAIN SELECT * FROM orders WHERE customer_id = 12345;

--                                          QUERY PLAN
-- --------------------------------------------------------------------------------------------
-- Index Scan using <205841>btree_orders_customer_id on orders  (cost=0.04..8.06 rows=8 width=48)
--   Index Cond: (customer_id = 12345)

Cost dropped from 15,234 to 8.06. The planner would use this index.

Step 4: Test a composite index for the common query pattern

The application frequently runs:

SQL — composite index test

SELECT hypopg_reset();  -- clear previous hypothetical indexes

SELECT * FROM hypopg_create_index(
  'CREATE INDEX ON orders(customer_id, status, created_at DESC)'
);

EXPLAIN SELECT * FROM orders
WHERE customer_id = 12345 AND status = 'pending'
ORDER BY created_at DESC LIMIT 10;

--                                                         QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------------------
-- Limit  (cost=0.04..4.12 rows=10 width=48)
--   ->  Index Scan Backward using <205844>btree_orders_customer_id_status_created_at on orders  (cost=0.04..12.30 rows=30 width=48)
--         Index Cond: ((customer_id = 12345) AND (status = 'pending'))

The composite index satisfies the WHERE clause and the ORDER BY in a single scan — no sort step required. Compare to the single-column index:

SQL — single-column comparison

SELECT hypopg_reset();

SELECT * FROM hypopg_create_index('CREATE INDEX ON orders(customer_id)');

EXPLAIN SELECT * FROM orders
WHERE customer_id = 12345 AND status = 'pending'
ORDER BY created_at DESC LIMIT 10;

--                                                QUERY PLAN
-- ---------------------------------------------------------------------------------------------------------
-- Limit  (cost=8.08..8.10 rows=10 width=48)
--   ->  Sort  (cost=8.08..8.15 rows=30 width=48)
--         Sort Key: created_at DESC
--         ->  Index Scan using <205845>btree_orders_customer_id on orders  (cost=0.04..8.06 rows=30 width=48)
--               Index Cond: (customer_id = 12345)
--               Filter: (status = 'pending')

The single-column index requires a separate sort and an in-memory filter on status. The composite index eliminates both. For a query running thousands of times per day, that difference adds up considerably.

Step 5: Create the real index and verify

SQL — create and analyze

CREATE INDEX CONCURRENTLY idx_orders_customer_status_created
  ON orders(customer_id, status, created_at DESC);

ANALYZE orders;

Final verification with EXPLAIN ANALYZE:

SQL — final verification, 0.06ms

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders
WHERE customer_id = 12345 AND status = 'pending'
ORDER BY created_at DESC LIMIT 10;

--                                                                      QUERY PLAN
-- -----------------------------------------------------------------------------------------------------------------------------------------------------
-- Limit  (cost=0.42..4.18 rows=10 width=48) (actual time=0.028..0.041 rows=3 loops=1)
--   ->  Index Scan Backward using idx_orders_customer_status_created on orders  (cost=0.42..12.30 rows=30 width=48) (actual time=0.026..0.038 rows=3 loops=1)
--         Index Cond: ((customer_id = 12345) AND (status = 'pending'))
--         Buffers: shared hit=4
-- Planning Time: 0.112 ms
-- Execution Time: 0.062 ms

Index scan, no sort, 0.06ms execution time. The full pipeline — from pg_qualstats data collection through HypoPG testing to validated index creation — is complete. Evidence, hypothesis, test, confirmation. Proper methodology.

Common mistakes to avoid

I should note a few pitfalls I have observed with some frequency.

Creating indexes on low-cardinality columns. A column like status with three distinct values (pending, shipped, delivered) across 150,000 rows has low selectivity. An index on this column alone rarely helps — the planner will often choose a sequential scan because the index would return too many rows. Consider a partial index instead: CREATE INDEX ON orders(created_at) WHERE status = 'pending' targets the specific value that matters.

Ignoring write amplification. Every index on a table adds overhead to INSERT, UPDATE, and DELETE operations. Each row change must update every index that covers the modified columns. A table with 10 indexes and heavy write traffic may spend more time maintaining indexes than serving reads.

Over-indexing. If a table has more indexes than columns, the household may have more staff than the duties require. Overlapping indexes (e.g., an index on (customer_id) and another on (customer_id, status)) waste space because the composite index already covers single-column lookups on customer_id.

Not running ANALYZE after index creation. The planner relies on table statistics to choose between sequential scans and index scans. Without fresh statistics, the planner may not recognize the new index as the optimal path.

Trusting pg_qualstats_indexes() without HypoPG validation. The automatic suggestions are based on predicate frequency, not planner behavior. An index that looks beneficial from predicate data may not be used by the planner due to low selectivity, small table size, or competing indexes. Always test with HypoPG before creating.

Setting pg_qualstats sample_rate too low on development databases. In development and staging environments, set sample_rate = 1.0 to capture every query. The overhead is irrelevant on non-production systems, and incomplete data leads to incomplete recommendations.

Forgetting that pg_qualstats data resets on server restart. pg_qualstats stores its data in shared memory. A PostgreSQL restart clears all accumulated data. If you need to preserve data across restarts, I would recommend exporting it to a table before planned maintenance windows.

What Gold Lapel automates

The workflow you have just walked through is exactly what Gold Lapel runs continuously as a Postgres proxy. The proxy observes every query passing through it, identifies predicate patterns, evaluates hypothetical indexes, and creates validated indexes — without manual intervention.

What the proxy sees that manual tooling does not:

Real-time workload patterns. The proxy does not need a week of pg_qualstats data — it observes query patterns as they arrive and responds to workload shifts within minutes.
Workload-aware timing. Index creation is scheduled during low-traffic periods, avoiding the performance impact of concurrent builds during peak hours.
Automatic validation. After creating an index, the proxy monitors whether the planner actually uses it. Unused indexes are flagged for removal.

The manual workflow taught in this article is valuable regardless of tooling. Teams that understand the pg_qualstats-to-HypoPG pipeline make better indexing decisions, debug index problems faster, and — should they adopt Gold Lapel — understand precisely what it is doing on their behalf.