← How-To

auto_explain for Production: Safe Setup and Interpretation

The illustration exceeded the configured duration threshold. Its full execution plan has been automatically logged.

March 27, 2026 · 18 min read
The illustration exceeded the configured duration threshold. Its full execution plan has been automatically logged to the file. The artist, it appears, nested three unnecessary loops.

Why auto_explain matters in production

The fundamental problem: you cannot EXPLAIN ANALYZE a query you cannot reproduce. Production workloads exhibit timing-dependent plan changes, parameter-dependent behavior, lock contention, and data distributions that development environments never replicate. A query that runs in 5ms on your local machine can take 30 seconds in production due to table statistics, cache state, concurrent load, or simply different data volumes.

auto_explain solves this by capturing execution plans automatically when queries exceed a configurable duration threshold. No code changes. No manual intervention. When a slow query happens, the plan that PostgreSQL chose is already waiting in your logs.

The common concern is that logging EXPLAIN output slows down the database. I should address that honestly: with the right configuration, the overhead ranges from negligible to modest, depending on which features you enable. The key is knowing which settings to turn on, which to leave off, and how to lower the threshold gradually as you fix the queries that surface.

This guide covers the production setup, safety considerations, and operational details. For a guide on interpreting auto_explain output to optimize specific query patterns, see the auto_explain optimization guide.

Installation — two approaches

Server-wide via shared_preload_libraries (recommended)

For production, I would recommend loading auto_explain at server startup:

postgresql.conf
# postgresql.conf
shared_preload_libraries = 'auto_explain'

This requires a PostgreSQL restart. After restart, auto_explain is available for all sessions and configurable via postgresql.conf or ALTER SYSTEM.

If you already load other extensions at startup, combine them:

postgresql.conf
shared_preload_libraries = 'pg_stat_statements, auto_explain'

Verify that auto_explain loaded successfully:

SQL
SHOW shared_preload_libraries;
-- Result: pg_stat_statements, auto_explain

Once loaded via shared_preload_libraries, all configuration changes can be applied without a restart — using ALTER SYSTEM followed by SELECT pg_reload_conf();.

Per-session via LOAD (for testing)

For testing auto_explain before committing to a restart, load it into a single session:

SQL
LOAD 'auto_explain';
SET auto_explain.log_min_duration = 1000;  -- 1 second
SET auto_explain.log_analyze = off;

This takes effect immediately, requires no restart, and is scoped to the current session only. Settings must be configured with SET for each session — they are not persisted.

Limitations: per-session loading only captures queries in the session that loaded it. It is not suitable for production monitoring, but it is valuable for testing configuration settings before applying them server-wide.

The critical configuration settings

If you'll permit me, I should like to walk through each setting and its implications.

log_min_duration — the threshold

SQL
-- Set via ALTER SYSTEM (persisted, requires reload)
ALTER SYSTEM SET auto_explain.log_min_duration = 5000;
SELECT pg_reload_conf();

-- Or set per-session
SET auto_explain.log_min_duration = 5000;

This is the single most important setting. It defines the minimum execution time (in milliseconds) for a query to trigger plan logging. Every query that finishes faster than this threshold is completely unaffected by auto_explain.

Recommended starting points:

EnvironmentThresholdRationale
Conservative production5000 (5s)Captures only clearly problematic queries
Active tuning1000 (1s)Captures anything a user might notice
Aggressive500 (500ms)Captures most optimization opportunities
Development100 (100ms)Captures everything worth looking at

The strategy: start at 5000ms. Fix the queries that auto_explain surfaces. Lower the threshold to 1000ms. Fix those. Continue lowering. Each step reveals a new layer of optimization opportunities — and each layer is more rewarding than the last.

Special values:

  • -1 disables auto_explain entirely (queries still execute normally, nothing is logged)
  • 0 logs every query — I must be direct: never do this in production. The log volume alone can degrade performance.

log_analyze — the overhead decision

SQL
ALTER SYSTEM SET auto_explain.log_analyze = off;

This setting controls whether auto_explain captures estimated plans (off, the default) or actual execution plans (on).

With log_analyze = off, auto_explain runs the equivalent of EXPLAIN — the plan with estimated costs, estimated row counts, and the chosen strategy. No additional execution overhead is added.

With log_analyze = on, auto_explain runs the equivalent of EXPLAIN ANALYZE — the plan includes actual execution times, actual row counts, and actual memory usage. This adds instrumentation overhead because PostgreSQL must call timing functions for each plan node.

The key detail: the overhead from log_analyze = on is only incurred on queries that already exceeded the threshold. If your threshold is 5000ms, you are adding instrumentation to queries that already took at least 5 seconds. The relative overhead is typically small on long-running queries.

Recommendation: start with log_analyze = off. Estimated plans still reveal sequential scans, poor join strategies, and missing indexes. Enable log_analyze = on when you need actual row counts and timing to diagnose specific queries.

log_buffers — I/O insight

SQL
ALTER SYSTEM SET auto_explain.log_buffers = on;

-- Plan output includes:
-- Buffers: shared hit=1234 read=567 dirtied=12 written=3

This distinguishes between "slow because of computation" and "slow because of disk I/O." A query with high shared read relative to shared hit is bottlenecked on I/O — the data was not in PostgreSQL's buffer cache.

Requires log_analyze = on to take effect. When you enable log_analyze, enable log_buffers as well — the additional overhead beyond log_analyze is minimal, and the I/O insight is valuable.

log_timing — per-node timing

SQL
ALTER SYSTEM SET auto_explain.log_timing = on;

When log_analyze is on, log_timing controls whether per-node execution times are reported. It defaults to on when log_analyze is enabled.

The overhead concern: gettimeofday() is called for every plan node on every execution. For plans with hundreds of nodes, this can add measurable overhead.

Recommendation: leave it on unless you have very complex queries with 100+ plan nodes and you measure a noticeable overhead.

log_nested_statements — functions and procedures

SQL
ALTER SYSTEM SET auto_explain.log_nested_statements = off;

The default (off) logs only top-level queries. When set to on, auto_explain also captures plans for queries executed inside PL/pgSQL functions, procedures, and DO blocks.

This is essential when slow queries hide inside stored procedures. A CALL process_orders() might finish in 50ms at the top level, but internally execute a query that takes 10 seconds. Without log_nested_statements, that inner query is invisible to auto_explain.

A word of caution: functions that execute many queries can generate substantial log volume when this setting is enabled.

log_format — choosing the output format

SQL
ALTER SYSTEM SET auto_explain.log_format = 'text';

-- text format output:
-- Seq Scan on users  (cost=0.00..1520.00 rows=1 width=128) (actual time=0.015..12.340 rows=1 loops=1)
--   Filter: (email = 'alice@example.com'::text)
--   Rows Removed by Filter: 99999

Options: text (default), json, xml, yaml.

Recommendation: use json if you have log processing infrastructure. Use text if you read PostgreSQL logs manually in a terminal. The format affects readability and parseability, not what data is captured.

log_level — controlling where output goes

SQL
ALTER SYSTEM SET auto_explain.log_level = 'LOG';

The default LOG level sends auto_explain output to the PostgreSQL log file. For most deployments, keep this at LOG.

A safe production configuration

Conservative — starting point

For a first deployment, I would suggest starting with minimal overhead:

postgresql.conf
# postgresql.conf
shared_preload_libraries = 'auto_explain'

auto_explain.log_min_duration = 5000    # 5 seconds
auto_explain.log_analyze = off
auto_explain.log_buffers = off
auto_explain.log_timing = on
auto_explain.log_nested_statements = off
auto_explain.log_format = text
auto_explain.log_level = LOG

This configuration captures estimated plans for queries exceeding 5 seconds. The overhead is negligible — auto_explain only runs EXPLAIN (not EXPLAIN ANALYZE) on the already-slow queries that cross the threshold.

Moderate — active tuning

After fixing the worst offenders:

postgresql.conf
auto_explain.log_min_duration = 1000    # 1 second
auto_explain.log_analyze = off
auto_explain.log_buffers = off
auto_explain.log_format = json

JSON format enables automated analysis. The 1-second threshold captures queries that users would notice.

Thorough — full diagnostic

When you need actual execution data for specific investigation:

postgresql.conf
auto_explain.log_min_duration = 1000    # 1 second
auto_explain.log_analyze = on
auto_explain.log_buffers = on
auto_explain.log_timing = on
auto_explain.log_format = json

This adds instrumentation overhead to queries exceeding 1 second. The overhead is added to already-slow queries, so the relative impact is typically small.

Changing settings without restart

After the initial restart to add auto_explain to shared_preload_libraries, all settings can be changed without a restart:

SQL
-- Change the threshold
ALTER SYSTEM SET auto_explain.log_min_duration = 1000;
SELECT pg_reload_conf();

-- Enable analyze for a deeper look
ALTER SYSTEM SET auto_explain.log_analyze = on;
ALTER SYSTEM SET auto_explain.log_buffers = on;
SELECT pg_reload_conf();

Per-role and per-database overrides

Fine-tune thresholds for specific databases or roles:

SQL
-- Lower threshold for the main application database
ALTER DATABASE myapp SET auto_explain.log_min_duration = 500;

-- Disable auto_explain for a reporting role (expected slow queries)
ALTER ROLE reporting SET auto_explain.log_min_duration = -1;

-- Aggressive threshold for a specific debugging session
SET auto_explain.log_min_duration = 100;

Reading auto_explain output

Finding the plans in your logs

auto_explain writes to the PostgreSQL log file, typically located in $PGDATA/log/ or the directory specified by log_directory. Plans appear as log entries with the query duration and the execution plan:

PostgreSQL log output
-- In PostgreSQL log:
-- 2026-03-26 14:23:45.678 UTC [12345] LOG:  duration: 5234.123 ms  plan:
--   Query Text: SELECT * FROM orders WHERE customer_id = $1 AND created_at > $2
--   Seq Scan on orders  (cost=0.00..45123.00 rows=500 width=96)
--     Filter: ((customer_id = 42) AND (created_at > '2026-01-01'::date))
--     Rows Removed by Filter: 1999500

For text-format logs, search with:

Shell
grep "duration:.*plan:" postgresql-*.log

What to look for

Six patterns appear most frequently in auto_explain output. I have seen each of them more times than I care to count:

Seq Scan on large tables — the most common finding. A sequential scan on a table with millions of rows, with a Filter that removes most of them, indicates a missing index.

Nested Loop with high actual loops — a Nested Loop join where the inner side executes thousands of times. This suggests an N+1 pattern or a missing index on the join column.

Sort with external merge Disk — the sort operation spilled to disk because the data exceeded work_mem. Either raise work_mem for this query or add an index that provides pre-sorted access.

Hash Join with Batches > 1 — the hash table spilled to disk. Raise work_mem to allow the hash table to fit in memory.

Estimated vs actual row mismatch — when estimated rows and actual rows differ by 10x or more, the planner is making decisions based on stale statistics. Run ANALYZE on the affected table.

Index Scan with Filter removing most rows — the planner used an index, but the index did not cover all the filter conditions. Consider a composite index that covers the additional conditions.

Common patterns and what they mean

Pattern: missing index
-- Pattern: Seq Scan, rows=1, Rows Removed by Filter: 999,999
-- Seq Scan on users  (cost=0.00..22456.00 rows=1 width=128) (actual time=0.021..145.678 rows=1 loops=1)
--   Filter: (email = 'alice@example.com'::text)
--   Rows Removed by Filter: 999999
-- Diagnosis: missing index on the email column.
Pattern: nested loop overhead
-- Pattern: Nested Loop, actual loops=1000
-- Nested Loop  (cost=0.57..8500.00 rows=1000 width=200) (actual time=0.050..450.000 rows=1000 loops=1)
--   -> Seq Scan on order_items  (cost=0.00..150.00 rows=1000 width=100) (actual time=0.010..1.500 rows=1000 loops=1)
--   -> Index Scan using orders_pkey on orders  (cost=0.57..8.34 rows=1 width=100) (actual time=0.400..0.410 rows=1 loops=1000)
-- Diagnosis: 1,000 index lookups on the inner side.
Pattern: sort spilling to disk
-- Pattern: Sort Method: external merge Disk: 150MB
-- Sort  (cost=250000.00..252500.00 rows=1000000 width=64) (actual time=4500.000..5200.000 rows=1000000 loops=1)
--   Sort Key: created_at DESC
--   Sort Method: external merge  Disk: 153600kB
-- Diagnosis: raise work_mem or add a pre-sorted index.
Pattern: lossy bitmap
-- Pattern: Bitmap Heap Scan, lossy=true
-- Bitmap Heap Scan on orders  (cost=500.00..15000.00 rows=50000 width=96) (actual time=25.000..350.000 rows=48500 loops=1)
--   Recheck Cond: (customer_id = ANY ('{1,2,3,4,5}'::integer[]))
--   Rows Removed by Index Recheck: 15000
--   Heap Blocks: exact=1000 lossy=5000
-- Diagnosis: bitmap exceeded work_mem, dropped to lossy precision.

Filtering noise

Once you lower the threshold below 5 seconds, auto_explain captures more plans — including queries you may not care about. Allow me to suggest several approaches for managing the noise:

Focus on top offenders. Sort logged plans by duration. Fix the slowest queries first. Each fix reduces log volume and clears the way for the next round.

Per-role filtering. Disable auto_explain for roles whose queries are expected to be slow:

SQL — per-role filtering
-- Reporting queries are expected to take minutes; do not log plans
ALTER ROLE reporting SET auto_explain.log_min_duration = -1;

-- The application role gets a lower threshold
ALTER ROLE app SET auto_explain.log_min_duration = 500;

Per-database filtering. Different databases can have different thresholds:

SQL — per-database filtering
ALTER DATABASE analytics SET auto_explain.log_min_duration = 30000;  -- 30s
ALTER DATABASE production SET auto_explain.log_min_duration = 1000;  -- 1s

Log rotation and retention. Configure log rotation to keep auto_explain output for 7-14 days:

postgresql.conf
# postgresql.conf
log_rotation_age = 1d
log_rotation_size = 100MB

Structured analysis. With JSON-format logs, pipe into a dashboard that groups by query pattern and shows frequency and duration distribution. This turns raw log entries into actionable rankings.

auto_explain vs pg_stat_statements — complementary, not competing

pg_stat_statements and auto_explain serve different purposes, and I would recommend enabling both.

pg_stat_statements tells you which queries are slow. It aggregates statistics — total execution time, mean time, call count, rows returned — across all executions of each query pattern. It answers: "Which queries consume the most database time?"

auto_explain tells you why those queries are slow. It captures the execution plan that PostgreSQL chose for a specific slow execution. It answers: "What plan did the planner choose, and where did the time go?"

The workflow:

  1. Query pg_stat_statements to identify the top queries by total time or mean time.
  2. Check auto_explain logs for the plans of those queries.
  3. The plan reveals whether the problem is a missing index, a bad join strategy, stale statistics, or a sort spilling to disk.
  4. Fix the plan, and verify with pg_stat_statements that the metrics improve.

For pg_stat_monitor users: pg_stat_monitor captures actual execution plans natively, reducing but not eliminating the need for auto_explain. See the pg_stat_monitor guide for details.

For a comprehensive overview of the complete monitoring stack, including pg_stat_statements, auto_explain, and other diagnostic extensions, see the monitoring stack guide.

What Gold Lapel captures that auto_explain cannot

auto_explain is a threshold-based tool — it captures plans when queries exceed a duration. This means it misses queries that are consistently moderate (800ms every time, with a 1-second threshold) and cannot track how plans change over time as data grows. I should be forthcoming about those limits.

Gold Lapel operates as a proxy between your application and PostgreSQL, observing every query plan as part of its optimization pipeline. It captures plans for every query pattern it tracks, regardless of execution time, and monitors how plans evolve as data distributions change. When a query that used an Index Scan last week silently switches to a Seq Scan because table statistics shifted, Gold Lapel detects the regression.

auto_explain writes to logs for human review. Gold Lapel acts on the plans — identifying optimization opportunities and applying them. The manual approach this article teaches is the foundation. Understanding it makes you better equipped to interpret what any automated tool does on your behalf.

Frequently asked questions