← PostgreSQL Concepts

WAL (Write-Ahead Log)

Every change is written to the log before it reaches the data files. This one guarantee is what makes crash recovery, replication, and point-in-time restore possible.

Concept · March 21, 2026 · 9 min read

The write-ahead log (WAL) is a sequential, append-only log of every change PostgreSQL makes to data. Before any modification is applied to the actual tables and indexes on disk, the change is first recorded in the WAL. If PostgreSQL crashes, it replays the WAL from the last checkpoint to restore every committed transaction. This write-before-apply guarantee is the foundation of PostgreSQL's durability, and the same log stream powers streaming replication, logical replication, and point-in-time recovery.

What the write-ahead log is

PostgreSQL does not write changes directly to data files when a transaction commits. Instead, it writes a WAL record — a compact description of the change — to a sequential log on disk. The actual data files (heap, indexes) are updated later, during a checkpoint or by the background writer. This separation is the core insight: sequential writes are fast, and a single durable log is easier to guarantee than scattered updates across many files.

The WAL is append-only. New records are always added at the end, never inserted into the middle or overwritten. Each record has a unique Log Sequence Number (LSN) — a monotonically increasing byte offset that identifies its exact position in the WAL stream. LSNs are how PostgreSQL tracks what has been written, flushed, replicated, and recovered.

SQL
-- Current WAL position and recent write activity
SELECT
  pg_current_wal_lsn() AS current_lsn,
  pg_walfile_name(pg_current_wal_lsn()) AS current_wal_file,
  pg_wal_lsn_diff(
    pg_current_wal_lsn(),
    '0/0'
  ) / (1024 * 1024 * 1024) AS total_wal_gb;

On disk, the WAL lives in the pg_wal directory as a series of 16 MB segment files. PostgreSQL writes to the current segment, advances to a new one when it fills up, and recycles old segments once they are no longer needed for recovery or replication.

Why WAL matters

WAL serves four distinct purposes, all built on the same write-before-apply guarantee:

  • Durability — a transaction is not reported as committed until its WAL records are flushed to stable storage. Even if the server loses power one millisecond after the commit acknowledgment, the change is on disk in the WAL.
  • Crash recovery — on startup after a crash, PostgreSQL identifies the last completed checkpoint and replays all WAL records written after it. This brings the database to a consistent state that includes every committed transaction and excludes every uncommitted one. No manual intervention required.
  • Point-in-time recovery (PITR) — by archiving WAL segments to external storage, you can restore a base backup and replay WAL up to any specific moment. This is how you recover from accidental DROP TABLE or data corruption — rewind to one second before the mistake.
  • Replication — the WAL stream is the data source for both physical and logical replication. Replicas receive WAL records from the primary and apply them, maintaining a near-real-time copy of the database.

How WAL works

The lifecycle of a write in PostgreSQL follows a specific path through WAL and checkpoints:

  1. A transaction modifies data (INSERT, UPDATE, DELETE). PostgreSQL modifies the page in the shared buffer cache (in memory) and generates a WAL record describing the change.
  2. The WAL record is written to the WAL buffer (also in memory).
  3. At commit time, the WAL buffer is flushed to disk (fsync or equivalent). Once the flush completes, the transaction is durable and the commit is acknowledged to the client.
  4. The modified data page remains in the buffer cache as a "dirty page." It has not been written to the data file yet.
  5. Periodically, the checkpointer process flushes all dirty pages to disk. This is a checkpoint. After a checkpoint completes, the WAL records before it are no longer needed for crash recovery — recovery can start from the checkpoint instead of replaying the entire history.
SQL — checkpoint statistics
-- Check checkpoint frequency and timing
SELECT
  checkpoints_timed,
  checkpoints_req,
  checkpoint_write_time / 1000 AS write_seconds,
  checkpoint_sync_time / 1000 AS sync_seconds,
  buffers_checkpoint,
  buffers_clean,
  buffers_backend
FROM pg_stat_bgwriter;

-- checkpoints_timed:  triggered by checkpoint_timeout (normal)
-- checkpoints_req:    triggered by max_wal_size (too frequent = WAL pressure)

The WAL writer process also flushes WAL buffers to disk between commits, ensuring that even busy systems do not accumulate a large backlog of unwritten WAL data in memory. Checkpoints are the more expensive operation — they flush potentially thousands of dirty data pages — and their frequency is a trade-off between recovery time (fewer checkpoints mean more WAL to replay after a crash) and I/O load (more checkpoints mean more disk writes during normal operation).

WAL and replication

The WAL stream is the foundation of PostgreSQL replication. Both major replication modes — streaming and logical — consume the same WAL, but process it differently.

Streaming replication (physical) sends raw WAL bytes from the primary to one or more replicas. Each replica replays the WAL identically, producing a byte-for-byte copy of the primary's data files. This is the simplest and lowest-overhead replication method. The replica can serve read queries while replay continues, and the lag between primary and replica is typically sub-second.

SQL — replication monitoring
-- Check connected replicas and their WAL positions
SELECT
  client_addr,
  state,
  sent_lsn,
  write_lsn,
  flush_lsn,
  replay_lsn,
  pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;

Logical replication decodes the WAL into structured change events — row-level inserts, updates, and deletes — using a logical decoding plugin. This allows selective replication (specific tables, not the entire database), cross-version replication, and feeding changes to external systems. Extensions like wal2json and pglogical build on this mechanism.

WAL archiving copies completed WAL segment files to a separate storage location as they are filled. This serves two purposes: it provides the WAL history needed for point-in-time recovery, and it acts as a fallback for replicas that fall too far behind streaming replication. Tools like pgBackRest, Barman, and WAL-G manage the archiving and restoration process.

Key configuration

WAL behavior is controlled by a handful of settings that balance durability, performance, and storage.

postgresql.conf
-- Key WAL-related settings (shown with typical production values)

-- Controls what information is written to WAL
wal_level = 'replica'

-- Maximum WAL size before a checkpoint is triggered
max_wal_size = '4GB'

-- Minimum WAL retained (even after checkpoint)
min_wal_size = '1GB'

-- Time between automatic checkpoints
checkpoint_timeout = '10min'

-- Spread checkpoint writes over this fraction of the interval
checkpoint_completion_target = 0.9

-- Enable WAL archiving for PITR and backups
archive_mode = 'on'
archive_command = 'cp %p /archive/%f'

wal_level determines how much information is written to WAL. The default replica is sufficient for physical replication and archiving. Set it to logical if you need logical replication or change data capture. The minimal level writes less WAL (certain bulk operations can skip logging) but disables replication and archiving entirely.

SQL
-- Check the current WAL level
SHOW wal_level;

-- Possible values:
-- minimal    — crash recovery only, no replication
-- replica    — supports physical replication and WAL archiving (default)
-- logical    — supports logical replication and change data capture

max_wal_size is the soft limit on total WAL size before a checkpoint is triggered. If WAL accumulates faster than checkpoints can flush dirty pages, PostgreSQL forces an early checkpoint. Frequent forced checkpoints (visible as high checkpoints_req in pg_stat_bgwriter) indicate that max_wal_size should be raised.

checkpoint_timeout is the maximum time between automatic checkpoints. The default 5 minutes is conservative. Production systems with high write throughput often increase this to 10-15 minutes to reduce checkpoint frequency, accepting that crash recovery will take longer as a trade-off.

archive_mode and archive_command enable WAL archiving. Once enabled, PostgreSQL will not recycle a WAL segment until the archive command has successfully copied it. This is essential for PITR but requires monitoring — a broken archive command will cause WAL to accumulate in pg_wal indefinitely.

How Gold Lapel relates

Gold Lapel operates at the query level — it sits between your application and PostgreSQL, analyzing and optimizing queries as they pass through. WAL operates at the storage level, below the query interface. They work at different layers of the stack and do not interact directly.

That said, Gold Lapel's optimizations can indirectly reduce WAL volume. When Gold Lapel serves a query result from a materialized view instead of re-executing an expensive join, the avoided writes (temporary files, sorts) generate less WAL activity. And when Gold Lapel recommends an index that replaces sequential scans, the more targeted access pattern can reduce buffer churn that would otherwise generate checkpoint pressure.

WAL-related extensions like wal2json operate at a different layer entirely — they decode the WAL for change data capture and replication purposes. Gold Lapel does not produce or consume WAL directly. It reads query statistics, not the change log.

Frequently asked questions