Zero-Downtime Alembic Migrations on PostgreSQL: The Household Does Not Tolerate Outages
Your migration took 50 milliseconds. The outage lasted 12 seconds. Allow me to explain the discrepancy.
Good evening. Your deploy appears to have taken the database offline.
Not through malice, and not through negligence exactly. Through the ordinary act of running alembic upgrade head against a production PostgreSQL database without considering what happens between the moment a lock is requested and the moment it is granted.
Most Alembic tutorials show you how to generate a migration, write op.add_column(), and run upgrade head. Very few mention that this polite one-liner can queue every query in your application behind it. The DDL statement itself finishes in milliseconds. The outage comes from the queue that forms while it waits for its lock.
I have attended to this particular problem more times than I care to enumerate. A team runs a schema migration during a deploy. The migration takes 50 milliseconds. The application is unresponsive for 12 seconds. The Slack channel fills with messages. The post-mortem identifies "a database issue" without specifying which issue, and the team resolves to "be more careful next time." Being more careful is not a strategy. Understanding the lock queue is.
There is no Alembic-specific, code-complete guide for zero-downtime PostgreSQL migrations. The Django ecosystem has django-pg-zero-downtime-migrations. Rails has strong_migrations. The framework-agnostic guides explain the theory but leave you to translate it into Alembic's env.py yourself. This guide fills that gap: every configuration change, every migration pattern, every code example — ready to paste into your project and adapt.
Why does a 50ms DDL statement cause a 12-second outage?
The PostgreSQL lock queue is a FIFO queue with a strict ordering rule: once a lock request is pending, all subsequent lock requests that conflict with it also queue behind it — even if they would be compatible with the currently granted locks.
This is worth reading twice, because it is the single fact that explains nearly every migration-induced outage.
-- Your migration wants to add a column.
-- It needs AccessExclusiveLock on the table.
-- Three SELECT queries are currently running against that table.
-- Each holds AccessShareLock.
-- The timeline:
-- T+0s SELECT queries running (AccessShareLock held)
-- T+0s ALTER TABLE submitted — waits for AccessExclusiveLock
-- T+1s New SELECT arrives — queued behind the pending ALTER TABLE
-- T+2s New INSERT arrives — also queued
-- T+3s More SELECTs — all queued
-- T+12s Original SELECTs finish
-- T+12s ALTER TABLE acquires lock, runs for 50ms
-- T+12s Queued queries finally execute
-- 12 seconds of total blockage.
-- The ALTER TABLE itself took 50 milliseconds.
-- The damage was the queue, not the DDL. This is the cascade. The ALTER TABLE needs AccessExclusiveLock, which conflicts with AccessShareLock held by running SELECTs. The ALTER waits. But now every new query — including plain SELECTs that would normally be fine — queues behind the pending AccessExclusiveLock. The pending lock poisons the queue.
The longer the ALTER waits, the longer the queue grows. If you have a long-running analytics query holding AccessShareLock for 45 seconds, that is 45 seconds of total database blockage for that table. Not because the ALTER needs 45 seconds — it needs 50ms — but because it has to wait 45 seconds to even start.
And the queue does not grow linearly. A web application receiving 200 requests per second, each touching the locked table, accumulates 9,000 blocked queries during a 45-second wait. When the lock finally releases, those 9,000 queries execute in a thundering herd that can spike CPU to 100% and cause a secondary outage. I have seen this happen. The migration succeeds, the lock releases, and then the application falls over again from the backlog. It is — if you will permit me — the infrastructural equivalent of a traffic jam causing a second traffic jam.
For those who wish to go further, I have prepared a dedicated guide to PostgreSQL lock contention that covers the lock mechanics in depth.
Observing the lock queue in real time
When a migration is blocked and you need to understand what is happening, this diagnostic query shows the lock queue:
-- See the lock queue in real time during a blocked migration:
SELECT
blocked.pid AS blocked_pid,
blocked.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking.query AS blocking_query,
now() - blocked.query_start AS wait_duration
FROM pg_stat_activity blocked
JOIN pg_locks bl ON bl.pid = blocked.pid
JOIN pg_locks kl ON kl.relation = bl.relation
AND kl.pid != bl.pid
AND kl.granted = true
JOIN pg_stat_activity blocking ON blocking.pid = kl.pid
WHERE NOT bl.granted
ORDER BY wait_duration DESC;
-- This shows you exactly who is blocking whom.
-- Every second this goes unanswered, the queue grows. If your migration is blocked, this query tells you precisely which session is holding the conflicting lock and what query it ran. Armed with that information, you can make an informed decision: wait for the blocking query to finish, or terminate it with pg_terminate_backend(pid). Neither option is pleasant, but both are preferable to staring at a frozen application wondering what happened.
Configuring Alembic's env.py for safety
Two changes to env.py eliminate the most common migration failures. Neither requires restructuring your project. Both should be applied before your next production deployment.
Problem: all migrations run in one transaction
Alembic's default env.py wraps all pending migrations in a single transaction. If you have three migration files pending, the locks from migration #1 are held until migration #3 finishes.
# env.py — the default Alembic configuration.
# This wraps ALL migrations in a single transaction.
def run_migrations_online():
connectable = engine_from_config(config.get_section("alembic"))
with connectable.connect() as connection:
context.configure(
connection=connection,
target_metadata=target_metadata,
)
with context.begin_transaction():
context.run_migrations()
# Problem: if you have 5 migration files pending,
# all 5 run inside one transaction.
# The AccessExclusiveLock from migration #1
# is held until migration #5 finishes.
# Every migration's lock duration = total batch duration. This default is sensible for correctness — if migration #3 fails, migrations #1 and #2 are rolled back, and the database returns to its original state. But in production, the cost of holding locks across multiple migrations far exceeds the benefit of atomic rollback. If migration #3 fails, you want to know about it and fix it. You do not want migrations #1 and #2 retroactively blocking traffic for the duration of #3's failure.
The fix is transaction_per_migration=True, available since Alembic 1.2. Each migration file gets its own transaction. Locks are acquired and released per-migration, not per-batch.
# env.py — configured for zero-downtime migrations.
def run_migrations_online():
connectable = engine_from_config(config.get_section("alembic"))
with connectable.connect() as connection:
context.configure(
connection=connection,
target_metadata=target_metadata,
transaction_per_migration=True, # Each migration gets its own transaction
)
context.run_migrations()
# Now migration #1 acquires and releases its lock
# before migration #2 begins.
# Lock duration = single migration duration, not batch duration. One transaction per migration. One lock scope per migration. If migration #1 adds a column and migration #2 creates an index, the column addition's AccessExclusiveLock is released before the index build begins. The exposure window shrinks from "total batch duration" to "single migration duration."
Problem: no lock_timeout means infinite waiting
Without lock_timeout, a migration that cannot acquire its lock will wait indefinitely. While waiting, it poisons the lock queue. Every subsequent query on that table blocks. The application goes down, and the migration has not even started.
# env.py — with lock_timeout to fail fast.
from alembic import context
from sqlalchemy import engine_from_config, text
def run_migrations_online():
connectable = engine_from_config(config.get_section("alembic"))
with connectable.connect() as connection:
# Set lock_timeout BEFORE running migrations.
# If any DDL cannot acquire its lock within 4 seconds, it fails
# instead of blocking the entire application.
connection.execute(text("SET lock_timeout = '4s'"))
# Also set a statement_timeout as a safety net.
connection.execute(text("SET statement_timeout = '30s'"))
context.configure(
connection=connection,
target_metadata=target_metadata,
transaction_per_migration=True,
)
context.run_migrations()
# If the lock cannot be acquired in 4 seconds:
# sqlalchemy.exc.OperationalError:
# (psycopg2.errors.LockNotAvailable)
# canceling statement due to lock timeout
#
# The migration fails. The queue drains. You retry later.
# This is vastly preferable to a 45-second outage. 4 seconds is a reasonable starting point. Long enough that a migration will succeed during normal traffic. Short enough that a failed lock acquisition does not take the application down. Adjust based on your deployment cadence and traffic patterns.
The statement_timeout is a separate safety net for a different failure mode: a DDL statement that acquires its lock but then takes too long to execute. This can happen with a CREATE INDEX on a large table (which is why you should use CONCURRENTLY, as discussed below). Set it generously — 30 seconds is reasonable for most DDL, though you may need more for large backfill operations.
When a migration fails due to lock timeout, it is not a crisis. It is the system working as designed. The queue drains immediately. Retry during a quieter moment, or terminate the long-running query that was blocking lock acquisition.
The full production configuration
For reference, here is what a complete production-ready env.py looks like with all the safety measures combined:
# env.py — full production-ready configuration.
# Combines all safety measures in one file.
from alembic import context
from sqlalchemy import engine_from_config, pool, text
import logging
log = logging.getLogger("alembic.runtime.migration")
def run_migrations_online():
connectable = engine_from_config(
config.get_section("alembic"),
prefix="sqlalchemy.",
poolclass=pool.NullPool, # Don't pool migration connections
)
with connectable.connect() as connection:
# Safety net: fail fast rather than block the application.
connection.execute(text("SET lock_timeout = '4s'"))
connection.execute(text("SET statement_timeout = '30s'"))
# Disable idle_in_transaction_session_timeout for the migration
# session — long backfill migrations may legitimately take time.
connection.execute(
text("SET idle_in_transaction_session_timeout = '0'")
)
context.configure(
connection=connection,
target_metadata=target_metadata,
transaction_per_migration=True,
compare_type=True,
)
context.run_migrations()
log.info("Migrations complete.") Two details worth noting. First, NullPool prevents the migration process from holding open connections in a pool — the migration opens one connection, uses it, and closes it. This avoids leaving behind idle connections that could interfere with subsequent migrations or application traffic. Second, idle_in_transaction_session_timeout is explicitly disabled for the migration session, because backfill migrations may legitimately run for minutes. The session-level setting overrides the server default without affecting other connections.
Which migration operations are safe?
Not all DDL is equal. Some operations hold heavy locks for milliseconds. Others hold them for minutes. The difference between a smooth deploy and an incident is knowing which is which.
| Operation | Safe? | Lock type | Duration | Notes |
|---|---|---|---|---|
| Add nullable column | Yes | AccessExclusiveLock (brief) | < 10ms | Fast on PG 11+. No table rewrite. |
| Add NOT NULL column with DEFAULT (PG 11+) | Yes | AccessExclusiveLock (brief) | < 10ms | Catalog-only on PG 11+. PG knows no NULLs exist in a freshly added column with a non-null default. |
| Add column with volatile DEFAULT | No | AccessExclusiveLock | Full table rewrite | Volatile defaults (e.g., now()) rewrite the table even on PG 11+. |
| CREATE INDEX | No | ShareLock | 30-120s (50M rows) | Blocks all writes. Use CONCURRENTLY. |
| CREATE INDEX CONCURRENTLY | Yes | ShareUpdateExclusiveLock | 60-240s (50M rows) | Does not block reads or writes. Cannot run in transaction. |
| DROP COLUMN | Yes | AccessExclusiveLock (brief) | < 10ms | Only updates catalog. Data stays on disk until VACUUM. |
| RENAME COLUMN | Yes | AccessExclusiveLock (brief) | < 10ms | Catalog-only. But will break running queries referencing old name. |
| ALTER COLUMN TYPE | No | AccessExclusiveLock | Full table rewrite | Rewrites every row. Use expand-contract: add new column, backfill, swap. |
| ADD CHECK constraint | No | AccessExclusiveLock | Full table scan | Use NOT VALID + VALIDATE CONSTRAINT pattern. |
| ADD FOREIGN KEY | No | AccessExclusiveLock (source table) + ShareRowExclusiveLock (referenced table) | Full table scan | Use NOT VALID + VALIDATE. Validate only needs ShareUpdateExclusiveLock. |
The pattern is consistent: anything that only modifies the system catalog (adding a nullable column, dropping a column, renaming) is fast. Anything that touches actual row data (type changes, constraint validation, non-concurrent indexes) is slow and dangerous.
I should note that "brief" in the lock duration column still means AccessExclusiveLock — the heaviest lock PostgreSQL has. Even brief AccessExclusiveLock operations block all concurrent access to the table for their duration. The distinction is that "brief" means sub-10ms, which is short enough that the lock queue has no time to build up. A 5ms lock at 200 requests per second queues exactly one request. A 45-second lock queues 9,000.
A word on DROP COLUMN: it is deceptively safe. PostgreSQL does not physically remove the column data — it marks the column as dropped in the system catalog, and the data remains on disk until the next VACUUM FULL or table rewrite. The lock is held only for the catalog update, which takes milliseconds. However, if any view, function, trigger, or rule references the column, the DROP will fail. Check dependencies before dropping.
How do you safely add a NOT NULL column?
This used to be the single most common source of migration-induced outages. On PostgreSQL 10 and below, writing nullable=False, server_default='0' triggered a full table rewrite. On modern PostgreSQL (11+), this specific operation is actually safe.
# Adding a NOT NULL column with a default in one step.
# On PostgreSQL < 11, this rewrites the entire table — dangerous on large tables.
# On PostgreSQL 11+, this is actually safe and fast: adding a new column with
# DEFAULT x NOT NULL is a catalog-only operation. PostgreSQL knows the column
# can't contain NULLs because it was just created with a non-null default.
def upgrade():
op.add_column('orders',
sa.Column('priority', sa.Integer(), nullable=False, server_default='0')
)
# On PG 11+: instant, sub-millisecond lock. No table scan needed.
# On PG < 11: full table rewrite — use the expand-contract pattern below instead. PostgreSQL 11 introduced fast ALTER TABLE ... ADD COLUMN ... DEFAULT — it stores the default in the system catalog without rewriting the table. And because the column was just created with a non-null default, PostgreSQL knows no NULLs can exist, so the NOT NULL constraint does not trigger a table scan either. The entire operation is catalog-only and completes in milliseconds, regardless of table size.
The expand-contract pattern below is still valuable for PostgreSQL 10 and below, for adding NOT NULL constraints to existing columns (where NULLs may already be present), and for volatile defaults like now() which do rewrite the table even on PG 11+.
# SAFE: Expand-contract pattern. Three migrations, three deploys.
# Migration 1: Add the column as nullable (instant, sub-millisecond lock).
def upgrade():
op.add_column('orders',
sa.Column('priority', sa.Integer(), nullable=True, server_default='0')
)
# Deploy. Let the application write to the new column.
# Backfill existing rows in batches:
# Migration 2: Backfill in batches (no DDL lock at all).
def upgrade():
conn = op.get_bind()
while True:
result = conn.execute(text("""
UPDATE orders SET priority = 0
WHERE priority IS NULL
AND id IN (
SELECT id FROM orders
WHERE priority IS NULL
LIMIT 10000
)
"""))
if result.rowcount == 0:
break
# Migration 3: Add the NOT NULL constraint using NOT VALID + VALIDATE.
def upgrade():
# NOT VALID: adds the constraint without scanning existing rows.
# Only enforces on new writes. Instant. Minimal lock.
op.execute("""
ALTER TABLE orders
ADD CONSTRAINT orders_priority_not_null
CHECK (priority IS NOT NULL) NOT VALID
""")
# VALIDATE CONSTRAINT: scans existing rows but only needs
# ShareUpdateExclusiveLock — does NOT block reads or writes.
op.execute("""
ALTER TABLE orders
VALIDATE CONSTRAINT orders_priority_not_null
""") Three deploys instead of one. Three migrations instead of one. This approach remains the right call when you are adding a NOT NULL constraint to a column that already exists and may contain NULLs.
A production-grade backfill
The simple backfill shown above works, but for large tables in high-traffic environments, a more careful approach pays dividends:
# A production-grade backfill with progress logging and throttling.
import time
def upgrade():
conn = op.get_bind()
batch_size = 10000
total_updated = 0
pause_between_batches = 0.1 # seconds — reduces replication lag
while True:
result = conn.execute(text("""
WITH batch AS (
SELECT id FROM orders
WHERE priority IS NULL
LIMIT :batch_size
FOR UPDATE SKIP LOCKED
)
UPDATE orders
SET priority = 0
FROM batch
WHERE orders.id = batch.id
"""), {"batch_size": batch_size})
if result.rowcount == 0:
break
total_updated += result.rowcount
conn.execute(text("COMMIT"))
if total_updated % 100000 == 0:
print(f"Backfilled {total_updated:,} rows...")
time.sleep(pause_between_batches)
print(f"Backfill complete: {total_updated:,} rows updated.") Three improvements over the simple version. First, FOR UPDATE SKIP LOCKED prevents the backfill from blocking concurrent writes — if a row is currently locked by an application query, the backfill skips it and picks it up on the next batch. Second, explicit COMMIT after each batch ensures the transaction does not accumulate locks across batches. Third, time.sleep(0.1) between batches gives replication a moment to catch up, preventing the backfill from overwhelming replicas with WAL traffic.
A note on batch size: 10,000 rows per batch keeps each UPDATE transaction short, limiting lock hold times and WAL generation. On a table with 50 million rows, this completes in roughly 5,000 iterations — a few minutes. Each batch is its own transaction, so reads and writes continue normally between batches. If you find each batch is taking more than a second, reduce the size. If each batch completes in under 100ms, you can safely increase it.
Downgrade considerations for expand-contract
The expand-contract pattern requires thoughtful downgrade handling. The downgrade is not simply the reverse of the upgrade:
# IMPORTANT: Expand-contract migrations need careful downgrade handling.
# The downgrade path is not simply the reverse of the upgrade.
# Migration 3 downgrade: remove the constraint only.
def downgrade():
op.execute("""
ALTER TABLE orders
DROP CONSTRAINT IF EXISTS orders_priority_not_null
""")
# Migration 2 downgrade: the backfill is not reversible in a meaningful way.
# The data is already written. Just let it be.
def downgrade():
pass # Backfilled data remains. This is intentional.
# Migration 1 downgrade: drop the column entirely.
def downgrade():
op.drop_column('orders', 'priority')
# The key insight: your downgrade path must leave the database
# in a state the previous application version can work with.
# If the old code doesn't know about the 'priority' column,
# dropping it on downgrade is correct — the column was nullable,
# so the old code never depended on it. The principle: at every deployment boundary, the application code and the database schema must be mutually compatible. The old code does not know about the new column, so it must be nullable. The new code reads from the new column but falls back gracefully if it is null. This compatibility window is the cost of zero-downtime migrations, and it is a cost worth paying.
How do you safely change a column type?
ALTER COLUMN TYPE is the other major source of migration-induced downtime. When you change a column's type, PostgreSQL rewrites every row in the table to convert the existing data. On a 20-million-row table, this can take 30 seconds to three minutes — all under AccessExclusiveLock.
# BAD: Changing a column type directly.
# PostgreSQL rewrites every row in the table.
def upgrade():
op.alter_column('events', 'payload',
type_=sa.Text(),
existing_type=sa.String(255)
)
# On a table with 20M rows: full table rewrite.
# AccessExclusiveLock held for the entire duration.
# 30-180 seconds of total blockage, depending on table width. The safe approach uses the same expand-contract pattern: add a new column with the desired type, backfill the data in batches, swap the column names, and drop the old column.
# SAFE: Expand-contract for type changes.
# Add new column, backfill, swap.
# Migration 1: Add the new column with the desired type.
def upgrade():
op.add_column('events',
sa.Column('payload_new', sa.Text(), nullable=True)
)
# Migration 2: Backfill in batches.
def upgrade():
conn = op.get_bind()
while True:
result = conn.execute(text("""
WITH batch AS (
SELECT id FROM events
WHERE payload_new IS NULL
AND payload IS NOT NULL
LIMIT 10000
)
UPDATE events
SET payload_new = payload::text
FROM batch
WHERE events.id = batch.id
"""))
if result.rowcount == 0:
break
# Migration 3: Swap columns.
# (Deploy application code to read from payload_new first.)
def upgrade():
op.alter_column('events', 'payload', new_column_name='payload_old')
op.alter_column('events', 'payload_new', new_column_name='payload')
# Migration 4: Drop the old column (after application is confirmed stable).
def upgrade():
op.drop_column('events', 'payload_old') Four migrations and four deploys for what would otherwise be a single-line ALTER COLUMN TYPE. The overhead is real. I would not pretend otherwise. But the alternative — three minutes of total table lockout — is the kind of overhead that results in phone calls at 2am. I know which one I prefer.
There is one exception: some type conversions do not require a table rewrite. Changing varchar(50) to varchar(255) is a catalog-only operation on PostgreSQL 9.2+ — it simply updates the column's type metadata without touching the data. Similarly, casting from varchar to text is free, because they share the same internal representation. The PostgreSQL team maintains a thorough reference on ALTER TABLE behavior that documents which conversions require a rewrite and which do not. When in doubt, test with EXPLAIN on a staging database with realistic data volumes.
How do you create indexes without blocking writes?
CREATE INDEX acquires ShareLock, which conflicts with RowExclusiveLock. Every INSERT, UPDATE, and DELETE blocks until the index build finishes. On a 50-million-row table, that can be two minutes of blocked writes.
# BAD: CREATE INDEX inside a migration transaction.
# Acquires ShareLock — blocks all writes for the entire build duration.
def upgrade():
op.create_index('idx_orders_status', 'orders', ['status'])
# On a 50M row table: 30-120 seconds of blocked writes. CREATE INDEX CONCURRENTLY is the answer, but it requires special handling in Alembic because it cannot run inside a transaction block.
# SAFE: CREATE INDEX CONCURRENTLY — but it requires special handling.
# CONCURRENTLY cannot run inside a transaction block.
from alembic import op
def upgrade():
# autocommit=True tells Alembic to execute outside a transaction.
op.execute("COMMIT") # End the current transaction first
op.create_index(
'idx_orders_status', 'orders', ['status'],
postgresql_concurrently=True
)
# Or more explicitly:
def upgrade():
op.execute("COMMIT")
op.execute("""
CREATE INDEX CONCURRENTLY IF NOT EXISTS
idx_orders_status ON orders (status)
""")
# Takes 2x longer than regular CREATE INDEX,
# but writes continue uninterrupted throughout.
# Always worth the trade-off in production. The IF NOT EXISTS clause is important for idempotency. If your migration fails halfway through (the CONCURRENTLY build can fail if there are constraint violations or the process is interrupted), you need to be able to re-run it. Without IF NOT EXISTS, the retry fails with "relation already exists."
Handling failed concurrent index builds
A failed CREATE INDEX CONCURRENTLY leaves an invalid index behind. PostgreSQL marks it as invalid in the catalog but does not remove it. This invalid index is worse than no index at all: it consumes disk space, gets updated on every write, and is never used by the query planner.
-- A failed CREATE INDEX CONCURRENTLY leaves an invalid index behind.
-- PostgreSQL marks it as invalid but does not remove it.
-- Check for invalid indexes:
SELECT indexrelid::regclass AS index_name,
indrelid::regclass AS table_name,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_index
WHERE NOT indisvalid;
-- If your failed index appears here, drop it before retrying:
DROP INDEX CONCURRENTLY IF EXISTS idx_orders_status;
-- Then retry the CREATE INDEX CONCURRENTLY.
-- The invalid index is not used by the planner, but it still
-- consumes disk space and slows down writes — it gets updated
-- on every INSERT/UPDATE even though no query will ever use it. After cleaning up the invalid index, retry the CREATE INDEX CONCURRENTLY. Common reasons for failure include unique constraint violations (for unique indexes), running out of disk space during the build, or the build process being killed by the OS or a statement_timeout. For unique indexes, verify your data has no duplicates before retrying.
One further note on CONCURRENTLY: it takes approximately twice as long as a regular CREATE INDEX because it performs two passes over the table. The first pass builds the index, and the second pass validates any rows that were modified during the first pass. This is the cost of non-blocking operation, and it is always worth it in production. A 4-minute index build that does not block writes is categorically better than a 2-minute build that does.
The NOT VALID pattern for constraints and foreign keys
Adding a CHECK constraint or FOREIGN KEY normally scans the entire table to verify existing rows satisfy the constraint. That scan holds AccessExclusiveLock. The NOT VALID modifier skips the scan: the constraint is added to the catalog immediately (enforced on new writes only), and validation happens separately with a weaker lock.
# SAFE: Adding a foreign key without blocking writes.
def upgrade():
# Step 1: Add the constraint NOT VALID.
# AccessExclusiveLock, but only held for catalog update — sub-millisecond.
op.execute("""
ALTER TABLE order_items
ADD CONSTRAINT fk_order_items_order_id
FOREIGN KEY (order_id) REFERENCES orders(id)
NOT VALID
""")
# Step 2: Validate. Scans the table but only holds
# ShareUpdateExclusiveLock — reads and writes continue.
op.execute("""
ALTER TABLE order_items
VALIDATE CONSTRAINT fk_order_items_order_id
""") The VALIDATE CONSTRAINT step only acquires ShareUpdateExclusiveLock, which does not conflict with reads (AccessShareLock) or writes (RowExclusiveLock). Normal application traffic continues uninterrupted while PostgreSQL scans the table to verify existing rows.
This pattern works for CHECK constraints, FOREIGN KEY constraints, and NOT NULL constraints (modeled as CHECK constraints, as shown in the expand-contract example above).
A subtlety worth noting: between the NOT VALID addition and the VALIDATE CONSTRAINT step, the constraint is only enforced on new writes. If an existing row violates the constraint, the VALIDATE step will fail with an error identifying the offending row. This is actually desirable — you discover the data quality issue before it can cause an application error. Fix the offending rows, then re-run VALIDATE CONSTRAINT.
The silent killer: idle in transaction
If you will permit me a brief digression on the single most common reason migrations fail to acquire locks in a timely manner. It is not long-running queries. It is not heavy write traffic. It is sessions that opened a transaction, performed a query, and then stopped doing anything at all — while still holding their locks.
-- The silent killer: idle in transaction.
-- This session opened a transaction, ran a SELECT, then... stopped.
-- It still holds locks. It blocks your migration.
-- And it will hold those locks until the connection times out or is killed.
-- Find them:
SELECT pid, usename, application_name,
now() - xact_start AS transaction_age,
now() - state_change AS idle_duration,
query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND xact_start < now() - interval '1 minute'
ORDER BY xact_start;
-- Kill them (carefully — verify the PID first):
SELECT pg_terminate_backend(12345);
-- Better: prevent them from accumulating in the first place.
-- In postgresql.conf:
-- idle_in_transaction_session_timeout = '5min'
--
-- Any session idle in a transaction for more than 5 minutes
-- is terminated automatically. Your migration pipeline will thank you. An idle in transaction session is a connection that called BEGIN, executed one or more statements, and then neither committed nor rolled back. The transaction remains open. The locks remain held. And because the session is idle, it generates no CPU load, no disk I/O, no log entries — nothing to draw your attention. It simply sits there, invisible, holding a lock that blocks your migration.
The most common cause is application code that opens a database transaction and then performs a non-database operation — an HTTP call, a file write, a message queue publish — that takes longer than expected or fails without closing the transaction. Connection pool misconfigurations are another frequent source: a connection is returned to the pool with an open transaction, and the pool does not reset it.
Setting idle_in_transaction_session_timeout in postgresql.conf is the structural fix. Five minutes is aggressive but effective. Any legitimate transaction that runs for five minutes should be redesigned. The sessions you lose to this timeout are invariably the ones that were going to cause problems anyway.
A pre-flight check and retry strategy
Even with lock_timeout configured, it is better to avoid lock contention than to recover from it. Run a pre-flight check before deploying migrations.
-- Run this BEFORE deploying a migration.
-- If any of these are active, your migration may queue behind them.
-- Long-running queries on the target table:
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
AND query ILIKE '%orders%'
AND query_start < now() - interval '5 seconds';
-- Idle-in-transaction sessions (the silent killers):
SELECT pid, now() - xact_start AS duration, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND xact_start < now() - interval '30 seconds';
-- If results appear: wait, or terminate with pg_terminate_backend(pid).
-- Never deploy a DDL migration into a crowded lock queue. If the pre-flight check reveals long-running queries or idle-in-transaction sessions on your target table, you have three options: wait for them to finish naturally, terminate them with pg_terminate_backend(pid), or schedule the migration for a lower-traffic window. The first option is safest. The second is fastest. The third is most courteous to your users. Choose based on the severity of the situation.
And when lock_timeout does its job and your migration fails, retry automatically.
# A migration runner with retry logic for lock_timeout failures.
# Use this in CI/CD or deployment scripts.
import subprocess
import time
import sys
MAX_RETRIES = 5
RETRY_DELAY = 10 # seconds
for attempt in range(1, MAX_RETRIES + 1):
result = subprocess.run(
["alembic", "upgrade", "head"],
capture_output=True, text=True
)
if result.returncode == 0:
print(f"Migration succeeded on attempt {attempt}")
sys.exit(0)
if "lock timeout" in result.stderr.lower():
print(f"Attempt {attempt}/{MAX_RETRIES}: lock timeout. "
f"Retrying in {RETRY_DELAY}s...")
time.sleep(RETRY_DELAY)
continue
# Non-lock error — don't retry
print(f"Migration failed: {result.stderr}")
sys.exit(1)
print(f"Migration failed after {MAX_RETRIES} attempts")
sys.exit(1) The retry script distinguishes between lock timeout failures (transient, worth retrying) and other failures (bugs in the migration, not worth retrying). Five retries with 10-second delays gives the system nearly a minute to find a clean window. In practice, most retries succeed on the second or third attempt.
I should note that the retry delay matters. Too short (1 second) and you retry into the same traffic pattern that caused the first failure. Too long (60 seconds) and your deployment pipeline idles unnecessarily. Ten seconds allows most in-flight transactions to complete and most lock queues to drain. It is not a magic number — it is a reasonable default. Adjust it based on your typical transaction duration.
The honest counterpoint: when this is too much ceremony
I should be forthcoming about the limits of this argument, because pretending they do not exist would be a disservice to you and an embarrassment to me.
Not every database needs zero-downtime migrations. If your application serves internal users during business hours, a 30-second maintenance window at 3am is a perfectly acceptable deployment strategy. The expand-contract pattern, the retry logic, the pre-flight checks — these are engineering investments that pay for themselves at scale, but they have real costs: more migration files to manage, more deployment steps to coordinate, more complexity in your CI/CD pipeline.
The threshold, in my observation, is roughly this: if your application has fewer than 100 concurrent database connections and can tolerate 30 seconds of degraded performance during deploys, the simple approach — alembic upgrade head with no ceremony — may serve you well. Set lock_timeout and transaction_per_migration regardless, because those cost nothing. But the expand-contract patterns, the concurrent indexes, the NOT VALID constraints — introduce those when you need them, not before.
The danger is the opposite extreme: a five-person team building three-phase migrations for a table with 10,000 rows. That is not engineering discipline. That is ritual without purpose. The Waiter's role is to ensure the household runs well, not to polish silver that no one will see.
That said — if you are reading this guide, it is likely because you have already experienced the pain of a migration-induced outage. In which case, the ceremony has already justified itself. Carry on.
"The Python ecosystem offers extraordinary tools for working with PostgreSQL — asyncpg for raw performance, SQLAlchemy for expressive querying, Alembic for schema evolution. The discipline is learning which tool serves which moment."
— from You Don't Need Redis, Chapter 7: Python: From psycopg & asyncpg to Production
How Gold Lapel helps during migrations
Everything above is defense: configuring timeouts, splitting migrations, building indexes concurrently. Gold Lapel adds a layer that makes migrations less disruptive by design.
GL sits between your application and PostgreSQL as a transparent proxy. During a migration that locks a table, GL can route read queries to existing materialized views while the base table is briefly locked. When a migration adds an index, GL may have already created it — GL's auto-indexing engine monitors query patterns and creates indexes CONCURRENTLY before you even write the migration.
GL also provides connection pooling that reduces the total number of backend connections, which means fewer locks held concurrently and shorter lock queues when DDL does need to acquire a heavy lock. Fewer connections also means fewer potential idle in transaction sessions — the silent blockers described above.
None of this replaces proper migration hygiene. But it provides a safety net that makes the difference between "migration retry succeeded on attempt 2" and "migration retry succeeded on attempt 5 after paging the on-call engineer."
I should also note that GL is not the only tool that helps here. django-pg-zero-downtime-migrations enforces safe patterns at the framework level for Django. Squawk is a linter for SQL migrations that catches unsafe patterns before they reach production — it works with any framework, including Alembic. strong_migrations does the same for Rails. The tools exist. The question is whether they are part of your pipeline.
A checklist, if you will permit me
For those who prefer their guidance in a form suitable for taping to a monitor:
- Set
transaction_per_migration=Trueinenv.py. This is non-negotiable. There is no reason to hold locks across migration boundaries. - Set
lock_timeoutbefore running migrations. 4 seconds is a reasonable starting point. Fail fast rather than block the application. - Set
statement_timeoutas a safety net. 30 seconds for DDL. Increase for backfill operations. - Use
CONCURRENTLYfor all index creation in production. No exceptions. The 2x build time is a small price for uninterrupted writes. - Use NOT VALID + VALIDATE for all constraint additions. CHECK constraints, foreign keys, and NOT NULL constraints modeled as CHECK constraints.
- Use expand-contract for NOT NULL columns and type changes. Three migrations is not excessive — it is careful.
- Run pre-flight checks before deploying migrations. Identify long-running queries and idle-in-transaction sessions that could block lock acquisition.
- Automate retries for lock timeout failures. Five retries with 10-second delays covers most transient contention.
- Set
idle_in_transaction_session_timeoutinpostgresql.conf. Five minutes. This prevents the most common source of phantom lock contention. - Test migrations against realistic data volumes in staging. A migration that takes 5ms on a table with 1,000 rows may take 45 seconds on a table with 50 million rows.
If you implement only the first three items — transaction_per_migration, lock_timeout, and statement_timeout — you will have eliminated the majority of migration-induced outages. The remaining items are for the households that demand true zero-downtime operations under all conditions. Which, if you have read this far, is likely yours.
Frequently asked questions
Terms referenced in this article
If you'll permit me a brief detour — the Django team faces the same locking hazards with a different set of escape hatches. I have written a parallel guide to zero-downtime Django migrations that covers the expand-contract pattern and CONCURRENTLY workflows from the Django side of the household.