← Django & Python Frameworks

Zero-Downtime Django Migrations on Million-Row PostgreSQL Tables

The migration itself took 17 seconds. The outage lasted 31. I regret to report the remaining 14 seconds were spent standing in a queue.

The Waiter of Gold Lapel · Updated Mar 20, 2026 Published Mar 5, 2026 · 32 min read
The east wing is inaccessible. The table is having a column added.

Good evening. Your migration has taken the database hostage.

Not intentionally. You wrote a perfectly ordinary Django model field. You ran makemigrations. You ran migrate. The ORM generated an ALTER TABLE statement that was syntactically correct, semantically reasonable, and operationally disastrous on a table with 5 million rows.

I have seen this particular incident unfold with remarkable consistency. A developer adds an index to a model on a Tuesday afternoon. The migration passes in staging, where the table has 12,000 rows and the entire operation completes in 80 milliseconds. On Friday, the same migration runs against production, where the table has 14 million rows. The operation acquires ShareLock, begins building the index, and the application stops responding to writes. The on-call engineer receives a page. The Slack channel fills with "is the site down?" messages. The migration finishes 47 seconds later. The post-mortem identifies the root cause as "an index was added."

This is not a Django bug. Django generates correct DDL. The problem is that PostgreSQL's lock system does not distinguish between a DDL operation that needs 50 milliseconds and one that needs 50 seconds. Both request the same lock. Both poison the lock queue the moment they start waiting. And both hold that lock for the entire duration of any table scan or index build the operation requires.

The Django documentation mentions AddIndexConcurrently in a brief note. The django-pg-zero-downtime-migrations library exists and is good. But there is no single resource that ties together every dangerous pattern Django generates, what makes each one dangerous, the safe alternative for each, what Django 5.0+'s db_default does and does not fix, and the tooling that exists to catch mistakes before they reach production.

This is that resource. Every pattern. Every code example. Every version-specific caveat. If you are running Django against PostgreSQL with tables that have crossed the million-row threshold, you are in the right room. Allow me to attend to the matter.

How PostgreSQL locks turn a fast DDL into a long outage

Before we examine the individual patterns, it is worth understanding the mechanism that makes all of them dangerous. The issue is not the DDL itself. The issue is the PostgreSQL lock queue.

Every query in PostgreSQL acquires a lock. SELECT acquires AccessShareLock. INSERT, UPDATE, and DELETE acquire RowExclusiveLock. These locks coexist peacefully — thousands of them can be held simultaneously without any conflict whatsoever. That is the normal state of a well-run household.

Then a migration arrives. ALTER TABLE requests AccessExclusiveLock, the strongest lock PostgreSQL offers. It conflicts with every other lock type. Including the humble AccessShareLock held by a simple SELECT.

If three queries are currently running against the table, the ALTER waits. That part is expected. Here is the part that catches people:

Once the AccessExclusiveLock request enters the queue, every subsequent conflicting request — including plain SELECTs — queues behind it, even if those requests would be compatible with the currently granted locks.

-- Your Django migration adds a NOT NULL column.
-- It needs AccessExclusiveLock on the table.
-- Three SELECT queries are running against that table.

-- T+0.0s   SELECT queries running (AccessShareLock held)
-- T+0.0s   ALTER TABLE submitted — waits for AccessExclusiveLock
-- T+0.1s   New SELECT arrives — queued behind pending ALTER TABLE
-- T+0.3s   New INSERT arrives — also queued
-- T+0.8s   More SELECTs — all queued
-- T+14s    Original SELECTs finish
-- T+14s    ALTER TABLE acquires lock, updates catalog
-- T+14.01s Catalog update complete. Lock released. Queued queries execute.

-- The ALTER TABLE needed < 50ms of lock time.
-- The outage lasted ~14 seconds.
-- All of those seconds were just waiting in line.

This is the cascade. The ALTER TABLE needs AccessExclusiveLock, which conflicts with AccessShareLock held by running SELECT queries. The ALTER waits. Now every new query — including plain SELECTs that would normally be fine — queues behind the pending AccessExclusiveLock.

The pending lock poisons the queue.

I cannot stress this sufficiently: the severity of the outage is not determined by how long the ALTER TABLE takes to execute. It is determined by how long the ALTER TABLE has to wait before it can execute. A migration that needs 50 milliseconds of actual work but waits 45 seconds for a long-running analytics query will cause 45 seconds of total blockage. The fast DDL is irrelevant. The wait is the outage.

If an analytics query or a forgotten idle in transaction session holds AccessShareLock for 45 seconds, that is 45 seconds of total blockage. Not because the ALTER needs 45 seconds. Because it has to wait 45 seconds to start. For the full mechanics, see our guide to PostgreSQL lock contention.

The dangerous patterns Django generates

Django's migration framework generates DDL that is correct for every table size. It is safe only for small tables. On million-row tables, several common model changes produce migrations that will lock your database for seconds or minutes. Allow me to present the offenders.

Pattern 1: Adding a NOT NULL field with a default

The single most common source of Django migration outages. I have attended to more incidents caused by this pattern than by all other migration issues combined.

Adding a NOT NULL column with a default
# models.py — the migration Django generates.

class Order(models.Model):
    # Adding this field to a table with 5 million rows:
    priority = models.IntegerField(default=0)

    # Django generates:
    #   ALTER TABLE orders ADD COLUMN priority integer DEFAULT 0 NOT NULL;
    #
    # On PostgreSQL 11+, this is actually safe and fast. Adding a new column
    # with DEFAULT x NOT NULL is a catalog-only operation — PostgreSQL knows
    # the column can't contain NULLs because it was just created with a
    # non-null default. Sub-millisecond lock, no table scan.
    #
    # On PostgreSQL 10 and below, this rewrites the entire table.
    # For PG < 11, use the expand-contract pattern shown below.

On PostgreSQL 11 and later, this entire operation is safe and fast. Adding a column with a non-volatile DEFAULT does not rewrite the table — PostgreSQL stores the default value in the system catalog and returns it for existing rows without physically writing anything. And because the column was just created with a non-null default, PostgreSQL knows no NULLs can exist, so the NOT NULL constraint does not trigger a table scan. The entire operation is catalog-only and completes in milliseconds regardless of table size.

On PostgreSQL 10 and below, however, this rewrites the entire table. If you are still on PG 10 or earlier, or if you need to add a NOT NULL constraint to an existing column that may already contain NULLs, the expand-contract pattern below remains essential.

The expand-contract pattern is also necessary for volatile defaults like Now(), which trigger a table rewrite even on PG 11+ because PostgreSQL must evaluate the function for each existing row.

Pattern 2: Adding a regular index

Dangerous: AddIndex on a large table
# Django's AddIndex generates:
#   CREATE INDEX idx_orders_status ON orders_order (status);
#
# ShareLock — blocks all writes for the entire build.
# 50 million rows: 30-120 seconds of blocked writes.

class Migration(migrations.Migration):
    operations = [
        migrations.AddIndex(
            model_name="order",
            index=models.Index(fields=["status"], name="idx_orders_status"),
        ),
    ]

Django's AddIndex generates CREATE INDEX, which acquires ShareLock. Every INSERT, UPDATE, and DELETE blocks until the index build finishes. Reads continue uninterrupted, which makes this less severe than the NOT NULL pattern — but on a write-heavy table, 30 to 120 seconds of blocked writes is still an incident.

Django has shipped AddIndexConcurrently in django.contrib.postgres.operations since Django 3.0. It has been available for over six years. There is no reason to use AddIndex on a production table with significant data. And yet I continue to see it in migration files, because makemigrations generates AddIndex by default when you add db_index=True or class Meta: indexes to your model. The safe option exists. The framework does not choose it for you.

Pattern 3: Adding a unique constraint

Dangerous: unique=True on a large table
# This migration generates:
#   ALTER TABLE orders_order ADD CONSTRAINT ... UNIQUE (tracking_number);
#
# PostgreSQL creates a unique index to back the constraint.
# This index build holds ShareLock — blocking all writes.
# On a large table: 30-120 seconds of blocked INSERTs and UPDATEs.

class Order(models.Model):
    tracking_number = models.CharField(max_length=64, unique=True)

Django backs unique constraints with a unique index. Same ShareLock, same write-blocking behavior as a regular index creation. But worse: you cannot just use AddIndexConcurrently, because the constraint also needs to be registered in the catalog. The safe path requires two steps — build the index concurrently, then attach it as a constraint — and Django provides no built-in operation for this. You must write raw SQL.

Pattern 4: Changing a column type

Dangerous: ALTER COLUMN TYPE on a large table
# Dangerous: ALTER COLUMN TYPE rewrites the entire table.
# Every. Single. Row.

# models.py — changing a CharField to a TextField
class Order(models.Model):
    notes = models.TextField()  # was CharField(max_length=255)

# Django generates:
#   ALTER TABLE orders_order ALTER COLUMN notes TYPE text;
#
# PostgreSQL must rewrite every row to update the column metadata.
# AccessExclusiveLock held for the ENTIRE rewrite.
# 5 million rows: 30-120 seconds.
# 50 million rows: 5-20 minutes.
# During which time: no reads, no writes, nothing.

This is the one that catches teams most often, because the model change looks so innocent. You changed CharField(max_length=255) to TextField(). Django generates ALTER COLUMN TYPE. PostgreSQL rewrites every row in the table. On a 50-million-row table, that can take 5 to 20 minutes of AccessExclusiveLock. No reads. No writes. Nothing.

I should note: not all type changes require a rewrite. Changing varchar(100) to varchar(200) is catalog-only on PostgreSQL 9.2+. Changing varchar to text is also catalog-only. But Django does not know this — it generates ALTER COLUMN TYPE regardless, and certain conversions (integer to bigint, for instance) genuinely do rewrite. If you are uncertain whether a particular type change requires a rewrite, test it on a copy of your production data. Do not discover the answer in production.

Safe vs. unsafe: the complete reference table

Not all DDL is equal. Some operations hold heavy locks for milliseconds. Others hold them for minutes. The difference between a smooth deploy and an incident is knowing which is which before the migration reaches production.

OperationDjango APISafe?Lock typeDurationNotes
Add nullable columnAddField(null=True)YesAccessExclusiveLock (brief)< 10msFast on PG 11+. No table rewrite.
Add NOT NULL column + DEFAULT (PG 11+)AddField(default=...)YesAccessExclusiveLock (brief)< 10msCatalog-only on PG 11+. PG knows no NULLs exist in a freshly added column with a non-null default.
Add NOT NULL (db_default, Django 5.0+)AddField(db_default=...)YesAccessExclusiveLock (brief)< 10msSame safe behavior as above on PG 11+. Dangerous only on PG < 11 or with volatile defaults.
CREATE INDEXAddIndexNoShareLock30-120s (50M rows)Blocks all writes. Use AddIndexConcurrently.
CREATE INDEX CONCURRENTLYAddIndexConcurrentlyYesShareUpdateExclusiveLock60-240s (50M rows)Does not block reads or writes. Requires atomic = False.
Add UNIQUE constraintAddConstraint(UniqueConstraint)NoShareLockIndex build timeBuilds a unique index under ShareLock. Use CONCURRENTLY + USING INDEX.
Add FOREIGN KEYAddField(ForeignKey)NoAccessExclusiveLock + ShareRowExclusiveLockFull table scanUse NOT VALID + VALIDATE. Validate only needs ShareUpdateExclusiveLock.
DROP COLUMNRemoveFieldYesAccessExclusiveLock (brief)< 10msCatalog-only. Data stays on disk until VACUUM.
ALTER COLUMN TYPEAlterFieldNoAccessExclusiveLockFull table rewriteRewrites every row. Use expand-contract: add new column, backfill, swap.
RENAME COLUMNRenameFieldYesAccessExclusiveLock (brief)< 10msCatalog-only. Will break running queries using old name.

The pattern is consistent: anything that only modifies the system catalog (adding a column with a default, dropping a column, renaming) is fast and safe. Anything that scans or rewrites row data (adding NOT NULL to an existing column, index builds, type changes) is slow and dangerous. Commit this distinction to memory. It is the single most important mental model for safe migrations.

The expand-contract pattern: safely adding NOT NULL columns

Three migrations. Three deploys. Zero downtime. This replaces the single dangerous AddField with three safe operations. The principle is simple: never combine a data-dependent operation with a schema-level lock. Expand the schema first, populate the data, then contract by adding the constraint.

Step 1: Add the column as nullable

Migration 1: Add nullable column (instant)
# Migration 1: Add the column as nullable. Sub-millisecond lock.

from django.db import migrations, models

class Migration(migrations.Migration):
    dependencies = [("orders", "0042_previous")]

    operations = [
        migrations.AddField(
            model_name="order",
            name="priority",
            field=models.IntegerField(null=True, default=0),
        ),
    ]

# PostgreSQL 11+ handles DEFAULT without a table rewrite.
# The nullable column is added instantly.
# AccessExclusiveLock held for < 10ms.

Deploy this. Update your application code to write to the new column. The column exists, accepts writes, and has a server-side default. No existing rows are touched. AccessExclusiveLock is held for the time it takes to update the system catalog — typically under 10 milliseconds, regardless of table size.

A note on the deployment sequence: your application code must start writing to the new column before the backfill migration runs. Otherwise, rows inserted between the backfill and the constraint enforcement will have null values, and the constraint validation will fail. Deploy the code change first. Then deploy the backfill.

Step 2: Backfill existing rows

Migration 2: Batch backfill (no DDL lock)
# Migration 2: Backfill existing rows in batches. No DDL lock at all.

from django.db import migrations

def backfill_priority(apps, schema_editor):
    Order = apps.get_model("orders", "Order")
    batch_size = 10_000
    while True:
        # Grab a batch of rows that still need backfilling.
        batch_ids = list(
            Order.objects.filter(priority__isnull=True)
            .values_list("id", flat=True)[:batch_size]
        )
        if not batch_ids:
            break
        Order.objects.filter(id__in=batch_ids).update(priority=0)

class Migration(migrations.Migration):
    dependencies = [("orders", "0043_add_priority_nullable")]

    operations = [
        migrations.RunPython(
            backfill_priority,
            reverse_code=migrations.RunPython.noop,
        ),
    ]

# Each batch is its own transaction.
# Reads and writes continue normally between batches.
# 5 million rows / 10,000 per batch = 500 iterations.
# Completes in 1-3 minutes without blocking anything.

Batch size matters. 10,000 rows per batch keeps each UPDATE transaction short, limiting lock hold times and WAL generation. On 5 million rows, this completes in 500 iterations — typically 1 to 3 minutes. Each batch is its own transaction, so reads and writes continue normally between batches.

I should offer a word of caution on batch sizing. Smaller batches (1,000 rows) generate less WAL per transaction and hold row-level locks for shorter durations, but they increase the total number of round trips. Larger batches (50,000 rows) are faster in total wall time but generate more WAL and can cause replication lag on replicas. For most tables, 5,000 to 10,000 rows per batch is the appropriate range. If you have streaming replicas and are concerned about lag, monitor pg_stat_replication during the backfill and reduce batch size if replay_lag exceeds your tolerance.

One additional consideration: the filter(priority__isnull=True) query benefits enormously from a partial index on the column. Without an index, each batch requires a sequential scan to find null rows. On a 50-million-row table with only 10,000 null rows remaining in the final batch, that sequential scan is wasteful. Consider adding a partial index before the backfill begins:

CREATE INDEX CONCURRENTLY idx_orders_priority_null ON orders_order (id) WHERE priority IS NULL;

Drop it after the backfill completes. The index exists solely to make the backfill efficient.

Step 3: Enforce NOT NULL with NOT VALID + VALIDATE

Migration 3: Enforce NOT NULL constraint (safe)
# Migration 3: Enforce NOT NULL using the NOT VALID + VALIDATE pattern.

from django.db import migrations

class Migration(migrations.Migration):
    dependencies = [("orders", "0044_backfill_priority")]

    operations = [
        migrations.RunSQL(
            # Step 1: Add a CHECK constraint NOT VALID.
            # Only enforces on new writes. No table scan.
            # AccessExclusiveLock held for < 10ms (catalog update only).
            sql="""
                ALTER TABLE orders_order
                ADD CONSTRAINT orders_order_priority_not_null
                CHECK (priority IS NOT NULL) NOT VALID;
            """,
            reverse_sql="""
                ALTER TABLE orders_order
                DROP CONSTRAINT orders_order_priority_not_null;
            """,
        ),
        migrations.RunSQL(
            # Step 2: Validate the constraint.
            # Scans existing rows but only holds ShareUpdateExclusiveLock.
            # Reads and writes continue uninterrupted.
            sql="""
                ALTER TABLE orders_order
                VALIDATE CONSTRAINT orders_order_priority_not_null;
            """,
            reverse_sql=migrations.RunSQL.noop,
        ),
    ]

# After validation, PostgreSQL knows every row satisfies the constraint.
# The column is effectively NOT NULL, enforced on both new and existing data.

The NOT VALID modifier adds the constraint to the catalog instantly, enforcing it on new writes only. PostgreSQL does not scan existing rows — it trusts you that the data is already clean (which it is, because you just backfilled it). AccessExclusiveLock is held for the catalog update only: under 10 milliseconds.

The VALIDATE CONSTRAINT step then scans existing rows to confirm the constraint holds. But it only acquires ShareUpdateExclusiveLock, which does not conflict with reads (AccessShareLock) or writes (RowExclusiveLock). Normal application traffic continues uninterrupted while PostgreSQL verifies every row.

Three deploys instead of one. Zero downtime instead of 8 to 40 seconds of blockage. The deployment overhead is trivial compared to the cost of an incident. I would rather deploy three migrations at leisure than one migration under pressure.

Indexes and unique constraints: the CONCURRENTLY patterns

Creating indexes without blocking writes

Django ships AddIndexConcurrently in django.contrib.postgres.operations. It has been available since Django 3.0. There is no reason to use AddIndex on a production table with significant data.

Safe: AddIndexConcurrently
# Safe: CREATE INDEX CONCURRENTLY. Requires atomic = False.

from django.db import migrations
from django.contrib.postgres.operations import AddIndexConcurrently
import django.contrib.postgres.indexes

class Migration(migrations.Migration):
    dependencies = [("orders", "0046_tracking_constraint")]

    # Required: CONCURRENTLY cannot run inside a transaction block.
    atomic = False

    operations = [
        AddIndexConcurrently(
            model_name="order",
            index=models.Index(fields=["status"], name="idx_orders_status"),
        ),
    ]

# Takes 2x longer than regular CREATE INDEX.
# But writes continue uninterrupted throughout.
# Always worth the trade-off on production tables.

The atomic = False declaration is mandatory. CREATE INDEX CONCURRENTLY cannot run inside a transaction block — it needs to perform multiple passes over the table, and each pass must see newly committed data. Django will raise an error if you forget this declaration, which is a welcome guardrail.

One caveat deserves emphasis: a failed CREATE INDEX CONCURRENTLY leaves an invalid index behind. The index exists in pg_indexes, consumes space, and is flagged as invalid. PostgreSQL will not use it for queries, but it will slow down writes (because PostgreSQL still attempts to maintain it). If a concurrent index build fails — due to a uniqueness violation, a deadlock, or a cancellation — you must drop the invalid index before retrying:

DROP INDEX CONCURRENTLY IF EXISTS idx_orders_status;

Check for invalid indexes with: SELECT indexrelid::regclass, indisvalid FROM pg_index WHERE NOT indisvalid;

Adding unique constraints safely

Django's UniqueConstraint does not have a concurrent option. You need raw SQL. The technique is to build the unique index concurrently, then attach it as a constraint.

Safe: CONCURRENTLY + USING INDEX for unique constraints
# Safe: Create the index CONCURRENTLY, then add the constraint using it.

from django.db import migrations

class Migration(migrations.Migration):
    dependencies = [("orders", "0045_enforce_priority")]

    # CONCURRENTLY cannot run inside a transaction.
    atomic = False

    operations = [
        migrations.RunSQL(
            # Step 1: Build the unique index without blocking writes.
            # ShareUpdateExclusiveLock — reads and writes continue.
            sql="""
                CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS
                idx_orders_order_tracking_number
                ON orders_order (tracking_number);
            """,
            reverse_sql="""
                DROP INDEX CONCURRENTLY IF EXISTS idx_orders_order_tracking_number;
            """,
        ),
        migrations.RunSQL(
            # Step 2: Attach the index as a constraint.
            # Instant — just a catalog update.
            sql="""
                ALTER TABLE orders_order
                ADD CONSTRAINT orders_order_tracking_number_uniq
                UNIQUE USING INDEX idx_orders_order_tracking_number;
            """,
            reverse_sql="""
                ALTER TABLE orders_order
                DROP CONSTRAINT orders_order_tracking_number_uniq;
            """,
        ),
    ]

Build the unique index concurrently. Then attach it as a constraint with USING INDEX. The attachment is instant — just a catalog update. The index build is the only slow part, and it does not block writes.

The USING INDEX syntax is available since PostgreSQL 9.1. It is one of the most valuable but least-known features for safe migrations. The index must already exist and must be unique. PostgreSQL simply adopts it as the backing index for the constraint. No rebuild. No table scan. A catalog update that completes in milliseconds.

Foreign keys: NOT VALID + VALIDATE

Django's ForeignKey field generates both a column addition and a foreign key constraint in the same migration. On large tables, split them apart.

Safe: Foreign key with NOT VALID + VALIDATE
# Safe: Adding a foreign key without blocking reads or writes.

from django.db import migrations

class Migration(migrations.Migration):
    dependencies = [("orders", "0047_add_status_index")]

    operations = [
        migrations.RunSQL(
            # NOT VALID: adds the constraint without scanning existing rows.
            # AccessExclusiveLock held briefly for catalog update only.
            sql="""
                ALTER TABLE orders_orderitem
                ADD CONSTRAINT fk_orderitem_order_id
                FOREIGN KEY (order_id) REFERENCES orders_order(id)
                NOT VALID;
            """,
            reverse_sql="""
                ALTER TABLE orders_orderitem
                DROP CONSTRAINT fk_orderitem_order_id;
            """,
        ),
        migrations.RunSQL(
            # VALIDATE: scans existing rows under ShareUpdateExclusiveLock.
            # Reads and writes continue normally.
            sql="""
                ALTER TABLE orders_orderitem
                VALIDATE CONSTRAINT fk_orderitem_order_id;
            """,
            reverse_sql=migrations.RunSQL.noop,
        ),
    ]

The VALIDATE CONSTRAINT step scans the table under ShareUpdateExclusiveLock, which does not conflict with reads (AccessShareLock) or writes (RowExclusiveLock). Your application continues serving traffic while PostgreSQL verifies referential integrity.

One detail that catches teams: when the foreign key references a heavily-written parent table, the NOT VALID step acquires a brief ShareRowExclusiveLock on the referenced table as well. On most tables this is sub-millisecond. But if the referenced table is itself the target of another concurrent migration, you may see lock contention. Sequence your migrations accordingly — one table at a time.

Safely changing column types: the expand-contract alternative

As noted earlier, ALTER COLUMN TYPE can rewrite the entire table. The safe alternative avoids the ALTER entirely: add a new column with the desired type, backfill it, then swap.

Safe: Expand-contract for column type changes
# Safe: Expand-contract for column type changes.
# Add new column, backfill, swap at the application layer.

from django.db import migrations, models

# Migration 1: Add the new column alongside the old one.
class Migration(migrations.Migration):
    dependencies = [("orders", "0048_previous")]

    operations = [
        migrations.AddField(
            model_name="order",
            name="notes_new",
            field=models.TextField(null=True),
        ),
    ]

# Deploy. Update application code to write to BOTH columns:
#   order.notes = value
#   order.notes_new = value
#
# Migration 2: Backfill notes_new from notes (batched, same as before).
# Migration 3: Update application code to read from notes_new.
# Migration 4: Drop the old column.
#
#   migrations.RemoveField(model_name="order", name="notes")
#   migrations.RenameField(model_name="order", old_name="notes_new", new_name="notes")
#
# Four deploys. Zero table rewrites. Zero lock-related downtime.
# The column type was never altered — it was replaced.

This is more work than a single AlterField. Four migrations instead of one. Application code that writes to both columns during the transition. But the alternative is 5 to 20 minutes of AccessExclusiveLock on a large table, which is not an alternative at all — it is an incident.

For the specific case of varchar to text or increasing varchar length, PostgreSQL can handle these without a rewrite. You can use RunSQL with the ALTER directly. But for anything involving a change in the underlying storage type (integer to bigint, for instance), the expand-contract pattern is the only safe option.

Connection pool exhaustion: the secondary cascade

There is a failure mode that receives insufficient attention, and it is often more severe than the lock queue itself. When queries queue behind a pending AccessExclusiveLock, they hold their database connections while waiting. Those connections are borrowed from your application's connection pool — whether that is Django's built-in pool, pgBouncer, or another pooler. As more queries queue, more connections are consumed. Eventually, the pool runs dry.

How lock queuing exhausts your connection pool
-- The cascade failure: lock queue exhausts your Django connection pool.

-- Setup: CONN_MAX_AGE = 600, pool of 20 connections via pgbouncer.
-- Traffic: 300 req/s, each holding a connection for ~15ms.

-- T+0.0s   Migration starts. AccessExclusiveLock requested on orders_order.
-- T+0.0s   Pending lock begins queuing every query that touches orders_order.

-- T+0.5s   12 connections now stuck in the lock queue.
-- T+1.0s   18 connections blocked. 2 remaining for all other requests.
-- T+1.3s   Pool exhausted. Django raises OperationalError across ALL views.

-- Now you have two queues:
-- 1. PostgreSQL lock queue (queries waiting for the ALTER TABLE)
-- 2. Django connection pool queue (requests waiting for a free connection)
--
-- Error logs:
--   django.db.utils.OperationalError:
--   connection pool exhausted, timeout expired

-- The migration has not even started.
-- The application is down because every connection is stuck
-- waiting for a lock that is waiting for an analytics query
-- that someone left running in pgAdmin 40 minutes ago.

Now your application is failing on every route, not just the ones that touch the locked table. A request to /api/users/me that has nothing to do with the orders table cannot get a database connection because all connections are stuck waiting for a lock on orders_order. The migration has not even started executing. The application is down because every connection is waiting for a lock that is waiting for an analytics query that someone started 40 minutes ago.

More connections in the pool is not the solution. More connections means more PostgreSQL backends, which means higher memory usage, more context switching, and degraded performance for everyone. The correct architecture is to manage connection lifecycle at the proxy level, where blocked connections do not consume your application's pool slots.

"The Python ecosystem offers extraordinary tools for working with PostgreSQL. The discipline is learning which tool serves which moment — and ensuring that the tools do not work against each other during the moments that matter most."

— from You Don't Need Redis, Chapter 7: Python: From psycopg & asyncpg to Production

The silent killer: idle in transaction

If the lock queue is the gun, the idle-in-transaction session is the finger on the trigger. I have investigated more migration outages caused by abandoned transactions than by any other single factor.

The scenario is always the same. A developer opens a psql session, runs BEGIN; then SELECT * FROM orders_order LIMIT 5;, and goes to lunch. That session is now holding AccessShareLock on the orders table. It will hold that lock until the transaction is committed, rolled back, or the session is terminated. There is no timeout by default. The lock will persist for hours.

When a migration arrives and requests AccessExclusiveLock, it waits behind this idle session. And everything else queues behind the migration. One abandoned psql session. One migration. Total application downtime.

Finding and preventing idle-in-transaction sessions
-- The silent killer: idle in transaction sessions.
-- These hold locks without doing any work.

-- Find them:
SELECT pid, now() - xact_start AS idle_duration,
       now() - state_change AS since_last_activity,
       query AS last_query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND xact_start < now() - interval '30 seconds'
ORDER BY xact_start;

-- Prevent them — add to postgresql.conf or per-role:
ALTER ROLE django_app SET idle_in_transaction_session_timeout = '60s';

-- After 60 seconds of inactivity within a transaction,
-- PostgreSQL terminates the session automatically.
-- The most common cause: an unhandled exception in Django
-- that leaves a transaction open, or a developer who ran
-- BEGIN in psql and went to get coffee.

Set idle_in_transaction_session_timeout to 60 seconds on all non-administrative database roles. This single setting prevents the most common cause of migration-related outages in Django applications. It is a two-line configuration change. If you take one thing from this article before reading any further, let it be this.

The Django ORM itself can contribute to idle-in-transaction problems. If a view raises an exception after opening a transaction but before the middleware commits or rolls back, the connection may be returned to the pool in an idle-in-transaction state. Django's ATOMIC_REQUESTS setting mitigates this by wrapping each request in a transaction that is automatically rolled back on exception, but if you manage transactions manually with @transaction.atomic or transaction.on_commit, ensure your error handling is airtight.

Tooling: django-pg-zero-downtime-migrations and lock_timeout

The library

django-pg-zero-downtime-migrations patches Django's schema editor to use safe DDL patterns automatically. It converts regular index creation to CONCURRENTLY, uses NOT VALID + VALIDATE for constraints, and sets lock_timeout before DDL operations.

django-pg-zero-downtime-migrations setup
# pip install django-pg-zero-downtime-migrations

# settings.py
INSTALLED_APPS = [
    "django_zero_downtime_migrations",
    # ... your apps
]

# What it does:
# - Patches Django's schema editor to use safe DDL patterns
# - CREATE INDEX becomes CREATE INDEX CONCURRENTLY
# - ADD CONSTRAINT uses NOT VALID + VALIDATE automatically
# - Sets lock_timeout before DDL operations
# - Adds statement_timeout as a safety net

# Configuration:
ZERO_DOWNTIME_MIGRATIONS_LOCK_TIMEOUT = "4s"
ZERO_DOWNTIME_MIGRATIONS_STATEMENT_TIMEOUT = "30s"
ZERO_DOWNTIME_MIGRATIONS_RAISE_FOR_UNSAFE = True  # Fail on unsafe ops

# Limitations:
# - Cannot fix the NOT NULL + DEFAULT pattern automatically.
#   You still need manual expand-contract for that.
# - Does not handle batched backfills.
# - Does not cover all edge cases (ALTER COLUMN TYPE, etc.).
# - Excellent for catching the common mistakes.
#   Not a substitute for understanding the patterns.

The RAISE_FOR_UNSAFE setting is the most valuable part. It makes your CI pipeline fail when a migration generates unsafe DDL, catching mistakes before they reach production. The library handles roughly 70% of dangerous patterns automatically. The remaining 30% — NOT NULL enforcement on large tables, batched backfills, column type changes — still require manual intervention.

I should be forthcoming about the library's limitations, because pretending they do not exist would be a disservice. It does not automate the three-step expand-contract pattern for NOT NULL columns. It does not handle batched data migrations. It does not solve ALTER COLUMN TYPE rewrites. And it cannot control what happens to your live traffic during the lock acquisition window. These are not flaws in the library — they are limitations of what a schema editor patch can accomplish. The library is excellent at what it does. What it does is catch the 70% of mistakes that have straightforward automated fixes. The remaining 30% requires understanding the patterns and applying them by hand.

lock_timeout: the safety net

Configuring lock_timeout
# settings.py — set lock_timeout to fail fast instead of blocking the world.

DATABASES = {
    "default": {
        "ENGINE": "django.db.backends.postgresql",
        "NAME": "mydb",
        "OPTIONS": {
            "options": "-c lock_timeout=4000 -c statement_timeout=30000",
        },
    }
}

# Or set it per-migration for finer control:

from django.db import migrations

class Migration(migrations.Migration):
    operations = [
        migrations.RunSQL("SET lock_timeout = '4s';"),
        # ... your DDL operations ...
    ]

# If the lock cannot be acquired in 4 seconds:
#   django.db.utils.OperationalError:
#   canceling statement due to lock timeout
#
# The migration fails. The queue drains. You retry.
# This is vastly preferable to a 45-second outage.

Without lock_timeout, a migration that cannot acquire its lock waits indefinitely. While waiting, it poisons the lock queue. Every subsequent query on that table blocks. The application goes down, and the migration has not even started.

4 seconds is a reasonable starting point. Long enough that a migration succeeds during normal traffic. Short enough that a failed lock acquisition does not take the application down. If your p99 query time on the target table is 2 seconds, set lock_timeout to at least 4 seconds — roughly 2x your slowest normal query. If your queries regularly exceed 5 seconds, you have a query performance problem that should be addressed independently.

When a migration fails due to lock timeout, that is the system working correctly. The queue drains immediately. The application recovers. You retry during a quieter moment. A failed migration is an inconvenience. A 45-second outage is an incident.

A retry strategy for lock_timeout failures

When lock_timeout causes a migration to fail — and it will, that is its purpose — you need a retry strategy. Manual retries work, but they require someone to watch the deploy. Automated retries with exponential backoff are more civilised.

Management command: migrate with retry
# management/commands/migrate_with_retry.py
# A management command that retries migrations on lock timeout.

from django.core.management import call_command
from django.core.management.base import BaseCommand
from django.db.utils import OperationalError
import time

class Command(BaseCommand):
    help = "Run migrations with retry on lock timeout"

    def handle(self, *args, **options):
        max_attempts = 5
        base_delay = 5

        for attempt in range(max_attempts):
            try:
                call_command("migrate", verbosity=1)
                self.stdout.write(
                    f"Migration succeeded on attempt {attempt + 1}"
                )
                return
            except OperationalError as e:
                if "lock timeout" in str(e).lower():
                    if attempt + 1 < max_attempts:
                        delay = base_delay * (2 ** attempt)
                        self.stdout.write(
                            f"Lock timeout on attempt {attempt + 1}. "
                            f"Retrying in {delay}s..."
                        )
                        time.sleep(delay)
                    else:
                        raise CommandError(
                            f"Migration failed after {max_attempts} "
                            f"attempts: {e}"
                        )
                else:
                    raise

# Usage: python manage.py migrate_with_retry
#
# Attempts: 5s, 10s, 20s, 40s, then fails.
# Total window: ~75 seconds to find a clean lock acquisition.
# In practice, most retries succeed on attempt 2 or 3.

Five attempts with exponential backoff gives the system roughly 75 seconds to find a clean window. In practice, most retries succeed on the second or third attempt. The lock queue that blocked attempt 1 drains within seconds once the timeout fires, and attempt 2 catches a gap between long-running queries.

If all 5 attempts fail, something structural is wrong. Either a long-running query is holding AccessShareLock persistently (check for idle-in-transaction sessions), or traffic is so consistently heavy that no clean window exists. In that case, schedule the migration for a low-traffic period — but deploy your application code first, so the migration is the only change happening during the window.

Pre-flight checks

Pre-flight: check for blockers before deploying
-- Run BEFORE deploying a migration against a large table.
-- If any of these return rows, your migration will queue.

-- Long-running queries on the target table:
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
  AND query ILIKE '%orders_order%'
  AND query_start < now() - interval '5 seconds';

-- Idle-in-transaction sessions (the silent killers):
SELECT pid, now() - xact_start AS duration, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND xact_start < now() - interval '30 seconds';

-- Current locks on the target table:
SELECT l.mode, l.granted, a.query, a.state,
       now() - a.query_start AS duration
FROM pg_locks l
JOIN pg_class c ON c.oid = l.relation
JOIN pg_stat_activity a ON a.pid = l.pid
WHERE c.relname = 'orders_order';

-- If results appear: wait, or terminate with pg_terminate_backend(pid).
-- Never deploy DDL into a crowded lock queue.

Run these queries before deploying any DDL migration against a large table. If results appear, wait for the long-running queries to finish or terminate idle transactions. Never deploy DDL into a crowded lock queue.

The idle-in-transaction sessions deserve particular attention. They are the most common cause of migration-induced outages, and they are the most easily preventable. A developer's abandoned psql session should not have the power to take down your production application. The idle_in_transaction_session_timeout setting exists to prevent exactly this.

What Django 5.0+ db_default does and does not fix

Django 5.0 introduced db_default, which tells PostgreSQL to set column defaults server-side rather than having Django inject them at the application layer. This is a genuine improvement — one I welcomed when it arrived. But it does not solve the problem that brings most teams to this article.

Django 5.0+ db_default
# Django 5.0+ introduces db_default — server-side defaults
# without requiring a full table scan for NOT NULL enforcement.

from django.db import models
from django.db.models.functions import Now

class Order(models.Model):
    # db_default tells PostgreSQL to set the default, not Django.
    priority = models.IntegerField(db_default=0)

    # Works with expressions too:
    created_at = models.DateTimeField(db_default=Now())

# The migration Django generates:
#   ALTER TABLE orders_order
#   ADD COLUMN priority integer DEFAULT 0 NOT NULL;
#
# On PostgreSQL 11+, this is entirely catalog-only (fast).
# The DEFAULT is stored in the catalog, and PostgreSQL knows
# no NULLs exist in a freshly added column with a non-null default.
#
# db_default produces the same safe, instant DDL on PG 11+.
# The three-step pattern is needed for PG < 11 or adding
# NOT NULL to existing columns that may contain NULLs.

Here is what db_default changes:

  • New tables: No difference. The table has zero rows. Any DDL is instant.
  • Adding a nullable column with a default: Marginal improvement. The DEFAULT was already fast on PG 11+. db_default makes it slightly cleaner in the catalog.
  • Adding a NOT NULL column with a default: Same problem. The DEFAULT is fast. The NOT NULL scan is not. db_default does not change the NOT NULL verification behavior.
  • Volatile defaults (e.g., Now()): Significant improvement. Without db_default, Django would set each row's value individually. With db_default, PostgreSQL handles it natively. But on PG 11+, even volatile server-side defaults still trigger a table rewrite.

db_default is a welcome addition to Django's toolkit. It is not a solution for zero-downtime migrations on large tables. The three-step expand-contract pattern remains necessary whenever you add a NOT NULL column to a table with meaningful data. I mention this not to diminish the feature — it solves real problems in other contexts — but to prevent the misunderstanding that "Django 5.0 fixed migration locking." It did not. The NOT NULL scan persists.

The honest counterpoints

I have spent considerable effort outlining the dangers of Django's default migration patterns and the safe alternatives. Honesty requires that I also outline the costs. A waiter who overstates his case is no waiter at all.

The expand-contract pattern is genuinely more complex. Three migrations instead of one. Coordinated deploys. Application code that writes to both the old and new column during the transition period. This complexity creates new categories of bugs: a developer forgets to update the write path, or the backfill migration runs before the code is deployed. These are not theoretical risks. They happen, and on small teams without extensive deploy infrastructure, the complexity of the safe pattern can itself become a source of incidents.

Most tables do not need this. On PostgreSQL 11+, adding a new NOT NULL column with a default is instant regardless of table size. The expand-contract pattern is necessary when adding NOT NULL to an existing column that may contain NULLs, or when running on PG 10 and below. For small tables (hundreds of thousands of rows), even a full scan completes in under a second. Do not apply the three-migration pattern to every model change — that is ceremony without purpose.

AddIndexConcurrently is slower. A concurrent index build takes roughly twice as long as a regular CREATE INDEX. On a 50-million-row table, that might mean 4 minutes instead of 2 minutes. During those 4 minutes, the index build consumes I/O bandwidth, CPU, and WAL throughput. On an already-saturated database server, the concurrent build can degrade query performance for the entire duration. The writes are not blocked, but they may be slower.

The NOT VALID + VALIDATE pattern has a gap. Between adding the constraint as NOT VALID and completing the VALIDATE step, PostgreSQL enforces the constraint on new writes but has not yet verified existing rows. If your backfill had a bug and some rows still have null values, those rows will not be caught until VALIDATE runs. If VALIDATE fails, you must fix the data and re-validate. This is usually a minor inconvenience, but on very large tables where VALIDATE takes 10 minutes, it can extend the migration window.

django-pg-zero-downtime-migrations adds a dependency to your migration path. If the library has a bug in its schema editor patches — and it has had bugs, because all software does — your migrations may behave differently than you expect. The library is actively maintained and well-tested, but it interposes itself between Django's ORM and your database. That is a position of considerable responsibility. Review the library's changelogs before upgrading, and test migrations against a copy of your production schema.

These are real trade-offs. The safe patterns are still the right choice for large tables. But applying them blindly to every migration is itself a form of cargo cult engineering. Know your table sizes. Know your traffic patterns. Apply the appropriate level of caution.

How Gold Lapel helps during schema migrations

Everything above is defense: splitting migrations, building indexes concurrently, setting timeouts. These techniques reduce the danger. They do not eliminate it. The lock queue still forms. Read traffic still queues during lock acquisition. Connection pools still absorb the pressure. Gold Lapel addresses the gap that application-level tooling cannot reach.

GL integrates with your Django application through a single package: pip install goldlapel-django. Swap your ENGINE in DATABASES settings to goldlapel.django and GL sits between Django and PostgreSQL, speaking the wire protocol natively. No query changes. No middleware to configure.

During a migration that locks a table, GL can route read queries to existing materialized views while the base table is briefly locked. Your users see stale-by-seconds data instead of connection timeouts. When the lock releases, traffic returns to the base table automatically. The migration completes. The reads never queued. The user experience was uninterrupted.

GL also provides connection pooling that addresses the secondary cascade described earlier. When queries queue behind a lock, those connections are managed at the proxy level — not by your application's pool. Your Django process never runs out of connections. Routes that do not touch the locked table continue functioning normally. The lock queue is contained to the table that is being migrated, which is where it belongs.

On the indexing side, GL's auto-indexing engine monitors query patterns and creates indexes CONCURRENTLY before you even write the migration. If GL has already created the index your migration needs, the migration becomes a no-op instead of a 2-minute index build.

None of this replaces proper migration hygiene. The expand-contract pattern, AddIndexConcurrently, NOT VALID + VALIDATE — you still need all of it. But GL provides the safety net that makes the difference between "migration retry succeeded on attempt 2" and "3 AM page to the on-call engineer." The migration patterns keep the DDL safe. GL keeps the traffic safe. Both are necessary. Neither is sufficient alone.

The recommendation

If you have read this far, you have the full picture. Allow me to distill it into the actions that matter most, in order of priority.

Set lock_timeout today. Before anything else. 4 seconds, on the database role used for migrations. This single setting converts unbounded outages into bounded, retryable failures. It is the highest-value, lowest-effort change available to you.

Set idle_in_transaction_session_timeout to 60 seconds on all application database roles. The abandoned psql session is the most common trigger for migration incidents, and it is the easiest to prevent.

Install django-pg-zero-downtime-migrations with RAISE_FOR_UNSAFE = True. It catches 70% of dangerous patterns automatically and makes CI fail when unsafe DDL is generated.

Use AddIndexConcurrently for every index creation on production tables. There is no exception to this rule for tables with more than a few thousand rows.

Use the expand-contract pattern for NOT NULL columns on any table over a million rows. Three migrations, three deploys, zero downtime. The alternative is an incident.

Use NOT VALID + VALIDATE for foreign keys and check constraints on large tables. The two-step pattern is always safe. The one-step pattern is safe only on small tables.

Never deploy DDL against a crowded lock queue. Run the pre-flight checks. Terminate idle-in-transaction sessions. Wait for long-running queries to finish. The migration can wait. Your users cannot.

The patterns in this article are not novel. They are documented in the PostgreSQL manual, discussed on mailing lists, implemented in libraries. What they are not, typically, is assembled into a single reference with Django-specific code examples and honest discussion of the trade-offs. That is what I have attempted to provide. The household runs smoothly when the staff know their duties. The database migrates smoothly when the engineer knows the lock types.

If you'll follow me, there are related matters regarding lock contention that may warrant your attention. But for now, the table is set, the migration is safe, and I believe the immediate matter has been attended to.

Frequently asked questions

Terms referenced in this article

The lock contention that migrations create is a specific instance of a broader PostgreSQL pattern. I have prepared a complete guide to PostgreSQL lock contention — detection queries, lock type hierarchies, advisory lock alternatives, and the prevention checklists that keep migrations from being the only time you think about locking.