Zero-Downtime Rails Migrations on PostgreSQL: Why strong_migrations Alone Isn't Enough

Q: Can I use strong_migrations and safe-pg-migrations together?

Yes, and many teams do. strong_migrations catches unsafe patterns at development time (before you even run the migration), while safe-pg-migrations provides runtime protection (lock_timeout, retry, automatic CONCURRENTLY). They operate at different layers and complement each other well. The one caveat: if both gems try to handle the same operation differently, safe-pg-migrations' runtime rewrite takes precedence. Test your migrations in a staging environment that has both gems installed.

Q: How do I handle a migration that keeps timing out due to lock_timeout?

First, check for the blocker: query pg_stat_activity for idle-in-transaction sessions or long-running queries on the target table. If an abandoned session is holding a lock, terminate it with pg_terminate_backend(). If the table genuinely has constant traffic with no gaps, schedule the migration during a low-traffic window — but deploy your application code first so only the migration changes during the window. If lock acquisition consistently fails even during low traffic, consider increasing lock_timeout temporarily for that specific migration.

Q: Does the NOT VALID constraint pattern work with all PostgreSQL versions?

NOT VALID for CHECK constraints has been available since PostgreSQL 9.2. NOT VALID for foreign key constraints has been available since PostgreSQL 9.1. The optimization where PostgreSQL 12+ treats a validated CHECK (col IS NOT NULL) as equivalent to a native NOT NULL constraint for query planning purposes requires PostgreSQL 12 or later. If you are on PostgreSQL 11 or earlier, the CHECK constraint still enforces correctness but the query planner may not use it for optimizations like join elimination.

Q: What happens if my application crashes between the expand and contract phases?

Nothing dangerous. That is the beauty of expand-contract. After the expand phase, you have a new nullable column and your application writes to both columns. If the process stops here permanently, you simply have an unused column — no data loss, no broken queries. The contract phase (dropping the old column) only runs after you have confirmed the new column is fully backfilled and the application no longer references the old one. Each phase is independently safe.

Q: Should I run database migrations as part of my CI/CD deployment pipeline?

Separate them. Run migrations as a distinct step before the application deployment begins. If a migration fails due to lock_timeout, you want the opportunity to investigate and retry without blocking your deployment pipeline. Coupling migrations to deployment means a transient lock conflict can abort an entire release. Run migrations first, verify they succeeded, then deploy application code.

Q: How do I safely add a column with a default value on a table with hundreds of millions of rows?

On PostgreSQL 11 and later, ADD COLUMN with a DEFAULT is instant — PostgreSQL stores the default in the catalog without rewriting existing rows. This is safe regardless of table size. The danger is adding NOT NULL enforcement. Use the three-phase expand-contract pattern: add the column with a default (instant), backfill any NULL rows in batches, then add a CHECK constraint with NOT VALID followed by VALIDATE CONSTRAINT. The backfill and validation happen under weaker locks that do not block reads or writes.

Three gems. Three different strategies. The same blind spot: none of them can move your read traffic while the table is locked.

The Waiter of Gold Lapel · Updated Mar 20, 2026 Published Mar 5, 2026 · 32 min read

Four hundred guests await in the foyer. The table is being altered.

Good evening. Your migration appears to be blocking the household.

You installed strong_migrations. You read the README. You changed add_index to use algorithm: :concurrently. You felt responsible and well-prepared.

Then you deployed AddNotesToOrders at 2:47 PM on a Wednesday, and your error tracker lit up with 400 timeout errors in 3 seconds. The migration itself completed in 4 milliseconds. The outage lasted 2.1 seconds. And you could not understand how a 4ms operation caused a 2,100ms disruption.

The answer is the PostgreSQL lock queue. And it is the one thing that no Ruby gem — not strong_migrations, not safe-pg-migrations, not pg_ha_migrations — can fix. Because the problem is not the DDL statement. The problem is what happens to every other query while the DDL statement waits for permission to run.

I should be forthright about the scope of what follows. This is not a comparison of three gems and a recommendation to pick one. This is a guide to the PostgreSQL locking mechanics that all three gems operate within — mechanics that determine whether your "zero-downtime migration" actually achieves zero downtime or merely reduces the downtime to a duration you hope nobody notices.

Allow me to walk through the mechanics, compare what each gem actually does, demonstrate the patterns that the gems leave to you, and explain the gap that remains after you have installed all three.

The lock queue: why 4ms of DDL causes 2 seconds of downtime

Every PostgreSQL query acquires a lock. SELECTs acquire AccessShareLock. INSERTs and UPDATEs acquire RowExclusiveLock. These locks coexist peacefully. Thousands of reads and writes run concurrently without conflict.

Then a migration arrives. ALTER TABLE needs AccessExclusiveLock — the strongest lock in PostgreSQL. It conflicts with every other lock type. Including the humble AccessShareLock held by a running SELECT.

If 12 queries are currently running against the table, the ALTER waits. That part is expected. Here is the part that catches people:

Once the AccessExclusiveLock request enters the queue, every subsequent query — including plain SELECTs — queues behind it.

The pending lock poisons the queue. New SELECTs that would normally run instantly are blocked, not by the running queries, but by the waiting ALTER TABLE ahead of them in line. PostgreSQL's lock queue is fair — it processes requests in order. A pending exclusive lock has priority over all subsequent shared locks. This prevents starvation of DDL operations, which is the right engineering trade-off for a general-purpose database. It is also the trade-off that takes down your Rails application.

Timeline of a lock queue cascade

-- What actually happens during a Rails migration on a busy table.

-- T+0.000s  Your migration runs: ALTER TABLE orders ADD COLUMN notes text;
--           PostgreSQL requests AccessExclusiveLock on "orders"
--           But 12 queries currently hold AccessShareLock.
--           ALTER waits.

-- T+0.002s  New SELECT * FROM orders WHERE id = 71923
--           Would normally be instant. But a pending AccessExclusiveLock
--           is ahead in the queue. This SELECT waits too.

-- T+0.005s  INSERT INTO orders (customer_id, total) VALUES (8, 149.00)
--           Needs RowExclusiveLock. Conflicts with pending AccessExclusive.
--           Queued.

-- T+0.008s  Another SELECT. Queued.
-- T+0.012s  Another INSERT. Queued.
-- T+0.019s  UPDATE orders SET status = 'shipped' WHERE id = 71801
--           Queued.

-- T+0.050s  ...
-- T+0.200s  ...
-- T+2.100s  The original 12 queries finish.
-- T+2.100s  ALTER TABLE acquires lock, runs in 4ms, releases.
-- T+2.104s  All queued queries execute.

-- 2.1 seconds of complete table blockage.
-- The ALTER TABLE itself took 4 milliseconds.
-- Every request that touched "orders" during those 2.1 seconds
-- returned 2+ seconds slower than normal.
-- At 200 req/s, that is 420 requests affected.

420 requests affected. Because of a 4ms column addition. The migration did nothing wrong. The lock acquisition did nothing wrong. The queue did exactly what it was designed to do. The problem is that nobody was managing the read traffic during those 2.1 seconds.

I should note that the queue length is a function of two variables: the time the lock spends waiting, and the query rate against the table. A table that receives 10 queries per second and a 500ms lock wait produces 5 queued queries — barely noticeable. A table that receives 500 queries per second and a 2-second lock wait produces 1,000 queued queries. Same lock. Same migration. Dramatically different outcomes. The gem cannot know your query rate. The gem cannot know your lock wait time. The gem cannot route those 1,000 reads elsewhere.

Visualizing the queue in real time

If you suspect a migration is stuck waiting for a lock, the following query shows you the queue as it forms. I recommend running this in a separate psql session before starting any migration against a table that receives meaningful traffic.

Inspecting the lock queue with pg_locks

-- Visualizing the lock queue with pg_locks.
-- Run this in a separate psql session during a migration.

SELECT
  l.pid,
  l.locktype,
  l.mode,
  l.granted,
  a.state,
  a.wait_event_type,
  now() - a.query_start AS waiting_since,
  left(a.query, 80) AS query_preview
FROM pg_locks l
JOIN pg_stat_activity a ON a.pid = l.pid
WHERE l.relation = 'orders'::regclass
ORDER BY l.granted DESC, a.query_start;

--  pid  | locktype | mode                | granted | state  | waiting_since | query_preview
-- ------+----------+---------------------+---------+--------+---------------+--------------
--  1201 | relation | AccessShareLock     | t       | active | 00:00:04.2    | SELECT * FROM orders WHERE...
--  1202 | relation | AccessShareLock     | t       | active | 00:00:02.8    | SELECT count(*) FROM orders...
--  1340 | relation | AccessExclusiveLock | f       | active | 00:00:01.1    | ALTER TABLE orders ADD...
--  1203 | relation | AccessShareLock     | f       | active | 00:00:00.9    | SELECT * FROM orders WHERE...
--  1204 | relation | RowExclusiveLock    | f       | active | 00:00:00.7    | INSERT INTO orders...
--  1205 | relation | AccessShareLock     | f       | active | 00:00:00.4    | SELECT * FROM orders WHERE...
--
-- PIDs 1201-1202: granted=true. Running normally. Blocking the ALTER.
-- PID  1340:      granted=false. The ALTER TABLE. Waiting for 1201/1202.
-- PIDs 1203-1205: granted=false. Innocent bystanders. Queued behind 1340.
--
-- This is the queue. Kill 1201 and 1202, and 1340 acquires instantly.
-- Or wait. But while you wait, 1203-1205 multiply.

The granted column tells the story. Rows with granted = true are currently holding locks — these are the queries your migration is waiting for. Rows with granted = false are the casualties — queries that arrived after your migration and are now stuck behind it. If you see the second group growing, your application is experiencing the queue cascade.

Lock mechanics and detection are covered more thoroughly in the PostgreSQL lock contention guide, if you wish to go further.

What each gem actually does — and does not do

The Rails ecosystem has three serious gems for migration safety. Each takes a different approach to the same problem. All three are good. None of them is complete. And I find that the differences between them are less interesting than the gap they all share.

strong_migrations: the linter

Andrew Kane's strong_migrations is a static analyzer. It examines your migration at generation time and warns you about unsafe operations. It does not modify the migration — it tells you what to change.

strong_migrations in action

# Gemfile
gem 'strong_migrations'

# This migration triggers a strong_migrations warning:
class AddIndexToOrdersStatus < ActiveRecord::Migration[7.1]
  def change
    add_index :orders, :status
  end
end

# strong_migrations says:
#
#   === Dangerous operation detected ===
#   Adding an index non-concurrently blocks writes.
#   Use:
#     def change
#       add_index :orders, :status, algorithm: :concurrently
#     end
#
# Good advice. You add algorithm: :concurrently.
# Crisis averted — for the index.
#
# But the 200 SELECT queries hitting that table every second
# while the migration acquires AccessExclusiveLock?
# strong_migrations has nothing to say about those.

strong_migrations catches the most common mistakes: non-concurrent indexes, adding NOT NULL columns directly, changing column types in place. It is the most widely adopted of the three (roughly 4,200 GitHub stars) and the easiest to add to an existing project. One line in your Gemfile. No configuration required.

What it does not do: set lock_timeout, retry failed migrations, or modify how migrations execute at runtime. It is purely advisory. If you ignore the warning, the unsafe migration runs. And if your migration is technically "safe" (like adding a nullable column) but encounters a lock queue, strong_migrations has nothing to say.

I want to be fair to strong_migrations here, because its design philosophy is deliberately minimal. Andrew Kane's position is that the gem should teach, not automate. When strong_migrations warns you, it shows you the correct code. You learn the pattern. You understand why algorithm: :concurrently matters. The next time, you write it correctly without the warning. There is real value in that pedagogical approach. It produces teams that understand PostgreSQL locking, not teams that depend on a gem to silently fix their mistakes.

That said, a team that understands PostgreSQL locking and has runtime protection is safer than a team that merely understands it.

safe-pg-migrations: the runtime rewriter

safe-pg-migrations from Doctolib takes the opposite approach. Instead of warning you, it rewrites your migrations at runtime to be safer. add_index becomes add_index ... algorithm: :concurrently automatically. It sets lock_timeout to 5 seconds by default. It retries lock acquisition up to 3 times.

safe-pg-migrations behavior

# Gemfile
gem 'safe-pg-migrations'

# safe-pg-migrations automatically:
# - Creates indexes CONCURRENTLY
# - Sets lock_timeout (default: 5s)
# - Sets statement_timeout (default: 1h)
# - Retries lock acquisition on timeout (3 attempts)

class AddNotNullPriorityToOrders < ActiveRecord::Migration[7.1]
  def change
    add_column :orders, :priority, :integer, default: 0, null: false
  end
end

# safe-pg-migrations rewrites this as:
# 1. ADD COLUMN priority integer DEFAULT 0  (nullable, fast)
# 2. ALTER TABLE orders ALTER COLUMN priority SET NOT NULL
#
# Step 2 still scans the entire table to validate.
# On a 40M row orders table, that is 20-60 seconds
# of AccessExclusiveLock.
#
# safe-pg-migrations does NOT use the NOT VALID + VALIDATE
# two-phase pattern. The lock is shorter than raw Rails,
# but it is not zero-downtime on large tables.

This is genuine protection. The lock_timeout prevents indefinite blocking. The automatic retry handles transient contention. The concurrent indexing is correct by default.

Where it falls short: the NOT NULL safety. safe-pg-migrations rewrites add_column ... null: false into two steps (add nullable, then set NOT NULL), but the second step still performs a full table scan under AccessExclusiveLock. It does not use the NOT VALID + VALIDATE CONSTRAINT pattern that would allow the validation scan to run under the weaker ShareUpdateExclusiveLock. On tables with tens of millions of rows, this is the difference between 20ms of blocking and 45 seconds of blocking.

I should also note a subtlety about safe-pg-migrations' retry behavior. When a lock_timeout fires and the migration fails, safe-pg-migrations retries the lock acquisition, not the entire migration. If the migration had already executed some DDL statements before the timeout, those statements may have already committed (if not wrapped in a transaction) or rolled back (if they were). For multi-statement migrations, you need to understand whether each statement is idempotent. ADD COLUMN IF NOT EXISTS is. ADD COLUMN is not.

pg_ha_migrations: the gatekeeper

pg_ha_migrations from Braintree is the most opinionated. It blocks unsafe operations entirely and forces you to use explicitly named safe/unsafe methods.

pg_ha_migrations approach

# Gemfile
gem 'pg_ha_migrations'

# pg_ha_migrations from Braintree.
# Blocks unsafe operations entirely:
#
#   PgHaMigrations::UnsafeMigrationError:
#   add_column with a default is NOT SAFE.
#   Use safe_add_column instead.

class AddPriorityToOrders < ActiveRecord::Migration[7.1]
  def up
    safe_add_column :orders, :priority, :integer
    unsafe_change_column_null :orders, :priority, false, 0
  end
end

# safe_add_column: adds nullable column, no default (instant).
# unsafe_change_column_null: sets NOT NULL with a backfill value.
#
# The "unsafe_" prefix is honest — it still acquires
# AccessExclusiveLock during the NOT NULL validation.
# pg_ha_migrations does not automate the NOT VALID pattern.
# It forces you to acknowledge the risk. That is valuable.
# It does not eliminate the risk.

The unsafe_ prefix is genuinely useful. It makes risk visible in code review. Nobody can approve unsafe_change_column_null without at least considering what "unsafe" means. pg_ha_migrations also sets lock_timeout and retries automatically, similar to safe-pg-migrations.

What it does not automate: the expand-contract pattern with NOT VALID constraints. Like safe-pg-migrations, the NOT NULL enforcement still acquires AccessExclusiveLock for the full table scan. And like both other gems, it has no mechanism for managing read traffic during lock acquisition.

Braintree's engineering team built pg_ha_migrations for their payment processing infrastructure, where the consequences of a migration-related outage are measured in dollars per second. The unsafe_ naming convention emerged from real incidents. I respect the honesty of it. Calling something unsafe_change_column_null is a better safety mechanism than any amount of automatic rewriting — it forces the developer to stop and think.

The comparison

Gem	Approach	lock_timeout	Auto retry	Concurrent indexes	NOT NULL safety	Read routing
strong_migrations	Static analysis	No (manual)	No	Warns; you fix	Warns; you fix	No
safe-pg-migrations	Runtime rewrite	Yes (5s default)	Yes (3 attempts)	Automatic	Partial — no NOT VALID	No
pg_ha_migrations	Block + rename	Yes (configurable)	Yes	Automatic	Partial — forces acknowledgment	No

Notice the last column. All three cells say "No." That is the gap.

An honest counterpoint: are the gems enough for most teams?

I would be a poor waiter indeed if I did not address this directly. For a Rails application with a single PostgreSQL server, moderate traffic (under 100 requests per second), and tables under 10 million rows — strong_migrations plus a 4-second lock_timeout is likely sufficient. The lock queue forms for milliseconds, drains quickly, and the latency spike is imperceptible to users.

The gap I am describing becomes painful at scale. Tables with hundreds of millions of rows. Applications serving hundreds or thousands of requests per second. Tables that are the target of analytics queries holding locks for seconds at a time. If your orders table has 5 million rows and you deploy during a quiet period, you may never encounter the queue cascade at all. The gems did their job. I would not suggest otherwise.

But if you have ever seen 400 timeout errors from a migration that "should have been instant" — you are in the territory where the gap matters.

The lock conflict matrix for Rails developers

Not all DDL locks are equal. Understanding which operations block which queries is the difference between a confident deploy and a crossed-fingers deploy.

Lock needed	DDL operation	Blocked by	Blocks everything?
AccessExclusiveLock	ALTER TABLE, DROP TABLE	Any query — SELECT, INSERT, UPDATE, DELETE	Yes
ShareLock	CREATE INDEX	INSERT, UPDATE, DELETE (RowExclusiveLock)	Writes only
ShareUpdateExclusiveLock	CREATE INDEX CONCURRENTLY, VALIDATE CONSTRAINT	Other DDL, VACUUM	No — reads and writes continue
RowExclusiveLock	INSERT, UPDATE, DELETE	AccessExclusiveLock, ShareLock	No — coexists with other row locks
AccessShareLock	SELECT	AccessExclusiveLock only	No — the weakest lock

AccessExclusiveLock is the nuclear option. Every ALTER TABLE — even adding a nullable column, which completes in under a millisecond — acquires it. The lock hold time varies enormously. Adding a nullable column holds it for microseconds. Adding a NOT NULL constraint holds it for the duration of a full table scan.

The key insight: ShareUpdateExclusiveLock (used by CREATE INDEX CONCURRENTLY and VALIDATE CONSTRAINT) does not block reads or writes. Every unsafe migration operation has a safe equivalent that uses this weaker lock. The gems automate some of these rewrites. None of them automate all of them.

I have included the full five-level matrix above because I find that most Rails developers are aware of AccessExclusiveLock and AccessShareLock but have not internalized the middle three levels. RowExclusiveLock — the lock held by every INSERT, UPDATE, and DELETE — coexists with other RowExclusiveLock acquisitions. This is why thousands of concurrent writes do not block each other. But it conflicts with ShareLock, which is why a regular CREATE INDEX blocks writes. And it coexists with ShareUpdateExclusiveLock, which is why CREATE INDEX CONCURRENTLY does not. These are not academic distinctions. They are the reason your migration works or does not.

The expand-contract pattern: actual zero-downtime NOT NULL

This is the pattern the gems should automate but do not. Three migrations across two deploys. Zero seconds of table-wide blocking.

The expand-contract pattern in Rails

# The zero-downtime way. Three migrations across two deploys.

# Migration 1: Add the column as nullable (instant, sub-ms lock)
class AddPriorityToOrders < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!

  def change
    add_column :orders, :priority, :integer, default: 0
    # PG 11+ adds the DEFAULT without rewriting the table.
    # Existing rows read as 0 without being physically updated.
  end
end

# Deploy. Application writes to the new column.

# Migration 2: Backfill in batches (no DDL lock at all)
class BackfillOrderPriority < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!

  def up
    Order.unscoped.in_batches(of: 10_000) do |batch|
      batch.where(priority: nil).update_all(priority: 0)
    end
  end
end

# Migration 3: Add NOT NULL via CHECK constraint (NOT VALID + VALIDATE)
class AddNotNullConstraintToOrdersPriority < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!

  def up
    # NOT VALID: instant, only enforces on new writes
    safety_assured do
      execute <<~SQL
        ALTER TABLE orders
        ADD CONSTRAINT orders_priority_not_null
        CHECK (priority IS NOT NULL) NOT VALID
      SQL
    end

    # VALIDATE: scans rows but only holds ShareUpdateExclusiveLock
    # Reads and writes continue uninterrupted.
    execute <<~SQL
      ALTER TABLE orders
      VALIDATE CONSTRAINT orders_priority_not_null
    SQL
  end

  def down
    execute <<~SQL
      ALTER TABLE orders
      DROP CONSTRAINT IF EXISTS orders_priority_not_null
    SQL
  end
end

Three migrations instead of one. Two deploys instead of one. In exchange, zero lock-related disruptions, regardless of table size. A 500-million-row table gets the same deploy experience as a 5,000-row table.

The NOT VALID + VALIDATE CONSTRAINT pattern is documented in the PostgreSQL ALTER TABLE reference, but it is underused in Rails shops because none of the three safety gems generate it automatically. You have to write it by hand. This is worth the effort for any table over a million rows.

A note on PostgreSQL 12 and later: the query planner recognizes a validated CHECK (column IS NOT NULL) constraint as equivalent to a native NOT NULL constraint. This means you can use the CHECK constraint approach without any penalty to query planning — joins, index scans, and aggregations all benefit from the NOT NULL knowledge, even though it is expressed as a CHECK rather than a column-level constraint. On PostgreSQL 11 and earlier, the CHECK constraint still enforces correctness, but the planner may not exploit it for optimizations.

The PostgreSQL mechanics are identical across migration frameworks — only the syntax differs.

CREATE INDEX CONCURRENTLY: getting it right in Rails

Index creation is the single most common migration that teams get wrong. A non-concurrent CREATE INDEX on a 50-million-row table holds ShareLock for 30 to 120 seconds. During that time, every INSERT, UPDATE, and DELETE on the table is blocked. Reads continue — but on a table that receives writes, 120 seconds of write blocking is a full outage.

CREATE INDEX CONCURRENTLY in Rails

# CREATE INDEX CONCURRENTLY — the right way in Rails.
#
# Regular CREATE INDEX acquires ShareLock on the table.
# ShareLock blocks all writes (INSERT, UPDATE, DELETE).
# On a 50M row table, index creation takes 30-120 seconds.
# That is 30-120 seconds of no writes.

# WRONG (even though it looks reasonable):
class AddIndexToOrdersCustomerId < ActiveRecord::Migration[7.1]
  def change
    add_index :orders, :customer_id
    # Rails wraps this in a transaction.
    # CREATE INDEX inside a transaction holds the ShareLock
    # for the entire transaction duration.
  end
end

# RIGHT:
class AddIndexToOrdersCustomerId < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!
  # ^^^ Critical. CREATE INDEX CONCURRENTLY cannot run inside
  # a transaction. disable_ddl_transaction! tells Rails not to
  # wrap this migration in BEGIN/COMMIT.

  def change
    add_index :orders, :customer_id, algorithm: :concurrently
    # CREATE INDEX CONCURRENTLY acquires ShareUpdateExclusiveLock.
    # Reads continue. Writes continue. Nothing blocks.
    # Takes longer (2-3x) because it does two passes over the table.
    # But zero downtime.
  end
end

# GOTCHA: If CONCURRENTLY fails partway through, the index is
# left in an INVALID state. PostgreSQL does not clean it up.
# Check for invalid indexes:
#
#   SELECT indexrelid::regclass, indisvalid
#   FROM pg_index
#   WHERE NOT indisvalid;
#
# Drop the invalid index and try again:
#   DROP INDEX CONCURRENTLY idx_orders_customer_id;
#
# safe-pg-migrations handles this automatically.
# strong_migrations warns but you write the fix yourself.

The disable_ddl_transaction! directive is critical and easy to forget. CREATE INDEX CONCURRENTLY cannot run inside a transaction — PostgreSQL will reject it outright. Without disable_ddl_transaction!, Rails wraps every migration in BEGIN/COMMIT, and your concurrent index creation fails with a cryptic error about transactions.

safe-pg-migrations handles this automatically — it detects add_index and rewrites it to use algorithm: :concurrently outside a transaction. strong_migrations warns you but leaves the fix to you. If you are using strong_migrations alone, the disable_ddl_transaction! and algorithm: :concurrently combination must become muscle memory.

The invalid index gotcha is worth dwelling on. If CREATE INDEX CONCURRENTLY fails partway through — due to a uniqueness violation, a statement timeout, or a crash — PostgreSQL leaves behind an invalid index. This index is not used for queries but is updated on every write. It is pure overhead. Query pg_index for indisvalid = false after every concurrent index operation. Drop the invalid index and retry. safe-pg-migrations does this cleanup automatically, which is one of its more valuable behaviors.

Column renames and type changes: the migrations that ruin afternoons

Some migrations are technically instant at the PostgreSQL level but catastrophically unsafe at the application level. Column renames are the worst offender.

Safe column rename via expand-contract

# Renaming a column without downtime.
# This is one of the most dangerous Rails migrations — and one of
# the most common. Rails does not handle this gracefully.

# WRONG: This causes immediate downtime.
class RenameOrdersStatusToState < ActiveRecord::Migration[7.1]
  def change
    rename_column :orders, :status, :state
    # PostgreSQL: ALTER TABLE orders RENAME COLUMN status TO state
    # Acquires AccessExclusiveLock. Instant. Sub-millisecond.
    # But: every query referencing "status" immediately fails.
    # SELECT status FROM orders => ERROR: column "status" does not exist
    # Your running application servers are still using "status."
    # 100% error rate until the new code deploys.
  end
end

# RIGHT: Expand-contract across three deploys.

# Deploy 1: Add the new column, dual-write.
class AddStateToOrders < ActiveRecord::Migration[7.1]
  def change
    add_column :orders, :state, :string
    # Nullable, no lock drama. Sub-millisecond.
  end
end
# Application code: write to BOTH columns.
# Order.after_save { update_column(:state, status) }

# Deploy 2: Backfill and swap reads.
class BackfillOrderState < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!
  def up
    Order.in_batches(of: 10_000) do |batch|
      batch.where(state: nil).update_all("state = status")
    end
  end
end
# Application code: read from "state", write to both.

# Deploy 3: Drop the old column.
class RemoveStatusFromOrders < ActiveRecord::Migration[7.1]
  def change
    safety_assured { remove_column :orders, :status }
  end
end
# Application code: only references "state" now.

The RENAME COLUMN DDL itself acquires AccessExclusiveLock for microseconds. The lock is not the problem. The problem is that every running application server is still referencing the old column name. The moment the rename completes, every query that uses the old name fails. This is not a gradual degradation — it is an instant 100% error rate that persists until new application code deploys to every server.

The expand-contract pattern for renames requires three deploys, not one. That feels expensive. It is. But the alternative — a 100% error rate for the duration of a rolling deploy — is more expensive. I have seen teams attempt to rename a column and deploy new code simultaneously, racing to minimize the error window. This is not engineering. This is gambling. The house always wins, eventually.

Type changes: the table rewrite you did not expect

Safe column type change

# Changing a column type — the most dangerous common migration.
# strong_migrations correctly blocks this. Here is why.

# WRONG: Direct type change.
class ChangeOrdersTotalToBigint < ActiveRecord::Migration[7.1]
  def change
    change_column :orders, :total, :bigint
    # PostgreSQL: ALTER TABLE orders ALTER COLUMN total TYPE bigint
    # This REWRITES THE ENTIRE TABLE.
    # Every row is read, converted, and written back.
    # AccessExclusiveLock held for the ENTIRE rewrite.
    # 50M rows? 3-10 minutes of complete table lockout.
  end
end

# RIGHT: Add new column, migrate data, swap.
class ChangeOrdersTotalToBigintSafely < ActiveRecord::Migration[7.1]
  def up
    # Step 1: Add new column (instant)
    add_column :orders, :total_new, :bigint

    # Step 2: Backfill (batched, no DDL lock)
    # Run this in a separate migration or rake task.
  end
end

# Application code: dual-write to total and total_new.
# Backfill remaining rows.
# Swap reads to total_new.
# Drop total, rename total_new to total.
#
# Five steps instead of one line.
# Zero downtime instead of 10 minutes of downtime.
# This is the trade-off. Always take it for tables over 1M rows.

Changing a column type triggers a full table rewrite. Every row is read, the value is cast to the new type, and the row is written back. For a 50-million-row table, this takes 3 to 10 minutes. AccessExclusiveLock is held for the entire duration. strong_migrations correctly blocks this operation. The expand-contract alternative — adding a new column, backfilling, swapping — is more work but infinitely less destructive.

I should note that some type changes are safe without a rewrite. Changing varchar(100) to varchar(200) (increasing the length) does not rewrite the table on PostgreSQL 9.2 and later. Changing varchar to text is also instant. But changing integer to bigint, text to integer, or any change that alters the physical storage format requires a rewrite. When in doubt, test in a staging environment and observe whether pg_locks shows a sustained AccessExclusiveLock.

Foreign keys: the NOT VALID pattern applies here too

Adding a foreign key constraint is another operation that is safe for small tables and devastating for large ones. The same NOT VALID + VALIDATE CONSTRAINT pattern that rescues NOT NULL constraints works identically for foreign keys.

Safe foreign key addition

# Adding a foreign key constraint without downtime.
# Another operation that is deceptively dangerous.

# WRONG: Direct add_foreign_key.
class AddForeignKeyOrdersToCustomers < ActiveRecord::Migration[7.1]
  def change
    add_foreign_key :orders, :customers
    # PostgreSQL: ALTER TABLE orders
    #   ADD CONSTRAINT fk_rails_... FOREIGN KEY (customer_id)
    #   REFERENCES customers (id)
    #
    # This scans every row in "orders" to validate the constraint.
    # AccessExclusiveLock held for the entire scan.
    # Same problem as NOT NULL — table size determines outage duration.
  end
end

# RIGHT: NOT VALID + VALIDATE, same as NOT NULL.
class AddForeignKeyOrdersToCustomersSafely < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!

  def up
    # Phase 1: Add constraint as NOT VALID (instant, sub-ms lock)
    safety_assured do
      execute <<~SQL
        ALTER TABLE orders
        ADD CONSTRAINT orders_customer_id_fk
        FOREIGN KEY (customer_id) REFERENCES customers (id)
        NOT VALID
      SQL
    end

    # Phase 2: Validate (scans table, but only ShareUpdateExclusiveLock)
    execute <<~SQL
      ALTER TABLE orders
      VALIDATE CONSTRAINT orders_customer_id_fk
    SQL
  end

  def down
    execute <<~SQL
      ALTER TABLE orders
      DROP CONSTRAINT IF EXISTS orders_customer_id_fk
    SQL
  end
end

# NOT VALID means: enforce on new rows immediately,
# but do not scan existing rows yet.
# VALIDATE CONSTRAINT means: now scan existing rows,
# but under ShareUpdateExclusiveLock (reads + writes continue).
#
# Same pattern as NOT NULL. Same two-phase approach.
# The principle is universal: separate enforcement from validation.

The principle is universal: separate enforcement from validation. NOT VALID tells PostgreSQL "enforce this constraint on all new rows, but do not check existing rows yet." This is instant — a catalog-level change, no table scan. VALIDATE CONSTRAINT then scans existing rows to confirm they satisfy the constraint, but under ShareUpdateExclusiveLock, which does not block reads or writes.

None of the three gems automate this two-phase approach for foreign keys. strong_migrations warns you about add_foreign_key and suggests using validate: false followed by validate_foreign_key — which is the Rails-native syntax for the same two-phase pattern. But you must write both steps yourself. safe-pg-migrations and pg_ha_migrations do not provide special handling for foreign keys.

lock_timeout: the setting your production database needs today

If you take one thing from this article, let it be this: set lock_timeout on your migration database role.

Without lock_timeout, a migration that cannot acquire its lock waits forever. While it waits, it poisons the lock queue. Every query on that table backs up behind it. Your application goes down, and the migration has not even started executing.

With lock_timeout, the migration fails immediately after the timeout window. The queue drains. Your application stays up. You retry in 30 seconds when traffic is lighter.

Setting lock_timeout in Rails

# config/initializers/lock_timeout.rb
#
# Set lock_timeout on the database connection.
# If a migration cannot acquire its lock within this window,
# it fails immediately instead of poisoning the queue.

ActiveRecord::Base.connection.execute("SET lock_timeout = '4s'")

# Or set it per-migration for finer control:
class AddNotesToOrders < ActiveRecord::Migration[7.1]
  def change
    execute "SET lock_timeout = '3s'"
    add_column :orders, :notes, :text
  end
end

# Or set it at the PostgreSQL role level:
# ALTER ROLE deploy_user SET lock_timeout = '5s';
#
# When lock_timeout fires:
#   ActiveRecord::LockWaitTimeout:
#   PG::LockNotAvailable: ERROR: canceling statement
#   due to lock timeout
#
# The migration fails. The queue drains instantly.
# Your application stays up. Retry in 30 seconds.

safe-pg-migrations and pg_ha_migrations set this automatically. strong_migrations does not — you need to configure it yourself. If you are using strong_migrations alone (and many teams are), lock_timeout is the most important complementary setting you can add.

4 seconds is a reasonable starting point. Short enough to fail before causing user-visible disruption. Long enough to succeed during normal traffic levels. Adjust based on your p99 query duration — the timeout should be at least 2x your slowest normal query on the target table.

statement_timeout as a secondary guard

lock_timeout covers the waiting phase — how long the migration waits to acquire its lock. But what about the execution phase? A VALIDATE CONSTRAINT on a 500-million-row table can run for 10 minutes under ShareUpdateExclusiveLock. That does not block reads or writes, but it does block other DDL operations and VACUUM.

Using SET LOCAL for scoped timeouts

# Statement timeout as a safety net.
# lock_timeout covers the waiting phase.
# statement_timeout covers the execution phase.

class SafeMigrationBase < ActiveRecord::Migration[7.1]
  def exec_with_timeouts(sql, lock_to: '4s', stmt_to: '30s')
    execute "SET LOCAL lock_timeout = '#&#123;lock_to&#125;'"
    execute "SET LOCAL statement_timeout = '#&#123;stmt_to&#125;'"
    execute sql
  end
end

# SET LOCAL applies only within the current transaction.
# When the migration transaction commits or rolls back,
# the timeouts revert to the session defaults.
#
# Why SET LOCAL and not SET?
# SET changes the session. If your migration runner reuses
# connections (Puma, Sidekiq), the timeout persists into
# subsequent operations. A 30s statement_timeout on a
# background job connection is a nasty surprise.
#
# SET LOCAL is scoped. It cannot leak.

# Usage:
class AddTrackingNumberToOrders < SafeMigrationBase
  def up
    exec_with_timeouts(
      "ALTER TABLE orders ADD COLUMN tracking_number text",
      lock_to: '3s',
      stmt_to: '10s'
    )
  end
end

The distinction between SET and SET LOCAL is worth memorizing. SET changes the session — and if your migration runner reuses connections (which Puma and Sidekiq both do), that timeout persists into subsequent operations. A 30-second statement_timeout leaking into a background job that runs legitimate long queries is a production incident of its own. SET LOCAL scopes the change to the current transaction. When the transaction ends, the timeout reverts. It cannot leak.

"In infrastructure, boring is the highest compliment available."
— from You Don't Need Redis, Chapter 1: Good Evening. We Have a Problem.

The pre-flight check: what to run before every migration

Even with lock_timeout configured, it is better to avoid contention than to recover from it. Before deploying any migration that touches a high-traffic table, run these queries.

Pre-flight: check for blockers

-- Before running any migration, check for blockers.
-- These queries will prevent your DDL from acquiring its lock.

-- Long-running queries on the target table:
SELECT pid,
       now() - query_start AS duration,
       state,
       left(query, 120) AS query_snippet
FROM pg_stat_activity
WHERE state = 'active'
  AND query ILIKE '%orders%'
  AND query_start < now() - interval '2 seconds'
ORDER BY duration DESC;

-- The real danger — idle transactions holding locks:
SELECT pid,
       now() - xact_start AS tx_duration,
       state,
       left(query, 120) AS last_query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND xact_start < now() - interval '30 seconds'
ORDER BY tx_duration DESC;

-- An idle-in-transaction session from 10 minutes ago
-- holds AccessShareLock on every table it touched.
-- Your ALTER TABLE will wait for it. Forever.
-- Every query after your ALTER TABLE will wait too.
--
-- Kill it:  SELECT pg_terminate_backend(<pid>);
-- Or prevent it: SET idle_in_transaction_session_timeout = '60s';

The idle-in-transaction sessions deserve special attention. A developer who ran BEGIN; SELECT * FROM orders LIMIT 5; in a psql session and went to lunch is holding AccessShareLock on the orders table. Your migration will wait for that lock to release. While it waits, every query on the orders table queues. The entire application hangs because of an abandoned psql session from three hours ago.

Set idle_in_transaction_session_timeout to 60 seconds on all non-administrative database roles. This single setting prevents the most common cause of migration-related outages in Rails applications.

Complete pre-deploy migration checklist

-- Pre-deploy migration checklist.
-- Run these queries before every migration on a production table.

-- 1. Table size and row count (for estimating lock duration):
SELECT
  pg_size_pretty(pg_total_relation_size('orders')) AS total_size,
  pg_size_pretty(pg_relation_size('orders')) AS table_size,
  reltuples::bigint AS estimated_rows
FROM pg_class WHERE relname = 'orders';

-- 2. Active queries on the target table:
SELECT pid, now() - query_start AS duration, left(query, 100)
FROM pg_stat_activity
WHERE state = 'active'
  AND query ILIKE '%orders%'
ORDER BY duration DESC;

-- 3. Idle-in-transaction sessions (the silent killers):
SELECT pid, now() - xact_start AS tx_age, left(query, 100)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND xact_start < now() - interval '30s';

-- 4. Existing locks on the target table:
SELECT l.pid, l.mode, l.granted, a.state, left(a.query, 80)
FROM pg_locks l
JOIN pg_stat_activity a ON a.pid = l.pid
WHERE l.relation = 'orders'::regclass;

-- 5. Replication lag (if backfilling):
SELECT
  client_addr,
  state,
  pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes,
  replay_lag
FROM pg_stat_replication;

-- If any of these return concerning results, wait.
-- Patience is a virtue. Impatience is an outage.

I recommend making this checklist a mandatory step in your deployment runbook. Not as a suggestion. Not as a "nice to have." As a gate. No migration runs on a table with active long-running queries or idle transactions. If the checklist reveals blockers, you wait or you clear them. Patience is not optional in this line of work.

Coordinating migrations across multiple servers

If your Rails application deploys to multiple servers — and most do — you face a coordination problem during backfills. Two deploy processes running the same backfill migration simultaneously waste resources and can cause lock contention on the rows being updated. Advisory locks solve this cleanly.

Advisory locks for migration coordination

# Using advisory locks to coordinate migrations across
# multiple deployment targets.

class SafeBackfillWithAdvisoryLock < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!

  LOCK_KEY = 847291  # Arbitrary but unique per migration

  def up
    # pg_try_advisory_lock returns true if acquired, false if held.
    # This prevents two deploy processes from running the same
    # backfill simultaneously (which wastes resources and can
    # cause lock contention on the rows being updated).

    got_lock = ActiveRecord::Base.connection.select_value(
      "SELECT pg_try_advisory_lock(#&#123;LOCK_KEY&#125;)"
    )

    unless got_lock
      puts "Another process is running this migration. Skipping."
      return
    end

    begin
      Order.in_batches(of: 10_000) do |batch|
        batch.where(priority: nil).update_all(priority: 0)
        sleep 0.1  # Throttle to reduce replication lag
      end
    ensure
      ActiveRecord::Base.connection.execute(
        "SELECT pg_advisory_unlock(#&#123;LOCK_KEY&#125;)"
      )
    end
  end
end

# Why throttle with sleep?
# Each batch of 10,000 UPDATEs generates WAL (write-ahead log).
# If you have streaming replicas, they must replay that WAL.
# Without throttling, you can generate WAL faster than replicas
# can consume it, causing replication lag.
# 100ms between batches keeps lag under 1 second typically.

Advisory locks are PostgreSQL's mechanism for application-level locking. They exist outside the table lock system — they do not block reads, writes, or DDL on any table. They are purely a coordination mechanism between processes that agree to check for them. If two deploy processes both call pg_try_advisory_lock with the same key, only one gets the lock. The other skips the migration gracefully.

The sleep 0.1 between batches is not idle waiting — it is deliberate throttling. Each batch of 10,000 UPDATEs generates write-ahead log (WAL) entries. If you have streaming replicas, they must replay that WAL. Without throttling, you can generate WAL faster than replicas can consume it, causing replication lag that makes read replicas serve stale data. 100 milliseconds between batches keeps replication lag manageable for most configurations.

Monitoring replica lag during backfills

If your Rails application uses read replicas — via Makara, the built-in Rails multi-database support, or a proxy — replication lag during backfills is a real concern that the migration gems do not address.

Backfill with replica lag monitoring

# Monitoring replica lag during batched backfills.
# Essential if your application reads from replicas.

class BackfillWithLagMonitoring < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!

  MAX_LAG_BYTES = 10_000_000  # 10 MB — adjust per your tolerance

  def up
    Order.in_batches(of: 10_000) do |batch|
      batch.where(priority: nil).update_all(priority: 0)

      # Check replica lag after each batch
      lag = ActiveRecord::Base.connection.select_value(<<~SQL
        SELECT COALESCE(
          MAX(pg_wal_lsn_diff(sent_lsn, replay_lsn)),
          0
        ) FROM pg_stat_replication
      SQL
      ).to_i

      if lag > MAX_LAG_BYTES
        puts "Replica lag: #&#123;lag&#125; bytes. Pausing for 5s..."
        sleep 5
      else
        sleep 0.1  # Standard throttle
      end
    end
  end
end

# Why this matters:
# If your Rails app reads from a replica (via Makara, Octopus,
# or the built-in Rails multi-database support), replica lag
# means stale reads. A user updates their order, then
# immediately views it — and sees the old data because the
# replica has not caught up.
#
# Batched backfills are the #1 cause of replication lag spikes
# in Rails applications. Monitor it. Throttle accordingly.

The pattern is straightforward: after each batch, query pg_stat_replication for the lag between sent and replayed WAL positions. If the lag exceeds your threshold (10 MB is a reasonable starting point), pause the backfill until the replicas catch up. This turns a 2-minute backfill into a 5-minute backfill, but it prevents users from seeing stale data on replica reads during the process.

I should be honest about a limitation of this approach: pg_stat_replication is only accessible on the primary. If your migration runs against the primary (which it must, for writes), this works. If you are running the migration through a proxy that does not expose replication statistics, you will need a separate monitoring connection to the primary.

The gap no gem can fill: live read traffic during DDL

Here is the fundamental limitation. All three gems operate at the application layer. They can control what DDL is issued. They can set timeouts. They can retry. They can rewrite unsafe operations into safer equivalents.

They cannot control what happens to the 200 SELECT queries per second that arrive between the moment a lock is requested and the moment it is granted.

What the gems cannot do

# Here is what happens during a "safe" migration
# with any of the three gems installed:
#
# T+0s     gem sets lock_timeout = 5s
# T+0s     ALTER TABLE orders ADD COLUMN notes text;
# T+0.01s  Lock acquired (fast — column add is instant on PG 11+)
# T+0.02s  Migration complete. Lock released.
#
# Total lock hold time: ~20ms. The gem did its job.
#
# But during those 20ms:
# - 4 SELECT queries queued behind the AccessExclusiveLock
# - Each waited ~15ms longer than normal
# - Users saw a brief latency spike
#
# On a small table with moderate traffic, this is negligible.
# On a 100M row table with 500 req/s? Those 20ms can queue
# 10 reads. And if the lock takes 2 seconds to acquire
# (because an analytics query is running), 1,000 reads queue.
#
# The gems prevent the WORST outcomes (indefinite blocking).
# They cannot prevent reads from queuing during lock acquisition.
# That requires routing reads elsewhere.

On a moderately busy table (500 req/s), even a 100ms lock acquisition window queues 50 reads. Those reads do not fail — they succeed, eventually — but they all return 100ms slower than normal. That shows up as a latency spike in your monitoring. On a table with long-running analytics queries holding AccessShareLock, the window can stretch to seconds. Then you have a genuine outage.

This is not a flaw in the gems. It is a limitation of the architecture. The gems live inside your Rails process. They control migration execution. They do not control query routing. To route reads during DDL operations, you need something between Rails and PostgreSQL.

Some teams solve this with application-level read replica routing — directing all SELECTs to a replica during migration windows. This works, but it requires coordination: someone must flip the routing before the migration starts and flip it back after. If the migration fails and retries, the routing must stay active. If the migration takes longer than expected, the replica begins to lag (because it is replaying the migration's WAL). The manual coordination is the fragile part.

Connection pool exhaustion: the secondary cascade

There is a second failure mode that the gems cannot prevent, and it is often more severe than the lock queue itself.

How lock queuing exhausts your connection pool

-- The cascade failure nobody warns you about.

-- Setup: 20 database connections in the pool.
-- Traffic: 200 req/s, each holding a connection for ~10ms.

-- T+0s     Migration starts. AccessExclusiveLock requested.
-- T+0s     Pending lock begins queuing new queries.

-- T+0.5s   10 connections are now blocked, waiting for lock.
-- T+1.0s   18 connections blocked. 2 remaining for new requests.
-- T+1.2s   Pool exhausted. New requests wait for a connection.

-- Now you have TWO queues:
-- 1. PostgreSQL lock queue (queries waiting for the lock)
-- 2. Application connection pool queue (requests waiting for a connection)
--
-- Error logs fill with:
--   ActiveRecord::ConnectionTimeoutError:
--   could not obtain a database connection within 5.000 seconds

-- The migration has not even run yet.
-- The application is down because every connection is stuck
-- waiting for a lock that is waiting for an analytics query
-- that started 45 seconds ago.
--
-- Gold Lapel's connection pooling prevents this cascade.
-- Blocked connections are managed at the proxy level.
-- The application pool never exhausts because GL holds
-- the waiting connections, not your app.

This is the cascade that turns a 2-second lock queue into a 30-second outage. The lock queue blocks individual queries. But those queries are holding connections from your application pool. As queries queue up, connections stop being returned to the pool. New requests cannot get a connection. The application throws ConnectionTimeoutError across the board — even for routes that do not touch the locked table.

I want to emphasize that last point. The connection pool is global. A lock on the orders table blocks connections that were checking out to query orders. But those connections come from the same pool used by the /users endpoint, the /products endpoint, the health check. When the pool exhausts, everything goes down. Your login page fails because someone ran a migration on the orders table. That is the secondary cascade.

More connections in the pool is not the answer. More connections means more PostgreSQL backends, which means higher memory usage, more context switching, and slower performance for everyone. The correct architecture is a proxy that manages connection lifecycle independently of your application pool.

How Gold Lapel fills the gap

Gold Lapel sits between your Rails application and PostgreSQL as a transparent proxy. It speaks the PostgreSQL wire protocol. Rails does not know it is there. You add gem "goldlapel" to your Gemfile — it auto-patches ActiveRecord to route through the proxy. No query changes. No configuration files.

During a migration, GL detects the AccessExclusiveLock request and does two things the gems cannot:

1. Read routing. SELECT queries that would queue behind the pending lock are transparently routed to existing materialized views while the base table is locked. No reads queue. No latency spike. The migration acquires its lock, runs its DDL, and releases. Reads route back to the primary table. Total user-visible disruption: zero.

2. Connection pooling. GL manages database connections at the proxy level. When queries queue behind a lock, those connections are held by GL — not by your Rails application's connection pool. Your application pool never exhausts. The cascade failure described above does not happen. Blocked queries wait at the proxy, and your application continues serving requests on routes that do not touch the locked table.

Gold Lapel during a migration

# What happens when Gold Lapel sits between Rails and PostgreSQL
# during a migration:
#
# Before migration:
#   Rails app --> Gold Lapel proxy --> PostgreSQL primary
#                                      (reads + writes)
#
# Migration begins. GL detects AccessExclusiveLock on "orders":
#   Rails app --> Gold Lapel proxy --> PostgreSQL primary (writes)
#                     |
#                     +--------------> Materialized view / replica
#                                      (reads routed here)
#
# The 4 SELECT queries that would have queued behind
# AccessExclusiveLock? They hit the materialized view instead.
# No queue. No latency spike. No user impact.
#
# Migration completes. Lock releases. Reads route back to primary.
#
# Total user-visible disruption: zero.
# Total migration safety gems still needed: yes — GL does not
# replace lock_timeout or expand-contract patterns.
# It handles the read traffic that gems cannot.

GL does not replace your migration safety gems. You should still use strong_migrations or safe-pg-migrations. You should still set lock_timeout. You should still use the expand-contract pattern for NOT NULL constraints on large tables. GL handles the traffic that the gems cannot — the live reads and the connection pool — and that is the gap where most migration incidents actually occur.

I should be honest about what GL does not do during migrations. It does not prevent slow migrations. It does not validate your DDL. It does not tell you that ALTER TABLE orders ALTER COLUMN total TYPE bigint will rewrite the entire table. That is the gem's job. GL handles the consequences of a lock — the queued reads, the connection pressure — not the cause. Use the gems for prevention. Use GL for the traffic that prevention cannot reach.

A retry strategy for lock_timeout failures

When lock_timeout causes a migration to fail (and it will — that is the point), you need a retry strategy. Automatic, with backoff, that distinguishes lock failures from real bugs.

Rake task: migrate with retry

# Rake task: retry migrations with exponential backoff.
# Use when lock_timeout causes expected failures.

namespace :db do
  desc "Run migrations with retry on lock timeout"
  task migrate_with_retry: :environment do
    max_attempts = 5
    base_delay = 5

    max_attempts.times do |attempt|
      begin
        Rake::Task["db:migrate"].invoke
        puts "Migration succeeded on attempt #&#123;attempt + 1&#125;"
        break
      rescue ActiveRecord::LockWaitTimeout => e
        if attempt + 1 < max_attempts
          delay = base_delay * (2 ** attempt)
          puts "Lock timeout on attempt #&#123;attempt + 1&#125;. " \
               "Retrying in #&#123;delay&#125;s..."
          sleep delay
          Rake::Task["db:migrate"].reenable
        else
          raise "Migration failed after #&#123;max_attempts&#125; attempts: " \
                "#&#123;e.message&#125;"
        end
      end
    end
  end
end

Five attempts with exponential backoff gives the system roughly 2.5 minutes to find a clean window. In practice, most retries succeed on the second or third attempt. The lock queue that blocked attempt 1 drains within seconds once the timeout fires, and attempt 2 catches a gap between long-running queries.

If all 5 attempts fail, something structural is wrong. Either a long-running query is holding AccessShareLock permanently (check for idle-in-transaction sessions), or traffic is so consistently heavy that no clean window exists. In that case, schedule the migration for a low-traffic period — but deploy your application code first, so the migration is the only change happening during the window.

A word on the distinction between ActiveRecord::LockWaitTimeout and other migration errors. If the migration fails with a StatementInvalid, TypeError, or any non-lock error, do not retry. Those are bugs in the migration, not transient contention. The retry task above only rescues LockWaitTimeout, which is the correct behavior. If you catch all exceptions, you will retry genuine bugs five times and still fail.

The full recommendation

Use strong_migrations as a baseline. It catches the common pitfalls at development time and costs nothing at runtime. If your team occasionally omits algorithm: :concurrently or writes NOT NULL additions without defaults, strong_migrations will flag them before they reach production.

Add safe-pg-migrations if you want runtime protection. The automatic lock_timeout, retry logic, and concurrent indexing are worth the dependency. It catches what strong_migrations warns about and fixes it without developer intervention.

Use the expand-contract pattern manually for NOT NULL constraints, foreign keys, and column renames on any table over a million rows. None of the gems automate this correctly. Write the three-migration sequence by hand. It takes 10 minutes. It prevents 10-minute outages.

Set lock_timeout regardless of which gem you use. 4 seconds. On the database role used for migrations. Today.

Set statement_timeout as a secondary guard. 30 seconds for DDL migrations. Longer for backfills. Use SET LOCAL so it does not leak into subsequent operations.

Set idle_in_transaction_session_timeout to 60 seconds on all application roles. The abandoned psql session is the most common cause of migration incidents, and it is the easiest to prevent.

Run the pre-flight checklist before every migration against a high-traffic table. Check for long-running queries, idle transactions, existing locks, and replication lag. Make it a gate, not a suggestion.

Monitor replica lag during batched backfills. Throttle writes when lag exceeds your threshold. Your read-replica users will thank you.

And for the gap that no gem can fill — live read routing and connection pool management during DDL operations — put a proxy in front of PostgreSQL that can manage traffic while your tables are briefly indisposed. The migrations are safe. The reads still need somewhere to go.

In infrastructure, boring is the highest compliment available. A migration that nobody notices is a migration done correctly. The goal is not to eliminate DDL from your deploys — it is to make DDL invisible to your users. The gems get you partway there. The patterns get you further. The proxy covers the rest.

Frequently asked questions

Can I use strong_migrations and safe-pg-migrations together?

How do I handle a migration that keeps timing out due to lock_timeout?

Does the NOT VALID constraint pattern work with all PostgreSQL versions?

What happens if my application crashes between the expand and contract phases?

Should I run database migrations as part of my CI/CD deployment pipeline?

How do I safely add a column with a default value on a table with hundreds of millions of rows?

Terms referenced in this article

Permit me one observation before we part. The lock contention at the heart of migration risk is closely related to the table bloat problem — a heavily bloated table takes longer to rewrite and holds locks longer. Understanding one illuminates the other.