Solid Queue on PostgreSQL: Tuning FOR UPDATE SKIP LOCKED (and Why MySQL Gets a Free Win You Don't)
Rails 8 defaults to Solid Queue. Solid Queue was built for MySQL. Your PostgreSQL is doing 33,000x more work than necessary on a single query. Allow me to show you where, and how to fix it.
Good evening. Your job queue is reading ten million rows to find five queue names.
Solid Queue is an excellent piece of engineering. It replaced Redis-backed job processors as the Rails 8 default, and it did so by leaning on a single SQL feature that makes relational databases viable as job queues: FOR UPDATE SKIP LOCKED.
The pitch is compelling. No Redis to operate. No Sidekiq license to manage. No external dependency to monitor, patch, and recover when it fails at 2 AM. Your database — the system you already operate, already back up, already understand — handles job coordination with row-level locking that resolves contention in microseconds. One fewer service in the household.
On MySQL, Solid Queue performs beautifully. It was designed there, tested there, and optimized for MySQL's query planner and its specific capabilities. The Basecamp team built it for their own production workloads, which run on MySQL. And they did a thorough job of it.
PostgreSQL is not MySQL.
There is a query that Solid Queue runs hundreds of times per second — a SELECT DISTINCT(queue_name) against the ready executions table — that MySQL handles with a Loose Index Scan. It reads one index entry per distinct value. Five queues? Five index entries. Ten million rows? Still five index entries.
PostgreSQL does not have Loose Index Scan. It never has. The planner's only options are a full table scan or a full index scan. Five queues, ten million rows? Ten million index entries read, hashed, deduplicated. Every single time the query runs.
This is not a bug. It is not a misconfiguration. It is a fundamental difference between two database engines that Solid Queue's architecture happens to expose at the worst possible moment — under load, when your queue table is deepest.
I have the numbers, the EXPLAIN plans, the tuning parameters, and a recursive CTE workaround that delivers a 33,529x improvement. I also have an honest assessment of when you should consider a PostgreSQL-native alternative instead. If you are running Solid Queue on PostgreSQL — or considering it — this is the guide I wish someone had written before the production incidents started appearing on GitHub.
A brief history of Rails job queues, if you will indulge me
To understand why Solid Queue exists and why its PostgreSQL behaviour matters, it helps to know what came before.
For a decade, the Rails job queue landscape was dominated by Redis-backed systems. Sidekiq, Resque, and Delayed Job (which used the database, but with polling — a strategy that aged poorly). Sidekiq became the de facto standard: fast, reliable, well-maintained. It also required Redis. And Redis required operational attention — memory management, persistence configuration, replication if you needed durability, monitoring for eviction policies that could silently discard jobs.
The question that lingered in every Rails team's infrastructure meetings: why are we running a separate service just to coordinate background work?
The database was already there. It was already durable. It already handled concurrent access with proper isolation. The missing piece was a locking primitive that could efficiently coordinate workers without contention — without the convoy problem where workers queue up waiting for a single locked row.
FOR UPDATE SKIP LOCKED was that primitive. PostgreSQL added it in version 9.5 (2016). MySQL added it in version 8.0 (2018). It changed the economics entirely: a relational database could now coordinate job workers with the same efficiency as Redis, without the additional service to operate.
Solid Queue was the Basecamp team's answer. Built into Rails 8 as the default. One fewer service. One fewer thing to break. The household staff, reduced to its essential members.
I admire the intent. Deeply. The instinct to reduce complexity by using existing infrastructure more intelligently is — if I may say so — precisely the instinct that guides everything Gold Lapel does. The issue is not the philosophy. The issue is that the implementation was optimized for one database engine and then shipped as the default for a framework that supports two.
How FOR UPDATE SKIP LOCKED makes PostgreSQL a job queue
Before the problems, the good news. The core mechanism that makes Solid Queue work — FOR UPDATE SKIP LOCKED — is genuinely excellent on PostgreSQL. It was introduced in PostgreSQL 9.5, and it solves the job queue coordination problem with an elegance that advisory locks and polling never achieved.
-- How Solid Queue claims jobs:
SELECT "solid_queue_ready_executions".*
FROM "solid_queue_ready_executions"
WHERE "solid_queue_ready_executions"."queue_name" = 'default'
ORDER BY "solid_queue_ready_executions"."priority" ASC,
"solid_queue_ready_executions"."id" ASC
LIMIT 1
FOR UPDATE SKIP LOCKED;
-- FOR UPDATE: lock the row so no other worker grabs it.
-- SKIP LOCKED: if someone else already locked it, skip it
-- and grab the next one. No waiting. No contention.
--
-- This is brilliant. It turns PostgreSQL into a job queue
-- without advisory locks, without polling delays,
-- without the coordination overhead of Redis. Three workers, three jobs, zero contention, zero waiting. Each worker grabs the next unlocked row, locks it, and processes it. No central coordinator. No Redis. No race conditions. The database itself handles the concurrency, and it does so at the row level with microsecond-scale overhead.
This is the part of Solid Queue on PostgreSQL that works flawlessly. Allow me to show you precisely why.
-- How FOR UPDATE SKIP LOCKED behaves under concurrency:
-- Worker 1 executes:
BEGIN;
SELECT * FROM solid_queue_ready_executions
WHERE queue_name = 'default'
ORDER BY priority, id
LIMIT 1
FOR UPDATE SKIP LOCKED;
-- Locks row id=1001, returns it.
-- Worker 2 executes simultaneously:
BEGIN;
SELECT * FROM solid_queue_ready_executions
WHERE queue_name = 'default'
ORDER BY priority, id
LIMIT 1
FOR UPDATE SKIP LOCKED;
-- Row id=1001 is locked. SKIP LOCKED skips it.
-- Locks row id=1002, returns it.
-- No waiting. No retry. No contention.
-- Worker 3 executes simultaneously:
-- Skips 1001 and 1002. Gets 1003.
-- This is why SKIP LOCKED is perfect for job queues.
-- But the LIMIT matters enormously.
-- LIMIT 1 = each worker grabs one job, processes it, comes back.
-- LIMIT 100 = each worker grabs a batch, reducing round trips
-- but holding locks longer.
-- EXPLAIN ANALYZE for the claim query (10M rows, indexed):
--
-- Limit (cost=0.56..1.14 rows=1 width=89)
-- -> LockRows (cost=0.56..5782401.14 rows=9999842 width=89)
-- -> Index Scan using idx_sqre_claim
-- on solid_queue_ready_executions
-- (cost=0.56..5682402.72 rows=9999842 width=89)
-- Index Cond: (queue_name = 'default')
-- Planning Time: 0.134 ms
-- Execution Time: 0.048 ms
--
-- 0.048ms. Even with 10 million pending jobs.
-- The index does the work. SKIP LOCKED does the coordination. 0.048ms to claim a job from a table with ten million rows. The composite index on (queue_name, priority, id) does the filtering and sorting. SKIP LOCKED handles the concurrency. No advisory locks, no external coordination, no polling delays.
Inside the lock mechanism
If you will permit me a brief tour of the internals — understanding how SKIP LOCKED works at the engine level explains why the claim query is fast and why the DISTINCT query is the problem, not the locking itself.
-- What PostgreSQL actually does when it encounters SKIP LOCKED:
--
-- 1. Walk the index in (queue_name, priority, id) order.
-- 2. For each candidate row, attempt to acquire a RowExclusiveLock.
-- 3. If the lock is already held by another transaction:
-- - Normal FOR UPDATE: WAIT until the lock is released.
-- - FOR UPDATE SKIP LOCKED: immediately move to the next row.
-- 4. Return the first row where the lock succeeded.
--
-- The lock check is a single atomic operation in shared memory.
-- No syscalls. No disk I/O. No WAL writes for the lock itself.
-- The lock lives in PostgreSQL's lock table (a hash table in
-- shared memory), not on disk.
-- Lock table entry for a single FOR UPDATE row:
-- LOCKTAG: (relation OID, block number, tuple offset, 0)
-- Lock mode: RowExclusiveLock (mode 3)
-- Size: ~120 bytes in shared memory
--
-- 10 workers each holding 1 locked row = 1,200 bytes.
-- 100 workers each holding a batch of 50 = 600KB.
-- The lock table is not the bottleneck.
-- The bottleneck, when it appears, is the INDEX SCAN itself.
-- But only if the index is missing or the table is severely bloated.
-- With a proper composite index, the scan reads 1-3 pages maximum. The key insight: SKIP LOCKED is a shared-memory operation. The lock check does not touch disk, does not write WAL, does not acquire heavyweight locks. It checks a hash table in RAM — the lock table — and either succeeds or moves on. The entire overhead of coordinating 50 concurrent workers is less than a single random disk read.
This is why I find it frustrating when people dismiss FOR UPDATE SKIP LOCKED as "slow" or "not suitable for high-throughput queues." The locking mechanism is superb. It is everything around it — specifically, the query that decides which queues to check — that creates the problem.
If this were the only query Solid Queue ran, PostgreSQL would be a perfect fit. But it is not the only query.
The DISTINCT query: where MySQL gets a free win
Solid Queue needs to know which queues have pending work. It answers this question with a seemingly innocent query: SELECT DISTINCT queue_name FROM solid_queue_ready_executions.
On MySQL, this query is essentially free. The InnoDB engine supports Loose Index Scan (also called "skip scan"), which reads the B-tree index by jumping from one distinct value to the next without visiting the entries in between. An index on queue_name with five distinct values? Five index lookups. Table size is irrelevant.
PostgreSQL has no equivalent optimization. None. It has been discussed since at least 2008, occasionally prototyped, and never merged. The PostgreSQL planner, faced with SELECT DISTINCT, chooses between two expensive options:
-- Solid Queue also runs this query. Frequently.
SELECT DISTINCT "solid_queue_ready_executions"."queue_name"
FROM "solid_queue_ready_executions";
-- On MySQL, this uses a Loose Index Scan:
-- the engine reads the first entry for each distinct value
-- in the index, then skips ahead to the next distinct value.
-- Cost: proportional to the number of DISTINCT values.
-- 100 million rows, 5 queues? Reads 5 index entries.
-- On PostgreSQL, there is no Loose Index Scan.
-- The planner has two options:
--
-- Option A: Sequential scan the entire table.
-- Seq Scan on solid_queue_ready_executions
-- (cost=0.00..289431.00 rows=5 width=18)
-- Filter: (queue_name IS NOT NULL)
-- Rows Removed by Filter: 0
-- Planning Time: 0.091 ms
-- Execution Time: 3842.110 ms
--
-- Option B: Full index scan with HashAggregate.
-- HashAggregate (cost=301820.42..301820.47 rows=5 width=18)
-- Group Key: queue_name
-- -> Index Only Scan using idx_ready_executions_queue
-- on solid_queue_ready_executions
-- (cost=0.56..276820.33 rows=9999842 width=18)
-- Heap Fetches: 0
-- Planning Time: 0.114 ms
-- Execution Time: 2917.445 ms
--
-- Both read every row. For a table that might hold millions of
-- pending jobs during a backlog, this query alone can consume
-- more CPU than the actual job processing. 2.9 seconds. For a query that returns five rows. On every poll cycle, for every dispatcher, hundreds of times per second.
I should be precise about why this happens, because "PostgreSQL is slower" is not an adequate explanation. PostgreSQL's query planner is, in most respects, more sophisticated than MySQL's. It supports hash joins, merge joins, parallel query execution, and a cost model that considers I/O, CPU, and memory in ways MySQL does not attempt. The planner is not lacking in intelligence.
What it lacks is a specific optimization path. When the planner sees SELECT DISTINCT column FROM table, it knows the column has an index. It knows the index is sorted. It knows the number of distinct values is small (the statistics catalog tells it so). What it cannot do is skip through the index from one distinct value to the next. The planner has no "skip scan" or "loose index scan" operator in its repertoire. It can read the entire index or it can read the entire table. Those are the options.
This is not a hypothetical concern. GitHub issue #508 documents production systems where this single query consumes 40% of database CPU.
-- From GitHub issue #508, a production report:
--
-- "We're seeing approximately 500K query executions
-- in a 15-minute window for the DISTINCT queue_name query.
-- The table has ~2 million rows. Each execution takes 800ms-3s.
-- Total database CPU devoted to this single query: ~40%."
--
-- The math:
-- 500,000 executions / 900 seconds = 555 executions per second
-- 555 * 1.5 seconds average = 833 CPU-seconds per second
--
-- That is not a typo. The DISTINCT query alone requires
-- more CPU time per second than a single core can provide.
-- The only reason the system survives is parallelism across
-- multiple cores — but it is consuming nearly all of them. The pattern is insidious because it gets worse exactly when you need it to get better. When your application is under load and jobs are backing up, the ready executions table grows. As the table grows, the DISTINCT query gets slower. As the query gets slower, it consumes more CPU. As it consumes more CPU, jobs process slower, which means the table grows further.
A feedback loop. The technical term is "cascading performance degradation." I prefer the household term: a fire in the kitchen that feeds itself by consuming the water meant to extinguish it.
Detecting the problem before it becomes a crisis
The DISTINCT query problem does not announce itself gradually. In my experience, teams discover it in one of two ways: either a monitoring alert fires because database CPU has crossed 80%, or the application slows to a crawl and someone opens pg_stat_activity and finds dozens of identical DISTINCT queries running simultaneously.
I would rather you find it on a quiet Tuesday with a diagnostic query than at 3 AM with a pager.
-- Monitor Solid Queue query performance in real time.
-- Run this during load to identify which queries are expensive.
-- 1. Active Solid Queue queries right now:
SELECT pid,
now() - query_start AS duration,
state,
left(query, 120) AS query_preview
FROM pg_stat_activity
WHERE query ILIKE '%solid_queue%'
AND state != 'idle'
ORDER BY duration DESC;
-- 2. Historical query stats (requires pg_stat_statements):
SELECT
left(query, 80) AS query_pattern,
calls,
round(total_exec_time::numeric, 1) AS total_ms,
round(mean_exec_time::numeric, 2) AS mean_ms,
round(max_exec_time::numeric, 1) AS max_ms,
rows
FROM pg_stat_statements
WHERE query ILIKE '%solid_queue%'
ORDER BY total_exec_time DESC
LIMIT 10;
-- What to look for:
-- - The DISTINCT query with high total_exec_time and high calls
-- - The claim query (FOR UPDATE SKIP LOCKED) should have mean < 1ms
-- - Any query with max_exec_time > 100ms needs investigation
--
-- If the DISTINCT query has total_exec_time > all others combined,
-- you have the problem described in this article. The pg_stat_statements view is the most important tool here. If you are not running the pg_stat_statements extension on your PostgreSQL instance, I would gently suggest that this is an oversight worth correcting immediately. It is a standard extension, included in every PostgreSQL distribution, and the overhead is negligible — a few percent of CPU in the worst case, typically less.
What you are looking for in the output:
- The DISTINCT query dominating
total_exec_time. If this single query accounts for more than 10% of your total database CPU time, you have the problem. At 40%+, you have the crisis described in issue #508. - High
callscount on the DISTINCT query. Hundreds of thousands of calls per hour is the signature. Each Solid Queue dispatcher fires this query on every poll cycle. - The claim query (
FOR UPDATE SKIP LOCKED) withmean_exec_timeunder 1ms. This confirms the claim mechanism is healthy. If the claim query is also slow, you likely have a missing index — a separate problem from the DISTINCT issue. - Growing
max_exec_timeon any Solid Queue query. This indicates table or index bloat accumulating over time. Autovacuum is falling behind.
Run this diagnostic weekly. Better yet, set up a monitoring dashboard that tracks these metrics continuously. The DISTINCT query problem creeps up as your table grows — it may be invisible at 10,000 rows and catastrophic at 2 million.
The recursive CTE workaround: simulating Loose Index Scan
PostgreSQL may not have Loose Index Scan, but it has recursive CTEs. And a recursive CTE can simulate a skip-scan pattern by hopping through distinct values one at a time, using the index for each hop.
-- The PostgreSQL workaround for Loose Index Scan:
-- a recursive CTE that hops through distinct values.
WITH RECURSIVE queue_names AS (
-- Seed: grab the smallest queue_name
(
SELECT queue_name
FROM solid_queue_ready_executions
ORDER BY queue_name
LIMIT 1
)
UNION ALL
-- Recursive step: find the next queue_name after the current one
(
SELECT sq.queue_name
FROM queue_names q,
LATERAL (
SELECT queue_name
FROM solid_queue_ready_executions
WHERE queue_name > q.queue_name
ORDER BY queue_name
LIMIT 1
) sq
)
)
SELECT queue_name FROM queue_names;
-- EXPLAIN ANALYZE on 10 million rows, 5 distinct queues:
--
-- CTE Scan on queue_names (cost=2.36..2.98 rows=31 width=32)
-- CTE queue_names
-- -> Recursive Union (cost=0.56..2.36 rows=11 width=32)
-- -> Limit (cost=0.56..0.59 rows=1 width=18)
-- -> Index Only Scan using idx_queue_name ...
-- -> Nested Loop (cost=0.56..0.16 rows=1 width=32)
-- -> WorkTable Scan on queue_names q
-- -> Limit (cost=0.56..0.59 rows=1 width=18)
-- -> Index Only Scan using idx_queue_name ...
-- Index Cond: (queue_name > q.queue_name)
-- Planning Time: 0.182 ms
-- Execution Time: 0.087 ms
--
-- From 2917ms to 0.087ms. That is a 33,529x improvement.
-- It reads exactly 6 index entries instead of 10 million. From 2,917ms to 0.087ms. The recursive CTE reads exactly N+1 index entries, where N is the number of distinct queue names. Five queues? Six index lookups. The table could hold a billion rows and this query would still finish in under a millisecond.
If that improvement seems implausible, allow me to walk through the execution step by step.
-- Why the recursive CTE works:
--
-- Step 1 (Seed):
-- "Find the alphabetically first queue_name."
-- Index scan: reads 1 entry. Returns 'critical'.
--
-- Step 2 (Recursion 1):
-- "Find the first queue_name > 'critical'."
-- Index scan with condition: reads 1 entry. Returns 'default'.
--
-- Step 3 (Recursion 2):
-- "Find the first queue_name > 'default'."
-- Index scan: reads 1 entry. Returns 'exports'.
--
-- Step 4 (Recursion 3):
-- "Find the first queue_name > 'exports'."
-- Index scan: reads 1 entry. Returns 'mailers'.
--
-- Step 5 (Recursion 4):
-- "Find the first queue_name > 'mailers'."
-- Index scan: reads 1 entry. Returns 'reports'.
--
-- Step 6 (Recursion 5):
-- "Find the first queue_name > 'reports'."
-- Index scan: reads 1 entry. Returns nothing. Recursion ends.
--
-- Total index entries read: 6
-- Total rows in table: 10,000,000
-- Rows the DISTINCT version reads: 10,000,000
--
-- The recursive CTE simulates what MySQL does automatically.
-- PostgreSQL just needs explicit instructions. The recursive CTE simulates exactly what MySQL's Loose Index Scan does automatically. Each recursion step uses an index range scan with queue_name > current_value and LIMIT 1, which reads a single B-tree leaf entry. PostgreSQL's planner handles this perfectly — it is a standard indexed lookup, not an exotic optimization.
The LATERAL join in the recursive step is what makes this work. It allows the subquery to reference the current row from the recursive CTE, enabling the "hop to next distinct value" pattern. Without LATERAL, you would need a correlated subquery, which PostgreSQL's recursive CTE syntax does not support.
Applying the fix: the monkey-patch approach
This is the fix you can apply today. It requires modifying Solid Queue's source — specifically the ReadyExecution.queues class method — or monkey-patching it in an initializer. Neither is ideal. A proper upstream fix would detect the database adapter and use the recursive CTE for PostgreSQL. As of this writing, that PR does not exist.
For production systems under pressure right now, the monkey-patch approach:
# config/initializers/solid_queue_pg_patch.rb
Rails.application.config.after_initialize do
next unless SolidQueue::ReadyExecution.connection
.adapter_name.downcase.include?("postgresql")
SolidQueue::ReadyExecution.class_eval do
def self.queues
connection.select_values(<<~SQL)
WITH RECURSIVE queue_names AS (
(
SELECT queue_name
FROM solid_queue_ready_executions
ORDER BY queue_name LIMIT 1
)
UNION ALL
(
SELECT sq.queue_name
FROM queue_names q,
LATERAL (
SELECT queue_name
FROM solid_queue_ready_executions
WHERE queue_name > q.queue_name
ORDER BY queue_name LIMIT 1
) sq
)
)
SELECT queue_name FROM queue_names
SQL
end
end
end This initializer checks the database adapter at boot and replaces the queues method only when running on PostgreSQL. MySQL installations are unaffected. The recursive CTE requires an index on queue_name, which Solid Queue's default migration creates.
I should note the trade-offs of monkey-patching a framework component. When Solid Queue is updated, the internal method signature may change. Your initializer may break silently — calling a method that no longer exists or has different semantics. Pin your Solid Queue version and test the patch against each upgrade. This is technical debt, taken on deliberately, because the alternative — a 33,000x performance penalty — is worse.
An honest counterpoint: when Solid Queue on PostgreSQL is the wrong choice
I have spent considerable time explaining how to make Solid Queue work well on PostgreSQL. It would be a disservice — and an embarrassment — if I did not tell you when you should consider something else entirely.
If your job volume exceeds 10,000 jobs per minute sustained, you are pushing Solid Queue's architecture into territory where the polling model itself becomes a bottleneck, regardless of the DISTINCT query fix. Each poll cycle involves at least one query per queue. With the CTE patch, each query is sub-millisecond, but 50 poll cycles per second across 5 queues is still 250 queries per second devoted to coordination overhead. At very high volumes, a push-based system — LISTEN/NOTIFY, or Redis pub/sub — handles coordination with less database load.
If you need sub-second job pickup latency, Solid Queue's polling model introduces inherent delay equal to the polling interval. A 200ms polling interval means jobs wait 0-200ms before a worker notices them. Redis-backed systems like Sidekiq use blocking pop (BRPOP), which wakes the worker the instant a job is enqueued. For real-time features — chat messages, live notifications, payment webhooks — that difference matters.
If your team already operates Redis reliably, the argument for Solid Queue is weaker. The primary benefit is eliminating Redis as a dependency. If Redis is already in your stack for caching, session storage, or real-time features, adding it as a job backend is marginal operational cost. Sidekiq on Redis is a mature, heavily optimized system with years of production hardening. There is no shame in using the right tool, even when a newer alternative exists.
A waiter who overstates his case is no waiter at all. Solid Queue on PostgreSQL is a good choice for many applications. It is not the best choice for every application. The tuning in this article makes it viable where it otherwise would not be. But "viable with tuning" and "optimal by design" are different things, and you deserve to know the difference.
"Rails 8 made this official: SolidQueue stores background jobs in PostgreSQL and ships as the framework's default. 37signals processes 20 million jobs per day with it."
— from You Don't Need Redis, Chapter 19: The Case for Simplicity
The PostgreSQL-native alternative: GoodJob
If you are starting a new Rails project on PostgreSQL and the choice of job backend is still open, I should introduce you to GoodJob. It was built for PostgreSQL from the ground up, and it avoids the DISTINCT query problem entirely through a fundamentally different architecture.
# Gemfile — the PostgreSQL-native alternative
# Instead of:
gem "solid_queue"
# Consider:
gem "good_job"
# GoodJob was built for PostgreSQL from day one.
# Key differences:
#
# 1. Uses advisory locks instead of FOR UPDATE SKIP LOCKED.
# Advisory locks are lighter-weight — they do not create
# row-level lock entries or interact with MVCC visibility.
#
# 2. Uses LISTEN/NOTIFY instead of polling for the DISTINCT query.
# When a job is enqueued, GoodJob sends a NOTIFY on a channel.
# Workers LISTEN on that channel and wake immediately.
# No polling. No DISTINCT. No full-index scan.
#
# 3. Stores jobs in a single table with state columns,
# not across multiple tables (ready, claimed, blocked, etc.).
# Fewer tables = fewer indexes = less autovacuum pressure.
#
# 4. Supports CONCURRENTLY index operations during migrations.
# Solid Queue's migrations do not use CONCURRENTLY by default.
#
# Trade-off: GoodJob is not the Rails 8 default.
# It requires explicit setup. It has a different admin UI.
# If your team prefers the "default Rails" path, Solid Queue
# with the tuning in this article is perfectly viable. The critical difference is how GoodJob discovers available work. Instead of polling with SELECT DISTINCT, GoodJob uses PostgreSQL's LISTEN/NOTIFY mechanism.
-- GoodJob uses LISTEN/NOTIFY instead of polling.
-- Here is how the two approaches compare:
-- === Solid Queue (polling) ===
-- Every 100-1000ms, each worker:
-- 1. SELECT DISTINCT queue_name ... (the problem query)
-- 2. For each queue with work:
-- SELECT ... FOR UPDATE SKIP LOCKED (the claim query)
--
-- 10 workers at 200ms polling interval:
-- 50 DISTINCT queries per second (potentially 2.9 seconds each)
-- + 50 claim queries per second (< 1ms each, no problem)
-- === GoodJob (LISTEN/NOTIFY) ===
-- At startup, each worker:
-- LISTEN good_job; (once, holds open connection)
--
-- When a job is enqueued:
-- NOTIFY good_job, '{"queue":"default"}'; (from enqueuing process)
--
-- Workers wake on notification:
-- SELECT ... FOR UPDATE SKIP LOCKED (claim query only)
--
-- 10 workers, idle system:
-- 0 queries per second. Zero.
-- Workers sleep until notified.
--
-- The DISTINCT query never runs. It does not need to.
-- The notification tells workers which queues have work. The numbers are stark. An idle Solid Queue system with 10 workers at 200ms polling runs 50 DISTINCT queries per second, each potentially scanning millions of rows. An idle GoodJob system with 10 workers runs zero queries per second. Workers sleep until notified. The database is silent.
GoodJob also uses advisory locks instead of FOR UPDATE SKIP LOCKED for job claiming. Advisory locks are lighter-weight on PostgreSQL — they do not interact with MVCC visibility checks and do not contribute to dead tuple accumulation. The queue table does not churn rows through insert-claim-delete cycles the way Solid Queue's multi-table architecture does.
When GoodJob is not the answer
I would not be doing my job if I presented GoodJob as universally superior. It has its own trade-offs.
GoodJob is not the Rails 8 default. Choosing it means diverging from the framework's blessed path, which has implications for documentation, community support, and the availability of guides and tutorials. When something goes wrong with Solid Queue, the Rails core team is invested in fixing it. GoodJob is maintained by a single dedicated developer (Ben Sheldon, who does excellent work) and a smaller community.
GoodJob's LISTEN/NOTIFY mechanism requires persistent connections. If you use PgBouncer in transaction pooling mode, LISTEN does not work — the notification is delivered to the server-side connection, which may no longer be associated with the client that issued LISTEN. This is a real operational constraint. Solid Queue's polling model, by contrast, works perfectly through any connection pooler.
GoodJob stores all job states in a single table with columns for state transitions. This simplifies some operations but means the table grows with completed job history. If you retain completed jobs for auditing, the table can become very large. Solid Queue's multi-table architecture (ready, claimed, blocked, completed) keeps the hot-path tables small by design.
The choice is not obvious, and that is fine. What matters is that you make it with full information rather than discovering the trade-offs in production.
Tuning Solid Queue for PostgreSQL: the complete parameter table
If you have chosen Solid Queue — or inherited it — here is the complete tuning guide for PostgreSQL. Every parameter below was tuned for MySQL by default. Several need adjustment on PostgreSQL, either because PostgreSQL handles certain query patterns differently or because PostgreSQL connections are more expensive than MySQL connections.
| Parameter | MySQL default | PostgreSQL recommendation | Reason |
|---|---|---|---|
| polling_interval | 0.1s | 0.2-1.0s | Each poll fires the DISTINCT query. Reduce frequency. |
| batch_size | 500 | 100-500 | Larger batches hold FOR UPDATE locks longer. Balance throughput vs. lock duration. |
| threads per worker | 5 | 3-5 | Each thread = 1 PostgreSQL connection. Size pool accordingly. |
| processes | 2 | 1-2 | More processes = more connections. Fewer processes with more threads is cheaper on PG. |
| concurrency_maintenance_interval | 600s | 300-600s | Stale lock cleanup queries are expensive. Do not reduce below 300s. |
| db pool size | threads + 2 | threads + 3 | Dispatcher + maintenance + headroom. PG connections are heavier than MySQL. |
# config/solid_queue.yml — production configuration for PostgreSQL
production:
dispatchers:
- polling_interval: 1 # seconds between dispatch cycles
batch_size: 500 # jobs per dispatch batch
concurrency_maintenance_interval: 600 # how often to check
# for stale locks
workers:
- queues: ["default", "mailers"]
threads: 5
processes: 2
polling_interval: 0.1 # 100ms — fast pickup
- queues: ["exports", "reports"]
threads: 3
processes: 1
polling_interval: 1 # these jobs are slower, poll less
# Key PostgreSQL-specific tuning:
#
# 1. polling_interval: Lower values = more FOR UPDATE SKIP LOCKED
# queries per second. On MySQL this is nearly free.
# On PostgreSQL, each poll that finds no work still executes
# the DISTINCT queue_name query. Tune conservatively.
#
# 2. threads: Each thread holds a connection during job execution.
# 5 threads * 2 processes = 10 PostgreSQL connections per worker.
# Size your connection pool accordingly.
#
# 3. batch_size: Larger batches mean fewer round trips but longer
# transactions holding FOR UPDATE locks. 500 is a safe default. The critical adjustment is polling_interval. On MySQL, aggressive polling (100ms) is cheap because the DISTINCT query is essentially free. On PostgreSQL, every poll cycle fires a multi-second full-index scan. Increasing the interval to 200ms-1s reduces the query frequency proportionally. Jobs will be picked up slightly less quickly, but the database will survive to process them.
If you have applied the recursive CTE patch above, you can safely reduce polling_interval back to 0.1s. The patch makes the DISTINCT query as cheap on PostgreSQL as it is on MySQL.
A note on batch_size: larger batches mean Solid Queue claims multiple jobs in a single FOR UPDATE SKIP LOCKED query with a higher LIMIT. This reduces the number of round trips to the database but extends the duration that rows are locked. On PostgreSQL, this interacts with autovacuum — locked rows cannot be vacuumed until the transaction completes. If your jobs take 30 seconds to process and you claim 500 at a time, those 500 rows are locked for 30 seconds. During a backlog, this can delay vacuum operations and contribute to bloat. A batch size of 100-200 strikes a better balance on PostgreSQL.
Connection pool sizing: PostgreSQL connections are not MySQL connections
A MySQL connection is a thread within the mysqld process. Lightweight. Thousands are routine. The MySQL manual barely discusses connection limits because they rarely matter in practice.
A PostgreSQL connection is a forked operating system process. Each one consumes 5-10MB of RAM, a full process slot, a set of file descriptors, and a share of the kernel's scheduling budget. Hundreds are expensive. Thousands are a crisis. The PostgreSQL documentation devotes entire sections to connection management because getting it wrong is one of the most common causes of production incidents.
Solid Queue spawns connections aggressively. Each worker thread holds a connection. The dispatcher holds a connection. The maintenance task holds a connection. If you run the default configuration — 5 threads, 2 processes — that is 12 PostgreSQL backend processes for job processing alone, before your web application opens a single connection.
# config/database.yml — pool sizing for Solid Queue workers
production:
adapter: postgresql
url: <%= ENV["DATABASE_URL"] %>
pool: <%= ENV.fetch("RAILS_MAX_THREADS", 5).to_i + 2 %>
checkout_timeout: 5
prepared_statements: true
# Why pool = threads + 2?
#
# Solid Queue uses:
# - 1 connection per worker thread (job execution)
# - 1 connection for the dispatcher (polling)
# - 1 connection for the maintenance task (cleanup)
#
# If you run 5 worker threads + dispatcher + maintenance = 7.
# Set pool to at least 7. The +2 above covers the dispatcher
# and maintenance connections.
#
# For PgBouncer users: set prepared_statements: false
# or use transaction-mode-compatible prepared statements.
# See: /grounds/connection-pooling/connection-pooling Let me walk through the connection arithmetic for two realistic deployment scenarios, because this is where teams consistently underestimate their needs.
# Typical Rails 8 deployment — connection audit
# === Single-server deployment ===
# Puma web server: 3 workers * 5 threads = 15 connections
# Solid Queue workers: 2 processes * 5 threads = 10 connections
# Solid Queue dispatcher: 1 connection
# Solid Queue maintenance: 1 connection
# Rails console / migrations / cron: ~3 connections
# Total: 30 connections
# === Scaled deployment (2 app servers) ===
# Puma (server 1): 3 workers * 5 threads = 15 connections
# Puma (server 2): 3 workers * 5 threads = 15 connections
# Solid Queue workers: 3 processes * 5 threads = 15 connections
# Solid Queue dispatcher: 1 connection
# Solid Queue maintenance: 1 connection
# Monitoring (Datadog, PgHero, etc.): ~3 connections
# Rails console / migrations: ~3 connections
# Total: 53 connections
# === PostgreSQL max_connections default: 100 ===
# At 53 connections, you have 47 remaining.
# Add a staging environment that shares the database: +30.
# Now you have 17 remaining.
# Deploy a third app server during a traffic spike: +15.
# Now you have 2 remaining.
# A developer opens psql to debug something: that is the crisis. The pattern is clear: connections accumulate silently across services, environments, and operational tools until you are one deploy away from FATAL: too many connections for role. And that error, when it arrives, affects not just Solid Queue but your entire application — web requests, API calls, admin dashboards, everything.
A connection pooler — PgBouncer, PgCat, or Gold Lapel's built-in pooling — is not optional for production Solid Queue on PostgreSQL. It is infrastructure. The same way you would not deploy a web application without a load balancer, you should not deploy a PostgreSQL-backed application with 30+ connections without a connection multiplexer.
One additional consideration for PgBouncer users: if you run PgBouncer in transaction pooling mode (which is the recommended mode for most applications), you must set prepared_statements: false in your database.yml. PostgreSQL's prepared statements are session-scoped — they are associated with a backend connection, not a transaction. When PgBouncer reassigns transactions to different backend connections, prepared statements from a previous session do not exist, and ActiveRecord will raise errors. This is a well-documented interaction, but it catches teams who add PgBouncer after Solid Queue is already running.
Indexes and autovacuum: the queue table is not a normal table
Solid Queue's ready executions table behaves like a queue, not a table. Rows are inserted, briefly live, then deleted. The table's live row count may be 10,000 at any moment, but PostgreSQL may have processed 10 million inserts and 10 million deletes over the course of a week.
This creates two problems that normal table tuning does not address. I encounter both regularly in production systems, and they compound each other in ways that are not immediately obvious.
Problem 1: Missing or inadequate indexes
Solid Queue's migration creates basic indexes. They are adequate for MySQL. For PostgreSQL, the claim query — the hot path that runs thousands of times per second — benefits from a composite index that covers the full WHERE + ORDER BY.
-- Indexes Solid Queue needs on PostgreSQL
-- (some are created by the migration, some are not)
-- 1. The queue_name index for the DISTINCT query
-- This exists by default, but the DISTINCT query ignores it
-- because PostgreSQL cannot do a Loose Index Scan.
CREATE INDEX IF NOT EXISTS idx_sqre_queue_name
ON solid_queue_ready_executions (queue_name);
-- 2. Composite index for the job claim query
-- Covers WHERE queue_name + ORDER BY priority, id
CREATE INDEX IF NOT EXISTS idx_sqre_claim
ON solid_queue_ready_executions (queue_name, priority ASC, id ASC);
-- 3. Scheduled executions: the dispatch query
CREATE INDEX IF NOT EXISTS idx_sqse_dispatch
ON solid_queue_scheduled_executions (scheduled_at, priority ASC, id ASC);
-- 4. Blocked executions: the unblock check
CREATE INDEX IF NOT EXISTS idx_sqbe_unblock
ON solid_queue_blocked_executions (concurrency_key, priority ASC, id ASC);
-- 5. Claimed executions: stale lock cleanup
CREATE INDEX IF NOT EXISTS idx_sqce_cleanup
ON solid_queue_claimed_executions (created_at);
-- Check if your indexes are being used:
SELECT indexrelname,
idx_scan,
idx_tup_read,
idx_tup_fetch,
pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
AND tablename LIKE 'solid_queue_%'
ORDER BY idx_scan DESC; The composite index (queue_name, priority ASC, id ASC) is the most important. Without it, the claim query uses the basic queue_name index for filtering but then performs a sort on the remaining rows. With the composite index, the sort is embedded in the index itself — the database reads the first matching entry and returns it. No sort. No extra I/O. The difference is typically 0.5-2ms without the composite index versus 0.03-0.05ms with it. Individually small, but multiplied by thousands of claims per second, the aggregate savings are substantial.
Problem 2: Autovacuum cannot keep up
PostgreSQL's MVCC means deleted rows are not immediately reclaimed. They become "dead tuples" that autovacuum cleans up in the background. Default autovacuum settings trigger cleanup when dead tuples exceed 20% of live tuples.
For a queue table with 10,000 live rows and 100,000 inserts per hour, that 20% threshold (2,000 dead tuples) is crossed within seconds of each vacuum cycle. Autovacuum runs, cleans up, and immediately falls behind again. Meanwhile, the dead tuples bloat the table and its indexes, making every query — including the DISTINCT query — slower.
-- Solid Queue tables are high-churn: constant INSERT + DELETE.
-- Dead tuples accumulate fast. Autovacuum must keep up.
-- Check current dead tuple count:
SELECT relname,
n_live_tup,
n_dead_tup,
last_autovacuum,
last_autoanalyze
FROM pg_stat_user_tables
WHERE relname LIKE 'solid_queue_%'
ORDER BY n_dead_tup DESC;
-- If n_dead_tup is > 20% of n_live_tup, autovacuum is falling behind.
-- Tune per-table:
ALTER TABLE solid_queue_ready_executions SET (
autovacuum_vacuum_scale_factor = 0.01, -- vacuum at 1% dead tuples
autovacuum_vacuum_cost_delay = 2, -- less sleeping between pages
autovacuum_analyze_scale_factor = 0.005 -- reanalyze frequently
);
ALTER TABLE solid_queue_claimed_executions SET (
autovacuum_vacuum_scale_factor = 0.01,
autovacuum_vacuum_cost_delay = 2,
autovacuum_analyze_scale_factor = 0.005
);
-- These tables behave like queues — rows are inserted, briefly live,
-- then deleted. Default autovacuum settings (20% threshold) assume
-- tables grow and shrink gradually. Queue tables do not. The key change: autovacuum_vacuum_scale_factor = 0.01. This triggers vacuum at 1% dead tuples instead of 20%. For a 10,000-row table, that means vacuuming after 100 dead tuples instead of 2,000. More frequent, smaller vacuum operations that prevent bloat from accumulating.
The autovacuum_vacuum_cost_delay = 2 (down from the default of 20ms) makes each vacuum cycle faster by reducing the sleep time between page scans. Queue tables are small enough that aggressive vacuuming has negligible impact on other queries.
Diagnosing bloat when it has already accumulated
If you suspect autovacuum has fallen behind — the table feels slower than it should, or the DISTINCT query is taking longer than expected even on a modestly-sized table — here is how to measure the damage.
-- Diagnose table and index bloat on Solid Queue tables.
-- High bloat means autovacuum is losing the race.
-- Table bloat estimate (requires pgstattuple extension):
-- CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT
tablename,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size,
pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size,
n_live_tup,
n_dead_tup,
CASE WHEN n_live_tup > 0
THEN round(100.0 * n_dead_tup / n_live_tup, 1)
ELSE 0
END AS dead_pct
FROM pg_stat_user_tables
WHERE tablename LIKE 'solid_queue_%'
ORDER BY n_dead_tup DESC;
-- Index bloat: check if indexes are growing faster than data
SELECT
indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan,
idx_tup_read
FROM pg_stat_user_indexes
WHERE tablename LIKE 'solid_queue_%'
ORDER BY pg_relation_size(indexrelid) DESC;
-- Warning signs:
-- - Table size is 500MB but n_live_tup is only 10,000.
-- That means 99.99% of the table is dead tuples and free space.
-- Autovacuum is not keeping up. Apply the tuning above.
--
-- - Index size exceeds table size by 5x or more.
-- Indexes on queue tables bloat faster than the tables themselves.
-- Consider REINDEX CONCURRENTLY during a maintenance window. A table with 10,000 live rows should occupy roughly 1-5MB depending on row width. If pg_total_relation_size reports 200MB, you have bloat. The dead tuples have been vacuumed, but the free space has not been returned to the operating system — PostgreSQL reuses free space within the table file but does not shrink it. Only VACUUM FULL or pg_repack reclaim the space, and both require an exclusive lock on the table.
For Solid Queue tables, the pragmatic approach: apply the aggressive autovacuum settings, and if bloat is severe, schedule a REINDEX CONCURRENTLY during a maintenance window. The indexes bloat faster than the table data and have a disproportionate impact on query performance.
What Gold Lapel does for Solid Queue on PostgreSQL
Gold Lapel is a self-optimizing PostgreSQL proxy. It sits between your Rails application and PostgreSQL, observes every query, and acts on patterns that indicate missing indexes, cacheable results, or connection waste.
Solid Queue on PostgreSQL produces exactly the patterns Gold Lapel was built to detect.
-- What Gold Lapel does for Solid Queue on PostgreSQL:
-- 1. AUTO-INDEXING
-- GL sees the claim query hitting queue_name + priority + id
-- thousands of times per second. If the composite index
-- is missing or suboptimal, GL creates it.
--
-- No manual CREATE INDEX. No migration.
-- The proxy watches, measures, and acts.
-- 2. QUERY RESULT CACHING
-- The DISTINCT queue_name query returns the same 3-5 values
-- thousands of times per minute. GL caches the result
-- and invalidates on INSERT/DELETE to the ready_executions table.
--
-- 500K executions in 15 minutes? GL serves them from cache.
-- Cost: ~0.01ms instead of ~1500ms. CPU reclaimed: ~40%.
-- 3. CONNECTION POOLING
-- Each Solid Queue worker thread holds a PostgreSQL connection
-- for the duration of job execution. A 30-second job ties up
-- a backend process for 30 seconds.
--
-- GL's built-in connection pooling multiplexes worker connections,
-- releasing the backend connection during job processing
-- and reacquiring it only for database operations.
--
-- 10 worker threads, 30-second average job, 100ms avg DB time:
-- Without GL: 10 PostgreSQL connections held continuously.
-- With GL: ~1-2 PostgreSQL connections under normal load. The DISTINCT queue_name query is a textbook case for result caching. It runs hundreds of times per second. It returns the same 3-5 values every time. The result only changes when a job is enqueued to a new queue or the last job in a queue completes. Gold Lapel caches the result and invalidates on table mutations. The 500K executions per 15 minutes from issue #508? Served from cache in microseconds.
The claim query's composite index is a textbook case for auto-indexing. Gold Lapel sees the same WHERE queue_name = ? ORDER BY priority, id pattern thousands of times per second, checks whether the existing indexes cover it optimally, and creates or adjusts indexes when they do not.
The connection pooling is a textbook case for multiplexing. Solid Queue workers hold connections for the duration of job execution — often seconds or minutes — but only use the database for milliseconds within that window. Gold Lapel releases the backend connection between database operations and reacquires it transparently, reducing 10 held connections to 1-2 active ones.
No monkey-patches. No initializer hacks. No manual index creation. No per-table autovacuum tuning. Add gem "goldlapel-rails" to your Gemfile, bundle, and Solid Queue on PostgreSQL performs the way it was always supposed to.
I would be a poor waiter indeed if I insisted you manage the household yourself when trained staff are available. But I would be an equally poor waiter if I did not first teach you how the household works. Everything in this article stands on its own, with or without Gold Lapel. The tuning, the patches, the monitoring queries — apply them today. Gold Lapel simply applies them for you, continuously and automatically, without the initializer you have to remember to test against each Solid Queue upgrade.
The complete checklist
If you have read this far, you have the full picture. Allow me to distill it into an ordered list of actions, from most urgent to most optional.
- Enable
pg_stat_statementsand check whether the DISTINCT query dominates your database CPU. If it does not, you may not need the other steps yet. If it does, proceed immediately. - Apply the recursive CTE patch via the initializer. This is the single highest-impact change — 33,529x improvement on the DISTINCT query. Pin your Solid Queue version and test the patch on each upgrade.
- Create the composite index
(queue_name, priority ASC, id ASC)if it does not exist. Check with thepg_stat_user_indexesquery above. - Tune autovacuum on
solid_queue_ready_executionsandsolid_queue_claimed_executions. These tables churn rows relentlessly and default autovacuum settings cannot keep pace. - Audit your connection count. Add up Puma threads + Solid Queue threads + dispatcher + maintenance + headroom. If the total exceeds 50, deploy a connection pooler.
- Adjust
polling_intervalto 0.2-1.0s if you have not applied the CTE patch, or leave at 0.1s if you have. - Consider GoodJob if you are starting a new project and want a PostgreSQL-native solution that avoids these issues architecturally.
Seven steps. Most take less than five minutes each. The collective impact is the difference between a database that spends 40% of its CPU answering a five-row question and one that answers it in microseconds.
Your queue was doing its best with the instructions it was given. It simply needed better instructions — and a household that understands the difference between what works on one engine and what works on another. In infrastructure, as in service, attention to these differences is not pedantry. It is professionalism.
Frequently asked questions
Terms referenced in this article
I would be remiss if I did not mention the connection pool arithmetic that accompanies this. The guide on Rails connection pool sizing for Puma and Solid Queue addresses the other side of this coin — how many connections these workers actually open, and why the number is never what you configured.