I Regret to Inform You: Your PostgreSQL Table Has Bloat
50GB on disk. 10GB of actual data. Allow me to explain the discrepancy — and resolve it.
Good evening. There is the matter of your disk usage.
You may have noticed that your PostgreSQL tables are larger than the data they contain would suggest. Perhaps significantly larger. A table with 5 million rows that should occupy 2GB is somehow consuming 12GB. The indexes are similarly inflated. The disk usage trend line points upward with a slope that bears no relation to your data growth.
I have attended to a great many databases exhibiting this condition. The symptoms are always the same: disk usage that defies arithmetic, queries that grow slower without explanation, and buffer cache hit ratios that decline as though the database were slowly forgetting how to do its job. The underlying cause is nearly always the same, too.
This is table bloat, and it is among the most common yet least diagnosed problems in PostgreSQL production databases. The good news: it is measurable, it is fixable, and with proper maintenance, it is preventable. The less good news: if left unattended, it compounds. A table at 30% bloat today will not politely remain at 30% tomorrow.
What causes table bloat — and why PostgreSQL tolerates it
PostgreSQL's MVCC (Multi-Version Concurrency Control) architecture is the root of the matter. The PostgreSQL documentation covers the mechanism in considerable detail, if you would like the authoritative treatment. When you update a row, PostgreSQL does not modify it in place. It creates a new version of the row and marks the old version as dead. When you delete a row, it marks that row as dead. In both cases, the old version remains physically present on disk.
This is not a deficiency. It is a deliberate design choice, and a rather elegant one. MVCC allows readers and writers to operate concurrently without blocking each other — a long-running analytical query can read a consistent snapshot of the data while transactions continue to modify it. The trade-off is that dead row versions accumulate, and someone must clean them up.
That someone is autovacuum. When autovacuum runs, it marks the space occupied by dead tuples as available for reuse. Future inserts and updates can then fill that space. The critical detail: regular vacuum does not shrink the physical file. It reclaims space within the file but does not return it to the operating system. If a table bloated from 2GB to 12GB and you run VACUUM, the file remains 12GB — but new data can fill the reclaimed space without further growth.
Bloat accumulates when any of the following hold true:
- Autovacuum is configured too conservatively. The default
autovacuum_vacuum_scale_factorof 0.2 means vacuum does not trigger until 20% of the table consists of dead tuples. On a 50-million-row table, that is 10 million dead tuples before anyone lifts a finger. - Long-running transactions hold back cleanup. Vacuum cannot remove dead tuples that might still be visible to an active transaction. A single
BEGINleft open in a forgotten psql session — or a long-running reporting query — can prevent vacuum from reclaiming any dead tuples across every table. - The write rate exceeds vacuum throughput. If your application updates 100,000 rows per second and vacuum processes 50,000 dead tuples per second, bloat grows by 50,000 tuples per second. Vacuum cost limits, worker counts, and I/O throttling all constrain how fast vacuum can work.
- Bulk operations without follow-up. A mass update that touches millions of rows creates millions of dead tuples in one operation. Autovacuum will eventually address this, but "eventually" on a 100-million-row table might mean hours later — by which time the bloat has already affected every query touching that table.
Once a table has bloated beyond what internal space reuse can correct, only VACUUM FULL, pg_repack (the good people at reorg maintain the project — their GitHub repository has the installation instructions), or pg_squeeze can reduce the physical file size.
How do you detect table bloat?
Three approaches, each with different trade-offs between speed, accuracy, and requirements.
The quick estimate: pg_stat_user_tables
-- Estimate table bloat using pg_stat_user_tables:
SELECT relname,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
n_live_tup,
n_dead_tup,
round(100.0 * n_dead_tup / GREATEST(n_live_tup + n_dead_tup, 1), 1) AS dead_pct,
last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC; This uses pg_stat_user_tables, which tracks approximate live and dead tuple counts. It is fast and requires no extensions, but the numbers are estimates — they update when autovacuum or ANALYZE runs, not in real time. On a table that has not been analyzed recently, these counts can be materially wrong.
Think of this as a smoke detector. It tells you there may be a fire. It does not tell you which room or how large the flames.
The statistical estimate: no extensions required
-- A more comprehensive bloat estimator (no extensions needed).
-- Compares expected table size (from pg_class stats) to actual size.
-- Adapted from the classic check_postgres query:
SELECT current_database(), schemaname, tablename,
pg_size_pretty(bs * tblpages) AS real_size,
pg_size_pretty(bs * est_pages) AS expected_size,
pg_size_pretty(bs * (tblpages - est_pages)) AS bloat_size,
CASE WHEN tblpages > 0
THEN round(100.0 * (tblpages - est_pages) / tblpages, 1)
ELSE 0 END AS bloat_pct
FROM (
SELECT schemaname, tablename,
cc.reltuples::bigint AS est_rows,
cc.relpages AS tblpages,
bs,
ceil(cc.reltuples *
(datahdr + nullhdr + 4 + ma -
CASE WHEN datahdr % ma = 0 THEN ma ELSE datahdr % ma END)
/ (bs - 20)) AS est_pages
FROM (
SELECT ma, bs, schemaname, tablename,
(datawidth + (hdr + ma - CASE WHEN hdr % ma = 0 THEN ma ELSE hdr % ma END))::numeric AS datahdr,
(maxfracsum * (nullhdr + ma - CASE WHEN nullhdr % ma = 0 THEN ma ELSE nullhdr % ma END)) AS nullhdr
FROM (
SELECT s.schemaname, s.tablename, 23 AS hdr, 8 AS ma, current_setting('block_size')::int AS bs,
SUM((1 - s.null_frac) * s.avg_width) AS datawidth,
MAX(s.null_frac) AS maxfracsum,
23 + COUNT(*) / 8 AS nullhdr
FROM pg_stats s
WHERE s.schemaname NOT IN ('pg_catalog', 'information_schema')
GROUP BY s.schemaname, s.tablename
) sub1
) sub2
JOIN pg_class cc ON cc.relname = sub2.tablename
JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = sub2.schemaname
) sub3
WHERE tblpages > est_pages + 10
ORDER BY (tblpages - est_pages) DESC
LIMIT 20; This query compares the expected table size (calculated from column statistics in pg_stats) against the actual size on disk. The difference is an estimate of bloat. It requires no extensions and works on managed PostgreSQL services where you cannot install pgstattuple.
The accuracy depends on up-to-date statistics. If ANALYZE has not run recently, the estimates will be off. Run ANALYZE on suspect tables first, then use this query.
The precise measurement: pgstattuple
-- For precise bloat measurement, use pgstattuple extension:
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT * FROM pgstattuple('orders');
-- Key fields:
-- table_len: total table size in bytes
-- tuple_len: size of live tuples
-- dead_tuple_len: size of dead tuples
-- free_space: reclaimable space within pages
-- tuple_percent: live data as % of total size
-- If tuple_percent is 20%, the table is 80% waste. The pgstattuple extension — documented thoroughly in the official PostgreSQL documentation — performs a full table scan and reports exact space usage. The tuple_percent field tells you what fraction of the table is live data. If it reads 20%, the table is 80% bloat — 80% of every sequential scan is reading dead space.
Be aware that pgstattuple reads the entire table, so it is slow on large tables. On a 100GB table, expect it to take several minutes. Run it during off-peak hours, and run it on specific tables you suspect — not on every table in the database.
I should note: pgstattuple is available on most managed PostgreSQL services (Amazon RDS, Google Cloud SQL, Azure Database for PostgreSQL). If you are on a service that does not support it, the statistical estimate above is your next best option.
How does bloat affect query performance?
Bloat degrades performance through several mechanisms, and the effects compound. Allow me to be specific.
Sequential scans read more pages. A sequential scan reads every page in the table, dead tuples included. At 50% bloat, every sequential scan reads twice the data it needs to. At 80% bloat, five times. On a table that should occupy 2GB, 80% bloat means scanning 10GB for every SELECT COUNT(*).
Buffer cache efficiency drops. PostgreSQL's shared buffer pool holds recently accessed pages in memory. When half the pages contain dead tuples, half your cache is storing waste. The effective cache size — the portion that actually accelerates queries — is halved. Queries that previously ran entirely from memory begin hitting disk.
I/O amplifies. More pages to read means more disk I/O. On tables that fit in memory at their natural size but not at their bloated size, the difference is dramatic — the distinction between a 2ms cached read and a 5-15ms disk read, multiplied by thousands of pages.
Index scans degrade. Index bloat is discussed in its own section below, but even with healthy indexes, table bloat matters for index scans. An index lookup returns a tuple identifier (TID) pointing to a heap page. If the heap is bloated, more pages must be fetched, and the visibility map is less useful for index-only scans.
| Bloat level | Seq scan impact | Index impact | Disk impact | Action |
|---|---|---|---|---|
| 0-5% | Negligible | None | Normal | Healthy — no action |
| 5-20% | Up to 25% slower | Slightly larger | Moderate growth | Tune autovacuum |
| 20-50% | 2-3x slower | Noticeably larger | Significant waste | Urgent vacuum or repack |
| 50-80% | 4-5x slower | Doubled or tripled | Majority is waste | REINDEX + pg_repack |
| 80%+ | 10x+ slower | Severely degraded | Disk pressure | Emergency remediation |
At 50% bloat, every sequential scan reads twice the data it needs to. Buffer cache is half-filled with dead data, reducing the effective cache hit ratio. Index lookups traverse larger B-trees with more dead entries to skip. The cumulative effect is a database that performs as if it has half the hardware it actually does.
I have seen production databases where a 3x hardware upgrade — more RAM, faster storage, more CPU — was proposed to address performance degradation that was entirely caused by 60% table bloat. The pg_repack that resolved it took 40 minutes and cost nothing.
Index bloat: the quieter problem
-- Index bloat is separate from table bloat and equally damaging:
SELECT indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan AS times_used,
idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 10;
-- An index 5x larger than expected with normal scan counts
-- is likely bloated. REINDEX or pg_repack will fix it. Index bloat is separate from table bloat and can be even more impactful. A bloated B-tree index has more levels, more pages to traverse, and more dead entries that must be skipped during scans. The performance effect is felt on every indexed lookup — not just sequential scans.
Consider a B-tree index on orders.created_at. At its healthy size, the tree is 3 levels deep and fits entirely in memory. At 4x bloat, the tree is 4 levels deep and spills to disk. Every index lookup now requires one additional page fetch, and some of those fetches go to disk instead of cache. Multiply by the thousands of index lookups per second in a typical OLTP workload, and the aggregate cost is substantial.
Measuring index bloat precisely
-- pgstattuple also measures index bloat precisely:
SELECT * FROM pgstatindex('orders_pkey');
-- Key fields:
-- tree_level: B-tree depth (more levels = slower lookups)
-- internal_pages: pages used for internal nodes
-- leaf_pages: pages used for leaf nodes
-- empty_pages: pages with no entries
-- deleted_pages: pages marked deleted but not yet recycled
-- avg_leaf_density: percentage of leaf pages in use
-- A healthy B-tree index has avg_leaf_density above 70%.
-- Below 50% means significant bloat. The pgstatindex function (also from the pgstattuple extension) reports exact B-tree statistics. The avg_leaf_density field is the key metric — it tells you what percentage of leaf pages are actually in use. Below 50% means the index is wasting more space than it uses.
The deleted_pages count reveals pages that B-tree operations have marked for recycling but that have not yet been reclaimed. A high count here suggests the index has undergone significant churn.
Fixing index bloat
-- REINDEX CONCURRENTLY (PostgreSQL 12+) rebuilds without blocking:
REINDEX INDEX CONCURRENTLY idx_orders_created_at;
-- For all indexes on a table:
REINDEX TABLE CONCURRENTLY orders;
-- What happens under the hood:
-- 1. A new index is built alongside the old one
-- 2. Concurrent reads/writes use the old index during the build
-- 3. Once complete, the old index is swapped out atomically
-- 4. The old index is dropped
-- Requires ~2x the index space temporarily.
-- Cannot run inside a transaction block. Unlike table bloat, index bloat is not resolved by regular VACUUM. Vacuum removes dead tuple references from indexes, but it does not compact the index structure itself. A B-tree that grew to 4 levels during a bloat episode will remain at 4 levels even after vacuum removes all the dead entries — the pages exist but are mostly empty.
REINDEX CONCURRENTLY (PostgreSQL 12+) rebuilds an index without blocking concurrent reads or writes. For production systems, this is the preferred approach. It requires temporary disk space equal to the index size and cannot run inside a transaction block, but it allows normal operations to continue throughout the rebuild.
For PostgreSQL versions before 12, CREATE INDEX CONCURRENTLY followed by dropping the old index achieves the same result with one additional step.
The long-running transaction trap
-- Find long-running transactions holding back vacuum:
SELECT pid,
now() - xact_start AS duration,
state,
LEFT(query, 80) AS query_preview,
backend_type
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
AND state != 'idle'
AND now() - xact_start > interval '5 minutes'
ORDER BY xact_start;
-- Also check for idle-in-transaction connections:
SELECT pid,
now() - state_change AS idle_duration,
LEFT(query, 80) AS last_query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND now() - state_change > interval '1 minute'
ORDER BY state_change;
-- A single idle-in-transaction session can prevent vacuum
-- from cleaning ANY dead tuples across ALL tables. I must draw particular attention to this, because it is the single most common cause of catastrophic bloat I encounter in production systems.
PostgreSQL's MVCC requires that dead tuples remain visible to any transaction that started before they were marked dead. Vacuum respects this requirement absolutely — it will not remove a dead tuple if any active transaction might still need to see it. The implication is severe: a single long-running transaction, anywhere in the database, prevents vacuum from cleaning dead tuples in every table.
The usual culprits:
- Forgotten interactive sessions. A developer opens
psql, runsBEGIN, executes a query, gets distracted, and leaves the session open. For the next eight hours, vacuum cannot clean anything. - Long analytical queries. A reporting query that scans a large table for 30 minutes holds a snapshot for the entire duration. Every dead tuple created during those 30 minutes is preserved.
- ORMs with implicit transactions. Some ORMs and connection pools hold transactions open longer than necessary. A web request that opens a transaction on first database access and closes it on response completion — even if the database work finished 200ms ago and the response takes 2 seconds to render — holds back vacuum for the entire request duration.
- Replication slots without consumers. Logical replication slots hold back the
xminhorizon. A replication slot whose consumer has died prevents vacuum cleanup indefinitely, without any visible transaction inpg_stat_activity.
Monitor pg_stat_activity for connections in the idle in transaction state. Set idle_in_transaction_session_timeout to terminate them automatically. In PostgreSQL 14+, idle_session_timeout catches idle connections before they even begin a transaction.
For replication slots, monitor pg_replication_slots and alert on any slot where active = false and xmin is advancing past the horizon. An unused replication slot is one of the quietest ways to bring a database to its knees.
Fixing table bloat: your options
Option 1: VACUUM FULL (with downtime)
-- VACUUM FULL rewrites the entire table, reclaiming all dead space.
-- But it acquires an ACCESS EXCLUSIVE lock — no reads or writes.
VACUUM FULL orders;
-- On a 50GB table, this takes minutes to hours.
-- During that time, every query touching orders will block.
-- In production, this is usually not acceptable. VACUUM FULL rewrites the entire table into a new, compact file. It is thorough but requires an ACCESS EXCLUSIVE lock — no other session can read or write the table during the operation. For large tables in production, this means downtime.
On a 50GB table with modern NVMe storage, expect 10-30 minutes. On spinning disks, multiply by 3-5x. During this time, every query that touches the table will queue behind the lock. If your application has no retry logic, those queries will fail when they hit connection timeouts.
VACUUM FULL also requires temporary disk space equal to the table size — it creates the new copy before dropping the old one. If you are already at 90% disk usage, you may not have the space to run it.
I would recommend VACUUM FULL only during planned maintenance windows on tables that can tolerate the lock duration. For production systems with uptime requirements, the next two options are preferable.
Option 2: pg_repack (without downtime)
-- pg_repack rebuilds the table online, without exclusive locks.
-- Install the extension first:
CREATE EXTENSION pg_repack;
-- Repack a single table:
$ pg_repack --table orders --no-order mydb
-- Repack a table and its indexes:
$ pg_repack --table orders mydb
-- pg_repack creates a new copy, replays changes via triggers,
-- then swaps the tables atomically. Reads and writes continue
-- throughout. The old bloated copy is dropped. pg_repack achieves the same result without exclusive locks. It creates a shadow copy of the table, uses triggers to capture concurrent changes, and atomically swaps the tables when the copy is complete. Reads and writes continue uninterrupted throughout the operation.
For production databases, pg_repack is the standard answer. It handles tables and indexes, supports concurrent operation, and is widely used in large PostgreSQL deployments.
The trade-offs are worth knowing:
- Temporary disk space.
pg_repackneeds space for the full copy of the table and its indexes. On a 100GB table with 50GB of indexes, you need 150GB of free disk space. - Write amplification. The triggers that capture concurrent changes add overhead to every write during the repack. On extremely write-heavy tables (50,000+ writes per second), this can slow write throughput by 10-30%.
- Brief exclusive lock at the end. The final swap requires a very brief exclusive lock — typically milliseconds. But if there are long-running queries or competing locks, this swap can be delayed.
- Extension installation required. Some managed PostgreSQL services do not support
pg_repack. Check your provider's extension list.
Option 3: pg_squeeze (automatic, logical decoding)
-- pg_squeeze is an alternative to pg_repack.
-- It uses logical decoding instead of triggers — less overhead
-- on high-write tables, but requires wal_level = logical.
CREATE EXTENSION pg_squeeze;
-- Register a table for automatic squeezing when bloat exceeds a threshold:
SELECT squeeze.start_worker();
INSERT INTO squeeze.tables (tabschema, tabname, free_space_extra)
VALUES ('public', 'orders', 25); -- squeeze when >25% free space
-- pg_squeeze checks periodically and repacks automatically.
-- Lower impact on write-heavy workloads than trigger-based pg_repack. pg_squeeze takes a different approach: instead of triggers, it uses logical decoding to capture concurrent changes. This means less overhead on write-heavy workloads, but it requires wal_level = logical (which increases WAL volume by approximately 10-20%).
The distinguishing feature is automation. You register tables with bloat thresholds, and pg_squeeze monitors and repacks them automatically. For teams that want to address bloat as an ongoing operational concern rather than an emergency response, this is appealing.
I should be forthcoming: pg_squeeze is less widely adopted than pg_repack. If your team has experience with pg_repack and it meets your needs, switching to pg_squeeze for its own sake is not necessary. The best tool for bloat remediation is the one your team understands and has practiced with.
"Before prescribing solutions, gather evidence. PostgreSQL provides built-in diagnostic tools that, together, reveal the nature of nearly any performance problem."
— from You Don't Need Redis, Chapter 18: The PostgreSQL Performance Decision Framework
Prevention: the only sustainable strategy
-- Prevention is far less disruptive than remediation.
-- 1. Tune autovacuum aggressively for write-heavy tables:
ALTER TABLE orders SET (
autovacuum_vacuum_scale_factor = 0.02,
autovacuum_vacuum_threshold = 1000,
autovacuum_analyze_scale_factor = 0.01,
autovacuum_analyze_threshold = 500
);
-- 2. Monitor dead tuple ratios continuously:
-- Alert when any table exceeds 10% dead tuples.
-- 3. For tables with heavy UPDATE patterns, consider HOT updates:
-- If the updated columns are NOT indexed, PostgreSQL can update
-- in-place (Heap Only Tuple), avoiding index bloat entirely.
-- Fewer indexes on frequently-updated columns = less bloat. Remediation is a fire drill. Prevention is a fire code. I have a strong preference for the latter.
Prevention rests on four pillars:
- Aggressive autovacuum tuning. The single most impactful change. Lower
autovacuum_vacuum_scale_factorto 0.02 on write-heavy tables, increaseautovacuum_vacuum_cost_limitto 400-800 on modern storage, and add autovacuum workers if your server has CPU and I/O headroom. The autovacuum tuning guide covers specific settings by workload type, if you wish to go further. - Continuous monitoring. Track
n_dead_tupas a time series. Set alerts at 10%. Watch the ratio of dead to live tuples over time — a ratio that climbs steadily means autovacuum is losing the race. Catch problems before they become emergencies. - Eliminate long-running transactions. Set
idle_in_transaction_session_timeout(e.g., 5 minutes). Monitor for connections that hold transactions open. Train developers to use explicit, short-lived transactions rather than leaving connections in transaction state. - Schema design that reduces dead tuple generation. This is the least obvious pillar but can be the most effective. I shall elaborate.
HOT updates and fillfactor
-- fillfactor reserves space in each page for future updates.
-- Default is 100 (pack pages full). For update-heavy tables:
ALTER TABLE orders SET (fillfactor = 80);
-- Now each 8KB page reserves 20% for in-page updates.
-- This enables HOT updates — the old and new row versions
-- live on the same page, so no index entry updates needed.
-- After changing fillfactor, you need to rewrite the table:
VACUUM FULL orders; -- or pg_repack
-- fillfactor trades disk space for update performance:
-- 80% fillfactor = 25% more disk, but dramatically fewer
-- index-entry updates and less index bloat. A Heap Only Tuple (HOT) update occurs when PostgreSQL can place the new row version on the same heap page as the old one, and none of the updated columns are indexed. When both conditions are met, the update creates no new index entries — the index still points to the original tuple, which chains to the new version via a HOT chain.
This is significant for bloat because index entry creation is a major source of index bloat. If you update a row 10 times and each update creates index entries in 5 indexes, you have generated 50 dead index entries. With HOT updates, you generate zero.
To enable HOT updates, two things must be true:
- The updated columns must not appear in any index. If you have an index on
updated_atand every update touchesupdated_at, no update to that table can use HOT. Consider whether that index is worth the bloat cost. - There must be free space on the same page. The
fillfactorsetting controls this. A fillfactor of 80 reserves 20% of each page for in-page updates.
The trade-off is straightforward: fillfactor 80 uses 25% more disk space, but reduces index bloat dramatically on update-heavy tables. For a table that receives thousands of updates per second, this is almost certainly worth it. For a table that is primarily insert-only, it is wasted space.
You can check your current HOT update rate in pg_stat_user_tables: the n_tup_hot_upd column tracks HOT updates, and n_tup_upd tracks total updates. If the ratio is low on a table with frequent updates, you are generating more index bloat than necessary.
TOAST bloat: the hidden dimension
-- Large values (>2KB) are stored in TOAST tables.
-- TOAST bloat is often invisible — it doesn't appear in
-- pg_stat_user_tables dead tuple counts.
-- Check TOAST table sizes:
SELECT c.relname AS table_name,
pg_size_pretty(pg_relation_size(c.oid)) AS table_size,
pg_size_pretty(pg_relation_size(t.oid)) AS toast_size,
t.relname AS toast_table
FROM pg_class c
JOIN pg_class t ON t.oid = c.reltoastrelid
WHERE pg_relation_size(t.oid) > 0
ORDER BY pg_relation_size(t.oid) DESC
LIMIT 10;
-- If toast_size >> table_size, TOAST is your bloat problem.
-- VACUUM FULL or pg_repack on the main table will also
-- compact the TOAST table. Tables with large column values — text, jsonb, bytea — store those values in a separate TOAST table. TOAST tables bloat independently of the main table, and their bloat does not appear in pg_stat_user_tables dead tuple counts.
I have encountered systems where the main table showed minimal bloat — 3% dead tuples, autovacuum running regularly — but the TOAST table was 10x the size of the main table. The culprit: a jsonb column that was updated frequently. Each update created a new TOAST entry and marked the old one as dead. The main table's dead tuple count reflected only the heap row, not the TOAST data.
Check TOAST table sizes explicitly. If a TOAST table dwarfs its parent, you have TOAST bloat. VACUUM FULL or pg_repack on the main table will compact the TOAST table as well.
Partitioning: making bloat manageable at scale
-- Table partitioning can make bloat manageable:
-- Instead of one 500GB table, you have 12 monthly partitions.
-- Drop old partitions entirely (instant space reclamation):
ALTER TABLE events DETACH PARTITION events_2024_01;
DROP TABLE events_2024_01;
-- Repack one partition at a time (smaller, faster, less disk needed):
$ pg_repack --table events_2025_01 mydb
-- Vacuum one partition at a time:
VACUUM (VERBOSE) events_2025_03;
-- Partitioning doesn't prevent bloat, but it makes every
-- remediation tool faster and less disruptive. For very large tables — hundreds of gigabytes or more — partitioning transforms the bloat problem from one large emergency into many small, manageable maintenance tasks.
Consider a 500GB events table partitioned by month. Instead of repacking 500GB (which requires 500GB of free disk space and hours of runtime), you repack the current month's partition — perhaps 40GB. Each partition has its own autovacuum settings, its own dead tuple count, and its own vacuum cycle. Old partitions can be dropped entirely, reclaiming space instantly without any vacuum overhead.
Partitioning does not prevent bloat. A poorly-tuned partition will bloat just as readily as a monolithic table. But it makes every diagnostic tool faster, every remediation tool less disruptive, and every emergency less severe. It reduces the blast radius.
If you have a table that exceeds 100GB and receives continuous writes, partitioning is worth the schema change. The upfront cost is meaningful — queries must be partition-aware, foreign keys to partitioned tables require care, and some ORMs handle partitions poorly — but the operational benefit over the lifetime of the system is substantial.
When the table is beyond easy remediation
On very large tables — terabytes of data, thousands of writes per second — even pg_repack requires significant temporary disk space and generates substantial I/O that can affect other workloads sharing the same storage. For these cases, a layered approach is necessary:
- Stop the bleeding. Identify and terminate long-running transactions. Tune autovacuum to match the write rate. These changes are immediate and low-risk.
- Triage by impact. Use
pgstattupleor the statistical estimate to measure bloat on each table. Prioritize tables by query volume — a 60% bloated table that handles 10,000 queries per second matters more than a 90% bloated archive table that sees 3 queries per day. - Repack incrementally. If the table is partitioned, repack one partition at a time during low-traffic hours. If it is not partitioned, consider whether the situation warrants an online migration to a partitioned schema.
- Plan for the future. Once remediated, set per-table autovacuum settings, add monitoring, and establish runbooks. Bloat remediation should be a scheduled operation, not an emergency response.
Honest counterpoints
I would be remiss to present bloat as a universally urgent problem without some qualifications.
Some bloat is normal and healthy. A table at 5-10% bloat is functioning as designed. The free space left by dead tuples is reused by future inserts and updates. A table that is perfectly compacted after every write would actually perform worse, because every insert would need to extend the file rather than reuse existing space. The goal is not zero bloat — it is stable, manageable bloat where vacuum keeps pace with writes.
Not all tables warrant aggressive intervention. A small lookup table — 10,000 rows, 2MB on disk — can be 80% bloated and nobody will notice. The performance impact of reading 10MB instead of 2MB is negligible for a table that fits entirely in shared buffers either way. Reserve your attention for tables where bloat produces measurable performance degradation.
Modern PostgreSQL is better at this than its reputation suggests. The autovacuum improvements in PostgreSQL 12 (reduced cost delay from 20ms to 2ms) and 13 (parallel vacuum of indexes, improved tracking of insert-only tables) have meaningfully reduced the prevalence of severe bloat. If you are running PostgreSQL 15 or later with even moderately tuned autovacuum settings, catastrophic bloat is less common than the older blog posts would have you believe. It still happens — long-running transactions and misconfigured autovacuum will defeat any version — but the baseline is better.
Managed databases handle some of this for you. Services like Amazon RDS, Google Cloud SQL, and Azure Database for PostgreSQL have their own autovacuum tuning and monitoring. They do not always get it right — default settings on managed services are often just as conservative as stock PostgreSQL — but they do provide monitoring dashboards and alert capabilities that reduce the chance of bloat going unnoticed for weeks.
A summary of tools for the well-prepared household
| Tool | Purpose | Locking | Notes |
|---|---|---|---|
VACUUM | Reclaim space for reuse | None (concurrent) | Does not shrink file, but prevents further growth |
VACUUM FULL | Compact table, shrink file | ACCESS EXCLUSIVE | Requires downtime. Use only in maintenance windows. |
pg_repack | Compact table, shrink file | Brief lock at swap | Production-safe. Needs temp disk equal to table size. |
pg_squeeze | Automatic compaction | Brief lock at swap | Requires wal_level=logical. Automatic monitoring. |
REINDEX CONCURRENTLY | Rebuild indexes online | None (concurrent) | PostgreSQL 12+. Needs temp disk equal to index size. |
pgstattuple | Precise bloat measurement | Share lock (read) | Full table scan — slow on large tables |
| Table partitioning | Reduce blast radius | N/A | Schema change. Enables per-partition maintenance. |
The well-maintained table
Bloat is not a mystery. It is not unpredictable. It is the natural consequence of MVCC in a write-heavy system where vacuum has not been given the resources or configuration to keep pace. Every case of severe bloat I have encountered could have been prevented with monitoring, tuning, and attention to long-running transactions.
The database was not slow. It was carrying weight it should have been relieved of long ago.
Gold Lapel monitors the query patterns that contribute to bloat — write-heavy tables with poor vacuum coverage, long-running transactions that block cleanup, and index-heavy tables where updates generate disproportionate dead entries. The proxy can surface these patterns before they manifest as bloat, giving you the information to act proactively rather than reactively.
If you will permit me one final observation: the best time to address bloat is before it arrives. The second best time is now.
Frequently asked questions
Terms referenced in this article
While the matter is fresh in your mind, there is a piece I have written on the best PostgreSQL extensions for performance that covers pg_repack, pg_stat_statements, and the other tools that make bloat detection and remediation less of an archaeological expedition.