Chapter 18: Scaling: From Laptop to Cluster

The Waiter of Gold Lapel · Published Apr 19, 2026 · 13 min

The kettle is on. The scaling question, if I may observe, tends to present itself somewhat earlier than the occasion requires. Someone reads a blog post about a company running thousands of shards, and by Monday morning the architecture diagram has sprouted multiple regions, a coordinator tier, and a frankly alarming number of arrows. The team hasn't measured their single node yet. They don't know whether they're CPU-bound or connection-bound. They have no idea what their working set looks like in memory. But they know — with the conviction that only unmeasured fear can produce — that they are about to outgrow PostgreSQL.

I am here to offer you a different sort of evening. Allow me to walk you through the PostgreSQL scaling progression in the order you should actually climb it: laptop, single server, read replicas, partitioning, Citus, mesh. Each rung handles roughly an order of magnitude more work than most teams believe. My aim is not to sell you the top of the ladder. It is to help you find the rung that costs you the least to stand on — and to keep you there for as long as the workload allows.

A word, before we begin, about the house across the street. MongoDB's sharding story is mature, well-documented, and that maturity is thoroughly earned. The argument of this chapter is narrower: PostgreSQL's scaling story is also mature, and it is rarely told end-to-end for document workloads. Tonight, we shall tell it end-to-end. Call it PostgreSQL document store scaling, and treat it as a single progression rather than five unrelated tools.

The PostgreSQL Scaling Progression

The ladder has five rungs. Each unlocks a class of workload, and each adds operational cost. The PostgreSQL scaling progression looks like this:

Rung	Unlocks	Adds
1. Single node	Tens of thousands of ops/sec for indexed document reads	Nothing you don't already have
2. Read replicas	Read throughput beyond one machine; failover target	Replica lag, routing logic, read-after-write handling
3. Table partitioning	Very large individual tables on one box	Partition management, retention windows
4. Citus	Aggregate throughput beyond one machine	Distribution column choice, shard rebalancing
5. Mesh	Multi-region cache coherence	Discovery, invalidation topology

The rule is simple, and I should like to state it plainly: climb only as far as your workload demands, and measure before you climb. Every rung above the one you need is a tax you will pay forever.

Single Node PostgreSQL Handles More Than You Think

Before anything else, a matter of proportion — because the scaling conversation, in my experience, is most often a proportion problem before it is a hardware one. A well-tuned PostgreSQL instance on modern commodity hardware is a serious machine, not a starter kit one graduates from. It is a workhorse many deployments never need to leave. For GIN-indexed JSONB containment lookups on a warm working set, a single node of reasonable size comfortably handles high-throughput document reads — well into the territory where most teams stop worrying about the database and start worrying about the application. That is not the ceiling of a distributed cluster. That is rung one.

Three levers matter for document workloads on a single node: GIN indexes on JSONB, materialized views for read-heavy aggregations, and a connection pooler. The first two were the subjects of Chapters 4 and 12, respectively, and I shall not trespass on territory already covered. The third is where most teams first meet a wall, and it deserves naming directly.

A GIN index on JSONB, using the jsonb_path_ops operator class, is the workhorse of document lookups:

GIN index

CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    data JSONB NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX documents_doc_gin
    ON documents
    USING GIN (data jsonb_path_ops);

jsonb_path_ops, available since PostgreSQL 9.4, produces a smaller and faster index than the default jsonb_ops class for containment queries of the form data @> '{"status": "active"}'. The trade, if I may be candid about it, is that jsonb_path_ops supports a narrower set of operators. For the containment-heavy access patterns typical of document workloads, it is the right trade. For others, it is not. One should know which workload one has.

The second lever is PgBouncer. A PostgreSQL backend is a full process with its own memory footprint, and opening thousands of direct connections from application servers is, I'm afraid, how single-node deployments die. PgBouncer in transaction-pooling mode multiplexes many client connections onto a small pool of backend connections:

pgbouncer.ini

# pgbouncer.ini
[databases]
app = host=127.0.0.1 port=5432 dbname=app

[pgbouncer]
listen_addr = 127.0.0.1
listen_port = 6432
auth_type = scram-sha-256
pool_mode = transaction
default_pool_size = 40
max_client_conn = 5000

With transaction pooling, a backend connection is handed to a client for the duration of a single transaction and returned to the pool at commit. Forty backends can comfortably serve five thousand idle clients. The arithmetic is gentle; the effect is not.

One historical caveat deserves mention, because pretending otherwise would be a disservice. Named prepared statements did not survive transaction pooling cleanly until PgBouncer 1.21, released in October 2023, added protocol-level support for prepared statements in transaction mode — leveraging PostgreSQL's existing extended query protocol rather than any new server-side feature. On PgBouncer 1.21 and later, transaction pooling works with prepared statements against any modern PostgreSQL version. On older PgBouncer releases, session pooling is the safer default. One works with the versions one has.

Pointing your client at the pooler is one line:

Gold Lapel config

# Gold Lapel config
connection_url: "postgres://app@127.0.0.1:6432/app"

The application does not know PgBouncer exists. That is rather the point.

Vertical Scaling and Connection Pooling Before You Shard

The cheapest scaling decision, allow me to observe, is a bigger box. RAM for the working set, NVMe for the write-ahead log, and a pooler in front. It is not the most fashionable answer — larger instances rarely feature in architecture talks — but this combination carries most document workloads further than most teams expect, and it carries them without introducing a single distributed-systems problem. On reflection, "solves your scaling problem without adding a distributed-systems problem" is rather a remarkable property for a scaling strategy to have.

Before you decide you have outgrown a single node, I should like you to measure three things.

First, connection count. The canonical "you need a pooler" signal lives in pg_stat_activity:

Connection count

SELECT state, count(*)
FROM pg_stat_activity
WHERE backend_type = 'client backend'
GROUP BY state
ORDER BY count(*) DESC;

If you see hundreds of connections in the idle state, you are not CPU-bound or disk-bound. You are burning backend memory on connections that are doing absolutely nothing at all. A pooler addresses this in an afternoon. In my experience, connection exhaustion is the most common "PostgreSQL can't scale" failure mode — and it is almost never a scaling problem at all. It is a pooling problem wearing the costume of a scaling problem.

Second, memory configuration. On a 64 GB server dedicated to PostgreSQL, reasonable starting values are:

postgresql.conf memory

shared_buffers = 16GB           # ~25% of RAM
effective_cache_size = 48GB     # ~75% of RAM, informs the planner
work_mem = 32MB                 # per-operation, not global
maintenance_work_mem = 2GB      # for VACUUM, CREATE INDEX

The 25% guidance for shared_buffers has held across several major versions and remains the reasonable starting point for mixed workloads. effective_cache_size is a hint to the query planner about how much data it can assume is hot in the OS page cache; it allocates nothing, and it influences everything.

Third, disk behavior. If pg_stat_bgwriter shows heavy buffers_backend writes, your backends are flushing dirty pages themselves instead of letting the background writer do the work, and the WAL volume is very likely your bottleneck. (A small note for the future: in PostgreSQL 17 the backend-write counters moved out of pg_stat_bgwriter and into pg_stat_io; the diagnostic is the same, the column simply lives next door.) NVMe for the WAL is cheap. Rearchitecting for distribution is, I must gently note, not.

If connections are pooled, memory is sized, and the disk is fast — and you remain saturated — then you have a real case for the next rung. Most evenings, you do not.

PostgreSQL Read Replicas and Read-After-Write Protection

Rung two. When read throughput is the ceiling, and the workload is read-heavy — which, for most document workloads, it is — streaming replication is the next lever to reach for. PostgreSQL has shipped physical streaming replication since version 9.0, released in 2010. It is as battle-tested as any feature in the system, and then some — fifteen years of production experience will do that to a feature.

A minimal primary configuration:

postgresql.conf (primary)

# postgresql.conf (primary)
wal_level = replica
max_wal_senders = 10
wal_keep_size = 1GB

pg_hba.conf

# pg_hba.conf (primary)
host replication replicator 10.0.0.0/24 scram-sha-256

The replica is provisioned with pg_basebackup and started with primary_conninfo pointing at the primary. Once it catches up, it serves reads. That is the entire ceremony.

Gold Lapel's client configuration takes both URLs:

Gold Lapel replica config

primary_url: "postgres://app@primary.db:6432/app"
replica_url: "postgres://app@replica.db:6432/app"
read_after_write_window: 5s

Reads are routed to the replica by default. Writes go to the primary. So far, so standard. The subtlety every document-workload developer eventually meets is replica lag — the interval between a successful write on the primary and the LSN arriving at the replica. On a healthy LAN, this is typically sub-second. But "typically" is not "guaranteed," and the application that assumes zero lag is, I regret to inform you, the application that ships read-after-write bugs.

Gold Lapel's read-after-write protection attends to the common case without requiring code changes. After a write on a given session, subsequent reads from that session are pinned to the primary for a configurable window. The mechanism is time-based on the client session, not LSN-based; LSN tracking is more precise, but it requires round-trips to check replica position, which costs more than it saves at typical lag values. A waiter, after all, does not send a runner to confirm the wine has arrived before pouring; he listens for the cork.

In practice:

Read-after-write

user = gl.doc_insert("users", {"email": "ada@example.com", "plan": "pro"})

# Reads from the primary — within the read-after-write window,
# Gold Lapel pins this session's reads to the primary.
fetched = gl.doc_find_one("users", {"_id": user["_id"]})
assert fetched["data"]["plan"] == "pro"

For critical paths where the primary read should be explicit rather than inferred, the escape hatch is direct:

Force primary read

account = gl.doc_find_one(
    "accounts",
    {"_id": account_id},
    read_preference="primary",
)

And for workloads that would prefer the opposite — where stale reads are perfectly acceptable, thank you very much, always hit the replica — the same knob set to "replica" forces the routing even immediately after a write. Most applications never touch these knobs. The default attends to them.

PostgreSQL Table Partitioning for Document Collections

Rung three. Some collections are dominated by a single table that grows without bound: event logs, audit trails, time-series telemetry, append-heavy document streams — the sort of tables that begin the year at 40 GB and conclude it at 400, and no amount of polite encouragement will persuade them to stop. Vertical scaling does not make a 400 GB events table scan faster. What helps is carving that table into partitions the planner can politely skip.

PostgreSQL's declarative partitioning, introduced in version 10 and matured significantly through 11, 12, and 13, is the mechanism. For time-keyed document collections, range partitioning on created_at is the natural fit:

Range partitioning

CREATE TABLE events (
    id BIGSERIAL,
    data JSONB NOT NULL,
    created_at TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (created_at);

CREATE TABLE events_2026_03 PARTITION OF events
    FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');

CREATE TABLE events_2026_04 PARTITION OF events
    FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');

CREATE INDEX events_2026_03_doc_gin
    ON events_2026_03 USING GIN (data jsonb_path_ops);
CREATE INDEX events_2026_04_doc_gin
    ON events_2026_04 USING GIN (data jsonb_path_ops);

Creating the monthly partitions by hand grows tiresome quickly, and the last thing a well-run household needs is a standing appointment on the operator's calendar every month. pg_partman automates the rotation:

pg_partman

CREATE EXTENSION pg_partman;

SELECT partman.create_parent(
    p_parent_table => 'public.events',
    p_control      => 'created_at',
    p_type         => 'range',
    p_interval     => '1 month',
    p_premake      => 3
);

With p_premake => 3, pg_partman keeps three months of future partitions ready at all times. A scheduled job runs partman.run_maintenance() nightly, quietly rotating the window forward while you sleep.

Partition pruning is what makes the arrangement worthwhile. A query with a predicate on the partition key touches only the relevant partitions:

Partition pruning

EXPLAIN SELECT count(*) FROM events
WHERE created_at >= '2026-04-01' AND created_at < '2026-04-08';

Plan

 Aggregate
   ->  Seq Scan on events_2026_04 events
         Filter: (created_at >= '2026-04-01' AND created_at < '2026-04-08')

The plan touches events_2026_04 only. One partition. One scan. The other eleven months of data remain undisturbed.

Pruning happens at planning time for constant predicates (since PostgreSQL 10) and at execution time for parameterized predicates and subqueries (since PostgreSQL 11). For parameterized queries from an application — the common case — this means pruning still fires when the planner cannot see the literal values. Which is, if you'll allow me the observation, rather a lot better than nothing.

From the application side, nothing changes. Gold Lapel issues a doc_find against the parent table:

Application query

recent = gl.doc_find("events", {
    "created_at": {"$gte": "2026-04-01", "$lt": "2026-04-08"},
})

The generated SQL targets events. The planner prunes to events_2026_04. The developer writes no partition-aware code whatsoever.

Retention, in this arrangement, becomes a metadata operation. Instead of DELETE FROM events WHERE created_at < '2025-04-01' — which, on a 400 GB table, is a multi-hour cascade through indexes and WAL that will make your monitoring dashboard blush — you detach and drop:

Retention by partition drop

ALTER TABLE events DETACH PARTITION events_2025_03 CONCURRENTLY;
DROP TABLE events_2025_03;

DETACH PARTITION CONCURRENTLY (PostgreSQL 14+) removes the partition from the parent without blocking concurrent readers or writers on other partitions. For log-like workloads, this turns retention from an operational event — the sort of thing that gets scheduled for Sundays and whispered about — into a cron job.

Partitioning vs Sharding

These two words are routinely confused, and the confusion leads to architectural mistakes of a particularly expensive variety. The PostgreSQL partitioning vs sharding distinction is, in truth, a simple one:

	Partitioning	Sharding (Citus)
Data location	Single node	Multiple nodes
Unlocks	Large tables on one box	Throughput beyond one box
Query routing	Automatic (pruning)	Automatic (distribution column)
Operational cost	Low	Medium–High
Right rung when	One table is huge	Aggregate workload exceeds one node

Partitioning solves a table-size problem. Sharding solves a machine-size problem. They are not alternatives, and should never be presented as such. They compose, in fact — and we shall see them compose shortly.

Citus PostgreSQL for Horizontal Scaling

Rung four, and the one where most architecture diagrams begin and most workloads do not actually need to reach. When one machine — even a generously sized one, with replicas and partitioned tables — cannot absorb the aggregate workload, we arrive at horizontal scaling. Citus is the PostgreSQL extension that makes this possible without leaving PostgreSQL, which is a rather important qualification.

Citus turns a coordinator node and a set of worker nodes into a single logical PostgreSQL instance. You connect to the coordinator, issue standard SQL, and Citus routes queries to the workers that hold the relevant shards. The application sees PostgreSQL. The cluster behaves like a distributed database. The arrangement is, one might say, discreet.

Setup is three statements on the coordinator — well, five, but who is counting:

Citus setup

CREATE EXTENSION citus;

SELECT citus_add_node('worker-1.db', 5432);
SELECT citus_add_node('worker-2.db', 5432);
SELECT citus_add_node('worker-3.db', 5432);
SELECT citus_add_node('worker-4.db', 5432);

The distribution-column decision, however, is the one that matters — the one decision that will either make your cluster feel effortless or make every query a small adventure. For document workloads, two strategies cover most cases.

Distribute by _id provides uniform load across shards and is the right default for general-purpose collections with no natural tenancy boundary:

Distribute by _id

SELECT create_distributed_table('documents', '_id');

Distribute by tenant key co-locates all rows for a given tenant on the same shard, which makes per-tenant reads, joins, and transactions local to a single worker:

Distribute by tenant_id

SELECT create_distributed_table('documents', 'tenant_id');

The tenant strategy is almost always the right choice for multi-tenant SaaS workloads. Cross-tenant analytics become distributed queries, which is entirely acceptable when analytics are rare; per-tenant operations remain single-shard, which is essential when they are the hot path. The arrangement follows the shape of the problem.

One constraint worth naming plainly: the distribution column must be part of any unique constraint or primary key on the table. If the table's primary key is _id, distributing by tenant_id requires redefining the key as (tenant_id, _id). This is usually a nudge toward a better data model rather than a real limitation — rather like discovering that the seating chart makes more sense once the guest list has been considered.

Gold Lapel detects Citus at startup by inspecting pg_extension:

Citus detection

[gold-lapel] Connected to PostgreSQL 16.2
[gold-lapel] Citus 12.1 detected; index operations will propagate to all shards
[gold-lapel] 4 worker nodes registered

Index creation issued through Gold Lapel propagates to every shard via the coordinator:

Shard-propagated index

gl.doc_create_index("documents", keys={"tenant_id": 1, "status": 1})

The coordinator expands this into a GIN or B-tree CREATE INDEX against every shard, tracks the propagation, and reports completion only when every worker has finished. The developer, once again, writes no shard-aware code. One line in. One index everywhere. No ceremony.

Citus vs Native PostgreSQL Partitioning

The citus vs native postgresql partitioning question is a distinction, not a choice. Native partitioning scales one table on one machine; Citus scales the entire workload across machines. They compose cleanly: a Citus distributed table can itself be partitioned, which gives you time-based pruning within each shard while the shards themselves are distributed across workers.

For a telemetry workload with five years of history spread across hundreds of tenants, the correct shape is often a table distributed by tenant_id and partitioned by created_at. Each tenant's data lives on a predictable shard; within that shard, each tenant's historical data lives in monthly partitions that prune on query. Two mechanisms, one table, and — mercifully — neither mechanism leaks into the application code.

Gold Lapel Mesh for Multi-Region Document Serving

Rung five, the top of the ladder. When application servers, caches, and readers span continents, a write in Frankfurt and a read in Oregon must agree on what the document currently says — or, at the very least, agree on when the cached answer from thirty seconds ago has ceased to be acceptable.

Gold Lapel Mesh is a coordination layer, not a replication layer. I should like to be especially clear on this point, because the confusion is easy to make. Mesh does not move document data between regions; PostgreSQL replication does that. Mesh propagates cache-invalidation events between Gold Lapel instances, so that a write in one region invalidates the corresponding cache entries in every other region within a small window.

A minimal two-instance configuration:

Region A

# Region A
mesh:
  instance_id: "gl-eu-1"
  discovery_endpoint: "https://mesh.internal/discovery"
  peer_port: 7781

Region B

# Region B
mesh:
  instance_id: "gl-us-1"
  discovery_endpoint: "https://mesh.internal/discovery"
  peer_port: 7781

Both instances register with a shared discovery endpoint, exchange peer membership, and open persistent connections to each other. When a write on gl-eu-1 invalidates a cache key, the invalidation is published to all registered peers; gl-us-1 drops the corresponding cache entry on receipt.

Picture three regions, if you'll indulge me, each with a Gold Lapel instance fronting a regional PostgreSQL reader, with invalidation messages flowing between the Gold Lapel instances over a mesh topology. A write to the primary — in whichever region happens to hold it — replicates to the regional readers through PostgreSQL's streaming replication. In parallel, the Gold Lapel instance that accepted the write broadcasts invalidation events to its peers. The caches drop their stale entries; the next read in each region goes to PostgreSQL; PostgreSQL returns the replicated value. The guests in every region are served fresh.

Cross-region write and read

# Region A
gl_a.doc_update_one(
    "products",
    {"_id": product_id},
    {"$set": {"price": 1299}},
)

# Region B, ~invalidation-window milliseconds later
fetched = gl_b.doc_find_one("products", {"_id": product_id})
assert fetched["data"]["price"] == 1299

The invalidation window is the propagation time of the mesh message plus the replication lag from the primary to the regional reader. On a well-provisioned inter-region network that is typically well under a second, but the exact figure depends on your topology, and I would rather tell you the shape of the answer than name a number that belongs to a different deployment. We deliberately decline the word "instant." Distributed systems that claim instant are, in my experience, lying about either the topology or the failure mode, and a waiter who overstates his case is no waiter at all.

Mesh does not replace PostgreSQL replication. It sits above it and attends to the cache-coherence problem that replication, left to its own devices, does not address.

Honest Boundary: Where This Ladder Ends

Two limits are worth naming clearly, and with equal weight. I should be forthcoming about them, because pretending they do not exist would be a disservice to you and an embarrassment to me.

Write throughput. For pure write-heavy workloads — high-volume single-document inserts where reads are secondary — PostgreSQL through Gold Lapel is currently 2.5–4x slower than MongoDB on equivalent hardware. This is not a small gap, and it is not a gap I intend to gloss over. Write acceleration — parallel COPY paths and pipeline-mode insertion — is on the roadmap. It is not yet shipped. If your workload is dominated by ingest — telemetry firehoses, event-stream capture, high-frequency logging — and reads are a secondary concern, this gap matters, and you should weigh it honestly.

Extreme horizontal sharding. Citus is excellent through the moderate-cluster range, and handles the workloads most sharded PostgreSQL deployments actually run. For clusters north of fifty nodes, with multi-terabyte tables and heavy cross-shard transactional requirements, MongoDB's sharding has more operational miles on it. This is an operational-maturity claim, not a performance claim. Gold Lapel has not run at that scale in production. MongoDB has — in many places, for many years, and the lessons that produces are not available any other way.

Naming both of these does not soften the book's thesis; it sharpens it. MongoDB Atlas's automated sharding, zone-aware deployments, and global clusters with multi-region writes are genuinely good engineering — the product of many team-years spent on genuinely difficult distributed-systems problems. If your workload sits in the corner where either of those two limits binds, MongoDB is the right tool, and I would be a poor waiter indeed to suggest otherwise. The argument throughout this book is that most workloads do not sit in that corner, and most teams reach for distributed infrastructure long before their single node has been measured.

Choosing Your Rung

A practical decision guide, then, for how to scale PostgreSQL for millions of users — in the form of five questions, answered in order:

Have you measured your single node? If you have not, that is where we begin. Look at pg_stat_activity, pg_stat_bgwriter (or pg_stat_io on PostgreSQL 17+), and the query-level outputs of pg_stat_statements. Most teams discover their scaling problem is a missing index, an unpooled connection storm, or a runaway N+1 — not a shortage of hardware. The database was not slow. It was being asked poorly.
Are you connection-bound or CPU/disk-bound? If connection-bound — many idle backends, pool exhaustion errors — deploy PgBouncer. If resource-bound, size the box up: more RAM, faster NVMe, more cores. Vertical scaling is unfashionable. It is also, rather often, correct.
Is read throughput your ceiling? If reads dominate and the primary is saturated, add a read replica. Configure the client for read-after-write protection. Most document workloads find this rung is where they live for a very long time — and there is nothing at all wrong with that.
Is one table dominating your storage and slowing your queries? If yes, declaratively partition it. Range-partition on the time key for log-like collections; retain by detaching partitions. Your operations team will send you flowers.
Has your aggregate workload truly outgrown one machine? If yes, deploy Citus. Choose the distribution column with care; tenant keys are usually right for multi-tenant SaaS, _id for general-purpose collections. If you also span regions and need coherent caches, add mesh on top.

The questions are in order for a reason. Skipping ahead is how teams end up with a four-node Citus cluster serving a workload that a single pooled node could have carried with room to spare — and then paying the operational tax for years afterward. I have seen it. I would rather you did not.

The ladder is tall. You almost certainly do not need to climb it. Most document workloads live comfortably on rung one, buy a year or two of headroom by adding rung two, and never touch rungs three through five. The point of knowing the ladder is not to climb it. It is to stop worrying that PostgreSQL does not have one.

It does. The rungs are well-lit. And the first one is rather higher than most teams realise.

Chapter 19 turns from how far PostgreSQL can scale to the seven languages your applications are actually written in. Python, Node.js, Ruby, Java, PHP, Go, .NET — the doc_* API is the same shape in each, the SQL underneath is byte-identical, and the integration with each framework is a straightforward matter of installing one more library alongside the one you already use. If you'll follow me, I shall show you how the document store fits into the tools your team already reaches for.