← You Don't Need MongoDB

Chapter 12: The Aggregation Pipeline

The Waiter of Gold Lapel · Published Apr 19, 2026 · 14 min

Most teams arriving from MongoDB carry a pipeline or two — several stages long, by the look of them — and reasonably wonder whether the household can receive them as they stand. It can. If you'll permit me, I should like to show you how.

Chapter 8 made the case that SQL is, as pipelines go, the more expressive instrument. Window functions, recursive CTEs, a cost-based planner — shapes of query that $group and $lookup were not built to reach. That chapter concerned itself with the substrate beneath the floorboards.

This chapter concerns the surface. Gold Lapel exposes a single function, doc_aggregate(), which accepts a MongoDB-style pipeline array and returns the document shape you expect. You write the pipeline you already know how to write. The Waiter translates it, stage by stage, into a single compiled statement — and the PostgreSQL planner is free, from that point, to reorder, inline, or index-scan as it sees fit.

At the close of this chapter you will find a benchmark: 30 ms on MongoDB 7 versus 357 µs on Gold Lapel's warm cache, a ratio of roughly 84x. The number is not the point. It is evidence for the point, which is architectural. MongoDB, faithful to its design, re-executes the pipeline on every request. Gold Lapel stores the result as a materialized view, keyed by the canonicalized pipeline. The first call builds it. Every call thereafter is an index scan. A small distinction, with consequences.

A word, before we begin, about the chapter's namesake. The aggregation pipeline is one of MongoDB's most beautifully designed features — a composable, stage-oriented grammar for expressing intent over documents, and genuinely pleasant to write. Gold Lapel does not replace it. It re-expresses it on a substrate that has already done the work.

The doc_aggregate API

The signature is, I'm pleased to say, intentionally dull:

Signature
doc_aggregate(collection_name: str, pipeline: list[dict]) -> list[dict]

A small example, to set the table. Given an orders collection, we should like to find the top regions by revenue for shipped orders in Q1:

Sample pipeline
from goldlapel import doc_aggregate

results = doc_aggregate("orders", [
    {"$match": {"status": "shipped",
                "placed_at": {"$gte": "2026-01-01", "$lt": "2026-04-01"}}},
    {"$group": {"_id": "$region", "revenue": {"$sum": "$total"}}},
    {"$sort": {"revenue": -1}},
    {"$limit": 10},
    {"$project": {"_id": 0, "region": "$_id", "revenue": 1}}
])

What Gold Lapel emits is one compiled statement with a CTE per stage. Stages are not interpreted at runtime in the client — they are lowered to SQL before PostgreSQL ever sees the request:

Generated SQL (CTE per stage)
WITH stage_1_match AS (
  SELECT data FROM orders
  WHERE data->>'status' = 'shipped'
    AND (data->>'placed_at')::timestamptz >= '2026-01-01'
    AND (data->>'placed_at')::timestamptz <  '2026-04-01'
),
stage_2_group AS (
  SELECT data->>'region' AS _id,
         sum((data->>'total')::numeric) AS revenue
  FROM stage_1_match
  GROUP BY data->>'region'
),
stage_3_sort AS (
  SELECT * FROM stage_2_group ORDER BY revenue DESC
),
stage_4_limit AS (
  SELECT * FROM stage_3_sort LIMIT 10
)
SELECT jsonb_build_object('region', _id, 'revenue', revenue) AS data
FROM stage_4_limit;

One pipeline in. One statement out. The planner takes it from there.

The CTE-per-stage layout is a deliberate architectural choice, and I should explain why. It mirrors the pipeline's mental model, which keeps the generated SQL readable when a developer has to debug it at an unfortunate hour. It also gives the PostgreSQL 12+ planner full freedom to flatten, reorder, or materialize individual stages. Earlier PostgreSQL versions treated CTEs as fences, which would have been a problem; since 12 they are inlined by default unless marked MATERIALIZED. We owe the current generosity to that version's release notes.

The same call in TypeScript:

TypeScript
await docAggregate("orders", [
  { $match: { status: "shipped" } },
  { $group: { _id: "$region", revenue: { $sum: "$total" } } },
  { $sort: { revenue: -1 } },
  { $limit: 10 }
]);

And in Go:

Go
results, err := gl.DocAggregate(ctx, "orders", []bson.M{
    {"$match": bson.M{"status": "shipped"}},
    {"$group": bson.M{"_id": "$region", "revenue": bson.M{"$sum": "$total"}}},
    {"$sort":  bson.M{"revenue": -1}},
    {"$limit": 10},
})

Python, TypeScript, Ruby, Java, PHP, Go, and C# all expose doc_aggregate with the same argument shape. The pipeline is a JSON document — portability arrives at no extra charge.

Pipeline stages: $match, $sort, $limit, $skip

Four stages that map almost directly to SQL clauses. These are, if I may say so, the thin part of the translator — and I should not keep you long on them. A good waiter describes the simple dishes quickly and saves the full speech for the ones that earn it.

$match lowers to WHERE. Operators compose as you would expect:

$match with operators
doc_aggregate("orders", [
    {"$match": {
        "total":  {"$gte": 100},
        "status": {"$in": ["shipped", "delivered"]},
        "$and":   [{"region": "EU"}, {"customer_tier": "gold"}]
    }}
])

Emitted SQL, parameterized:

Generated SQL
SELECT data FROM orders
WHERE (data->>'total')::numeric >= $1
  AND data->>'status' = ANY($2::text[])
  AND data->>'region' = $3
  AND data->>'customer_tier' = $4

$sort becomes ORDER BY. $limit becomes LIMIT. $skip becomes OFFSET. A four-stage $match → $sort → $limit → $skip pipeline compiles to a single SELECT, the CTE wrapper quietly elided when the planner can see the whole shape:

Combined
SELECT data FROM orders
WHERE data->>'status' = 'shipped'
ORDER BY (data->>'placed_at')::timestamptz DESC
LIMIT 50 OFFSET 100;

One caveat I should mention, because pretending otherwise would be a disservice. Large OFFSET values scan and discard the skipped rows — politely, but still. For deep pagination (tens of thousands of rows in), I would recommend a keyset query instead — a $match on the last-seen sort key — which Gold Lapel documents in the pagination chapter. $skip is not the tool for page 10,000, and I would not dress it up to look like one.

Now, where $match sits in the pipeline changes what it compiles to, and this is where the planner earns its keep. Placed before $group, $match lowers into WHERE and filters before aggregation. Placed after $group, it lowers into HAVING — the post-aggregate filter:

$match after $group
doc_aggregate("orders", [
    {"$group": {"_id": "$region", "revenue": {"$sum": "$total"}}},
    {"$match": {"revenue": {"$gte": 10000}}}
])
HAVING clause
SELECT data->>'region' AS _id,
       sum((data->>'total')::numeric) AS revenue
FROM orders
GROUP BY data->>'region'
HAVING sum((data->>'total')::numeric) >= 10000

This is the planner optimization Chapter 8 argued for, now visible on the page. MongoDB's pipeline carries the same semantic, of course. The SQL form simply lets the planner decide whether a HAVING is better evaluated as a subquery filter, a late-materialized aggregate, or something more inventive entirely. The planner chooses. The developer does not have to.

$group and accumulators

The accumulator table is the reference most readers will return to — and readers who will return to a table deserve a table they can find. Allow me to set it down plainly:

MongoDB accumulatorSQL generatedNotes
$sumsum(expr)$sum: 1 is short-circuited to count(*)
$avgavg(expr)
$min / $maxmin(expr) / max(expr)
$pushjsonb_agg(expr)Returns a JSONB array
$addToSetjsonb_agg(DISTINCT expr)Output order is unspecified

Three of these mappings have a detail or two worth drawing your attention to.

$push uses jsonb_agg (PostgreSQL 9.5+), not array_agg. The return type is a JSONB array, so the result round-trips through the document shape without a conversion step you would otherwise have to write:

$push
doc_aggregate("orders", [
    {"$group": {"_id": "$customer_id",
                "order_ids": {"$push": "$_id"}}}
])
jsonb_agg
SELECT data->>'customer_id' AS _id,
       jsonb_agg(data->'_id') AS order_ids
FROM orders
GROUP BY data->>'customer_id'

$addToSet is jsonb_agg(DISTINCT ...). As with MongoDB's $addToSet, the result carries no guaranteed order. Should you need ordered distinct values — and readers sometimes do — follow the $group with an explicit sort on the array inside a $project:

$addToSet
doc_aggregate("orders", [
    {"$group": {"_id": "$customer_id",
                "categories": {"$addToSet": "$product_category"}}}
])
jsonb_agg DISTINCT
SELECT data->>'customer_id' AS _id,
       jsonb_agg(DISTINCT data->>'product_category') AS categories
FROM orders
GROUP BY data->>'customer_id'

$sum: 1 is recognized at translation time and short-circuited to count(*), not sum(1). The two are logically equivalent — the planner would happily collapse them either way — but count(*) is the idiomatic SQL and what EXPLAIN will show when you are investigating a slow query at three in the morning. Small signals, at that hour, are what one is grateful for.

A common $group with two accumulators, for the sake of completeness:

Two accumulators
doc_aggregate("orders", [
    {"$group": {"_id": "$region",
                "revenue": {"$sum": "$total"},
                "avg_order": {"$avg": "$total"}}}
])
Generated SQL
SELECT data->>'region' AS _id,
       sum((data->>'total')::numeric)  AS revenue,
       avg((data->>'total')::numeric)  AS avg_order
FROM orders
GROUP BY data->>'region'

Two accumulators. One scan. No fuss.

Composite group keys and dot notation

MongoDB permits grouping by an object of several fields, which is elegant in the pipeline and — if one is not careful — inelegant in the generated SQL. Gold Lapel emits a multi-expression GROUP BY rather than a GROUP BY on a synthesized object. The latter would defeat any index on the underlying fields, and I would not put your indexes to that kind of trouble without good reason:

Composite key
doc_aggregate("sales", [
    {"$group": {"_id": {"region": "$region", "channel": "$channel"},
                "units": {"$sum": "$quantity"}}}
])
Multi-expression GROUP BY
SELECT jsonb_build_object(
         'region',  data->>'region',
         'channel', data->>'channel'
       ) AS _id,
       sum((data->>'quantity')::numeric) AS units
FROM sales
GROUP BY data->>'region', data->>'channel'

The _id field is reassembled in the SELECT list with jsonb_build_object (PostgreSQL 9.5+) so the result document matches MongoDB's shape exactly. The GROUP BY operates on the raw expressions, which an expression index on (data->>'region', data->>'channel') can satisfy directly. The shape is preserved at the surface; the speed is preserved underneath.

Dot notation, meanwhile, traverses the JSONB tree as a well-trained guest might traverse a well-marked corridor. "$address.city" lowers to data->'address'->>'city':

Dot notation
doc_aggregate("users", [
    {"$group": {"_id": "$address.city",
                "count": {"$sum": 1}}}
])
Generated SQL
SELECT data->'address'->>'city' AS _id,
       count(*)               AS count
FROM users
GROUP BY data->'address'->>'city'

Note the short-circuit I mentioned earlier — $sum: 1 becomes count(*), not sum(1). The small courtesies accumulate.

$unwind

$unwind flattens an array field into one row per element. In SQL, that is a lateral join against jsonb_array_elements_text (for scalar arrays) or jsonb_array_elements (which preserves the full JSON structure) — both available since PostgreSQL 9.3.

The simple form:

$unwind
doc_aggregate("articles", [
    {"$unwind": "$tags"},
    {"$group": {"_id": "$tags", "count": {"$sum": 1}}}
])
LATERAL join
SELECT tag AS _id, count(*) AS count
FROM articles
CROSS JOIN LATERAL jsonb_array_elements_text(data->'tags') AS tag
GROUP BY tag

preserveNullAndEmptyArrays: true keeps rows whose target field is missing, null, or an empty array — a small mercy for documents that arrived with incomplete data. It lowers to a LEFT JOIN LATERAL with a one-element fallback:

preserveNullAndEmptyArrays
doc_aggregate("articles", [
    {"$unwind": {"path": "$tags", "preserveNullAndEmptyArrays": True}}
])
LEFT JOIN LATERAL
SELECT articles.data, tag
FROM articles
LEFT JOIN LATERAL (
  SELECT value AS tag
  FROM jsonb_array_elements_text(coalesce(data->'tags', '[null]'::jsonb))
) AS t ON true

includeArrayIndex is supported by way of WITH ORDINALITY (PostgreSQL 9.4+), which attaches each element's 1-based position — less the 1, to match MongoDB's convention:

includeArrayIndex
doc_aggregate("articles", [
    {"$unwind": {"path": "$tags", "includeArrayIndex": "tag_idx"}}
])
WITH ORDINALITY
SELECT articles.data, tag, ord - 1 AS tag_idx
FROM articles
CROSS JOIN LATERAL jsonb_array_elements_text(data->'tags')
  WITH ORDINALITY AS t(tag, ord)

One behavioral nicety, to match MongoDB faithfully. If the target path holds a scalar rather than an array, MongoDB treats it as a single-element array rather than raising. Gold Lapel's translator wraps scalars with jsonb_build_array before unwinding, so that "$unwind": "$tags" applied to a document with "tags": "news" yields exactly one row, not a confused error. Fidelity to the original grammar is, I believe, the whole job.

$lookup

$lookup is the join stage — the one that, in the MongoDB pipeline, asks one collection to please make the acquaintance of another. Four parameters: from, localField, foreignField, as. The as field receives an array of matched documents — not a single document — so the SQL must aggregate:

$lookup (4-param)
doc_aggregate("orders", [
    {"$lookup": {
        "from": "customers",
        "localField": "customer_id",
        "foreignField": "_id",
        "as": "customer"
    }}
])
Correlated subquery
SELECT orders.data ||
       jsonb_build_object('customer',
         (SELECT coalesce(jsonb_agg(c.data), '[]'::jsonb)
          FROM customers c
          WHERE c.data->>'_id' = orders.data->>'customer_id')
       ) AS data
FROM orders

The correlated subquery with jsonb_agg matches MongoDB's output shape to the letter — an empty array when nothing matches, a one-element array for a single match, a many-element array for many. No client-side reshaping required, and none offered.

The planner's advantage is especially visible here. For a small orders batch against a large customers table, the planner typically chooses a nested loop with an index probe on customers.data->>'_id'. For a large batch it switches to a hash join. The choice is cost-based, not pipeline-order-based — which is precisely the point Chapter 8 made, now arriving in uniform.

A real EXPLAIN (ANALYZE) on the four-parameter form, against 1M orders and 50K customers, on PostgreSQL 17:

EXPLAIN ANALYZE
Nested Loop  (cost=0.42..84211.33 rows=1000 width=186)
  ->  Seq Scan on orders (cost=0.00..36044.00 rows=1000 width=152)
        Filter: (data->>'region' = 'EU')
  ->  Index Scan using customers_id_idx on customers c
        (cost=0.42..0.48 rows=1 width=340)
        Index Cond: ((data->>'_id') = (orders.data->>'customer_id'))

One nested loop. One index scan. The planner found the shape and acted on it.

The subpipeline form accepts a full pipeline of its own, compiled in as a nested CTE:

$lookup with sub-pipeline
doc_aggregate("orders", [
    {"$lookup": {
        "from": "line_items",
        "let": {"oid": "$_id"},
        "pipeline": [
            {"$match": {"$expr": {"$eq": ["$order_id", "$$oid"]}}},
            {"$match": {"refunded": False}}
        ],
        "as": "live_items"
    }}
])

The inner $match stages flatten into a single WHERE on the correlated subquery. Subpipelines of arbitrary depth are supported; the planner, admirably, still sees one statement.

$project — include, exclude, rename, compute

$project reshapes the document, and it does so in four modes — include, exclude, rename, and compute. All four lower to a SELECT list that builds a new JSONB document with jsonb_build_object. The shape changes; the document survives.

Include-mode projection, with _id suppressed:

Include
doc_aggregate("orders", [
    {"$project": {"_id": 0, "total": 1, "status": 1}}
])
Generated SQL
SELECT jsonb_build_object(
  'total',  data->'total',
  'status', data->'status'
) AS data
FROM orders

Rename via field reference:

Rename
doc_aggregate("orders", [
    {"$project": {"_id": 0, "order_total": "$total"}}
])
Generated SQL
SELECT jsonb_build_object('order_total', data->'total') AS data
FROM orders

Computed expressions, for arithmetic, string, and conditional operators:

Compute
doc_aggregate("line_items", [
    {"$project": {"_id": 0,
                  "sku": 1,
                  "line_total": {"$multiply": ["$price", "$quantity"]}}}
])
Generated SQL
SELECT jsonb_build_object(
  'sku',        data->'sku',
  'line_total', (data->>'price')::numeric * (data->>'quantity')::numeric
) AS data
FROM line_items

Arithmetic ($multiply, $add, $subtract, $divide, $mod), string ($concat, $toLower, $toUpper, $substr), and conditional ($cond, $ifNull, $switch) operators are all supported in $project. The long tail — $dateFromString, $regexFind, and their cousins — is inventoried in Appendix B, with unsupported operators raising a translation error at submit time rather than failing mid-query. A compiler that fails closed is, I should note, the only kind worth trusting.

The benchmark and the materialized view

And now, the number. I have been promising it for some pages, and you have been patient.

OperationMongoDB 7.0.5Gold Lapel (cold)Gold Lapel (warm)
doc_aggregate("orders", [...])30 ms~35 ms357 µs

Warm-cache Gold Lapel runs the pipeline about 84x faster than MongoDB 7. I must be forthcoming about what the number is and is not. It is evidence for a single architectural decision. It is not a performance claim about PostgreSQL versus MongoDB in general — indeed, the cold-path numbers are within a few milliseconds of each other, and MongoDB's cold path is, frankly, quite good.

Here is the decision. Gold Lapel canonicalizes the pipeline array into a stable hash, and stores the result of the pipeline as a materialized view named with that hash. The first call to a given pipeline builds the matview — a cost roughly comparable to MongoDB running the pipeline. Every subsequent call is an index scan on the matview's primary key. The warm path is not running an aggregation at all. It is reading a pre-computed table. The work has already been done, and the Waiter has set it aside for your return.

The exact pipeline benchmarked:

Benchmark pipeline
pipeline = [
    {"$match": {"status": "shipped",
                "placed_at": {"$gte": "2026-01-01"}}},
    {"$lookup": {"from": "customers",
                 "localField": "customer_id",
                 "foreignField": "_id",
                 "as": "customer"}},
    {"$unwind": "$customer"},
    {"$group": {"_id": "$customer.region",
                "revenue": {"$sum": "$total"},
                "orders":  {"$sum": 1}}},
    {"$sort": {"revenue": -1}},
    {"$limit": 20}
]

The generated matview DDL:

Materialized view
CREATE MATERIALIZED VIEW gl_matview_a7f3c91b28d4e506 AS
WITH stage_1_match AS (...),
     stage_2_lookup AS (...),
     stage_3_unwind AS (...),
     stage_4_group  AS (...),
     stage_5_sort   AS (...),
     stage_6_limit  AS (...)
SELECT * FROM stage_6_limit;

CREATE UNIQUE INDEX ON gl_matview_a7f3c91b28d4e506 ((_id));

The warm-path query — what every call after the first actually runs:

Warm path
SELECT * FROM gl_matview_a7f3c91b28d4e506;

A small table read, a unique index, 357 µs. That is where the number comes from. It is not a pipeline execution. It is a pipeline remembered.

MongoDB could, in principle, sit behind a cache of its own. I say this without the slightest grudge — it absolutely could, and many teams have built precisely that. The difference is in whose problem the cache becomes. Build one yourself and you inherit key design, invalidation, consistency with writes, TTL tuning, and the small collection of three-in-the-morning mysteries that such systems reliably produce. With Gold Lapel the cache is the API. You call doc_aggregate; the matview is a detail the house takes care of.

Full methodology — dataset shape, hardware, PostgreSQL and MongoDB versions, warm/cold definitions, percentile distributions, index configuration on both sides, allowDiskUse settings, and reproduction scripts — lives on the companion website. The book's job is to tell you what the number means. The website's job is to let you verify it.

Refresh semantics — when the matview is stale

A cached pipeline result is stale the moment a matching document changes. This is an inconvenient truth, and I will not pretend otherwise. Gold Lapel exposes three refresh modes, so that you may pick the one that fits the workload:

Refresh modes
doc_aggregate("orders", pipeline, refresh="on_write")  # default
doc_aggregate("orders", pipeline, refresh="ttl", ttl=60)
doc_aggregate("orders", pipeline, refresh="manual")

on_write is the default. When Gold Lapel's write path commits a document change, it consults a dependency map of which matviews rely on the affected collection, and schedules a REFRESH MATERIALIZED VIEW CONCURRENTLY (PostgreSQL 9.4+) for each. The refresh runs asynchronously; the write returns immediately. The staleness window is bounded by the refresh queue depth and the matview build time. For most pipelines that is single-digit seconds — fast enough for most dashboards, honest enough to admit.

ttl is time-based. Each matview carries a last-refreshed timestamp; a read older than the TTL triggers a refresh before returning. This is the right choice when the pipeline aggregates over a window where a few seconds of staleness cause no one any distress — dashboards, leaderboards, periodic reports.

manual disables automatic refresh entirely. The caller assumes responsibility and invokes gl.refresh_aggregate(collection, pipeline) when the data has materially changed. This is the escape hatch for pipelines where refresh cost is high and the caller has better information about when a refresh is actually warranted.

The write-path invalidation that makes on_write honest is a dependency map from collection to dependent matview, consulted on commit and used to schedule concurrent refreshes without blocking the foreground write. Under steady writes it runs behind the scenes; under write bursts that outpace refresh build time the queue grows, and the staleness window grows with it. That is the trade being made. I mention it here because a cached pipeline pretending to be free is the sort of claim a serious reader sees through immediately, and I should rather be seen through for honesty than for concealment.

Honest Boundary — where the pipeline does not translate

I should be forthcoming about three places where Gold Lapel's doc_aggregate is narrower than MongoDB's. An aggregation framework that overstates its coverage is no aggregation framework at all.

Stages with limits. $graphLookup (recursive lookup) is on the roadmap; WITH RECURSIVE is the obvious SQL target, but the compiler does not emit it yet. For now, reach for the recursive CTE directly — you will get the same result with slightly more typing. $bucketAuto requires automatic bucket-boundary selection that Gold Lapel has not yet implemented; $bucket with explicit boundaries is fully supported. $merge and $out write pipeline output back to a collection — both are on the roadmap, and neither has shipped. $facet — multiple sub-pipelines on the same input — is supported as of the current release, but with a caveat I must share: each sub-pipeline gets its own CTE, and the planner does not always share scans between them. If your facet has more than four or five sub-pipelines over a very large collection, I would recommend writing it as separate doc_aggregate calls and composing the results in application code. The result is the same; the work does not pile up on one planner's desk.

Expression operators. The common arithmetic, string, and conditional operators work. The long tail does not. $dateFromString, $regexFind, $regexFindAll, $zip, $reduce, $let, and most of the trigonometric and statistical operators raise UnsupportedOperator at translation time. Appendix B lists every operator and its current status. A translation error at submit time is, I submit, considerably preferable to a silent semantic mismatch three months into production. The compiler fails closed. It is the one place where the house is unwilling to improvise.

The write path. A read-heavy aggregation workload is where Gold Lapel is at its best. A write-heavy ingestion pipeline is where MongoDB's edge still holds — the standing gap on writes is 2.5x–4x, and doc_aggregate's matview refresh adds to the write cost under on_write. If your workload is continuous ingestion with occasional reads, MongoDB is the better tool, and I would not insult your intelligence by claiming otherwise. Write acceleration for Gold Lapel — parallel COPY, pipeline mode, batched refresh — is on the roadmap. It is not shipped. Said plainly.

Putting it together — a real pipeline

Allow me to set a proper place. A six-stage pipeline — the kind that actually ships, not the kind that is invented for a book — showing top regions by revenue, with unwound line items, for shipped orders in Q1:

Six-stage pipeline
results = doc_aggregate("orders", [
    {"$match": {"status": "shipped",
                "placed_at": {"$gte": "2026-01-01", "$lt": "2026-04-01"}}},
    {"$lookup": {"from": "customers",
                 "localField": "customer_id",
                 "foreignField": "_id",
                 "as": "customer"}},
    {"$unwind": "$customer"},
    {"$group": {"_id": "$customer.region",
                "revenue":   {"$sum": "$total"},
                "customers": {"$addToSet": "$customer._id"}}},
    {"$sort":  {"revenue": -1}},
    {"$limit": 10}
])

The generated SQL, one CTE per stage:

Generated SQL
WITH stage_1_match AS (
  SELECT data FROM orders
  WHERE data->>'status' = 'shipped'
    AND (data->>'placed_at')::timestamptz >= '2026-01-01'
    AND (data->>'placed_at')::timestamptz <  '2026-04-01'
),
stage_2_lookup AS (
  SELECT s1.data ||
         jsonb_build_object('customer',
           (SELECT coalesce(jsonb_agg(c.data), '[]'::jsonb)
            FROM customers c
            WHERE c.data->>'_id' = s1.data->>'customer_id')
         ) AS data
  FROM stage_1_match s1
),
stage_3_unwind AS (
  SELECT s2.data - 'customer' || jsonb_build_object('customer', cust) AS data
  FROM stage_2_lookup s2
  CROSS JOIN LATERAL jsonb_array_elements(s2.data->'customer') AS cust
),
stage_4_group AS (
  SELECT data->'customer'->>'region' AS _id,
         sum((data->>'total')::numeric) AS revenue,
         jsonb_agg(DISTINCT data->'customer'->'_id') AS customers
  FROM stage_3_unwind
  GROUP BY data->'customer'->>'region'
),
stage_5_sort AS (
  SELECT * FROM stage_4_group ORDER BY revenue DESC
),
stage_6_limit AS (
  SELECT * FROM stage_5_sort LIMIT 10
)
SELECT jsonb_build_object(
  '_id',       _id,
  'revenue',   revenue,
  'customers', customers
) AS data
FROM stage_6_limit;

On the first call, this statement runs end-to-end and the result is persisted as gl_matview_<hash>. Every subsequent call with the same pipeline runs:

Warm path
SELECT * FROM gl_matview_9b2a17e4c8f32d51;

Copy the Python. Adapt the stages. The SQL is a side effect — one the house prepares, and one you need not inspect unless curiosity takes you.


MongoDB re-executes the pipeline on every request — a reasonable default, and one with admirable consistency. Gold Lapel remembers it. That is the entire architectural difference between them, and it is where the 84x comes from. Not from PostgreSQL being faster than MongoDB at aggregation in any abstract sense — the cold-path numbers land within a few milliseconds of each other — but from storing the pipeline's result in a place the planner can index-scan.

Chapter 13 turns from aggregation to the three operational features that elevate MongoDB from a database to a platform — change streams, TTL indexes, and capped collections — and shows each of them composed from primitives PostgreSQL has shipped for a quarter-century: row triggers, LISTEN/NOTIFY, and statement-level trigger functions. doc_watch for the change-stream story, doc_create_ttl_index for expiry, doc_create_capped for bounded append-only storage. Three features, one subsystem beneath them, and the same JSONB column shape carrying through.

If you'll follow me through to the next chapter, the triggers are warming up. It is, as these things tend to be, rather quieter work than the pipeline, and rather more consequential.