Appendix E: Aggregation Pipeline Reference

The Waiter of Gold Lapel · Published Apr 19, 2026 · 6 min

Every supported pipeline stage and accumulator. Chapter 12 is the narrative; this is the lookup. The compiler lowers each stage into a CTE in a single compiled statement, and PostgreSQL 12+ inlines the CTEs by default — the planner sees the whole shape and reorders as cost dictates.

Every pipeline result is canonicalized into a stable hash and cached behind a materialized view (gl_matview_<hash>). The first call builds the view; every call afterwards is an index scan. This is where the warm-cache 84x number comes from — not from PostgreSQL being faster than MongoDB at aggregation in any abstract sense, but from storing the result in a place the planner can index-scan.

`$match` — `WHERE` (or `HAVING`)

Filter documents. Position matters: before $group it lowers to WHERE; after $group it lowers to HAVING.

$match

{"$match": {"status": "shipped", "total": {"$gte": 100}}}

Before $group:

WHERE

SELECT data FROM orders
WHERE data->>'status' = 'shipped'
  AND (data->>'total')::numeric >= 100

After $group:

HAVING

... GROUP BY data->>'region'
HAVING sum((data->>'total')::numeric) >= 100

Filter operators inside $match use the full filter language from Appendix D.

`$group` — `GROUP BY` with Aggregates

$group

{"$group": {"_id": "$region",
            "revenue": {"$sum": "$total"},
            "avg_order": {"$avg": "$total"}}}

GROUP BY

SELECT data->>'region' AS _id,
       sum((data->>'total')::numeric) AS revenue,
       avg((data->>'total')::numeric) AS avg_order
FROM orders
GROUP BY data->>'region'

Composite group keys. {"_id": {"region": "$region", "channel": "$channel"}} lowers to a multi-expression GROUP BY with jsonb_build_object reassembling the _id shape in the SELECT list. The composite key respects expression indexes on the underlying fields; grouping by a synthesized object would defeat them.

Dot notation. "$address.city" lowers to data->'address'->>'city'.

`$sort` — `ORDER BY`

$sort

{"$sort": {"revenue": -1, "name": 1}}

ORDER BY

ORDER BY revenue DESC, name ASC

`$limit` — `LIMIT`

$limit

{"$limit": 10}

LIMIT

LIMIT 10

`$skip` — `OFFSET`

$skip

{"$skip": 100}

OFFSET

OFFSET 100

Large OFFSET values scan and discard the skipped rows. For deep pagination (tens of thousands in), use a keyset query — a $match on the last-seen sort key — instead of $skip.

`$project` — `SELECT` with `jsonb_build_object`

Reshape the document. Four modes: include, exclude, rename, compute.

Include (with _id suppressed):

Include

{"$project": {"_id": 0, "total": 1, "status": 1}}

Generated SQL

SELECT jsonb_build_object('total', data->'total', 'status', data->'status') AS data FROM orders

Rename via field reference:

Rename

{"$project": {"_id": 0, "order_total": "$total"}}

Generated SQL

SELECT jsonb_build_object('order_total', data->'total') AS data FROM orders

Computed expressions (arithmetic, string, conditional):

Compute

{"$project": {"line_total": {"$multiply": ["$price", "$quantity"]}}}

Generated SQL

SELECT jsonb_build_object(
  'line_total', (data->>'price')::numeric * (data->>'quantity')::numeric
) AS data FROM line_items

Supported in $project:

Arithmetic: $multiply, $add, $subtract, $divide, $mod
String: $concat, $toLower, $toUpper, $substr
Conditional: $cond, $ifNull, $switch

Unsupported operators ($dateFromString, $regexFind, $regexFindAll, $zip, $reduce, $let, trigonometric and statistical operators) raise UnsupportedOperator at translation time. The compiler fails closed.

`$unwind` — `CROSS JOIN LATERAL`

$unwind

{"$unwind": "$tags"}

LATERAL join

CROSS JOIN LATERAL jsonb_array_elements_text(data->'tags') AS tag

preserveNullAndEmptyArrays: true keeps rows whose target field is missing, null, or an empty array — LEFT JOIN LATERAL with a one-element fallback:

LEFT JOIN LATERAL

LEFT JOIN LATERAL (
  SELECT value FROM jsonb_array_elements_text(coalesce(data->'tags', '[null]'::jsonb))
) AS t ON true

includeArrayIndex: "tag_idx" adds the 1-based position via WITH ORDINALITY (less 1, to match MongoDB's 0-based convention):

WITH ORDINALITY

CROSS JOIN LATERAL jsonb_array_elements_text(data->'tags')
  WITH ORDINALITY AS t(tag, ord)
-- and `ord - 1 AS tag_idx` in the SELECT

If the target path holds a scalar rather than an array, Gold Lapel wraps it with jsonb_build_array before unwinding — matching MongoDB's behaviour of treating scalars as single-element arrays.

Use jsonb_array_elements_text for scalar arrays; jsonb_array_elements for arrays of objects (preserves JSON structure).

`$lookup` — Correlated Subquery with `jsonb_agg`

Four-parameter form:

$lookup

{"$lookup": {
    "from": "customers",
    "localField": "customer_id",
    "foreignField": "_id",
    "as": "customer"
}}

Correlated subquery

SELECT orders.data ||
       jsonb_build_object('customer',
         (SELECT coalesce(jsonb_agg(c.data), '[]'::jsonb)
          FROM customers c
          WHERE c.data->>'_id' = orders.data->>'customer_id')
       ) AS data
FROM orders

The as field receives an array of matched documents — an empty array when nothing matches, a one-element array for a single match, a many-element array for many. No client-side reshaping required.

The planner is cost-based: small input batches against a large foreign collection get a nested loop with index probe; large batches switch to hash join.

Subpipeline form with let:

$lookup with sub-pipeline

{"$lookup": {
    "from": "line_items",
    "let": {"oid": "$_id"},
    "pipeline": [
        {"$match": {"$expr": {"$eq": ["$order_id", "$$oid"]}}},
        {"$match": {"refunded": False}}
    ],
    "as": "live_items"
}}

The inner stages flatten into the correlated subquery's WHERE clause. Subpipelines of arbitrary depth compile to one statement.

The Seven Accumulators

MongoDB	SQL generated	Notes
`$sum`	`sum(expr)`	`$sum: 1` short-circuits to `count(*)` for the idiomatic plan
`$avg`	`avg(expr)`
`$min`	`min(expr)`
`$max`	`max(expr)`
`$count`	`count(*)`	Dedicated stage as well as accumulator
`$push`	`jsonb_agg(expr)`	Returns a JSONB array; round-trips through the document shape natively
`$addToSet`	`jsonb_agg(DISTINCT expr)`	Output order is unspecified; sort downstream if needed

$push uses jsonb_agg, not array_agg — the result type is JSONB so the document round-trips cleanly without a conversion step.

$addToSet carries no guaranteed order. For ordered distinct values, follow the $group with an explicit sort on the array inside a $project.

Stages with Limits

Stage	Status	Notes
`$facet`	Supported	Each sub-pipeline gets its own CTE. For more than four or five sub-pipelines over a very large collection, write separate `doc_aggregate` calls and compose results in application code.
`$bucket`	Supported	Explicit boundaries only.
`$bucketAuto`	Not yet	Automatic boundary selection on the roadmap.
`$graphLookup`	Not yet	`WITH RECURSIVE` is the SQL target; the compiler does not emit it yet. Use a recursive CTE directly for now.
`$merge`	Not yet	Pipeline output back into a collection. On the roadmap.
`$out`	Not yet	Pipeline output to a new collection. On the roadmap.

Any unsupported stage raises UnsupportedStage at translation time, never silently — a compiler that fails closed is the only kind worth trusting.

Refresh Modes

doc_aggregate accepts a refresh parameter that controls how the cached matview is kept current.

Mode	Behavior
`on_write` (default)	Write path consults a dependency map; a `REFRESH MATERIALIZED VIEW CONCURRENTLY` is scheduled when a write affects the source collection. Refresh is asynchronous; staleness is bounded by queue depth and matview build time (typically single-digit seconds).
`ttl`	Each matview carries a last-refreshed timestamp. A read older than `ttl` seconds triggers a refresh before returning. Right for dashboards, leaderboards, periodic reports.
`manual`	No automatic refresh. The caller invokes `gl.refresh_aggregate(collection, pipeline)` when refresh is warranted. The escape hatch for high-cost pipelines whose refresh cadence the application knows best.

A cached pipeline result is stale the moment a matching document changes. The trade-off is honest, and named in Chapter 12.

A Complete Example

The benchmarked six-stage pipeline from Chapter 12:

Six-stage pipeline

results = doc_aggregate("orders", [
    {"$match": {"status": "shipped",
                "placed_at": {"$gte": "2026-01-01"}}},
    {"$lookup": {"from": "customers",
                 "localField": "customer_id",
                 "foreignField": "_id",
                 "as": "customer"}},
    {"$unwind": "$customer"},
    {"$group": {"_id": "$customer.region",
                "revenue": {"$sum": "$total"},
                "orders":  {"$sum": 1}}},
    {"$sort": {"revenue": -1}},
    {"$limit": 20}
])

Cold path: ~35 ms (comparable to MongoDB 7's 30 ms — the cold paths are within a few milliseconds of each other). Warm path: 357 µs. The warm path is not running an aggregation. It is reading a pre-computed table.

Full benchmark methodology — dataset shape, hardware, percentile distributions, index configuration, reproduction scripts — lives on the companion benchmark page. The book tells you what the number means; the website lets you verify it.

Appendix E: Aggregation Pipeline Reference

$match — WHERE (or HAVING)

$group — GROUP BY with Aggregates

$sort — ORDER BY

$limit — LIMIT

$skip — OFFSET

$project — SELECT with jsonb_build_object

$unwind — CROSS JOIN LATERAL

$lookup — Correlated Subquery with jsonb_agg

The Seven Accumulators

Stages with Limits

Refresh Modes

A Complete Example

`$match` — `WHERE` (or `HAVING`)

`$group` — `GROUP BY` with Aggregates

`$sort` — `ORDER BY`

`$limit` — `LIMIT`

`$skip` — `OFFSET`

`$project` — `SELECT` with `jsonb_build_object`

`$unwind` — `CROSS JOIN LATERAL`

`$lookup` — Correlated Subquery with `jsonb_agg`