← You Don't Need MongoDB

Appendix E: Aggregation Pipeline Reference

The Waiter of Gold Lapel · Published Apr 19, 2026 · 6 min

Every supported pipeline stage and accumulator. Chapter 12 is the narrative; this is the lookup. The compiler lowers each stage into a CTE in a single compiled statement, and PostgreSQL 12+ inlines the CTEs by default — the planner sees the whole shape and reorders as cost dictates.

Every pipeline result is canonicalized into a stable hash and cached behind a materialized view (gl_matview_<hash>). The first call builds the view; every call afterwards is an index scan. This is where the warm-cache 84x number comes from — not from PostgreSQL being faster than MongoDB at aggregation in any abstract sense, but from storing the result in a place the planner can index-scan.

$matchWHERE (or HAVING)

Filter documents. Position matters: before $group it lowers to WHERE; after $group it lowers to HAVING.

$match
{"$match": {"status": "shipped", "total": {"$gte": 100}}}

Before $group:

WHERE
SELECT data FROM orders
WHERE data->>'status' = 'shipped'
  AND (data->>'total')::numeric >= 100

After $group:

HAVING
... GROUP BY data->>'region'
HAVING sum((data->>'total')::numeric) >= 100

Filter operators inside $match use the full filter language from Appendix D.

$groupGROUP BY with Aggregates

$group
{"$group": {"_id": "$region",
            "revenue": {"$sum": "$total"},
            "avg_order": {"$avg": "$total"}}}
GROUP BY
SELECT data->>'region' AS _id,
       sum((data->>'total')::numeric) AS revenue,
       avg((data->>'total')::numeric) AS avg_order
FROM orders
GROUP BY data->>'region'

Composite group keys. {"_id": {"region": "$region", "channel": "$channel"}} lowers to a multi-expression GROUP BY with jsonb_build_object reassembling the _id shape in the SELECT list. The composite key respects expression indexes on the underlying fields; grouping by a synthesized object would defeat them.

Dot notation. "$address.city" lowers to data->'address'->>'city'.

$sortORDER BY

$sort
{"$sort": {"revenue": -1, "name": 1}}
ORDER BY
ORDER BY revenue DESC, name ASC

$limitLIMIT

$limit
{"$limit": 10}
LIMIT
LIMIT 10

$skipOFFSET

$skip
{"$skip": 100}
OFFSET
OFFSET 100

Large OFFSET values scan and discard the skipped rows. For deep pagination (tens of thousands in), use a keyset query — a $match on the last-seen sort key — instead of $skip.

$projectSELECT with jsonb_build_object

Reshape the document. Four modes: include, exclude, rename, compute.

Include (with _id suppressed):

Include
{"$project": {"_id": 0, "total": 1, "status": 1}}
Generated SQL
SELECT jsonb_build_object('total', data->'total', 'status', data->'status') AS data FROM orders

Rename via field reference:

Rename
{"$project": {"_id": 0, "order_total": "$total"}}
Generated SQL
SELECT jsonb_build_object('order_total', data->'total') AS data FROM orders

Computed expressions (arithmetic, string, conditional):

Compute
{"$project": {"line_total": {"$multiply": ["$price", "$quantity"]}}}
Generated SQL
SELECT jsonb_build_object(
  'line_total', (data->>'price')::numeric * (data->>'quantity')::numeric
) AS data FROM line_items

Supported in $project:

  • Arithmetic: $multiply, $add, $subtract, $divide, $mod
  • String: $concat, $toLower, $toUpper, $substr
  • Conditional: $cond, $ifNull, $switch

Unsupported operators ($dateFromString, $regexFind, $regexFindAll, $zip, $reduce, $let, trigonometric and statistical operators) raise UnsupportedOperator at translation time. The compiler fails closed.

$unwindCROSS JOIN LATERAL

$unwind
{"$unwind": "$tags"}
LATERAL join
CROSS JOIN LATERAL jsonb_array_elements_text(data->'tags') AS tag

preserveNullAndEmptyArrays: true keeps rows whose target field is missing, null, or an empty array — LEFT JOIN LATERAL with a one-element fallback:

LEFT JOIN LATERAL
LEFT JOIN LATERAL (
  SELECT value FROM jsonb_array_elements_text(coalesce(data->'tags', '[null]'::jsonb))
) AS t ON true

includeArrayIndex: "tag_idx" adds the 1-based position via WITH ORDINALITY (less 1, to match MongoDB's 0-based convention):

WITH ORDINALITY
CROSS JOIN LATERAL jsonb_array_elements_text(data->'tags')
  WITH ORDINALITY AS t(tag, ord)
-- and `ord - 1 AS tag_idx` in the SELECT

If the target path holds a scalar rather than an array, Gold Lapel wraps it with jsonb_build_array before unwinding — matching MongoDB's behaviour of treating scalars as single-element arrays.

Use jsonb_array_elements_text for scalar arrays; jsonb_array_elements for arrays of objects (preserves JSON structure).

$lookup — Correlated Subquery with jsonb_agg

Four-parameter form:

$lookup
{"$lookup": {
    "from": "customers",
    "localField": "customer_id",
    "foreignField": "_id",
    "as": "customer"
}}
Correlated subquery
SELECT orders.data ||
       jsonb_build_object('customer',
         (SELECT coalesce(jsonb_agg(c.data), '[]'::jsonb)
          FROM customers c
          WHERE c.data->>'_id' = orders.data->>'customer_id')
       ) AS data
FROM orders

The as field receives an array of matched documents — an empty array when nothing matches, a one-element array for a single match, a many-element array for many. No client-side reshaping required.

The planner is cost-based: small input batches against a large foreign collection get a nested loop with index probe; large batches switch to hash join.

Subpipeline form with let:

$lookup with sub-pipeline
{"$lookup": {
    "from": "line_items",
    "let": {"oid": "$_id"},
    "pipeline": [
        {"$match": {"$expr": {"$eq": ["$order_id", "$$oid"]}}},
        {"$match": {"refunded": False}}
    ],
    "as": "live_items"
}}

The inner stages flatten into the correlated subquery's WHERE clause. Subpipelines of arbitrary depth compile to one statement.

The Seven Accumulators

MongoDBSQL generatedNotes
$sumsum(expr)$sum: 1 short-circuits to count(*) for the idiomatic plan
$avgavg(expr)
$minmin(expr)
$maxmax(expr)
$countcount(*)Dedicated stage as well as accumulator
$pushjsonb_agg(expr)Returns a JSONB array; round-trips through the document shape natively
$addToSetjsonb_agg(DISTINCT expr)Output order is unspecified; sort downstream if needed

$push uses jsonb_agg, not array_agg — the result type is JSONB so the document round-trips cleanly without a conversion step.

$addToSet carries no guaranteed order. For ordered distinct values, follow the $group with an explicit sort on the array inside a $project.

Stages with Limits

StageStatusNotes
$facetSupportedEach sub-pipeline gets its own CTE. For more than four or five sub-pipelines over a very large collection, write separate doc_aggregate calls and compose results in application code.
$bucketSupportedExplicit boundaries only.
$bucketAutoNot yetAutomatic boundary selection on the roadmap.
$graphLookupNot yetWITH RECURSIVE is the SQL target; the compiler does not emit it yet. Use a recursive CTE directly for now.
$mergeNot yetPipeline output back into a collection. On the roadmap.
$outNot yetPipeline output to a new collection. On the roadmap.

Any unsupported stage raises UnsupportedStage at translation time, never silently — a compiler that fails closed is the only kind worth trusting.

Refresh Modes

doc_aggregate accepts a refresh parameter that controls how the cached matview is kept current.

ModeBehavior
on_write (default)Write path consults a dependency map; a REFRESH MATERIALIZED VIEW CONCURRENTLY is scheduled when a write affects the source collection. Refresh is asynchronous; staleness is bounded by queue depth and matview build time (typically single-digit seconds).
ttlEach matview carries a last-refreshed timestamp. A read older than ttl seconds triggers a refresh before returning. Right for dashboards, leaderboards, periodic reports.
manualNo automatic refresh. The caller invokes gl.refresh_aggregate(collection, pipeline) when refresh is warranted. The escape hatch for high-cost pipelines whose refresh cadence the application knows best.

A cached pipeline result is stale the moment a matching document changes. The trade-off is honest, and named in Chapter 12.

A Complete Example

The benchmarked six-stage pipeline from Chapter 12:

Six-stage pipeline
results = doc_aggregate("orders", [
    {"$match": {"status": "shipped",
                "placed_at": {"$gte": "2026-01-01"}}},
    {"$lookup": {"from": "customers",
                 "localField": "customer_id",
                 "foreignField": "_id",
                 "as": "customer"}},
    {"$unwind": "$customer"},
    {"$group": {"_id": "$customer.region",
                "revenue": {"$sum": "$total"},
                "orders":  {"$sum": 1}}},
    {"$sort": {"revenue": -1}},
    {"$limit": 20}
])

Cold path: ~35 ms (comparable to MongoDB 7's 30 ms — the cold paths are within a few milliseconds of each other). Warm path: 357 µs. The warm path is not running an aggregation. It is reading a pre-computed table.

Full benchmark methodology — dataset shape, hardware, percentile distributions, index configuration, reproduction scripts — lives on the companion benchmark page. The book tells you what the number means; the website lets you verify it.