Appendix E: Aggregation Pipeline Reference
Every supported pipeline stage and accumulator. Chapter 12 is the narrative; this is the lookup. The compiler lowers each stage into a CTE in a single compiled statement, and PostgreSQL 12+ inlines the CTEs by default — the planner sees the whole shape and reorders as cost dictates.
Every pipeline result is canonicalized into a stable hash and cached behind a materialized view (gl_matview_<hash>). The first call builds the view; every call afterwards is an index scan. This is where the warm-cache 84x number comes from — not from PostgreSQL being faster than MongoDB at aggregation in any abstract sense, but from storing the result in a place the planner can index-scan.
$match — WHERE (or HAVING)
Filter documents. Position matters: before $group it lowers to WHERE; after $group it lowers to HAVING.
{"$match": {"status": "shipped", "total": {"$gte": 100}}} Before $group:
SELECT data FROM orders
WHERE data->>'status' = 'shipped'
AND (data->>'total')::numeric >= 100 After $group:
... GROUP BY data->>'region'
HAVING sum((data->>'total')::numeric) >= 100 Filter operators inside $match use the full filter language from Appendix D.
$group — GROUP BY with Aggregates
{"$group": {"_id": "$region",
"revenue": {"$sum": "$total"},
"avg_order": {"$avg": "$total"}}} SELECT data->>'region' AS _id,
sum((data->>'total')::numeric) AS revenue,
avg((data->>'total')::numeric) AS avg_order
FROM orders
GROUP BY data->>'region' Composite group keys. {"_id": {"region": "$region", "channel": "$channel"}} lowers to a multi-expression GROUP BY with jsonb_build_object reassembling the _id shape in the SELECT list. The composite key respects expression indexes on the underlying fields; grouping by a synthesized object would defeat them.
Dot notation. "$address.city" lowers to data->'address'->>'city'.
$sort — ORDER BY
{"$sort": {"revenue": -1, "name": 1}} ORDER BY revenue DESC, name ASC $limit — LIMIT
{"$limit": 10} LIMIT 10 $skip — OFFSET
{"$skip": 100} OFFSET 100 Large OFFSET values scan and discard the skipped rows. For deep pagination (tens of thousands in), use a keyset query — a $match on the last-seen sort key — instead of $skip.
$project — SELECT with jsonb_build_object
Reshape the document. Four modes: include, exclude, rename, compute.
Include (with _id suppressed):
{"$project": {"_id": 0, "total": 1, "status": 1}} SELECT jsonb_build_object('total', data->'total', 'status', data->'status') AS data FROM orders Rename via field reference:
{"$project": {"_id": 0, "order_total": "$total"}} SELECT jsonb_build_object('order_total', data->'total') AS data FROM orders Computed expressions (arithmetic, string, conditional):
{"$project": {"line_total": {"$multiply": ["$price", "$quantity"]}}} SELECT jsonb_build_object(
'line_total', (data->>'price')::numeric * (data->>'quantity')::numeric
) AS data FROM line_items Supported in $project:
- Arithmetic:
$multiply,$add,$subtract,$divide,$mod - String:
$concat,$toLower,$toUpper,$substr - Conditional:
$cond,$ifNull,$switch
Unsupported operators ($dateFromString, $regexFind, $regexFindAll, $zip, $reduce, $let, trigonometric and statistical operators) raise UnsupportedOperator at translation time. The compiler fails closed.
$unwind — CROSS JOIN LATERAL
{"$unwind": "$tags"} CROSS JOIN LATERAL jsonb_array_elements_text(data->'tags') AS tag preserveNullAndEmptyArrays: true keeps rows whose target field is missing, null, or an empty array — LEFT JOIN LATERAL with a one-element fallback:
LEFT JOIN LATERAL (
SELECT value FROM jsonb_array_elements_text(coalesce(data->'tags', '[null]'::jsonb))
) AS t ON true includeArrayIndex: "tag_idx" adds the 1-based position via WITH ORDINALITY (less 1, to match MongoDB's 0-based convention):
CROSS JOIN LATERAL jsonb_array_elements_text(data->'tags')
WITH ORDINALITY AS t(tag, ord)
-- and `ord - 1 AS tag_idx` in the SELECT If the target path holds a scalar rather than an array, Gold Lapel wraps it with jsonb_build_array before unwinding — matching MongoDB's behaviour of treating scalars as single-element arrays.
Use jsonb_array_elements_text for scalar arrays; jsonb_array_elements for arrays of objects (preserves JSON structure).
$lookup — Correlated Subquery with jsonb_agg
Four-parameter form:
{"$lookup": {
"from": "customers",
"localField": "customer_id",
"foreignField": "_id",
"as": "customer"
}} SELECT orders.data ||
jsonb_build_object('customer',
(SELECT coalesce(jsonb_agg(c.data), '[]'::jsonb)
FROM customers c
WHERE c.data->>'_id' = orders.data->>'customer_id')
) AS data
FROM orders The as field receives an array of matched documents — an empty array when nothing matches, a one-element array for a single match, a many-element array for many. No client-side reshaping required.
The planner is cost-based: small input batches against a large foreign collection get a nested loop with index probe; large batches switch to hash join.
Subpipeline form with let:
{"$lookup": {
"from": "line_items",
"let": {"oid": "$_id"},
"pipeline": [
{"$match": {"$expr": {"$eq": ["$order_id", "$$oid"]}}},
{"$match": {"refunded": False}}
],
"as": "live_items"
}} The inner stages flatten into the correlated subquery's WHERE clause. Subpipelines of arbitrary depth compile to one statement.
The Seven Accumulators
| MongoDB | SQL generated | Notes |
|---|---|---|
$sum | sum(expr) | $sum: 1 short-circuits to count(*) for the idiomatic plan |
$avg | avg(expr) | |
$min | min(expr) | |
$max | max(expr) | |
$count | count(*) | Dedicated stage as well as accumulator |
$push | jsonb_agg(expr) | Returns a JSONB array; round-trips through the document shape natively |
$addToSet | jsonb_agg(DISTINCT expr) | Output order is unspecified; sort downstream if needed |
$push uses jsonb_agg, not array_agg — the result type is JSONB so the document round-trips cleanly without a conversion step.
$addToSet carries no guaranteed order. For ordered distinct values, follow the $group with an explicit sort on the array inside a $project.
Stages with Limits
| Stage | Status | Notes |
|---|---|---|
$facet | Supported | Each sub-pipeline gets its own CTE. For more than four or five sub-pipelines over a very large collection, write separate doc_aggregate calls and compose results in application code. |
$bucket | Supported | Explicit boundaries only. |
$bucketAuto | Not yet | Automatic boundary selection on the roadmap. |
$graphLookup | Not yet | WITH RECURSIVE is the SQL target; the compiler does not emit it yet. Use a recursive CTE directly for now. |
$merge | Not yet | Pipeline output back into a collection. On the roadmap. |
$out | Not yet | Pipeline output to a new collection. On the roadmap. |
Any unsupported stage raises UnsupportedStage at translation time, never silently — a compiler that fails closed is the only kind worth trusting.
Refresh Modes
doc_aggregate accepts a refresh parameter that controls how the cached matview is kept current.
| Mode | Behavior |
|---|---|
on_write (default) | Write path consults a dependency map; a REFRESH MATERIALIZED VIEW CONCURRENTLY is scheduled when a write affects the source collection. Refresh is asynchronous; staleness is bounded by queue depth and matview build time (typically single-digit seconds). |
ttl | Each matview carries a last-refreshed timestamp. A read older than ttl seconds triggers a refresh before returning. Right for dashboards, leaderboards, periodic reports. |
manual | No automatic refresh. The caller invokes gl.refresh_aggregate(collection, pipeline) when refresh is warranted. The escape hatch for high-cost pipelines whose refresh cadence the application knows best. |
A cached pipeline result is stale the moment a matching document changes. The trade-off is honest, and named in Chapter 12.
A Complete Example
The benchmarked six-stage pipeline from Chapter 12:
results = doc_aggregate("orders", [
{"$match": {"status": "shipped",
"placed_at": {"$gte": "2026-01-01"}}},
{"$lookup": {"from": "customers",
"localField": "customer_id",
"foreignField": "_id",
"as": "customer"}},
{"$unwind": "$customer"},
{"$group": {"_id": "$customer.region",
"revenue": {"$sum": "$total"},
"orders": {"$sum": 1}}},
{"$sort": {"revenue": -1}},
{"$limit": 20}
]) Cold path: ~35 ms (comparable to MongoDB 7's 30 ms — the cold paths are within a few milliseconds of each other). Warm path: 357 µs. The warm path is not running an aggregation. It is reading a pre-computed table.
Full benchmark methodology — dataset shape, hardware, percentile distributions, index configuration, reproduction scripts — lives on the companion benchmark page. The book tells you what the number means; the website lets you verify it.