Chapter 8: Aggregation: SQL Is the Pipeline
This chapter is, I will confess, the one I have been looking forward to most. The previous four chapters taught the mechanics of storing, querying, indexing, and validating documents. This chapter addresses a question that is less about mechanics and more about philosophy: when you need to aggregate your documents — group them, count them, sum them, rank them, reshape them into answers — which model gives you more?
MongoDB's aggregation pipeline is a sequence of stages executed in order. Each stage receives the output of the previous stage, transforms it, and passes it forward. $match → $group → $sort → $limit. The model is intuitive — it reads like a recipe. Step one, filter the documents. Step two, group by field. Step three, sort the groups. Step four, take the top ten. There is a reason this model is popular: it maps cleanly to how humans think about data transformation, and MongoDB deserves credit for designing an abstraction that developers find natural from the first time they use it.
SQL takes a different approach entirely. You describe the result you want — which rows, which groups, which ordering — and the query planner decides how to get there. The developer expresses what, not how. The planner determines the execution order, chooses the algorithms, and uses statistics about your data to pick the fastest path. A recipe tells you the steps. A declaration tells you the destination and trusts the kitchen to find the best route.
The practical consequence: a pipeline can only express operations that compose linearly. SQL can express operations that compose in any shape the planner can optimize — including shapes the pipeline model cannot represent at all. This chapter shows the parity first, then shows where SQL leaves the pipeline behind.
Stage-by-Stage Translation
Seven pipeline stages, each translated to its SQL equivalent. I present both side by side, because the best way to learn a new language is to see it next to the one you already speak.
$match → WHERE
db.orders.aggregate([{ $match: { status: "shipped" } }]) SELECT * FROM orders WHERE data @> '{"status": "shipped"}'; Containment for equality. GIN-indexed. The simplest and most common stage — and the translation that should reassure you that the distance between the two languages is smaller than it appears.
$group → GROUP BY
db.orders.aggregate([
{ $group: { _id: "$status", count: { $sum: 1 }, total: { $sum: "$amount" } } }
]) SELECT data->>'status' AS status,
COUNT(*) AS count,
SUM((data->>'amount')::numeric) AS total
FROM orders
GROUP BY data->>'status'; The accumulators map directly: $sum → SUM(), $avg → AVG(), $min → MIN(), $max → MAX(), $count → COUNT(). MongoDB's $push (collect values into an array) becomes array_agg(). $addToSet (collect distinct values) becomes array_agg(DISTINCT ...). The vocabulary changes. The operations do not.
$sort → ORDER BY
db.orders.aggregate([{ $sort: { amount: -1 } }]) SELECT * FROM orders ORDER BY (data->>'amount')::numeric DESC; $limit / $skip → LIMIT / OFFSET
db.orders.aggregate([{ $skip: 20 }, { $limit: 10 }]) SELECT * FROM orders LIMIT 10 OFFSET 20; $project → SELECT
db.orders.aggregate([{ $project: { status: 1, amount: 1, _id: 0 } }]) SELECT data->>'status' AS status, (data->>'amount')::numeric AS amount
FROM orders; These five translations are straightforward enough that I trust you see the pattern. The syntax differs. The intent is identical. The planner on the PostgreSQL side may choose a different execution order than the pipeline's stage sequence — and that freedom to reorder is, as we will see shortly, one of SQL's advantages.
Now let me show you the two translations that deserve more attention.
$unwind → jsonb_array_elements
This is the stage MongoDB developers ask about most, and it is the one where the PostgreSQL equivalent is most worth understanding in depth.
$unwind takes an array field and produces one document per element. jsonb_array_elements does the same — it expands a JSON array into rows.
db.orders.aggregate([{ $unwind: "$items" }]) SELECT o._id, o.created_at, item
FROM orders o, jsonb_array_elements(o.data->'items') AS item; An order with three line items becomes three rows, each carrying the full item object. From there, you group, sum, filter, or join — just as you would after $unwind in a MongoDB pipeline.
A complete $unwind + $group pipeline — total revenue by product:
db.orders.aggregate([
{ $unwind: "$items" },
{ $group: { _id: "$items.product", revenue: { $sum: "$items.price" } } },
{ $sort: { revenue: -1 } }
]) SELECT item->>'product' AS product,
SUM((item->>'price')::numeric) AS revenue
FROM orders, jsonb_array_elements(data->'items') AS item
GROUP BY item->>'product'
ORDER BY revenue DESC; Three pipeline stages become one SQL statement. The result is identical. The difference is structural: the pipeline describes three steps executed in sequence. The SQL describes the result and lets the planner decide the path. A recipe and a destination, arriving at the same place.
$lookup → JOIN
db.orders.aggregate([{
$lookup: { from: "users", localField: "user_id", foreignField: "_id", as: "user" }
}]) SELECT o.*, u.data AS user_data
FROM orders o
JOIN users u ON u._id = (o.data->>'user_id')::uuid; MongoDB's $lookup uses indexes on the foreign collection in current versions — this is a genuine improvement over earlier releases, and credit is due. However, the execution model remains a correlated subquery: for each document in the local collection, MongoDB runs a query against the foreign collection. PostgreSQL's JOIN uses the query planner to choose hash joins, merge joins, or nested loops based on table statistics. The architectural difference is optimization strategy: SQL JOINs are planned across both sides. $lookup is iterated one document at a time. On small collections, the difference is negligible. On large collections, it is the difference between linear and planned.
SQL JOINs also support all join types — inner, left, right, full, cross — and work across any tables in the database. $lookup connects collections within the same database. The scope of what you can join is the scope of what you can ask in a single query. SQL's scope is broader.
A Complete Pipeline, Translated
Allow me to put the full picture together. A realistic business query: top 10 products by revenue in the last 30 days, with unit counts.
db.orders.aggregate([
{ $match: { created_at: { $gte: thirtyDaysAgo } } },
{ $unwind: "$items" },
{ $group: { _id: "$items.product", revenue: { $sum: "$items.price" }, units: { $sum: 1 } } },
{ $sort: { revenue: -1 } },
{ $limit: 10 }
]) SELECT item->>'product' AS product,
SUM((item->>'price')::numeric) AS revenue,
COUNT(*) AS units
FROM orders, jsonb_array_elements(data->'items') AS item
WHERE created_at > now() - interval '30 days'
GROUP BY item->>'product'
ORDER BY revenue DESC
LIMIT 10; Five pipeline stages. One SQL statement. Same result. The MongoDB pipeline describes the procedure: filter, then expand, then group, then sort, then limit. The SQL describes the outcome: I want the top 10 products by revenue from the last 30 days. The planner decides whether to filter first, expand first, or use an index to skip both. The developer's job is to state what they want. The database's job is to figure out how. A well-designed division of labor, if I may say so.
Where SQL Wins Outright
The translations above demonstrate parity. Every pipeline stage has a SQL equivalent, and the SQL is often more concise. But parity is not the full story. SQL has four categories of capability that MongoDB's pipeline model cannot express — not as a matter of syntax, but as a matter of architecture.
Window Functions
Window functions compute a value across a set of rows related to the current row without collapsing those rows into groups. They answer questions like "what is this row's rank within its category?" and "how does this month's revenue compare to last month's?" — questions that GROUP BY cannot answer because GROUP BY collapses the rows.
Rank products by revenue within each category:
SELECT data->>'product' AS product,
data->>'category' AS category,
(data->>'revenue')::numeric AS revenue,
RANK() OVER (
PARTITION BY data->>'category'
ORDER BY (data->>'revenue')::numeric DESC
) AS category_rank
FROM products; Month-over-month revenue growth:
SELECT date_trunc('month', created_at) AS month,
SUM((data->>'amount')::numeric) AS revenue,
SUM((data->>'amount')::numeric) - LAG(SUM((data->>'amount')::numeric))
OVER (ORDER BY date_trunc('month', created_at)) AS growth
FROM orders
GROUP BY month
ORDER BY month; Running total:
SELECT created_at, (data->>'amount')::numeric AS amount,
SUM((data->>'amount')::numeric) OVER (ORDER BY created_at) AS running_total
FROM orders; MongoDB added $setWindowFields in version 5.0, and it covers the basics — $sum, $avg, $rank, $denseRank. It is a real capability and a welcome addition. The gap that remains is significant: MongoDB has no equivalent to LAG or LEAD for referencing adjacent rows (it offers $shift, which is more limited). It does not support arbitrary frame boundaries like ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING. And window operations exist only as a pipeline stage, not as composable expressions within a SELECT — you cannot embed a window function inside a CASE expression or use one in an ORDER BY clause. SQL window functions can appear anywhere a value expression is valid. The flexibility is not a matter of degree. It is a matter of kind.
CTEs (Common Table Expressions)
A CTE is a named intermediate result that can be referenced multiple times in the same query. Think of it as a pipeline stage that you can reuse — and reuse is the operative word.
WITH monthly_revenue AS (
SELECT date_trunc('month', created_at) AS month,
SUM((data->>'amount')::numeric) AS revenue
FROM orders
GROUP BY month
),
monthly_growth AS (
SELECT month, revenue,
revenue - LAG(revenue) OVER (ORDER BY month) AS growth,
ROUND(
(revenue - LAG(revenue) OVER (ORDER BY month)) /
NULLIF(LAG(revenue) OVER (ORDER BY month), 0) * 100, 1
) AS growth_pct
FROM monthly_revenue
)
SELECT * FROM monthly_growth WHERE growth_pct > 10; Two CTEs, each building on the last, with the final SELECT filtering the result. monthly_revenue is defined once and could be referenced by any number of downstream queries. MongoDB's pipeline stages are consumed as they execute — if two branches of your analysis need the same intermediate result, the pipeline computes it twice. A CTE computes it once and names it. Naming things is not just convenient. It is how complex systems stay comprehensible.
Recursive Queries
WITH RECURSIVE handles hierarchical data — org charts, category trees, threaded comments, bill-of-materials traversal. Any structure where a document refers to another document of the same kind, and the depth is unknown in advance.
WITH RECURSIVE category_tree AS (
-- Base case: root categories
SELECT _id, data->>'name' AS name, 0 AS depth
FROM categories
WHERE data->>'parent_id' IS NULL
UNION ALL
-- Recursive case: children
SELECT c._id, c.data->>'name', ct.depth + 1
FROM categories c
JOIN category_tree ct ON (c.data->>'parent_id')::uuid = ct._id
)
SELECT * FROM category_tree ORDER BY depth, name; MongoDB offers $graphLookup for graph traversal — it can query across collections and supports configurable depth limits, which is more capable than early versions. SQL's recursive CTEs are more general: they work across any tables, support arbitrary termination conditions, compute values at each recursion level (running totals, path construction, depth counting), and are optimized by the query planner. For hierarchical document data stored in JSONB — categories, organizations, threaded discussions — recursive CTEs are the natural and more powerful fit.
Cross-Table JOINs
JSONB document tables are regular PostgreSQL tables. They JOIN with anything — relational tables, other document collections, views, CTEs, materialized views. This is not an integration to configure. It is a property of being a PostgreSQL table.
SELECT o._id,
u.email,
p.name AS product_name,
(item->>'qty')::numeric AS quantity,
(item->>'price')::numeric AS line_total
FROM orders o
JOIN users u ON u.id = (o.data->>'user_id')::integer
CROSS JOIN LATERAL jsonb_array_elements(o.data->'items') AS item
JOIN products p ON p.sku = item->>'sku'
WHERE o.created_at > now() - interval '7 days'; One query. Document data joined with relational data joined with array-expanded line items. The query planner optimizes the entire thing — choosing join algorithms, using indexes on every table, estimating selectivity across all three. This is what it means for documents to be relational citizens: they participate in the full query planner, not a restricted subset of it. A document that can be joined with anything is a document that can answer any question. A document trapped in a single-collection pipeline is a document that can only answer the questions the pipeline was designed for.
Materialized Views — Cache the Aggregation
I have saved the most consequential concept for last, because it changes the performance conversation from "how fast can we run this aggregation?" to "why are we running this aggregation more than once?"
A materialized view stores the result of a query as a table. You define the query. PostgreSQL runs it and stores the result. Every subsequent read serves the stored result — no re-execution.
CREATE MATERIALIZED VIEW top_products_30d AS
SELECT item->>'product' AS product,
SUM((item->>'price')::numeric) AS revenue,
COUNT(*) AS units
FROM orders, jsonb_array_elements(data->'items') AS item
WHERE created_at > now() - interval '30 days'
GROUP BY item->>'product'
ORDER BY revenue DESC
LIMIT 100;
CREATE UNIQUE INDEX ON top_products_30d (product); The aggregation runs once. The result is stored in an indexed table. Every subsequent query is a lookup, not a computation. Refresh when the underlying data changes:
REFRESH MATERIALIZED VIEW CONCURRENTLY top_products_30d; CONCURRENTLY allows reads to continue during the refresh — no downtime, no locking. It requires a unique index on the materialized view, which is why the CREATE UNIQUE INDEX above is necessary, not optional.
The architectural difference is fundamental and worth stating plainly. MongoDB re-executes the aggregation pipeline on every request. The same computation runs again and again, each time scanning the same data, producing the same result, until the underlying data changes — at which point it scans again, produces a new result, and begins the cycle over. A materialized view runs the aggregation once and serves the result from an indexed table on every subsequent request. The first execution does the work. Every subsequent execution is a lookup. This is not a caching trick. It is a different model: compute once, serve many.
This is the architecture behind the 84x benchmark in Chapter 12. The number is not a query optimization. It is the natural result of comparing "re-execute the aggregation on every request" against "look up the cached result." Chapter 12 shows how Gold Lapel's doc_aggregate manages the materialized view lifecycle — creation, refresh scheduling, and invalidation — automatically. The developer calls doc_aggregate. The proxy decides whether the query needs a materialized view, creates one if it does, and serves the cached result on subsequent calls. The performance difference is not subtle. It is architectural.
Part II is complete. Five chapters have built the PostgreSQL document layer: storage (Ch4), queries (Ch5), indexes (Ch6), validation (Ch7), and aggregation (this chapter). The SQL foundation is set — every capability you need to store, query, and aggregate documents, written in the SQL that PostgreSQL has refined for over three decades.
The aggregation comparison is, I believe, the clearest illustration of the relationship between the two models. MongoDB's pipeline is an excellent tool for linear data transformation — intuitive, composable, and genuinely well-designed. I do not say that as a formality. The pipeline model taught a generation of developers to think about data transformation as a sequence of composable steps, and that mental model has value regardless of which database you use. SQL's declarative model includes everything the pipeline can do, and adds window functions, CTEs, recursive queries, cross-table JOINs, and materialized views. The pipeline is a subset. SQL is the superset. Both are useful. One contains the other.
Part III introduces the Gold Lapel API — the developer experience layer that makes this SQL accessible through MongoDB-style syntax. The SQL came first because it is the foundation. The API comes next because foundation without experience is a library that nobody visits. Chapter 9 begins with the twenty-one methods. Chapter 12 brings the aggregation story full circle — doc_aggregate, materialized views, and the benchmark that shows what happens when you stop re-executing the same computation on every request.
The kitchen has been explained. Now let me show you the menu.