Chapter 16: Schema Design: When JSONB, When Columns, When Both
A collection that was, until recently, living quite contentedly inside MongoDB now finds itself on the threshold of PostgreSQL, uncertain of which room to enter. The tables, as the previous chapter promised, are waiting to be laid properly. Allow me.
Most MongoDB-to-PostgreSQL migrations do not fail at data transfer. They fail here, at the schema. The team exports the collection, copies the document shape into a single data JSONB column, and declares the migration complete. Six months later every hot-path query is extracting JSONB, every ORDER BY falls to a sequential scan, and every foreign key the team wishes it had is unreachable. The table works. It just works slowly, and each query feels like pushing against a wall that did not exist on MongoDB.
I do not mention this to alarm you. I mention it because the schema deserves thirty quiet minutes of reconsideration before the migration script runs, and thirty minutes now is a great deal cheaper than thirty weeks later. Knowing when to use JSONB in PostgreSQL — and when not to — is the single decision that separates a migration that compounds into a long-term asset from one the team eventually regrets.
Chapter 17 is executable only if this chapter is honest. So let us be honest.
The Three Models for PostgreSQL JSONB Schema Design
There are exactly three schema shapes available when you move a document workload onto PostgreSQL: pure JSONB, pure columns, and the hybrid between them. About 80% of production migrations land in the hybrid middle. The other 20% split between the two extremes, and both extremes have legitimate, narrow justifications.
The wrong move, if I may say so, is to pick an extreme by default because it looks simpler. Simplicity is a virtue one earns through evidence, not one claims by silhouette.
Pure JSONB — the whole document in one column
The minimum viable shape is a primary key, a data JSONB column, and a pair of timestamps.
CREATE TABLE events (
id BIGSERIAL PRIMARY KEY,
data JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
); Pure JSONB fits workloads where the document is read and written as a whole and rarely dissected: event logs, user preferences, dynamic form submissions, webhook payloads, audit trails. Anything heterogeneous. Anything where you genuinely do not know what tomorrow's document will contain.
The cost is that every filter on a frequently-queried field pays the JSONB extraction overhead. WHERE data->>'status' = 'active' is not free, and without an expression index it is a sequential scan that extracts and casts every row. A perfectly pleasant arrangement, provided you never ask it to hurry.
Pure columns — the traditional relational schema
The opposite end promotes every field to a typed column and uses no JSONB at all.
CREATE TABLE users (
id UUID PRIMARY KEY,
email TEXT NOT NULL UNIQUE,
status TEXT NOT NULL,
tenant_id UUID NOT NULL REFERENCES tenants(id),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
); Appropriate when the shape is stable, every query is known, and the value of type safety and foreign keys outweighs the cost of schema migrations. The cost is real: every new field requires an ALTER TABLE, and the MongoDB team's muscle memory for "just add a field in the application" breaks.
Teams coming from a document culture often resent this — which is understandable — until the first time a typed column catches a bug that would have shipped silently as a malformed string inside a JSONB payload. One grows fond of the structure rather quickly after that.
The hybrid schema — stable fields as columns, variable fields as JSONB
The practical default. If I had to hand you a schema in the dark, this would be it.
CREATE TABLE users (
id UUID PRIMARY KEY,
email TEXT NOT NULL UNIQUE,
status TEXT NOT NULL,
tenant_id UUID NOT NULL REFERENCES tenants(id),
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
); The fields that every query touches — identity, tenancy, status, timestamps — are typed columns with indexes and foreign keys. The fields that vary per user, per feature flag, per experimental rollout live in metadata. The test is simple, and it is the one I would commit to memory before any other: if you would ever want to index it, it should be a column. If you would never query it on its own, it stays in JSONB.
PostgreSQL JSONB vs columns at a glance
| Dimension | Pure JSONB | Pure Columns | Hybrid |
|---|---|---|---|
| Query speed on indexed field | Good with expression index | Best | Best on promoted fields |
| Schema evolution cost | None | High (ALTER TABLE per field) | Low for variable fields |
| Storage overhead | Higher (key names repeated per row) | Lowest | Moderate |
| Index flexibility | GIN + expression indexes | Full btree/hash/GiST/BRIN | Full on columns, GIN on JSONB |
| Type safety | Runtime only | Enforced at DDL time | Enforced on columns |
| Foreign key support | No (cannot reference a JSONB field) | Yes | Yes on columns |
| Migration complexity from MongoDB | Lowest | Highest | Moderate |
Three models, each honest about what it costs. One chooses not by preference but by the shape of the workload one actually has.
Generated Columns as the Middle Path
There is a fourth tool that is technically a refinement of the hybrid model but deserves its own section, because it changes the migration calculus entirely: the generated column. I confess a certain fondness for it.
What a PostgreSQL generated column actually is
A generated column is a column whose value is always computed from other columns in the same row. PostgreSQL has supported GENERATED ALWAYS AS (...) STORED since version 12. In PostgreSQL 18, virtual generated columns were added, computed at read time rather than stored on disk. For the schema-design patterns in this chapter, STORED is what matters — it writes to disk, it can be indexed, and it behaves like any other column from the planner's perspective.
The column is computed on every insert and update. No trigger to maintain. No application code to remember. No dual-write drift. You cannot write to it directly; the source expression is the only truth. A single authoritative field, faithfully mirrored — which is, after all, the whole point.
Promoting a JSONB field to an indexed generated column
The canonical pattern: the document stays whole and the hot field becomes queryable at column speed.
ALTER TABLE users
ADD COLUMN email TEXT
GENERATED ALWAYS AS (data->>'email') STORED;
CREATE INDEX users_email_idx ON users (email); After this, SELECT * FROM users WHERE email = $1 uses the btree index on email. No expression in the WHERE clause, no JSONB extraction at query time, no surprises in EXPLAIN. The application can continue to write whole documents to data; the email column follows automatically, as though it had been there all along.
The trade is storage. Email now lives in both data and the email column, and every update to the JSONB payload recomputes the generated column and updates the index. Nothing is free, but this is a small tariff for a civilised query.
When a generated column beats an expression index
There is a lighter alternative: index the expression directly.
CREATE INDEX users_email_expr_idx ON users ((data->>'email')); Both work. The difference is ergonomic and structural. An expression index only helps queries that use the exact expression: WHERE data->>'email' = $1 hits the index, WHERE email = $1 cannot — because email does not exist as a column. A generated column is queryable by name, appears in SELECT *, and participates cleanly in views, ORMs, and foreign keys from other tables.
The expression index is the right tool when you are adding an index to an existing large table and do not want to pay the cost of rewriting the table to add a STORED column. The generated column is the right tool when you are designing the schema, the field is stable, and you want the query surface to look like any other relational table.
In short: one is a neat trick performed behind the scenes; the other is a full place-setting. Both have their occasion.
When to Stay Pure JSONB
There is a persistent temptation, once one has learned about generated columns and hybrid schemas, to promote everything. I must gently ask you to resist it. Pure JSONB is the right answer more often than migration teams admit — and admitting it costs nothing, while promoting fifteen columns you do not need costs rather a lot.
The four signals that pure JSONB is right
Pure JSONB fits when all four of these are true:
- The schema evolves frequently and unpredictably.
- Documents within the collection are heterogeneous — different customers, different integrations, different versions all living side by side.
- The application reads and writes whole documents and rarely filters on individual fields.
- The migration timeline is short and you need to ship before you optimize.
INSERT INTO events (data)
VALUES ('{"kind":"page_view","user_id":"7a1c...","url":"/checkout","ts":"2026-04-18T10:15:22Z"}'); Shipping a pure-JSONB table in week one and migrating individual hot fields to generated columns in month three is a respectable, conservative path. Premature column extraction — promoting fifteen fields before production tells you which three actually matter — is not. One does not guess the wine before the meal has been ordered.
When to Extract Columns from JSONB
The inverse discipline matters just as much. A field that stays in JSONB when it should have been a column is a query you will eventually rewrite under pressure — and queries rewritten under pressure are rarely the ones one is most proud of.
The four tests for promoting a field to a column
Promote a field to a real column when any one of these applies:
- The field appears in
WHEREon nearly every request to the table. - The field is joined from another table. Foreign keys cannot target a field inside JSONB; the referential integrity constraint needs a column.
- The field is used in
ORDER BY. JSONB values have no sort order the planner can use efficiently without an exact-match expression index. - The field's type matters for correctness — timestamps, numerics, UUIDs, enums. JSONB stores numbers as
numeric, which is correct but not always what you want, and timestamps are stored as strings until you cast them.
Why ORDER BY on JSONB is slow without an expression index
When the planner sees ORDER BY data->>'created_at', it cannot use any index that is not exactly on that expression. A GIN index on data does not help — GIN supports containment and existence, not ordering. The result is a sequential scan followed by a sort.
-- Without an expression index
EXPLAIN ANALYZE
SELECT id, data->>'created_at' AS created_at
FROM events
ORDER BY data->>'created_at' DESC
LIMIT 20;
-- QUERY PLAN
-- Limit
-- -> Sort (Seq Scan on events)
-- Sort Key: ((data ->> 'created_at'::text)) DESC
-- -> Seq Scan on events
-- Execution Time: hundreds of milliseconds (illustrative; grows with table size) Hundreds of milliseconds to discover the twenty most recent rows in a table the database already has. I'm afraid that is working rather harder than it needs to.
Adding the expression index, or a generated column plus a btree index, changes this to an ordered index scan with sub-millisecond tail latency on the same table. PostgreSQL requires index expressions to be IMMUTABLE, but the text-to-timestamptz cast is STABLE — it depends on session timezone. The simplest correct form is to index the ISO 8601 text directly. It sorts lexicographically the same way the timestamp does, provided the stored format is fixed (UTC with a Z suffix, for instance):
CREATE INDEX events_created_at_idx
ON events ((data->>'created_at') DESC); If you want a real timestamptz column — useful for range predicates and type-correct ORDER BY — promote it to a generated column, which is evaluated at write time and escapes the immutability requirement:
ALTER TABLE events
ADD COLUMN created_at_ts TIMESTAMPTZ
GENERATED ALWAYS AS ((data->>'created_at')::timestamptz) STORED;
CREATE INDEX events_created_at_ts_idx ON events (created_at_ts DESC); One more line of DDL. A different query altogether.
When to Use Generated Columns
Generated columns shine in one specific situation: you are migrating incrementally and you want fast paths on specific fields without restructuring the document. This, in my experience, is where most teams actually live.
The incremental migration pattern
Ship the pure-JSONB schema first. Let it run in production for a few weeks. Watch which queries show up in pg_stat_statements with high total time. Then add generated columns exactly where the slow queries are, not where you guessed they would be.
ALTER TABLE users
ADD COLUMN tenant_id UUID
GENERATED ALWAYS AS ((data->>'tenant_id')::uuid) STORED;
CREATE INDEX users_tenant_id_idx ON users (tenant_id); This pattern is conservative in the right direction. The document model does not change. The application continues writing whole documents. The performance profile shifts one query at a time, driven by evidence rather than prediction.
One caveat, because I would be an unreliable Waiter if I did not mention it: adding a STORED generated column to a populated table does rewrite the table — the column has to be computed for every existing row. On large tables plan the ALTER TABLE during a maintenance window, or use the expression-index alternative and skip the rewrite. The surprise is the expensive part; the knowledge of it is not.
When generated columns are the wrong answer
If the field needs to be written independently of the document — computed server-side, back-filled by a data-quality job, corrected out-of-band when a bug ships bad data — a generated column blocks you. You cannot UPDATE users SET email = ...; the generated column is read-only with respect to write statements.
In that case use a plain column and synchronize it from the application, accepting the dual-write discipline that comes with it. Or, better, restructure: if the field is authoritative enough to be written separately, it probably belongs in its own column and its own source of truth, with the JSONB document recording the rest. A field with two voices needs to decide which one is telling the truth.
MongoDB Collection to PostgreSQL Table Mapping Patterns
A collection is not a table — and here I must say a proper word about MongoDB, because it has been an extraordinarily influential database and the engineers who built it solved real problems that were not well-served elsewhere. Its document model is genuinely well-designed. The patterns below are not corrections; they are translations. The vocabulary differs; the intent carries across.
Most collections become one table, and the mapping patterns that work are a small set.
One-to-one — each collection becomes one table
The majority of collections map cleanly to a single table: one document becomes one row, the _id becomes the primary key, and the document body becomes the data column or a set of columns depending on your schema model. For most teams this covers the bulk of the migration surface — which is, on the whole, rather a relief.
Embedded documents to columns
When a MongoDB document contains a nested object that gets queried frequently — user.address.city is the usual example — the migration keeps the embedded structure in JSONB for full-document reads and promotes the hot nested path to a generated column.
CREATE TABLE users (
id UUID PRIMARY KEY,
data JSONB NOT NULL,
city TEXT GENERATED ALWAYS AS (data #>> '{address,city}') STORED
);
CREATE INDEX users_city_idx ON users (city); The #>> operator walks a path and returns text. The generated column gives you WHERE city = 'Lisbon' at column speed without flattening the document. The document keeps its shape; the query gets its speed. Both are served.
Embedded arrays to related tables
When the MongoDB document holds an array of sub-documents that the application queries with $elemMatch, and the query pattern is awkward to express on the array, promote the array to a child table.
CREATE TABLE orders (
id UUID PRIMARY KEY,
customer_id UUID NOT NULL REFERENCES customers(id),
placed_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE order_items (
id UUID PRIMARY KEY,
order_id UUID NOT NULL REFERENCES orders(id) ON DELETE CASCADE,
sku TEXT NOT NULL,
quantity INTEGER NOT NULL CHECK (quantity > 0),
unit_price NUMERIC(12,2) NOT NULL
);
CREATE INDEX order_items_order_id_idx ON order_items (order_id);
CREATE INDEX order_items_sku_idx ON order_items (sku); The gain is real: indexes per field, joins to other tables, aggregations that the planner can push down. The loss is that reading a whole order now requires a join or a JSON aggregation in the select list.
For workloads that read orders far more often than they filter line items, the array-in-JSONB form may be the right answer. For workloads that do analytics on items, the child table wins. I would not choose for you — I would only ask which direction the reads flow, and let that answer the question.
Polymorphic collections — JSONB-first with a type discriminator
When a collection stores documents of genuinely different shapes — notifications that can be email or push or in-app, audit events across many subsystems, activity feeds — the right shape is a discriminator column plus JSONB, with partial indexes per kind.
CREATE TABLE notifications (
id BIGSERIAL PRIMARY KEY,
user_id UUID NOT NULL REFERENCES users(id),
kind TEXT NOT NULL,
data JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX notifications_email_idx
ON notifications (user_id, created_at DESC)
WHERE kind = 'email';
CREATE INDEX notifications_push_idx
ON notifications (user_id, created_at DESC)
WHERE kind = 'push'; Partial indexes keep each index small and let the planner pick the right one based on the kind in the query. This pattern scales well past the point where a single GIN index across all payload shapes would be unwieldy. A small, tidy index per kind beats one enormous index trying to serve them all.
The GIN Index Decision — jsonb_path_ops vs jsonb_ops
The operator class you choose for a GIN index on JSONB is a schema-design decision, not a tuning knob. Rebuilding a GIN index on a large, write-heavy table is slow and blocking (or CONCURRENTLY and merely long). Pick once, deliberately. A decision made twice is a decision paid for twice.
jsonb_path_ops — smaller, faster, containment and jsonpath
jsonb_path_ops supports the containment operator @> and — since PostgreSQL 12 — the jsonpath operators @? and @@. It does not support the key-existence operators (?, ?|, ?&). On typical payloads the index is meaningfully smaller than the default operator class, and containment queries are usually faster as a result.
CREATE INDEX events_data_path_idx
ON events USING GIN (data jsonb_path_ops); Use it when every JSONB query in the application is of the form "does this document contain this sub-document?" — for example, WHERE data @> '{"kind":"signup","source":"email"}' — or when you are using jsonpath predicates.
jsonb_ops — larger, supports key existence
The default operator class supports @>, ?, ?|, and ?&. Any query that asks "does this key exist?" requires it.
CREATE INDEX events_data_idx
ON events USING GIN (data); Larger index, broader operator support. On payloads with many unique keys the difference in index size is substantial, and the difference matters more on write-heavy tables where every insert updates the index.
The decision rule
If the application only ever asks "does this document contain this sub-document?" (or uses jsonpath) — use jsonb_path_ops. If the application asks "does this key exist?" — use jsonb_ops. Mixed workloads: create both, or use jsonb_ops and accept the larger index. Choosing wrong is not a crisis, but it costs you index size and scan speed until you rebuild.
Naming Conventions — Collection Names to Table Names
A small point that causes a surprising amount of pain, and I would rather you learn it here than at three in the morning: PostgreSQL folds unquoted identifiers to lowercase. A MongoDB collection named userEvents becomes a PostgreSQL table named userevents unless you quote the identifier every time you reference it — and double-quoted identifiers are a lifetime of inconvenience in SQL, ORMs, and migration tooling.
Translate camelCase to snake_case at migration time.
users → users
userEvents → user_events
OrderLineItems → order_line_items
webhookDeliveries → webhook_deliveries Snake_case is the idiomatic PostgreSQL choice. The migration script in Chapter 17 assumes this convention, and so will every team member who opens psql after you. A kindness to one's future colleagues, really.
Honest Boundary — When the Hybrid Model Hurts
This chapter has spent most of its words advocating the hybrid default. Allow me now the counterweight — the situations where the hybrid model is the wrong answer. A recommendation without its boundaries is a sales pitch, and I have no interest in selling you anything.
Frequent schema churn on promoted columns
If the fields you promote are themselves unstable — new ones appearing monthly, old ones being removed — you are running ALTER TABLE constantly, and the hybrid model loses its advantage over pure JSONB. Pure JSONB is schema-free at the storage layer; every new field is a write, not a migration.
A small but useful refinement: since PostgreSQL 11, ALTER TABLE ADD COLUMN with a constant or non-volatile default is a metadata-only change — effectively instant regardless of table size. With a volatile default or a GENERATED ALWAYS AS (...) STORED expression, the operation rewrites the table. On a billion-row events table this is the difference between a blink and a maintenance window. Know which one you are doing before you run it in production. I cannot stress this firmly enough.
The dual-write trap
When application code writes to both the JSONB document and a plain promoted column — because the column is not generated — the two values can drift. Every code path that writes must remember to update both, and the first forgotten path ships silent inconsistency. Silent inconsistency is the worst kind, because nothing alarms until something breaks.
Generated columns avoid this entirely: there is one source of truth, the JSONB document, and the column is derived. If you cannot use a generated column because the field is written independently, accept that a plain promoted column requires explicit discipline, code review, and tests. It is not free.
Write amplification from generated columns and many indexes
Every write updates the JSONB payload, recomputes every generated column derived from it, and updates every index on those generated columns. Each of those is cheap individually. Ten of them are not.
On high-throughput insert workloads this compounds with the honest 2.5–4x write gap established in Chapter 15. If the generated-column count grows without restraint, the gap widens. Treat each generated column as a deliberate performance trade, not a free index. Every index earns its keep or it does not — there is no middle ground in a write-heavy workload.
The decision, in one page
If you remember nothing else from this chapter, remember the decision tree. I would like you to be able to answer these four questions before your next migration meeting.
- Do you read and write the whole document and rarely filter on individual fields? Pure JSONB.
- Is the schema stable, every query predictable, and the value of type safety and foreign keys high? Pure columns.
- Some fields are hot, most are cold, the shape is mostly stable with a long tail of variation? Hybrid: typed columns for the hot fields, JSONB for the rest.
- You want hybrid performance without restructuring the document model — especially mid-migration? Generated columns on top of a JSONB payload.
That is the entire decision surface. When to use JSONB vs columns in PostgreSQL is not a matter of taste; it is a matter of how your application actually reads and writes, and how much that pattern will change.
You now have the decisions. Chapter 17 turns them into a migration script — psql open, COPY warmed, and a rather long evening ahead of us. If you'll follow me, I believe you'll find the conversation eminently practical. Do bring your document with you.