Chapter 6: Indexing: The Art of Not Calling createIndex()
In MongoDB, every query pattern requires its own createIndex() call. Miss one and the query scans the entire collection. The developer must anticipate every question the application will ask and prepare an index for each. It is, if you will permit the observation, a system that requires the developer to do the database's job.
PostgreSQL takes a different view of the arrangement. A single GIN index on a JSONB column covers containment queries on every field, at every depth, without the developer declaring which fields matter in advance. Expression indexes handle range comparisons on specific fields. Partial indexes target the exact subset of documents your application cares about. And Gold Lapel creates the common ones automatically, based on the queries your application actually runs — not the queries you predicted it would run six months ago.
This chapter covers four index types, how to diagnose when an index is not doing its job, and how auto-indexing works. Chapter 5 taught you to ask questions of your documents. This chapter ensures the answers arrive promptly. A good question deserves a fast answer, and a good database should not require you to ask twice.
GIN Indexes
GIN — Generalized Inverted Index — is the primary index type for JSONB and the one you will rely on most. It works differently from the B-tree indexes most developers are familiar with, and understanding the difference matters.
A B-tree index orders values along a single axis — like a phone book sorted by last name. It is excellent for one field, one comparison. A GIN index takes a different approach: it decomposes the JSONB document into individual key-value entries and builds an inverted index over all of them. Every key-value pair in every document in the column is indexed, simultaneously, in a single index. This is what allows one GIN index to serve containment queries on any field at any depth — the index already knows every field in every document.
GIN indexes come in two operator classes, and the choice between them is more consequential than it first appears.
jsonb_path_ops supports containment queries (@>) only. It indexes value paths rather than key entries, producing an index that is typically 50–70% smaller than the alternative. For document workloads where queries match field values — the common case behind doc_find and most application queries — this is the right default:
CREATE INDEX ON users USING GIN (data jsonb_path_ops); With this single statement, every containment query against the users table is indexed:
-- All of these use the GIN index
WHERE data @> '{"status": "active"}'
WHERE data @> '{"address": {"state": "CA"}}'
WHERE data @> '{"status": "active", "role": "admin"}'
WHERE data @> '{"tags": ["premium"]}' Any field, any depth, any combination. One index. In MongoDB, each of these query patterns would need its own index — or a compound index carefully designed to cover the combination. In PostgreSQL, the GIN index was waiting before the queries were written.
jsonb_ops is PostgreSQL's default and supports a broader set of operators: @> containment, ? key existence, ?| any-key existence, and ?& all-keys existence. The index is larger because it catalogs both keys and values. Use this when your queries ask whether a key is present, not what its value is:
CREATE INDEX ON users USING GIN (data jsonb_ops); The decision: if your queries look like data @> '{"field": "value"}', use jsonb_path_ops. If they look like data ? 'field_name', use jsonb_ops. If you need both patterns on the same table, create both — PostgreSQL selects the right index for each query without being asked.
On large collections, the size difference between operator classes is not academic. A smaller index scans faster, fits more comfortably in memory, and vacuums more quickly. An index that is half the size is not half as important. It is twice as efficient. Choose the operator class that matches your queries, not the one that covers the most operators.
One practical note: GIN indexes are slower to build and update than B-tree indexes, because every insert decomposes the document and updates multiple index entries. PostgreSQL mitigates this with a pending list that batches updates, but on write-heavy tables, GIN maintenance is a cost. For the vast majority of document workloads — where reads significantly outnumber writes — this trade-off is overwhelmingly favorable. The read performance GIN provides is worth the write overhead it requires. If your workload is write-dominant, Chapter 15's honest discussion of write throughput applies here as well.
Expression Indexes
GIN handles containment and jsonpath admirably. It does not handle range comparisons, sorting, or unique constraints. For those, expression indexes bring B-tree precision to a specific extracted field.
-- Numeric range queries on age
CREATE INDEX ON users (((data->>'age')::numeric));
-- Timestamp range queries
CREATE INDEX ON events (((data->>'created_at')::timestamptz));
-- Equality and pattern matching on email
CREATE INDEX ON users ((data->>'email'));
-- Unique constraint on an extracted field
CREATE UNIQUE INDEX ON users ((data->>'email')); The double parentheses are required by PostgreSQL's syntax for expression indexes — the outer pair is the expression wrapper, the inner pair is the cast. This will confuse you exactly once, and then never again. Consider this your once.
A typical document collection ends up with one GIN index and a small number of expression indexes — one for each field that sees range queries, sorting, or uniqueness constraints. The pattern is predictable: GIN covers the broad foundation, expression indexes handle the specific demands.
-- A well-indexed document collection
CREATE INDEX ON products USING GIN (data jsonb_path_ops); -- all containment queries
CREATE INDEX ON products (((data->>'price')::numeric)); -- price range and sort
CREATE INDEX ON products (((data->>'created_at')::timestamptz)); -- date range
CREATE UNIQUE INDEX ON products ((data->>'sku')); -- uniqueness Four statements. The collection is now indexed for containment on every field, range queries on price and date, and uniqueness on SKU. Compare this to the MongoDB equivalent — where each query pattern requires its own index, compound indexes must be designed for specific query shapes, and the order of fields in a compound index affects which queries it serves. PostgreSQL's approach requires fewer indexes, covers more query shapes, and demands less anticipation of future access patterns. A system that asks you to predict the future is a system that will eventually be surprised by it.
Partial Indexes
Partial indexes add a WHERE clause to the index itself. Only rows matching the condition enter the index. The rest are invisible to it — and that is precisely the point.
-- Only index active users
CREATE INDEX ON users USING GIN (data jsonb_path_ops)
WHERE data->>'status' = 'active';
-- B-tree on email, only for active users
CREATE INDEX ON users ((data->>'email'))
WHERE data->>'status' = 'active';
-- Only index recent events
CREATE INDEX ON events USING GIN (data jsonb_path_ops)
WHERE created_at > '2025-01-01'; Document collections accumulate. Active records sit alongside archived ones, current data beside historical. A user table with ten million records may have eight million inactive accounts that no query ever touches. A full GIN index covers all ten million. A partial index on active users covers only the two million that matter — one fifth the size, one fifth the memory, and every query that filters by active status runs against the smaller index.
The savings compound in multi-tenant applications. If your document collection serves multiple tenants and most queries are tenant-scoped, a partial index per tenant — or per status per tenant — can reduce each tenant's effective index to a fraction of the whole. The index only contains what the query will ask for. Everything else is left out, deliberately, because an index that covers data you never query is an index that is larger than it needs to be. And a larger index is a slower index.
One behavior to understand: PostgreSQL only uses a partial index when the query's WHERE clause matches or implies the index condition. A query that does not filter by status will not use an index that only covers active records. This is correct — the index genuinely does not contain the rows the query needs. If you create a partial index and wonder why it is not being used, check whether your query includes the matching filter. The index is not broken. It is precise.
Diagnosing Index Usage
Every developer eventually asks: "I created an index. Why is the query still slow?" The answer begins with one command:
EXPLAIN ANALYZE SELECT * FROM users WHERE data @> '{"status": "active"}'; What you want to see:
Bitmap Heap Scan on users (cost=12.00..316.00 rows=100 width=256) (actual time=0.5..1.2 rows=95)
Recheck Cond: (data @> '{"status": "active"}'::jsonb)
-> Bitmap Index Scan on users_data_idx (cost=0.00..12.00 rows=100 width=0) (actual time=0.3..0.3 rows=95)
Index Cond: (data @> '{"status": "active"}'::jsonb)
Planning Time: 0.1 ms
Execution Time: 1.3 ms The Bitmap Index Scan referencing your GIN index means the index is being used. The execution time is in milliseconds. This is what a well-indexed JSONB query looks like.
What you do not want to see:
Seq Scan on users (cost=0.00..12500.00 rows=100 width=256) (actual time=45.0..120.5 rows=95)
Filter: (data @> '{"status": "active"}'::jsonb)
Rows Removed by Filter: 99905
Planning Time: 0.1 ms
Execution Time: 120.6 ms Sequential scan. Every row examined. 120 milliseconds instead of 1.3. The index is not being used. The reasons are almost always one of four things.
Wrong operator class. You created jsonb_path_ops but the query uses ? for key existence. jsonb_path_ops does not support that operator. The fix: create a jsonb_ops index, or rewrite the query as a containment check — data @> '{"email": ...}' instead of data ? 'email'.
Mismatched expression. This is the most common and most frustrating cause. The index is on (data->>'age')::numeric but the query compares data->>'age' > '25' as text. These are different expressions — one is a numeric cast, the other is a text comparison — and the index does not apply. The fix: ensure the query expression matches the index expression exactly, character for character. (data->>'age')::numeric > 25 will use the index. data->>'age' > '25' will not. PostgreSQL is precise about this, which is a quality you want in the system responsible for your data, even when it costs you ten minutes of debugging.
Small table. The planner decides a sequential scan is faster. On tables under a few thousand rows, this is usually correct — the overhead of the index lookup exceeds the cost of scanning the table directly. The index earns its keep when the table grows. This is not a problem to solve. It is the planner being smarter than you might expect.
Partial index condition not matched. The query does not include the filter the partial index requires. Add the filter, or create a non-partial index.
The diagnostic pattern is always the same: compare the exact expression in the index definition to the exact expression in the query. If they do not match, the index is invisible. Precision, in indexes as in most professional matters, is not pedantry. It is the mechanism by which things work.
Auto-Indexing
I have spent this chapter teaching you to create indexes by hand. Gold Lapel does it for you. These two facts are not in tension — they are complementary, and I will explain why.
Gold Lapel's proxy sits between your application and PostgreSQL. Every doc_find, doc_aggregate, and filter query passes through it. The proxy observes which JSONB operators appear in each query and creates GIN indexes with the appropriate operator class.
Containment queries produce jsonb_path_ops indexes. Key-existence queries produce jsonb_ops indexes. The proxy tracks what it creates and reports on the dashboard's index panel: the index name, the operator class chosen, the creation timestamp, the hit count, and the last time the index was used. A hit count of zero after a week is a signal that the index is not needed — or that the query pattern changed and the index is serving a workload that no longer exists.
Auto-indexing can be disabled per-collection or globally via goldlapel.toml or the --disable-jsonb-indexes CLI flag. If the proxy creates an index you do not want, disable auto-indexing for that collection and manage it manually. The proxy does not insist. It observes, it suggests (by creating), and it reports. The final authority is always the developer.
Auto-indexing handles the common case — GIN indexes matched to the query patterns the proxy observes. Expression indexes for specific range fields, partial indexes for targeted subsets, and unique constraints are still created by the developer when the situation calls for them. The proxy handles the floor. You raise the ceiling. And because you have read this chapter, you understand what the proxy creates, why it chose that operator class, and how to verify its work with EXPLAIN ANALYZE. That understanding is not made redundant by automation. It is what makes automation trustworthy. Automation you do not understand is automation you cannot diagnose. Automation you do understand is a colleague.
Four index types, each suited to a different need. GIN for the broad foundation — containment and jsonpath across every field, in a single index that anticipates queries before they are written. Expression indexes for the targeted work — range, sort, uniqueness on individual fields, with B-tree precision. Partial indexes for the disciplined choice — index only what you query, and let everything else stay out of the way. Auto-indexing for the common case — the proxy watches, the proxy creates, and the developer who understands what was created sleeps better for it.
Your documents are stored, queried, and indexed. The next question is one that MongoDB's reputation both invites and obscures: how much structure should those documents have? Chapter 7 walks the spectrum from fully schemaless to fully validated, and makes the case that the answer is rarely one extreme or the other.
A schema is not a constraint on your data. It is a promise your data makes to the code that reads it. Chapter 7 is about deciding how much your data should promise.