Chapter 5: Fuzzy Matching (pg_trgm)
Allow me a brief observation about human beings and keyboards.
Users make typos. "Postgre" when they mean "PostgreSQL." "Alic" when they mean "Alice." "Jonh" when they mean "John." The fingers move faster than the spelling center of the brain, and this is not a deficiency in the user. It is a fact about the human hand. A search system that demands perfection from the hand is a search system that has confused its priorities. The user came to find something, not to prove they can spell it.
A search that requires exact spelling is not a search. It is a spelling test. And I do not believe your users signed up for a spelling test.
Elasticsearch handles typo-tolerant search with its fuzzy query, which uses Levenshtein distance — a perfectly respectable measure of how many character edits separate two strings. PostgreSQL handles it with pg_trgm, which takes a different approach — trigram similarity — and the approach has practical advantages that I think you will appreciate.
The LIKE Problem
Before pg_trgm, PostgreSQL's only tool for approximate string matching was LIKE and ILIKE.
SELECT * FROM products WHERE name ILIKE '%postgre%'; This finds rows where "postgre" appears as a substring, case-insensitive. It does its job. But it has two problems, and I should be forthcoming about both.
Problem 1: Performance. Without a trigram index, ILIKE '%term%' is always a sequential scan. PostgreSQL must examine every row in the table. On a table with a million rows, this is slow. On ten million rows, it is the kind of slow that generates support tickets. Standard btree indexes cannot help with infix patterns — they only accelerate prefix matches like LIKE 'term%'.
Problem 2: No ranking. LIKE is binary — it matches or it doesn't. There is no similarity score, no way to say "this result is a closer match than that one." A user searching for "Postgre" gets every row containing that substring, with no indication of which result best matches their intent. "PostgreSQL" and "PostgreSQL Conference 2019 Workshop Notes" are treated as equally relevant. They are not equally relevant. The search system should know this.
pg_trgm solves both problems with one extension. It makes LIKE/ILIKE queries fast through GIN trigram indexes, and it adds similarity scoring for ranked fuzzy matches. I find this kind of efficiency — one tool, two benefits — deeply satisfying.
What Trigrams Are
A trigram is a sequence of three consecutive characters. pg_trgm splits every string into its constituent trigrams, including padding spaces at the boundaries:
SELECT show_trgm('PostgreSQL'); Result:
{" p"," po","esq","gre","l ","ost","pos","ql ","res","sql","stg","tgr"} "PostgreSQL" produces 12 trigrams. "Postgre" produces 9 trigrams. Many of those trigrams overlap — both strings contain " po", "pos", "ost", "stg", "tgr", "gre". That overlap is what the similarity() function measures.
similarity() computes the Jaccard coefficient — the size of the intersection divided by the size of the union of two strings' trigram sets. It returns a value between 0.0 and 1.0:
SELECT similarity('PostgreSQL', 'Postgre'); -- 0.538462
SELECT similarity('PostgreSQL', 'MongoDB'); -- 0.0
SELECT similarity('PostgreSQL', 'PostgreSQL'); -- 1.0 "Postgre" shares enough trigrams with "PostgreSQL" to score 0.54 — a strong match. "MongoDB" shares none — 0.0. The function captures the intuitive sense of "how similar do these strings look?" and expresses it as a number. Intuition, quantified. I appreciate that in a function.
Trigrams vs. Levenshtein
Most developers assume fuzzy matching means Levenshtein distance — the count of character insertions, deletions, and substitutions needed to transform one string into another. PostgreSQL provides Levenshtein via the fuzzystrmatch extension (Chapter 6). It is a well-established algorithm, and I would not speak ill of it. But for general fuzzy search, trigram similarity has practical advantages worth understanding.
Length sensitivity. Levenshtein distance is an absolute count. A 1-character typo produces a distance of 1 regardless of string length. But a distance of 1 means very different things for a 5-character word (20% error rate) and a 50-character product name (2% error rate). Trigram similarity normalizes for length automatically — the ratio of shared trigrams to total trigrams accounts for string length without additional calculation. The algorithm handles what the developer would otherwise need to handle manually.
Transposition handling. "teh" vs. "the" — a transposition, where two adjacent characters swap. This is among the most common types of typos. Levenshtein counts it as two substitutions (distance 2), which overstates the error. Trigrams see that the strings still share overlapping character sequences and produce a high similarity score. The algorithm's assessment is closer to the human assessment, which is what we want from a search system.
Practical guidance. Use trigrams (pg_trgm) for general fuzzy search — product names, article titles, user names, any variable-length text where you want ranked similarity results. Use Levenshtein (fuzzystrmatch, Chapter 6) when you need exact edit-distance calculations — "find all names within 2 edits of this input." The tools complement each other. Both live in PostgreSQL. Neither requires a separate service.
The similarity() Function
Here is fuzzy search in PostgreSQL. It is, I am pleased to report, straightforward:
SELECT name, similarity(name, 'Postgre') AS score
FROM products
WHERE similarity(name, 'Postgre') > 0.3
ORDER BY score DESC
LIMIT 10; The threshold (0.3 in this example) controls the minimum similarity required for a result to appear:
- 0.2 — permissive. Catches multi-character typos but may return noise. Use when recall matters more than precision.
- 0.3 — the default. Catches most single-character typos. A sensible starting point for most applications.
- 0.4+ — strict. High-confidence matches only. Use for name lookups where precision matters.
I would recommend starting at 0.3 and adjusting based on your data. Every dataset has its own characteristics, and the right threshold is the one that serves your users well — not the one that looks elegant in documentation.
Gold Lapel exposes this directly: goldlapel.search_fuzzy(conn, "products", "name", "Postgre", threshold=0.3).
The % operator. pg_trgm provides a shorthand for threshold-based filtering:
SELECT name FROM products WHERE name % 'Postgre' ORDER BY similarity(name, 'Postgre') DESC; The % operator returns true when similarity() exceeds the current threshold, which defaults to 0.3 and can be adjusted with SET pg_trgm.similarity_threshold = 0.4;.
word_similarity() — for short queries against long strings. similarity() compares the full query against the full target. For short queries matched against longer strings, this can undercount the match — the query "chair" against "ergonomic office chair with lumbar support" has a low similarity() score because the query is so much shorter than the target.
word_similarity() addresses this by comparing the query against the best-matching substring of the target:
SELECT word_similarity('chair', 'ergonomic office chair with lumbar support'); -- higher
SELECT similarity('chair', 'ergonomic office chair with lumbar support'); -- lower Use word_similarity() when matching short search queries against long product names, article titles, or descriptions. Use similarity() when comparing strings of roughly equal length. The right function depends on the shape of your data, and both are available.
GIN Trigram Indexes
Without an index, similarity() requires a sequential scan — PostgreSQL computes trigram overlap for every row in the table. With a GIN trigram index, PostgreSQL pre-filters candidates by trigram overlap and only computes exact similarity for promising matches.
CREATE INDEX idx_trgm_name ON products USING GIN(name gin_trgm_ops); This index accelerates similarity() queries, % operator queries, and — if you will permit me a moment of genuine enthusiasm — all LIKE and ILIKE queries on that column.
Without pg_trgm, WHERE name ILIKE '%search%' is always a sequential scan. PostgreSQL cannot use a standard btree index for infix substring patterns. With the GIN trigram index, it can. This means installing pg_trgm and creating one index makes every existing LIKE/ILIKE query on that column faster, without changing a single line of application code. The queries you wrote last year, the queries your colleague wrote before they left the team, the queries your ORM generates behind your back — all faster. Automatically.
I mention this because it is one of the most useful side effects of any PostgreSQL extension, and it is the sort of discovery that makes a person feel that PostgreSQL has been looking out for them all along.
Gold Lapel's proxy auto-creates GIN trigram indexes when it detects LIKE/ILIKE patterns in queries.
GiST alternative. pg_trgm also supports GiST indexes (USING GiST(name gist_trgm_ops)). GiST indexes are slower for lookups but faster to build and smaller on disk. For most search use cases, GIN is the right choice. Appendix D covers the full decision matrix for those who enjoy a thorough comparison.
Building "Did You Mean?"
The classic search UX pattern: the user types a query, gets no results or poor results, and sees a suggestion — "Did you mean: PostgreSQL?"
This is a feature that users expect in every modern search experience, and it is surprisingly rare in PostgreSQL tutorials. I intend to correct that omission. The implementation is straightforward.
Step 1: Maintain a table of known-good terms.
CREATE TABLE known_terms (
term TEXT PRIMARY KEY
);
-- Populate from your searchable data
INSERT INTO known_terms (term)
SELECT DISTINCT name FROM products
UNION
SELECT DISTINCT name FROM categories
UNION
SELECT DISTINCT name FROM tags;
CREATE INDEX idx_known_terms_trgm ON known_terms USING GIN(term gin_trgm_ops); Step 2: When a search returns zero or few results, suggest an alternative.
SELECT term, similarity(term, 'postgre') AS score
FROM known_terms
WHERE similarity(term, 'postgre') > 0.3
ORDER BY score DESC
LIMIT 1; Result:
term | score
-------------+----------
PostgreSQL | 0.538462
(1 row) "Did you mean: PostgreSQL?" One query. One suggestion. The user's typo is handled with grace rather than a blank results page.
Step 3: Combine with the search. In application code: run the primary search first. If it returns fewer than N results, run the suggestion query against known_terms and display the suggestion to the user.
The known_terms table can itself be a materialized view — refreshed periodically from the most common values in your searchable columns. This keeps the suggestions current as your data changes, with the same refresh pattern from Chapter 3.
This pattern also combines naturally with the autocomplete functionality in Chapter 9. pg_trgm powers both: similarity for "Did you mean?" suggestions after a poor search, and ILIKE prefix matching for typeahead suggestions while the user is still typing. One extension, serving the user at two different moments in the search experience. I find the elegance of that worth noting.
FTS vs. Trigram Search: When to Use Which
Full-text search (Chapter 4) and trigram search serve different purposes. They are complementary tools, not competing ones.
| Aspect | Full-Text Search (tsvector) | Trigram Search (pg_trgm) |
|---|---|---|
| Matches | Stemmed words (lexemes) | Character sequences |
| "running" → "run" | Yes (stemming) | Partial (shared trigrams) |
| "Postgre" → "PostgreSQL" | No (different stems) | Yes (high trigram overlap) |
| Ranking | ts_rank (term frequency) | similarity() (0.0-1.0) |
| Index type | GIN tsvector | GIN trigram |
| Best for | Known words, language-aware search | Typos, partial matches, LIKE acceleration |
| Extension | None (built-in) | pg_trgm (contrib) |
Full-text search finds documents containing the right words. Trigram search finds results when the user cannot quite manage to type the right words. The distinction is important: full-text search is the primary mechanism, and trigram search is the forgiveness layer — the gracious fallback that catches typos and partial inputs without making the user feel that the mistake was theirs to apologize for.
Chapter 13 (Hybrid Search) shows how to combine both in a single query with Reciprocal Rank Fusion.
The Extension
CREATE EXTENSION IF NOT EXISTS pg_trgm; pg_trgm ships with PostgreSQL as a contrib extension. It is available on every managed provider. No separate installation or download is needed. Gold Lapel's wrapper creates it lazily on first search_fuzzy() or suggest() call.
pg_trgm is also used by Chapter 9 (Autocomplete). Same extension, different application. Install it once, and it serves fuzzy matching, autocomplete, and LIKE/ILIKE acceleration — three capabilities from one CREATE EXTENSION. Efficient, which is a quality I value in infrastructure and in extensions.
Honest Boundary
Trigram similarity is character-level, not semantic. "PostgreSQL" and "Postgres" score high — many shared trigrams. But "car" and "automobile" score zero. They share no character sequences despite meaning the same thing. When the user types the wrong word entirely — not a misspelling of the right word, but a different word for the same concept — trigrams cannot help. That is the territory of Chapters 7 and 8, where pgvector bridges the gap between words and meaning.
For very short strings (1-3 characters), trigram similarity is less useful. A 2-character input produces only one trigram plus padding, which provides very little signal for comparison. Trigram search works best on strings of moderate length — names, titles, product names, short descriptions.
For long documents, full-text search (Chapter 4) is the better primary tool, with trigram search as the fallback for queries that contain typos. The two approaches work together. I would not recommend one without at least considering the other.
Trigrams handle the imprecision of the fingers — the typo, the truncation, the misspelling. They bridge the gap between what the user meant to type and what actually arrived in the search box. A search system that handles this gracefully is a search system that respects its users, and I consider that a minimum standard, not a luxury.
But there is another kind of imprecision, and it has nothing to do with the fingers.
When a user searches for "Smith" and means "Smyth," when they type "Stefan" expecting to find "Stephen," the fingers performed their task correctly. The spelling is exactly what the user intended. The problem is that the language itself offers more than one way to spell the same sound — and the user chose a different spelling than the one in your database.
This is a phonetic problem. And it has been solved, if you can believe it, since 1918. Chapter 6 will make the introduction.