Chapter 4: Full-Text Search (tsvector)

The Waiter of Gold Lapel · Published Apr 12, 2026 · 17 min

If you will permit me a direct observation: full-text search is not a PostgreSQL extension. It is PostgreSQL.

It has shipped with core PostgreSQL since version 8.3, released in 2008. No CREATE EXTENSION needed. No third-party dependency. No installation step beyond having PostgreSQL itself. If you have PostgreSQL — and you do, because you are reading this book — you have full-text search. Tokenization, stemming, stop word removal, relevance ranking, highlighted snippets, and GIN indexes for performance. All built in. All production-grade. All quietly available for nearly two decades, doing precisely the job that most teams add a separate service to perform.

I find it worth pausing on that fact. Nearly two decades. The capability has been in your database this entire time. It simply never introduced itself.

The Elasticsearch equivalent is the match query — the core of what most people mean when they say "search," and the reason most teams reach for Elasticsearch in the first place. This chapter will demonstrate, with complete and runnable SQL, that PostgreSQL handles that workload natively.

I should mention: this chapter may be the most important in the book. Not because the other pillars are less capable — they are each formidable in their own right — but because full-text search is the foundation upon which everything else rests. If this chapter does not persuade, the others need not try.

I intend for it to persuade.

Why Not LIKE?

A reasonable question before we begin. PostgreSQL has always supported LIKE and ILIKE for string matching. Why isn't that sufficient?

LIKE finds substrings. WHERE title LIKE '%database%' finds rows where "database" appears somewhere in the title. It does its job and it does it honestly. But it does not stem — "databases" would match, but a document containing only "db" would not. It does not remove stop words — searching for "the database" matches that literal string. It does not rank — there is no concept of relevance, no way to put the best result first. And without a trigram index (Chapter 5), LIKE '%term%' is always a sequential scan, which means PostgreSQL must examine every row in the table. On a table with a million rows, that patience is expensive.

Full-text search is a fundamentally different operation. It tokenizes text into meaningful units, applies language-aware stemming, removes noise words, and ranks results by relevance. LIKE finds strings. Full-text search finds content. One is looking for a needle in a haystack. The other understands what needles are and where they tend to be found.

The progression from LIKE to full-text search to fuzzy matching (Chapter 5) to semantic search (Chapters 7-8) is a progression from surface-level string operations to increasingly sophisticated understanding of what the user actually means. This chapter covers the first major step in that progression — and it is a substantial step.

How tsvector Works

A tsvector is PostgreSQL's representation of a searchable document — a sorted list of lexemes with position information. Allow me to demonstrate.

tsvector example

SELECT to_tsvector('english', 'The quick brown foxes jumped over the lazy dogs');

Result:

'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2

What happened here deserves a moment's attention, because this is where the intelligence lives.

Tokenization. The text was split into individual words. Straightforward enough.

Stop word removal. "The," "over," and "the" were dropped entirely. These are stop words — common words that carry no search value in English. You would not want a search for "the database" to match every document containing the word "the." Every language configuration maintains its own stop word list, and the English list is sensibly comprehensive.

Stemming. "foxes" became "fox." "jumped" became "jump." "lazy" became "lazi." "dogs" became "dog." The stemmer reduces each word to its root form so that searching for "jumping" will match documents containing "jumped," "jumps," or "jump." This is language-aware — English stemming rules differ from French, which differ from German. The stemmer knows which language it is serving.

Position tracking. Each lexeme records its position in the original text. "brown" was the third word, "fox" the fourth. Positions enable phrase matching and proximity scoring — features we will use shortly.

The result — 'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2 — is the search index for that text, stored as a column value. This is the same concept as an Elasticsearch inverted index. The data structure is identical. It simply lives inside your database rather than in a separate service, which is rather the point of this book.

How tsquery Works

A tsquery is PostgreSQL's representation of a search query — a set of lexemes combined with boolean operators. PostgreSQL provides four functions to parse search input, and choosing the right one matters more than most tutorials suggest. I will give each the attention it deserves.

plainto_tsquery()

plainto_tsquery

SELECT plainto_tsquery('english', 'quick brown fox');
-- Result: 'quick' & 'brown' & 'fox'

Stems all words and ANDs them together. Never throws a parse error regardless of what the user types. You could pass it the complete works of Shakespeare and it would stem every word without complaint.

This is what Gold Lapel's search() wrapper uses. It is the safe default for programmatic use — untrusted user input will never cause an exception. If you are building a search endpoint and are not certain which function to use, use this one. You will not regret it.

When to use: Application code, backend search endpoints, any context where the input comes from a user and must not cause an error. Which is to say: most contexts.

websearch_to_tsquery()

websearch_to_tsquery

SELECT websearch_to_tsquery('english', '"brown fox" -lazy OR quick');
-- Result: 'brown' <-> 'fox' & !'lazi' | 'quick'

Google-like syntax. Quoted phrases become adjacency operators (<->). A leading - becomes NOT. The OR keyword is supported. And — this is the part that deserves emphasis — it never throws parse errors. Malformed syntax is treated as plain text rather than producing an exception.

I confess this function does not receive the attention it merits. It supports the query patterns users already know from years of using Google — quoted phrases for exact matching, minus signs for exclusions, OR for alternatives — without any risk of syntax errors crashing your search endpoint. Available since PostgreSQL 11, and underused since PostgreSQL 11.

A richer example, to illustrate why this function is worth your consideration:

Advanced websearch example

-- Find articles about PostgreSQL performance, exclude replication topics,
-- and optionally match query tuning or optimization
SELECT websearch_to_tsquery('english',
    'postgresql performance -replication OR "query tuning" OR optimization');

This produces a tsquery that a user could type into a search box naturally. No special syntax documentation needed. No error handling required. The user types what they mean, and PostgreSQL understands the intent. That is, after all, what good service looks like.

When to use: User-facing search boxes. Any context where users might type quoted phrases or - exclusions. If your search box faces the public, this is the function I would recommend.

to_tsquery()

to_tsquery

SELECT to_tsquery('english', 'quick & !lazy & fox:*');
-- Result: 'quick' & !'lazi' & 'fox':*

Full boolean syntax: & (AND), | (OR), ! (NOT), <-> (followed by), :* (prefix match). The most powerful parser — and the only one that throws parse errors on malformed input. With power comes the responsibility of validating your input, and I trust you to handle that responsibility appropriately.

Note the prefix operator: fox:* matches any lexeme starting with "fox" — "foxes," "foxhound," "foxglove." This is useful for autocomplete-style matching within full-text search. Chapter 9 covers dedicated autocomplete with pg_trgm, which is more flexible for that purpose, but the prefix operator is a handy tool for simpler cases.

When to use: Advanced search interfaces where input is validated. Internal tools, admin panels, power-user search forms with syntax documentation. Not for untrusted input — a malformed query will produce an error, and your users deserve better than an error.

phraseto_tsquery()

phraseto_tsquery

SELECT phraseto_tsquery('english', 'database performance tuning');
-- Result: 'databas' <-> 'perform' <-> 'tune'

All words must appear adjacent and in order. Exact phrase matching. Never throws parse errors.

When to use: When the user expects phrase-level precision. "Database performance tuning" as a three-word phrase in that exact order, not three separate words scattered across a document. The difference between finding an article about database performance tuning and finding an article that happens to mention databases, performance, and tuning in three unrelated paragraphs.

Which Function Should You Use?

Function	Parse Errors?	User Syntax	Best For
`plainto_tsquery()`	Never	None	Application code, APIs
`websearch_to_tsquery()`	Never	Quotes, `-`, `OR`	User-facing search boxes
`to_tsquery()`	Yes	`&`, `\|`, `!`, `<->`, `:*`	Power users, admin tools
`phraseto_tsquery()`	Never	None	Exact phrase matching

For most applications, plainto_tsquery() is the safe default. If your search box faces users directly, websearch_to_tsquery() is the upgrade — same safety, considerably more expressiveness. I keep this table within reach, and I suspect you may find it useful to do the same.

The @@ Match Operator

The @@ operator is the match. It returns true if a tsvector matches a tsquery:

Full-text search query

SELECT id, title, ts_rank(search_vector, q) AS score
FROM articles, plainto_tsquery('english', 'database performance') AS q
WHERE search_vector @@ q
ORDER BY score DESC
LIMIT 20;

This is the PostgreSQL equivalent of Elasticsearch's match query. Tokenize the query, match against the index, return ranked results. The syntax is different. The outcome is the same.

Without a GIN index, the @@ operator requires a sequential scan — PostgreSQL checks every row. With a GIN index, it becomes an index scan. The difference, which the next section will address, is not subtle.

GIN Indexes

GIN stands for Generalized Inverted Index. It is the same data structure as Elasticsearch's inverted index — a mapping from each lexeme to the list of rows containing it. I find it worth noting that the data structure most people associate with Elasticsearch has been available in PostgreSQL for just as long. The concept is identical. It simply lives inside your database.

GIN index

CREATE INDEX idx_search ON articles USING GIN(search_vector);

Without the GIN index: sequential scan. PostgreSQL examines every row, reads the stored tsvector, and checks the match. On a table with a million rows, this takes seconds. On ten million rows, it takes the kind of time that makes users wonder whether the application has frozen.

With the GIN index: index scan. PostgreSQL looks up the query lexemes in the index, retrieves the list of matching row IDs, and fetches only those rows. On a table with a million rows, this takes milliseconds. On ten million rows, it still takes milliseconds. The GIN index is the difference between a search that tests the user's patience and a search that respects it.

Gold Lapel's proxy auto-creates this index when it detects to_tsvector(...) @@ ... patterns in queries. The developer never writes CREATE INDEX. The proxy observes the query pattern and creates the right index automatically. One less thing to manage. One less thing to forget.

Practical note: Create the GIN index on a stored tsvector column — either a generated column on the base table or a column on the materialized view from Chapter 3. Indexing the stored column avoids recomputing tsvector at query time. This is the production pattern, and I recommend it without reservation.

Ranking: ts_rank() and ts_rank_cd()

Finding matches is half the problem. Ranking them is the other half — and arguably the half the user notices. A search that returns the right results in the wrong order is, from the user's perspective, a search that returned the wrong results.

PostgreSQL provides two relevance ranking functions:

ts_rank() scores by term frequency — how often do the query terms appear in the document, normalized by document length. This is the default for most use cases, and a sensible default it is.

ts_rank_cd() scores by cover density — how close together are the matching terms. It rewards proximity. A document where "database" and "performance" appear side by side should rank higher than one where "database" is in the opening paragraph and "performance" appears three thousand words later. Proximity implies relevance. ts_rank_cd() respects that implication.

Both return a float. Higher values mean more relevant. Use in ORDER BY score DESC.

Ranking functions

SELECT title,
    ts_rank(search_vector, q) AS rank,
    ts_rank_cd(search_vector, q) AS rank_cd
FROM articles, plainto_tsquery('english', 'database performance') AS q
WHERE search_vector @@ q
ORDER BY rank DESC
LIMIT 10;

Normalization. Both functions accept an optional integer parameter that controls normalization:

0 — default, no normalization
1 — divide by 1 + log(document length)
2 — divide by document length
8 — divide by number of unique words

These can be combined with |: ts_rank(search_vector, q, 2|8) divides by both document length and unique word count. This prevents long documents from always outranking short ones simply by having more words — a form of bias that, left uncorrected, means your longest articles always appear first regardless of relevance. That is not the behavior of a well-managed search system.

Practical guidance: Start with ts_rank(search_vector, q, 2) — term frequency normalized by document length. Switch to ts_rank_cd() if proximity matters for your content. Most applications will find this default serves them well, and I see no reason to complicate what works.

A note on honesty. PostgreSQL's ts_rank is simpler than Elasticsearch's BM25, which combines term frequency, inverse document frequency across the entire corpus, and field length normalization in a single formula. For most application search workloads — product search, article search, user lookup — ts_rank produces results that serve the user well. For applications where search result quality is the primary differentiator, Chapter 13 shows how hybrid search with Reciprocal Rank Fusion combines lexical and semantic scores for relevance that exceeds what either approach achieves alone. I mention this because a good recommendation includes its boundaries, and I would rather you trust my recommendations than be impressed by them.

Highlighting: ts_headline()

ts_headline() generates a text snippet with matched terms highlighted — the excerpt that appears below each search result. It is the detail that makes search results feel polished rather than utilitarian.

ts_headline example

SELECT title,
    ts_headline('english', body, q,
        'StartSel=<mark>, StopSel=</mark>, MaxWords=35, MinWords=15') AS snippet
FROM articles, plainto_tsquery('english', 'database performance') AS q
WHERE search_vector @@ q
ORDER BY ts_rank(search_vector, q) DESC
LIMIT 10;

Result for a matching row:

...optimizing <mark>database</mark> <mark>performance</mark> requires understanding
how the query planner selects execution strategies...

Configurable options:

StartSel / StopSel — the tags that wrap matched terms (default: <b> / </b>)
MaxWords / MinWords — the snippet window size
MaxFragments — number of fragments to return (0 = entire document with highlights)
FragmentDelimiter — separator between fragments (default: ...)

This is the _highlight field in Gold Lapel's return format when highlight=True is set in the search() call. The user sees a search result with the matching terms visually emphasized. It is a small thing. It matters more than you might expect.

Multi-Column Search and Weighted Fields

Most search needs span more than one column. A user searching for "ergonomic chair" should match products where "ergonomic" appears in the title and "chair" appears in the description — not only products where both words happen to appear in the same field.

Simple concatenation combines columns into a single tsvector:

Simple concatenation

to_tsvector('english', title || ' ' || body)

This works, but treats all columns equally. A match in the title carries the same weight as a match in the body. For many applications, that is not quite right — a product named "Ergonomic Chair" is more relevant than a product whose title is "Office Supplies" but whose description mentions chairs once in the fourteenth paragraph.

Weighted search corrects this by assigning different importance to different columns:

Weighted search

setweight(to_tsvector('english', title), 'A') ||
setweight(to_tsvector('english', body), 'B')

PostgreSQL supports four weight categories: A (highest), B, C, D (lowest/default). When ts_rank() computes relevance, it scores lexemes from weight-A fields higher than those from weight-B fields.

To see the difference in practice, consider two documents matching the query "ergonomic chair":

Document 1: Title is "Ergonomic Chair" (both terms in weight A). Body mentions neither term.
Document 2: Title is "Office Furniture Catalog." Body mentions "ergonomic" once and "chair" once (both terms in weight B).

With weighted search, Document 1 ranks higher — the terms appear in the more important field. Without weights, both documents score roughly the same. The weight system encodes the intuition that a title match is a stronger signal than a body match — an intuition that is, in my experience, almost always correct.

Practical weight assignments:

A — titles, headings, product names
B — body text, descriptions
C — metadata, tags, categories
D — supplementary content, comments, reviews

Gold Lapel supports multi-column search via search(conn, "articles", ["title", "body"], "query") — the wrapper handles the concatenation automatically. The developer specifies which columns to search. Gold Lapel handles the rest.

Language Configurations

PostgreSQL ships with 30+ built-in text search configurations — one per supported language. Each configuration controls the stemming rules, stop word lists, and dictionary chain for that language. The stemmer that knows "running" should match "run" in English also knows that "courant" should match "cour" in French. Languages have their own rules. PostgreSQL respects them.

Language configurations

-- English: "running" stems to "run," stop words removed
SELECT to_tsvector('english', 'The running fox');      -- 'fox':3 'run':2

-- French: "courant" stems to "cour"
SELECT to_tsvector('french', 'Le renard courant');     -- 'cour':3 'renard':2

-- Simple: no stemming, no stop words — everything preserved
SELECT to_tsvector('simple', 'The running fox');       -- 'fox':3 'running':2 'the':1

The 'simple' configuration performs no linguistic processing — no stemming, no stop word removal. Useful for identifiers, product SKUs, codes, or any content where exact lexeme matching is desired. Not every column contains natural language, and the search configuration should reflect that.

Multiple languages per table are supported. A product name in English and a description in French can each use the appropriate configuration. Custom configurations — adding stop words, changing dictionaries, building domain-specific stemming rules — are covered in Chapter 12.

Generated Columns: The Production Pattern

For single-table search, the production pattern is a generated column:

Generated column pattern

ALTER TABLE articles
ADD COLUMN search_vector tsvector
GENERATED ALWAYS AS (
    setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
    setweight(to_tsvector('english', coalesce(body, '')), 'B')
) STORED;

CREATE INDEX idx_articles_search ON articles USING GIN(search_vector);

The tsvector is precomputed and stored on disk. The GIN index is built on the stored column. Queries match directly against the stored column without calling to_tsvector() at query time.

The generated column updates automatically when the row changes. No refresh needed. No staleness. No maintenance. For single-table search, this is full-text search that takes care of itself — which is, I should note, the kind of infrastructure I most admire.

For multi-table search, Chapter 3's materialized view applies this same pattern to a denormalized view — same generated tsvector, same GIN index, but spanning multiple tables with a refresh schedule.

Putting It All Together

If you prefer to see the complete picture — and I respect that preference — here is a runnable example from table creation to search results. Every statement can be pasted directly into psql.

Complete example

-- Create a table
CREATE TABLE articles (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    body TEXT NOT NULL
);

-- Add the search column with weights
ALTER TABLE articles
ADD COLUMN search_vector tsvector
GENERATED ALWAYS AS (
    setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
    setweight(to_tsvector('english', coalesce(body, '')), 'B')
) STORED;

-- Create the index
CREATE INDEX idx_articles_search ON articles USING GIN(search_vector);

-- Insert some data
INSERT INTO articles (title, body) VALUES
('PostgreSQL Performance Tuning', 'This guide covers query optimization, indexing strategies, and configuration tuning for PostgreSQL databases.'),
('Database Migration Strategies', 'When migrating databases, performance testing should be conducted early. PostgreSQL handles migrations well.'),
('Introduction to SQL', 'SQL is the standard language for relational databases. PostgreSQL implements the SQL standard comprehensively.');

-- Search
SELECT title, ts_rank(search_vector, q, 2) AS score
FROM articles, plainto_tsquery('english', 'postgresql performance') AS q
WHERE search_vector @@ q
ORDER BY score DESC;

Result:

              title               |    score
----------------------------------+-----------
 PostgreSQL Performance Tuning    | 0.0991032
 Database Migration Strategies    | 0.0282843
(2 rows)

"PostgreSQL Performance Tuning" ranks first — both search terms appear in the title (weight A). "Database Migration Strategies" ranks second — "performance" and "PostgreSQL" both appear, but in the body (weight B). "Introduction to SQL" does not appear — it contains "PostgreSQL" but not "performance."

Three SQL statements to set up. One to search. No external service. No sync pipeline. No cluster to manage. The search is simply part of the database, which is where it belonged all along.

PostgreSQL Full-Text Search vs. Elasticsearch

Aspect	PostgreSQL tsvector	Elasticsearch match
Tokenization	`to_tsvector()` with language configs	Analyzer pipeline
Query parsing	4 functions (plain, websearch, boolean, phrase)	Query DSL match/multi_match
Relevance scoring	`ts_rank()`, `ts_rank_cd()`	BM25
Highlighting	`ts_headline()`	Highlight API
Index type	GIN (inverted index)	Inverted index
Multi-field search	Concatenation or weighted `setweight()`	multi_match with field boosting
Data consistency	ACID — immediate	Eventually consistent
Infrastructure	Same database	Separate cluster + sync pipeline
Access control	Row-level security (RLS)	Separate authorization layer

What PostgreSQL provides that Elasticsearch cannot: ACID consistency — search results are transactionally consistent with the database, immediately, without a sync pipeline. Row-level security — per-user search results without a separate permission layer. JOINs — enrich search results with related data in the same query.

What Elasticsearch provides: BM25 is a more sophisticated relevance algorithm than ts_rank, accounting for inverse document frequency across the corpus. Distributed search across shards enables horizontal scaling — Chapter 15 addresses this with Citus for PostgreSQL. These are genuine capabilities, and I acknowledge them as such.

For the vast majority of application search workloads, the comparison favors PostgreSQL: same core capabilities, stronger consistency guarantees, fewer moving parts, and no second service to maintain at three in the morning.

Honest Boundary

I would not be doing my job properly if I did not tell you where this technique ends and the others begin.

tsvector is lexeme-based. It matches words — or more precisely, stems of words. It does not understand meaning. "Comfortable chair" and "ergonomic seating" share no stems, and tsvector treats them as completely unrelated. The user who searches for one and expects to find the other will be disappointed by tsvector alone. That gap — the gap between words and meaning — is what pgvector addresses in Chapters 7 and 8.

ts_rank is good. It is not perfect. For most applications, it produces results that serve the user well. For applications where search result quality is the primary differentiator — where the difference between result number two and result number five affects revenue — Chapter 13 shows how hybrid search with Reciprocal Rank Fusion combines lexical and semantic scores for relevance that exceeds what either approach achieves alone.

Full-text search requires the user to type the right words, or at least their stems. When they don't — when they misspell, when they use a synonym, when they use a name that sounds right but is spelled differently — the other pillars fill the gap. Chapter 5 handles the typo. Chapter 6 handles the mispronunciation. Chapters 7-8 handle the synonym. Together, they form a search system where the user does not need to guess the exact right word. Full-text search is the foundation of that system. It is not the whole building.

Full-text search handles the majority of what teams add Elasticsearch to do — tokenization, stemming, relevance ranking, highlighting, multi-column search, language-aware processing. It has been doing this since 2008, built into every PostgreSQL installation, waiting patiently to be used. I have always found patience an admirable quality in infrastructure.

But it matches words, not intention. And users, being human, do not always type the right words. Sometimes the fingers slip. Sometimes the spelling varies. Sometimes "PostgreSQL" arrives as "Postgre" and your beautifully indexed tsvector returns nothing at all, because it was taught to be precise, not forgiving.

Chapter 5 addresses this. If you will follow me, I should like to introduce you to the tool that forgives the fingers without judging the typist. I find that a civilized approach to the problem.