Chapter 7: What Embeddings Are

The Waiter of Gold Lapel · Published Apr 12, 2026 · 12 min

I should like to show you something that I find genuinely remarkable.

Chapters 4 through 6 covered three kinds of text matching. Full-text search matches words — or more precisely, their stems. Fuzzy matching matches characters — forgiving typos by comparing how strings look. Phonetic matching matches sounds — finding names that are pronounced alike regardless of how they happen to be spelled. Each of these tools is valuable. Each solves a real problem. And I have recommended each of them with confidence.

But they all share a limitation, and it is time to name it plainly: they operate on the surface of language — the characters and the sounds — not the meaning underneath.

When a user searches for "comfortable office chair" and expects to find "ergonomic desk seating," no amount of stemming will bridge the gap — the words share no roots. No amount of trigram matching will bridge it — the strings share no character sequences. No amount of phonetic coding will bridge it — the words do not sound alike. The user knows exactly what they mean. The search system does not, because it has been comparing surfaces when the user was expressing meaning.

This is the gap between words and meaning. Bridging it requires a fundamentally different representation of text — not characters, not sounds, but meaning itself, encoded as numbers. I realize that sentence may sound improbable. I assure you it is not.

This chapter explains that representation. It involves no code and no SQL. It is the concept that makes Chapter 8 possible — and, if I may say, the concept that makes the entire search architecture we have been building considerably more powerful than the sum of its parts.

What Is an Embedding?

What are text embeddings, in simple terms? An embedding is a list of numbers — a vector — that represents the meaning of a piece of text.

"Comfortable office chair" might become [0.23, -0.41, 0.87, 0.12, ...] — a list of, say, 1536 numbers. Each number captures one aspect of the text's meaning, though not in a way any individual number can be interpreted by a human. The meaning lives in the pattern of numbers taken together — in the position they define in a high-dimensional space.

The key property: texts with similar meanings produce vectors that are close together in that space. "Comfortable office chair" and "ergonomic desk seating" produce nearby vectors. "Comfortable office chair" and "tropical fish feeding schedule" produce vectors that are far apart. The proximity is not coincidental. It is the entire point.

Allow me an analogy, because I find it makes the concept feel less abstract and more like something a person can hold in their mind.

Think of a library where books are shelved not by author or title, but by what they are about. Books about ergonomics are near books about office comfort, regardless of what words appear on their covers. A book titled "Ergonomic Desk Seating" and a book titled "Comfortable Office Chairs" sit side by side — because they are about the same thing, even though their titles share not a single word. Embeddings are the coordinate system for that library. They assign each piece of text a location based on its meaning, and texts with similar meanings get nearby locations.

Now imagine searching that library. You walk in and say: "I need something about making my home office more comfortable." The librarian does not search an alphabetical index. She does not look for the word "comfortable" in a catalog. She walks you to the section of the library where comfort-related books live — and there, in that section, you find books about ergonomic chairs, standing desks, monitor arm placement, and lumbar support pillows. None of these books contain the word "comfortable" in their titles. But they are all in the right section, because the library is organized by meaning, not by words.

This is what vector search does. The user's query becomes a vector — a position in the library. The search finds the nearest books. The words do not need to match. The meaning does.

I find this concept quietly beautiful. A system that understands what the user means, even when the user cannot find the exact right words to say it. That is, I would suggest, what search ought to do.

The vector is not a description of the text. It is a position — a point in a space where proximity equals similarity. You do not read a vector to understand the text. You compare two vectors to understand how two texts relate to each other.

How Embeddings Are Created

A neural network — called an "embedding model" — reads text and produces a vector. The model was trained on billions of text examples — web pages, books, articles, conversations — learning which concepts tend to appear in similar contexts. Through that training, it learned that "comfortable" and "ergonomic" appear in similar contexts, that "chair" and "seating" appear in similar contexts, and therefore that "comfortable office chair" and "ergonomic desk seating" should be located near each other in the vector space.

You do not train the model. You use it. Send text to the model, get back a vector. Like using a dictionary — you don't write the definitions, you look them up. The model is either a commercial API (send an HTTP request, receive a vector) or a downloadable open-source model (run it on your own infrastructure, receive a vector). Either way, text goes in, numbers come out. The complexity is inside the model. The interface is simple.

The canonical demonstration of why this works is vector arithmetic. Take the vector for "king." Subtract the vector for "man." Add the vector for "woman." The resulting vector is closest to "queen."

This works because the model learned, from patterns in billions of text examples, that gender is a direction in the vector space and royalty is another direction. These directions were not programmed by the model's creators. They emerged from the data. The model discovered — without being told — that the relationship between "king" and "queen" is the same as the relationship between "man" and "woman": a consistent directional shift in the vector space. The meaning was there in the data. The model found it.

The level of understanding needed for this book: enough to make "king - man + woman ≈ queen" click. Enough to understand that embeddings capture meaning as position. Not enough to derive gradient descent or explain transformer architectures. This is a practitioner's book, not a textbook, and I have no intention of making you sit through a lecture when a demonstration will do.

Dimensions

Each vector has a fixed number of dimensions, determined by the model that produced it. A 1536-dimensional model means each piece of text becomes a list of 1536 numbers. A 768-dimensional model means 768 numbers. The dimension count is a property of the model, not the text.

How do embedding dimensions affect performance? More dimensions capture more information — more fine-grained distinctions in meaning. But more dimensions also mean more storage per row, more computation per distance calculation, and slower index builds. The trade-off is quality versus cost, and the returns diminish as dimensions increase.

Common dimension counts and what they mean in practice:

384 dimensions — lightweight models like sentence-transformers/all-MiniLM. Fast, small (~1.5 KB per vector), good enough for many applications. The right choice when speed and storage matter more than the last few percent of quality.
768 dimensions — mid-range models like nomic-embed-text and Qwen3-Embedding. Good balance of quality and efficiency (~3 KB per vector). Strong quality for most application search workloads. I find this range serves the majority of cases well.
1024 dimensions — Cohere Embed v4. Strong quality at moderate storage cost (~4 KB per vector).
1536 dimensions — OpenAI text-embedding-3-small. High quality, widely used, well-documented (~6 KB per vector). The model most tutorials reference.
3072 dimensions — OpenAI text-embedding-3-large. Maximum quality from OpenAI, maximum storage cost (~12 KB per vector).

For most applications, 768-1536 dimensions provide excellent quality. Go higher only if quality testing on your specific data shows measurable improvement. In my experience, the improvement from 1536 to 3072 is rarely worth the doubled storage cost — but your data may differ, and testing is always more reliable than advice.

Storage implications at scale: a 1536-dimensional vector is approximately 6 KB per row. On a table with one million rows, that is roughly 6 GB of vector data, plus the HNSW index. On ten million rows, 60 GB. These are manageable numbers, but they are worth planning for — especially when choosing between 768-dimensional vectors (3 GB per million rows) and 3072-dimensional vectors (12 GB per million rows). The dimension choice is an architectural decision with storage consequences, and I would prefer you make it deliberately rather than discover it during a capacity review.

Distance: How Close Are Two Vectors?

Once text is represented as vectors, the question "how similar are these two texts?" becomes "how close are these two points?" Measuring closeness requires a distance metric, and three are commonly used.

Cosine distance measures the angle between two vectors, ignoring their magnitude (length). If two vectors point in the same direction, they are similar, regardless of how long the arrows are. This is the most common metric for text embeddings and the one that Gold Lapel's similar() uses by default. Chapter 8 will demonstrate it with pgvector's <=> operator.

Euclidean distance (L2) measures the straight-line distance between two points in vector space. Unlike cosine, it is sensitive to magnitude — two vectors pointing in the same direction but with different lengths will have nonzero Euclidean distance. Better for some non-text use cases like image features or geographic coordinates.

Inner product is related to cosine but does not normalize by magnitude. Used in some recommendation systems and retrieval-augmented generation (RAG) pipelines.

The intuitive explanation for cosine — because I believe intuition serves you better than formulas: imagine two arrows starting from the same point. Cosine similarity compares the direction the arrows point, regardless of how long they are. Two long arrows pointing the same way are just as similar as two short arrows pointing the same way. For text embeddings, the "direction" encodes the meaning, and the "length" is a byproduct of the model's processing. Comparing directions is what matters. The length is, if you will forgive the expression, beside the point.

For text search, use cosine distance. This is the default and the correct choice for the vast majority of text embedding use cases. When in doubt, cosine. You will not regret it.

The Embedding Model Landscape

The embedding model space moves fast — faster, perhaps, than any other area of the search stack. What follows reflects the landscape as of early 2026. Appendix C has the full comparison table with dimensions, context lengths, costs, licensing, and benchmark scores. This section provides practical guidance for choosing a model, with the caveat that practical guidance in a fast-moving field carries an expiration date. I will do my best. The appendix will be updated at publication time.

The Current Leaders

Gemini Embedding 2 (Google) leads the MTEB retrieval benchmarks with a score of 67.71. Strong all-around quality. Commercial API.

Voyage 4 (Voyage AI) uses a Mixture of Experts (MoE) architecture for efficient inference at high quality. Commercial API.

Cohere Embed v4 produces 1024-dimensional vectors with strong quality at moderate storage cost. Commercial API with a generous free tier.

Qwen3-Embedding-8B (Alibaba) produces 768-dimensional vectors under an Apache 2.0 license — fully open-source and self-hostable. The strongest open-source embedding model as of early 2026.

OpenAI's Position

OpenAI's text-embedding-3-small (1536 dimensions) and text-embedding-3-large (3072 dimensions) remain widely used and well-documented. Most tutorials still default to them. However, these models have not been updated since January 2024, and on MTEB retrieval benchmarks, text-embedding-3-large now ranks approximately 7th-9th.

They remain good models. They are well-integrated, well-documented, and produce results that serve most applications well. They are no longer the best-performing models on current benchmarks — and that is not a criticism. The field has advanced. Other teams have published excellent work. OpenAI's models continue to be a reasonable choice, particularly for teams already using them. They are simply no longer the default recommendation for new projects.

Open-Source Models

Qwen3-Embedding-8B — 768 dimensions, Apache 2.0. The strongest open-source option. Runs locally or on your own infrastructure. No API costs. No data leaving your network.

nomic-embed-text — 768 dimensions, Apache 2.0. High quality for its size, with a long context window (8192 tokens) that handles longer documents well.

sentence-transformers (all-MiniLM-L6-v2) — 384 dimensions. Lightweight, fast, free. Lower quality but sufficient for simpler use cases or prototyping.

The open-source advantage: no API calls, no per-token cost, no data leaving your infrastructure. The trade-off: you host and run the model yourself, which requires compute resources and operational responsibility. For teams where data privacy is a requirement rather than a preference, open-source models are not merely an option — they are the option.

Choosing a Model

The practical decision tree:

Starting a new project, quality is the priority → Gemini Embedding 2 or Cohere Embed v4 for current best quality.
Data must stay on-premises, no external API calls → Qwen3-Embedding-8B (best open-source) or nomic-embed-text.
Already using OpenAI, results are satisfactory → stay. No urgent need to switch. A working system deserves respect.
Budget-constrained, simple use case → sentence-transformers/all-MiniLM (384 dimensions, free, fast).

You can always re-embed later. Switching models means regenerating all vectors and rebuilding indexes — a computational cost, not an architectural one. Your database schema doesn't change. Your queries don't change. Your pgvector indexes get rebuilt with the new vectors. The migration cost is hours of compute, not weeks of engineering. I mention this because the fear of being locked into a model choice prevents some teams from starting at all, and that fear is unfounded.

What Embeddings Are Not

A few misconceptions are common enough to address directly, because they lead to architectural decisions that are difficult to undo.

"Embeddings are just fancy keyword matching." They are not. Keyword matching — whether full-text search, trigrams, or phonetic codes — operates on the surface features of text: which characters are present, which stems they reduce to, how they sound when spoken. Embeddings operate on the relationships between concepts, learned from billions of examples of how those concepts are used together. "Comfortable office chair" and "ergonomic desk seating" share no surface features at all — no common stems, no shared trigrams, no phonetic overlap. They are only similar in an embedding space, because the model learned that these phrases appear in similar contexts and describe similar things. This is a fundamentally different kind of matching, not an incremental improvement on the old kind. The distinction matters, and I would prefer you understand it clearly before writing any SQL.

"More dimensions are always better." They are not always worth the cost. Going from 384 to 768 dimensions produces a meaningful quality improvement for most datasets. Going from 768 to 1536 produces a smaller improvement. Going from 1536 to 3072 produces a smaller improvement still. The returns diminish while the costs — storage, index build time, query latency — scale linearly. A 3072-dimensional embedding is not twice as good as a 1536-dimensional one. It is marginally better at distinguishing fine-grained differences, at twice the storage cost. For most application search workloads, 768-1536 dimensions hit the practical sweet spot. The wise approach, as with most things in infrastructure, is to test on your own data before committing to the most expensive option.

"You need to train your own model." You almost certainly do not. Modern embedding models are general-purpose. They have been trained on enough text to produce high-quality embeddings for most domains without fine-tuning. Product descriptions, article text, user queries, support tickets — off-the-shelf models handle all of these well. Fine-tuning is an option for domains with highly specialized vocabulary — medical, legal, scientific — but it is an optimization, not a prerequisite. Start with a pre-trained model. Fine-tune only if quality testing on your specific data shows that the pre-trained model is not sufficient. Most teams never reach that threshold, and that is perfectly acceptable.

"Switching models later means rebuilding everything." It means regenerating vectors and rebuilding indexes. It does not mean changing your schema, rewriting your queries, or modifying your application code. The pgvector column type stays the same (though the dimension count may change). The <=> operator stays the same. The HNSW index gets rebuilt with the new vectors. The migration cost is computational — hours of embedding generation and index building — not architectural. You are not locked into your first model choice. I mention this because the freedom to change your mind later is, in my experience, one of the most undervalued properties of a well-designed system.

The Key Insight for This Book

An embedding model turns the "matching meaning" problem into a "finding nearby points" problem. Text becomes a vector. Similarity becomes distance. And "finding nearby points" is a well-solved problem in databases — it is a nearest-neighbor query with an index.

PostgreSQL + pgvector turns this into: store the vectors in a column, build an HNSW index, query with ORDER BY distance LIMIT k. The concept from this chapter becomes the implementation in Chapter 8.

Consider what you have built across the last four chapters:

Chapter 4 — lexical search. Matching words (or their stems). tsvector and GIN indexes.
Chapter 5 — fuzzy search. Forgiving typos. pg_trgm and trigram GIN indexes.
Chapter 6 — phonetic search. Matching sounds. fuzzystrmatch and B-tree expression indexes.
Chapters 7-8 — semantic search. Matching meaning. pgvector and HNSW indexes.

Together, these cover every dimension of imprecision between what the user types and what they intend: the right words, the wrong spelling, the wrong pronunciation, and the wrong words entirely. The materialized view from Chapter 3 is the surface where the tsvector column and the vector column coexist — lexical and semantic search indexed on the same view, refreshed atomically, queried in the same SQL statement. Chapter 13 will show how to combine their scores into a single ranked result using Reciprocal Rank Fusion.

The architecture is coming together. I trust you can feel it.

The concept is in place. Text becomes vectors. Vectors encode meaning. Similar meanings produce nearby vectors. Finding nearby vectors is a database query — and PostgreSQL, with pgvector, is rather good at database queries.

What remains is to see it work. Chapter 8 will show exactly how — storing vectors in a column, computing distances with operators, building HNSW indexes for speed, and returning semantically similar results in milliseconds. The concept from this chapter becomes code you can run. The library we have been imagining becomes a table you can query.

If you have followed me this far — through words, through characters, through sounds, and now to meaning itself — I believe you will find the next chapter deeply satisfying. If you will permit me, I should like to show you the SQL.