Building a RAG Pipeline with pgvector and Python
The artwork is semantically similar to what was requested. A cosine similarity of 0.87 — close enough to retrieve, not close enough to frame.
Why PostgreSQL for RAG
Good evening. I see you are building a RAG pipeline, and you are considering a dedicated vector database. Before you add another service to the household, allow me to present an alternative.
Retrieval-Augmented Generation pipelines need a vector store to find relevant documents and feed them to a language model as context. The standard approach involves a separate vector database — Pinecone, Weaviate, Qdrant, or similar — but most teams already run PostgreSQL.
pgvector turns PostgreSQL into a vector database. It adds a vector column type, distance operators, and approximate nearest-neighbor indexes (HNSW and IVFFlat). Your documents, metadata, and embeddings live in the same database as the rest of your application data.
I should be forthcoming about the trade-off: pgvector is not as fast as purpose-built vector databases at billion-scale vector search. But for datasets up to a few million vectors — which covers the majority of production RAG applications — pgvector is competitive on latency and dramatically simpler to operate. No separate service to deploy, monitor, back up, or pay for. JOINs, transactions, WHERE clauses, and standard PostgreSQL tooling all work normally. For a deeper comparison, see the pgvector vs Pinecone analysis and the broader vector database comparison.
This guide builds a complete pipeline: document ingestion, chunking, embedding generation, vector storage in pgvector, similarity search, and LLM integration. The implementation uses raw Python with psycopg3 — no LangChain or LlamaIndex required. Framework integration is covered at the end for teams that want it.
Prerequisites and setup
Installing pgvector
pgvector requires PostgreSQL 13 or later. pgvector 0.7+ is recommended for HNSW index support and performance improvements.
Enable the extension in your database:
CREATE EXTENSION IF NOT EXISTS vector; Python environment
Install the required packages:
pip install "psycopg[binary]" pgvector openai For local embeddings without API calls:
pip install sentence-transformers torch Set up the connection with pgvector type registration:
import psycopg
from pgvector.psycopg import register_vector
conn = psycopg.connect("postgresql://localhost/mydb")
register_vector(conn) If you are migrating from psycopg2, see the psycopg2 to psycopg3 migration guide for the key differences.
Designing the vector table
If you'll permit me, the schema design deserves care. What you store and how you index it determines everything about retrieval quality and performance.
Table schema for RAG documents
CREATE TABLE documents (
id bigserial PRIMARY KEY,
content text NOT NULL,
metadata jsonb DEFAULT '{}',
embedding vector(1536),
created_at timestamptz DEFAULT now()
); Column-by-column:
id bigserial— auto-incrementing primary key. Usebigserialrather thanserial— vector tables tend to grow large.content text— the original text chunk. This is what gets returned to the LLM as context.metadata jsonb— JSONB for structured metadata: source URL, document title, chunk index, author. This column is what makes pgvector powerful for production RAG — you can filter vectors with standard SQL WHERE clauses. See the JSONB guide for advanced usage patterns.embedding vector(1536)— the vector column. The dimension must match your embedding model (OpenAItext-embedding-3-small: 1536,all-MiniLM-L6-v2: 384).created_at timestamptz— useful for incremental updates and time-bounded search.
Create a GIN index on the metadata column for efficient filtered queries:
CREATE INDEX idx_documents_metadata ON documents USING gin (metadata); Chunking strategy
LLMs have context windows, and embedding models have token limits. A 50-page document must be split into chunks, each of which becomes one row in the documents table.
Chunk size guidelines: 200–500 tokens per chunk works well for most retrieval tasks. Start with 400 tokens and adjust based on retrieval quality.
Overlap: Include 50–100 tokens of overlap between adjacent chunks. This prevents information at chunk boundaries from being lost.
Implementation with tiktoken (for OpenAI models):
import tiktoken
def chunk_text(text, max_tokens=400, overlap_tokens=50):
encoder = tiktoken.encoding_for_model("text-embedding-3-small")
tokens = encoder.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + max_tokens
chunk_tokens = tokens[start:end]
chunk_text = encoder.decode(chunk_tokens)
chunks.append(chunk_text)
start += max_tokens - overlap_tokens
return chunks Simple character-based splitting (model-agnostic):
def chunk_text_simple(text, max_chars=1600, overlap_chars=200):
chunks = []
start = 0
while start < len(text):
end = start + max_chars
chunk = text[start:end]
chunks.append(chunk)
start += max_chars - overlap_chars
return chunks Each chunk becomes one row. Store the chunk index and source in metadata:
metadata = {
"source": "docs/user-guide.md",
"chunk_index": 3,
"total_chunks": 12,
"title": "User Guide"
} Generating embeddings
OpenAI embeddings
text-embedding-3-small (1536 dimensions) is the recommended default for most RAG applications.
from openai import OpenAI
client = OpenAI() # uses OPENAI_API_KEY env var
def get_embedding(text, model="text-embedding-3-small"):
response = client.embeddings.create(
input=text,
model=model
)
return response.data[0].embedding Batching: The embeddings API accepts up to 2,048 inputs per request:
def get_embeddings_batch(texts, model="text-embedding-3-small"):
response = client.embeddings.create(
input=texts,
model=model
)
return [item.embedding for item in response.data] Rate limiting: Implement exponential backoff for production pipelines:
import time
def get_embeddings_with_retry(texts, model="text-embedding-3-small", max_retries=5):
for attempt in range(max_retries):
try:
return get_embeddings_batch(texts, model)
except Exception as e:
if attempt == max_retries - 1:
raise
wait = 2 ** attempt
time.sleep(wait) Reduced dimensions: text-embedding-3-small supports a dimensions parameter to reduce output from 1536 to 512 or 256:
response = client.embeddings.create(
input=text,
model="text-embedding-3-small",
dimensions=512 # reduced from default 1536
) Local embeddings with sentence-transformers
For teams that need to keep data local, avoid API costs, or run without network access:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2") # 384 dimensions
def get_embedding_local(text):
return model.encode(text).tolist()
def get_embeddings_batch_local(texts):
return model.encode(texts, batch_size=64, show_progress_bar=True).tolist() Trade-offs: No API calls, no per-token cost, no rate limits, and data stays on your machine. Batch embedding large datasets is slow on CPU — a GPU accelerates this substantially.
Storing embeddings in pgvector
Insert individual documents:
def insert_document(conn, content, metadata, embedding):
conn.execute(
"""
INSERT INTO documents (content, metadata, embedding)
VALUES (%s, %s, %s::vector)
""",
(content, psycopg.types.json.Json(metadata), embedding)
) Bulk insert for large datasets:
def insert_documents_bulk(conn, documents):
"""
documents: list of (content, metadata, embedding) tuples
"""
with conn.cursor() as cur:
with cur.copy(
"COPY documents (content, metadata, embedding) FROM STDIN WITH (FORMAT text)"
) as copy:
for content, metadata, embedding in documents:
import json
vec_str = "[" + ",".join(str(x) for x in embedding) + "]"
meta_str = json.dumps(metadata)
copy.write_row((content, meta_str, vec_str))
conn.commit() Upsert pattern for updating embeddings when source documents change:
def upsert_document(conn, source_id, content, metadata, embedding):
conn.execute(
"""
INSERT INTO documents (id, content, metadata, embedding)
VALUES (%s, %s, %s, %s::vector)
ON CONFLICT (id) DO UPDATE SET
content = EXCLUDED.content,
metadata = EXCLUDED.metadata,
embedding = EXCLUDED.embedding
""",
(source_id, content, psycopg.types.json.Json(metadata), embedding)
)
conn.commit() Similarity search — querying vectors
The similarity query
This is the heart of the arrangement: embed the user's question with the same model used for documents, then find the most similar chunks.
SELECT id, content, metadata,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT 5; Distance operators:
<=>— cosine distance (lower is more similar). Use for text embeddings — this is the standard.<->— L2 (Euclidean) distance. Use when magnitude matters.<#>— negative inner product. Use with normalized vectors.
Python implementation:
def search_similar(conn, query_text, limit=5):
query_embedding = get_embedding(query_text)
results = conn.execute(
"""
SELECT id, content, metadata,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s
""",
(query_embedding, query_embedding, limit)
).fetchall()
return results Filtered similarity search
This is where pgvector earns its keep. Combining vector search with metadata filters is its strongest advantage over dedicated vector databases:
SELECT id, content, metadata,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
WHERE metadata->>'source' = 'product-docs'
AND created_at > '2026-01-01'
ORDER BY embedding <=> %s::vector
LIMIT 5; Use cases: multi-tenant RAG, time-bounded search, category-specific retrieval, access control.
def search_filtered(conn, query_text, source=None, after_date=None, limit=5):
query_embedding = get_embedding(query_text)
conditions = []
params = [query_embedding, query_embedding]
if source:
conditions.append("metadata->>'source' = %s")
params.append(source)
if after_date:
conditions.append("created_at > %s")
params.append(after_date)
where_clause = ""
if conditions:
where_clause = "WHERE " + " AND ".join(conditions)
params.append(limit)
results = conn.execute(
f"""
SELECT id, content, metadata,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
{where_clause}
ORDER BY embedding <=> %s::vector
LIMIT %s
""",
params
).fetchall()
return results Creating vector indexes
Without an index, pgvector performs exact nearest-neighbor search — a sequential scan of all vectors. Perfectly accurate, but for large datasets, the performance leaves something to be desired.
HNSW index (recommended for production):
CREATE INDEX idx_documents_embedding ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64); HNSW builds a multi-layer graph for fast approximate nearest-neighbor search. m = 16 is a good default for connections per node. ef_construction = 64 controls build-time thoroughness. Tune query-time accuracy with SET hnsw.ef_search = 40; (higher values improve recall at the cost of latency).
IVFFlat index (alternative for constrained environments):
CREATE INDEX idx_documents_embedding_ivf ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100); Which index to use: HNSW for production workloads where recall quality matters. IVFFlat when build time or memory is constrained. For a deeper dive, see the pgvector performance tuning guide and the pgvector query optimization guide.
Integrating with an LLM — the complete RAG pipeline
The RAG flow
The complete pipeline has five steps:
- User asks a question
- Embed the question using the same model that embedded the documents
- Query pgvector for the top-K most similar chunks
- Construct a prompt with the question and retrieved chunks as context
- Send the prompt to an LLM and return the response
Prompt construction
def build_prompt(question, context_chunks):
context_text = "\n\n---\n\n".join(
f"Source: {chunk['metadata'].get('source', 'unknown')}\n{chunk['content']}"
for chunk in context_chunks
)
system_message = (
"You are a helpful assistant. Answer the user's question based on the "
"provided context. If the context does not contain enough information to "
"answer the question, say so. Cite the source when possible."
)
user_message = f"""Context:
{context_text}
Question: {question}"""
return system_message, user_message Standalone implementation (no framework)
Allow me to present the complete RAG pipeline in under 80 lines, using psycopg3, pgvector, and the OpenAI client. No LangChain, no LlamaIndex — just the primitives:
import psycopg
from pgvector.psycopg import register_vector
from openai import OpenAI
# --- Configuration ---
DB_URL = "postgresql://localhost/mydb"
EMBEDDING_MODEL = "text-embedding-3-small"
CHAT_MODEL = "gpt-4o"
TOP_K = 5
# --- Clients ---
db = psycopg.connect(DB_URL)
register_vector(db)
openai_client = OpenAI()
# --- Embedding ---
def embed(text):
response = openai_client.embeddings.create(input=text, model=EMBEDDING_MODEL)
return response.data[0].embedding
# --- Ingestion ---
def ingest(content, metadata):
embedding = embed(content)
db.execute(
"INSERT INTO documents (content, metadata, embedding) VALUES (%s, %s, %s::vector)",
(content, psycopg.types.json.Json(metadata), embedding)
)
db.commit()
# --- Retrieval ---
def retrieve(question, limit=TOP_K):
question_embedding = embed(question)
rows = db.execute(
"""
SELECT content, metadata, 1 - (embedding <=> %s::vector) AS similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s
""",
(question_embedding, question_embedding, limit)
).fetchall()
return [{"content": r[0], "metadata": r[1], "similarity": r[2]} for r in rows]
# --- Generation ---
def ask(question):
chunks = retrieve(question)
context = "\n\n---\n\n".join(
f"[Source: {c['metadata'].get('source', 'unknown')}]\n{c['content']}"
for c in chunks
)
response = openai_client.chat.completions.create(
model=CHAT_MODEL,
messages=[
{
"role": "system",
"content": (
"Answer based on the provided context. "
"If the context is insufficient, say so. Cite sources."
)
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}
]
)
return response.choices[0].message.content
# --- Usage ---
if __name__ == "__main__":
# Ingest some documents
ingest("PostgreSQL supports JSONB for semi-structured data...", {"source": "docs/jsonb.md"})
ingest("pgvector adds vector similarity search to PostgreSQL...", {"source": "docs/pgvector.md"})
# Ask a question
answer = ask("How does PostgreSQL handle vector search?")
print(answer) This is intentionally framework-free. Understanding the primitives — embed, store, retrieve, generate — makes it clear what the frameworks abstract and helps you debug when things go wrong.
Integration with LangChain
For teams using LangChain, pgvector has a first-class integration:
from langchain_community.vectorstores import PGVector
from langchain_openai import OpenAIEmbeddings
CONNECTION_STRING = "postgresql://localhost/mydb"
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = PGVector(
collection_name="documents",
connection_string=CONNECTION_STRING,
embedding_function=embeddings,
)
# Add documents
vectorstore.add_texts(
texts=["PostgreSQL supports JSONB...", "pgvector adds vector search..."],
metadatas=[{"source": "docs/jsonb.md"}, {"source": "docs/pgvector.md"}]
)
# Search
results = vectorstore.similarity_search_with_score("vector search", k=5)
# Use as a retriever in a chain
retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) LangChain helps when you need complex chains (multi-step reasoning, tool use, agents) or multiple retrievers. For straightforward question-answering RAG, the standalone implementation is simpler and easier to debug.
Integration with LlamaIndex
from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
vector_store = PGVectorStore.from_params(
database="mydb",
host="localhost",
password="postgres",
port=5432,
table_name="documents",
embed_dim=1536,
)
# One-shot ingestion from a directory of files
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents, vector_store=vector_store)
# Query
query_engine = index.as_query_engine()
response = query_engine.query("How does PostgreSQL handle vector search?")
print(response) LlamaIndex handles document loading from diverse sources (PDF, HTML, APIs), chunking, embedding, and retrieval in an integrated pipeline. It is most valuable for document-heavy applications that ingest from many source formats.
Performance and scale
Benchmarks — pgvector vs dedicated vector databases
I should present honest numbers, because the comparison deserves candor:
| Metric | pgvector (HNSW) | Pinecone | Qdrant |
|---|---|---|---|
| 1M vectors, top-10 query latency | 3-8 ms | 2-5 ms | 2-6 ms |
| 5M vectors, top-10 query latency | 8-20 ms | 3-8 ms | 4-10 ms |
| 10M vectors, top-10 query latency | 20-50 ms | 5-12 ms | 8-18 ms |
| Recall @10 (ef_search=100) | 95-98% | 99%+ | 98-99% |
| Filtered search (metadata) | Full SQL WHERE | Metadata filters | Payload filters |
| Operational overhead | Your existing PostgreSQL | Managed SaaS | Self-hosted service |
pgvector is competitive through a few million vectors. Beyond 5–10 million, purpose-built engines maintain better latency and memory efficiency. The value proposition for pgvector is operational simplicity. For detailed comparisons, see pgvector vs Pinecone and the broader vector database comparison.
Optimization tips
Reduce dimensions: Use the dimensions parameter with text-embedding-3-small to reduce from 1536 to 512 or 256. Smaller vectors mean faster distance calculations, smaller indexes, and less storage.
Limit the search space: Use WHERE clauses to narrow the search before vector comparison.
Connection pooling: For concurrent RAG requests, use a connection pool. See the connection pooling guide or FastAPI connection pooling guide.
VACUUM regularly: Vector indexes benefit from regular VACUUM, especially after bulk inserts or updates.
Production considerations
Embedding model versioning: Vectors from different embedding models are incompatible. If you change models, you must re-embed all existing documents. Store the model name in metadata.
Incremental updates: Use the upsert pattern (ON CONFLICT DO UPDATE) for documents that change. A background worker that monitors source documents and re-embeds keeps the vector store current.
Monitoring: Track query latency, index size, and recall quality. Maintain a set of test queries with known good answers and periodically verify that the correct documents appear in the top-K results.
Backup and recovery: Vectors are regular data in PostgreSQL. Standard pg_dump and pg_restore work. No separate backup strategy is needed.
Cost comparison: pgvector runs on your existing PostgreSQL instance. For 1 million vectors at 1536 dimensions, plan for roughly 6 GB of storage and 12–24 GB of index. Compare this to Pinecone's pricing (typically $70–300/month for moderate workloads).
What Gold Lapel adds to vector workloads
Gold Lapel's proxy sees pgvector similarity queries alongside your application's regular OLTP queries. Vector search queries have a distinctive pattern — ORDER BY distance operator, LIMIT K — that the proxy recognizes and optimizes for.
Applications running RAG pipelines also run traditional queries: user authentication, session management, metadata lookups, application state. Gold Lapel optimizes both workloads simultaneously.
The pipeline described in this guide works independently. Gold Lapel attends to its performance as your vector table grows and your query mix becomes more complex.