Building a RAG Pipeline with pgvector and Python

Q: How many vectors can pgvector handle?

pgvector handles millions of vectors well with HNSW or IVFFlat indexes. For most applications (under 5 million vectors), query latency stays in single-digit milliseconds. Beyond 10 million vectors, dedicated vector databases like Pinecone or Qdrant may offer better performance. The practical limit depends on your hardware, embedding dimensions, and acceptable query latency.

Q: Which embedding model should I use for RAG with pgvector?

For most applications, OpenAI's text-embedding-3-small (1536 dimensions) offers the best balance of quality, cost, and speed. For local or private deployment, all-MiniLM-L6-v2 from sentence-transformers (384 dimensions) is fast, free, and runs without network access. Start with text-embedding-3-small and only upgrade if retrieval quality on your specific data justifies the cost.

Q: Should I use IVFFlat or HNSW indexes?

Use HNSW for production workloads. It provides better recall at the same query latency and does not need rebuilding when data distribution changes. Use IVFFlat when memory is constrained or when you need faster index build times. Both are approximate indexes; tune probes (IVFFlat) or ef_search (HNSW) to balance speed versus recall.

Q: Can I combine vector search with regular SQL WHERE clauses?

Yes — this is pgvector's strongest advantage over dedicated vector databases. Standard PostgreSQL WHERE clauses, JOINs, subqueries, CTEs, and window functions all combine naturally with vector similarity search in a single query. This makes multi-tenant RAG, time-bounded search, and category-specific retrieval straightforward.

Q: Do I need LangChain or LlamaIndex to use pgvector for RAG?

No. A minimal, working RAG pipeline requires about 80 lines of Python with psycopg3, the pgvector package, and an LLM client. LangChain and LlamaIndex add useful abstractions for complex workflows, but for straightforward question-answering RAG, the framework-free approach is simpler and easier to debug.

The artwork is semantically similar to what was requested. A cosine similarity of 0.87 — close enough to retrieve, not close enough to frame.

March 27, 2026 · 22 min read

The artwork is semantically similar to what was requested. A cosine similarity of 0.87 — close enough to retrieve, not close enough to frame. We have asked the artist to move nearer to the query.

Why PostgreSQL for RAG

Good evening. I see you are building a RAG pipeline, and you are considering a dedicated vector database. Before you add another service to the household, allow me to present an alternative.

Retrieval-Augmented Generation pipelines need a vector store to find relevant documents and feed them to a language model as context. The standard approach involves a separate vector database — Pinecone, Weaviate, Qdrant, or similar — but most teams already run PostgreSQL.

pgvector turns PostgreSQL into a vector database. It adds a vector column type, distance operators, and approximate nearest-neighbor indexes (HNSW and IVFFlat). Your documents, metadata, and embeddings live in the same database as the rest of your application data.

I should be forthcoming about the trade-off: pgvector is not as fast as purpose-built vector databases at billion-scale vector search. But for datasets up to a few million vectors — which covers the majority of production RAG applications — pgvector is competitive on latency and dramatically simpler to operate. No separate service to deploy, monitor, back up, or pay for. JOINs, transactions, WHERE clauses, and standard PostgreSQL tooling all work normally. For a deeper comparison, see the pgvector vs Pinecone analysis and the broader vector database comparison.

This guide builds a complete pipeline: document ingestion, chunking, embedding generation, vector storage in pgvector, similarity search, and LLM integration. The implementation uses raw Python with psycopg3 — no LangChain or LlamaIndex required. Framework integration is covered at the end for teams that want it.

Prerequisites and setup

Installing pgvector

pgvector requires PostgreSQL 13 or later. pgvector 0.7+ is recommended for HNSW index support and performance improvements.

Enable the extension in your database:

Enable pgvector

CREATE EXTENSION IF NOT EXISTS vector;

Python environment

Install the required packages:

Install Python dependencies

pip install "psycopg[binary]" pgvector openai

For local embeddings without API calls:

Install local embedding dependencies

pip install sentence-transformers torch

Set up the connection with pgvector type registration:

Connect with pgvector type registration

import psycopg
from pgvector.psycopg import register_vector

conn = psycopg.connect("postgresql://localhost/mydb")
register_vector(conn)

If you are migrating from psycopg2, see the psycopg2 to psycopg3 migration guide for the key differences.

Designing the vector table

If you'll permit me, the schema design deserves care. What you store and how you index it determines everything about retrieval quality and performance.

Table schema for RAG documents

Vector table schema

CREATE TABLE documents (
    id bigserial PRIMARY KEY,
    content text NOT NULL,
    metadata jsonb DEFAULT '{}',
    embedding vector(1536),
    created_at timestamptz DEFAULT now()
);

Column-by-column:

id bigserial — auto-incrementing primary key. Use bigserial rather than serial — vector tables tend to grow large.
content text — the original text chunk. This is what gets returned to the LLM as context.
metadata jsonb — JSONB for structured metadata: source URL, document title, chunk index, author. This column is what makes pgvector powerful for production RAG — you can filter vectors with standard SQL WHERE clauses. See the JSONB guide for advanced usage patterns.
embedding vector(1536) — the vector column. The dimension must match your embedding model (OpenAI text-embedding-3-small: 1536, all-MiniLM-L6-v2: 384).
created_at timestamptz — useful for incremental updates and time-bounded search.

Create a GIN index on the metadata column for efficient filtered queries:

GIN index on metadata

CREATE INDEX idx_documents_metadata ON documents USING gin (metadata);

Chunking strategy

LLMs have context windows, and embedding models have token limits. A 50-page document must be split into chunks, each of which becomes one row in the documents table.

Chunk size guidelines: 200–500 tokens per chunk works well for most retrieval tasks. Start with 400 tokens and adjust based on retrieval quality.

Overlap: Include 50–100 tokens of overlap between adjacent chunks. This prevents information at chunk boundaries from being lost.

Implementation with tiktoken (for OpenAI models):

Token-based chunking with tiktoken

import tiktoken

def chunk_text(text, max_tokens=400, overlap_tokens=50):
    encoder = tiktoken.encoding_for_model("text-embedding-3-small")
    tokens = encoder.encode(text)
    chunks = []
    start = 0

    while start < len(tokens):
        end = start + max_tokens
        chunk_tokens = tokens[start:end]
        chunk_text = encoder.decode(chunk_tokens)
        chunks.append(chunk_text)
        start += max_tokens - overlap_tokens

    return chunks

Simple character-based splitting (model-agnostic):

Character-based chunking

def chunk_text_simple(text, max_chars=1600, overlap_chars=200):
    chunks = []
    start = 0

    while start < len(text):
        end = start + max_chars
        chunk = text[start:end]
        chunks.append(chunk)
        start += max_chars - overlap_chars

    return chunks

Each chunk becomes one row. Store the chunk index and source in metadata:

Chunk metadata example

metadata = {
    "source": "docs/user-guide.md",
    "chunk_index": 3,
    "total_chunks": 12,
    "title": "User Guide"
}

Generating embeddings

OpenAI embeddings

text-embedding-3-small (1536 dimensions) is the recommended default for most RAG applications.

Single embedding

from openai import OpenAI

client = OpenAI()  # uses OPENAI_API_KEY env var

def get_embedding(text, model="text-embedding-3-small"):
    response = client.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding

Batching: The embeddings API accepts up to 2,048 inputs per request:

Batch embeddings

def get_embeddings_batch(texts, model="text-embedding-3-small"):
    response = client.embeddings.create(
        input=texts,
        model=model
    )
    return [item.embedding for item in response.data]

Rate limiting: Implement exponential backoff for production pipelines:

Embeddings with retry

import time

def get_embeddings_with_retry(texts, model="text-embedding-3-small", max_retries=5):
    for attempt in range(max_retries):
        try:
            return get_embeddings_batch(texts, model)
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            wait = 2 ** attempt
            time.sleep(wait)

Reduced dimensions: text-embedding-3-small supports a dimensions parameter to reduce output from 1536 to 512 or 256:

Reduced dimensions

response = client.embeddings.create(
    input=text,
    model="text-embedding-3-small",
    dimensions=512  # reduced from default 1536
)

Local embeddings with sentence-transformers

For teams that need to keep data local, avoid API costs, or run without network access:

Local embeddings with sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")  # 384 dimensions

def get_embedding_local(text):
    return model.encode(text).tolist()

def get_embeddings_batch_local(texts):
    return model.encode(texts, batch_size=64, show_progress_bar=True).tolist()

Trade-offs: No API calls, no per-token cost, no rate limits, and data stays on your machine. Batch embedding large datasets is slow on CPU — a GPU accelerates this substantially.

Storing embeddings in pgvector

Insert individual documents:

Insert a single document

def insert_document(conn, content, metadata, embedding):
    conn.execute(
        """
        INSERT INTO documents (content, metadata, embedding)
        VALUES (%s, %s, %s::vector)
        """,
        (content, psycopg.types.json.Json(metadata), embedding)
    )

Bulk insert for large datasets:

Bulk insert with COPY

def insert_documents_bulk(conn, documents):
    """
    documents: list of (content, metadata, embedding) tuples
    """
    with conn.cursor() as cur:
        with cur.copy(
            "COPY documents (content, metadata, embedding) FROM STDIN WITH (FORMAT text)"
        ) as copy:
            for content, metadata, embedding in documents:
                import json
                vec_str = "[" + ",".join(str(x) for x in embedding) + "]"
                meta_str = json.dumps(metadata)
                copy.write_row((content, meta_str, vec_str))
    conn.commit()

Upsert pattern for updating embeddings when source documents change:

Upsert pattern

def upsert_document(conn, source_id, content, metadata, embedding):
    conn.execute(
        """
        INSERT INTO documents (id, content, metadata, embedding)
        VALUES (%s, %s, %s, %s::vector)
        ON CONFLICT (id) DO UPDATE SET
            content = EXCLUDED.content,
            metadata = EXCLUDED.metadata,
            embedding = EXCLUDED.embedding
        """,
        (source_id, content, psycopg.types.json.Json(metadata), embedding)
    )
    conn.commit()

Similarity search — querying vectors

The similarity query

This is the heart of the arrangement: embed the user's question with the same model used for documents, then find the most similar chunks.

Cosine similarity search

SELECT id, content, metadata,
       1 - (embedding <=> %s::vector) AS similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT 5;

Distance operators:

<=> — cosine distance (lower is more similar). Use for text embeddings — this is the standard.
<-> — L2 (Euclidean) distance. Use when magnitude matters.
<#> — negative inner product. Use with normalized vectors.

Python implementation:

Python similarity search

def search_similar(conn, query_text, limit=5):
    query_embedding = get_embedding(query_text)

    results = conn.execute(
        """
        SELECT id, content, metadata,
               1 - (embedding <=> %s::vector) AS similarity
        FROM documents
        ORDER BY embedding <=> %s::vector
        LIMIT %s
        """,
        (query_embedding, query_embedding, limit)
    ).fetchall()

    return results

Filtered similarity search

This is where pgvector earns its keep. Combining vector search with metadata filters is its strongest advantage over dedicated vector databases:

Filtered similarity search

SELECT id, content, metadata,
       1 - (embedding <=> %s::vector) AS similarity
FROM documents
WHERE metadata->>'source' = 'product-docs'
  AND created_at > '2026-01-01'
ORDER BY embedding <=> %s::vector
LIMIT 5;

Use cases: multi-tenant RAG, time-bounded search, category-specific retrieval, access control.

Python filtered search

def search_filtered(conn, query_text, source=None, after_date=None, limit=5):
    query_embedding = get_embedding(query_text)

    conditions = []
    params = [query_embedding, query_embedding]

    if source:
        conditions.append("metadata->>'source' = %s")
        params.append(source)
    if after_date:
        conditions.append("created_at > %s")
        params.append(after_date)

    where_clause = ""
    if conditions:
        where_clause = "WHERE " + " AND ".join(conditions)

    params.append(limit)

    results = conn.execute(
        f"""
        SELECT id, content, metadata,
               1 - (embedding <=> %s::vector) AS similarity
        FROM documents
        {where_clause}
        ORDER BY embedding <=> %s::vector
        LIMIT %s
        """,
        params
    ).fetchall()

    return results

Creating vector indexes

Without an index, pgvector performs exact nearest-neighbor search — a sequential scan of all vectors. Perfectly accurate, but for large datasets, the performance leaves something to be desired.

HNSW index (recommended for production):

HNSW index

CREATE INDEX idx_documents_embedding ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

HNSW builds a multi-layer graph for fast approximate nearest-neighbor search. m = 16 is a good default for connections per node. ef_construction = 64 controls build-time thoroughness. Tune query-time accuracy with SET hnsw.ef_search = 40; (higher values improve recall at the cost of latency).

IVFFlat index (alternative for constrained environments):

IVFFlat index

CREATE INDEX idx_documents_embedding_ivf ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

Which index to use: HNSW for production workloads where recall quality matters. IVFFlat when build time or memory is constrained. For a deeper dive, see the pgvector performance tuning guide and the pgvector query optimization guide.

Integrating with an LLM — the complete RAG pipeline

The RAG flow

The complete pipeline has five steps:

User asks a question
Embed the question using the same model that embedded the documents
Query pgvector for the top-K most similar chunks
Construct a prompt with the question and retrieved chunks as context
Send the prompt to an LLM and return the response

Prompt construction

Prompt template

def build_prompt(question, context_chunks):
    context_text = "\n\n---\n\n".join(
        f"Source: {chunk['metadata'].get('source', 'unknown')}\n{chunk['content']}"
        for chunk in context_chunks
    )

    system_message = (
        "You are a helpful assistant. Answer the user's question based on the "
        "provided context. If the context does not contain enough information to "
        "answer the question, say so. Cite the source when possible."
    )

    user_message = f"""Context:
{context_text}

Question: {question}"""

    return system_message, user_message

Standalone implementation (no framework)

Allow me to present the complete RAG pipeline in under 80 lines, using psycopg3, pgvector, and the OpenAI client. No LangChain, no LlamaIndex — just the primitives:

Complete RAG pipeline — 80 lines

import psycopg
from pgvector.psycopg import register_vector
from openai import OpenAI

# --- Configuration ---

DB_URL = "postgresql://localhost/mydb"
EMBEDDING_MODEL = "text-embedding-3-small"
CHAT_MODEL = "gpt-4o"
TOP_K = 5

# --- Clients ---

db = psycopg.connect(DB_URL)
register_vector(db)
openai_client = OpenAI()

# --- Embedding ---

def embed(text):
    response = openai_client.embeddings.create(input=text, model=EMBEDDING_MODEL)
    return response.data[0].embedding

# --- Ingestion ---

def ingest(content, metadata):
    embedding = embed(content)
    db.execute(
        "INSERT INTO documents (content, metadata, embedding) VALUES (%s, %s, %s::vector)",
        (content, psycopg.types.json.Json(metadata), embedding)
    )
    db.commit()

# --- Retrieval ---

def retrieve(question, limit=TOP_K):
    question_embedding = embed(question)
    rows = db.execute(
        """
        SELECT content, metadata, 1 - (embedding <=> %s::vector) AS similarity
        FROM documents
        ORDER BY embedding <=> %s::vector
        LIMIT %s
        """,
        (question_embedding, question_embedding, limit)
    ).fetchall()
    return [{"content": r[0], "metadata": r[1], "similarity": r[2]} for r in rows]

# --- Generation ---

def ask(question):
    chunks = retrieve(question)

    context = "\n\n---\n\n".join(
        f"[Source: {c['metadata'].get('source', 'unknown')}]\n{c['content']}"
        for c in chunks
    )

    response = openai_client.chat.completions.create(
        model=CHAT_MODEL,
        messages=[
            {
                "role": "system",
                "content": (
                    "Answer based on the provided context. "
                    "If the context is insufficient, say so. Cite sources."
                )
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ]
    )

    return response.choices[0].message.content

# --- Usage ---

if __name__ == "__main__":
    # Ingest some documents
    ingest("PostgreSQL supports JSONB for semi-structured data...", {"source": "docs/jsonb.md"})
    ingest("pgvector adds vector similarity search to PostgreSQL...", {"source": "docs/pgvector.md"})

    # Ask a question
    answer = ask("How does PostgreSQL handle vector search?")
    print(answer)

This is intentionally framework-free. Understanding the primitives — embed, store, retrieve, generate — makes it clear what the frameworks abstract and helps you debug when things go wrong.

Integration with LangChain

For teams using LangChain, pgvector has a first-class integration:

LangChain integration

from langchain_community.vectorstores import PGVector
from langchain_openai import OpenAIEmbeddings

CONNECTION_STRING = "postgresql://localhost/mydb"

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = PGVector(
    collection_name="documents",
    connection_string=CONNECTION_STRING,
    embedding_function=embeddings,
)

# Add documents
vectorstore.add_texts(
    texts=["PostgreSQL supports JSONB...", "pgvector adds vector search..."],
    metadatas=[{"source": "docs/jsonb.md"}, {"source": "docs/pgvector.md"}]
)

# Search
results = vectorstore.similarity_search_with_score("vector search", k=5)

# Use as a retriever in a chain
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

LangChain helps when you need complex chains (multi-step reasoning, tool use, agents) or multiple retrievers. For straightforward question-answering RAG, the standalone implementation is simpler and easier to debug.

Integration with LlamaIndex

LlamaIndex integration

from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

vector_store = PGVectorStore.from_params(
    database="mydb",
    host="localhost",
    password="postgres",
    port=5432,
    table_name="documents",
    embed_dim=1536,
)

# One-shot ingestion from a directory of files
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents, vector_store=vector_store)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("How does PostgreSQL handle vector search?")
print(response)

LlamaIndex handles document loading from diverse sources (PDF, HTML, APIs), chunking, embedding, and retrieval in an integrated pipeline. It is most valuable for document-heavy applications that ingest from many source formats.

Performance and scale

Benchmarks — pgvector vs dedicated vector databases

I should present honest numbers, because the comparison deserves candor:

Metric	pgvector (HNSW)	Pinecone	Qdrant
1M vectors, top-10 query latency	3-8 ms	2-5 ms	2-6 ms
5M vectors, top-10 query latency	8-20 ms	3-8 ms	4-10 ms
10M vectors, top-10 query latency	20-50 ms	5-12 ms	8-18 ms
Recall @10 (ef_search=100)	95-98%	99%+	98-99%
Filtered search (metadata)	Full SQL WHERE	Metadata filters	Payload filters
Operational overhead	Your existing PostgreSQL	Managed SaaS	Self-hosted service

pgvector is competitive through a few million vectors. Beyond 5–10 million, purpose-built engines maintain better latency and memory efficiency. The value proposition for pgvector is operational simplicity. For detailed comparisons, see pgvector vs Pinecone and the broader vector database comparison.

Optimization tips

Reduce dimensions: Use the dimensions parameter with text-embedding-3-small to reduce from 1536 to 512 or 256. Smaller vectors mean faster distance calculations, smaller indexes, and less storage.

Limit the search space: Use WHERE clauses to narrow the search before vector comparison.

Connection pooling: For concurrent RAG requests, use a connection pool. See the connection pooling guide or FastAPI connection pooling guide.

VACUUM regularly: Vector indexes benefit from regular VACUUM, especially after bulk inserts or updates.

Production considerations

Embedding model versioning: Vectors from different embedding models are incompatible. If you change models, you must re-embed all existing documents. Store the model name in metadata.

Incremental updates: Use the upsert pattern (ON CONFLICT DO UPDATE) for documents that change. A background worker that monitors source documents and re-embeds keeps the vector store current.

Monitoring: Track query latency, index size, and recall quality. Maintain a set of test queries with known good answers and periodically verify that the correct documents appear in the top-K results.

Backup and recovery: Vectors are regular data in PostgreSQL. Standard pg_dump and pg_restore work. No separate backup strategy is needed.

Cost comparison: pgvector runs on your existing PostgreSQL instance. For 1 million vectors at 1536 dimensions, plan for roughly 6 GB of storage and 12–24 GB of index. Compare this to Pinecone's pricing (typically $70–300/month for moderate workloads).

What Gold Lapel adds to vector workloads

Gold Lapel's proxy sees pgvector similarity queries alongside your application's regular OLTP queries. Vector search queries have a distinctive pattern — ORDER BY distance operator, LIMIT K — that the proxy recognizes and optimizes for.

Applications running RAG pipelines also run traditional queries: user authentication, session management, metadata lookups, application state. Gold Lapel optimizes both workloads simultaneously.

The pipeline described in this guide works independently. Gold Lapel attends to its performance as your vector table grows and your query mix becomes more complex.

Building a RAG Pipeline with pgvector and Python

Why PostgreSQL for RAG

Prerequisites and setup

Installing pgvector

Python environment

Designing the vector table

Table schema for RAG documents

Chunking strategy

Generating embeddings

OpenAI embeddings

Local embeddings with sentence-transformers

Storing embeddings in pgvector

Similarity search — querying vectors

The similarity query

Filtered similarity search

Creating vector indexes

Integrating with an LLM — the complete RAG pipeline

The RAG flow

Prompt construction

Standalone implementation (no framework)

Integration with LangChain

Integration with LlamaIndex

Performance and scale

Benchmarks — pgvector vs dedicated vector databases

Optimization tips

Production considerations

What Gold Lapel adds to vector workloads

Frequently asked questions

How many vectors can pgvector handle?

Which embedding model should I use for RAG with pgvector?

Should I use IVFFlat or HNSW indexes?

Can I combine vector search with regular SQL WHERE clauses?

Do I need LangChain or LlamaIndex to use pgvector for RAG?

Related content