← How-To

Building a RAG Pipeline with pgvector and Python

The artwork is semantically similar to what was requested. A cosine similarity of 0.87 — close enough to retrieve, not close enough to frame.

March 27, 2026 · 22 min read
The artwork is semantically similar to what was requested. A cosine similarity of 0.87 — close enough to retrieve, not close enough to frame. We have asked the artist to move nearer to the query.

Why PostgreSQL for RAG

Good evening. I see you are building a RAG pipeline, and you are considering a dedicated vector database. Before you add another service to the household, allow me to present an alternative.

Retrieval-Augmented Generation pipelines need a vector store to find relevant documents and feed them to a language model as context. The standard approach involves a separate vector database — Pinecone, Weaviate, Qdrant, or similar — but most teams already run PostgreSQL.

pgvector turns PostgreSQL into a vector database. It adds a vector column type, distance operators, and approximate nearest-neighbor indexes (HNSW and IVFFlat). Your documents, metadata, and embeddings live in the same database as the rest of your application data.

I should be forthcoming about the trade-off: pgvector is not as fast as purpose-built vector databases at billion-scale vector search. But for datasets up to a few million vectors — which covers the majority of production RAG applications — pgvector is competitive on latency and dramatically simpler to operate. No separate service to deploy, monitor, back up, or pay for. JOINs, transactions, WHERE clauses, and standard PostgreSQL tooling all work normally. For a deeper comparison, see the pgvector vs Pinecone analysis and the broader vector database comparison.

This guide builds a complete pipeline: document ingestion, chunking, embedding generation, vector storage in pgvector, similarity search, and LLM integration. The implementation uses raw Python with psycopg3 — no LangChain or LlamaIndex required. Framework integration is covered at the end for teams that want it.

Prerequisites and setup

Installing pgvector

pgvector requires PostgreSQL 13 or later. pgvector 0.7+ is recommended for HNSW index support and performance improvements.

Enable the extension in your database:

Enable pgvector
CREATE EXTENSION IF NOT EXISTS vector;

Python environment

Install the required packages:

Install Python dependencies
pip install "psycopg[binary]" pgvector openai

For local embeddings without API calls:

Install local embedding dependencies
pip install sentence-transformers torch

Set up the connection with pgvector type registration:

Connect with pgvector type registration
import psycopg
from pgvector.psycopg import register_vector

conn = psycopg.connect("postgresql://localhost/mydb")
register_vector(conn)

If you are migrating from psycopg2, see the psycopg2 to psycopg3 migration guide for the key differences.

Designing the vector table

If you'll permit me, the schema design deserves care. What you store and how you index it determines everything about retrieval quality and performance.

Table schema for RAG documents

Vector table schema
CREATE TABLE documents (
    id bigserial PRIMARY KEY,
    content text NOT NULL,
    metadata jsonb DEFAULT '{}',
    embedding vector(1536),
    created_at timestamptz DEFAULT now()
);

Column-by-column:

  • id bigserial — auto-incrementing primary key. Use bigserial rather than serial — vector tables tend to grow large.
  • content text — the original text chunk. This is what gets returned to the LLM as context.
  • metadata jsonbJSONB for structured metadata: source URL, document title, chunk index, author. This column is what makes pgvector powerful for production RAG — you can filter vectors with standard SQL WHERE clauses. See the JSONB guide for advanced usage patterns.
  • embedding vector(1536) — the vector column. The dimension must match your embedding model (OpenAI text-embedding-3-small: 1536, all-MiniLM-L6-v2: 384).
  • created_at timestamptz — useful for incremental updates and time-bounded search.

Create a GIN index on the metadata column for efficient filtered queries:

GIN index on metadata
CREATE INDEX idx_documents_metadata ON documents USING gin (metadata);

Chunking strategy

LLMs have context windows, and embedding models have token limits. A 50-page document must be split into chunks, each of which becomes one row in the documents table.

Chunk size guidelines: 200–500 tokens per chunk works well for most retrieval tasks. Start with 400 tokens and adjust based on retrieval quality.

Overlap: Include 50–100 tokens of overlap between adjacent chunks. This prevents information at chunk boundaries from being lost.

Implementation with tiktoken (for OpenAI models):

Token-based chunking with tiktoken
import tiktoken

def chunk_text(text, max_tokens=400, overlap_tokens=50):
    encoder = tiktoken.encoding_for_model("text-embedding-3-small")
    tokens = encoder.encode(text)
    chunks = []
    start = 0

    while start < len(tokens):
        end = start + max_tokens
        chunk_tokens = tokens[start:end]
        chunk_text = encoder.decode(chunk_tokens)
        chunks.append(chunk_text)
        start += max_tokens - overlap_tokens

    return chunks

Simple character-based splitting (model-agnostic):

Character-based chunking
def chunk_text_simple(text, max_chars=1600, overlap_chars=200):
    chunks = []
    start = 0

    while start < len(text):
        end = start + max_chars
        chunk = text[start:end]
        chunks.append(chunk)
        start += max_chars - overlap_chars

    return chunks

Each chunk becomes one row. Store the chunk index and source in metadata:

Chunk metadata example
metadata = {
    "source": "docs/user-guide.md",
    "chunk_index": 3,
    "total_chunks": 12,
    "title": "User Guide"
}

Generating embeddings

OpenAI embeddings

text-embedding-3-small (1536 dimensions) is the recommended default for most RAG applications.

Single embedding
from openai import OpenAI

client = OpenAI()  # uses OPENAI_API_KEY env var

def get_embedding(text, model="text-embedding-3-small"):
    response = client.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding

Batching: The embeddings API accepts up to 2,048 inputs per request:

Batch embeddings
def get_embeddings_batch(texts, model="text-embedding-3-small"):
    response = client.embeddings.create(
        input=texts,
        model=model
    )
    return [item.embedding for item in response.data]

Rate limiting: Implement exponential backoff for production pipelines:

Embeddings with retry
import time

def get_embeddings_with_retry(texts, model="text-embedding-3-small", max_retries=5):
    for attempt in range(max_retries):
        try:
            return get_embeddings_batch(texts, model)
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            wait = 2 ** attempt
            time.sleep(wait)

Reduced dimensions: text-embedding-3-small supports a dimensions parameter to reduce output from 1536 to 512 or 256:

Reduced dimensions
response = client.embeddings.create(
    input=text,
    model="text-embedding-3-small",
    dimensions=512  # reduced from default 1536
)

Local embeddings with sentence-transformers

For teams that need to keep data local, avoid API costs, or run without network access:

Local embeddings with sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")  # 384 dimensions

def get_embedding_local(text):
    return model.encode(text).tolist()

def get_embeddings_batch_local(texts):
    return model.encode(texts, batch_size=64, show_progress_bar=True).tolist()

Trade-offs: No API calls, no per-token cost, no rate limits, and data stays on your machine. Batch embedding large datasets is slow on CPU — a GPU accelerates this substantially.

Storing embeddings in pgvector

Insert individual documents:

Insert a single document
def insert_document(conn, content, metadata, embedding):
    conn.execute(
        """
        INSERT INTO documents (content, metadata, embedding)
        VALUES (%s, %s, %s::vector)
        """,
        (content, psycopg.types.json.Json(metadata), embedding)
    )

Bulk insert for large datasets:

Bulk insert with COPY
def insert_documents_bulk(conn, documents):
    """
    documents: list of (content, metadata, embedding) tuples
    """
    with conn.cursor() as cur:
        with cur.copy(
            "COPY documents (content, metadata, embedding) FROM STDIN WITH (FORMAT text)"
        ) as copy:
            for content, metadata, embedding in documents:
                import json
                vec_str = "[" + ",".join(str(x) for x in embedding) + "]"
                meta_str = json.dumps(metadata)
                copy.write_row((content, meta_str, vec_str))
    conn.commit()

Upsert pattern for updating embeddings when source documents change:

Upsert pattern
def upsert_document(conn, source_id, content, metadata, embedding):
    conn.execute(
        """
        INSERT INTO documents (id, content, metadata, embedding)
        VALUES (%s, %s, %s, %s::vector)
        ON CONFLICT (id) DO UPDATE SET
            content = EXCLUDED.content,
            metadata = EXCLUDED.metadata,
            embedding = EXCLUDED.embedding
        """,
        (source_id, content, psycopg.types.json.Json(metadata), embedding)
    )
    conn.commit()

Integrating with an LLM — the complete RAG pipeline

The RAG flow

The complete pipeline has five steps:

  1. User asks a question
  2. Embed the question using the same model that embedded the documents
  3. Query pgvector for the top-K most similar chunks
  4. Construct a prompt with the question and retrieved chunks as context
  5. Send the prompt to an LLM and return the response

Prompt construction

Prompt template
def build_prompt(question, context_chunks):
    context_text = "\n\n---\n\n".join(
        f"Source: {chunk['metadata'].get('source', 'unknown')}\n{chunk['content']}"
        for chunk in context_chunks
    )

    system_message = (
        "You are a helpful assistant. Answer the user's question based on the "
        "provided context. If the context does not contain enough information to "
        "answer the question, say so. Cite the source when possible."
    )

    user_message = f"""Context:
{context_text}

Question: {question}"""

    return system_message, user_message

Standalone implementation (no framework)

Allow me to present the complete RAG pipeline in under 80 lines, using psycopg3, pgvector, and the OpenAI client. No LangChain, no LlamaIndex — just the primitives:

Complete RAG pipeline — 80 lines
import psycopg
from pgvector.psycopg import register_vector
from openai import OpenAI

# --- Configuration ---

DB_URL = "postgresql://localhost/mydb"
EMBEDDING_MODEL = "text-embedding-3-small"
CHAT_MODEL = "gpt-4o"
TOP_K = 5

# --- Clients ---

db = psycopg.connect(DB_URL)
register_vector(db)
openai_client = OpenAI()

# --- Embedding ---

def embed(text):
    response = openai_client.embeddings.create(input=text, model=EMBEDDING_MODEL)
    return response.data[0].embedding

# --- Ingestion ---

def ingest(content, metadata):
    embedding = embed(content)
    db.execute(
        "INSERT INTO documents (content, metadata, embedding) VALUES (%s, %s, %s::vector)",
        (content, psycopg.types.json.Json(metadata), embedding)
    )
    db.commit()

# --- Retrieval ---

def retrieve(question, limit=TOP_K):
    question_embedding = embed(question)
    rows = db.execute(
        """
        SELECT content, metadata, 1 - (embedding <=> %s::vector) AS similarity
        FROM documents
        ORDER BY embedding <=> %s::vector
        LIMIT %s
        """,
        (question_embedding, question_embedding, limit)
    ).fetchall()
    return [{"content": r[0], "metadata": r[1], "similarity": r[2]} for r in rows]

# --- Generation ---

def ask(question):
    chunks = retrieve(question)

    context = "\n\n---\n\n".join(
        f"[Source: {c['metadata'].get('source', 'unknown')}]\n{c['content']}"
        for c in chunks
    )

    response = openai_client.chat.completions.create(
        model=CHAT_MODEL,
        messages=[
            {
                "role": "system",
                "content": (
                    "Answer based on the provided context. "
                    "If the context is insufficient, say so. Cite sources."
                )
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ]
    )

    return response.choices[0].message.content

# --- Usage ---

if __name__ == "__main__":
    # Ingest some documents
    ingest("PostgreSQL supports JSONB for semi-structured data...", {"source": "docs/jsonb.md"})
    ingest("pgvector adds vector similarity search to PostgreSQL...", {"source": "docs/pgvector.md"})

    # Ask a question
    answer = ask("How does PostgreSQL handle vector search?")
    print(answer)

This is intentionally framework-free. Understanding the primitives — embed, store, retrieve, generate — makes it clear what the frameworks abstract and helps you debug when things go wrong.

Integration with LangChain

For teams using LangChain, pgvector has a first-class integration:

LangChain integration
from langchain_community.vectorstores import PGVector
from langchain_openai import OpenAIEmbeddings

CONNECTION_STRING = "postgresql://localhost/mydb"

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = PGVector(
    collection_name="documents",
    connection_string=CONNECTION_STRING,
    embedding_function=embeddings,
)

# Add documents
vectorstore.add_texts(
    texts=["PostgreSQL supports JSONB...", "pgvector adds vector search..."],
    metadatas=[{"source": "docs/jsonb.md"}, {"source": "docs/pgvector.md"}]
)

# Search
results = vectorstore.similarity_search_with_score("vector search", k=5)

# Use as a retriever in a chain
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

LangChain helps when you need complex chains (multi-step reasoning, tool use, agents) or multiple retrievers. For straightforward question-answering RAG, the standalone implementation is simpler and easier to debug.

Integration with LlamaIndex

LlamaIndex integration
from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

vector_store = PGVectorStore.from_params(
    database="mydb",
    host="localhost",
    password="postgres",
    port=5432,
    table_name="documents",
    embed_dim=1536,
)

# One-shot ingestion from a directory of files
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents, vector_store=vector_store)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("How does PostgreSQL handle vector search?")
print(response)

LlamaIndex handles document loading from diverse sources (PDF, HTML, APIs), chunking, embedding, and retrieval in an integrated pipeline. It is most valuable for document-heavy applications that ingest from many source formats.

Performance and scale

Benchmarks — pgvector vs dedicated vector databases

I should present honest numbers, because the comparison deserves candor:

Metricpgvector (HNSW)PineconeQdrant
1M vectors, top-10 query latency3-8 ms2-5 ms2-6 ms
5M vectors, top-10 query latency8-20 ms3-8 ms4-10 ms
10M vectors, top-10 query latency20-50 ms5-12 ms8-18 ms
Recall @10 (ef_search=100)95-98%99%+98-99%
Filtered search (metadata)Full SQL WHEREMetadata filtersPayload filters
Operational overheadYour existing PostgreSQLManaged SaaSSelf-hosted service

pgvector is competitive through a few million vectors. Beyond 5–10 million, purpose-built engines maintain better latency and memory efficiency. The value proposition for pgvector is operational simplicity. For detailed comparisons, see pgvector vs Pinecone and the broader vector database comparison.

Optimization tips

Reduce dimensions: Use the dimensions parameter with text-embedding-3-small to reduce from 1536 to 512 or 256. Smaller vectors mean faster distance calculations, smaller indexes, and less storage.

Limit the search space: Use WHERE clauses to narrow the search before vector comparison.

Connection pooling: For concurrent RAG requests, use a connection pool. See the connection pooling guide or FastAPI connection pooling guide.

VACUUM regularly: Vector indexes benefit from regular VACUUM, especially after bulk inserts or updates.

Production considerations

Embedding model versioning: Vectors from different embedding models are incompatible. If you change models, you must re-embed all existing documents. Store the model name in metadata.

Incremental updates: Use the upsert pattern (ON CONFLICT DO UPDATE) for documents that change. A background worker that monitors source documents and re-embeds keeps the vector store current.

Monitoring: Track query latency, index size, and recall quality. Maintain a set of test queries with known good answers and periodically verify that the correct documents appear in the top-K results.

Backup and recovery: Vectors are regular data in PostgreSQL. Standard pg_dump and pg_restore work. No separate backup strategy is needed.

Cost comparison: pgvector runs on your existing PostgreSQL instance. For 1 million vectors at 1536 dimensions, plan for roughly 6 GB of storage and 12–24 GB of index. Compare this to Pinecone's pricing (typically $70–300/month for moderate workloads).

What Gold Lapel adds to vector workloads

Gold Lapel's proxy sees pgvector similarity queries alongside your application's regular OLTP queries. Vector search queries have a distinctive pattern — ORDER BY distance operator, LIMIT K — that the proxy recognizes and optimizes for.

Applications running RAG pipelines also run traditional queries: user authentication, session management, metadata lookups, application state. Gold Lapel optimizes both workloads simultaneously.

The pipeline described in this guide works independently. Gold Lapel attends to its performance as your vector table grows and your query mix becomes more complex.

Frequently asked questions