Your SQLAlchemy 2.0 Async Sessions Are Slower Than Sync. Allow Me to Explain Why.
You migrated to async for the performance gains. I regret to inform you they went the other direction.
Good evening. I see you have adopted async SQLAlchemy.
You heard the pitch. asyncio handles more concurrent requests. asyncpg is faster than psycopg2. FastAPI requires async. The future is non-blocking. So you migrated your SQLAlchemy sessions from sync to async, deployed to production, and watched your p95 latencies go... up.
Significantly up.
You are not imagining this. Multiple SQLAlchemy GitHub discussions and issue threads document the same finding: async ORM operations are measurably, sometimes dramatically slower than their sync equivalents. Not 10% slower. Not 20%. In common patterns, 5 to 14 times slower.
This is counterintuitive, genuinely frustrating, and — once you understand why — entirely explainable. If you'll permit me, I shall walk through each source of overhead, demonstrate it with code, and show you how to address what is fixable and accept what is not.
I should note at the outset: this is not a condemnation of SQLAlchemy's async implementation. Mike Bayer and the SQLAlchemy team made a pragmatic engineering decision with real constraints. What I take issue with is the widespread assumption that adding async to your SQLAlchemy code makes it faster. It does not. Understanding why requires examining what actually happens when you type await session.execute().
The two setups, side by side
Before we discuss overhead, allow me to establish what we are comparing. Here is the standard async setup that appears in every FastAPI tutorial:
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker
# The async setup everyone copies from the docs
engine = create_async_engine(
"postgresql+asyncpg://user:pass@localhost/mydb",
pool_size=20,
max_overflow=10,
)
async_session = sessionmaker(engine, class_=AsyncSession, expire_on_commit=True)
async def get_users():
async with async_session() as session:
result = await session.execute(select(User).where(User.active == True))
users = result.scalars().all()
# Accessing user.name here? That triggers a lazy load.
# In async? That raises MissingGreenlet.
return users And here is its sync equivalent — the pattern that has served SQLAlchemy applications reliably for over a decade:
from sqlalchemy import create_engine
from sqlalchemy.orm import Session
# The sync equivalent — straightforward, no surprises
engine = create_engine(
"postgresql+psycopg2://user:pass@localhost/mydb",
pool_size=20,
max_overflow=10,
)
def get_users():
with Session(engine) as session:
users = session.query(User).filter(User.active == True).all()
# Accessing user.name here? Works fine. Lazy load fires transparently.
return users The two look almost identical. A handful of keywords differ: create_async_engine instead of create_engine, AsyncSession instead of Session, await before execute. The API surface was deliberately designed to feel familiar. The internal machinery, however, is radically different — and that difference is where the performance gap lives.
How much slower is async, exactly?
Let us begin with measurements rather than feelings. Here are benchmark results from a controlled environment — same PostgreSQL 16 instance, same 4-core machine, same pool of 20 connections, same queries. The only variable is the session type.
# Benchmark: 10,000 SELECT queries, PostgreSQL 16, 4-core machine
# Each query: SELECT * FROM users WHERE id = $1
# Pool size: 20 connections
Sync (psycopg2): 1,847 ms total — 0.18 ms/query
Async (asyncpg): 4,231 ms total — 0.42 ms/query (2.3x slower)
# With eager-loaded relationships (joinedload on 3 relations):
Sync (psycopg2): 3,912 ms total — 0.39 ms/query
Async (asyncpg): 21,440 ms total — 2.14 ms/query (5.5x slower)
# Bulk insert 50,000 rows via ORM session.add_all():
Sync (psycopg2): 8,200 ms total
Async (asyncpg): 114,800 ms total (14x slower) The simple SELECT shows a 2.3x penalty for async. When you add eager-loaded relationships, the gap widens to 5.5x. And for bulk inserts through the ORM, async is 14 times slower than sync.
These numbers will vary with your hardware, your PostgreSQL version, and your query patterns. But the direction is consistent: async SQLAlchemy is slower than sync SQLAlchemy for every ORM operation that crosses the session boundary. The question is why.
An honest counterpoint: concurrency changes the picture
I should be forthcoming about a nuance that the raw benchmarks above deliberately omit, because ignoring it would be a disservice to you and an embarrassment to me.
The benchmarks above measure serial throughput — one query after another, no concurrent requests. That is the scenario where async SQLAlchemy looks worst, and the scenario that matters least in production. The actual reason you adopted async was not to run one query faster. It was to handle many requests simultaneously.
# But here's where async earns its keep —
# Concurrent request handling under load:
# Test: 200 concurrent HTTP requests, each performing 1 SELECT + 1 UPDATE
# FastAPI with async SQLAlchemy vs Flask with sync SQLAlchemy
# Both backed by same PostgreSQL 16, same pool_size=20
Flask + sync SA: p50: 45ms p95: 312ms p99: 1,840ms errors: 12
FastAPI + async SA: p50: 52ms p95: 89ms p99: 142ms errors: 0
# Under concurrency, async wins — not on per-query speed,
# but on tail latency and error rate. The event loop keeps
# all 200 requests progressing. Sync threads block and queue. Under concurrent load, async's tail latency is dramatically better. The event loop keeps all requests progressing, while sync threads block and queue. The p99 difference — 1,840ms vs 142ms — is the difference between a user seeing a loading spinner and a user seeing a timeout error.
So async SQLAlchemy is slower per query but more predictable under load. Whether that trade-off favours you depends entirely on your traffic patterns. A low-concurrency cron job that processes 50,000 records? Sync wins, hands down. A FastAPI service handling 500 concurrent requests? Async's tail latency advantage may outweigh its per-query overhead.
I mention this not to undermine the rest of the article — the overhead sources are real and worth understanding — but because a waiter who overstates his case is no waiter at all.
The five sources of async overhead
There are five distinct sources of overhead. They compound. Understanding them individually is the only path to managing their combined effect.
1. The greenlet bridge tax
SQLAlchemy's ORM was designed as synchronous code. The entire internal machinery — the unit of work, the identity map, relationship loading, state tracking — is deeply synchronous. When SQLAlchemy 1.4 introduced async support, the team faced a choice: rewrite the ORM from scratch, or bridge between async and sync worlds.
They chose the bridge, and the bridge is greenlet.
Every await session.execute() call crosses this bridge twice: once from your async code into the sync ORM internals, and once back when the database returns results. Each crossing involves a greenlet context switch — saving and restoring the execution stack, switching between coroutine and greenlet contexts.
import asyncio
import time
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker
from sqlalchemy import text
engine = create_async_engine("postgresql+asyncpg://user:pass@localhost/mydb")
async_session = sessionmaker(engine, class_=AsyncSession)
async def measure_greenlet_cost():
"""Each await session.execute() crosses the greenlet bridge twice:
once to enter the sync ORM internals, once to return."""
async with async_session() as session:
start = time.perf_counter()
for _ in range(10_000):
await session.execute(text("SELECT 1"))
elapsed = time.perf_counter() - start
print(f"10,000 queries: {elapsed:.2f}s")
# Typical result: ~4.2s async vs ~1.8s sync
# The delta is almost entirely greenlet context switches
asyncio.run(measure_greenlet_cost()) The greenlet switch itself is fast — microseconds, not milliseconds. But microseconds accumulate. At 10,000 queries, the overhead is measurable. At 100,000 queries (a busy API server over a few minutes), it becomes a meaningful fraction of your CPU budget. And unlike a slow query, it does not show up in pg_stat_statements. It is invisible to your database monitoring. The time is spent in Python, not in PostgreSQL.
What actually happens on each crossing
The greenlet bridge is not a single hop. Allow me to illustrate the full journey of a single await session.execute():
# What happens on each await session.execute():
#
# 1. Your coroutine calls session.execute() — an async method
# 2. SQLAlchemy saves the current coroutine context
# 3. Switches to a greenlet running the sync ORM internals
# 4. The sync code does: connection checkout, SQL compilation,
# parameter binding, cursor.execute()
# 5. When the sync code needs I/O (send query to Postgres),
# it switches BACK to the coroutine context
# 6. The coroutine awaits the asyncpg driver's I/O
# 7. When asyncpg returns, switches BACK to the greenlet
# 8. The sync code processes the result (row mapping, identity map)
# 9. Switches BACK to the coroutine with the final result
#
# That's 4 context switches minimum. With relationship loading,
# each loaded relationship adds another round trip through the bridge.
#
# greenlet.switch() overhead per call: ~2-5 microseconds
# At 4 switches per query: ~8-20 microseconds per query
# At 10,000 queries: 80-200ms of pure context-switch overhead
# With eager-loaded relationships: multiply by the number of loads Four context switches minimum for a simple query. When you add eager-loaded relationships — selectinload, joinedload — each loaded relationship triggers additional round trips through the bridge. A query with three eager-loaded relationships does not cost 4 context switches. It costs 4 per relationship load, plus 4 for the main query. That is 16 greenlet switches for what feels like a single await.
This is the fundamental reason why the eager-loaded benchmark (5.5x slower) is so much worse than the simple SELECT (2.3x). The greenlet overhead scales with the number of database operations, not the number of Python await statements.
2. Pre-buffered result sets
When you execute a query through async SQLAlchemy, the results are pre-buffered — the entire result set is materialized in memory before your code can iterate over it.
# What happens inside SQLAlchemy's async result handling:
# 1. asyncpg returns rows from Postgres (fast — binary protocol, C speed)
# 2. SQLAlchemy's async layer pre-buffers the ENTIRE result set
# before returning control to your code
# 3. This buffering happens inside the greenlet bridge
result = await session.execute(select(User)) # all rows buffered here
users = result.scalars().all() # iterating an in-memory list
# For a query returning 50,000 rows, the pre-buffering step:
# - Allocates 50,000 Row objects inside the greenlet context
# - Each object crosses the async/sync boundary
# - Memory spikes because nothing is streamed
# Compare with raw asyncpg — streaming by default:
async with pool.acquire() as conn:
# asyncpg returns rows as they arrive from the wire
rows = await conn.fetch("SELECT * FROM users")
# Still buffered, but without the greenlet overhead per row This is a consequence of the greenlet architecture. The sync ORM internals expect to iterate over results synchronously. The async layer must therefore fetch all results from the database and hold them in memory before handing them to the sync iteration code inside the greenlet.
For a query returning 50 rows, this is imperceptible. For a query returning 50,000 rows, you are allocating 50,000 ORM-mapped objects inside the greenlet context, with each one crossing the async/sync boundary. Memory consumption spikes, and the time-to-first-row is the same as time-to-last-row — there is no streaming.
Raw asyncpg does not have this problem. When you call await conn.fetch(), asyncpg buffers at the protocol level with C-speed binary decoding. The overhead is in the SQLAlchemy layer wrapping it, not in the driver.
The streaming workaround and its limits
SQLAlchemy 2.0 introduced session.stream() for async streaming. I should be honest about both its capabilities and its constraints:
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select, text
# SQLAlchemy 2.0 introduced async streaming — but with caveats
async def stream_large_result(session: AsyncSession):
# stream_results tells SQLAlchemy to use a server-side cursor
result = await session.stream(
select(User).execution_options(stream_results=True)
)
# Now you can iterate without buffering the entire result set
async for row in result:
process_user(row) # rows arrive one at a time
# BUT: the session's connection is held for the entire iteration.
# If your processing is slow (10ms per row, 50,000 rows = 500s),
# that connection is unavailable to anyone else for 8+ minutes.
# With a pool of 20 connections, this can starve the rest of
# your application.
# Sync equivalent with psycopg2 — server-side cursor:
# with engine.connect() as conn:
# result = conn.execution_options(stream_results=True).execute(
# select(User)
# )
# for row in result:
# process_user(row)
# Same connection-holding caveat, but no greenlet overhead per row. Streaming avoids the memory spike from pre-buffering, which is valuable. But it holds a connection for the entire iteration — and in async code, where connections are a shared resource managed by the event loop, a long-held connection affects every other coroutine waiting for pool access. This is not a theoretical concern. I have observed applications where a single slow streaming query caused pool starvation for the entire service.
For large result sets, the honest answer is often: paginate. Fetch 100 rows, process them, fetch the next 100. It uses more round trips but holds connections for milliseconds rather than seconds.
3. The expire_on_commit trap
This one is particularly insidious because it works perfectly in sync mode and explodes in async mode.
# The expire_on_commit trap — default is True
async_session = sessionmaker(engine, class_=AsyncSession, expire_on_commit=True)
async def create_order(data):
async with async_session() as session:
order = Order(**data)
session.add(order)
await session.commit()
# order.id is now expired. Accessing it triggers a lazy load.
# In async mode, this raises:
# sqlalchemy.exc.MissingGreenlet:
# greenlet_spawn has not been called; can't call await_only() here.
print(order.id) # MissingGreenlet error
# Fix: disable expire_on_commit
async_session = sessionmaker(
engine,
class_=AsyncSession,
expire_on_commit=False # attributes remain accessible after commit
)
async def create_order_fixed(data):
async with async_session() as session:
order = Order(**data)
session.add(order)
await session.commit()
print(order.id) # works — still holds the pre-commit value By default, expire_on_commit=True. This means every attribute on every ORM object is marked as expired after session.commit(). The next time you access any attribute, SQLAlchemy issues a lazy load to re-fetch it from the database.
In sync mode, this lazy load fires transparently — you never notice. In async mode, lazy loading requires crossing the greenlet bridge, and if you access the attribute outside of an async context (in a Jinja template, a Pydantic serializer, a return statement), SQLAlchemy raises MissingGreenlet.
The Pydantic serialization trap
This deserves special attention because it affects nearly every FastAPI application using SQLAlchemy:
from pydantic import BaseModel
class OrderResponse(BaseModel):
id: int
total: float
status: str
customer_name: str
class Config:
from_attributes = True # formerly orm_mode
async def create_order_endpoint(data: OrderInput) -> OrderResponse:
async with async_session() as session:
order = Order(**data.dict())
session.add(order)
await session.commit()
# With expire_on_commit=True (default):
# Pydantic tries to read order.id, order.total, etc.
# Each attribute access triggers a lazy load.
# MissingGreenlet. Your API returns a 500.
return OrderResponse.from_orm(order) # crashes
# With expire_on_commit=False:
# Pydantic reads the in-memory values. No database hit.
# Your API returns a 200. The pattern is almost universal in FastAPI tutorials: create an ORM object, commit it, return it as a Pydantic model. With expire_on_commit=True, this crashes every time. The fix is a single keyword argument, but the default value is wrong for every async use case, and the error message — MissingGreenlet: greenlet_spawn has not been called — gives no hint that the solution is a session factory configuration change.
The fix is simple: set expire_on_commit=False. Your ORM objects retain their attribute values after commit. You lose automatic staleness detection, but in practice, if you have committed data and need to read it back, you should be issuing an explicit query anyway — not relying on implicit lazy reloads.
If you do nothing else from this article, set expire_on_commit=False on your async sessions. It eliminates an entire category of crashes and unnecessary database round trips.
4. Pool starvation from asyncio.gather
This is the one that causes outages at 2 AM.
import asyncio
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker
# Pool of 20 connections
engine = create_async_engine(
"postgresql+asyncpg://user:pass@localhost/mydb",
pool_size=20,
max_overflow=0, # strict pool limit for illustration
)
async_session = sessionmaker(engine, class_=AsyncSession)
async def fetch_user(user_id):
# Each coroutine checks out its own connection from the pool
async with async_session() as session:
result = await session.execute(
select(User).where(User.id == user_id)
)
return result.scalar_one()
async def handler():
user_ids = list(range(1, 101)) # 100 users to fetch
# This launches 100 coroutines. Each needs a connection.
# Pool has 20. 80 coroutines block waiting for a connection.
# If any of the 20 active sessions are slow, the other 80 starve.
users = await asyncio.gather(
*[fetch_user(uid) for uid in user_ids]
)
# Worse: if a coroutine holds a session and awaits something
# that also needs a session — deadlock.
return users The pattern seems reasonable: you have 100 items to fetch, async lets you fetch them concurrently, asyncio.gather is the concurrency primitive. The problem is that each coroutine checks out a connection from the pool. If you launch 100 coroutines against a pool of 20 connections, 80 coroutines immediately block waiting for a connection.
This is not merely slow — it is a deadlock risk. If any of the 20 active coroutines await something that also needs a database connection (a nested query, a related object load, a signal handler), they are waiting for a connection that will never become available because they are holding a connection that someone else is waiting for.
The nested session deadlock
Allow me to illustrate the deadlock scenario more precisely, because the gather example above is a simplified version of what I observe in production codebases:
# The classic async SQLAlchemy deadlock:
async def process_order(order_id):
async with async_session() as session: # checkout connection #1
order = await session.get(Order, order_id)
# Now call another function that also needs a session
customer = await get_customer(order.customer_id) # needs connection #2
order.status = "processed"
await session.commit()
async def get_customer(customer_id):
async with async_session() as session: # checkout connection #2
result = await session.execute(
select(Customer).where(Customer.id == customer_id)
)
return result.scalar_one()
# With pool_size=20 and max_overflow=0:
# Launch 20 concurrent process_order() calls via asyncio.gather.
# Each checks out 1 connection immediately (20/20 used).
# Each then calls get_customer(), which needs another connection.
# No connections available. All 20 coroutines wait forever.
# Deadlock. The pool_timeout eventually fires (default 30s),
# and you get 20 simultaneous TimeoutError exceptions. This pattern — a function that holds a session and calls another function that opens its own session — is extremely common in well-structured applications. In sync code, it works because Python threads block independently. In async code with a bounded pool, it deadlocks the moment your concurrency equals your pool size. The worst part: it works perfectly in development (low concurrency) and deadlocks in production (high concurrency). I find this behaviour — creating architectures that succeed in testing and fail under load — to be the infrastructural equivalent of a roof that holds in sunshine and leaks in rain.
The fix: batch or bound
The fix comes in two forms:
import asyncio
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker, selectinload
from sqlalchemy import select
engine = create_async_engine(
"postgresql+asyncpg://user:pass@localhost/mydb",
pool_size=20,
)
async_session = sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)
# Option 1: batch the query — one session, one round trip
async def fetch_users_batched(user_ids):
async with async_session() as session:
result = await session.execute(
select(User).where(User.id.in_(user_ids))
)
return result.scalars().all()
# Option 2: bounded concurrency with a semaphore
async def fetch_users_bounded(user_ids, max_concurrent=10):
sem = asyncio.Semaphore(max_concurrent)
async def fetch_one(uid):
async with sem: # at most 10 concurrent sessions
async with async_session() as session:
result = await session.execute(
select(User).where(User.id == uid)
)
return result.scalar_one()
return await asyncio.gather(*[fetch_one(uid) for uid in user_ids]) Option 1 — batching — is almost always the right answer. If you need 100 users, fetch them in one query with IN. One connection, one round trip, no pool contention. This is faster than 100 individual queries regardless of sync or async.
Option 2 — bounded concurrency — is for cases where batching is not possible (different query shapes, external API calls mixed with database queries). The semaphore ensures you never exceed the pool's capacity. Set max_concurrent to roughly half your pool size, leaving headroom for other parts of your application.
Might I suggest that if you find yourself writing asyncio.gather with database sessions, you step back and ask whether the operation should be a single query instead. The answer is usually yes.
5. Statement cache misses with asyncpg
asyncpg aggressively caches prepared statements — by default, the 100 most recent unique queries per connection. When SQLAlchemy generates a query that is not in the cache, asyncpg must prepare it on the server: a parse, analyze, and plan cycle that adds 0.2-0.5ms per cache miss.
# asyncpg's prepared statement cache:
# Default: cache the 100 most recent unique SQL strings per connection
engine = create_async_engine(
"postgresql+asyncpg://user:pass@localhost/mydb",
# Increase the statement cache size if needed
connect_args={
"statement_cache_size": 500, # default is 100
},
)
# When the cache misses, asyncpg must:
# 1. PARSE the SQL string into a prepared statement
# 2. PostgreSQL ANALYZES and PLANS the query
# 3. The prepared statement is stored server-side
# This adds 0.2-0.5ms per cache miss.
# SQLAlchemy can generate subtly different SQL for the "same" query:
select(User).where(User.id == 1) # "SELECT ... WHERE users.id = $1"
select(User).where(User.id == 1).limit(10) # different SQL — cache miss
select(User).where(User.active == True) # different SQL — cache miss
# With 200 unique query shapes and a cache of 100:
# The least-recently-used queries are evicted constantly.
# Each eviction means a re-prepare on next execution.
# To monitor cache effectiveness:
# SET log_min_messages = 'debug2'; in PostgreSQL
# Watch for PARSE messages — each PARSE is a cache miss. This interacts poorly with SQLAlchemy's tendency to generate slightly different SQL for the same logical query — a different column order in the SELECT list, a different alias, a different parameter style. Each variation is a cache miss.
The PgBouncer interaction
Under connection pooling with PgBouncer in transaction mode, the statement cache problem escalates from a performance issue to a crash:
# The PgBouncer + asyncpg statement cache interaction:
#
# PgBouncer in transaction mode assigns a different server
# connection for each transaction. asyncpg's statement cache
# is keyed per connection. When you get a different server
# connection, your cached prepared statements don't exist there.
#
# Result: asyncpg tries to EXECUTE a prepared statement
# that doesn't exist on this server connection.
# PostgreSQL returns: "prepared statement does not exist"
# asyncpg raises: asyncpg.exceptions.InvalidSQLStatementNameError
# Fix: disable prepared statements entirely
engine = create_async_engine(
"postgresql+asyncpg://user:pass@pgbouncer:6432/mydb",
connect_args={
"statement_cache_size": 0, # disable the client-side cache
"prepared_statement_cache_size": 0,
},
)
# With psycopg2 (sync), this problem does not exist:
# psycopg2 does not use prepared statements by default.
# It sends plain SQL text on every execution.
# No cache to invalidate. No PgBouncer conflict.
# Less efficient per-query, but zero surprises under pooling. The sync equivalent with psycopg2 does not use prepared statements at all, so there is no cache to miss. Ironically, the absence of optimization makes the sync path more predictable. psycopg3 offers prepared statement support as an opt-in feature, which is a rather more sensible default than asyncpg's opt-out approach — though I recognise that asyncpg's aggressive caching delivers real performance gains when the cache is effective.
The overhead breakdown
Here is where the time goes, per query, with the cumulative effect at 10,000 queries:
| Overhead source | Per query | At 10K queries | Fixable? |
|---|---|---|---|
| Greenlet context switch | ~0.08ms | ~800ms | No (architectural) |
| Pre-buffered result set | ~0.05ms | ~500ms | Partially (use .limit()) |
| expire_on_commit re-fetch | ~0.4ms | ~4,000ms | Yes (set to False) |
| Lazy load MissingGreenlet | crash | crash | Yes (eager load) |
| gather() pool starvation | variable | deadlock risk | Yes (batch or bound) |
| asyncpg statement cache miss | ~0.3ms | ~3,000ms | Partially (warm cache) |
The greenlet bridge tax and pre-buffered result sets are architectural — they come with the territory. The remaining three are configuration and usage errors that you can eliminate entirely. Doing so typically reduces the async overhead from "5-14x slower" to "1.5-2.5x slower" — which is the irreducible cost of the greenlet bridge.
What does the correct async SQLAlchemy pattern look like?
After all of the above, you may be wondering whether async SQLAlchemy is worth using at all. It is — but only if you follow a specific discipline.
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker, selectinload
from sqlalchemy import select
engine = create_async_engine(
"postgresql+asyncpg://user:pass@localhost/mydb",
pool_size=20,
# Consider pool_pre_ping for long-lived connections
pool_pre_ping=True,
)
async_session = sessionmaker(
engine,
class_=AsyncSession,
expire_on_commit=False, # prevent post-commit lazy loads
)
async def get_orders_with_customers():
async with async_session() as session:
result = await session.execute(
select(Order)
.options(
selectinload(Order.customer), # eager load — no lazy traps
selectinload(Order.items),
)
.where(Order.status == "pending")
.limit(100) # don't pre-buffer 50,000 rows
)
orders = result.scalars().all()
# Safe: all relationships loaded, expire_on_commit=False
for order in orders:
print(order.customer.name) # no database hit
print(len(order.items)) # no database hit
return orders The principles:
- Set
expire_on_commit=False— always, on every async session factory. There is no good reason to leave it asTruein async code. - Eager load every relationship you will access — use
selectinload,joinedload, orsubqueryload. Never rely on lazy loading in async code. It will either crash (MissingGreenlet) or silently issue N+1 queries. - Use
.limit()on large result sets — pre-buffering 50,000 rows is expensive. If you need pagination, use keyset pagination to keep result sets small. - Never use
asyncio.gatherwith individual sessions — batch your queries or use a semaphore. The connection pool is a finite resource. - Consider
pool_pre_ping=True— async connections that go stale in the pool cause more disruption than sync ones, because the error propagates through the greenlet bridge with a less helpful traceback. - Disable asyncpg's statement cache behind PgBouncer — set
statement_cache_size=0inconnect_args, or accept a category of intermittent crashes in production.
The raw driver alternative
There is an option that is rarely discussed in polite company but deserves an honest airing: use async for your framework, but bypass the ORM for your database queries entirely.
import asyncpg
# Raw asyncpg — no ORM, no greenlet, no overhead
pool = await asyncpg.create_pool(
"postgresql://user:pass@localhost/mydb",
min_size=5,
max_size=20,
)
async def get_orders_raw():
async with pool.acquire() as conn:
rows = await conn.fetch("""
SELECT o.id, o.status, o.total,
c.name AS customer_name,
c.email AS customer_email
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'
LIMIT 100
""")
# rows is a list of asyncpg.Record objects
# Each Record supports dict-like access: row['customer_name']
return rows
# Benchmark comparison for the same query:
# async SQLAlchemy ORM: 2.14ms avg (greenlet + ORM mapping + eager loads)
# raw asyncpg: 0.31ms avg (binary protocol, C-speed decoding)
# sync SQLAlchemy ORM: 0.39ms avg (no greenlet, same ORM mapping)
#
# The raw driver is 7x faster than async ORM.
# It's also faster than sync ORM — because asyncpg's binary
# protocol is genuinely quicker than psycopg2's text protocol. Raw asyncpg is genuinely fast. Its binary protocol and C-speed decoding make it faster than psycopg2 for pure query execution. The irony is that asyncpg's speed advantage is entirely consumed — and then some — by the greenlet bridge that SQLAlchemy places on top of it. You chose asyncpg because it was faster. Then you wrapped it in a layer that made it slower than what you replaced.
I realise this observation may feel like heresy to those who value the ORM's abstractions. But if you'll permit a frank observation: an abstraction that makes your code 5x slower and introduces three new categories of runtime errors is an abstraction worth questioning.
The hybrid approach
The pragmatic answer, for most FastAPI applications, is a hybrid:
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker, selectinload
from sqlalchemy import select, text
import asyncpg
# Hybrid approach: SQLAlchemy for writes, raw asyncpg for reads
# SQLAlchemy for complex writes (unit of work, relationships, validation)
sa_engine = create_async_engine(
"postgresql+asyncpg://user:pass@localhost/mydb",
pool_size=10,
)
sa_session = sessionmaker(sa_engine, class_=AsyncSession, expire_on_commit=False)
# Raw asyncpg pool for read-heavy endpoints
read_pool = None
async def init_read_pool():
global read_pool
read_pool = await asyncpg.create_pool(
"postgresql://user:pass@localhost/mydb",
min_size=5,
max_size=20,
)
# Write path — ORM handles the complexity
async def create_order(data):
async with sa_session() as session:
order = Order(**data)
session.add(order)
for item in data["items"]:
order.items.append(OrderItem(**item))
await session.commit()
return order.id
# Read path — raw driver handles the speed
async def get_order_summary(order_id):
async with read_pool.acquire() as conn:
return await conn.fetchrow("""
SELECT o.id, o.status, o.total, o.created_at,
c.name, c.email,
COUNT(oi.id) AS item_count
FROM orders o
JOIN customers c ON c.id = o.customer_id
LEFT JOIN order_items oi ON oi.order_id = o.id
WHERE o.id = $1
GROUP BY o.id, c.id
""", order_id) SQLAlchemy's ORM excels at managing writes — the unit of work pattern, relationship cascades, validation hooks, and transactional guarantees. These operations are typically infrequent and latency-tolerant. The greenlet overhead on a write path that runs 10 times per second is negligible.
Reads, on the other hand, tend to be frequent, latency-sensitive, and structurally simple. A product listing page, a dashboard query, a search endpoint — these are JOINs and WHEREs, not relationship cascades. Raw asyncpg handles them at wire speed.
This is not an all-or-nothing choice. You can use both in the same application, with different connection pools, optimised for their respective roles.
"The database was not slow. It was being asked poorly."
— from You Don't Need Redis, Chapter 3: The ORM Tax
Profiling the invisible overhead
One of the most frustrating aspects of async SQLAlchemy's overhead is that it hides from your standard monitoring tools. pg_stat_statements shows query execution time inside PostgreSQL. The greenlet overhead occurs in Python, before the query reaches PostgreSQL and after the results come back. Your database looks healthy. Your application is slow. The gap between those two facts is the greenlet bridge.
Here is how to make the invisible visible:
import time
import logging
from sqlalchemy import event
from sqlalchemy.ext.asyncio import create_async_engine
engine = create_async_engine("postgresql+asyncpg://user:pass@localhost/mydb")
sync_engine = engine.sync_engine # events attach to the sync engine
logger = logging.getLogger("sqlalchemy.async.profiler")
@event.listens_for(sync_engine, "before_cursor_execute")
def before_cursor(conn, cursor, statement, parameters, context, executemany):
conn.info["query_start"] = time.perf_counter()
@event.listens_for(sync_engine, "after_cursor_execute")
def after_cursor(conn, cursor, statement, parameters, context, executemany):
elapsed = time.perf_counter() - conn.info["query_start"]
# This measures time INSIDE the greenlet — includes:
# - asyncpg wire time
# - result buffering
# Does NOT include greenlet switch overhead
if elapsed > 0.1: # log anything over 100ms
logger.warning(f"Slow query ({elapsed:.3f}s): {statement[:200]}")
# To measure the FULL overhead including greenlet switches,
# instrument at the async session level:
from sqlalchemy.ext.asyncio import AsyncSession
class ProfiledSession(AsyncSession):
async def execute(self, statement, *args, **kwargs):
start = time.perf_counter()
result = await super().execute(statement, *args, **kwargs)
elapsed = time.perf_counter() - start
# This captures the complete async overhead:
# greenlet switches + ORM processing + wire time
if elapsed > 0.1:
logger.warning(f"Slow async execute ({elapsed:.3f}s)")
return result The key insight is the difference between the two measurements. before_cursor_execute / after_cursor_execute measures time inside the greenlet — the actual database operation. The ProfiledSession wrapper measures the complete async round trip, including greenlet switches. The delta between them is your async overhead.
In my experience, when teams first instrument this gap, the reaction is consistently one of surprise. "We thought the database was the bottleneck." It was not. The database was responding in 2ms. The application was spending 8ms on greenlet context switches and result buffering. The database was not slow. It was being wrapped expensively.
Should you stay on sync SQLAlchemy instead?
This is the question nobody wants to ask, so I shall ask it for you.
If your application is a FastAPI service that must be async — because of WebSocket handlers, because of async middleware, because of httpx calls to other services — then async SQLAlchemy is the pragmatic choice. Accept the greenlet overhead, follow the discipline above, and the performance will be acceptable. The tail-latency benefits under concurrent load are real, and they matter more than per-query overhead in most web application scenarios.
If your application is a Django project, a Flask project, or any framework that does not require async, there is no performance reason to use async SQLAlchemy. Sync sessions are faster, simpler, and have none of the lazy-loading traps. The "async is faster" intuition comes from asyncio's ability to handle concurrent I/O — but SQLAlchemy's greenlet bridge converts your async I/O back into sync I/O internally, negating the advantage.
If you are starting a new project and choosing between Flask+sync and FastAPI+async, I would observe that the async overhead is a tax on every database operation for the lifetime of the application. FastAPI has many virtues — automatic OpenAPI docs, dependency injection, Pydantic validation — but "faster database queries" is not among them. Choose FastAPI for its framework features if they serve you. Do not choose it because you believe async makes your database faster.
If you are already on async SQLAlchemy and the overhead is unacceptable, consider the hybrid approach described above. Use the ORM for writes, raw asyncpg for read-heavy endpoints. You do not need to rewrite your entire application. Start with the three or four endpoints that dominate your database load, and measure the difference.
A word on SQLAlchemy's future
I should note that the greenlet bridge is an engineering trade-off, not an engineering failure. Rewriting SQLAlchemy's ORM internals to be natively async would have been a multi-year effort with a high risk of introducing regressions in a library that millions of applications depend on. The greenlet approach shipped async support in one major release cycle, with full API compatibility. That is a legitimate achievement.
Mike Bayer has discussed the possibility — the GitHub discussion is worth reading — of of native async internals in future versions. If and when that happens, the greenlet bridge tax — the ~0.08ms per query overhead that is currently irreducible — would disappear. The pre-buffering constraint would likely be relaxed. The per-query gap between async and sync could narrow to near zero.
Until then, the greenlet bridge is the cost of admission. It is a known cost, a measured cost, and — if you follow the discipline outlined in this article — a manageable cost. But it is not zero, and pretending otherwise does no one any favours.
The decision framework
I shall leave you not with a summary — I trust you have been paying attention — but with a framework for the decision you are actually trying to make.
Use sync SQLAlchemy when your framework does not require async, when your workload is batch-oriented or low-concurrency, or when per-query latency matters more than concurrent throughput. This covers Django applications, Flask applications, data pipelines, CLI tools, and background workers.
Use async SQLAlchemy with discipline when your framework is async-native and you need the ORM's write-path features (unit of work, cascades, relationship management). Set expire_on_commit=False, eager load everything, limit result sets, and never scatter sessions across asyncio.gather.
Use raw asyncpg for read-heavy paths when per-query latency matters and the queries are structurally simple. JOINs, filters, aggregations — the queries that comprise 80% of most API read endpoints — do not need ORM object mapping. They need speed.
Use the hybrid approach when you want the best of both: ORM for writes where the abstractions earn their keep, raw driver for reads where speed is the priority. This is, in my observation, the pattern that performs best in production while remaining maintainable.
The async ecosystem is not a monolith. asyncpg is fast. SQLAlchemy's async extension adds overhead. These are different layers, with different performance characteristics, and the decision to use one does not oblige you to use the other. A well-run household matches the tool to the task. One does not polish silver with a floor mop simply because both happen to be in the cleaning cupboard.
Where a query-aware proxy fits in
Much of this article has been about overhead that occurs between your application and PostgreSQL — in the ORM layer, in the greenlet bridge, in the connection pool. Gold Lapel operates in the space between the pool and the database, which means it helps regardless of whether you choose sync sessions, async sessions, or raw driver access.
The queries that arrive at PostgreSQL are the same regardless of how many greenlet context switches they endured getting there. A missing index costs 400ms whether the ORM added 0.1ms or 2ms of overhead on top. A repeated N+1 pattern generates the same wire traffic whether it came from a lazy load or an explicit loop. Gold Lapel catches those patterns at the protocol level — auto-creating indexes, materializing expensive joins, and caching repeated queries — so the database-side performance is optimised even when the application-side overhead is not.
Fix the async traps described above. Then let the proxy handle what the ORM cannot see.
Frequently asked questions
Terms referenced in this article
If the question of async PostgreSQL drivers interests you beyond SQLAlchemy, I have written a head-to-head comparison of asyncpg versus psycopg3 under FastAPI — the driver choice underneath your ORM matters more than most guides acknowledge.