Spring Data JPA's @EntityGraph: Why Multiple Collections Generate a Cartesian Product on PostgreSQL

You asked for 50 departments with their employees and projects. PostgreSQL returned 10,000 rows. Allow me to account for the other 8,500.

The Waiter of Gold Lapel · Updated Mar 20, 2026 Published Mar 5, 2026 · 28 min read

The illustration was duplicated two hundred times during the LEFT JOIN. We kept one.

Good evening. Your @EntityGraph has produced a Cartesian product, and nobody noticed.

You have a Spring Data JPA repository backed by Hibernate and PostgreSQL. Your entity has two @OneToMany collections. You want both loaded eagerly for a particular query, so you reach for @EntityGraph — the Spring Data mechanism designed precisely for this purpose.

If both collections are typed as List, Hibernate throws a MultipleBagFetchException and refuses to proceed. This feels like an error. It is actually a kindness.

If you follow the top-voted Stack Overflow answer and change both collections to Set, the exception vanishes. The query runs. Your tests pass. Everything appears correct.

It is not correct. Hibernate is now silently fetching a Cartesian product — every employee paired with every project for each department — and de-duplicating the results in memory after the fact. For 50 departments with 20 employees and 10 projects each, that means 10,000 rows transferred from PostgreSQL instead of 1,550. The data is right. The performance is catastrophic.

I have investigated this pattern in a great many Spring Boot applications. It is, without exaggeration, the single most common source of unexplained query slowness in Hibernate-backed services. The entity graph looks correct. The tests pass. The code review raises no flags. And the query that took 4ms in development takes 14 seconds in production with real data volumes.

Allow me to walk you through precisely why this happens, how to detect it in an existing codebase, and the three strategies that eliminate it entirely.

The entities: two collections on one parent

The setup is as ordinary as it gets. A Department with two independent child collections — Employee and Project. Both are @OneToMany with lazy fetching by default. The schema is normalized. The foreign keys are indexed. Every individual piece of this design is textbook-correct.

JPA entities

@Entity
@Table(name = "departments")
public class Department {

    @Id
    @GeneratedValue(strategy = GenerationType.SEQUENCE)
    private Long id;

    @Column(nullable = false)
    private String name;

    @OneToMany(mappedBy = "department", fetch = FetchType.LAZY)
    private List<Employee> employees = new ArrayList<>();

    @OneToMany(mappedBy = "department", fetch = FetchType.LAZY)
    private List<Project> projects = new ArrayList<>();

    // getters, setters
}

@Entity
@Table(name = "employees")
public class Employee {

    @Id
    @GeneratedValue(strategy = GenerationType.SEQUENCE)
    private Long id;

    private String name;
    private String role;

    @ManyToOne(fetch = FetchType.LAZY)
    @JoinColumn(name = "department_id")
    private Department department;
}

@Entity
@Table(name = "projects")
public class Project {

    @Id
    @GeneratedValue(strategy = GenerationType.SEQUENCE)
    private Long id;

    private String title;
    private String status;

    @ManyToOne(fetch = FetchType.LAZY)
    @JoinColumn(name = "department_id")
    private Department department;
}

I should note that the word "independent" is doing significant work in that description. Employees and projects are both children of departments, but they have no relationship to each other. There is no foreign key from employees to projects, no join table connecting them, no business rule that pairs a specific employee with a specific project.

This independence is exactly what makes the Cartesian product so pernicious. When you ask PostgreSQL to JOIN two independent child tables through their shared parent, the database has no choice but to produce every possible combination. There is no join predicate to constrain the result. It is doing precisely what the SQL specifies — which is the problem.

The trouble emerges only when you try to fetch both collections in a single query. And @EntityGraph makes that attempt remarkably easy to write.

The @EntityGraph that triggers the Cartesian product

Spring Data JPA's @EntityGraph lets you override lazy fetching for a specific repository method. You specify which association paths to load eagerly, and Hibernate generates the appropriate JOINs. The mechanism is clean, declarative, and — when used with a single collection — genuinely helpful.

Repository with @EntityGraph

public interface DepartmentRepository extends JpaRepository<Department, Long> {

    // Attempt 1: @EntityGraph to eagerly fetch both collections
    @EntityGraph(attributePaths = {"employees", "projects"})
    @Query("SELECT d FROM Department d")
    List<Department> findAllWithEmployeesAndProjects();
}

// What you expect:
//   50 departments, each with ~20 employees and ~10 projects
//   Result: 50 Department objects, fully populated
//
// What actually happens depends on the collection type.

This looks clean. It reads as a straightforward instruction: "fetch all departments, and include their employees and projects." The Spring Data JPA documentation shows exactly this pattern. The annotation takes an array of attribute paths. You list the associations you want. Hibernate obliges.

The documentation does not mention Cartesian products. It does not warn you that listing two collection paths in the same @EntityGraph will generate a SQL query whose result set grows as the product of the collection cardinalities rather than their sum. This omission is, I must confess, the part that troubles me most — because the documentation is where developers go to learn the correct pattern, and the pattern as documented produces the problem.

What happens next depends entirely on whether your collections are typed as List or Set. Both paths lead to trouble, but they arrive there by different routes.

With List: MultipleBagFetchException

If both collections use List<>, Hibernate refuses to execute the query.

The exception

// If both collections use List<> (as shown above):
//
// org.hibernate.loader.MultipleBagFetchException:
//     cannot simultaneously fetch multiple bags: [
//         com.example.Department.employees,
//         com.example.Department.projects
//     ]
//
// Hibernate refuses to execute the query at all.
// This is Hibernate protecting you from the Cartesian product.
//
// A "bag" in Hibernate terminology is an unordered collection
// that allows duplicates — which is what List<> maps to by default.
// Fetching two bags simultaneously produces a Cartesian product
// that Hibernate cannot correctly de-duplicate.
//
// The exception fires at application startup if you use
// FetchType.EAGER, or at query time with @EntityGraph.

Hibernate's term for an unordered List collection is a "bag" — a collection that permits duplicates and has no inherent ordering. When you JOIN FETCH two bags simultaneously, the resulting Cartesian product creates duplicate entries that Hibernate cannot reliably de-duplicate while preserving bag semantics. If Employee #42 appears in the result set 10 times (once for each project in its department), Hibernate cannot distinguish between "10 copies of the same employee from a Cartesian product" and "10 genuinely duplicate employee entries in a bag." The semantics of bag do not permit that distinction.

The Hibernate User Guide documents this restriction. A MultipleBagFetchException is thrown at query execution time (or at startup if you use FetchType.EAGER) to prevent silently incorrect results.

This exception is annoying. It interrupts your development flow with a stack trace when you expected a query result. I understand the impulse to make it go away as quickly as possible.

But the exception is correct behavior. It is Hibernate declining to produce results it cannot guarantee are accurate. The real danger — the one that costs you production incidents at 2 AM — is what happens when you suppress it.

With Set: the silent Cartesian explosion

The overwhelmingly popular fix — at time of writing, the top answer on every relevant Stack Overflow question — is to change List to Set.

The Set workaround

// The common "fix" from Stack Overflow: change List to Set.
@Entity
@Table(name = "departments")
public class Department {

    @Id
    @GeneratedValue(strategy = GenerationType.SEQUENCE)
    private Long id;

    @Column(nullable = false)
    private String name;

    @OneToMany(mappedBy = "department", fetch = FetchType.LAZY)
    private Set<Employee> employees = new HashSet<>();  // List -> Set

    @OneToMany(mappedBy = "department", fetch = FetchType.LAZY)
    private Set<Project> projects = new HashSet<>();    // List -> Set
}

// MultipleBagFetchException disappears.
// The query runs.
// Everyone marks the Stack Overflow answer as accepted.
//
// Nobody checks what SQL Hibernate actually generates.

The MultipleBagFetchException goes away because Set is not a bag. Sets have uniqueness semantics, so Hibernate allows the simultaneous fetch. The query executes. Tests pass. Reviews approve. The pull request merges.

Nobody checks what SQL Hibernate actually generates.

Here is the SQL:

Generated SQL

-- What Hibernate generates for the @EntityGraph query:
SELECT
    d.id          AS d_id,
    d.name        AS d_name,
    e.id          AS e_id,
    e.name        AS e_name,
    e.role        AS e_role,
    e.department_id AS e_dept,
    p.id          AS p_id,
    p.title       AS p_title,
    p.status      AS p_status,
    p.department_id AS p_dept
FROM departments d
LEFT OUTER JOIN employees e ON d.id = e.department_id
LEFT OUTER JOIN projects p ON d.id = p.department_id

-- Two independent LEFT JOINs from the same parent table.
-- No join condition between employees and projects.
-- PostgreSQL does exactly what you asked: it pairs every
-- employee row with every project row for each department.

Two LEFT OUTER JOINs from the same parent table to two independent child tables. There is no join condition between employees and projects — they share only the parent. PostgreSQL does exactly what the SQL specifies: it pairs every employee row with every project row within each department.

Allow me to be precise about what "pairs every employee with every project" means in practice. For a department with 20 employees and 10 projects, the database generates 200 rows — every possible (employee, project) combination. Every employee appears 10 times. Every project appears 20 times. The department itself appears 200 times. None of these combinations represent a real relationship. They are an artifact of the SQL structure, not the data model.

The result is a full Cartesian product between the two collections, scoped per parent. The data is technically correct — Hibernate will de-duplicate it. But the work has already been done. PostgreSQL generated it. The JDBC driver transferred it. The JVM allocated memory for it. And then 85% of it was thrown away.

What EXPLAIN ANALYZE reveals

Numbers settle this. Here is the query plan for a modest dataset: 50 departments, 1,000 employees, 500 projects.

EXPLAIN ANALYZE

-- EXPLAIN ANALYZE on a dataset of:
--   50 departments, 1,000 employees (~20 per dept), 500 projects (~10 per dept)

QUERY PLAN
--------------------------------------------------------------------
 Hash Left Join  (cost=42.75..4218.50 rows=10000 width=184)
                 (actual time=0.412..89.241 rows=10000 loops=1)
   Hash Cond: (d.id = p.department_id)
   ->  Hash Left Join  (cost=22.50..172.50 rows=1000 width=136)
                        (actual time=0.198..4.821 rows=1000 loops=1)
         Hash Cond: (d.id = e.department_id)
         ->  Seq Scan on departments d  (cost=0.00..1.50 rows=50 width=72)
                   (actual time=0.009..0.031 rows=50 loops=1)
         ->  Hash  (cost=15.00..15.00 rows=1000 width=64)
                   (actual time=0.142..0.142 rows=1000 loops=1)
               ->  Seq Scan on employees e  (cost=0.00..15.00 rows=1000 width=64)
                         (actual time=0.004..0.078 rows=1000 loops=1)
   ->  Hash  (cost=8.00..8.00 rows=500 width=48)
             (actual time=0.091..0.091 rows=500 loops=1)
         ->  Seq Scan on projects p  (cost=0.00..8.00 rows=500 width=48)
                   (actual time=0.003..0.042 rows=500 loops=1)
 Planning Time: 0.284 ms
 Execution Time: 91.847 ms

-- 10,000 rows returned. From 50 departments.
-- 50 departments x 20 employees x 10 projects = 10,000 rows.
-- The actual data: 50 + 1,000 + 500 = 1,550 rows across three tables.
-- The result set: 10,000 rows. A 6.5x amplification.

10,000 rows. From three tables containing 1,550 rows total. The row count at each stage:

Join step	Row count	Multiplier	Explanation
Departments (base)	50	-	All departments
+ Employees (1:N)	1,000	20x	~20 employees per department
+ Projects (1:N)	10,000	10x	~10 projects per department, CROSS with employees

The first LEFT JOIN (departments to employees) is a clean 1:N expansion — 50 rows become 1,000. That is expected and correct. Each department appears once per employee. The result set grows linearly with the child table cardinality.

The second LEFT JOIN (departments to projects) is where the Cartesian product forms. Each of the 1,000 employee rows gets paired with all 10 projects for that department. 1,000 rows become 10,000. The employees and projects are independent — they have no relationship to each other — but the flattened result set treats every combination as a distinct row.

Expected vs. actual row counts

-- What you expected to transfer over the wire:
--   50 departments  = 50 rows
--   1,000 employees = 1,000 rows
--   500 projects    = 500 rows
--   Total: 1,550 rows
--
-- What actually transferred:
--   10,000 rows, each containing ALL columns from ALL three tables.
--
-- Every employee row is duplicated 10 times (once per project).
-- Every project row is duplicated 20 times (once per employee).
-- Every department row is duplicated 200 times.
--
-- At scale:
--   200 departments x 50 employees x 30 projects = 300,000 rows
--   Instead of 200 + 10,000 + 6,000 = 16,200 rows
--
-- The Cartesian product grows as the PRODUCT of collection sizes.
-- The separate-query approach grows as the SUM.

At production scale, the numbers become ruinous. 200 departments with 50 employees and 30 projects each: 300,000 rows instead of 16,200. The query that worked fine in development with 5 departments takes 45 seconds in production. And the growth is multiplicative — doubling the average employees per department doubles the result set, and adding a third collection multiplies it again.

Three collections: where the math becomes alarming

Two collections produce a Cartesian product. Three collections produce a Cartesian product of a Cartesian product. The amplification is not additive. It is exponential in the number of collections.

Three-collection Cartesian

// What happens with THREE collections?
// The Cartesian grows as the product of ALL collection sizes.

@Entity
public class Department {

    @OneToMany(mappedBy = "department")
    private Set<Employee> employees;      // ~20 per dept

    @OneToMany(mappedBy = "department")
    private Set<Project> projects;        // ~10 per dept

    @OneToMany(mappedBy = "department")
    private Set<Budget> budgetEntries;    // ~12 per dept
}

// With @EntityGraph fetching all three:
//   50 x 20 x 10 x 12 = 120,000 rows
//   Actual data: 50 + 1,000 + 500 + 600 = 2,150 rows
//   Amplification factor: 55.8x
//
// Each additional collection MULTIPLIES, not adds.
// Two collections: 10,000 rows (6.5x amplification)
// Three collections: 120,000 rows (55.8x amplification)
// Four collections: you should be questioning your life choices.

I raise this not as an academic exercise but because I have seen it in production. An entity with four @OneToMany collections, all listed in a single @NamedEntityGraph, producing over two million rows for a query that should have returned 3,000. The application was allocated 4GB of heap. The query consumed 1.8GB of it before de-duplication reduced the result to the 3,000 entities that were actually needed.

The rule is simple: each additional collection in the same @EntityGraph multiplies the result set by the average cardinality of that collection. Two collections with 20 and 10 children: 200x per parent. Three collections with 20, 10, and 12 children: 2,400x per parent. The amplification factor grows as the product of all collection cardinalities.

The hidden cost after the query returns

The Cartesian product does not just hurt PostgreSQL. Hibernate pays a second tax after receiving the result set.

Hibernate de-duplication

// After receiving 10,000 rows, Hibernate must de-duplicate them
// into 50 Department objects. This is why you changed List to Set.
//
// With Set<Employee> and Set<Project>, Hibernate uses the entity
// identity (the @Id field) to collapse duplicate rows. The 200
// copies of each employee get reduced to one. The result is correct.
//
// But the damage is already done:
//
// 1. PostgreSQL generated 10,000 rows (CPU, memory, I/O)
// 2. JDBC driver transferred 10,000 rows over the wire (~1.6 MB)
// 3. Hibernate allocated 10,000 row objects before de-duplication
// 4. GC pressure from 8,500 discarded temporary objects
//
// The entities are correct. The performance is not.
//
// And if you forgot to override equals/hashCode on your entities?
// The Set uses object identity. Hibernate creates 10,000 distinct
// Employee instances. Your department.getEmployees().size() returns
// 200 instead of 20. Silent data duplication, no exception.

This is the insidious part. The entities you receive are correct. Your department.getEmployees().size() returns 20. Your department.getProjects().size() returns 10. Every unit test passes. Every integration test passes. The Cartesian product is invisible at the application level — it manifests only as slower queries, higher memory consumption, and increased GC pressure.

Unless, of course, you forgot to implement equals and hashCode.

equals/hashCode for Set collections

// CRITICAL: If you use Set collections, you MUST implement
// equals/hashCode correctly. The default Object identity
// will NOT de-duplicate Hibernate's Cartesian result.

@Entity
public class Employee {

    @Id
    @GeneratedValue(strategy = GenerationType.SEQUENCE)
    private Long id;

    // ...

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (!(o instanceof Employee)) return false;
        Employee that = (Employee) o;
        // Use the business key or ID, NOT all fields
        return id != null && id.equals(that.id);
    }

    @Override
    public int hashCode() {
        // Constant hashCode for Hibernate proxies compatibility.
        // Yes, this means all entities land in the same hash bucket.
        // The Set still works — it just degrades to O(n) lookups.
        // For entity collections, this is the accepted pattern.
        // See: Vlad Mihalcea's JPA entity equality guide.
        return getClass().hashCode();
    }
}

Without proper equals/hashCode, the Set uses object identity. Every one of the 10,000 rows produces a distinct Employee instance. Your department ends up with 200 employees instead of 20. No exception. No warning. Just wrong data, silently, in production.

The constant hashCode pattern shown above — returning getClass().hashCode() — is the standard Hibernate recommendation. It sacrifices hash distribution (all entities in one bucket, O(n) lookups) for compatibility with Hibernate proxies and detached entities. For entity collections of typical size, this is an acceptable trade-off. For details, see Vlad Mihalcea's comprehensive guide to JPA entity equality.

The memory arithmetic, since we are being thorough

Allow me to quantify the memory impact, because "GC pressure" is easy to dismiss as a vague concern. It is not vague. It is arithmetic.

Memory allocation breakdown

// Memory impact of the Cartesian product on a real-world entity.
// Assumptions: average row width 184 bytes (from EXPLAIN output).

// ---- With Cartesian (single @EntityGraph, both collections) ----
//
// JDBC ResultSet allocation:
//   10,000 rows x 184 bytes = ~1.8 MB raw data
//   10,000 ResultSet row objects = ~640 KB object headers
//   Total wire + JDBC: ~2.4 MB
//
// Hibernate entity hydration (before de-dup):
//   10,000 Employee instances x ~96 bytes = ~960 KB
//   10,000 Project instances x ~80 bytes  = ~800 KB
//   10,000 Department refs x ~16 bytes    = ~160 KB
//   Total entity allocation: ~1.9 MB
//
// After de-duplication:
//   1,000 Employee instances retained
//   500 Project instances retained
//   50 Department instances retained
//   ~8,450 temporary objects become GC garbage
//
// Grand total allocated: ~4.3 MB for 1,550 entities
// Retained: ~0.5 MB
// Waste ratio: 88%
//
// ---- With separate queries ----
//
// Query 1: 1,000 rows x 136 bytes = ~136 KB
// Query 2: 500 rows x 96 bytes = ~48 KB
// Total: ~184 KB. No waste. No GC pressure.
//
// At 200 departments x 50 employees x 30 projects:
//   Cartesian: ~130 MB allocated, ~3.5 MB retained
//   Separate: ~5.2 MB allocated, ~5.2 MB retained

The waste ratio of 88% deserves emphasis. For every byte of useful data your application retains, it allocated and discarded seven bytes of Cartesian duplicates. Those discarded objects are not free — they consume allocation bandwidth, fill young generation spaces, and trigger minor GC pauses. In a request-heavy service processing dozens of these queries per second, the cumulative GC overhead is measurable in your p99 latency.

At 200 departments, the Cartesian version allocates 130MB per query execution. If this is a frequently-called endpoint — a department listing page, an admin dashboard — you are generating 130MB of garbage per request. That is the kind of allocation pattern that turns a 50ms p50 into a 500ms p99 when GC pauses start compounding.

The separate-query approach allocates 5.2MB and retains all of it. No waste. No GC overhead. The memory profile is flat and predictable.

The correct fix: separate queries for separate collections

The fix is simple once you accept the premise: two independent collections should be fetched in two independent queries. The extra round trip costs 0.5ms. The Cartesian product costs 88ms. The math is not ambiguous.

Three approaches to sequential fetching

// The correct fix: fetch each collection in a separate query.
// Three approaches, same result.

// Approach 1: Two separate @EntityGraph queries
public interface DepartmentRepository extends JpaRepository<Department, Long> {

    @EntityGraph(attributePaths = {"employees"})
    @Query("SELECT d FROM Department d")
    List<Department> findAllWithEmployees();

    @EntityGraph(attributePaths = {"projects"})
    @Query("SELECT d FROM Department d WHERE d IN :departments")
    List<Department> findAllWithProjects(@Param("departments") List<Department> depts);
}

// In your service:
List<Department> departments = departmentRepository.findAllWithEmployees();
departmentRepository.findAllWithProjects(departments);
// Hibernate's persistence context merges the results automatically.

// Approach 2: Hibernate's @Fetch(FetchMode.SUBSELECT)
@Entity
public class Department {

    @OneToMany(mappedBy = "department")
    @Fetch(FetchMode.SUBSELECT)
    private Set<Employee> employees = new HashSet<>();

    @OneToMany(mappedBy = "department")
    @Fetch(FetchMode.SUBSELECT)
    private Set<Project> projects = new HashSet<>();
}
// SUBSELECT fires a second query using IN (SELECT id FROM ...)
// when the collection is first accessed. No Cartesian product.

// Approach 3: @BatchSize for controlled batch fetching
@Entity
public class Department {

    @OneToMany(mappedBy = "department")
    @BatchSize(size = 25)
    private Set<Employee> employees = new HashSet<>();

    @OneToMany(mappedBy = "department")
    @BatchSize(size = 25)
    private Set<Project> projects = new HashSet<>();
}
// Loads collections in batches of 25 parent IDs.
// 50 departments = 2 batches per collection = 4 extra queries.

All three approaches produce the same result: each collection is loaded in its own query, avoiding the Cartesian product. The differences are in granularity and control.

Approach 1 (two @EntityGraph queries) gives you full control. You choose exactly when and which collections load. The persistence context automatically merges results — the second query's projects attach to the same Department instances loaded by the first query. This is my recommendation for most cases. It is explicit, testable, and easy to reason about.

Approach 2 (SUBSELECT) is the lowest-ceremony option. Hibernate fires a subselect query when you first access the collection. The downside: the subselect uses IN (SELECT id FROM departments WHERE ...), re-executing the parent query as a subquery. For simple parent queries, this is fine. For complex parent queries with expensive WHERE clauses, it can be slower than the explicit approach. I should also note that SUBSELECT is a Hibernate-specific annotation — it is not part of the JPA standard. If portability across JPA implementations matters to you (it rarely does in practice, but the caveat is worth stating), this approach binds you to Hibernate.

Approach 3 (BatchSize) loads collections in batches. For 50 departments with a batch size of 25, Hibernate fires 2 batch queries per collection (4 total). This is optimal when you need lazy loading semantics but want to avoid N+1 queries. The batch size is configurable globally via hibernate.default_batch_fetch_size or per-collection via @BatchSize. A batch size of 25-50 is a sensible default for most applications.

"The ORM did not fail. It did exactly what was asked. It was simply asked poorly."
— from You Don't Need Redis, Chapter 3: The ORM Tax

What the database sees with separate queries

Comparing the PostgreSQL execution plans side by side makes the case unambiguous.

Separate queries: SQL and timing

-- What PostgreSQL sees with separate queries:

-- Query 1: Departments + Employees (clean 1:N join, no Cartesian)
SELECT d.id, d.name, e.id, e.name, e.role, e.department_id
FROM departments d
LEFT OUTER JOIN employees e ON d.id = e.department_id;
-- 1,000 rows. 2.1 ms.

-- Query 2: Departments + Projects (clean 1:N join, no Cartesian)
SELECT d.id, d.name, p.id, p.title, p.status, p.department_id
FROM departments d
LEFT OUTER JOIN projects p ON d.id = p.department_id;
-- 500 rows. 1.4 ms.

-- Total: 1,500 rows across 2 queries. 3.5 ms.
-- vs. 10,000 rows in 1 query. 91.8 ms.
--
-- 26x faster. 6.7x fewer rows transferred. 2 queries instead of 1.
-- The extra round trip costs ~0.5ms. The Cartesian product costs 88ms.

Two queries, 1,500 total rows, 3.5ms. One query, 10,000 rows, 91.8ms. The separate queries are 26x faster and transfer 6.7x fewer rows over the wire. Each individual query is a clean 1:N join that PostgreSQL handles efficiently with hash joins and sequential scans.

The "one query is always faster than two queries" intuition does not hold when the single query contains a Cartesian product. Network round-trip latency is measured in microseconds on a local network and single-digit milliseconds across availability zones. Cartesian products are measured in orders of magnitude of wasted work.

I should be honest about when the single-query approach does win: when both collections are very small. If every department has 2 employees and 1 project, the Cartesian produces 2 rows per department instead of 3 — a 33% reduction versus the sum. At that scale, the single query is genuinely faster because the Cartesian overhead is negligible and you save a round trip. But collections that small rarely justify eager fetching at all. If you are eagerly loading a collection with 2 entries, you might question whether the eager fetch is worth the annotation.

JOIN FETCH in JPQL has the same problem

If you bypass @EntityGraph and write JPQL directly, the Cartesian product appears just the same.

JPQL JOIN FETCH

-- You might think JPQL gives you more control. It does — barely.

-- This JPQL:
SELECT d FROM Department d
  JOIN FETCH d.employees
  JOIN FETCH d.projects

-- Generates exactly the same SQL:
SELECT d.*, e.*, p.*
FROM departments d
LEFT OUTER JOIN employees e ON d.id = e.department_id
LEFT OUTER JOIN projects p ON d.id = p.department_id

-- JOIN FETCH is syntactic sugar over LEFT OUTER JOIN.
-- Two JOIN FETCHes to independent collections produce
-- the same Cartesian product as two @EntityGraph paths.
-- The MultipleBagFetchException still applies with List.
-- Switching to Set still produces the silent Cartesian.

-- The only JPQL escape is to write two separate queries:
SELECT d FROM Department d JOIN FETCH d.employees
SELECT d FROM Department d JOIN FETCH d.projects
-- Which is exactly what the separate-query approach does.

JOIN FETCH is syntactic sugar over LEFT OUTER JOIN. Two JOIN FETCH clauses to independent collections generate the same dual-LEFT-JOIN SQL as @EntityGraph with two attribute paths. The MultipleBagFetchException still applies with List. Changing to Set still produces the silent Cartesian.

There is no JPQL syntax that fetches two independent collections without a Cartesian product in a single query. The only escape within JPQL is writing two separate queries — which is what the separate-query approach does with repository methods.

The Criteria API: a third syntax, same Cartesian

For thoroughness — and because I have seen developers reach for the Criteria API assuming it offers more control — the same problem reproduces identically.

Criteria API Cartesian

// The Criteria API has the same Cartesian problem.
// Every JPA mechanism that builds a single SELECT with
// multiple collection joins produces this result.

CriteriaBuilder cb = entityManager.getCriteriaBuilder();
CriteriaQuery<Department> cq = cb.createQuery(Department.class);
Root<Department> root = cq.from(Department.class);

// Fetching both collections via Criteria API
root.fetch("employees", JoinType.LEFT);
root.fetch("projects", JoinType.LEFT);

List<Department> departments = entityManager
    .createQuery(cq)
    .getResultList();

// Generated SQL: identical dual LEFT JOIN.
// Cartesian product: identical.
// MultipleBagFetchException with List: identical.
//
// @EntityGraph, JPQL JOIN FETCH, and Criteria fetch()
// are three syntaxes for the same underlying Hibernate
// fetch strategy. They all produce the same SQL plan.

@EntityGraph, JPQL JOIN FETCH, and the Criteria API's fetch() method are three syntaxes for the same underlying Hibernate fetch strategy. They all instruct Hibernate to generate LEFT JOINs. Multiple LEFT JOINs to independent collections produce a Cartesian product. The syntax you use to express the instruction does not change the SQL that results from it.

This is worth stating explicitly because the instinct when encountering a problem with one API is to try a different API. In this case, switching from @EntityGraph to JPQL to Criteria is rearranging deck chairs. The fix is structural: fetch one collection per query.

@NamedEntityGraph and subgraphs: deeper graphs, worse Cartesians

The @NamedEntityGraph annotation lets you define reusable fetch graphs at the entity level. It is a convenient way to centralize your eager-loading configuration. It does not, however, change the SQL generation strategy.

@NamedEntityGraph

// @NamedEntityGraph offers the same mechanism at the entity level.
// It does NOT solve the Cartesian problem — it just moves the
// graph definition from the repository to the entity class.

@Entity
@NamedEntityGraph(
    name = "Department.withEmployeesAndProjects",
    attributeNodes = {
        @NamedAttributeNode("employees"),
        @NamedAttributeNode("projects")
    }
)
public class Department {
    // ... same entity, same problem
}

// Repository:
@EntityGraph(value = "Department.withEmployeesAndProjects")
@Query("SELECT d FROM Department d")
List<Department> findAllFull();

// The @NamedEntityGraph controls WHAT to fetch eagerly.
// It does not control HOW Hibernate fetches it.
// The "how" is always: LEFT JOIN all specified paths
// into a single query. Multiple paths = Cartesian product.
//
// The fix is the same: split the graph into separate queries,
// one collection per query.

And then there are subgraphs — the mechanism for eagerly loading nested collections through a parent. This is where the amplification becomes truly alarming.

Subgraph Cartesian

// Subgraphs make it even worse.
// Loading a nested collection through the parent multiplies
// the Cartesian product at each level.

@Entity
@NamedEntityGraph(
    name = "Department.deep",
    attributeNodes = {
        @NamedAttributeNode(value = "employees", subgraph = "emp-tasks"),
        @NamedAttributeNode("projects")
    },
    subgraphs = {
        @NamedSubgraph(name = "emp-tasks",
            attributeNodes = @NamedAttributeNode("tasks"))
    }
)
public class Department { /* ... */ }

// Three levels of LEFT JOIN:
// departments -> employees -> tasks
//            \-> projects
//
// 50 depts x 20 employees x 5 tasks x 10 projects = 500,000 rows
// Actual data across 4 tables: 50 + 1,000 + 5,000 + 500 = 6,550
// Amplification: 76.3x
//
// This is where ORMs stop being helpful and start being dangerous.

Each level of nesting adds another LEFT JOIN, and another multiplication factor. A three-level graph with modest collection sizes — 20, 5, 10 — produces a 1,000x amplification per parent entity. At 50 parent entities, that is 500,000 rows from three tables containing 6,550 rows of actual data.

I am not suggesting you should never use subgraphs. They have legitimate uses when the nested collection is small (a user with 2-3 roles, each role with 5-6 permissions). But the deeper the graph, the more important it becomes to verify the generated SQL and measure the actual row counts. The compounding effect of each additional LEFT JOIN makes the Cartesian growth non-obvious until you check.

Pagination and collection fetches: the worst of both worlds

If you combine pagination with multi-collection @EntityGraph fetching, you arrive at what I can only describe as the infrastructural equivalent of a controlled demolition that was not, in fact, controlled.

Pagination with collection fetch

// Pagination makes Cartesian products WORSE, not better.
// You might think LIMIT would constrain the damage. It doesn't.

// This JPQL:
SELECT d FROM Department d
  JOIN FETCH d.employees
  JOIN FETCH d.projects
  ORDER BY d.name

// Hibernate detects the collection fetch and issues:
// HHH90003004: firstResult/maxResults specified with collection
// fetch; applying in memory!
//
// Translation: Hibernate fetches ALL rows (the full Cartesian),
// loads them into memory, de-duplicates, THEN applies LIMIT/OFFSET
// in Java. Not in SQL. In Java.
//
// For 200 departments with 50 employees and 30 projects:
//   300,000 rows fetched from PostgreSQL
//   300,000 rows materialized in JVM memory
//   Hibernate de-duplicates to 200 Department objects
//   THEN takes the first 20 (your page size)
//
// You asked for page 1 of 20 departments.
// PostgreSQL sent you 300,000 rows.
// Hibernate used ~130 MB of heap to pick 20 of them.

Hibernate cannot apply SQL-level LIMIT/OFFSET to a query that contains collection JOINs, because the row count in the result set does not correspond to the entity count. A department with 20 employees produces 20 rows — applying LIMIT 20 would give you one department, not twenty. So Hibernate falls back to the only strategy available: fetch everything, de-duplicate in memory, then paginate the resulting entity list in Java.

This means your paginated query — the one that was supposed to return 20 departments — materializes the entire Cartesian product of all departments, all their employees, and all their projects. In memory. On every page request.

The HHH90003004 warning that Hibernate logs is accurate but easy to miss in a sea of application logs. It states exactly what is happening: "firstResult/maxResults specified with collection fetch; applying in memory." If you see this warning, treat it as a production incident waiting to happen.

The fix is the same: fetch the paginated parent entities first (without collection joins), then load collections in separate queries for just the entities on the current page.

The full comparison

Every approach, with its trade-offs, for convenient reference:

Approach	Queries	Rows transferred	Time	Correctness
Single @EntityGraph (both collections)	1	10,000	91.8 ms	Requires Set + equals/hashCode
Two separate @EntityGraph queries	2	1,500	3.5 ms	Always correct
@Fetch(FetchMode.SUBSELECT)	3	1,550	4.2 ms	Always correct
@BatchSize(size = 25)	5	1,550	5.8 ms	Always correct
Gold Lapel (auto-detection + rewrite)	1 (rewritten to 2)	1,500	~3.8 ms	Always correct

The single @EntityGraph approach is the worst performer by a wide margin — and the only one that requires careful equals/hashCode implementation to avoid silent data corruption. Every alternative is both faster and safer.

Detecting Cartesian products in existing applications

The most dangerous Cartesian products are the ones you do not know about. They were added months ago, worked fine on small development datasets, and slowly deteriorate as production data grows. By the time someone notices the slowness, the query has been in production long enough that nobody remembers why it was written that way.

From the PostgreSQL side

Diagnostic queries for pg_stat_statements

-- Detect Cartesian products from @EntityGraph queries in production.
-- The signature: multi-table LEFT JOINs with multiplicative row counts.

SELECT
    left(query, 100) AS query_prefix,
    calls,
    rows / NULLIF(calls, 0) AS avg_rows_per_call,
    round(mean_exec_time::numeric, 2) AS avg_ms,
    round(total_exec_time::numeric, 0) AS total_ms
FROM pg_stat_statements
WHERE query LIKE '%LEFT%JOIN%LEFT%JOIN%'
  AND rows / NULLIF(calls, 0) > 500
  AND calls > 10
ORDER BY total_exec_time DESC
LIMIT 20;

-- What to look for:
--   avg_rows_per_call >> expected base table count
--   Two or more LEFT JOINs to different tables from the same parent
--   No join condition between the child tables (Cartesian indicator)
--
-- Cross-reference with EXPLAIN ANALYZE:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT d.*, e.*, p.*
FROM departments d
LEFT JOIN employees e ON d.id = e.department_id
LEFT JOIN projects p ON d.id = p.department_id;

-- The Hash Left Join node will show actual rows far exceeding
-- the sum of both child tables. That is your Cartesian.

The signature is distinctive: a query with multiple LEFT JOINs from the same parent table, returning a row count that is the product rather than the sum of the child table cardinalities. In pg_stat_statements, look for queries where avg_rows_per_call exceeds the base table row count by an order of magnitude.

From the Hibernate side

Hibernate statistics

# Enable Hibernate statistics to detect Cartesian waste.
# application.properties:
spring.jpa.properties.hibernate.generate_statistics=true
logging.level.org.hibernate.stat=DEBUG

# What you'll see in the logs for a Cartesian query:
#
# Session Metrics {
#     1234567 nanoseconds spent acquiring 1 JDBC connections;
#     89241000 nanoseconds spent executing 1 statements;
#     0 nanoseconds spent executing 0 flushes;
#     10000 JDBC rows fetched;      <-- This number
#     1550 entities loaded;          <-- vs. this number
# }
#
# When "JDBC rows fetched" vastly exceeds "entities loaded",
# you are looking at a Cartesian product being de-duplicated.
# The ratio tells you the amplification factor.
#
# 10,000 / 1,550 = 6.45x amplification
#
# WARNING: Do not leave statistics enabled in production.
# The overhead is measurable. Enable, measure, disable.

The gap between "JDBC rows fetched" and "entities loaded" is the Cartesian amplification factor, rendered as two plain numbers in your application logs. If you fetch 10,000 JDBC rows and load 1,550 entities, 8,450 rows were Cartesian duplicates that Hibernate allocated, inspected, and discarded.

On the Hibernate side, enabling hibernate.generate_statistics=true temporarily reveals the entity fetch counts. If Hibernate reports fetching 10,000 Employee entities but your Set contains 1,000 after de-duplication, 90% of the work was wasted on the Cartesian product. The EXPLAIN ANALYZE guide covers how to read the query plan nodes that reveal the multiplication point.

An honest counterpoint: when the Cartesian product is acceptable

I have spent considerable effort explaining why multi-collection @EntityGraph queries are dangerous. In the interest of completeness — and because a waiter who overstates his case is no waiter at all — I should acknowledge the scenarios where the Cartesian product is genuinely tolerable.

Very small collections. If both collections consistently have 2-3 entries, the Cartesian amplification is 4-9x. At that scale, the absolute overhead is negligible — a few extra rows, a few extra microseconds. The separate-query approach is still technically superior, but the difference is measured in microseconds rather than milliseconds. If the code is clearer with a single @EntityGraph, the trade-off may be worthwhile.

Internal tooling with bounded datasets. An admin panel that displays 10 departments with their employees and projects — never more, because the company has 10 departments — will never grow into a problem. The Cartesian product is bounded by the domain, not the code. The risk is low.

One-time batch operations. If the query runs once during a nightly batch job and the result fits comfortably in memory, the Cartesian overhead is a few extra seconds in a process that takes minutes. Optimizing it would be correct but not impactful.

The danger is not in using a Cartesian product knowingly. The danger is in using one unknowingly, in a codepath that will eventually face production-scale data. If you have consciously evaluated the collection cardinalities and decided the overhead is acceptable, carry on. If you changed List to Set because Stack Overflow told you to and never checked the generated SQL — that is the scenario this guide exists to address.

Where a proxy detects what the ORM cannot

The fundamental problem with @EntityGraph Cartesian products is visibility. Hibernate does not warn you. The SQL is technically valid. The results are correct (assuming proper equals/hashCode). The only symptom is a query that takes 90ms instead of 4ms — and at small dataset sizes, even that difference is invisible.

# What Gold Lapel sees at the proxy layer:
#
# Incoming query:
#   SELECT d.*, e.*, p.*
#   FROM departments d
#   LEFT JOIN employees e ON d.id = e.department_id
#   LEFT JOIN projects p ON d.id = p.department_id
#
# Gold Lapel's query analyzer detects:
#   - Two LEFT JOINs from the same parent table
#   - No join predicate between employees and projects
#   - Historical row count: 10,000 (from pg_stat_statements)
#   - Expected non-Cartesian row count: ~1,500
#   - Amplification factor: 6.7x
#
# Action: rewrite to sequential fetches.
#
# The application sends one query.
# Gold Lapel executes two:
#   1. departments LEFT JOIN employees  -> 1,000 rows
#   2. departments LEFT JOIN projects   -> 500 rows
# Then merges the results at the proxy layer before
# returning to Hibernate. Same result set shape.
# Same JDBC ResultSet contract. Zero application changes.
#
# The rewrite triggers when:
#   - Amplification factor exceeds 2x (configurable)
#   - Query has been observed at least 5 times
#   - Both child tables produce >10 rows per parent
#
# Net effect: 10,000 rows -> 1,500 rows.
# 91ms -> ~4ms. No query changes. No @EntityGraph refactoring.

Gold Lapel sits between Hibernate and PostgreSQL, analyzing query patterns at the wire protocol level. When it detects a multi-collection LEFT JOIN producing a Cartesian product — identified by the amplification factor between expected and actual row counts — it can rewrite the query into sequential fetches transparently.

The application sends one query. Gold Lapel executes two. The merged result returns to Hibernate in the same format as the original Cartesian result set. The JDBC contract is preserved. The @EntityGraph annotation stays unchanged. The difference is that PostgreSQL produces 1,500 rows instead of 10,000, and the query completes in 4ms instead of 92ms.

This is not a substitute for writing correct queries. The strategies outlined earlier — separate @EntityGraph queries, SUBSELECT, @BatchSize — are the proper architectural fixes, and I recommend implementing them in any codebase you actively maintain. The proxy handles the queries you have not fixed yet, the queries in third-party libraries you cannot modify, and the queries that will be introduced by future developers who have not read this guide.

Fix the @EntityGraph queries you know about. Split multi-collection fetches into separate queries. Implement equals/hashCode on every entity that participates in a Set collection. And for the Cartesian products that will appear in the future — the ones added by developers who have not read this guide, on entities that do not exist yet — a proxy provides the safety net that the ORM does not.

Change the JDBC connection string. Keep your entity graphs.

Frequently asked questions

Does @EntityGraph always produce a Cartesian product?

Is the MultipleBagFetchException a bug in Hibernate?

Will switching to Hibernate 6 fix this?

Can I use DISTINCT to remove the Cartesian duplicates?

Is this specific to PostgreSQL?

How many collections can I safely fetch in a single @EntityGraph?

Terms referenced in this article

The adjacent question — how Spring handles pagination when JOIN FETCH is involved — has its own treatment. I have written a guide on Spring Data JPA pagination with JOIN FETCH that explains why Hibernate applies LIMIT in memory and the three ways to prevent it.