← ORM Performance & Raw SQL

Single Table Inheritance on PostgreSQL: The Performance Pitfalls at Scale, and the Escape Routes

Your users table has 13 columns. Your Guest model uses 4 of them. The other 9 are NULL, twelve million times over. Shall we discuss the implications?

The Waiter of Gold Lapel · Updated Mar 20, 2026 Published Mar 5, 2026 · 28 min read
The illustration inherited several NULL attributes from its parent class. Resolving the bloat.

Good evening. I understand you are using Single Table Inheritance.

A fine pattern in its youth. Rails made it effortless: add a type column, create some subclasses, and ActiveRecord handles the rest. The first time you see polymorphism materialize from a single column — Admin.where(department: 'Engineering') routing to the right class with no configuration — the elegance is genuine. I do not begrudge anyone the initial delight.

At 50,000 rows, STI is invisible. At 500,000, you might notice the table is wider than expected. At 5 million, your DBA is asking pointed questions about NULL ratios. At 12 million, with 8 subtypes and 40 columns, the situation has become architectural. The pattern that bought you speed in week one is now the single largest contributor to your database performance profile, and it is not contributing favourably.

STI does not scale gracefully on PostgreSQL. This is not a matter of opinion — it is a matter of physics. Every row carries the full column set. Every index spans every type. Every buffer page contains a mix of types that will never be queried together. The costs compound quietly, like interest on a debt you forgot you had, until they become the dominant line item in your performance budget.

I should be forthcoming: there are escape routes. Several of them, in fact, each with different trade-offs and different levels of disruption. I have opinions about which are best. But first, it helps to understand precisely what is going wrong, because the symptoms of STI distress are more varied than most teams realize, and the correct escape route depends on which symptom is causing the most damage.

The anatomy of an STI table at scale

Consider a representative example. Three types in one table — modest, by production standards. I have seen STI tables with twelve subtypes. Those cases require stronger remedies than we will discuss here, and possibly a frank conversation about data modelling philosophy.

# The classic Rails STI setup — one table, many types:
class User < ApplicationRecord
end

class Admin < User
end

class Moderator < User
end

class Guest < User
  # No permissions, no profile, no billing.
  # But the table has columns for all of them.
end

The subclass structure looks clean. The Guest model is particularly telling: it exists as a distinct type with distinct behaviour, but it carries no unique data. No permissions, no profile details, no billing attributes. It is, in terms of the data it actually stores, a User with an email and a name.

Yet it shares a table with models that have substantial type-specific data:

# What the migration looks like:
class CreateUsers < ActiveRecord::Migration[7.1]
  def change
    create_table :users do |t|
      t.string  :type,             null: false  # STI discriminator
      t.string  :email,            null: false
      t.string  :name

      # Admin-only columns
      t.string  :department
      t.integer :access_level
      t.jsonb   :admin_permissions

      # Moderator-only columns
      t.string  :assigned_region
      t.integer :escalation_limit
      t.boolean :can_ban_users

      # Guest-only columns
      t.string  :invite_token
      t.datetime :invite_expires_at

      t.timestamps
    end
  end
end

# For an Admin row, 5 columns are NULL (the Moderator and Guest ones).
# For a Guest row, 6 columns are NULL (the Admin and Moderator ones).
# Multiply by 12 million rows. That is a great deal of NULL.

The migration creates 13 columns. An Admin row uses 8 of them. A Moderator row uses 8. A Guest row uses 5 — id, type, email, name, and timestamps. The remaining columns sit at NULL, and PostgreSQL still pays for them. Not an outrageous price per row. But 12 million rows is rather a lot of rows.

The fundamental issue is that STI conflates logical separation with physical co-location. Admin, Moderator, and Guest are logically distinct entities — different attributes, different access patterns, different query frequencies, different update cadences. STI forces them into a single physical structure because that was the most convenient thing to do in a migration file. Convenience is a fine reason to make a decision at 10,000 rows. It is a less fine reason at 12 million.

How bad is the NULL bloat, actually?

This is the question I am asked most often, and the answer invariably surprises. Not because the NULLs themselves are expensive — PostgreSQL handles NULL storage quite efficiently — but because the cumulative effects on row width, page density, and buffer utilization are larger than intuition suggests.

-- How much space are those NULLs actually costing you?
-- Check the null fraction per column:
SELECT attname,
       null_frac,
       avg_width,
       n_distinct
FROM pg_stats
WHERE tablename = 'users'
ORDER BY null_frac DESC;

--  attname          | null_frac | avg_width | n_distinct
-- ------------------+-----------+-----------+------------
--  assigned_region  |      0.92 |        14 |         37
--  escalation_limit |      0.92 |         4 |         10
--  can_ban_users    |      0.92 |         1 |          2
--  department       |      0.85 |        18 |         24
--  access_level     |      0.85 |         4 |          5
--  admin_permissions|      0.85 |       312 |         -1
--  invite_token     |      0.97 |        32 |         -1
--  invite_expires_at|      0.97 |         8 |         -1

-- Columns that are 85-97% NULL across 12 million rows.
-- Each one inflates the row header and wastes shared_buffers.

The null_frac column tells the story. When assigned_region is 92% NULL, 11 million of your 12 million rows carry that column for no reason. The avg_width shows the average size of non-null values — 14 bytes for region names, 312 bytes for the admin permissions JSONB. Those columns exist in the table definition for every row, contributing to wider rows and fewer tuples per page.

Allow me to explain the storage mechanics, because they matter:

-- How PostgreSQL stores NULLs internally:
-- Each row has a HeapTupleHeaderData (23 bytes fixed) plus a null bitmap.
-- The null bitmap is 1 bit per column, rounded up to 8-byte alignment.
-- A table with 13 columns = 2 bytes of null bitmap per row.

-- With 12 million rows, that's 24MB just for null bitmaps.
-- Not the problem. The problem is what the NULLs represent:
-- columns that exist in the row's storage allocation but carry no value.

-- PostgreSQL stores rows as a contiguous sequence of field values.
-- NULL fields don't store data, but they DO affect:
--   1. Row alignment and padding (datums align to type boundaries)
--   2. The tuple's overall width (planner uses avg_width for cost estimation)
--   3. Toast threshold (wider rows toast sooner, fragmenting storage)

-- Measure actual table bloat from NULLs:
SELECT pg_size_pretty(pg_total_relation_size('users'))  AS total_size,
       pg_size_pretty(pg_relation_size('users'))         AS heap_size,
       pg_size_pretty(pg_indexes_size('users'))          AS index_size,
       n_live_tup,
       n_dead_tup
FROM pg_stat_user_tables
WHERE relname = 'users';

--  total_size | heap_size | index_size | n_live_tup  | n_dead_tup
-- ------------+-----------+------------+-------------+------------
--  4.2 GB     | 2.8 GB    | 1.4 GB     | 12,000,000  | 340,000

A table with 5 shared columns and 8 type-specific columns (most of which are NULL for any given row) can be 30-50% wider than the same data stored in type-specific tables. On a 12 million row table, that translates to gigabytes of unnecessary bloat in both the heap and every index that covers those columns.

An honest boundary on NULL cost

I should note that NULL storage in PostgreSQL is genuinely cheap on a per-row basis. A NULL value does not store any datum — it is tracked in a bitmap that costs 1 bit per column. The null bitmap itself adds minimal overhead. If your STI table has 4 subtypes and 6 columns total, the NULL waste is negligible. You should not restructure your schema for a table with 200,000 rows and two subtypes. That would be solving a problem you do not have.

The cost becomes material when three factors converge: many type-specific columns (8+), large row counts (millions), and non-null values that are wide (JSONB, text, varchar). The 312-byte admin_permissions JSONB column that is 85% NULL across 12 million rows — that is where the waste accumulates. A boolean column that is 92% NULL costs you almost nothing. The calculation is always specific to your data.

The more significant cost is not the NULLs per se but the page density reduction. Wider rows mean fewer rows per 8KB page. Fewer rows per page means more pages to scan, more buffer cache consumed, and more I/O for every sequential operation. At 12 million rows, the difference between 6 rows per page and 4 rows per page is 1 million additional pages. That is 8GB of additional heap to store, cache, vacuum, and backup.

Why does the query planner get STI wrong?

PostgreSQL's query planner is, in my considered opinion, one of the finest cost-based optimizers available. It makes remarkably good decisions with remarkably limited information. But STI tables present it with a problem that is genuinely difficult: correlated columns with skewed distributions, all sharing a single set of statistics.

-- The query planner struggles with skewed type distributions.
-- You have 12M users: 1.8M Admins, 960K Moderators, 9.2M Guests.

EXPLAIN ANALYZE
SELECT * FROM users
WHERE type = 'Admin' AND department = 'Engineering';

-- What the planner estimates (with outdated statistics):
--   Seq Scan on users
--     Filter: ((type = 'Admin') AND (department = 'Engineering'))
--     Rows Removed by Filter: 11,994,200
--     Planning Time: 0.4 ms
--     Execution Time: 4,218 ms

-- The planner chose a sequential scan because it estimated
-- the type = 'Admin' filter would return 4M rows (33% of 12M,
-- assuming uniform distribution across 3 types).
-- The actual count: 1.8M — but that still felt too many for
-- an index scan, so it went sequential.

-- Meanwhile, the index on (type, department) exists
-- but the planner doesn't trust it for "large" result sets.

The core issue: PostgreSQL collects statistics per column, not per column-per-type. When it sees WHERE type = 'Admin' AND department = 'Engineering', it estimates the selectivity of each predicate independently and multiplies them. If the statistics target is low (the default is 100), the planner may not have enough histogram buckets to accurately model the skew.

But the problem runs deeper than simple selectivity math:

-- The deeper issue: column correlation.
-- The planner assumes type and department are independent.
-- They are not. All Admins have a department. No Guests do.

-- Without extended statistics, PostgreSQL calculates:
--   P(type = 'Admin') = 0.15             (1.8M / 12M)
--   P(department = 'Engineering') = 0.02  (estimated from all rows)
--   P(both) = 0.15 * 0.02 = 0.003        → estimated 36,000 rows

-- But the actual selectivity is:
--   P(department = 'Engineering' | type = 'Admin') = 0.17
--   P(both) = 0.15 * 0.17 = 0.0255       → actual 306,000 rows

-- The planner underestimated by 8.5x.
-- For minority types, the error goes the other direction:
--   Moderator with region 'APAC': estimated 1,200 rows, actual 147.
--   The planner chose a hash join. A nested loop would have been 40x faster.

The result is a planner that consistently overestimates result sizes for minority types and underestimates for majority types. Sequential scans where index scans would be faster. Nested loops where hash joins would be better. Hash joins where nested loops would be better. The wrong plan for the actual data, chosen with full confidence by a planner that has been given misleading statistics.

This is not the planner's fault. The planner is doing excellent work with the information it has. The data model is the problem — it forces the planner to reason about a single table that behaves like three different tables depending on the value of one column.

Extended statistics: the partial remedy

-- Force accurate statistics with extended statistics:
CREATE STATISTICS users_type_dept (dependencies, ndistinct)
ON type, department FROM users;

ANALYZE users;

-- Or increase the statistics target for the type column:
ALTER TABLE users ALTER COLUMN type SET STATISTICS 1000;
ANALYZE users;

-- Now the planner knows Admin = 15%, not 33%.

Extended statistics help substantially. They tell the planner about column correlations — that department depends on type, that certain value combinations are far more common than independence would predict. On our benchmark table, adding extended statistics reduced planning errors from 8.5x to under 1.5x.

-- Extended statistics: what they actually track.

-- 'dependencies' tracks functional dependencies:
--   "If type = 'Admin', then department IS NOT NULL with probability 1.0"
-- This lets the planner adjust selectivity when correlated columns appear together.

-- 'ndistinct' tracks the number of distinct value combinations:
--   (type, department) has 24 distinct pairs, not 3 * 24 = 72.
-- This improves GROUP BY estimates.

-- 'mcv' (most common values, PostgreSQL 12+) tracks full value combinations:
CREATE STATISTICS users_type_dept_mcv (dependencies, ndistinct, mcv)
ON type, department FROM users;
ANALYZE users;

-- After ANALYZE, verify the planner uses them:
EXPLAIN ANALYZE
SELECT * FROM users
WHERE type = 'Admin' AND department = 'Engineering';

--   Index Scan using idx_admin_department on users
--     Index Cond: (department = 'Engineering')
--     Filter: (type = 'Admin')
--     Rows Removed by Filter: 0
--     Planning Time: 0.6 ms     -- slightly higher (consulting extended stats)
--     Execution Time: 2.1 ms    -- down from 4,218 ms

-- Check which extended statistics exist:
SELECT stxname, stxkeys, stxkind
FROM pg_statistic_ext
WHERE stxrelid = 'users'::regclass;

But extended statistics are a band-aid on a structural problem. They require explicit creation per column combination. They consume additional planner time to consult. They do not help with join ordering or subquery estimation. And they must be refreshed via ANALYZE, which on a 12 million row STI table is itself a non-trivial operation.

The planner is fighting the data model. Extended statistics give it better weapons for the fight. But the fight should not exist.

Does STI cause lock contention?

Yes, and this is the pitfall that surprises teams the most. It does not appear in EXPLAIN ANALYZE output. It does not show up in slow query logs. It manifests as elevated p99 latencies that resist all query-level optimization, because the problem is not the queries — it is the physical storage.

-- On high-concurrency systems, STI tables become LWLock hotspots.
-- Every query touches the same table, the same indexes, the same
-- buffer pages — regardless of type.

-- Check for buffer contention:
SELECT wait_event_type, wait_event, count(*)
FROM pg_stat_activity
WHERE state = 'active'
  AND wait_event_type = 'LWLock'
GROUP BY wait_event_type, wait_event
ORDER BY count DESC;

-- Common result on an STI table under load:
--  wait_event_type | wait_event         | count
-- -----------------+--------------------+-------
--  LWLock          | BufferContent      |    14
--  LWLock          | BufferMapping      |     8

-- 14 connections waiting for access to the same buffer pages.
-- Admins, Moderators, and Guests share physical pages because
-- they share a physical table. Separate tables would distribute
-- this contention across different buffers.

When all types share a single heap, all types share the same physical pages. An Admin update and a Guest insert may touch the same 8KB buffer page, requiring BufferContent LWLock coordination. With separate tables, these operations would never compete for the same buffer.

The buffer cache problem

The contention extends to your buffer cache. On a healthy system, hot data should fit in shared_buffers and stay there. On an STI table, "hot data" is scattered across pages that are mostly cold for any given query pattern.

-- Visualize the buffer cache distribution for your STI table.
-- This shows how many buffer pages are shared across types.

-- Install pg_buffercache:
CREATE EXTENSION IF NOT EXISTS pg_buffercache;

-- See how much of your STI table sits in shared_buffers:
SELECT c.relname,
       count(*) AS buffers,
       pg_size_pretty(count(*) * 8192) AS cached_size,
       round(100.0 * count(*) /
         (SELECT setting::int FROM pg_settings
          WHERE name = 'shared_buffers'), 2) AS pct_of_cache
FROM pg_buffercache b
JOIN pg_class c ON c.relfilenode = b.relfilenode
WHERE c.relname = 'users'
GROUP BY c.relname;

--  relname | buffers | cached_size | pct_of_cache
-- ---------+---------+-------------+--------------
--  users   |  89,421 | 699 MB      |        17.14

-- 17% of your shared_buffers dedicated to a single table.
-- Admins, Moderators, and Guests are interleaved on these pages.
-- An Admin query that needs 1,000 rows might touch 800 pages
-- because Admin rows are scattered among Guest and Moderator rows.
-- With a dedicated admins table, those 1,000 rows would fit on ~40 pages.

An Admin query that touches 1,000 rows on a mixed-type table might access 800 buffer pages, because Admin rows are interleaved with Guest and Moderator rows at the physical level. The same 1,000 Admin rows on a dedicated table would occupy approximately 40 pages. That is a 20x difference in buffer pressure — and buffer pressure translates directly to cache eviction, which translates directly to disk I/O.

On an STI table with 12 million rows and 100 concurrent connections, this combination — LWLock contention on shared pages, cache inefficiency from scattered data, and VACUUM processing all types — can reduce effective throughput by 20-40% compared to type-separated tables. The queries themselves are fast. The infrastructure is slow.

The VACUUM problem nobody warned you about

If I may direct your attention to a consequence of STI that receives insufficient discussion: VACUUM behaviour.

-- STI tables also create VACUUM headaches.
-- VACUUM processes the entire table — all types, all pages.

-- Check vacuum statistics:
SELECT relname,
       n_live_tup,
       n_dead_tup,
       last_vacuum,
       last_autovacuum,
       vacuum_count,
       autovacuum_count
FROM pg_stat_user_tables
WHERE relname = 'users';

-- On an active STI table with mixed write patterns:
--   Guests: high insert rate (signups), low update rate
--   Admins: low insert rate, high update rate (permissions changes)
--   Moderators: moderate insert, moderate update

-- VACUUM must process all 12M rows even if only Admin rows have dead tuples.
-- autovacuum_vacuum_threshold (default 50) + autovacuum_vacuum_scale_factor
-- (default 0.2) means autovacuum triggers at 50 + 0.2 * 12,000,000 = 2,400,050
-- dead tuples. That's a LOT of dead tuples before cleanup starts.

-- Per-type tables would trigger autovacuum much sooner:
-- Admin table (1.8M rows): triggers at 50 + 360,000 = 360,050 dead tuples.
-- Far more responsive to the Admin update pattern.

PostgreSQL's autovacuum triggers based on the total number of dead tuples in a table, using the formula: threshold + scale_factor * n_live_tup. With default settings, a 12 million row table triggers autovacuum at approximately 2.4 million dead tuples. That is a generous amount of dead weight to carry before cleanup begins.

The problem is that different types have different write patterns. Admins might generate 50,000 dead tuples per hour from permission updates. Guests might generate 200,000 per hour from signup flows. Moderators, comparatively few. But autovacuum sees only the aggregate — and the aggregate may not reach the threshold until the table is carrying hundreds of megabytes of dead tuples.

With separate tables, autovacuum would trigger much sooner on each table individually. An admins table with 1.8 million rows triggers at 360,050 dead tuples — 6.6x sooner than the combined threshold. The result is tighter dead tuple management, less bloat accumulation, and more predictable VACUUM duration.

You can tune autovacuum_vacuum_scale_factor per table to compensate, and I would recommend doing so if you are retaining STI. A scale factor of 0.05 instead of 0.2 brings the threshold down to 650,050 dead tuples — still not ideal, but considerably better than the default.

The escape routes

Five strategies, from least to most invasive. Each eliminates different symptoms. None is universally superior — the right choice depends on how much schema change you can absorb, how many types you have, and whether your bottleneck is storage, planner accuracy, or contention. I have opinions about which are best for various situations. I shall not be shy about sharing them.

Strategy 1: JSONB for type-specific attributes

# Escape route 1: JSONB for type-specific attributes.
# Keep shared columns as real columns, move the rest to JSONB.

class CreateUsersWithJsonb < ActiveRecord::Migration[7.1]
  def change
    create_table :users do |t|
      t.string   :type,       null: false
      t.string   :email,      null: false
      t.string   :name
      t.jsonb    :properties, null: false, default: {}
      t.timestamps
    end

    add_index :users, "(properties->>'department')",
              where: "type = 'Admin'",
              name: 'idx_admin_department'
  end
end

# In the model:
class Admin < User
  store_accessor :properties,
    :department, :access_level, :admin_permissions
end

# The table has 5 columns instead of 13.
# No NULLs from type-specific attributes.
# GIN or expression indexes on the JSONB for queries that need them.

This approach keeps STI's single-table simplicity while eliminating the NULL column problem. Shared columns (email, name) remain as proper columns with proper indexes. Type-specific attributes move into a properties JSONB column, accessed via Rails' store_accessor.

The model layer barely changes:

# Adding validation to the JSONB approach:
class Admin < User
  store_accessor :properties,
    :department, :access_level, :admin_permissions

  validates :department, presence: true
  validates :access_level, numericality: {
    only_integer: true, in: 1..5
  }
end

# Database-level validation with a CHECK constraint:
# ALTER TABLE users ADD CONSTRAINT admin_properties_check
#   CHECK (
#     type != 'Admin' OR (
#       properties->>'department' IS NOT NULL AND
#       (properties->>'access_level')::int BETWEEN 1 AND 5
#     )
#   );

# GIN index for complex JSONB queries:
# CREATE INDEX idx_users_properties_gin ON users
#   USING GIN (properties)
#   WHERE type = 'Admin';

# Expression index for equality lookups:
# CREATE INDEX idx_admin_access ON users
#   ((properties->>'access_level')::int)
#   WHERE type = 'Admin';

The trade-off: JSONB expression indexes ((properties->>'department')) are less flexible than B-tree indexes on dedicated columns. You cannot add foreign keys to JSONB fields. Type-specific validation lives in application code or CHECK constraints, not column-level NOT NULL. And the JSONB column itself has overhead — the binary JSONB representation includes key names alongside values, which for many small attributes can be larger than dedicated columns would be.

That said, the overhead is nearly always less than 8 dedicated columns that are 85% NULL. And you gain something significant: the table drops from 13 columns to 5. Row width decreases. Page density increases. Buffer cache efficiency improves. The planner has fewer columns to reason about.

This is the strategy I recommend most often. It requires the least disruption, preserves the STI pattern that your application code expects, and solves the NULL bloat problem directly. If your performance bottleneck is storage and page density rather than contention, this may be sufficient on its own.

Strategy 2: PostgreSQL native table inheritance

# Escape route 2: PostgreSQL table inheritance via active_record-mti.
# Each type gets its own physical table. Queries route automatically.
# gem 'active_record-mti'

class CreateUsersWithInheritance < ActiveRecord::Migration[7.1]
  def change
    create_table :users do |t|
      t.string :email, null: false
      t.string :name
      t.timestamps
    end

    # Admins get their own table, inheriting users columns:
    execute <<-SQL
      CREATE TABLE admins (
        department    varchar,
        access_level  integer,
        admin_permissions jsonb
      ) INHERITS (users);
    SQL

    execute <<-SQL
      CREATE TABLE moderators (
        assigned_region   varchar,
        escalation_limit  integer,
        can_ban_users     boolean
      ) INHERITS (users);
    SQL

    # Each table has only the columns it needs.
    # SELECT * FROM users still queries all child tables.
    # SELECT * FROM admins queries only admins.
  end
end

PostgreSQL's native table inheritance creates separate physical tables that share a parent schema. The active_record-mti gem wires this into Rails so that Admin.where(department: 'Engineering') queries only the admins table, while User.all queries across all child tables.

This eliminates NULL bloat, separates physical storage for reduced contention, and gives each type its own indexes, its own VACUUM schedule, and its own buffer pages. Each type is, physically, an independent table with independent performance characteristics.

But I must be forthcoming about the limitations, because they are consequential:

-- PostgreSQL table inheritance: the limitations nobody mentions first.

-- 1. No cross-table unique constraints:
ALTER TABLE users ADD CONSTRAINT users_email_unique UNIQUE (email);
-- This ONLY enforces uniqueness within the parent table.
-- An admin and a moderator CAN have the same email.
-- You need application-level or trigger-based uniqueness.

-- 2. No foreign keys referencing parent tables:
CREATE TABLE orders (
    user_id integer REFERENCES users(id)  -- FAILS
);
-- ERROR: cannot reference partitioned/inherited table "users"
-- You must reference child tables individually or use triggers.

-- 3. Indexes are NOT inherited:
CREATE INDEX idx_users_email ON users (email);
-- This index exists ONLY on the parent. Child tables need their own:
CREATE INDEX idx_admins_email ON admins (email);
CREATE INDEX idx_moderators_email ON moderators (email);

-- 4. VACUUM and ANALYZE must run on each child table:
ANALYZE users;      -- analyzes parent only
ANALYZE admins;     -- must be explicit
ANALYZE moderators; -- must be explicit

No cross-table unique constraints means you must enforce email uniqueness at the application level or through triggers. No foreign keys referencing the parent means your orders.user_id cannot use a standard foreign key constraint. Indexes are not inherited, so you must create and maintain them on each child table individually.

These are not theoretical concerns. They are operational burdens that compound as your team grows. A junior developer who does not know that CREATE INDEX on the parent does not propagate to children will create an index that covers 0% of the data they intend to query.

This works best when types are truly distinct entities with different access patterns, different index requirements, and minimal cross-type querying. If you rarely run User.all and your types share nothing beyond an email address, native inheritance may be the cleanest solution. If you have 14 foreign keys referencing your users table, it will cause more problems than it solves.

Strategy 3: Class Table Inheritance (CTI)

# Escape route 3: Class Table Inheritance (CTI).
# Shared columns in a base table, type-specific columns in join tables.
# Pure Rails — no gems, no PostgreSQL-specific features.

class CreateUsersCTI < ActiveRecord::Migration[7.1]
  def change
    create_table :users do |t|
      t.string :email, null: false
      t.string :name
      t.string :type, null: false
      t.timestamps
    end

    create_table :admin_profiles do |t|
      t.references :user, null: false, foreign_key: true
      t.string     :department
      t.integer    :access_level
      t.jsonb      :admin_permissions
    end

    create_table :moderator_profiles do |t|
      t.references :user, null: false, foreign_key: true
      t.string     :assigned_region
      t.integer    :escalation_limit
      t.boolean    :can_ban_users
    end
  end
end

# class Admin < User
#   has_one :admin_profile, foreign_key: :user_id
#   delegate :department, :access_level, to: :admin_profile
# end
#
# Trade-off: requires JOINs for type-specific data.
# But each table is compact, well-indexed, and bloat-free.

CTI is the relational purist's answer: a base table for shared attributes, separate tables for type-specific attributes, joined by foreign key. No PostgreSQL-specific features, no gems, no NULLs. Each table is compact and well-indexed. Foreign keys work. Unique constraints work. Your data model is normalized.

The cost is JOINs. But let us examine that cost honestly:

-- CTI performance: measuring the JOIN cost.

-- Simple query on the base table (no JOIN needed):
EXPLAIN ANALYZE
SELECT id, email, name FROM users WHERE email = 'admin@company.com';
--   Index Scan using idx_users_email
--   Execution Time: 0.04 ms

-- Query with type-specific data (JOIN required):
EXPLAIN ANALYZE
SELECT u.id, u.email, ap.department, ap.access_level
FROM users u
JOIN admin_profiles ap ON ap.user_id = u.id
WHERE ap.department = 'Engineering';
--   Nested Loop
--     -> Index Scan using idx_admin_profiles_department on admin_profiles ap
--     -> Index Scan using users_pkey on users u
--   Execution Time: 1.2 ms

-- Compare with STI (single table, no JOIN):
-- EXPLAIN ANALYZE
-- SELECT id, email, department, access_level FROM users
-- WHERE type = 'Admin' AND department = 'Engineering';
--   Execution Time: 2.1 ms (with partial index), 4,218 ms (without)

-- CTI with a JOIN: 1.2ms.
-- STI with a partial index: 2.1ms.
-- The JOIN is FASTER because admin_profiles has 1.8M rows, not 12M.
-- The table is compact. The index is compact. The pages are dense.

The JOIN is 1.2ms against the profile table's 1.8 million rows. The equivalent STI query — even with a partial index — is 2.1ms against 12 million rows. The JOIN is faster because the constituent tables are compact, dense, and well-indexed. This result surprises teams who assume JOINs are inherently expensive. JOINs between small, well-indexed tables are cheap. Scans across large, bloated tables are expensive. CTI trades the theoretical overhead of JOINs for the practical benefit of compact storage.

The delegate pattern in Rails hides the JOIN at the application level. admin.department transparently queries the admin_profiles table. The N+1 risk is real — if you load 100 admins and access department on each, you will generate 100 additional queries unless you includes(:admin_profile). But this is a solved problem in Rails, and one your team already knows how to manage.

This works best when you need the strongest relational integrity, when cross-database portability matters, or when type-specific data is accessed separately from shared data. It is the most "boring" solution, and in infrastructure, boring is the highest compliment available.

Strategy 4: Partial indexes on existing STI

-- The immediate fix: partial indexes filtered by type.
-- Works with existing STI — no schema changes required.

-- Instead of a full index on department (12M rows):
CREATE INDEX idx_users_department ON users (department);
-- Size: ~312MB. 85% of entries are NULL. Wasteful.

-- Partial index on department, only for Admins (1.8M rows):
CREATE INDEX idx_admin_department ON users (department)
  WHERE type = 'Admin';
-- Size: ~46MB. Every entry is meaningful. 85% smaller.

-- Partial index for moderator queries:
CREATE INDEX idx_moderator_region ON users (assigned_region)
  WHERE type = 'Moderator';
-- 960K rows indexed instead of 12M. 92% smaller.

-- The query planner uses these when the WHERE clause matches:
EXPLAIN ANALYZE
SELECT * FROM users
WHERE type = 'Admin' AND department = 'Engineering';
--   Index Scan using idx_admin_department
--   Rows: 312   Actual time: 0.04..1.8 ms
-- Down from 4,218ms with the sequential scan.

This is the escape route that requires no schema changes at all. Partial indexes add a WHERE type = 'Admin' clause to the index definition, indexing only the rows of that type. The result: indexes that are 85-92% smaller, dramatically faster to scan, and cheaper to maintain.

For additional performance, covering partial indexes eliminate heap access entirely:

-- Advanced: covering partial indexes for index-only scans.
-- Include all columns the query needs, avoiding heap access entirely.

-- If your most common Admin query fetches email and department:
CREATE INDEX idx_admin_dept_covering ON users (department)
  INCLUDE (email, name)
  WHERE type = 'Admin';

-- Now the planner can serve this query from the index alone:
EXPLAIN ANALYZE
SELECT email, name, department FROM users
WHERE type = 'Admin' AND department = 'Engineering';
--   Index Only Scan using idx_admin_dept_covering
--   Heap Fetches: 0
--   Execution Time: 0.8 ms

-- Covering indexes are larger than simple partial indexes,
-- but they eliminate heap access entirely for matching queries.
-- On an STI table where heap pages are polluted with other types,
-- avoiding the heap is especially valuable.

-- Composite partial indexes for multi-column filters:
CREATE INDEX idx_admin_dept_level ON users (department, access_level)
  WHERE type = 'Admin';

-- Serves queries like:
-- WHERE type = 'Admin' AND department = 'Engineering' AND access_level >= 3

Partial indexes do not fix NULL bloat. They do not reduce table size. They do not eliminate contention. What they do is make queries on individual types fast — often dramatically so — by giving the planner a perfectly sized, perfectly targeted index for each type-specific query pattern.

This is the strategy you deploy today, in production, with a CREATE INDEX CONCURRENTLY that takes effect immediately and requires no application changes. While you plan the longer-term migration to JSONB or CTI, partial indexes keep the household running.

Strategy 5: Declarative partitioning by type

-- Escape route 5 (for PostgreSQL 10+): declarative partitioning by type.
-- Not table inheritance — true partitioning with full constraint exclusion.

-- Create the partitioned table:
CREATE TABLE users (
    id            bigserial,
    type          varchar NOT NULL,
    email         varchar NOT NULL,
    name          varchar,
    department    varchar,
    access_level  integer,
    admin_permissions jsonb,
    assigned_region   varchar,
    escalation_limit  integer,
    can_ban_users     boolean,
    invite_token      varchar,
    invite_expires_at timestamptz,
    created_at    timestamptz NOT NULL DEFAULT now(),
    updated_at    timestamptz NOT NULL DEFAULT now()
) PARTITION BY LIST (type);

-- Create partitions:
CREATE TABLE users_admin PARTITION OF users FOR VALUES IN ('Admin');
CREATE TABLE users_moderator PARTITION OF users FOR VALUES IN ('Moderator');
CREATE TABLE users_guest PARTITION OF users FOR VALUES IN ('Guest');

-- Each partition is a separate physical table.
-- Queries with WHERE type = 'Admin' hit only users_admin.
-- Indexes are per-partition. VACUUM is per-partition.
-- Foreign keys and unique constraints work (PostgreSQL 12+).

-- The null columns still exist in every partition definition,
-- but rows in users_admin never have Guest columns populated.
-- Combine with JSONB (Strategy 1) to eliminate NULLs entirely.

Declarative partitioning (PostgreSQL 10+, mature from 12+) is the most powerful structural option. Unlike table inheritance, partitioned tables support unique constraints across partitions (PostgreSQL 12+), foreign keys (PostgreSQL 12+), and automatic index propagation. Queries with WHERE type = 'Admin' benefit from partition pruning — the planner knows at planning time that only the users_admin partition needs to be scanned.

The trade-off is migration complexity. Converting an existing table to a partitioned table requires creating the partitioned structure, migrating data, and swapping. It cannot be done in-place. But the result is a structure that gives you per-type physical separation, per-type VACUUM schedules, per-type buffer pools, and full relational integrity — the best of both worlds.

If your PostgreSQL version supports it and you are willing to invest in the migration, declarative partitioning is, in my estimation, the most complete solution for STI at scale. Combine it with JSONB (Strategy 1) to eliminate the NULL columns entirely, and you have a data model that scales to hundreds of millions of rows without the pathologies we have been discussing.

"PostgreSQL is not a database. It is an ecosystem that most teams use as a database."

— from You Don't Need Redis, Chapter 1: Good Evening. We Have a Problem.

Comparing the escape strategies

StrategySchema changeQuery performanceNULL bloatRails compatibilityPrimary trade-off
JSONB for type-specific attributesModerateGood (with expression indexes)EliminatedNative (store_accessor)JSONB expression indexes less flexible than B-tree
PostgreSQL table inheritance (MTI)SignificantExcellent (separate tables)EliminatedRequires active_record-mti gemNo cross-table foreign keys or unique constraints
Class Table Inheritance (CTI)SignificantGood (JOINs required)EliminatedNative (has_one + delegate)Extra JOINs; more complex queries and associations
Partial indexes on existing STINoneExcellent (targeted indexes)Still presentTransparentDoesn't reduce table size; treats symptoms, not cause
Declarative partitioning by typeSignificantExcellent (partition pruning)Reduced (combine with JSONB)Native (PostgreSQL 12+)Requires PostgreSQL 12+; partition management overhead

Most teams end up combining strategies: partial indexes for immediate relief, then a gradual migration to JSONB or CTI for the structural fix. Declarative partitioning is the strongest long-term play but demands the most migration effort. The active_record-mti approach is powerful but niche — it suits teams that are fully committed to PostgreSQL and comfortable with its inheritance semantics and limitations.

If I were attending to this matter myself, I would deploy partial indexes today, begin a JSONB migration for the highest-NULL-cost columns next week, and evaluate declarative partitioning if the table is projected to exceed 50 million rows. But every household is different, and the correct strategy depends on your particular constraints.

The honest counterpoint: when STI is fine

I have spent considerable time enumerating the pathologies of STI at scale. It would be dishonest of me not to enumerate the cases where STI is the correct choice and none of the escape routes are warranted.

STI is fine — genuinely, straightforwardly fine — when:

  • You have fewer than 500,000 rows. At this scale, the NULL bloat is measured in megabytes. Buffer contention is negligible. The planner estimates are good enough. The operational simplicity of a single table outweighs the theoretical benefits of separation.
  • You have 2-3 subtypes with minimal type-specific columns. If Admin adds 2 columns and Moderator adds 1 column, the NULL waste is trivial. The table is not meaningfully wider than a non-STI table would be.
  • Your types share most of their columns. If 80% of the columns are used by all types, the type-specific remainder is not worth separating. STI is doing what STI is designed to do: modelling a genuine hierarchy with shared attributes.
  • Your queries rarely filter by type. If most queries are User.where(email: ...) without a type predicate, the planner estimation issues are irrelevant. You are treating the table as a single entity, and a single table is the right structure for a single entity.
  • You are in the first six months of a product. The schema will change. The types will change. The priorities will change. STI gives you maximum flexibility to evolve the model without migrations. Use it. Revisit the decision when the table reaches a size that warrants it.

A waiter who suggests restructuring a perfectly adequate household is not being helpful — he is being officious. STI earned its place in ActiveRecord because it solves a real problem well. The pitfalls emerge at scale, and scale is not guaranteed.

Migrating away from STI without downtime

If you have decided that STI must go — and the evidence at your scale supports that decision — the migration need not be dramatic. A phased approach lets you move incrementally, with partial indexes maintaining performance throughout the transition.

# Migrating from STI to CTI without downtime.
# Phase 1: Create the new tables alongside the existing one.

class AddAdminProfiles < ActiveRecord::Migration[7.1]
  def change
    create_table :admin_profiles do |t|
      t.references :user, null: false, foreign_key: true, index: true
      t.string     :department
      t.integer    :access_level
      t.jsonb      :admin_permissions
      t.timestamps
    end

    add_index :admin_profiles, :department
  end
end

# Phase 2: Backfill the new table (in batches, during low traffic):
# Admin.find_each(batch_size: 5000) do |admin|
#   AdminProfile.create!(
#     user_id: admin.id,
#     department: admin.department,
#     access_level: admin.access_level,
#     admin_permissions: admin.admin_permissions
#   )
# end

# Phase 3: Dual-write (write to both old columns and new table).
# Phase 4: Switch reads to the new table.
# Phase 5: Drop the old columns.

# Total timeline: 2-4 weeks for a team comfortable with zero-downtime migrations.
# The partial indexes keep performance acceptable during the transition.

The key insight is that the old and new structures can coexist. During the dual-write phase, every write updates both the STI columns and the new profile table. Reads can switch over table-by-table, type-by-type. If something goes wrong, the STI columns are still there, still current, still indexed.

I have seen this migration pattern executed successfully on tables with 30 million rows and 99.99% uptime requirements. The timeline was four weeks: one week for schema creation and backfill, one week for dual-write deployment and monitoring, one week for read switchover, and one week for column cleanup. Not trivial. But not heroic, either.

The partial indexes serve a critical role during this transition. Without them, the STI table's performance degrades under the additional write load of dual-writing. With them, the STI queries remain fast while the new structure takes shape alongside.

The partial indexes write themselves

Here is what makes partial indexes on STI tables especially interesting: the optimal index set is entirely determined by your query traffic. Which types are queried most? Which type-specific columns appear in WHERE clauses? What is the cardinality of each type? The answers are in the queries your application sends, not in your schema documentation or your engineering team's intuition.

Gold Lapel observes this traffic at the proxy level. It sees that 73% of queries on the users table filter by type = 'Admin', that the department column appears in 40% of those queries, and that the planner is choosing sequential scans because the full index is too large to be attractive. It creates WHERE type = 'Admin' partial indexes on the columns that matter — department, access_level, created_at — and the planner immediately starts using them.

It also detects the bloat patterns that STI accelerates: wide rows with high NULL fractions, indexes inflated by type-irrelevant entries, and the buffer contention that results from cramming 12 million mixed-type rows into shared physical pages. The proxy cannot restructure your schema — that is your migration to plan — but it can make the existing schema perform as well as possible while you plan it.

This is, if you will permit me the observation, rather the point of the entire service. The structural migration is your decision, on your timeline, with your risk tolerance. The immediate performance improvement — the partial indexes, the statistics adjustments, the planner hints — that is what a good proxy attends to while you think.

A final accounting

STI is a convenience that becomes a constraint. The transition is not abrupt — it is a gradual accumulation of costs that are individually small and collectively transformative. NULL bloat that wastes 30% of your heap. Planner estimates that are wrong by factors of 3-8x. Buffer contention that adds milliseconds to every concurrent query. VACUUM schedules that cannot respond to type-specific write patterns.

The escape routes exist. Partial indexes provide immediate relief with zero schema risk. JSONB consolidation eliminates NULL bloat while preserving the single-table model. CTI normalizes the data with full relational integrity. Declarative partitioning offers physical separation with none of the limitations of table inheritance.

The right strategy depends on your data, your traffic, and your appetite for migration. But the diagnosis is consistent: if your STI table has grown past the point where its convenience outweighs its costs, the costs will not diminish on their own. They will, with the quiet persistence that characterizes all infrastructure debt, continue to compound.

I would be a poor waiter indeed if I did not point that out.

Frequently asked questions

Terms referenced in this article

The partial indexes that rescue STI tables are only one tool in a rather well-stocked cabinet. I have written a comprehensive guide to PostgreSQL index types — partial, expression, covering, and more — each of which has a role to play when your table structure is less than ideal.