SQL Query Optimization & Index Strategies

Write efficient database queries using EXPLAIN ANALYZE, index design, and query patterns that scale.

Most backend bottlenecks stem from inefficient database queries. Understanding how to optimize queries can yield 10-100x performance improvements without changing your application code.

The EXPLAIN PLAN

Start by understanding what your database is actually doing.

EXPLAIN ANALYZE
SELECT * FROM users
WHERE age > 30
ORDER BY created_at DESC
LIMIT 10;

Output:

Seq Scan on users  (cost=0.00..1234.50 rows=500)
  Filter: (age > 30)
  -> Sort  (cost=500.00..550.00 rows=500)
        Sort Key: created_at DESC

Key Metrics:

Cost: Estimated units of I/O and CPU.
Rows: Estimated number of rows returned.
Actual vs. Estimated: If vastly different, optimizer is guessing wrong.

Index Fundamentals

Single-Column Index

CREATE INDEX idx_users_age ON users(age);

When to use:

Filtering: WHERE age > 30
Sorting: ORDER BY age
High cardinality (many distinct values)

When NOT to use:

Low cardinality (few distinct values, e.g., gender: M/F)
Columns that are frequently NULL

Composite Index (Multi-Column)

CREATE INDEX idx_users_age_created ON users(age, created_at DESC);

Best For:

-- This query benefits from composite index
SELECT * FROM users
WHERE age > 30
ORDER BY created_at DESC
LIMIT 10;

The index serves double duty: filter by age, then provide sorted results.

Covering Index

Include all columns needed without reading the main table.

CREATE INDEX idx_users_email_name ON users(email) INCLUDE (full_name, created_at);

Query:

SELECT email, full_name, created_at FROM users WHERE email = 'john@example.com';

Advantage: No need to read the main table—everything is in the index!

Query Patterns That Scale

Bad: SELECT *

-- ❌ Bad: Transfers all columns, even unused ones
SELECT * FROM users WHERE age > 30;

Good: Explicit Columns

-- ✅ Good: Transfers only needed columns
SELECT id, name, email FROM users WHERE age > 30;

Bad: N+1 Queries

# ❌ Bad: 1 query for users, then 1000 queries for each user's posts
users = db.query("SELECT * FROM users LIMIT 1000")
for user in users:
    posts = db.query(f"SELECT * FROM posts WHERE user_id = {user.id}")
    # Process posts

Good: JOIN or Batch Load

# ✅ Good: Single query with JOIN
users_with_posts = db.query("""
    SELECT u.*, p.title, p.content
    FROM users u
    LEFT JOIN posts p ON u.id = p.user_id
    LIMIT 1000
""")

Bad: Functions in WHERE Clause

-- ❌ Bad: Function evaluation prevents index usage
SELECT * FROM users WHERE UPPER(email) = 'JOHN@EXAMPLE.COM';

Good: Store Normalized Data

-- ✅ Good: Index can be used
SELECT * FROM users WHERE email = 'john@example.com';

Pagination at Scale

Bad: OFFSET

-- ❌ Inefficient: Scans first 100,000 rows and throws them away
SELECT * FROM users ORDER BY id LIMIT 20 OFFSET 100000;

Complexity: $O(n)$ where $n$ is the offset.

Good: Keyset Pagination

-- ✅ Efficient: Uses index to jump directly
SELECT * FROM users
WHERE id > last_seen_id
ORDER BY id
LIMIT 20;

Complexity: $O(\log n)$ + $O(\text)$

Partitioning for Large Tables

When a table grows to billions of rows, even with indexes, scans become slow. Partition the table.

Date-Based Partitioning

CREATE TABLE events (
    id BIGINT,
    created_at TIMESTAMP,
    event_type TEXT
) PARTITION BY RANGE (DATE(created_at)) (
    PARTITION p_2026_06 VALUES LESS THAN ('2026-07-01'),
    PARTITION p_2026_07 VALUES LESS THAN ('2026-08-01'),
    ...
);

Benefit: Queries on recent events only scan the latest partition.

Connection Pooling

Opening a new database connection is expensive (~100ms).

Without pooling: Each request opens a connection → resource exhaustion. With pooling: Connections are reused → much faster.

# Using pgbouncer or SQLAlchemy
engine = create_engine(
    "postgresql://...",
    pool_size=20,
    max_overflow=10
)

Config:

pool_size: Connections kept warm
max_overflow: Temporary extra connections if needed
pool_recycle: Recycle connections every 3600 seconds (prevents stale connections)