Error Handling & Resilience Patterns

Build fault-tolerant systems using circuit breakers, timeouts, retries, bulkheads, and structured error handling.

In distributed systems, failures are inevitable. Network timeouts, service crashes, database overloads—these are not exceptions but normal occurrences. Resilience patterns help you build systems that gracefully degrade and recover.

The Failure Cascade

Without proper error handling, a single failure can cascade through your entire system:

Service A ---> Service B (SLOW) ---> Service C
     ↓              ↓
 Queues fill up, threads exhaust
     ↓
Service A becomes unresponsive
     ↓
Upstream services also fail

This is the Thundering Herd problem.

Resilience Patterns

1. Timeouts

Set a maximum wait time for external calls. If the response doesn't arrive, fail fast.

def get_user(user_id: int) -> User:
    try:
        return requests.get(
            f"http://user-service/{user_id}",
            timeout=2.0  # 2 seconds
        )
    except requests.Timeout:
        return handle_timeout_error()

Timeouts per Layer:

Network: TCP connection timeout (100-500ms)
Request: HTTP request timeout (2-5 seconds)
Operation: Database query timeout (5-30 seconds)

2. Retries

Transient failures (network blip, temporary overload) often resolve quickly. Retry with exponential backoff.

@retry(max_attempts=3, backoff_factor=2)
def call_external_service():
    return requests.get("http://external-api")

# Retry timeline:
# Attempt 1 (fails)
# Wait 1 second
# Attempt 2 (fails)
# Wait 2 seconds
# Attempt 3 (fails)
# Raise error

Exponential Backoff Formula: Exponential Backoff Formula:

Wait Time = base * 2^(attempt) + jitter

Jitter: Add randomness to prevent thundering herd (all clients retrying at the same time).

When NOT to retry:

400 Bad Request (client error, won't succeed on retry)
401 Unauthorized
404 Not Found
429 Too Many Requests (rate limit—retry later with exponential backoff)

3. Circuit Breaker

Stop calling a failing service to prevent cascading failures.

States:

Closed: Requests pass through; failures are counted.
Open: Requests fail immediately (don't even try calling the service).
Half-Open: After a cooldown, allow a single request through to test if service recovered.

Closed (Working)
    ↓ (5 failures)
Open (Failing)
    ↓ (Wait 60 sec)
Half-Open (Testing)
    ↓ (Request succeeds)
Closed (Recovered)

Config:

Failure Threshold: 5 consecutive failures → open
Cooldown: 60 seconds → half-open
Success Threshold: 2 successes in half-open → closed

4. Bulkhead Pattern

Isolate critical resources to prevent a failure in one area from affecting others.

Thread Pool Isolation:

Request 1 -> [ThreadPool A] -> Service A
Request 2 -> [ThreadPool B] -> Service B
Request 3 -> [ThreadPool C] -> Service C

If Service A is slow, only ThreadPool A's threads are exhausted. ThreadPool B and C remain available for Services B and C.

Isolation Levels:

Thread Pool: Different thread pools per service.
Container: Different processes/containers per service.
Machine: Different machines per service.

5. Rate Limiting & Load Shedding

Reject excess requests gracefully to protect your service.

Strategies:

Token Bucket: Allow a fixed number of requests per time window.
Leaky Bucket: Process requests at a fixed rate, queue excess.
Sliding Window: Fine-grained rate limiting.

Load Shedding: When overloaded, reject low-priority requests to keep the system responsive for critical operations.

if queue_depth > threshold:
    if request.priority == "low":
        return 503 Service Unavailable  # Shed load
    else:
        queue_request()  # Accept high-priority

Structured Error Handling

Error Classification

Transient: Temporary, will likely succeed on retry (timeout, 503).
Permanent: Won't succeed on retry (400, 404, 401).
Unknown: Can't determine; retry with caution.

Error Response Structure

{
  "error": {
    "code": "SERVICE_UNAVAILABLE",
    "message": "Payment service is temporarily down",
    "retryable": true,
    "retry_after_seconds": 30,
    "trace_id": "abc123"
  }
}

Monitoring & Alerting

Track:

Error Rate: % of requests that fail.
P99 Latency: 99th percentile response time.
Circuit Breaker State: How many circuits are open?
Retry Rate: How many requests are being retried?

Alert on:

Error rate > 5%
P99 latency > 5 seconds
Circuit breaker open for > 5 minutes