Designing Distributed Systems at Scale

June 2024 | 8 min read | Distributed Systems Architecture Cloud

Building distributed systems that can handle millions of requests per second while maintaining high availability is one of the most challenging problems in software engineering. Over the past 8 years, working at companies like Oracle Cloud Infrastructure and Walmart Global Tech, I've had the opportunity to design and operate such systems firsthand.

The Core Challenge

At its heart, distributed systems must deal with three fundamental constraints — known as the CAP theorem — which states that any distributed data store can only provide two of the following three guarantees simultaneously:

Consistency: Every read receives the most recent write or an error
Availability: Every request receives a response (not necessarily the latest data)
Partition Tolerance: The system continues operating despite network partitions

Lessons from OCI Load Balancer

While working on the Oracle Cloud Infrastructure Load Balancer team, we maintained a system that distributed web requests across fleets of servers across multiple fault domains and availability domains. Here are key design decisions that made it resilient:

Health checks with exponential backoff: Rather than hammering unhealthy backends, we used exponential backoff to reduce noise and give backends time to recover.
Circuit breakers: We implemented circuit breakers at the routing layer to prevent cascading failures when downstream services degrade.
Sticky sessions with fallback: Consistent hashing for sticky sessions, but with graceful fallback when the assigned backend is unavailable.

Data Consistency Patterns

At Walmart's Order Management System, we handled Canada market orders where eventual consistency was acceptable for some read operations, but order state transitions required strong consistency. We used a combination of:

Event sourcing to maintain a reliable audit trail of all state changes
Optimistic locking at the database level to prevent concurrent modification conflicts
Idempotency keys on all write operations to safely retry failed requests

Observability is Non-Negotiable

You cannot operate a distributed system without comprehensive observability. At OCI, I owned the creation of Grafana dashboards, alarms, and runbooks. The three pillars of observability — metrics, logs, and traces — are essential. I'd add a fourth: runbooks. When an alarm fires at 2am, you want step-by-step guidance available immediately.

Conclusion

Designing distributed systems is as much an art as it is a science. There's no one-size-fits-all solution. Understanding the specific requirements of your system — the SLAs, the consistency requirements, the scale — is the foundation of any good distributed system design. Start simple, measure everything, and evolve the architecture as you learn from production.

Have thoughts or questions? Reach out at tiwarisudhir059@gmail.com or connect on LinkedIn.