Sudhir Kumar Tiwari

Back to Portfolio

Lessons from Building OCI Load Balancer

October 2023  |  12 min read  |  OCI Load Balancer Cloud Infrastructure CDN

From February 2022 to December 2024, I worked as a Senior Member of Technical Staff on Oracle Cloud Infrastructure's Load Balancer team. The OCI Load Balancer enables customers to distribute web requests across server fleets and route traffic across fault domains, availability domains, and regions — providing high availability and fault tolerance for mission-critical applications. Here are the most important lessons I took away from that experience.

Lesson 1: Operations is a First-Class Concern

Alarms, Dashboards, and Runbooks From Day One

One of my primary ownership areas was all operations tasks: creating alarms, Grafana dashboards, and runbooks. The temptation is to treat these as afterthoughts — things you add after the feature ships. That's backwards.

The right approach: before any feature goes to production, define:

A runbook written while you understand the system deeply is infinitely better than one reconstructed from memory at 3am during an incident.

Lesson 2: Control Plane vs Data Plane — Different SLAs, Different Designs

The Load Balancer has two distinct planes:

Key insight: The data plane must function correctly even when the control plane is down. Design them to be independent. Cache config aggressively in the data plane, and design the control plane to push updates rather than the data plane polling.

Lesson 3: CDN Control Plane — API Design Matters Enormously

I designed, developed, and documented multiple APIs for the CDN control plane. The biggest lesson: API decisions are nearly irreversible. A poorly designed API that ships to production will haunt you for years. Things I wish I'd paid more attention to earlier:

Lesson 4: Cross-Fault Domain Routing is Harder Than It Looks

Routing traffic across fault domains and availability domains for high availability introduces a class of problems that don't exist in single-datacenter systems:

Lesson 5: Test in Production (Carefully)

No staging environment perfectly mirrors production at cloud infrastructure scale. Customer traffic patterns, load profiles, and edge cases that appear in production simply don't exist in staging. Some approaches that work:

Final Thoughts

Working on cloud infrastructure at Oracle was humbling and exhilarating in equal measure. The scale of the systems, the diversity of customer use cases, and the zero-tolerance for downtime forced me to raise my engineering bar significantly. The most important thing I learned: reliability is never an accident. It's the result of deliberate design decisions, comprehensive observability, rigorous operational practices, and a culture where every engineer takes ownership of the systems they build.


Have thoughts or questions? Reach out at tiwarisudhir059@gmail.com or connect on LinkedIn.