Lessons from Building OCI Load Balancer

October 2023 | 12 min read | OCI Load Balancer Cloud Infrastructure CDN

From February 2022 to December 2024, I worked as a Senior Member of Technical Staff on Oracle Cloud Infrastructure's Load Balancer team. The OCI Load Balancer enables customers to distribute web requests across server fleets and route traffic across fault domains, availability domains, and regions — providing high availability and fault tolerance for mission-critical applications. Here are the most important lessons I took away from that experience.

Lesson 1: Operations is a First-Class Concern

Alarms, Dashboards, and Runbooks From Day One

One of my primary ownership areas was all operations tasks: creating alarms, Grafana dashboards, and runbooks. The temptation is to treat these as afterthoughts — things you add after the feature ships. That's backwards.

The right approach: before any feature goes to production, define:

What metrics indicate this feature is healthy?
What alarms should fire when it's not?
What are the on-call steps when an alarm fires?

A runbook written while you understand the system deeply is infinitely better than one reconstructed from memory at 3am during an incident.

Lesson 2: Control Plane vs Data Plane — Different SLAs, Different Designs

The Load Balancer has two distinct planes:

Control plane: Handles API calls to create/update/delete load balancers. Can tolerate slightly higher latency. Correctness and durability are paramount.
Data plane: Handles actual traffic routing in real time. Must have extremely low latency and extremely high availability. Cannot afford to wait for control plane state updates.

Key insight: The data plane must function correctly even when the control plane is down. Design them to be independent. Cache config aggressively in the data plane, and design the control plane to push updates rather than the data plane polling.

Lesson 3: CDN Control Plane — API Design Matters Enormously

I designed, developed, and documented multiple APIs for the CDN control plane. The biggest lesson: API decisions are nearly irreversible. A poorly designed API that ships to production will haunt you for years. Things I wish I'd paid more attention to earlier:

Idempotency: All mutating operations must be safely retryable with the same client token
Async by default for long operations: Return a work request ID immediately; don't make clients wait for eventual consistency
Versioning from day one: Never ship a v1 API without planning for how v2 can exist alongside it
Error messages that help: "Bad request" is useless. "IP address '10.0.0.x' is not a valid CIDR block; use CIDR notation like '10.0.0.0/24'" is actionable.

Lesson 4: Cross-Fault Domain Routing is Harder Than It Looks

Routing traffic across fault domains and availability domains for high availability introduces a class of problems that don't exist in single-datacenter systems:

Network latency between regions/ADs adds to request latency — you need to measure and set SLAs accordingly
Health checks must account for transient network partitions to avoid flapping backends
Session affinity (sticky sessions) across fault domains requires careful design — consistent hashing works well, but fallback behavior needs explicit definition
DNS-based routing (for active-active multi-region) has TTL limitations that affect failover speed

Lesson 5: Test in Production (Carefully)

No staging environment perfectly mirrors production at cloud infrastructure scale. Customer traffic patterns, load profiles, and edge cases that appear in production simply don't exist in staging. Some approaches that work:

Canary deployments: Route 1% of traffic to new versions; monitor for error rate increases
Feature flags: Dark-launch features, enable for internal tenants first, then gradually roll out
Chaos engineering: Intentionally inject failures to test resilience mechanisms
Load testing in dedicated tenancies: Use production-equivalent infrastructure with synthetic load

Final Thoughts

Working on cloud infrastructure at Oracle was humbling and exhilarating in equal measure. The scale of the systems, the diversity of customer use cases, and the zero-tolerance for downtime forced me to raise my engineering bar significantly. The most important thing I learned: reliability is never an accident. It's the result of deliberate design decisions, comprehensive observability, rigorous operational practices, and a culture where every engineer takes ownership of the systems they build.

Have thoughts or questions? Reach out at tiwarisudhir059@gmail.com or connect on LinkedIn.