Dhruv Bhartia

Systems I've Worked On

Kubernetes Failure Under Load

Problem
Under load, apps became unresponsive. No HPA initially. Even after scaling, traffic didn't distribute as expected and pods (~1.5-2min startup on Fargate) came too late.

What I did
Introduced HPA and redesigned load testing to include scaling behavior. Identified that CPU/memory signals lag real traffic patterns.

Outcome
HPA improved stability (~80%), but spike loads still broke the system. Realization: pre-scaling and event-driven scaling were required.

Learning
Autoscaling is reactive, not instantaneous. For bursty or async workloads, CPU/memory are weak signals of demand. Scaling decisions must align with how the workload behaves, not just what metrics are available.

CI/CD Standardization Across Services

Problem
Dozens of microservices had separate pipelines. Any change meant updating every repo - and missed updates were discovered late.

What I did
Initially tried scripting updates across repos (didn't scale). Moved to shared pipeline libraries and reduced flows into ~3–4 standard patterns.

Outcome
Pipeline changes became centralized. New services only needed a minimal import - no pipeline setup/testing overhead.

Learning
Automating repeated tasks is not the same as solving duplication. If duplication exists, automate-after-abstraction - not before.

Internal CLI for Container Build Standardization

Problem
Ephemeral CI runners (DIND) needed frequent updates with strict security constraints. Using Alpine reduced vulnerabilities, but introduced issues (musl + unofficial Node support).

What I did
Built an internal CLI to generate/update runner images with required dependencies (Node, Helm, etc.) via arguments.

Outcome
Removed manual image maintenance and made updates repeatable under security constraints.

Learning
When constraints stack (security, compatibility, tooling), pipelines become brittle. Encapsulating complexity in a dedicated tool is often more stable than pushing it into CI logic.

Scaling Mismatch: When HPA Was the Wrong Model

Problem
AI workloads (file uploads, async processing) didn't scale well with HPA. CPU/memory signals didn't reflect actual pressure.

What I did
Analyzed scaling gaps during load tests and explored alternatives beyond HPA/VPA.

Outcome
Identified that event-driven scaling (e.g., KEDA) fits the workload better, even though not implemented at the time.

Learning
Choosing the wrong scaling signal is worse than no scaling. The real problem wasn't tuning HPA - it was assuming the problem fit HPA at all.

Reducing Tool Onboarding Time

Problem
Manual onboarding across tools took 2–3 days per request.

What I did
Automated onboarding using tool APIs into a self-service flow.

Outcome
Reduced setup time to under 5 minutes.

Learning
Most developer friction isn't technical - it's process latency.

Platform / DevOps Engineer

Systems I've Worked On

Kubernetes Failure Under Load

CI/CD Standardization Across Services

Internal CLI for Container Build Standardization

Scaling Mismatch: When HPA Was the Wrong Model

Reducing Tool Onboarding Time

Engineering Blog

Docker vs systemd - My Mental Model Shift

How DNS Resolution Actually Works

How Git Works Internally

Let's Connect