By Sahil Kapoor in Engineering — 03 Jul 2024

Simplest explanation for a phenomenon is usually the best one

Everything broke at the worst possible moment. We moved fast, hit every layer, ruled out every theory. Still nothing. Until we circled back to something so basic, we almost missed it.

Occam's Razor

I still remember the chaos like it was yesterday. It was a Saturday evening in April, smack in the middle of IPL. We had just launched a major update to our cricket live score platform - it was climbing the charts, DAUs were spiking, and retention looked solid.

Then, it happened.

During the most-watched match of the week - CSK vs MI - our live scores stopped updating. Just froze. First for a few users. Then everyone. Crashlytics logs were redlining. Our alerts lit up like a Christmas tree.

We spun up a war room within minutes. DevOps joined. Backend, frontend, infra, product, even our designer who had nothing to do with it - all in one Zoom, typing furiously and tracking every signal we could.

We touched every possible angle - performance, reliability, infra limits, even a potential security incident. We were on top of it.

Then, one by one, we started knocking down possible causes. Every step took 5-15 minutes:

Checked that score data wasn't being pushed too aggressively.
Added extra logs directly on the live system to track Redis calls.
Tuned pod limits.
Reduced extra client-side polling frequency (thanks, I had got it configured from Firebase a few months back).
Hardened rate limiting to ease pressure.
Increased Redis memory.
Upgraded our Mongo cluster from M40 to M60.
Re-ran security checks to rule out bot attacks or malicious flooding.
Rolled back a mobile build as a precaution, even though it wasn't live yet.

Infra metrics looked healthy. CPU usage was fine. Disk IO was fine. Redis was up, but... just not serving keys like it should.

At one point, when nothing seemed to bring down the load, I stopped everyone.

We needed to think from scratch. I asked the team to walk the entire flow again - from the CDN, to the API gateway, to each microservice, to MongoDB, and finally Redis.

That's when it hit us. Redis was the bottleneck. Thousands of keys were sitting there, and none were being cleared. It wasn't a traffic spike or a backend fan-out. It was Redis silently refusing new writes.

Someone on the call casually said:

"Hey, what's our eviction policy on Redis?"

...Silence.

We hadn't configured it.

Redis had filled up. No keys were being evicted. So it silently refused to store anything new. The score keys we were writing? Dropped. No error. No warning. Just... gone.

One missing config flag.

Not a memory leak. Not Mongo. Not horizontal scaling. Just maxmemory-policy set to noeviction.

We flipped it to allkeys-lru. Scores started flowing again in 10 seconds. War room disbanded. We took a breath.

I think about this a lot now. Because honestly, this wasn't just a tech issue. It was a decision-making issue.

When things break, it's tempting to jump straight to the exotic theories. But in the end, it usually comes down to something simple.

This is where Occam's Razor kicks in - the idea that the simplest explanation is usually the correct one.

It's a principle I now try to use everywhere:

When debugging a flaky feature - maybe the API isn't slow, maybe the timestamp is off by a few seconds.
When a teammate isn't responding - maybe they're not ignoring you, maybe they just didn't see the message.
When users aren't converting - maybe it's not your UX flow, maybe the CTA just says the wrong thing.

It doesn't mean the simplest answer is always right. But it should be the first one you eliminate. Before you spin up a new environment or blame Kubernetes or open up a 12-tab Grafana dashboard.

Sometimes, the problem is just a missing config.

Follow-up note:
We now do a "5-minute Occam check" before any war room:
What's the dumbest, simplest possible reason this could be happening?
It's saved us more times than I want to admit.

Running Lean and Building Faster

You might also like...