5

From Melting Servers to Calmo: War Stories and a New Hope

Pankaj Kaushal

Mar 7, 2025

Introduction

I’ve been on the front lines of hundreds of production incidents over my career. From websites going dark to data centers literally catching fire, I’ve felt the 3 AM adrenaline surge of scrambling to fix the unthinkable. In this article, I want to share a few of my most unforgettable “war stories” – real incidents at the most visited sites in the world. Yahoo! Booking! Flipkart, a fire and a flooded data center – and the lessons they taught me about debugging under extreme conditions. 

These stories illustrate the common challenges we face when systems fail: elusive race conditions, lack of visibility into complex distributed systems, and the intense pressure of debugging in a crisis. Finally, I’ll explain why I believe the future of incident response will be very different, thanks to AI. And in particular, I’ll introduce Calmo, an AI-assisted root cause analysis tool With AI’s help, we could drastically improve how we investigate and resolve outages, turning multi-hour firefights into swift and surgical fixes. 

Yahoo News: Scaling in the Face of Melting Servers

(File:Data Center (22370911658).jpg - Wikimedia Commons) Melting servers: Data center hardware can quickly become overwhelmed under unexpected traffic spikes, requiring urgent scaling measures.

I still remember the night Yahoo News almost broke the internet. It was June 2009, when Michael Jackson’s death sent shockwaves through the web. Traffic to Yahoo News exploded beyond anything we’d ever seen – one story got 800,000 clicks in 10 minutes, making it the highest-clicking news article in our history (Michael Jackson's Death: An Inside Look At How Google, Yahoo, & Bing Handled An Extraordinary Day In Search). Our web servers began to melt under the load. CPU temperatures spiked, response times lagged, and we were dangerously close to a total meltdown. As the on-call engineer, I was frantically adding servers to the pool and tweaking caching rules on the fly. We had to scale up within minutes or face a very public outage. It felt like repairing a flying airplane.

We discovered that some of our caching mechanisms had a race condition under extreme load – cache entries were expiring too fast, causing thundering herds of requests to hit the backend at once. The bug was subtle and only manifested at insane traffic levels. By horizontally scaling our front-ends and deploying a quick patch to the cache logic, we managed to keep Yahoo News online.

What really made a difference, however, was the work we had put in the previous summer on graceful degradation. By designing our system to intelligently shed non-essential subsystems under heavy load, we ensured that the core functionality of Yahoo News remained accessible even as peripheral services were temporarily scaled back. This strategic foresight allowed us to maintain a reliable user experience, even when the infrastructure was pushed to its limits. That night was a trial by fire: it taught me that even “stable” systems can crumble in the face of unprecedented events, and that race conditions lurking in code will find the worst possible time to bite.

Flipkart: Keeping the Site Alive During a Data Center Fire

(File:Argonite automatic fire suppression system server room.jpg - Wikimedia Commons) Fire suppression cylinders (argon/CO₂ mix) in a server room. Even with such systems in place, a serious fire can knock out an entire data center.

At Flipkart.com, I experienced a different kind of nightmare: a fire in one of our data centers.On the day of Diwali, India’s christmas and black friday rolled into one, a busy shopping day, an electrical short triggered a blaze in the generator room. The fire suppression systems kicked in, but not before taking chunks of infrastructure offline. (In 2024, a similar incident at Reliance Jio caused a nationwide network outage (Fire at data centre causes India-wide outage for Reliance Jio users, source says | Reuters), underscoring how devastating a data center fire can be.) 

My team’s job was to keep Flipkart.com running while half a data center was incapacitated. We immediately failed over services to new VMs, but problems cascaded. Some services didn’t come up cleanly due to stale configuration – ironically, a lack of visibility into the config running in production and terraform/puppet in version control. A config drift bit us at the worst time. Meanwhile, alarms were blaring both in the NOC and literally on the data center floor. It was controlled chaos. We were essentially flying half-blind, since the fire had also knocked out some monitoring nodes. I was SSH-ing into machines by IP, trying to assess which services survived. By rerouting traffic at the load balancers and bringing up backup instances from cold storage, we managed to keep the core website functionalities alive. We had moments to decide what features to sacrifice – for instance, we temporarily disabled recommendations and some non-critical APIs to reduce load. This incident hammered home the importance of redundancy, observability and infrastructure as code

Without real-time insight into which services were down, we were operating on gut instinct and tribal knowledge. It was also a lesson in calm under pressure: despite literal fire, we had to methodically work through a recovery checklist. In the end, Flipkart stayed up for customers, though most never knew how close we came to a total outage.

The Flooded Data Center: A Complete Shutdown and Restart

Disasters aren’t always fiery; sometimes they arrive as water. In one particularly dramatic incident, a data center I was responsible for started flooding after a nearby river flooded. Water was seeping under the raised floor, threatening the power distribution units. We had no choice but to shut down the entire facility to prevent electrocution and equipment destruction. 

This was a controlled shutdown, but a nerve-wracking one: powering off hundreds of servers gracefully in a hurry is not easy. (We knew from industry events like Hurricane Sandy that flooding can cripple data centers by taking out power systems (Massive Flooding Damages Several NYC Data Centers).) Once the water was cleared and repairs made, we faced the herculean task of bringing everything back up. This wasn’t simply hitting a power switch. Each service had dependencies that had to come up in the correct order. Our databases had to start and restore from logs before application servers could connect. Caches had to warm up. The network gear had to reboot and re-learn routes. In a distributed system with many interconnected components, a full restart is the ultimate test of your architecture. We encountered multiple hiccups: one storage array didn’t power on due to a tripped breaker, and a cluster management service got wedged requiring a manual reset. It took us nearly a full day to get every system verified and the data center back to normal operation. The flood incident revealed how complex and fragile distributed systems can be when they have to be rebuilt from scratch. It also highlighted the need for runbooks and automation. Humans are prone to error when juggling dozens of moving parts under stress. We realized we needed better bootstrapping scripts and system maps. Still, that day ended in success: we recovered without data loss. But I never again underestimated the complexity hidden in what we call “a reboot.” As an old saying goes, rebooting 500 servers isn’t 500 times harder than rebooting one server – it’s 5,000 times harder, due to all the interdependencies.

Booking.com: The 40-Hour, 15,000-Server Debugging Marathon

Perhaps my hardest battle was at Booking.com, when a routine infrastructure change turned into a cascading failure. We rolled out an update to our emergency out of band access system it was supposed to be a minor change to a service startup script. Instead, a lurking bug caused it to randomly restart about 15000 servers across our fleet. One moment, everything is fine; the next, a huge chunk of our production servers started “killing” themselves without warning. 

Imagine the chaos: users were getting errors on the website, internal services were flapping as their hosts went down, and our metrics went wild. We had a full-on outage in progress. This kicked off a 36-hour debugging marathon that I will never forget. We had every engineer available on deck, rotating in and out as fatigue set in. The tricky part was figuring out why this change caused such a fiasco. 

Critical state was wiped from many machines. Worse, the bug’s effects were nondeterministic – not all servers were affected, and the pattern of failure seemed random. We dug through logs across dozens of services. Booking.com’s infrastructure is highly distributed (by necessity, running a global travel site), which made this bug hide like a needle in a haystack. Logs were scattered, and some of our usual deployment traces didn’t capture this scenario. It took us hours just to correlate which exact 15,000 servers had been restarted, which services were in act and which didn’t work. With over 80 types of subsystems just making sure all systems are stable is a task onto itself.

Once the issue was tracked, We rolled the change back within minutes, but the bulk of the time was spent in going over hundreds of changes made to all of our repositories in the last 24 hours.

Fixing the bug was simple once found, but by then we had to also restore data and reassure teams that their systems were intact. 40 hours later, bleary-eyed and ecstatic, we resolved the incident. This war story encapsulated every possible debugging challenge: that “impossible” bug that only appears under certain timing, lack of initial visibility (we had to write scripts on the fly to gather data from various sources), the complexity of a distributed architecture, and immense pressure from the business (every minute of downtime was costly). It was a baptism by fire for our on-call processes. We emerged with a conviction: we needed far better tooling to investigate issues like this faster, because pure human effort nearly reached its limit.

Debugging Under Fire: Common Challenges

Each of these incidents was unique, but they share common themes. When production goes up in flames (sometimes literally), engineers face a gauntlet of challenges that make debugging incredibly hard:

  • Concurrency and Race Conditions: Some of the worst bugs only appear under specific timing or load conditions. As one of my colleagues quipped, “if you have a seemingly impossible bug that you cannot consistently reproduce, it’s almost always a race condition”. 

  • Lack of Visibility (Observability): In a crisis, not knowing what’s happening is half the battle. Debugging distributed systems is hard because observability is limited at a global scale. Traditional debugging gives you a local view (a single server’s logs or a stack trace), but in a system spread across hundreds of nodes, that’s like blindfolding one eye. In the Flipkart fire, we lost some monitoring and were essentially flying blind. It becomes difficult to piece together the chain of events without a global timeline of the system. Modern practices like distributed tracing and centralized logging are meant to help, but if they aren’t comprehensive, engineers end up with only puzzle fragments.


  • Distributed Systems Complexity: By design, distributed systems have many interacting components, which introduces a combinatorial explosion of things that can go wrong. It’s well-understood that distributed systems are much harder to debug than centralized ones. There are more failure modes: network partitions, partial outages, inconsistent state across services, etc. As systems grow, emergent behaviors appear that weren’t explicitly programmed – and those can lead to very puzzling bugs. The flood scenario showed how dependency ordering can complicate a recovery. At Booking, microservice architecture meant the root cause was buried in chatter between services. In such systems, a small glitch in one component can ripple outward in unexpected ways, obscuring the original source.


  • High-Pressure Environments: Perhaps the biggest factor is human: the pressure of fixing things fast. When an outage is in progress, every minute counts. It’s not a calm debugging session in an IDE; it’s an adrenaline-fueled race against the clock. It’s often 3 AM on a Friday and the on-call engineer is exhausted, forced to rely on personal know-how to find the issue (Learn what incident response automation is and how it works). Critical information (like who to call, where certain logs are) might not be documented or may be outdated Fatigue and stress set in, increasing the chance of mistakes or tunnel vision. I’ve pulled all-nighters watching sunrise from the office window, still chasing a bug. This environment is brutal. Under these conditions, even the best engineers can miss obvious clues. Pressure can narrow your thinking at exactly the time you need to think broadly.

Given these challenges, it’s clear that debugging complex outages is as much an art as a science. We develop playbooks, we practice drills, we build monitoring systems all to mitigate these difficulties. But no matter how experienced you are, there’s always that incident that will humble you. After years of fighting these fires, I found myself asking: Can we do better? Does it always have to be this painful? This is where my excitement for new approaches comes in. Specifically, I believe that advances in AI and automation are poised to fundamentally change how we tackle production incidents.

Calmo: AI-Assisted Root Cause Analysis

Imagine if, during those war stories, I had a trusty AI assistant by my side – a kind of Sherlock Holmes for systems, tirelessly sifting through data while I focused on decisions. This is the promise of Calmo.

Meta’s engineering team revealed they had built an AI system to help with incident investigations, combining smart heuristics with a large language model to pinpoint root causes. The results were eye-opening: their system achieved 42% accuracy in identifying the root cause at the start of an investigation, significantly reducing the time engineers spent searching (Meta: AI-Assisted Root Cause Analysis System for Incident Response - ZenML LLMOps Database). In other words, in nearly half of the incidents, the AI’s top suggestions contained the actual culprit, right when the incident was declared. That kind of head start is a game-changer. It means potentially saving hours of trial and error. Meta’s approach works by automatically narrowing down thousands of code changes to a few likely suspects (using signals like which systems are failing, recent deployments, and dependency graphs) and then using an LLM to rank the most relevant ones (Leveraging AI for efficient incident response - Engineering at Meta) (Leveraging AI for efficient incident response - Engineering at Meta). Essentially, it’s an AI-powered detective that scans the usual “clues” an on-call engineer would gather – except it does it in seconds and without fatigue.

Calmo is envisioned in a similar vein, but extending beyond just code changes to the entire debugging workflow. The idea is to leverage AI (including machine learning on historical incident data and LLMs that ingest logs/metrics) to improve investigation efficiency at every step:

How Calmo Could Transform Debugging

  • Instant Analysis of System Anomalies: The moment an incident arises, Calmo would consume the firehose of data coming from the system: logs, error traces, metrics, recent deployment changes, configuration tweaks, etc. It can cross-correlate these in a way no human realistically can under time pressure. For example, Calmo might recognize that right before a service crashed, a specific configuration value was pushed network-wide – something an engineer might only discover after digging through chat or wiki updates. AI excels at pattern matching, so it could flag “these 500 error messages across 20 services all share a common thread starting at time X.” This breadth of analysis addresses the lack of visibility by providing an automated global view.


  • Ranking Likely Root Causes: Instead of a human trying to formulate hypotheses blindly, Calmo can generate a ranked list of potential root causes. Calmo weighs evidence, maybe a spike in database errors points to a DB issue, but correlating that with a just-deployed microservice suggests an upstream cause something like: “80% confidence that the checkout service failure is due to the recent payment service deployment at 09:42 UTC.” It can list a few such hypotheses, each backed by data as evidence. This guides engineers where to focus first. In effect, it triages the incident cause, similar to how medical diagnostics prioritize possible illnesses. Industry tools are already exploring this: for instance, products like Zebrium use ML on logs to automatically surface root cause events (Unleashing AI in SRE: A New Dawn for Incident Management - DevOps.com), rather than making humans search manually.


  • Automated Investigative Actions: Calmo doesn’t  just sit and observe – it takes initiative on routine investigative steps. Think of it as having a junior engineer who runs around checking things for you. For example, it can automatically fetch relevant logs from all services involved in a user request that failed, and group them by timeline. It can run system checks: if high CPU is detected on a server, it can grab a thread dump or CPU profiler output and include it in the report. If a database error is suspected, it can query the DB for lock wait statistics or replication lag. Essentially, it can execute parts of the runbook on its own. This automation saves precious minutes. When it’s 3 AM, having routine diagnostics done for you is huge – you can spend your brainpower interpreting results rather than gathering them. In practice, some SRE teams write scripts or use chatbots to do this; Calmo just makes it more intelligent and context-aware.


  • Learnings from Past Incidents: One of the most powerful aspects of AI is learning from history. Calmo is trained on past incidents all those war stories and their resolutions become fodder for the AI. Calmo uses its knowledge base: “If error X and symptom Y happen together, it was cause Z (with 95% probability).” Meta’s team fine-tuned their LLM on historical investigation data to teach it how to recognize patterns and even read internal code and wiki docs (Leveraging AI for efficient incident response - Engineering at Meta) (Leveraging AI for efficient incident response - Engineering at Meta). Calmo similarly ingests post-mortems and incident timelines for every one of its deployments. This means that if a familiar problem reoccurs, the AI spots it immediately. For example, “This error pattern matches an issue seen 2 months ago in which a race condition in the cache layer caused a cascade.” Even if the on-call engineer has never seen that old incident, Calmo has the institutional memory to bring it up. This kind of knowledge retention can dramatically reduce time to resolution, especially in organizations with high staff turnover or distributed teams.


  • Reduced Cognitive Load and Stress: Perhaps the most humane benefit: Calmo can act as a tireless sidekick during high-pressure incidents. It doesn’t get tired or panic. By handling the grunt work of searching logs and monitoring dashboards, it reduces the cognitive load on the human responders. In practical terms, an engineer using Calmo would have a concise briefing of “what we know so far” within minutes of an outage, rather than staring at 10 different screens trying to piece it together. This goes a long way in reducing stress. It’s easier to stay calm and think clearly when you’re not also trying to be a human parser for gigabytes of logs in real-time. By streamlining the workflow (maybe even automatically creating an incident Slack/Teams channel and posting updates), Calmo lets engineers focus on decision-making and creative problem-solving – the things humans are best at – rather than data crunching. It’s like moving from manually flying a plane to having an autopilot handle the stability while you chart the course.


To see the potential impact, consider how each of my war stories might have played out with Calmo in the loop. In the Yahoo News traffic surge, Calmo could have instantly identified the spike and perhaps recalled similar past events (like other celebrity news spikes) to suggest scaling actions. It might have flagged the cache invalidation code as a suspect by correlating error rates with a recent code push. In the Flipkart fire, Calmo would have quickly mapped out which services were down and which were up in the surviving data center – a task that took us a lot of manual effort. During the Booking.com marathon, I daydream about how Calmo might have pointed us to the script within minutes, by noticing the common thread in those 15,000 servers’ reboots. We could have ended that incident in minutes instead of days. 

This isn’t to say AI can solve everything – debugging often requires intuition and creative thinking that a machine might not replicate – but even if it shortlists the right answer 40% of the time, that’s an enormous win. It turns the problem of finding a needle in a haystack into finding a needle in a small pile of straw.

Importantly, AI-assisted debugging needs to be implemented carefully. We must avoid false confidence in the AI’s suggestions. While Calmo can significantly cut down investigation time, it can also suggest wrong causes and potentially mislead engineers if used blindly. Calmo, therefore, is designed to augment human operators, not replace them. It would present its reasoning and allow engineers to confirm or dismiss leads. Think of it as an extremely knowledgeable assistant, but the incident commander is still human. With proper feedback loops (engineers marking suggestions as useful or not), the system can improve over time and build trust.

Conclusion

After a career spent firefighting in data centers and war rooms, I’m genuinely excited about what the future holds. The advent of AI in our monitoring and debugging toolchain feels like the cavalry coming over the hill. We are on the cusp of a transformation in how we handle production incidents. Instead of paging an exhausted human to sift through metrics and logs in the dark of night, we’ll have AI-driven systems like Calmo shining a spotlight on the likely culprit within minutes. 

The impact on our industry could be profound. Imagine vastly lower downtime, faster recovery, and perhaps most importantly, saner on-call schedules. Future engineers might hear our old war stories with disbelief: “You manually looked through logs for 40 hours? Why didn’t you just ask the AI for the root cause?”

Calmo represents a vision of incident response that is proactive, data-driven, and intelligent. It’s about learning from every outage so that the next one is easier to resolve. It’s about giving engineers superpowers the ability to cut through complexity with algorithmic precision. Will firefighting ever be completely stress-free? Probably not; complex systems will always find novel ways to fail. But with AI as our ally, we can tame the chaos. We can move from reactive scrambling to confident, accelerated problem-solving. The war stories of tomorrow might be less about grueling marathons and more about how quickly and gracefully we handled incidents with our AI copilots. As someone who has lived through the evolution from bare-metal servers to cloud and now to AIOps, I firmly believe that AI-assisted debugging tools like Calmo will become standard issue in the SRE toolbox. And I won’t miss those all-nighters one bit.

In the end, the goal is simple: fewer outages, faster fixes, and a good night’s sleep for on-call engineers. After all the fires I’ve fought, that sounds like a revolution worth striving for. With Calmo lighting the way, the future of debugging looks a lot calmer indeed.

AI Root Cause Analysis

Schedule a call with the team

AI Root Cause Analysis

Schedule a call with the team

AI Root Cause Analysis

Schedule a call with the team

AI Root Cause Analysis

Schedule a call with the team