1
Why we are building Calmo
Pankaj Kaushal
Feb 14, 2025
Modern software is inherently complex: microservices, containers, serverless functions, each one capable of generating an overwhelming amount of data. Maintaining reliability can become a juggling act that involves multiple monitoring systems, on-call schedules, and repeated incident triage.
We are building Calmo to help with these challenges without adding another hurdle to the workflow. The intent is to serve more as a capable teammate than just another tool, aligning with the idea of an AI SRE, a concept that places Artificial Intelligence in Site Reliability Engineering.
The Complexity of Production Environments
In many organizations, about 30% of an engineer’s time is tied up in production tasks, managing legacy code, babysitting multiple dashboards, and coping with notifications that may or may not be critical. Some alerts are minor, but the reality is that production also hosts significant issues needing prompt attention. Hours often slip away trying to figure out which problems should be tackled first.
Calmo is an AI Site Reliability Engineer (SRE) that behaves like a colleague, relying on existing infrastructure tools to manage the day-to-day turmoil. Instead of presenting yet another SaaS interface, this approach offers a form of SRE automation tools, bridging metrics from services like AWS or GCP, analyzing logs, and referencing code repositories to keep production stable. Handling automated incident response tasks means the main human role is making key decisions or deploying final fixes.
When something malfunctions, Calmo gathers system metrics, traces, and historical data to understand what went wrong.
Root Cause Analysis and Intelligent Investigation
Many failures can be traced back to a single root cause: a flawed code push, a database schema change, or a resource bottleneck. Calmo correlates different signals (metrics, recent deployments, incident histories) to identify that root cause quickly.
It also connects to metrics dashboards, tracing systems, and code repositories, along with Slack, Teams, or other collaboration channels. Access to logs, runbooks, and playbooks can be coordinated in one thread, minimizing the search for scattered documentation. Calmo consumes these resources to deliver a human-like output for each investigation, ultimately resulting in a workflow that’s faster and more transparent.
A Teammate, Not Another Tool
Calmo isn’t focused on burying teams in numbers or charts. Instead, it highlights actionable insights, much like discussing an outage with a fellow engineer. Understanding why a microservice failed (perhaps a memory leak or a faulty merge) becomes simpler because Calmo delivers the root cause analysis in under a minute, removing the need to dig through multiple tools or logs.
Unlike typical platforms that require logging into a separate portal, Calmo remains within the environment already in use, whether that’s Slack, Teams, or some other incident response solution. Engineers can trust the system to handle operational details, aligning with reliability engineering with AI rather than manually checking every dashboard.
We Want Software Engineers Building, Not Firefighting
One clear mission drives us: free engineers from the constant grind of production firefighting. More building, less debugging. This AI agent approach significantly shrinks the time spent parsing logs or connecting scattered data points. Engineers can then focus on building, not firefighting.
The mindset is simple: let a specialized AI agent handle operational tasks, so human can build softwares.
See Calmo in Action
Schedule a demo to observe how Calmo addresses real incidents, pinpoints the cause, and offers a clear fix, without requiring constant monitoring of multiple dashboards.
Please book a 30 mins demo at this link.