2

How we leverage Knowledge Graphs for AI driven RCA

Pankaj Kaushal

Feb 21, 2025

calmo-slack-demo

At midnight, a routine database update causes a minor delay in processing transactions. This delay leads to a growing queue in the payment service, which goes unnoticed.

By 6 AM, the queue is large enough to cause intermittent timeouts in the authentication service, affecting customer logins.

At 9 AM, during peak shopping hours, the backlog overwhelms the payment gateway, resulting in widespread payment failures.

Traditional monitoring tools see isolated issues, but Calmo’s Temporal Knowledge Graph connects the dots: 

Calmo autonomously detects, traces and resolves incidents, reduces downtime and ensures system resilience. Temporal knowledge graphs allow Calmo to track when and how incidents unfold, spot patterns that traditional monitoring tools miss.

Foundations of Calmo, The AI SRE

What are Knowledge Graphs?

Knowledge graphs store information as entities and their relationships, offering a structured way of representing knowledge compared to traditional databases. This structured representation is particularly useful in Site Reliability Engineering, as graphs are a natural fit for representing complex systems and their dependencies. Capturing both high-level and low-level relationships between infrastructure components and providing a holistic view of system context and health.

Temporal Data and its Impact

All production systems are dynamic in nature. Relationships between systems and services are evolving and changing over time,Through deployments, code changes and data flow. In dynamic production environments, temporal data is crucial.

Temporal data refers to information associated with a specific point in time or time interval. This type of data allows  to analyse changes over time. In the context of knowledge graphs, temporal data is particularly important as it allows Calmo to represent the evolution of entities and their relationships. By using these ever evolving temporal relationships, Calmo can provide a more complete picture of the system behavior, spot trends, patterns and anomalies that would otherwise go unnoticed. This temporal awareness is key to proactive site reliability engineering, allowing to intervene in time and improve system resilience.



How Calmo Builds and Uses Knowledge Graphs

Calmo’s knowledge graph has 3 interconnected layers that improves incident detection and response.

Event Subgraph: Captures raw system events, logs, metrics and anomalies, so no data is lost.

Service Relationship Subgraph: Extracts meaningful connections between services, maps dependencies and tracks interactions over time.

System-Wide Insight Subgraph: Groups related entities into clusters, provides a high level view of service performance and failure patterns.

This layered approach organizes raw events into structured insights, makes system behavior easier to analyze and understand. By dynamically updating and linking information, Calmo ensures a continuously evolving understanding of system health. This is a new industry benchmark by advancing technology and forecasting capabilities.

Calmo’s Graph RAG links temporal data with a knowledge graph to connect services, logs and metrics.

Automated Log Retrieval: When an anomaly occurs, Calmo builds a timeline of related issues using temporal knowledge graphs and reduces time spent manually searching logs, enabling faster root cause identification.

Contextual Root Cause Analysis: The system links errors to service interactions and dependencies, offering context-aware root cause analysis.

Real-Time Correlation: By combining temporal awareness with graph-based intelligence, Calmo automatically traces multi-step outages, identifies root causes without human intervention. 

Why Temporal Knowledge Graphs are a Game Changer

Temporal knowledge graphs help Calmo to track incident evolution over time, from minor anomalies to major outages. It allows calmo to Identify hidden patterns and correlations across system events, logs, and metrics. This time-aware approach advances traditional monitoring, enabling fully autonomous incident detection and response.

The Future of AI SRE with Calmo

We envision a future where AI-driven Site Reliability Engineering achieves full autonomy by continuously learning from temporal patterns and system behaviors, reducing human intervention and improving overall system resilience.

Request a free trial to see Calmo resolve real incidents, pinpoint the cause and provide a fix, without having to monitor multiple dashboards constantly.

AI Root Cause Analysis

Schedule a call with the team

AI Root Cause Analysis

Schedule a call with the team

AI Root Cause Analysis

Schedule a call with the team

AI Root Cause Analysis

Schedule a call with the team