Engineering

How we leverage Knowledge Graphs for AI driven RCA

10 min read
Calmo Team

At midnight, a routine database update causes a minor delay in processing transactions. This delay leads to a growing queue in the payment service, which goes unnoticed. By 6 AM, the queue is large enough to cause intermittent timeouts in the authentication service, affecting customer logins.

Foundations of Calmo, The AI SRE

What are Knowledge Graphs?

Knowledge graphs store information as entities and their relationships, offering a structured way of representing knowledge compared to traditional databases. This structured representation is particularly useful in Site Reliability Engineering (SRE), as graphs are a natural fit for representing complex systems and their dependencies. Capturing both high-level and low-level relationships between infrastructure components provides a holistic view of system context and health, while also helping to identify potential knowledge hazards and ensure data integrity.

Temporal Data and its Impact

All production systems are dynamic in nature. Relationships between systems and services are evolving and changing over time, through deployments, code changes, and data flow. In dynamic production environments, temporal data is crucial. Temporal data refers to information associated with a specific point in time or time interval. This type of data allows for analyzing changes over time and is essential for monitoring distributed systems effectively.

In the context of knowledge graphs, temporal data is particularly important as it allows Calmo to represent the evolution of entities and their relationships. By using these ever-evolving temporal relationships, Calmo can provide a more complete picture of system behavior, spot trends, patterns, and anomalies that would otherwise go unnoticed. This temporal awareness is key to proactive site reliability engineering, allowing for timely interventions, improved system resilience, and the prevention of cascading failures.

How Calmo Builds and Uses Knowledge Graphs

Calmo's knowledge graph has 3 interconnected layers that improves incident detection and response.

AI ROOT CAUSE ANALYSIS

Debug Production Faster with Calmo

Resolve Incidents and Alerts in minutes, not hours.

Try Calmo for free
  • Event Subgraph: Captures raw system events, logs, metrics and anomalies, so no data is lost.
  • Service Relationship Subgraph: Extracts meaningful connections between services, maps dependencies and tracks interactions over time.
  • System-Wide Insight Subgraph: Groups related entities into clusters, provides a high level view of service performance and failure patterns.

This layered approach organizes raw events into structured insights, makes system behavior easier to analyze and understand. By dynamically updating and linking information, Calmo ensures a continuously evolving understanding of system health. This is a new industry benchmark by advancing technology and forecasting capabilities.

Calmo's Graph RAG links temporal data with a knowledge graph to connect services, logs and metrics.

  • Automated Log Retrieval: When an anomaly occurs, Calmo builds a timeline of related issues using temporal knowledge graphs and reduces time spent manually searching logs, enabling faster root cause identification.
  • Contextual Root Cause Analysis: The system links errors to service interactions and dependencies, offering context-aware root cause analysis.
  • Real-Time Correlation: By combining temporal awareness with graph-based intelligence, Calmo automatically traces multi-step outages, identifies root causes without human intervention.

Why Temporal Knowledge Graphs are a Game Changer

Temporal knowledge graphs help Calmo to track incident evolution over time, from minor anomalies to major outages. It allows calmo to Identify hidden patterns and correlations across system events, logs, and metrics. This time-aware approach advances traditional monitoring, enabling fully autonomous incident detection and response.

Calmo Team

Expert in AI and site reliability engineering with years of experience solving complex production issues.