Autonomous AI SRE that triages and fixes production incidents using multi-agent orchestration.
Using hierarchical multi-strategy RAG for context-rich analysis, multi-agent domain orchestration across Kubernetes and AWS, and time-series anomaly correlation.

|
DevOps Automation
|
YC W26

Last Updated:
March 19, 2026

Builds an open-source AI SRE agent that autonomously triages, coordinates, and fixes production incidents using multi-agent orchestration, hierarchical RAG retrieval, and anomaly detection across 24+ LLM providers.
IncidentFox has publicly launched its open-source core platform, RAPTOR hierarchical retrieval system for context-rich incident analysis, Slack-native debugging integration, and support for 24+ LLM providers with BYO API keys. They've announced multi-strategy RAG (RAPTOR + Knowledge Graph + HyDE + BM25 + Neural Reranking) achieving 74% Recall@10, and specialist multi-agent orchestration for Kubernetes, AWS, and other infrastructure domains. All aimed at replacing manual on-call workflows with autonomous AI-driven incident resolution.
Behind the scenes, their GitHub activity and open-source contributions signal rapid iteration on automated runbook execution and deeper observability platform integrations (Datadog, Grafana, PagerDuty). Job-adjacent community discussions on Hacker News and Product Hunt indicate demand for proactive incident prevention and predictive anomaly detection, which aligns with their investment in Prophet-based time-series analysis and multi-layer alert correlation. The lean two-person team and YC backing suggest an imminent fundraise to scale engineering, likely targeting Series A in late 2025 or early 2026. There are also strong indicators of a plugin/extension marketplace strategy leveraging their open-source community, and expansion toward multi-channel collaboration beyond Slack (Teams, Discord, custom webhooks).
<p>Uses RAPTOR hierarchical retrieval combined with multi-strategy RAG to provide AI agents with deep, multi-resolution context from logs, docs, and past incidents for accurate root cause analysis.</p>
It's like giving your on-call engineer a photographic memory of every past incident, runbook, log file, and Slack thread so they instantly know what's wrong and how to fix it.
IncidentFox's RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) system builds hierarchical tree structures from organizational knowledge by recursively clustering documents, generating LLM-powered summaries at each level, and enabling multi-resolution search across the entire tree. This is combined with a multi-strategy RAG pipeline that fuses RAPTOR with Knowledge Graph retrieval, HyDE (Hypothetical Document Embeddings), BM25 lexical search, and Neural Reranking to achieve 74% Recall@10 on internal benchmarks. During an incident, the system retrieves not just keyword-matched snippets but semantically rich, hierarchically organized context—connecting a current database timeout to a similar incident from six months ago, the relevant runbook, the infrastructure change that preceded it, and the Slack thread where the fix was discussed. This dramatically reduces the cognitive load on responders and enables the AI agent to propose accurate root causes and remediation steps with full evidentiary context.
It's like having a librarian who not only knows every book in the library but has already read them all, written cliff notes at five different detail levels, and can hand you exactly the right page before you finish asking your question.
<p>Deploys specialist AI agents for distinct infrastructure domains (Kubernetes, AWS, databases, networking) that collaborate autonomously to investigate and resolve complex cross-system production incidents.</p>
It's like having a team of specialist doctors—one for the heart, one for the lungs, one for the brain—who huddle together instantly when a patient arrives in the ER and agree on a diagnosis in seconds.
IncidentFox's multi-agent orchestration system assigns specialist AI agents to distinct infrastructure domains—one agent deeply understands Kubernetes pod scheduling and resource limits, another specializes in AWS service dependencies and IAM configurations, another focuses on database query performance and replication lag, and so on. When an incident fires, a coordinator agent analyzes the initial alert signal, determines which domains are likely involved, and dispatches the relevant specialist agents in parallel. Each specialist agent autonomously investigates its domain—querying metrics, sampling logs, checking recent deployments, and correlating anomalies—then reports findings back to the coordinator. The coordinator synthesizes cross-domain findings, identifies the root cause (e.g., a Kubernetes HPA scaling event that overwhelmed a downstream RDS connection pool), and proposes or executes a remediation action. The agents continuously learn from every incident, updating their domain knowledge from Slack conversations, code changes, and documentation updates, making them progressively more accurate over time. This architecture mirrors how elite SRE teams operate—with domain experts collaborating—but at machine speed and 24/7 availability.
It's like assembling the Avengers every time your website goes down, except each superhero specializes in a different part of your infrastructure and they never need coffee breaks.
<p>Applies Meta's Prophet time-series algorithm with multi-layer alert correlation and continuous learning to detect infrastructure anomalies early, reduce alert noise, and surface only actionable, correlated incident signals.</p>
It's like a smoke detector that not only smells smoke but also checks the stove, the wiring, and the neighbor's barbecue before deciding whether to wake you up.
IncidentFox integrates Meta's Prophet algorithm for time-series forecasting and anomaly detection across infrastructure metrics—CPU utilization, memory pressure, request latency, error rates, queue depths, and more. Prophet's ability to handle seasonality, trend changes, and holiday effects makes it particularly effective for production systems with cyclical traffic patterns. But raw anomaly detection alone generates excessive noise, so IncidentFox layers a multi-layer alert correlation engine on top: when Prophet flags an anomaly in request latency, the system automatically checks correlated signals—has error rate spiked? Are database connections saturating? Did a deployment just roll out? Is a dependent service degraded? By correlating anomalies across metrics, logs, traces, and deployment events simultaneously, the system suppresses false positives and surfaces only high-confidence, actionable incident signals. The continuous learning component means the system adapts to each organization's unique baseline patterns, seasonal traffic shifts, and infrastructure topology over time, becoming increasingly precise. This transforms the traditional alert-fatigue-ridden on-call experience into a focused, signal-rich incident detection pipeline.
It's like having a weather forecaster who doesn't just tell you it might rain, but cross-references the barometer, satellite imagery, and your neighbor's arthritic knee before confidently telling you to grab an umbrella.
IncidentFox combines open-source community-driven extensibility with a proprietary multi-strategy RAG system (RAPTOR + Knowledge Graphs) and multi-agent orchestration, enabling context-rich autonomous incident resolution that learns continuously from every Slack thread, codebase change, and documentation update,a self-improving SRE brain that traditional alerting tools cannot replicate.