Sentinel.AI Overview
Real-time health across all agents, workflows, and reliability systems
Success Rate
โ
Agent runs (24h)
โ
Open Incidents
โ
Unresolved
โ
Active Workflows
โ
Running pipelines
โ
Circuit Breakers
โ
Open / Total
โ
P95 Latency
โ
ms
Total Tokens
โ
All agents
Total Cost
โ
USD (estimated)
DLQ Size
โ
Failed tasks pending retry
Agent Breakdown
| Agent | Runs | Success | Avg Latency | Avg Cost |
|---|
Incident Types (Open)
Workflow Failure Replay
Run-level execution traces, step failures, and one-click replay from any failure point
Loading runs...
Select a run to view details
Agent Traces
Every agent run with Gantt timeline, checkpoints, and replay
Incidents
Agent loops, cascading failures, silent errors, latency spikes
Blast Radius Containment
If an agent fails, which downstream agents, users, and workflows are affected?
Select Agent to Analyze
Impact Summary
Select an agent and compute
Dependency Graph & Blast Radius
Compute blast radius to see the dependency graph
Reliability Guarantees
Circuit breakers, error budgets, dead letter queue, and retry policies
Circuit Breakers
Error Budgets (24h window)
Dead Letter Queue โ Failed Tasks Awaiting Retry
Rollback & Replay
Every agent step is checkpointed. Replay from any point with modified inputs.
Select Trace to Replay
Replay Result
Select a trace and click Replay on any checkpoint
Service Level Objectives
Agent reliability targets with error budgets and burn rate alerts
Per-Agent SLO Targets
| Agent | Success Rate Target | P95 Latency Target | Tool Failure Target |
|---|---|---|---|
| support-agent | 99.5% | 3,000ms | <0.5% |
| code-assistant | 98.0% | 10,000ms | <2% |
| data-analyst | 97.0% | 15,000ms | <3% |
| orchestrator | 99.9% | 30,000ms | <0.1% |
| default | 99.0% | 5,000ms | <1% |