Sentinel.AI Overview
Real-time health across all agents, workflows, and reliability systems
Success Rate
โ€”
Agent runs (24h)
โ€”
Open Incidents
โ€”
Unresolved
โ€”
Active Workflows
โ€”
Running pipelines
โ€”
Circuit Breakers
โ€”
Open / Total
โ€”
P95 Latency
โ€”
ms
Total Tokens
โ€”
All agents
Total Cost
โ€”
USD (estimated)
DLQ Size
โ€”
Failed tasks pending retry
Agent Breakdown
AgentRunsSuccessAvg LatencyAvg Cost
Incident Types (Open)
Workflow Failure Replay
Run-level execution traces, step failures, and one-click replay from any failure point
๐Ÿ”€
Loading runs...
๐Ÿ‘ˆ
Select a run to view details
Agent Traces
Every agent run with Gantt timeline, checkpoints, and replay
Incidents
Agent loops, cascading failures, silent errors, latency spikes
Blast Radius Containment
If an agent fails, which downstream agents, users, and workflows are affected?
Select Agent to Analyze
Impact Summary
๐Ÿ’ฅ
Select an agent and compute
Dependency Graph & Blast Radius
๐Ÿ”—
Compute blast radius to see the dependency graph
Reliability Guarantees
Circuit breakers, error budgets, dead letter queue, and retry policies
Circuit Breakers
Error Budgets (24h window)
Dead Letter Queue โ€” Failed Tasks Awaiting Retry
Rollback & Replay
Every agent step is checkpointed. Replay from any point with modified inputs.
Select Trace to Replay
Replay Result
โฎ๏ธ
Select a trace and click Replay on any checkpoint
Service Level Objectives
Agent reliability targets with error budgets and burn rate alerts
Per-Agent SLO Targets
AgentSuccess Rate TargetP95 Latency TargetTool Failure Target
support-agent99.5%3,000ms<0.5%
code-assistant98.0%10,000ms<2%
data-analyst97.0%15,000ms<3%
orchestrator99.9%30,000ms<0.1%
default99.0%5,000ms<1%