Research Project · Case Study

Exploring reliability for multi-agent AI

A research project investigating how to validate agent handoffs, replay from failure checkpoints, and prevent race conditions in concurrent agent pipelines — built as a working proof of concept.

agentsentinelai.com/dashboard⤢ expand

SSentinel.AI

Overview

Live

Active runs

94%

Success rate

Open incidents

284ms

Avg latency

PipelineStatusWhen

my_pipelinesuccess2 min ago

lead_qualifierblocked12 min ago

content_draftersuccess28 min ago

data_enricherrunningjust now

Research Findings

What existing tools can't do.

Monitoring detects. Alerting notifies. Tracing records. Circuit breakers retry. None of it validates what one agent hands to the next, recovers from a specific failure point, or prevents concurrent agents from overwriting shared state. This project explores what a purpose-built reliability layer would look like.

🔗

Observability

Tracing

Every agent step recorded — status, latency, inputs, outputs. See the full pipeline in one view. Equivalent to structured logging, but built for agent handoffs.

Logging does this too, but unstructured and without the pipeline view.

📋

EnforcementCan't do with logging

Contract + Replay

Define the schema each agent expects. Sentinel validates every handoff — bad output is blocked before the next agent runs. An incident is created, a checkpoint saved. Fix the output, replay from there.

Logging records the failure after the fact. Sentinel prevents it and gives you a recovery path.

🔒

CoordinationCan't do with logging

Shared State

Safe concurrent writes for parallel agents. If two agents write to the same key simultaneously, Sentinel retries and merges — no silent overwrites, no data loss.

Logging can't prevent a race condition, only record that it happened.

How it works

Under the hood

Sentinel instruments at the Python function level — context managers, class-level monkey-patching, and inline validation. No daemons, no sidecars, no config files.

sentinel.workflow() opens a run

Tracing

Assigns a run_id, POSTs to the ingestion API. The context manager auto-closes the run as success, blocked, or failed when the block exits — no try/finally needed.

run.step() wraps each agent

Tracing

Records step name, type (llm_call / tool_call), input, output, and wall-clock latency. Steps nest under the run automatically — no manual wiring.

patch_openai_async() hooks at the class level

Auto-instrument

Patches AsyncCompletions.create on the OpenAI SDK class itself — so every AsyncOpenAI client created anywhere, including inside third-party libraries, becomes a traced step under your active run.

handoff() validates at the boundary

Contracts

Before the next agent runs, Sentinel checks the payload against the registered contract. Wrong type, missing field, or out-of-range value raises ContractViolationError — the downstream agent never executes.

Checkpoint saved — replay when ready

Replay

A checkpoint is saved at every handoff. Fix the payload, call replay() — Sentinel fast-forwards past the checkpoint, reusing outputs from all prior steps.

View on PyPI →See examples →

import sentinel

sentinel.init(api_key="sk_live_...")

with sentinel.workflow("travel-planner") as run:

with run.step("planner", step_type="llm_call") as step:

step.set_input({"query": query})

result = planner_agent(query)

step.set_output(result)

with run.step("researcher", step_type="tool_call") as step:

step.set_input(result)

data = researcher_agent(result)

step.set_output({"findings": data})

Live Prototype

Explore the working prototype

The concepts in this research are implemented as a working proof of concept — every step traced, every failure caught, full replay capability. Open to explore.

Live agent demo →Open dashboard

Integrations

Works with your stack

Sentinel wraps your existing framework — no rewrites, no lock-in. If it makes LLM calls or coordinates agents, it works.

OpenAILLM

AnthropicLLM

GeminiLLM

LlamaLLM

LangChainFramework

LlamaIndexFramework

AutoGenFramework

CrewAIFramework

LangGraphFramework

Any HTTP APICustom

Don't see your stack? Sentinel instruments at the Python function level — if you can call it, Sentinel can watch it.

Collaborate

Interested in this research?

Reach out to discuss the ideas, explore a collaboration, or ask questions about the prototype.