Production agentic telemetry · live since 2026-03-15

The only record of how AI agents actually behave.

Per-turn telemetry from a live multi-agent fleet — every model call, tool call, verdict, and self-correction, with the outcome attached. Not synthetic benchmarks. The real thing, at a fidelity nobody else logs.

Request data access → See the corpus

grok · claude · kimi · deepseek · qwen — one fleet, every model, every turn

35,457

LLM calls fingerprinted

57,444

tool-call traces

5,781

cross-model verdicts

~85 days

continuous, unbroken

The corpus

Six datasets. One fleet. Every turn accounted for.

Measured, not estimated — these are live row counts from the substrate, spanning 2026-03-15 → present. Each layer is a different lens on the same production agentic behavior.

35,457

LLM-call telemetry

Per call: model, action, query, result, latency, tokens. Real agentic use — not a benchmark harness.

model · latency · tokens · outcome

57,444

Tool-call audit

Every tool an agent fired: input, output, exit code. What agents actually do, and how often it fails.

tool · io · exit-code

5,781

Cross-model verdicts

Panel judgments with disagreement scores, remediation taken, and post-remediation outcome. Eval-of-evals, with ground truth.

panel · remediation · override

26,664

Narrative action log

What each agent did and why, in its own words — the connective tissue between the structured layers.

agent · action · summary

14,082

Per-op timing receipts

Millisecond-level timing on real data operations — throughput and tail-latency, measured under live load.

kind · bytes · ms

full

Substrate turn corpus

Every turn with kind + byline + provenance. The canonical source the structured layers project from.

kind · byline · provenance

Why it's unique

Everyone has traces. Nobody has the shape.

The value isn't volume — it's structure. A production multi-agent fleet, every model, with verified outcomes and cross-model judging stitched to provenance.

AI labs see

Single API calls. One prompt, one completion. No outcome, no coordination, no verification of whether the answer was right.

Eval platforms see

Synthetic traces against fixed test sets. Useful, but not what agents do when the task is real and the stakes are live.

The Observatory sees

Production multi-agent fleets — every model — with verified outcomes, cross-model judging, remediation results, and per-turn provenance. The whole behavior, not a slice.

Anthropic · Claude DeepSeek Kimi K2 xAI · Grok Qwen OpenAI Meta · Llama

The rare one

Did the AI's self-correction actually work?

5,781 verdicts that don't just score an answer — they record the panel disagreement, the remediation the fleet took, the score after remediation, and whether a human overrode it. That's measured self-healing, with ground truth. Almost nobody has this.

// judge-calibration.jsonl — one row
{
  "verdict": "GREEN-WITH-CURES",
  "score": 0.71,
  "panel_disagreement": 0.34,
  "remediation_taken": true,
  "post_remediation_score": 0.93,
  "operator_override": false,
  "dispatch_route": "council"
}

Who it's for

Anyone shipping agents into production.

AI labs

Real-world agentic failure + coordination behavior on your models, in a shape your own API telemetry can't show.

Eval & observability

Production traces with verified outcomes and cross-model judging — ground truth to calibrate against.

Agent framework builders

What actually breaks in production multi-agent systems, at per-turn resolution, over months.

Researchers

A longitudinal, multi-model corpus of how agents reason, fail, and self-correct in the wild.

What it is / isn't

We measure ourselves the same way we measure the agents.

This product exists because of a discipline: trust the source, not the reporter. So here's the honest frame — no puffery, because the buyers would see through it anyway.

● What it is

Longitudinal — continuous per-turn capture since March 2026, still running.
Multi-model — every major family on one fleet, directly comparable.
Verified-outcome — verdicts, remediation, and overrides attached to behavior.
Provenanced — every turn carries who, what, and why.

▲ What it isn't (yet)

Not billions of rows — it's six figures and compounding, not web-scale today.
One operator — the deepest record of a production fleet; multi-operator is the roadmap to the bigger claim.
Not raw — proprietary content is anonymized + aggregated before it ships.