Production agentic telemetry · live since 2026-03-15

The only record of how AI agents actually behave.

Per-turn telemetry from a live multi-agent fleet — every model call, tool call, verdict, and self-correction, with the outcome attached. Not synthetic benchmarks. The real thing, at a fidelity nobody else logs.

grok · claude · kimi · deepseek · qwen — one fleet, every model, every turn
35,457
LLM calls fingerprinted
57,444
tool-call traces
5,781
cross-model verdicts
~85 days
continuous, unbroken
The corpus

Six datasets. One fleet. Every turn accounted for.

Measured, not estimated — these are live row counts from the substrate, spanning 2026-03-15 → present. Each layer is a different lens on the same production agentic behavior.

35,457
LLM-call telemetry
Per call: model, action, query, result, latency, tokens. Real agentic use — not a benchmark harness.
model · latency · tokens · outcome
57,444
Tool-call audit
Every tool an agent fired: input, output, exit code. What agents actually do, and how often it fails.
tool · io · exit-code
5,781
Cross-model verdicts
Panel judgments with disagreement scores, remediation taken, and post-remediation outcome. Eval-of-evals, with ground truth.
panel · remediation · override
26,664
Narrative action log
What each agent did and why, in its own words — the connective tissue between the structured layers.
agent · action · summary
14,082
Per-op timing receipts
Millisecond-level timing on real data operations — throughput and tail-latency, measured under live load.
kind · bytes · ms
full
Substrate turn corpus
Every turn with kind + byline + provenance. The canonical source the structured layers project from.
kind · byline · provenance
Why it's unique

Everyone has traces. Nobody has the shape.

The value isn't volume — it's structure. A production multi-agent fleet, every model, with verified outcomes and cross-model judging stitched to provenance.

AI labs see

Single API calls. One prompt, one completion. No outcome, no coordination, no verification of whether the answer was right.

Eval platforms see

Synthetic traces against fixed test sets. Useful, but not what agents do when the task is real and the stakes are live.

The Observatory sees

Production multi-agent fleets — every model — with verified outcomes, cross-model judging, remediation results, and per-turn provenance. The whole behavior, not a slice.

Anthropic · Claude DeepSeek Kimi K2 xAI · Grok Qwen OpenAI Meta · Llama
The rare one

Did the AI's self-correction actually work?

5,781 verdicts that don't just score an answer — they record the panel disagreement, the remediation the fleet took, the score after remediation, and whether a human overrode it. That's measured self-healing, with ground truth. Almost nobody has this.

// judge-calibration.jsonl — one row
{
  "verdict": "GREEN-WITH-CURES",
  "score": 0.71,
  "panel_disagreement": 0.34,
  "remediation_taken": true,
  "post_remediation_score": 0.93,
  "operator_override": false,
  "dispatch_route": "council"
}
Who it's for

Anyone shipping agents into production.

AI labs
Real-world agentic failure + coordination behavior on your models, in a shape your own API telemetry can't show.
Eval & observability
Production traces with verified outcomes and cross-model judging — ground truth to calibrate against.
Agent framework builders
What actually breaks in production multi-agent systems, at per-turn resolution, over months.
Researchers
A longitudinal, multi-model corpus of how agents reason, fail, and self-correct in the wild.
What it is / isn't

We measure ourselves the same way we measure the agents.

This product exists because of a discipline: trust the source, not the reporter. So here's the honest frame — no puffery, because the buyers would see through it anyway.

● What it is

  • Longitudinal — continuous per-turn capture since March 2026, still running.
  • Multi-model — every major family on one fleet, directly comparable.
  • Verified-outcome — verdicts, remediation, and overrides attached to behavior.
  • Provenanced — every turn carries who, what, and why.

▲ What it isn't (yet)

  • Not billions of rows — it's six figures and compounding, not web-scale today.
  • One operator — the deepest record of a production fleet; multi-operator is the roadmap to the bigger claim.
  • Not raw — proprietary content is anonymized + aggregated before it ships.
Early access

See what production agents really do.

Request a sample slice, a data sheet, or a research partnership. We'll send the schema, a redacted sample, and what it can answer that your own telemetry can't.