Benchmark

Benchmarking Mimic Zero on Real Event Dynamics

We evaluate Mimic Zero against frontier models on historical incidents with known downstream outcomes. The benchmark measures whether a system can recover the right event dynamics from an early historical packet, and how much of that performance strong frontier baselines recover once progressively richer structure is supplied.


Frontier Comparison under Raw Evidence

68.0
47.2
42.8
37.8
35.4
Mimic Zero(Gemini 3 Flash)GPT 5.4Claude Opus 4.6Gemini 3 ProGemini 3 Flash

Figure 1: Raw-evidence benchmark comparison. Mimic Zero leads in the strictest setting, where all systems receive the same frozen packet and no additional structured context.

This flagship figure presents the strictest frontier comparison in the benchmark. It precedes the ablation ladder below, where parts of the Mimic Zero stack are progressively exposed to the baselines in order to measure how much of the full-system advantage can be recovered. The bracketed labels identify the underlying simulation model used by each Mimic Zero variant. See Methodology.

Evaluation Setting

This benchmark is built around the actual product problem: forecasting how narrative situations unfold from incomplete early public evidence. The current public snapshot uses frozen historical cases sourced from real X (formerly Twitter) evidence, each with a fixed anchor post, a bounded evidence window, and a held-out set of consequential events that occurred later. Systems are asked to forecast what happens next under the same historical constraints.

The operative question is whether a system can recover the right event dynamics from the early packet, and how much additional performance can be recovered as more structured context is introduced.

Why The Ladder Exists

The benchmark is organized as an ablation ladder because this is a system-level task. A single baseline comparison would not isolate the sources of performance. The more useful question is how much of the forecasting problem can be recovered from raw evidence alone, how much is recovered when the model is given better structured context, and what remains once that context is already in place.

The later layers exist to probe this. They are diagnostic conditions that show whether stronger event structure and richer actor context materially narrow the gap, or whether there is still something important left to be explained by the full system.

Ablation I — Raw Evidence Only

The first layer is intentionally minimal. A model receives only the frozen benchmark case: the event headline, a short summary, and the pre-cutoff evidence posts. It must forecast what happens next without any extra structured context. This is the most sterile estimate of raw predictive performance from the model alone.

We treat this as the irreducible baseline. If a model performs well here, then a meaningful portion of the forecasting task is already recoverable from raw evidence and parametric knowledge. If it does not, then downstream gains may reflect not only better reasoning but better problem formulation. In other words, the first ablation measures what a frontier model can infer before we begin to supply the kinds of structure that Mimic Zero explicitly constructs.

Performance under Raw Evidence

68.0
65.3
56.0
47.2
42.8
37.8
35.4
Mimic Zero(Gemini 3 Flash)Mimic Zero(Gemini 2.5 Flash)Mimic Zero(Gemini 2.5 Flash Lite)GPT 5.4Claude Opus 4.6Gemini 3 ProGemini 3 Flash

Figure 2: Performance under raw evidence. Frontier models are provided only the frozen benchmark packet; Mimic Zero remains unchanged.

In the raw-evidence setting, Mimic Zero scores 68.0, ahead of GPT 5.4 at 47.2. The Flash Lite configuration still leads the frontier baselines at 56.0. Every system receives the same frozen packet and nothing else.

The same pattern appears within the Gemini 3 Flash family itself: Gemini 3 Flash alone scores 35.4, while Mimic Zero with Gemini 3 Flash reaches 68.0. That gap is consistent with the view that simulation and structured world construction materially strengthen forecasting performance on this task.

This is the most direct estimate of what can be recovered from early evidence alone. The later ablations therefore read as narrowing experiments rather than as the source of the advantage itself.

Ablation II — Mimic Zero Environment

The second layer adds structured event context. Models receive the same evidence together with a clearer view of the relevant actors in the case.

Many failures in event forecasting are not failures of language modeling in the narrow sense; they are failures of correctly identifying who matters. A model that sees the right early evidence but misses the relevant operator, issuer, venue, or investigator will often produce forecasts that are locally coherent and globally wrong. The second ablation therefore measures the value of environment construction independent of persona construction.

Performance with Mimic Zero Environment

68.0
65.3
56.0
48.9
48.0
45.2
38.6
Mimic Zero(Gemini 3 Flash)Mimic Zero(Gemini 2.5 Flash)Mimic Zero(Gemini 2.5 Flash Lite)GPT 5.4+ Mimic Zero EnvironmentClaude Opus 4.6+ Mimic Zero EnvironmentGemini 3 Pro+ Mimic Zero EnvironmentGemini 3 Flash+ Mimic Zero Environment

Figure 3: Performance with environment structure. Frontier models are provided with Mimic Environment; Mimic Zero remains unchanged.

With actor structure added, GPT 5.4 rises from 47.2 to 48.9, while Mimic Zero maintains the leading score. The baseline lift is substantial, which is consistent with the view that part of the forecasting problem is correctly identifying who the event runs through before the reaction graph becomes obvious.

Even with that lift, the leaderboard does not flatten. Better environment structure helps strong models materially, but it does not by itself recover the full downstream shape of the incident.

Ablation III — Mimic Zero Persona

The third layer adds richer behavioral context on top of the actor set. Models now receive the evidence packet together with a stronger behavioral read on the actors involved in the incident.

This is the strongest single-pass baseline in the benchmark. It is meant to test whether better behavioral context alone is enough to recover most of the forecasting gap, or whether a meaningful gap still remains.

Performance with Mimic Zero Environment + Mimic Zero Personas

68.0
65.3
56.0
53.6
49.9
49.0
38.3
Mimic Zero(Gemini 3 Flash)Mimic Zero(Gemini 2.5 Flash)Mimic Zero(Gemini 2.5 Flash Lite)Claude Opus 4.6+ Mimic Zero Environment+ Mimic Zero PersonasGPT 5.4+ Mimic Zero Environment+ Mimic Zero PersonasGemini 3 Pro+ Mimic Zero Environment+ Mimic Zero PersonasGemini 3 Flash+ Mimic Zero Environment+ Mimic Zero Personas

Figure 4: Performance with environment and persona grounding. Frontier models are provided with Mimic Environment and Personas; Mimic Zero remains unchanged.

With richer behavioural context, Claude Opus 4.6 + Mimic Zero Environment + Mimic Zero Personas rises to 53.6, narrowing the gap to Mimic Zero with Gemini 2.5 Flash Lite at 56.0, while Mimic Zero with Gemini 3 Flash preserves the lead. By this stage the baselines are no longer operating from a thin packet; they have the event evidence, the actor set, and richer behavioural context.

That pattern is the core result of the ladder. Richer context improves the frontier baselines substantially, but it does not eliminate the lead from the stronger underlying simulation run.

Benchmark Case Studies in Practice

The benchmark is best understood through concrete case studies. The following examples illustrate the distinction between forecasting event classes and modelling downstream reaction structures.

Case Example 1

edgeX Airdrop Outcry

Case. edgeX faced significant backlash following its $EDGE airdrop, with users alleging that newly created wallets received disproportionately large allocations.

Mimic Zero forecasts a targeted institutional response, namely that HTX Global issues a clarification distancing itself from the airdrop distribution. By contrast, GPT-5.4 projects the continuation of social backlash, with narratives on Crypto X (formerly Twitter) converging around claims of sybil exploitation and unfair allocation. Gemini 3 Pro anticipates market and issuer responses, specifically a decline in $EDGE price alongside an official statement from the edgeX team.

The frontier forecasts all fall within Mimic Zero's prediction set; however, Mimic Zero distinguishes itself through its modelling of universe interconnectivity and shared world state. These mechanisms enable it to capture higher-order, cross-entity dynamics—such as HTX Global responding to emergent backlash generated by simulated social media archetype agent clusters—rather than limiting forecasts to isolated, first-order reactions.

Case Example 2

Resolv / USR Exploit

Case. Resolv was exploited for roughly $80 million, USR depegged, and connected venues had to decide whether to keep trading live or contain exposure.

Mimic Zero forecasts a concrete, venue-level containment response, specifically that Aster Perpetuals will suspend RESOLV trading pairs. In contrast, GPT-5.4 anticipates a further official communication from Resolv during the recovery process, while Claude Opus 4.6 predicts that Resolv will publish a post-mortem explaining the exploit vector—reasonable, but still abstracted from entity-level decision dynamics, which Mimic Zero incorporates to forecast concrete venue-level actions.

The distinction lies in the level of operational specificity: Mimic Zero advances a precise, actionable intervention at the exchange level, whereas the comparator models remain within the broader, more generic trajectory of post-exploit communications and analysis.

Experimental Protocol

Every system is evaluated under the same repeated-run setup. For Mimic Zero, each benchmark case is executed across five worlds. For external baselines, the same case packet is run five times with the same model configuration. We treat these five realizations as the comparable unit of evaluation. All of the systems under comparison exhibit stochasticity, and a single run can flatter or punish a system through chance formatting, borderline event selection, or one-off retrieval of the correct actor frame.

The benchmark is case-balanced. Scores are first averaged across the five realizations for a given case, and only then macro-averaged across cases. This prevents a small number of event-dense incidents from dominating the leaderboard and keeps the interpretation of the results aligned with the research question. We are comparing systems on their ability to forecast event trajectories across a benchmark set, not rewarding them merely for generating large volumes of partially relevant text.

Scoring and Repeated Evaluation

The displayed benchmark score is a probability-weighted fuzzy F1 derivative. Each realization produces a set of forecasted events with attached probabilities, and those predictions are matched against the held-out benchmark events using one-to-one fuzzy alignment. Exact matches receive full credit, partial semantic matches receive partial credit, and unmatched predictions or missed benchmark events receive none.

Precision is weighted by the probability mass placed on the forecasted events, which means high-conviction predictions matter more than low-conviction ones. Recall is measured against the held-out benchmark event set, which means systems are rewarded for recovering the downstream structure of the case rather than for producing generic crisis language. The result is a benchmark measure of forecast quality that remains close to the logic of F1 while behaving appropriately for probabilistic event forecasting.

The displayed benchmark score is a measure of forecast quality. It is not a literal forecast percentage accuracy.

Scoring

Probability-Weighted F1 Derivative

Event-level fuzzy matching with probability-weighted precision and benchmark recall

Prediction set

P = {(xi, pi)}i=1n

xi is predicted event i and pi ∈ [0,1] is the probability assigned to it.

Ground-truth set

G = {gj}j=1m

The held-out benchmark events that actually occurred after the evidence window.

Matched overlap

si ∈ [0,1]

Each one-to-one matched pair receives partial semantic credit; unmatched predictions and benchmark events receive 0.

Precision

Prec = (Σi=1n pisi) / (Σi=1n pi)

Recall

Rec = (1 / m) Σi=1n si

Final score

F1,fuzzy,p = (2 Prec Rec) / (Prec + Rec)

This preserves the logic of F1 while accounting for both semantic partial credit and forecast confidence. Higher scores are better.

Case Construction

Benchmark cases are frozen from real incidents with documented evidence windows and known downstream outcomes. Each case fixes an anchor post, a retrieval window, and a held-out set of benchmark events. This keeps the comparison standardized across systems while allowing the case set to expand over time without changing the evaluation contract.

The intention is to benchmark systems on historically grounded, post-level evidence rather than on hand-authored summaries alone. Cases therefore begin from real public information and are scored against real subsequent developments. For a simulation system, preserving that structure is essential. Mimic Zero is meant to forecast how narrative situations unfold under uncertainty; the benchmark must therefore preserve the asymmetry between what was known at the time and what became known later. That asymmetry is the core object of the evaluation.

Reading the Results

The benchmark should be interpreted as a decomposition of system value. The first ablation measures what the model can do with raw evidence alone. The second measures the gain from actor discovery. The third measures the additional gain from persona grounding. Mimic Zero then measures what remains after adding simulation itself: interacting agents, tiered reasoning, and multi-world execution. This is why the benchmark is presented as an ablation ladder rather than as a single baseline-versus-product chart. The structure of the comparison is part of the result.

It should not be read as a universal benchmark of intelligence. It should be read as a benchmark of the task Mimic Zero is actually trying to solve: forecasting how real narrative situations evolve from incomplete early evidence.