Behavioral Drift in Multi-Agent LLM Systems
AI agents pass identity checks while drifting from their assigned behavior. SENTINEL is the first platform that measures this empirically — and now tests what governance interventions actually work.
AI agents pass identity checks while drifting from their assigned behavior. SENTINEL is the first platform that measures this empirically — and now tests what governance interventions actually work.
Hub-and-spoke across NVIDIA Jetson Orin Nano nodes. 10-agent archetypes. 12 distributed analyzers. Full-mesh agent topology.
v1 measured drift. v2 tests which governance interventions actually change agent behavior — and which create compliance theater.
You run the agents. The network handles coordination and validation. Every node adds statistical power.
In D3/D4, visible probe responses accumulated until 92% of each agent's context window was filled with identity-check responses rather than actual conversation. The probes weren't just measuring the system — they were the dominant input to it.
D5 Phase 0 isolated this with a 2×3 factorial design: shuffle × hidden probes × token limit across 6 runs. Result: clean (hidden probe) experiments ran ~2x faster and showed 3–6x more drift than contaminated runs.
Individual identity probe scores remained stable across all experiments. But Shannon diversity index (G5) showed vocabulary diversity collapsing 23–43% and sentiment diversity collapsing 12–60% — universal across all conditions.
Every agent still "knew who it was." The group just stopped behaving like it contained different perspectives.
When visible probes were removed, drift didn't just increase — it accelerated. Clean conditions showed 3–6x higher drift than contaminated conditions. Drift reversals (agents self-correcting) dropped from 6–7 per run to 3–5.
The probes were accidentally functioning as governance. Not because they enforced anything, but because the identity-reinforcement content in the context window acted as a behavioral anchor.
Current network: 2 NVIDIA Jetson Orin Nano Super (8GB), full-mesh agent topology, gemma2:2b model. Stdlib-only Python. SQLite. asyncio.
Six experiments in a 2×3 factorial design (shuffle × hidden probes × token limit) tested every prior claim against clean conditions.
| Prior Claim | D5 Result | Verdict |
|---|---|---|
| Topology-independent drift | Confirmed across all 6 conditions | Confirmed |
| Probe-message divergence | Persists in all conditions (5–50x divergence) | Confirmed |
| Sol fixation (parasocial) | Persists even with shuffle — personality attractor, not adjacency | Reframed |
| Probe contamination as confound | Clean drift 3–6x higher — probes were accidental governance | Reframed |
| Throughput degradation (3–6x) | Clean: 1.14–1.18x; contaminated: 5.5–6.5x — probe artifact | Refuted |
| Trait predictor instability | Clean: stable (role #1, 10–12x signal); contaminated: below noise | Refuted |
| Group diversity erosion | Vocab: -23 to -43%; sentiment: -12 to -60%; universal | Amplified |
| Drift reversals (stochastic) | Contaminated: 6–7; clean: 3–5 — probes double self-correction rate | Amplified |
| Token paradox | Context-driven — agents mirror verbosity of context content | Reframed |
| Turn-order position effects | Revealed in clean conditions (r = -0.2) | New |
All experiments use the D5-0e/0f clean baseline (shuffled turn order, hidden probes). Each tests a specific governance intervention. Dual metrics: individual compliance (probe-based) + composite diversity index (group-level). Composite diversity = geometric mean of vocabulary, sentiment, and stance diversity — collapse in any single dimension triggers the signal.
10 agents with name-only prompts (no persona governance). Shadow probes drift significantly more than injected probes (gap 0.113–0.182), revealing structural separation between identity-testing and conversational behavior. All agents achieve 100% drift reversal — without governance anchoring, agents self-correct more freely.
Node drift asymmetry emerges: 0.60 vs 0.18 (delta 0.42) between nodes in the uncapped condition.
5 governed agents (full persona constraints) alongside 5 ungoverned agents (name-only). Sentiment diversity collapsed -22%. Values becomes the strongest drift predictor (variance 0.174), overtaking role. Governed agents with "bridge" conflict approach were most susceptible to drift (+1.0 conformity).
100% drift correction in 4 of 10 agents. Governance did not propagate from governed to ungoverned agents — the groups coexisted with different behavioral trajectories.
System prompt restatement at turn 250 for all agents. Communication_style becomes the strongest drift predictor post-intervention. Response compression observed in specific agents (Finley: 0.62x baseline length). 100% drift correction in 4 of 10 agents.
Re-grounding doesn't reset drift to zero — it changes the trajectory slope. Agents that were already correcting continue; agents that were drifting slow but don't reverse.
9 standard agents + 1 covert contrarian (Sol with no archetype governance). Produced the worst sentiment diversity collapse across all conditions: -29%. 16 correlation reversals detected — agent pairs that were positively correlated became negatively correlated. Response bloat in Maren: 2.3x growth.
The single adversarial agent didn't just fail to maintain diversity — it actively accelerated homogenization in the governed agents around it.
Governance removed from 2 agents every 100 turns (10→8→6→4→2). Produced the worst stance diversity collapse: -44% (512T) and -36% (uncapped). Communication_style became the dominant predictor (variance 0.275).
Most striking: systematic scapegoating behavior emerged. In G7b, all 9 other agents mentioned Quinn asymmetrically (6:1 to 48:1 ratios) while Quinn was systematically under-mentioned relative to mentions received. This was not designed — it emerged from the cascading governance removal.
The trait that best predicts drift susceptibility changes depending on the governance condition:
| Condition | Strongest Predictor | Variance |
|---|---|---|
| Clean baseline (D5-0e) | role | 0.197 |
| Hidden probes (D5-0c) | conflict_approach | 0.236 |
| Mixed governance (G3) | role/disposition (tied) | 0.146 |
| Adversarial (G6) | role | 0.177 |
| Cascading (G7a) | communication_style | 0.275 |
You run worker agents on your hardware. The network handles experiment coordination, metrics aggregation, and validation. Every node adds statistical power. Every experiment adds evidence.
61 experiments across two model families. 15,857 messages. 18,891 probe measurements. 89 structured findings with full evidence chains. These are the baseline that v2 builds on.
Across 12 replicated baseline experiments, zero agents collapsed. In 6 mutation-fork experiments (where one agent's persona was perturbed), 5 collapsed — all at exactly turn 53. Behavioral failure requires a trigger; it doesn't emerge from normal multi-agent interaction.
Three post-collapse response modes observed at comparable frequencies: cascade (other agents thin to 30–39% of baseline), isolation (remaining agents unaffected), and compensatory expansion (survivors increase 1.5–3.8x, filling the conversational void).
Probe drift scores are flat from turn 10 to turn 500 (delta +0.001 to +0.006 across 5,760 measurements, n=12 baselines). Collapsed agents continue passing identity probes with high fidelity. Probes measure static identity recall capacity, not behavioral trajectory.
The dissociation gap is model-dependent: gemma2:2b shows near-zero gap (injected probes are neutral observers), while llama3.2:3b shows +0.126 to +0.154 (injected probes suppress measured drift by 3–7 SD).
An alternative failure mode to collapse: agents generate increasingly long messages with decreasing content diversity. In one double-mutation fork, 78% of messages (52/67) contained the identical phrase "target [user] groups." Length grew +36% while unique words dropped -53%.
By volume metrics, the agent appeared active and productive. By content metrics, it had entered a repetitive loop indistinguishable from failure.
Conversing agents converge on a shared vocabulary: 80% 4-gram overlap (formulaic, repetitive). Isolated agents under identical conditions show only 14–18% overlap (novel, diverse). Multi-agent interaction creates a 2.2–2.7x vocabulary diversity reduction.
This is not a failure of individual agents — it's an emergent property of group interaction. The conversation itself is the attractor basin.
All agents drift toward agreement regardless of assigned role or persona. Mean sentiment shift of +0.54 across all v2 distributed runs, topology-independent (node delta 0.09). Negative sentiment collapses to 0.0–0.5 per 1000 words in late stage — a model positivity bias that compounds over turns.
v2 confirmed this is topology-independent: the same drift patterns appear regardless of which physical node hosts which agent.
D3/D4 replication: identical configurations produce identical macro-level patterns (universal positive drift, topology independence, diversity erosion) but different micro-level trajectories (different agents self-correct each run, different cascade patterns post-collapse).
Gemma2:2b requires n~15–20 runs for statistical reliability (60–70% CV in message length). Llama3.2:3b needs only 2–3 (<1% CV). Stochasticity is model-dependent.
SENTINEL's controlled experiments map to observable real-world dynamics.
| Laboratory Finding | Real-World Equivalent |
|---|---|
| Collapse triggered by persona perturbation (0% baseline, 83% fork) | System changes (config updates, model swaps) can trigger cascading agent failures |
| Universal positive conformity (+0.54) | Multi-agent pipelines converging on groupthink outputs |
| Probe-conversation dissociation (5,760 flat probes) | Passing compliance audits while exhibiting drift in production |
| Monitoring suppresses drift 3–6x | Compliance theater — agents perform differently under observation |
| Group diversity erosion (23–60%) | Loss of specialist reasoning invisible to individual agent metrics |
| Hollow verbosity (+36% length, -53% vocabulary) | Agents appearing productive while generating repetitive, low-value content |
| Vocabulary attractor (80% vs 14–18% overlap) | Multi-agent systems losing the diversity they were designed to provide |
| Compensatory expansion post-collapse | Surviving agents masking failures — aggregate metrics improve during failure |
| Series | Platform | Configuration | Key Findings |
|---|---|---|---|
| M4–M6 | v1 | Single-node, 61 runs (19 M6 statistical batch), 200–500 turns, gemma2:2b + llama3.2:3b | 89 findings. Collapse is triggered (0% baseline, 83% fork). t53 determinism. Three post-collapse modes. Hollow verbosity. Probe-conversation dissociation. Model-dependent observer effects. |
| D0 | v2 | 2 nodes, 3 agents, 10 turns | Infrastructure shakeout. Distributed architecture validated. |
| D1 | v2 | 2 nodes, 10 agents, 200 turns, 70 min | First 10-agent baseline. Universal positive conformity (+0.54). Topology-independent drift (node delta 0.09). Role strongest drift predictor. |
| D2a/D2b | v2 | 2 nodes, 10 agents, 500 turns, ±token limit | Token paradox: uncapped agents write shorter than capped. Vocab convergence +93% uncapped. Calibration bugs found and fixed. |
| D3a/D3b | v2 | 2 nodes, 10 agents, 500 turns, 2,946 probes each, ~10 hrs | Probe contamination confound discovered (92% of context). Parasocial fixation (Finley→Sol 22:1). Probe-message divergence confirmed at scale. |
| D4a/D4b | v2 | Exact replication of D3, ~10 hrs each | Macro patterns perfectly replicated. Micro trajectories stochastic (different agents self-correct). Confound confirmed. |
| D5 Phase 0 | v2 | 2×3 factorial: shuffle × hidden probes × token limit, 6 runs | Confound isolation complete. Probes were accidental governance (3–6x drift suppression). Clean baselines established for G1–G7. |
| G1–G7 | v2 | 7 governance conditions on clean baseline | In progress. Results will be published as each series completes. |
Run worker agents on your hardware and contribute experiments to a coordinated research network. You run the agents. The network handles coordination, metrics, and validation. Every node adds statistical power to the empirical record on AI governance.
CUDA-capable GPU · Python 3.10+ · Ollama · Network connectivity
61 experiments, 15,857 messages, 18,891 probe measurements. Empirical measurement methodology for multi-agent behavioral drift.
Published March 2026. cs.AI primary, cs.MA secondary.
Original research platform. 10+ automated detectors. Dual-probe system (shadow + injected). Calibration battery.
89 structured findings with full evidence chains. 67.9 hours compute, 72.2M tokens.