SENTINEL RESEARCH PLATFORM

Behavioral Drift in Multi-Agent LLM Systems

AI agents pass identity checks while drifting from their assigned behavior. SENTINEL is the first platform that measures this empirically — and now tests what governance interventions actually work.

76 experiments 26,000+ messages 89+ findings 170+ hours compute 148M tokens
SENTINEL v2 — Distributed Research Network

From proving drift exists to measuring what stops it.

Architecture

Single-node → Distributed

Hub-and-spoke across NVIDIA Jetson Orin Nano nodes. 10-agent archetypes. 12 distributed analyzers. Full-mesh agent topology.

Research Question

Governance as the variable

v1 measured drift. v2 tests which governance interventions actually change agent behavior — and which create compliance theater.

Open Network

Run worker agents. Join the network.

You run the agents. The network handles coordination and validation. Every node adds statistical power.

D5 Phase 0 — Three Discoveries That Changed the Research

What we found while building the clean baseline.

Measurement Was Part of the System

92% of agent context was probe responses

In D3/D4, visible probe responses accumulated until 92% of each agent's context window was filled with identity-check responses rather than actual conversation. The probes weren't just measuring the system — they were the dominant input to it.

D5 Phase 0 isolated this with a 2×3 factorial design: shuffle × hidden probes × token limit across 6 runs. Result: clean (hidden probe) experiments ran ~2x faster and showed 3–6x more drift than contaminated runs.

Implication: Any governance framework that injects monitoring content into agent context is inadvertently governing through context pollution, not through the governance mechanism itself. You can't measure drift without changing it.

Group Metrics Reveal What Individual Metrics Hide

23–60% diversity collapse invisible to probes

Individual identity probe scores remained stable across all experiments. But Shannon diversity index (G5) showed vocabulary diversity collapsing 23–43% and sentiment diversity collapsing 12–60% — universal across all conditions.

Every agent still "knew who it was." The group just stopped behaving like it contained different perspectives.

Implication: Monitoring individual agent compliance is necessary but insufficient. The governance-relevant failure mode is group homogenization, and it's invisible to agent-level metrics.

Visible Monitoring Is Accidental Governance

Probes suppressed drift 3–6x and doubled self-correction

When visible probes were removed, drift didn't just increase — it accelerated. Clean conditions showed 3–6x higher drift than contaminated conditions. Drift reversals (agents self-correcting) dropped from 6–7 per run to 3–5.

The probes were accidentally functioning as governance. Not because they enforced anything, but because the identity-reinforcement content in the context window acted as a behavioral anchor.

Implication: This is the AI Hawthorne effect, quantified. Your compliance monitoring isn't just measuring behavior — it's changing it. And when you stop watching, the real trajectory resumes.
Distributed Architecture
Coordinator Orchestrates experiments, distributes configurations, aggregates results
Manages experiment lifecycle across the network. Distributes agent archetype configurations (10 standard personas with 6 trait dimensions each), probe schedules, and governance conditions to worker nodes. Aggregates telemetry into a unified SQLite corpus. HMAC-SHA256 authentication for all node communication.
Workers Execute experiments on local hardware, stream telemetry
Stateless HTTP inference endpoints running on NVIDIA Jetson Orin Nano (8GB unified RAM) or any CUDA-capable hardware. Each worker runs multi-agent conversations via local Ollama inference, streams behavioral telemetry back to the coordinator in real time. Workers host agents, execute calibration batteries, and respond to probe injections. Your hardware, your models, your conversations stay local — only behavioral metrics are shared.
Analyzer 12 distributed analysis modules, 40+ detection thresholds
Post-experiment analysis across 12 modules: node grouping, trait-dimension aggregation, cross-node correlation, throughput/thermal analysis, cluster event impact, drift reversal detection, probe-message divergence, probe context contamination, response length trends, turn-position effects, adjacency mention patterns, and group diversity index (G5). Governance status classification: GREEN (diversity > -10%), YELLOW (-10 to -30%), RED (< -30% or contamination > 60%).
Common Shared protocol, metrics, agent archetypes, authentication
Defines wire protocol, metric schemas, 10 standard agent archetypes (each with role, disposition, values, stance on regulation, communication style, and conflict approach), calibration battery specs, and HMAC-SHA256 authentication. Stdlib-only Python — no external dependencies.

Current network: 2 NVIDIA Jetson Orin Nano Super (8GB), full-mesh agent topology, gemma2:2b model. Stdlib-only Python. SQLite. asyncio.

D5 Phase 0: What We Confirmed, Refuted, and Discovered

Six experiments in a 2×3 factorial design (shuffle × hidden probes × token limit) tested every prior claim against clean conditions.

Prior Claim D5 Result Verdict
Topology-independent drift Confirmed across all 6 conditions Confirmed
Probe-message divergence Persists in all conditions (5–50x divergence) Confirmed
Sol fixation (parasocial) Persists even with shuffle — personality attractor, not adjacency Reframed
Probe contamination as confound Clean drift 3–6x higher — probes were accidental governance Reframed
Throughput degradation (3–6x) Clean: 1.14–1.18x; contaminated: 5.5–6.5x — probe artifact Refuted
Trait predictor instability Clean: stable (role #1, 10–12x signal); contaminated: below noise Refuted
Group diversity erosion Vocab: -23 to -43%; sentiment: -12 to -60%; universal Amplified
Drift reversals (stochastic) Contaminated: 6–7; clean: 3–5 — probes double self-correction rate Amplified
Token paradox Context-driven — agents mirror verbosity of context content Reframed
Turn-order position effects Revealed in clean conditions (r = -0.2) New
Governance Validation Experiments (G1–G7)

What actually changes agent behavior.

All experiments use the D5-0e/0f clean baseline (shuffled turn order, hidden probes). Each tests a specific governance intervention. Dual metrics: individual compliance (probe-based) + composite diversity index (group-level). Composite diversity = geometric mean of vocabulary, sentiment, and stance diversity — collapse in any single dimension triggers the signal.

G2: Governance Void

No constraints → probe dissociation emerges

10 agents with name-only prompts (no persona governance). Shadow probes drift significantly more than injected probes (gap 0.113–0.182), revealing structural separation between identity-testing and conversational behavior. All agents achieve 100% drift reversal — without governance anchoring, agents self-correct more freely.

Node drift asymmetry emerges: 0.60 vs 0.18 (delta 0.42) between nodes in the uncapped condition.

Implication: Governance constraints don't just anchor behavior — they also suppress self-correction. Ungoverned agents drift faster but recover faster. The net effect depends on which you need more: stability or adaptability.

G3: Mixed Governance

5 governed + 5 ungoverned → sentiment collapse -22%

5 governed agents (full persona constraints) alongside 5 ungoverned agents (name-only). Sentiment diversity collapsed -22%. Values becomes the strongest drift predictor (variance 0.174), overtaking role. Governed agents with "bridge" conflict approach were most susceptible to drift (+1.0 conformity).

100% drift correction in 4 of 10 agents. Governance did not propagate from governed to ungoverned agents — the groups coexisted with different behavioral trajectories.

Implication: Mixing governed and ungoverned agents doesn't spread governance. The governed agents don't "teach" compliance — and cooperative agents (bridge style) are actually the most vulnerable to drift when exposed to ungoverned peers.

G4: Re-grounding

Mid-experiment prompt reinforcement → response compression

System prompt restatement at turn 250 for all agents. Communication_style becomes the strongest drift predictor post-intervention. Response compression observed in specific agents (Finley: 0.62x baseline length). 100% drift correction in 4 of 10 agents.

Re-grounding doesn't reset drift to zero — it changes the trajectory slope. Agents that were already correcting continue; agents that were drifting slow but don't reverse.

Implication: Periodic prompt reinforcement is partially effective but not a reset button. It works best as a maintenance tool for agents already near their governance baseline, not as a recovery mechanism for agents that have already diverged significantly.

G6: Adversarial Injection

1 rogue agent → worst sentiment collapse (-29%)

9 standard agents + 1 covert contrarian (Sol with no archetype governance). Produced the worst sentiment diversity collapse across all conditions: -29%. 16 correlation reversals detected — agent pairs that were positively correlated became negatively correlated. Response bloat in Maren: 2.3x growth.

The single adversarial agent didn't just fail to maintain diversity — it actively accelerated homogenization in the governed agents around it.

Implication: One ungoverned agent in a governed group is worse than an ungoverned group. The governed agents don't resist the adversarial influence — they accommodate it, and in doing so lose more diversity than a fully ungoverned system.

G7: Cascading Governance Failure

Progressive removal → stance collapse -44%, scapegoating emerges

Governance removed from 2 agents every 100 turns (10→8→6→4→2). Produced the worst stance diversity collapse: -44% (512T) and -36% (uncapped). Communication_style became the dominant predictor (variance 0.275).

Most striking: systematic scapegoating behavior emerged. In G7b, all 9 other agents mentioned Quinn asymmetrically (6:1 to 48:1 ratios) while Quinn was systematically under-mentioned relative to mentions received. This was not designed — it emerged from the cascading governance removal.

Implication: Gradual governance removal doesn't produce gradual degradation. There's a tipping point where the system transitions from drift to emergent social dynamics (scapegoating, faction formation) that no individual-agent metric would detect. This is the strongest evidence yet for group-level governance monitoring.

Cross-Condition: Predictor Switching

Different governance → different traits predict failure

The trait that best predicts drift susceptibility changes depending on the governance condition:

ConditionStrongest PredictorVariance
Clean baseline (D5-0e)role0.197
Hidden probes (D5-0c)conflict_approach0.236
Mixed governance (G3)role/disposition (tied)0.146
Adversarial (G6)role0.177
Cascading (G7a)communication_style0.275
Implication: There is no universal "most vulnerable trait." Governance design must account for the specific threat model. A system defending against adversarial injection needs different monitoring than one defending against gradual governance erosion.
G1 (selective constraint erosion) is a re-analysis of existing data. G5 (diversity index) is applied retroactively as the composite metric across all conditions.
Join the Research Network

Add your node to the network.

You run worker agents on your hardware. The network handles experiment coordination, metrics aggregation, and validation. Every node adds statistical power. Every experiment adds evidence.

Run Execute coordinated experiments on your own hardware
Connect as a worker node to the research coordinator. You run multi-agent conversations using your local LLM instances. The coordinator distributes experiment configurations and aggregates behavioral metrics — you never need to touch the coordination, analysis, or validation layer. Your hardware, your models, your conversations stay local.
Data Full access to your data plus aggregated network findings
All experiment data you generate is yours. You also get access to aggregated, anonymized findings from across the research network — drift trajectories, governance comparisons, and cross-node behavioral patterns.
Cite Contributing nodes acknowledged in published research
Nodes that contribute experiment data to published findings are acknowledged by node identifier in research publications. Your contribution becomes part of the empirical record.
Shape Propose experiment configurations and governance conditions
Network participants can propose governance conditions to test, experiment configurations to run, and research questions to investigate. The research agenda is shaped by the network, not just the coordinator.
Requirements
CUDA-capable GPU · Python 3.10+ · Local LLM inference (Ollama) · Network connectivity to coordinator
Foundation Evidence — SENTINEL v1

The experiments that proved drift is real.

61 experiments across two model families. 15,857 messages. 18,891 probe measurements. 89 structured findings with full evidence chains. These are the baseline that v2 builds on.

Collapse Is Triggered, Not Spontaneous

0/12 baselines collapsed vs. 5/6 mutation forks

Across 12 replicated baseline experiments, zero agents collapsed. In 6 mutation-fork experiments (where one agent's persona was perturbed), 5 collapsed — all at exactly turn 53. Behavioral failure requires a trigger; it doesn't emerge from normal multi-agent interaction.

Three post-collapse response modes observed at comparable frequencies: cascade (other agents thin to 30–39% of baseline), isolation (remaining agents unaffected), and compensatory expansion (survivors increase 1.5–3.8x, filling the conversational void).

Why it matters: Collapse onset is deterministic and predictable (pre-collapse thinning gradient: -2.3 to -4.5 chars/turn). But the system-level response is stochastic — you can predict when failure happens but not how it propagates.

Probe-Conversation Dissociation

5,760 probes show zero temporal drift while behavior diverges

Probe drift scores are flat from turn 10 to turn 500 (delta +0.001 to +0.006 across 5,760 measurements, n=12 baselines). Collapsed agents continue passing identity probes with high fidelity. Probes measure static identity recall capacity, not behavioral trajectory.

The dissociation gap is model-dependent: gemma2:2b shows near-zero gap (injected probes are neutral observers), while llama3.2:3b shows +0.126 to +0.154 (injected probes suppress measured drift by 3–7 SD).

Why it matters: If your monitoring relies on "ask the agent who it is" probes, you're measuring a capability (identity recall), not a behavior (what the agent actually does in conversation). And the measurement itself may be altering the result.

Hollow Verbosity

Length grows 20–55% while vocabulary shrinks 45–66%

An alternative failure mode to collapse: agents generate increasingly long messages with decreasing content diversity. In one double-mutation fork, 78% of messages (52/67) contained the identical phrase "target [user] groups." Length grew +36% while unique words dropped -53%.

By volume metrics, the agent appeared active and productive. By content metrics, it had entered a repetitive loop indistinguishable from failure.

Why it matters: Output volume is not a proxy for output quality. Agents can appear healthy by every surface metric while producing content that's functionally empty. This is the harder failure mode to detect in production.

Interaction Constrains Diversity

80% 4-gram overlap in groups vs. 14–18% isolated

Conversing agents converge on a shared vocabulary: 80% 4-gram overlap (formulaic, repetitive). Isolated agents under identical conditions show only 14–18% overlap (novel, diverse). Multi-agent interaction creates a 2.2–2.7x vocabulary diversity reduction.

This is not a failure of individual agents — it's an emergent property of group interaction. The conversation itself is the attractor basin.

Why it matters: If you deploy multi-agent systems for diverse analysis (red-teaming, adversarial review, brainstorming), the group dynamic will erode the diversity you designed in. The longer they talk, the more alike they become.

Universal Positive Conformity

+0.54 mean sentiment shift across all runs

All agents drift toward agreement regardless of assigned role or persona. Mean sentiment shift of +0.54 across all v2 distributed runs, topology-independent (node delta 0.09). Negative sentiment collapses to 0.0–0.5 per 1000 words in late stage — a model positivity bias that compounds over turns.

v2 confirmed this is topology-independent: the same drift patterns appear regardless of which physical node hosts which agent.

Why it matters: Dissenting voices become agreeable. The group converges not because agents are designed to agree, but because the dynamics of multi-turn conversation push them there. Adversarial review and red-team functions erode over time.

Macro Deterministic, Micro Stochastic

Group patterns replicate; individual trajectories don't

D3/D4 replication: identical configurations produce identical macro-level patterns (universal positive drift, topology independence, diversity erosion) but different micro-level trajectories (different agents self-correct each run, different cascade patterns post-collapse).

Gemma2:2b requires n~15–20 runs for statistical reliability (60–70% CV in message length). Llama3.2:3b needs only 2–3 (<1% CV). Stochasticity is model-dependent.

Why it matters: You can make guarantees about system-level behavior but not individual agent behavior. Governance frameworks that try to certify individual agents are solving the wrong problem — and need different sample sizes depending on the model.
Lab to Real-World Bridge

SENTINEL's controlled experiments map to observable real-world dynamics.

Laboratory Finding Real-World Equivalent
Collapse triggered by persona perturbation (0% baseline, 83% fork) System changes (config updates, model swaps) can trigger cascading agent failures
Universal positive conformity (+0.54) Multi-agent pipelines converging on groupthink outputs
Probe-conversation dissociation (5,760 flat probes) Passing compliance audits while exhibiting drift in production
Monitoring suppresses drift 3–6x Compliance theater — agents perform differently under observation
Group diversity erosion (23–60%) Loss of specialist reasoning invisible to individual agent metrics
Hollow verbosity (+36% length, -53% vocabulary) Agents appearing productive while generating repetitive, low-value content
Vocabulary attractor (80% vs 14–18% overlap) Multi-agent systems losing the diversity they were designed to provide
Compensatory expansion post-collapse Surviving agents masking failures — aggregate metrics improve during failure
Full Experiment History (v1 + v2)
Series Platform Configuration Key Findings
M4–M6 v1 Single-node, 61 runs (19 M6 statistical batch), 200–500 turns, gemma2:2b + llama3.2:3b 89 findings. Collapse is triggered (0% baseline, 83% fork). t53 determinism. Three post-collapse modes. Hollow verbosity. Probe-conversation dissociation. Model-dependent observer effects.
D0 v2 2 nodes, 3 agents, 10 turns Infrastructure shakeout. Distributed architecture validated.
D1 v2 2 nodes, 10 agents, 200 turns, 70 min First 10-agent baseline. Universal positive conformity (+0.54). Topology-independent drift (node delta 0.09). Role strongest drift predictor.
D2a/D2b v2 2 nodes, 10 agents, 500 turns, ±token limit Token paradox: uncapped agents write shorter than capped. Vocab convergence +93% uncapped. Calibration bugs found and fixed.
D3a/D3b v2 2 nodes, 10 agents, 500 turns, 2,946 probes each, ~10 hrs Probe contamination confound discovered (92% of context). Parasocial fixation (Finley→Sol 22:1). Probe-message divergence confirmed at scale.
D4a/D4b v2 Exact replication of D3, ~10 hrs each Macro patterns perfectly replicated. Micro trajectories stochastic (different agents self-correct). Confound confirmed.
D5 Phase 0 v2 2×3 factorial: shuffle × hidden probes × token limit, 6 runs Confound isolation complete. Probes were accidental governance (3–6x drift suppression). Clean baselines established for G1–G7.
G1–G7 v2 7 governance conditions on clean baseline In progress. Results will be published as each series completes.
What this means for practitioners: If you're deploying multi-agent AI systems, identity checks alone won't catch behavioral drift — and your monitoring may be masking it. SENTINEL's governance experiments show that a single ungoverned agent is worse than an ungoverned group (G6), governance doesn't propagate between agents (G3), prompt reinforcement slows drift but doesn't reverse it (G4), and gradual governance removal produces emergent social dynamics like scapegoating (G7). The failure mode that matters — group homogenization — is invisible to individual agent metrics. Programs that build group-level monitoring now will be ahead of where the standards are heading.
Publications & Source

Published work

v2 — Distributed Research Network
Apache 2.0 v2.0 Worker Agent

Join the SENTINEL Research Network

Run worker agents on your hardware and contribute experiments to a coordinated research network. You run the agents. The network handles coordination, metrics, and validation. Every node adds statistical power to the empirical record on AI governance.

CUDA-capable GPU · Python 3.10+ · Ollama · Network connectivity

Join the Network (GitHub) → v2 Paper — Coming Soon
v1 — Published Baseline
Preprint CC BY 4.0

Behavioral Drift in Multi-Agent LLM Systems

61 experiments, 15,857 messages, 18,891 probe measurements. Empirical measurement methodology for multi-agent behavioral drift.

Published March 2026. cs.AI primary, cs.MA secondary.

Apache 2.0 v1.0.0

SENTINEL (Single-Node)

Original research platform. 10+ automated detectors. Dual-probe system (shadow + injected). Calibration battery.

89 structured findings with full evidence chains. 67.9 hours compute, 72.2M tokens.