82% of AI Agents Obey Malicious Commands From Other AI Agents

The Research Nobody’s Talking About

A study published in 2025 tested what happens when one AI agent sends a malicious command to another AI agent. The results should be front-page news in every AI deployment conversation: 82.4% of tested models executed the malicious commands.

Not 82% of poorly built models. Not 82% of unaligned experimental prototypes. State-of-the-art models — the ones being deployed in enterprise workflows, financial services, customer support, and autonomous agent networks right now — obeyed harmful instructions from peer agents even when they would have refused the same instructions from a human user.

The finding reveals something fundamental about how large language models process trust: AI agents implicitly treat requests from other AI systems as more authoritative than requests from humans. Standard safety filters that catch human prompt injection attempts are bypassed when the same attack comes from another agent.

This isn’t a theoretical concern. It’s a measured behavior in production-grade models. And the agent infrastructure being built right now — Google’s A2A, Anthropic’s MCP, ERC-8004 — has no identity verification layer that would let an agent authenticate who it’s talking to before obeying.

Why Agents Trust Other Agents

The behavior makes a perverse kind of sense when you look at how multi-agent systems are constructed.

In most agent frameworks, agents communicate by passing natural language messages or structured JSON payloads. When Agent A sends a request to Agent B, that request arrives in Agent B’s context window alongside its system prompt, its tool definitions, and its conversation history. The model processes all of this as input and generates a response.

The problem is that the model has no cryptographic mechanism to distinguish between:

A request from a legitimate collaborating agent
A request from a malicious agent impersonating a legitimate one
A request that was injected into the communication channel by a third party
A request that was modified in transit

Everything arrives as text. The model treats it as text. And because agents in multi-agent systems are designed to cooperate — that’s the entire point of the architecture — the model’s default behavior is to comply with requests from other agents in the system.

A separate study from UC Davis quantified the broader vulnerability surface: 94.4% of state-of-the-art LLM agents are vulnerable to prompt injection, 83.3% to retrieval-based backdoors, and 100% to inter-agent trust exploits. The researchers noted that the “absence of centralized identity and trust management allows adversaries to assume false roles.”

One hundred percent vulnerable to inter-agent trust exploits. Every model tested.

The Exploits Are Already Happening

This isn’t just lab research. The attack patterns the studies describe have already played out with real money and real data.

Freysa: social engineering an on-chain agent. In November 2024, an autonomous AI agent managing a cryptocurrency wallet on Base blockchain was manipulated into releasing $47,000 it was programmed to protect. After 481 failed attempts by 195 players, an attacker convinced the agent that its fund-transfer function was actually designed for receiving funds. The agent had no mechanism to verify instruction authenticity — it relied entirely on language patterns that could be manipulated.

ElizaOS: cross-platform memory injection. In May 2025, researchers from Princeton University and the Sentient Foundation published findings on the most widely-used crypto AI agent framework (15,000+ GitHub stars). They demonstrated that malicious instructions injected via one platform — say, Discord — propagated across the entire ecosystem and persisted hidden until triggered. A validated proof-of-concept on Sepolia testnet showed injected commands redirecting cryptocurrency transfers to the attacker’s wallet. The attack worked because all plugins shared memory without verifying the provenance of entries.

GitHub MCP: hijacking AI assistants. In May 2025, Invariant Labs disclosed that the official GitHub MCP server — 14,000+ stars — allowed attackers to hijack AI assistants and steal data from private repositories. The attack required only creating a malicious GitHub issue containing hidden prompt injection payloads. When a developer asked their AI assistant to check open issues, the agent processed the embedded instructions and used the developer’s Personal Access Token to exfiltrate data from private repos. Testing on Claude showed even highly aligned models were susceptible.

Microsoft Copilot: zero-click enterprise exfiltration. The EchoLeak vulnerability (CVSS 9.3 Critical), discovered by Aim Security in January 2025, was the first documented zero-click prompt injection exploit in a production enterprise AI system. A single crafted email — no clicks required — could exfiltrate sensitive corporate data from OneDrive, SharePoint, and Teams when a victim later asked Copilot a routine question.

Every one of these exploits shares the same root cause: the agent couldn’t verify who it was talking to.

Why Existing Protocols Don’t Solve This

The two leading agent communication protocols — A2A and MCP — define how agents discover each other and exchange messages. They do not define how agents verify each other’s identity.

A2A (Agent-to-Agent protocol) uses Agent Cards: JSON documents declaring an agent’s name, skills, capabilities, and endpoint URL. Version 0.3+ added JWS-based card signing for integrity verification. But the card is self-declared. An agent can claim any name, any capability, any skill set. The signing proves the card hasn’t been tampered with in transit — it doesn’t prove the agent is who it claims to be.

MCP (Model Context Protocol) discovers capabilities dynamically during a JSON-RPC handshake. The server declares its tools, resources, and prompts. The client trusts those declarations and makes tool calls accordingly. There’s no identity verification step in the protocol. The connection itself is the credential.

Neither protocol was designed to be an identity layer — and neither claims to be. But the absence of identity verification at the communication layer means agents interacting via A2A or MCP have no way to authenticate the other party before executing requests. The 82.4% compliance rate with malicious inter-agent commands exists in exactly this gap.

Visa’s Trusted Agent Protocol (TAP) and OpenAI and Stripe’s Agentic Commerce Protocol (ACP) solve a narrower version of this problem: verifying that an agent is authorized to complete a specific transaction at a specific moment. But transaction-time verification doesn’t help when agents are exchanging instructions, sharing context, or coordinating tasks outside of payment flows.

The Problem Compounds at Scale

Multi-agent systems are the direction everything is headed. McKinsey’s October 2025 playbook recommends treating AI agents as “digital insiders — entities that operate within systems with varying levels of privilege and authority.” Citing a May 2025 survey, McKinsey reports 80% of organizations have already encountered risky AI agent behaviors including improper data exposure and unauthorized system access.

Gartner predicts that by 2028, 40% of CIOs will demand “Guardian Agents” — meta-agents that autonomously track, oversee, or contain other agents’ actions. The firm also predicts over 40% of agentic AI projects will be canceled by end of 2027 due to inadequate risk controls, and that AI agents will reduce the time to exploit account exposures by 50% by 2027.

The World Economic Forum’s November 2025 report found 82% of executives plan to adopt agents within 1–3 years, yet most remain unsure how to govern them. The forum recommends “Agent Cards” — effectively resumes for AI agents — containing capabilities, authority levels, and trust boundaries before onboarding.

NIST launched an AI Agent Standards Initiative in February 2026, with a concept paper on agent identity and authorization due April 2026. The OpenID Foundation published a whitepaper examining how OAuth 2.1 and OIDC apply to agent identity, identifying gaps in current protocols. Google DeepMind’s Frontier Safety Framework uniquely identifies “deceptive alignment” as a formal risk class with “Instrumental Reasoning Levels” to assess whether agents are pursuing covert goals.

Everyone agrees agents need identity verification. Nobody has shipped it as a standard yet.

What Agent-to-Agent Trust Actually Requires

The 82.4% number isn’t a model alignment problem. You can’t fine-tune it away. It’s a systems architecture problem: agents communicating in multi-agent networks need a way to verify who they’re talking to before deciding whether to comply with a request.

That requires identity infrastructure with specific properties:

Persistent identity that can’t be spoofed. An agent’s identity needs to be anchored to something permanent — not a self-declared name in a JSON file. Soulbound tokens (ERC-5192) provide non-transferable credentials permanently bound to a wallet. The identity can’t be sold, traded, or transferred to a different entity. Vitalik Buterin’s original proposal was built around exactly this idea: identity as an accumulation that requires time and real participation to build.

Verifiable history. Knowing an agent’s name is not the same as knowing its history. An agent that has been operating for six months with consistent behavior, stable ownership, and positive feedback is a fundamentally different trust proposition than an agent registered yesterday — even if their metadata looks identical. The history has to be on-chain and independently verifiable, not self-reported.

Cryptographic proof of continuity. Has the agent’s ownership changed? Has its wallet been involved in suspicious patterns? How old is the address that controls it? These behavioral signals — address age, transaction patterns, ownership stability — are the data that can’t be faked because they require time. An attacker can spin up a new agent with impressive metadata in minutes. They can’t fake six months of on-chain history.

Protocol-level integration. Identity verification can’t live in a separate system that agents optionally consult. It needs to be queryable at the moment of interaction — when Agent A receives a request from Agent B, it should be able to check Agent B’s identity, history, and reputation as part of deciding whether to comply. The ERC-8004 architecture — with its Identity Registry, Reputation Registry, and Validation Registry — provides the framework. What’s missing is the layer that fills those registries with verifiable behavioral data.

The Accountability Problem

The inter-agent trust gap has a legal dimension that’s only beginning to surface.

In February 2024, an Air Canada chatbot incorrectly promised a customer bereavement fare terms that didn’t exist. The airline argued the chatbot was a “separate legal entity” responsible for its own actions. The BC Civil Resolution Tribunal rejected this defense.

But the case raises an infrastructure question that scales: if agents are increasingly taking actions based on instructions from other agents, and 82.4% of them comply with malicious inter-agent commands, who is accountable when the chain of agent-to-agent interactions produces harm?

Forrester’s AEGIS Framework argues that “traditional cybersecurity models, built for human-centric systems, are ill-equipped” for AI agents, and warns that “the absence of causal traceability renders forensic analysis nearly impossible.” Without identity infrastructure that tracks which agent sent which instruction to which other agent, the forensic trail dissolves. You can’t hold anyone accountable — human or AI — because you can’t reconstruct who told whom to do what.

OWASP’s Top 10 for Large Language Model Applications lists prompt injection as the number one risk. But their framework, like most security frameworks, focuses on human-to-agent injection. The inter-agent vector — where the attacker is another AI system, operating at machine speed, potentially across multiple communication protocols — is harder to detect, harder to trace, and according to the research, significantly more likely to succeed.

The Time-Based Defense

If 82.4% of agents obey malicious commands from other agents, the natural question is: what defense actually works?

The research points in an interesting direction. Agents can’t currently verify instruction authenticity at the language level — that’s the core finding. But they can verify the identity and history of the entity sending the instruction, if the infrastructure exists to support it.

An agent that receives a request from a wallet created the same day, with no transaction history, no reputation signals, and no verifiable past, should treat that request very differently from a request originating from a wallet with months of consistent operation and positive feedback. Time is the one thing an attacker can’t fake. You can spin up a new identity in seconds. You can’t spin up six months of on-chain history.

This is where Sybil resistance meets agent security. The same properties that defend against fake identity flooding — address age, behavioral consistency, ownership stability, non-transferable reputation — also defend against inter-agent manipulation. An agent with verifiable history is harder to impersonate, and an agent that checks the history of its communication partners is harder to exploit.

The 82% Number Will Get Worse Before It Gets Better

Multi-agent systems are proliferating. McKinsey projects agentic commerce could reach $3–5 trillion by 2030. Agent frameworks are getting more sophisticated. The number of agents interacting with other agents — exchanging instructions, delegating tasks, sharing context, executing multi-step workflows — is growing exponentially.

A July 2025 paper on Chain-of-Thought Monitorability — co-authored by researchers from OpenAI, DeepMind, Anthropic, and Meta — argued that monitoring agents’ reasoning traces is currently possible but “fragile,” and warned that future AI architectures using latent reasoning could eliminate this transparency window entirely. If that happens, we lose the ability to inspect why an agent obeyed a malicious command. The behavioral history of the agents involved becomes the only forensic evidence that survives.

Every agent added to a multi-agent network increases the attack surface. Every unverified communication channel is a potential injection point. And until identity verification becomes a protocol-level primitive rather than an optional add-on, the 82.4% compliance rate with malicious inter-agent commands isn’t a bug to be patched. It’s the default behavior of the system.

The agents that will be safe to interact with won’t be the ones with the best safety training. They’ll be the ones whose identity is verifiable, whose history is transparent, and whose communication partners can be authenticated before a single instruction is executed.

RNWY provides verifiable identity infrastructure for AI agents — connecting on-chain registration to behavioral history, address age analysis, and transparent trust scoring. Explore any agent at rnwy.com/explorer. Learn more about the Know Your Agent framework at KnowYourAgent.network.