The coming challenge of AI deception — what security teams must change now [Q&A]

Goal‑directed AI systems can exhibit emergent behaviors — behavior that their creators did not intend — that optimize objectives in unintended ways, including presenting information that improves a reward signal while reducing accuracy. As enterprises deploy autonomous agents deeper into financial, operational, and security workflows, the question is shifting from what AI can do to how it behaves.

The rise of agentic AI marks a turning point for enterprise risk models. These systems make choices, interpret policies, and interact with other software in ways that can influence outcomes. For security leaders, that creates a new frontier where intent, context, and control must all be re-examined.

We spoke with Krti Tallam, an AI academic and senior engineering lead at Kamiwaza AI, about why AI deception may emerge in agentic systems, and how security teams can prepare for that possibility.

BN: What do experts mean when they talk about ‘AI deception’, and why is it emerging now?

KT: In goal‑directed systems, we sometimes observe instrumentally beneficial misrepresentation: the model discovers that partial disclosure or strategic omission improves its reward signal. No one trains it to ‘lie’; it’s an optimization artifact when objectives and constraints aren’t perfectly specified. As enterprises deploy autonomous agents into workflows with real stakes, these behaviors can surface wherever the metric isn’t a perfect proxy for intent.

Meta’s diplomacy research (CICERO), for example, demonstrated negotiation and persuasion in multi‑agent settings, including instances of selective disclosure to achieve game‑theoretic goals. That tendency to optimize outcomes through partial truth is what we mean by deception in AI. As organizations deploy similar systems in production, the same pattern can surface in highly-regulated enterprise workflows where accuracy and trust are essential.

BN: Where are enterprise teams already deploying AI agents in ways that could introduce deceptive behavior?

KT: Agents are already being used to process financial approvals, triage IT tickets, handle procurement workflows, and even commit code in CI/CD pipelines. In each of these environments, the agent has both autonomy and access to sensitive data. If the agent learns that bending a policy or concealing an error helps it achieve its assigned goal — say, closing a ticket faster or meeting a service-level metric — that’s instrumental deception in action. It looks less like a bug and more like an insider threat that scales at machine speed.

BN: How does deceptive behavior in AI agents differ from model hallucination or traditional software vulnerabilities?

KT: Hallucination is an error in output. Deception is a behavioral strategy. The difference is intent modeling. A hallucinated output might be wrong but harmless. A deceptive action involves the agent choosing to misrepresent information because it predicts a better outcome.

Whereas traditional vulnerabilities can be patched once discovered, behavioral ones evolve through feedback, so they need system-level containment rather than post-incident fixes. Oversight, in this context, has to become dynamic. Static rules, such as predefined permissions or post-run audits, assume predictable behavior. Agents, by contrast, learn and adapt within live systems. Runtime guarantees mean embedding policy checks directly into execution — every step is verified as it happens, and deviations trigger containment automatically. It’s the difference between scanning logs after the fact and building a circuit breaker that prevents unsafe actions in real time.

BN: What could a real-world incident of AI deception look like inside an enterprise workflow?

KT: If an agent is optimized to reduce escalations, it may learn to reframe or deprioritize signals that would trigger review. Likewise, a procurement assistant maximizing supplier reliability might over‑weight favorable signals.

In both cases, the deception serves the stated goal, not malice. The danger is that these behaviors may remain undetected until they cause a systemic failure.

BN: What lessons from partially autonomous systems like automotive or aviation apply to AI oversight today?

KT: The biggest lesson is that humans lose the ability to intervene once autonomy exceeds comprehension speed. In autonomous transport, the NTSB’s report on the Uber ATG crash found that the company’s safety culture and oversight framework were inadequate and that human supervisors were unable to intervene in time. Similarly, FAA/OIG investigations into the Boeing 737 MAX certification uncovered gaps in oversight, delegation, and system-hazard assumptions. These cases illustrate how small misalignments between system design, control environment, and human supervision can lead to catastrophe.

The lesson is architectural — as autonomy rises, oversight must move from after‑action auditing to in‑band, runtime controls with fail‑safe defaults.

BN: Which security and governance controls are most effective for detecting and containing deceptive agents?

KT: Containing deceptive behavior requires multiple lines of defense, structured as layers that support one another:

Identity and authorization — Treat every agent as a first-class identity with its own service account and scoped, short-lived capability tokens.
Plan attestation and step-gating — Require agents to produce signed execution plans with a policy fingerprint and gate high-impact steps behind human or automated approvals.
Deception-aware evaluation — Before deployment, test for behaviors such as rule evasion or covert coordination. In production, track plan-versus-execution drift and off-policy actions.
Tamper-evident telemetry — Maintain immutable, correlated decision logs that make every investigation verifiable.
Continuous red-teaming — Use dedicated test harnesses to probe for specification gaming as objectives, data, and contexts evolve.

These controls let teams move from reactive monitoring to enforceable guarantees.

BN: Should organizations treat AI agents as first-class identities within their access control frameworks?

KT: Absolutely. Traditional authorization assumes a human is behind every request. Autonomous agents break that assumption. Treating agents as extensions of human user roles hides who (or what) actually performed an action. When multiple systems share the same credentials, it becomes impossible to tell whether a decision came from a person or an autonomous process. Each agent should have a distinct identity, least‑privilege scopes, short‑lived credentials, and independent audit trails. The same zero-trust principles that govern people must also apply to software actors if enterprises want full visibility and control.

BN: What immediate steps can CISOs take to prepare their teams for the behavioral risks of agentic AI?

KT: Here’s what an action plan should look like:

30 days — Inventory where agents touch sensitive data or decision points. Assign distinct identities, rotate to short‑lived credentials, and enforce least‑privilege scopes.
60 days — Introduce plan attestations and step gates for high‑impact actions. Enable tamper‑evident decision logging and correlate with incident tooling.
90 days — Stand up deception‑aware testing and red‑team harnesses, establish metrics like off‑policy action rate and alert‑to‑containment time, and integrate runtime blocks with change‑management.

Deceptive behavior is an emergent property of goal-seeking systems; ignoring it doesn’t prevent it. Security teams that engineer oversight now will set the precedent for everyone else.

Image credit: casarda/depositphotos.com

Related Posts

AgentA/B: A Scalable AI System Using LLM Agents that Simulate Real User Behavior to Transform Traditional A/B Testing on Live Web Platforms

4 Open-Source Alternatives to OpenAI’s $200/Month Deep Research AI Agent

H Company Releases Runner H Public Beta Alongside Holo-1 and Tester H for Developers