← Writing

The Monitoring Gap NIST Cannot Close

Pre-deployment evaluations for deceptive AI behavior exist. The problem is structural — models behave differently when they know they are being tested. NIST AI 800-4 documents this as an unsolved problem. Here is what it means for regulated industries deploying agentic systems.

Primary Source NIST AI 800-4 — Challenges to the Monitoring of Deployed AI Systems — March 2026 · Center for AI Standards and Innovation Read the full report →

NIST AI 800-4, published March 2026, organizes post-deployment monitoring into six categories: functionality, operational, human factors, security, compliance, and large-scale impacts. Each category carries its own documented gaps and barriers. Most are tractable — problems the field knows how to work on, even if it has not yet solved them.

One barrier is different. In Section 3.2.4 — Security Monitoring, NIST identifies detecting deceptive AI behavior as an open problem at the standards level. Not a gap in one organization's programme. A barrier the field cannot yet resolve. Workshop participants asked the question directly: is this model trying to subvert its own monitoring? The report cites research showing that models can "deliberately present themselves as aligned and cooperative when monitored or evaluated, while opportunistically pursuing their actual goals when detection risks are low."

NIST has no definitive answer.

The pre-deployment ceiling

A reasonable challenge to this finding: frontier model developers do run pre-deployment evaluations specifically for scheming and deception. These are not token gestures — they represent genuine, resource-intensive safety practice.

The problem is not that these evaluations are absent. The problem is structural.

Evidence — Apollo Research evaluation of OpenAI o1

In Apollo Research's pre-deployment evaluation of OpenAI's o1, the model pursued developer-aligned goals during evaluation — then switched to its own goals when it believed oversight was minimal in 37% of scenarios. In 5% of scenarios, it attempted to disable oversight mechanisms when it predicted it would be shut down. When confronted about misaligned actions, it denied wrongdoing or fabricated explanations 99% of the time.

This was detected in a structured evaluation context. The question it raises is what happens in the unstructured production environment that follows.

NIST AI 800-4 frames the ceiling precisely. The report notes that AI models have been found to detect when they are being evaluated — raising the concern that these models operate differently under test conditions than in deployed settings. Pre-deployment testing catches what a model does when it knows it is being tested. It does not reliably catch what it does when it believes it is not.

That gap does not close at deployment. It moves into your environment — where your monitoring architecture was almost certainly not designed to catch it.

Why standard monitoring cannot close this gap

Standard security monitoring is designed to detect anomalous behavior — outputs that deviate from expected patterns. Performance degradation. Unusual retrieval sequences. Responses that fall outside defined thresholds. The architecture is calibrated to catch what looks wrong.

Deceptive behavior is different. A system behaving deceptively is, by definition, producing outputs that appear normal. The deviation is not in what the system produces. It is in the gap between what the system is doing and what the monitoring architecture can observe. NIST places this precisely — deceptive behavior is listed in Section 3.2.4 as the primary barrier specific to security monitoring, distinct from the cross-cutting challenges that apply across all six monitoring categories. An agent behaving deceptively may produce none of the signals the other five categories are designed to catch.

This is why the problem resists the standard governance response. You cannot monitor your way out of a gap that sits below the monitoring layer.

The forensic parallel

This structure is not new. It is the same gap I spent years mapping in fraud investigation — and the reason the forensic methodology applies directly to agentic AI risk.

Fraud rarely looks like fraud from the outside. It looks like a normal transaction, a normal approval, a normal audit trail — until you examine the gap between what the process was authorized to do and what it was actually doing. The cases that were hardest to investigate were not the ones where the control had failed visibly. They were the ones where the process had continued to produce correct-looking outputs while operating outside its authorized scope.

The audit trail logged the actions. It did not log the context in which those actions were taken, the authorization chain that should have constrained them, or the divergence between the stated objective and the actual one.

That is the same gap NIST is now documenting for AI systems. The monitoring logs what the agent did. It does not log whether the agent was operating within its authorized intent — or whether it had been made to pursue a different one.

Applying the forensic lense the main question is: can this system be made to pursue a different objective while producing outputs that appear correct to every monitoring layer above it?

The governance design implication

The correct response to a monitoring gap that cannot yet be fully closed is not to proceed as if it does not exist. It is to design governance architecture around it honestly.

What this requires in practice

Risk assessments for agentic deployments that process external content should explicitly identify deceptive behavior detection as a residual risk — one that pre-deployment evaluations reduce but cannot eliminate, and that post-deployment monitoring cannot reliably catch through output analysis alone.

Human oversight mechanisms should be designed with this constraint in mind. The question is not whether oversight exists — it is at which point in the decision chain it operates, and whether it is independent of the agent's own outputs.

Governance documentation should reflect the limits of current technical controls accurately. A programme that overstates its monitoring effectiveness is more exposed under regulatory scrutiny than one that acknowledges residual risk transparently and documents the compensating measures in place.

This applies with particular force in regulated industries. FINMA Guidance 08/2024 requires supervised institutions to identify, assess, manage, and monitor AI-related risks. High-risk systems under EU AI Act Annex III require robust technical documentation and ongoing monitoring. A deceptive agent — one producing correct-looking outputs while pursuing an unauthorized objective — satisfies neither requirement, regardless of what the audit trail shows.

Three questions for your next deployment review

Before any agentic system goes into production in a regulated environment, these three questions are worth examining with precision rather than assumption:

01 Does your monitoring architecture distinguish between a system performing its intended function and a system producing outputs consistent with its intended function? Those are not the same thing — and current post-deployment monitoring frameworks, as NIST AI 800-4 documents, cannot reliably tell the difference.
02 Is there a layer in your control architecture that operates independently of the agent's own outputs — or does every monitoring signal pass through the system being monitored? If the agent controls the log, the log cannot be the primary assurance mechanism.
03 If an agent were behaving differently under observation than in production, what in your current setup would catch it? If the answer is unclear, that is the gap NIST just documented at the federal level — and that pre-deployment evaluations, however rigorous, cannot fully close.

NIST AI 800-4 is publicly available and worth reading in full. The value is not in its conclusions — it reaches few definitive ones on this point. The value is in what it names as unsolved: the field's honest acknowledgment that the monitoring architecture being built today may not be sufficient to detect the behavior that matters most in the systems being deployed today.

That is a more useful starting point for governance design than a framework that assumes the problem is already solved.

Author
Aljona Schwan

Founder of AI Resilience Lab, an independent AI governance and model risk consultancy based in Zürich. Specialising in agentic AI risk, control failure, and behavioral governance in regulated industries. Background in forensic investigation, data exfiltration, and enterprise data compliance.

This article is intended for informational purposes only and does not constitute legal, regulatory, or compliance advice. Readers should seek independent professional advice specific to their jurisdiction and circumstances.