Research question

The agent-framework market treats orchestration as the central abstraction. A framework gives the model a planner, memory, tool adapters, callbacks, traces, and retries. Those components matter, but the production question is different: what authority does the agent have when it is about to perform a consequential action?

This paper separates the framework layer from the execution boundary. The framework helps an agent decide what it wants to do. The boundary decides whether the action is admissible, records why, and produces evidence that can be verified after the fact. The argument is that regulated deployments should invest first in the boundary, then treat frameworks as replaceable callers.

Method

We review three classes of source material:

  1. Agent research that established tool use and reasoning-action loops, including ReAct and Toolformer.
  2. Evaluation and security work showing that agents are difficult to benchmark, expensive to compare, and vulnerable when external content becomes instruction.
  3. Security and governance standards that already assume per-request authorization, provenance, and risk management.

The paper is not a benchmark of individual frameworks. It is a systems-design analysis: which responsibilities must stay outside the framework if an enterprise wants replayable, regulator-facing control over tool use?

Finding 1: frameworks optimize capability, not authority

ReAct made the reasoning-action loop legible: a model can interleave thought-like planning with calls into an environment. Toolformer showed that models can learn when and how to call APIs. Those papers are capability papers. They make the agent more useful by widening the action surface.

That same widening creates the authority problem. Once a model can operate a CRM, mailbox, browser, payment rail, or code repository, the hard question is no longer whether the agent can call the tool. The hard question is whether the caller, target, context, policy version, data classification, and approval state make this specific tool call permissible right now.

General-purpose frameworks usually represent that decision as middleware, callbacks, tool metadata, human-in-the-loop prompts, or logging hooks. Those mechanisms are useful extension points. They are not a sufficient authority plane unless they are deterministic, fail closed, policy-versioned, and independently replayable.

Finding 2: benchmark success does not imply operational fitness

Kapoor et al. argue that agent evaluations overemphasize accuracy and underweight cost, reproducibility, holdout quality, and real-world usefulness. That critique matters for framework selection because many frameworks are optimized for benchmark-visible behavior: more calls, more retries, larger traces, and more elaborate planning.

Enterprise operations care about a different set of metrics:

  • the percentage of consequential actions with a verifiable pre-dispatch verdict;
  • false-allow rate under policy drift, tool failure, or prompt-injection pressure;
  • escalation latency for actions requiring human approval;
  • replay success against the original policy bundle without a live vendor service;
  • evidence completeness when the model, framework, or connector is replaced.

An agent framework can improve task accuracy while worsening several of those operational metrics. A boundary architecture makes the metrics separable: framework teams can improve capability while security and compliance teams control authority.

Finding 3: prompt injection turns tool routing into a security boundary

Indirect prompt injection breaks the assumption that instructions arrive only from the user. Greshake et al. show how retrieved or browsed content can manipulate an LLM-integrated application, including whether other APIs are called. In agent systems, that means the framework’s tool router is not just orchestration code. It is part of the security boundary.

Treating the framework itself as the boundary creates a brittle trust base. The same component that plans the action, interprets untrusted content, selects tools, and formats arguments also decides whether the action should leave the system. A stronger pattern is to put a narrow, deterministic policy enforcement point after planning and before dispatch. The planner may be probabilistic. The authority check should not be.

Boundary pattern

A production execution boundary should have five properties:

  1. Pre-dispatch evaluation. The boundary evaluates the proposed action before side effects occur.
  2. Explicit policy versioning. The decision binds to a specific policy snapshot, not a mutable label.
  3. Fail-closed behavior. Missing policy, stale policy, malformed input, connector drift, and verifier failure deny or escalate by default.
  4. Portable receipts. The decision record can be verified without the original framework runtime.
  5. Caller neutrality. Any framework, model, or workflow engine can call the boundary through the same contract.

This is analogous to zero-trust architecture in NIST SP 800-207: access decisions move from implicit network position to per-resource, per-request evaluation. Agent execution needs the same move. The relevant resource is not only data; it is the ability to cause an external effect.

Design implications

The framework should own task decomposition, memory ergonomics, model routing, intermediate observations, and developer experience. The boundary should own admissibility, evidence, approval state, and replay. Keeping those responsibilities separate has practical consequences:

  • framework traces become diagnostic artifacts, not the source of authority;
  • tool schemas become inputs to policy, not informal hints;
  • high-risk tools can require stronger approval without rewriting the agent;
  • a framework migration does not invalidate old receipts;
  • security review can focus on a smaller deterministic component.

This does not make frameworks unimportant. It makes them less privileged. A good framework is a caller into the authority layer, not the authority layer itself.

Limitations

The analysis does not claim that all agent frameworks are insecure. Some provide strong guardrail hooks, sandboxing integrations, human approval flows, or audit logs. The claim is narrower: unless those controls are externalized into a deterministic, replayable boundary, the enterprise still depends on framework-specific behavior for an authority decision.

The paper also does not evaluate performance overhead. Boundary latency must be low enough to sit on the hot path. That is an implementation constraint, not an argument against the architecture.

Conclusion

The agent is not the durable abstraction. The durable abstraction is the action boundary. Frameworks will change as models, tools, and developer workflows change. The authority question remains stable: before an AI system touches the world, what policy allowed it, what evidence proves it, and what happens when the answer cannot be computed?

Production agent systems should design around that question first.

References

  1. Yao et al. — ReAct: Synergizing Reasoning and Acting in Language Models
  2. Schick et al. — Toolformer: Language Models Can Teach Themselves to Use Tools
  3. Kapoor et al. — AI Agents That Matter
  4. Greshake et al. — Indirect Prompt Injection
  5. OWASP Top 10 for Agentic Applications
  6. NIST AI RMF Generative AI Profile, NIST AI 600-1
  7. NIST SP 800-207 — Zero Trust Architecture
  8. HELM OSS repository