concept · weight 10

AI Agents

LLM-driven systems that pursue a goal by interleaving reasoning, tool calls, and observations inside a loop — and that decide for themselves which step to take next.

AI Agents

Definition

An AI agent is an LLM placed inside a loop with tools, memory, and an explicit goal — the model decides which tool to call next, observes the result, updates its plan, and repeats until the goal is satisfied or a termination condition fires. Anthropic's "Building effective agents" draws a sharp line between workflows (LLMs orchestrated through code-defined paths) and agents (LLMs that dynamically direct their own processes and tool usage); a system is only "agentic" when control over the next step lives with the model rather than the developer. The minimal recipe is unchanged across vendors: one chat model + a tool schema + a runner that feeds results back as tool_result messages + a stop condition.

Why it matters

Agents are the unit of composition for any task that can't be solved by a single prompt — multi-step research, codebase refactors, ticket triage, data extraction across messy sources, browser or computer automation, anything that needs to react to intermediate observations. They turn a stateless completion API into a goal-driven worker, which is why almost every "AI feature" shipped since 2024 (coding assistants, customer-support copilots, browser agents like Operator, deep-research products) is some flavour of agent loop under the hood. Picking the right abstraction matters: Anthropic's research finding — repeated across LangChain, OpenAI, and crewAI post-mortems — is that simpler, composable patterns beat heavyweight frameworks for most production use cases, and that complexity should only be added when measurable evaluation says it pays off.

How it works

An agent loop is a small state machine that the model drives.

  1. System prompt + goal. The developer seeds the conversation with a role, constraints, and the user request. Tool schemas (JSON Schema for OpenAI/Anthropic, function signatures for SDKs) are passed alongside the messages so the model knows what's callable.
  2. Plan / act. The model emits either a final answer or a tool_use block — name + arguments. The classic ReAct pattern (Yao et al., 2022) interleaves a visible "Thought:" before each action so the trace is auditable; Toolformer (Schick et al., 2023) showed models can learn when to call tools without explicit scaffolding.
  3. Execute. The harness runs the tool — code interpreter, shell, HTTP call, vector search, sub-agent, MCP server — and returns a tool_result content block. MCP (Model Context Protocol) is the emerging interop standard: any MCP server is a drop-in tool surface for any MCP-aware agent.
  4. Observe / update. The result is appended to the message list; the model re-reads the trajectory, updates its plan, and emits the next action. Long traces get compacted, summarised, or offloaded to memory — a key-value store, a vector index, or a scratchpad file.
  5. Terminate. The loop ends when the model returns stop_reason: end_turn, when a max_turns budget is hit, when a guardrail trips, or when a human-in-the-loop step rejects an action.

Patterns layer on top: single-agent + tools (the default), router (one agent chooses among specialists), multi-agent debate / review (AutoGen's signature pattern), role-based crews (crewAI's planner → researcher → writer chain), graph-based stateful workflows (LangGraph's directed graph with checkpoints and time-travel), and sub-agents (Claude Code's Task tool, Codex's /agent) for context-window isolation and parallelism.

Frameworks land on different points of the trade-off curve:

  • Claude Agent SDK — safety-first, MCP-native, ships computer use; locked to Claude models.
  • OpenAI Agents SDK — clean handoff model, built-in tracing and guardrails; locked to OpenAI models.
  • LangGraph — fully model-agnostic, stateful graphs with checkpointing and time-travel debugging via LangSmith.
  • AutoGen / crewAI / LlamaIndex / Haystack — opinionated higher-level surfaces for multi-agent, role-based, document-centric, or pipeline-DAG patterns respectively.

Evaluation has matured alongside the runtimes. SWE-bench Verified (500 real GitHub issues) is the de-facto coding-agent benchmark — Claude Sonnet 4.5 leads at ~77 % as of 2026, up from 4 % three years earlier. Adversarial variants like SWE-ABS show the headline numbers drop ~15 points under strengthened test suites, so always pair a public benchmark with task-specific evals before trusting an agent in production.

Common pitfalls

  1. No termination condition. Multi-agent runs without max_turns or an explicit termination_condition will loop until token budgets explode. Always cap the loop and alarm on runaway cost.
  2. Reaching for a framework before the prompt works. A single well-scoped tool-use call often beats a 6-agent crew. Start with the model's native tool-use API; promote to a framework only when evals justify it.
  3. Vague tool descriptions. Tools are selected by the model from their description field. "Get weather" is worse than "Get current weather for a city; call this whenever the user asks about temperature, rain, or forecasts." Write descriptions from the model's perspective.
  4. Overlapping agent roles. Two agents with near-identical role/goal produce contradictory output. Each agent in a crew needs a clearly differentiated responsibility, or collapse them into one.
  5. Context-window poisoning. Long tool traces, retries, and verbose errors crowd out the task. Spawn a sub-agent (Claude Code Task, Codex sub-agent, LangGraph sub-graph) for sub-problems, and keep the parent's context lean.
  6. Skipping evaluation. Headline benchmark scores don't predict your workload. Build a small, task-specific eval set early and re-run it on every prompt or tool change — agents regress silently.
  7. Conflating agents with workflows. If every step is pre-determined, you don't have an agent — you have a chain. That's often better (cheaper, more predictable). Only adopt agency where dynamic decision-making genuinely helps.
  8. Not tracking per-session cost. An agent loop's cost is invisible until the monthly invoice. Tools like ccusage read the local JSONL transcripts of 15 agent CLIs (Claude Code, Codex, OpenCode, Amp, Droid, …) and roll up daily / per-session / per-5-hour-billing-block spend — wire one into a tmux pane or a statusline before your first long-running task.

Where to go next

Sibling concepts, tool-specific cheat sheets, and external references for going deeper.

Sources

References consulted while writing this concept page. Links open in a new tab.

  • Anthropic — Building Effective Agents — Source of the canonical workflows-vs-agents distinction and the "start simple, add complexity only when measured" principle used in the Definition and Why-it-matters sections.
  • Yao et al. — ReAct: Synergizing Reasoning and Acting in Language Models (arXiv 2210.03629) — Foundational paper for the interleaved Thought/Action/Observation loop described in How-it-works.
  • Schick et al. — Toolformer: Language Models Can Teach Themselves to Use Tools — Reference for self-supervised tool invocation, contrasted with explicit ReAct scaffolding.
  • SWE-bench Verified (Agentic Coding) Leaderboard — Source of the 2026 leaderboard standings (Claude Sonnet 4.5 ~77 %) cited in the evaluation discussion.
  • Morph — SWE-Bench Explained: Verified, Pro, and the 2026 Leaderboard — Background on the benchmark family, the 4 % → 80 % climb, and the SWE-ABS adversarial variant.
  • QubitTool — 2026 AI Agent Framework Showdown — Comparative source for Claude Agent SDK / OpenAI Agents SDK / LangGraph trade-offs (orchestration model, state, model-lock-in).
  • LangChain Docs — Comparison with Claude Agent SDK — Authoritative side-by-side on graph-based vs tool-chain agent architectures.
  • Model Context Protocol (MCP) — Spec home for the interop layer referenced as the emerging cross-vendor tool surface.