The case for function calling over chain-of-thought agents

There’s a pattern I keep seeing: a team decides they need an “AI agent” — something that can reason through a multi-step problem, use tools, and arrive at an answer autonomously. They reach for a chain-of-thought agent framework (LangChain agents, CrewAI, AutoGen), build something impressive in a demo, and then spend months debugging unpredictable behavior in production.

Most of the time, they should have used structured function calling instead.

The distinction

Function calling: The model receives a user query, decides which function(s) to call with which arguments, returns structured output, and your application code handles the orchestration. The model makes decisions; your code controls the flow.

Chain-of-thought agents: The model receives a goal, reasons about what to do, executes tools, observes results, reasons again, and continues until it decides it’s done. The model controls the flow.

The difference is who’s driving. With function calling, your code is in control and the model is an advisor. With agents, the model is in control and your code is a tool provider.

Why function calling wins for most production use cases

Predictability

Function calling produces deterministic control flow. Given the same input, the model might choose different functions, but your code always executes them the same way, always checks for errors the same way, and always formats the response the same way.

Agents have non-deterministic control flow. The model might take 2 steps or 20. It might call tools in a sensible order or a bizarre one. It might get stuck in a loop. Every run is an adventure.

For production systems where you need to guarantee SLAs, handle errors gracefully, and explain to users what happened — predictability matters enormously.

Cost

A function call interaction is typically 1-3 LLM calls. An agent interaction is 5-15+ calls (sometimes many more). At scale, this 5x cost multiplier adds up fast.

One client was spending $8K/month on an agent-based workflow that could have been implemented as a three-step function calling pipeline for $1.5K/month. Same output quality. Fifth the cost.

Debuggability

When a function calling pipeline produces wrong output, you can trace it: what function was called, what arguments were passed, what the function returned. The bug is always in one of three places: the function selection, the arguments, or the function implementation.

When an agent produces wrong output, the reasoning chain might be 12 steps long, with the error introduced at step 4 but not manifesting until step 11. Debugging requires reading the model’s internal monologue and understanding why it made each decision. This is possible but significantly harder.

Latency

Function calling: 1-3 sequential LLM calls, typically 2-6 seconds total.

Agents: 5-15 sequential calls, typically 15-60 seconds total. Users notice.

When agents actually make sense

Agents aren’t always wrong. They win in specific scenarios:

Exploratory tasks: When you genuinely don’t know the steps in advance. Research tasks, complex data analysis, debugging — where the right approach depends on intermediate results.

Low-frequency, high-value tasks: When the task runs rarely enough that cost and latency don’t matter, but the value of a good result is high. A weekly competitive analysis, not a per-request API response.

Human-in-the-loop workflows: When a human is watching the agent work and can intervene if it goes off track. Interactive coding assistants are a good example — the human provides course correction.

The hybrid approach

What I recommend for most production systems: function calling for the hot path, agents for offline tasks.

User request → Function calling (fast, predictable, cheap)
                    ↓
              [If complex/ambiguous]
                    ↓
              Background agent task (slow, flexible, expensive)
                    ↓
              Async notification when complete

The user gets an immediate response from the function calling layer. If the query requires deeper reasoning, a background agent handles it and the user gets notified when it’s done. Best of both worlds.

The practical test

Before reaching for an agent framework, ask:

Can I enumerate the possible actions? If yes → function calling.
Is the control flow known in advance? If yes → function calling.
Does it need to run in under 10 seconds? If yes → function calling.
Is the task exploratory with unknown steps? If yes → maybe an agent.
Is a human watching and able to intervene? If yes → agent is fine.

Most production use cases hit “yes” on questions 1-3. Most teams building agents are solving a function calling problem with a more complex tool than they need.

Keep it simple. Your on-call self at 2am will thank you.