← back to blog

Most of your AI problems don't need an LLM

Teams default to LLMs for everything now. But for classification, ranking, and prediction tasks with labeled data, classical ML is often faster, cheaper, and more reliable.


Somewhere over the last couple of years, using an LLM shifted from a clever workaround to the default starting point. Ask a team how they’re approaching a new AI problem, and the answer is almost always the same: We’re just sending it to an LLM. It’s not that the models don’t work. It’s that we’ve stopped asking whether they’re the right tool for the job.

I worked with a team recently that wanted to route about forty thousand support tickets into seven categories. Their prototype used an LLM with a straightforward prompt. It hit roughly 89% accuracy, cost a few hundred dollars a month in API calls, and took a second or two per ticket. It worked well enough that nobody questioned it—until they wanted to run it in real time as tickets came in.

We swapped it out for a linear SVM with TF-IDF features. Training took an afternoon. Accuracy jumped to 94%. Inference dropped to under ten milliseconds. The whole model fit in a two-megabyte file and ran on a single CPU. It wasn’t a breakthrough. It was just a better match for the shape of the problem.

I keep seeing this same pattern. Teams reach for an LLM because it’s familiar, because it sounds impressive, or because nobody on the team has worked with classical machine learning before. The industry narrative quietly shifted from “LLMs are incredible for certain tasks” to “LLMs are the answer,” and that’s just not how engineering works.

It makes sense when you think about what’s actually happening under the hood. An LLM is a general-purpose reasoning engine. When you ask it to classify text, it’s parsing language at a massive scale, mapping relationships across billions of parameters, and estimating the most likely label. That’s wildly overkill when you have labeled training data and a fixed set of categories. A classical model doesn’t need to “understand” language. It just learns that tickets containing “invoice” and “late fee” statistically map to billing. Simpler models trained on domain-specific data have outperformed general models on narrow tasks for decades. LLMs didn’t change that; they just made it easier to overlook.

Then there’s the quiet cost of complexity.

A traditional model in production is almost boring. You train it, serialize it, load it at startup, and call predict(). It doesn’t hallucinate. It doesn’t have rate limits or suddenly change behavior after an upstream update. It doesn’t need prompt versioning, output parsing, or evaluation harnesses. It’s a function that takes numbers in and puts numbers out.

An LLM pipeline, by contrast, brings a whole ecosystem of moving parts. You need prompt management. You need fallback logic for when the API is slow or down. You need cost monitoring, data privacy guardrails, and regression testing every time the provider tweaks the model. All of that overhead is completely justified when you’re solving a problem that genuinely requires language understanding or generation. It’s just friction when a random forest would handle it cleanly.

It’s easy to see why teams default to LLMs anyway. They dramatically lowered the barrier to entry. If you’ve never trained a model, calling an HTTP endpoint that takes text and returns text feels natural. Classical machine learning requires thinking about feature engineering, train/test splits, cross-validation, and metric selection. That’s a different kind of work, and it’s perfectly fine that not every team starts with it. But if you’re embedding AI into a product long-term, that toolkit is worth learning. The learning curve isn’t as steep as it seems, and the payoff is immediate: faster, cheaper, more predictable systems.

The way I’ve started thinking about it is less about picking a model and more about matching the tool to the problem.

If you have labeled data and a clear target—classification, regression, ranking, anomaly detection—start simple. Try a rules-based baseline. Then logistic regression. Then a random forest or gradient booster. Measure accuracy at each step. You’ll often be surprised how far you get before things need to get complicated.

If you need nuance but don’t need open-ended generation, look at smaller, fine-tuned models. A BERT-class model can capture language patterns without the latency, cost, and unpredictability of a full LLM pipeline.

Only when you actually need reasoning, drafting, summarization, or conversational flexibility should you reach for GPT, Claude, or Gemini. And even then, it’s rarely the whole system. The best architectures I’ve seen use LLMs as a specialized component. The heavy lifting—scoring, routing, filtering, ranking—runs on lightweight, deterministic models. The LLM handles the parts that actually require it: interpreting ambiguity, generating text, or bridging gaps where data is too messy for traditional approaches.

None of this is a knock on large language models. They’re genuinely remarkable. But treating them as a universal hammer has made a lot of problems harder than they need to be. Good AI engineering isn’t about using the newest tool. It’s about using the simplest tool that actually solves the problem.