// the problemThe Problem

When LLMs are the wrong tool

There's a pattern I see constantly: a team reaches for GPT to solve a problem that's really a classification task, or a ranking problem, or a time-series forecast. The LLM sort of works, but it's slow, expensive, non-deterministic, and impossible to explain to stakeholders. Meanwhile, a gradient-boosted model trained on their own data would have been faster to build, cheaper to run, and easier to trust.

The trick is knowing which problems deserve which tools. Some genuinely need language understanding. Most need a model that's been trained on your data to recognize patterns that are specific to your business. I help teams make that distinction and then build the right thing.

// what these look likeWhat These Look Like

The spectrum

Custom models range from focused classifiers that answer one question well, to production prediction systems with real-time inference, monitoring, and continuous retraining. The right level of complexity depends on your data, your accuracy requirements, and how the predictions get used.

Focused models straightforward

A single model trained on your data to answer a specific question — will this customer churn, is this transaction fraudulent, which category does this belong to. Clean problem definition, solid feature engineering, one model doing one thing well. Often the highest ROI starting point.

Production prediction systems moderate

Models deployed behind APIs with real-time or batch inference, integrated into your application or decision-making workflow. Includes feature pipelines that compute inputs from your live data, model versioning, A/B testing infrastructure, and monitoring for data drift and performance degradation.

Multi-model orchestration advanced

Multiple models working together — a classifier that routes to specialized predictors, an ensemble that combines signals from different model types, or a pipeline where one model's output feeds another's input. Includes automated retraining triggers, champion/challenger deployment, and the infrastructure to manage multiple models in production without losing your mind.

// examples of what I've builtExamples

This looks like

Customer churn prediction
SaaS company wanted to identify at-risk accounts before they cancelled. Built a gradient-boosted model trained on behavioral signals — login frequency, feature usage patterns, support ticket velocity, billing changes — that scored accounts weekly. The customer success team used the scores to prioritize outreach, catching at-risk accounts weeks earlier than they were spotting them manually.
Task priority scoring engine
Operations team managing hundreds of daily tasks across a large team, with no systematic way to decide what to work on first. Built a scoring model that weighs urgency, expected value, deadline proximity, and dependency chains to produce a ranked priority queue — effectively turning a gut-feel process into a data-driven one. Phase one was rule-based; phase two replaced it with a trained ranking model as enough labeled data accumulated.
Anomaly detection for transaction monitoring
Fintech company needed to flag unusual transactions without drowning the review team in false positives. Built an isolation forest model augmented with domain-specific features that learned normal patterns per account type, flagging genuine outliers while keeping the false positive rate under 5% — down from 30%+ with their previous rule-based system.
Demand forecasting pipeline
E-commerce company with seasonal inventory challenges. Built a time-series forecasting system that combines historical sales data, promotional calendars, and external signals (weather, holidays, competitor pricing) to predict demand at the SKU level. Integrated directly into their inventory management system to trigger reorder alerts.
Lead scoring model
B2B company with a sales team spending equal time on every inbound lead regardless of quality. Trained a classifier on historical conversion data — firmographic attributes, engagement signals, source channel, behavioral patterns — to score new leads on likelihood to convert. Sales team focused on the top two tiers, close rate improved meaningfully within the first quarter.
// under the hoodHow It Works

How these are built

Building a model is the easy part. Building a model that works reliably on real data, integrates into your systems, and stays accurate over time — that's the actual job. Here's what I focus on at each layer.

Problem framing & data audit
Before any model is trained, I make sure we're solving the right problem. What exactly are you predicting? What data is available? Is there enough signal? What does the model's output feed into — a dashboard, an API, a human decision? Getting this wrong means building a technically impressive model that nobody uses.
Feature engineering
The part that matters most and gets the least attention. I spend time understanding your domain to design features that capture real business signals — not just throwing raw columns at a model and hoping. Good features often improve performance more than model selection does.
Model selection & training
I evaluate multiple approaches — logistic regression, gradient boosting (XGBoost, LightGBM), random forests, SVMs, neural networks — and pick based on your data characteristics, interpretability requirements, and latency constraints. No default to the fanciest architecture; the right answer is often the simplest model that meets your accuracy bar.
Evaluation & validation
Accuracy alone is usually the wrong metric. I set up evaluation frameworks that measure what actually matters for your use case — precision vs. recall tradeoffs, calibration quality, performance across subgroups, behavior at decision boundaries. Cross-validation, holdout sets, and temporal splits to make sure the model generalizes and isn't just memorizing your training data.
Deployment & monitoring
Models go behind APIs (real-time) or into batch pipelines (scheduled), depending on your latency needs. I build in data drift detection, performance monitoring, and alerting so you know when the model's accuracy starts degrading — before your team notices the predictions feel off. Retraining pipelines when the data warrants it.
// how we work togetherWorking Together

The process

01
Scoping call
30 minutes, free. You describe the prediction problem you're trying to solve and what data you have. I'll tell you honestly whether a custom model is the right approach, what accuracy is realistically achievable, and whether simpler alternatives (rules, heuristics, an existing API) would get you 80% of the way there.
02
Data exploration & feasibility
I explore your data — volume, quality, feature availability, label distribution. This is where projects succeed or fail. I'll give you a clear-eyed assessment of whether there's enough signal to build a useful model, and if not, what data you'd need to collect. No point building on a weak foundation.
03
Build, evaluate, iterate
Feature engineering, model training, evaluation — in short cycles with checkpoints. You see performance metrics early and we decide together whether the model is good enough for your use case or needs more work. I don't hand you a model without showing you exactly how it performs and where it struggles.
04
Deploy & hand off
Model goes into production with an inference API or batch pipeline, monitoring dashboard, and documentation. I include a runbook covering retraining procedures, drift detection thresholds, and what to do when performance degrades — so your team can maintain it independently.
// pricingPricing

What this costs

It depends on the problem complexity, your data readiness, and how the model needs to be deployed. A focused classifier with clean training data is a smaller engagement than a multi-model system with real-time inference, monitoring, and automated retraining.

The scoping call is free and there's no obligation. I'll assess your data situation, tell you what's realistically achievable, and give you an honest recommendation on whether building a custom model is worth the investment for your use case.

// common questionsCommon Questions

Frequently asked

Q: How much data do I need?
It depends on the problem. Some classification tasks work well with a few thousand labeled examples; others need tens of thousands. More important than volume is quality — clean labels, representative samples, and enough signal in your features. I'll assess this during the data exploration phase and be upfront if you don't have enough to build something reliable.
Q: What if we don't have labeled data?
There are options. Sometimes you can derive labels from historical outcomes (e.g., "did this customer churn within 90 days" is a label you already have, you just haven't structured it that way). Other times, we start with a rule-based system that generates initial labels, then train a model as enough reviewed data accumulates. I can help design the labeling strategy.
Q: How do we know if the model is accurate enough?
We define success criteria upfront — before any training happens. What precision and recall tradeoff is acceptable? What's the cost of a false positive vs. a false negative? I evaluate models against these criteria on held-out data and show you exactly where the model performs well and where it struggles. You make the call on whether it's good enough.
Q: What happens when the model gets stale?
All models degrade over time as the underlying data distribution shifts. I build in monitoring that detects drift — changes in input distributions, drops in prediction confidence, degradation in measured performance — and alerting that tells you when it's time to retrain. For models where this is critical, I set up automated retraining pipelines.
Q: Can stakeholders understand how the model makes decisions?
Interpretability is a design choice, not an afterthought. If your use case requires explainability (regulated industries, executive buy-in, user-facing decisions), I choose model architectures that support it and build in feature importance analysis, SHAP explanations, and plain-language summaries of what's driving each prediction.
Q: Should we just use an LLM for this instead?
Maybe — it depends on the problem. LLMs are great when the task requires language understanding, when you don't have labeled training data, or when the problem is too ambiguous to define a clear target variable. But for well-defined prediction tasks with structured data, a custom model is almost always faster, cheaper, more accurate, and more explainable. I'll tell you which applies to your situation.
// related servicesRelated Services

Related services

Want to work together?
30 minutes. No pitch. Just a conversation.
$ book --consultBook a consultation