// the problemThe Problem

The gap between demo and production

Getting an LLM to do something impressive in a notebook takes an afternoon. Getting it to do that same thing reliably, at scale, within your latency budget, at a cost you can afford, with error handling that doesn't break your UX — that takes real engineering. Most teams underestimate this gap by months.

The challenges aren't the ones you see in tutorials. They're the ones you discover at 2am: the model returns malformed JSON and your frontend crashes. Latency spikes to 8 seconds and users abandon the feature. Costs hit $40k/month because nobody optimized the prompts. The model hallucinates a policy your company doesn't have and a customer screenshots it.

I've shipped LLM features used by hundreds of thousands of users in regulated environments. I know where the landmines are, and I build around them from day one.

// what these look likeWhat These Look Like

The spectrum

LLM features range from straightforward API integrations to complex multi-step workflows with model orchestration. The right approach depends on your product, your users, and how much you're willing to invest in getting it right.

Single-call features straightforward

One LLM call, one output. Summarize this document, draft a response, classify this input, extract these fields. Simple to reason about, fast to ship, and often the right starting point. The engineering isn't in the prompt — it's in the error handling, output validation, caching, and UX around it.

Orchestrated workflows moderate

Multi-step features where the LLM is one part of a larger pipeline — retrieve context, call the model, validate the output, maybe call it again with corrections, then format and deliver. Includes chained prompts, conditional logic, parallel calls, and graceful degradation when any step fails. This is where most real product features live.

Multi-model systems advanced

Different models for different tasks — a fast, cheap model for classification and routing, a capable model for generation, a specialized model for extraction. Smart routing based on query complexity, cost constraints, or latency requirements. Model fallbacks so the feature degrades gracefully instead of breaking when a provider has an outage.

// examples of what I've builtExamples

This looks like

In-app document summarization
SaaS product where users upload long documents and need quick summaries before deciding to read further. Built a summarization feature that handles documents up to 100+ pages, produces structured summaries (key points, action items, decisions), streams the output for perceived speed, and caches results so repeated views are instant.
AI-assisted note drafting
Users spending 20–30 minutes writing notes after each session. Built a feature that generates structured draft notes from session data, pre-filling the relevant fields while leaving the user in full control to edit before saving. Framed as a "suggested draft" — the AI does the first pass, the human does the final pass. Reduced average note completion time significantly.
Intelligent form pre-fill
Complex intake forms where 60–70% of the information already exists somewhere in the system. Built an LLM feature that reads existing records, extracts relevant fields, and pre-populates the form — with confidence indicators so users know which fields to double-check. Turned a 15-minute data entry task into a 3-minute review task.
Conversational search interface
Enterprise product with powerful but complex filtering — users who weren't power users couldn't find what they needed. Added a natural language search layer that translates queries like "show me all overdue items assigned to the east coast team" into the underlying filter syntax, with a preview of the interpreted query so users can verify before executing.
Content generation with brand controls
Marketing team generating product descriptions, email copy, and social posts across dozens of product lines. Built a generation feature with brand voice guardrails, tone controls, length constraints, and a review workflow. The system references product specs and style guides to stay on-brand — not generic AI copy, but copy that sounds like the brand.
// under the hoodHow It Works

How these are built

The prompt is 10% of the work. The other 90% is everything that makes the feature production-ready — model selection, cost control, latency optimization, error handling, output validation, and the UX decisions that determine whether users actually trust and adopt the feature.

Prompt engineering & output design
Structured outputs with schema validation so the model returns data your frontend can actually use. Prompt design that handles edge cases, minimizes hallucination, and produces consistent results across the range of inputs your users will throw at it. Not a single prompt — usually a prompt library with variants for different contexts.
Model selection & routing
Not every request needs GPT-4. I evaluate models against your specific use case — accuracy, speed, cost — and build routing logic that sends simple requests to fast/cheap models and complex requests to capable ones. This is often where the biggest cost savings come from.
Latency & cost optimization
Response streaming so users see output immediately. Semantic caching so identical or similar requests don't hit the model twice. Prompt optimization to minimize token usage without sacrificing quality. Parallel execution where the workflow allows it. I've seen teams cut LLM costs by 50–70% with these techniques.
Error handling & fallbacks
The model will time out. The provider will have an outage. The output will occasionally be malformed. I build for all of these — retry logic, model fallbacks, output validation with graceful degradation, and UX that communicates clearly when the AI couldn't complete the task rather than showing a broken page.
UX & trust patterns
How the feature is presented matters as much as how it works technically. Confidence indicators, edit-before-submit flows, "AI suggested" labels, regenerate buttons, human override controls. The goal is a feature users trust enough to actually use daily — not a novelty they try once and ignore.
// how we work togetherWorking Together

The process

01
Scoping call
30 minutes, free. You show me your product, describe the feature you want to build, and walk me through the user workflow it's meant to improve. I'll tell you what's realistic, what the technical approach should be, and where teams typically get burned on similar features.
02
Technical spec & prototype
I write a spec covering the feature design, model selection, prompt architecture, integration points, and expected cost/latency profile. For complex features, I build a working prototype first — enough to validate the approach with real data before committing to full production engineering.
03
Build & integrate
I build the feature in your codebase — not in a separate repo that needs porting later. Short cycles with working builds along the way. Your team reviews code as it's written, so there's no handoff shock. The feature gets tested against real user scenarios throughout, not just at the end.
04
Ship, monitor, iterate
Feature goes live with monitoring, cost tracking, and eval infrastructure in place. I stick around for the first week or two of real usage to tune prompts, adjust model routing, and address edge cases that only surface with real users. Then handoff with documentation.
// pricingPricing

What this costs

It depends on the feature complexity, your codebase, and how much infrastructure (caching, routing, monitoring) needs to be built around it. A single-call feature with clean integration points is a smaller engagement than a multi-model orchestrated workflow with streaming UI and eval harness.

The scoping call is free. I'll assess the feature you want to build, give you a realistic estimate of timeline and complexity, and tell you if there's a simpler approach that would get your users 80% of the value in half the time.

// common questionsCommon Questions

Frequently asked

Q: How much will the LLM API cost to run this feature?
I model this during the spec phase — estimated tokens per request, expected request volume, model pricing. Then I build with cost optimization from the start: smaller models where possible, semantic caching, prompt optimization, smart routing. I've seen teams go from "this will cost us $50k/month" to "$8k/month" with the same quality level. You'll have visibility into per-request costs from day one.
Q: What about latency? Our users won't wait 10 seconds.
Response streaming is the first line of defense — users see output appearing immediately, even if the full response takes a few seconds. Beyond that: model routing (smaller/faster models for simpler tasks), caching, parallel execution, and prompt optimization all contribute. For most features, perceived latency under 2 seconds is achievable.
Q: What happens when OpenAI / Anthropic has an outage?
Your feature shouldn't go down because a provider does. I build with fallback models — if the primary model is unavailable, the feature degrades to a secondary model or a cached response rather than showing an error. The UX handles this gracefully so users aren't affected. For mission-critical features, I can set up self-hosted inference as the fallback layer.
Q: How do we prevent hallucinations?
You can't eliminate them entirely, but you can reduce them to near-zero for most use cases. Grounding the model in your actual data (RAG), constraining outputs to structured schemas, validation against known facts, confidence scoring, and UX that makes it clear the output is a suggestion the user should verify. The architecture choice depends on how high-stakes the output is.
Q: Will we be locked into a specific model provider?
No. I build a model abstraction layer so you can swap providers without rewriting the feature. This is important both for negotiating pricing and for taking advantage of new models as they're released. Your prompts, routing logic, and evaluation infrastructure work across providers.
Q: Can my team maintain and iterate on this after handoff?
That's the goal. The feature lives in your codebase, written in your team's stack, with documented prompts and clear architecture. I'm not building a black box — your engineers should be able to adjust prompts, add new model routes, and extend the feature without needing me. Ongoing support is available if you want it for tuning and optimization.
// related servicesRelated Services

Related services

Want to work together?
30 minutes. No pitch. Just a conversation.
$ book --consultBook a consultation