RAG Systems & Semantic Search

// the problemThe Problem

Why most RAG pipelines fail

The pitch is simple: connect an LLM to your documents and let people ask questions. The reality is that most teams try this, get excited by the first demo, and then discover the system is confidently wrong 20–30% of the time. Nobody trusts it, nobody uses it, and the project quietly dies.

The problem is almost never the LLM — it's the retrieval. Bad chunking, wrong embedding model, no re-ranking, no evaluation, no visibility into what's going wrong. A RAG system is an information retrieval system first and a language model second. Most teams build it backwards.

// what these look likeWhat These Look Like

The spectrum

RAG systems range from straightforward Q&A over a document set to complex multi-source retrieval with structured and unstructured data. The right complexity depends on your data, your accuracy requirements, and how much is at stake when the system gets it wrong.

◆ Document Q&A straightforward

A knowledge base, a set of PDFs, or a Confluence space — users ask questions in natural language and get accurate answers with source citations. The right starting point for most teams. Fast to build, easy to evaluate, and immediately useful.

◆ Multi-source retrieval moderate

The answer isn't in one place — it's spread across docs, databases, APIs, and internal tools. The system needs to figure out where to look, pull from multiple sources, reconcile conflicting information, and synthesize a coherent response. Requires routing logic, hybrid search (semantic + keyword), and more sophisticated evaluation.

◆ Agentic RAG advanced

The system doesn't just retrieve — it reasons about what it needs, executes multi-step research plans, and uses tools to gather information before generating a response. Query decomposition, iterative retrieval, self-evaluation loops, and structured output generation. For high-stakes use cases where accuracy can't be approximate.

// examples of what I've builtExamples

This looks like

Internal knowledge assistant

RAG system over a company's Confluence, Google Drive, and SOPs. Employees ask questions in natural language, get answers with direct source links, and the system logs unanswered questions so the docs team knows where the gaps are. Replaced a Slack channel where people waited hours for answers.

Compliance document search

Regulatory teams needed to search thousands of pages of compliance documentation — policies, procedures, audit findings — and get precise answers with citations. Built a hybrid search system (semantic + BM25) with re-ranking that consistently outperformed keyword search on their internal benchmarks.

Support ticket deflection

Customer support team receiving hundreds of tickets that were already answered in the help docs. Built a RAG layer that drafts responses by retrieving relevant help center articles, presenting the answer with a confidence score. Agents review and send — reducing average response time significantly while maintaining accuracy.

Research synthesis tool

Agentic RAG system for a team that needed to synthesize findings across large document collections. The system decomposes complex queries, retrieves across multiple corpora, identifies contradictions, and produces structured summaries with citations to source material — work that previously took analysts hours per query.

Product catalog semantic search

E-commerce company with thousands of SKUs and a keyword search that couldn't handle natural language queries ("show me something like X but cheaper"). Built a semantic search layer over their product catalog with attribute-aware embeddings, so "lightweight laptop for travel under $1000" actually returns relevant results.

// under the hoodHow It Works

How these are built

There's no one-size-fits-all RAG architecture. Every decision — chunking strategy, embedding model, retrieval method, re-ranking — depends on your data and your accuracy requirements. Here's what I'm thinking about at each layer.

Ingestion & chunking

How you break documents into pieces determines everything downstream. I evaluate chunk sizes, overlap strategies, and document-aware splitting (respecting headers, sections, tables) rather than naive character-count chunking. Different document types often need different strategies.

Embedding & indexing

Model selection matters more than most teams realize — a general-purpose embedding model is often worse than a smaller one tuned for your domain. I benchmark embedding models against your actual queries, set up vector storage (PGVector, Qdrant, or managed options depending on scale), and configure indexing for your latency requirements.

Retrieval & re-ranking

Semantic search alone isn't enough. I build hybrid retrieval (combining vector similarity with keyword/BM25 search), implement cross-encoder re-ranking to surface the most relevant chunks, and add metadata filtering so the system can narrow by date, source, category, or any structured attribute in your data.

Generation & grounding

The LLM layer — prompt design that minimizes hallucination, citation generation so users can verify answers, confidence scoring, and guardrails that make the system say "I don't know" when retrieval comes up empty rather than inventing an answer.

Evaluation & iteration

The part most teams skip entirely. I build eval harnesses that measure retrieval quality (are the right chunks being found?), generation quality (is the answer accurate given those chunks?), and end-to-end accuracy. This is how you know the system is working and how you catch regressions when you change anything.

// how we work togetherWorking Together

The process

Scoping call

30 minutes, free. You walk me through your data, your current search pain points, and what "good" looks like for your team. I'll tell you honestly whether RAG is the right approach or if something simpler would work better.

Data audit & technical spec

I review your document corpus — formats, volume, structure, quality. Then I write a technical spec covering the architecture, embedding strategy, retrieval approach, and a realistic timeline. We agree on scope and success criteria before any code is written.

Build, evaluate, iterate

I build in short cycles — ingestion pipeline first, then retrieval, then generation — with evaluation at each stage. You see working demos early and give feedback on real queries from your team. The eval harness is built alongside the system, not after.

Deploy & hand off

The system goes into production with monitoring, documentation, and a handoff to your team. I include a playbook for common maintenance tasks — adding new documents, re-indexing, troubleshooting retrieval quality — so your team can manage it independently.

// pricingPricing

What this costs

Every RAG system is different — a simple Q&A layer over a knowledge base is a smaller engagement than a multi-source agentic retrieval system with hybrid search and custom evaluation. I scope and price each project individually based on data complexity, integration requirements, and accuracy needs.

The scoping call is free and there's no obligation. I'll give you an honest assessment of what your system needs, what the right architecture is, and whether it's worth building at all.

// common questionsCommon Questions

Frequently asked

Q: We already tried RAG and it wasn't accurate enough. Can you fix it?

Usually, yes. Most failed RAG implementations have fixable problems — bad chunking, wrong embedding model, no re-ranking, or insufficient evaluation to even know what's failing. I start by auditing the existing system, identifying where retrieval is breaking down, and fixing the weakest link first. Sometimes it's a few targeted changes, sometimes it's a deeper rebuild.

Q: What kind of documents can this handle?

PDFs, Word docs, HTML, Markdown, Confluence pages, Google Docs, plain text — anything with extractable text content. Tables and structured data require more careful handling (and often a different retrieval strategy), but it's a solved problem. Scanned documents need OCR as a preprocessing step, which I can include in the pipeline.

Q: How do we know if the answers are accurate?

Every system I build includes an evaluation harness — a set of test queries with known correct answers that the system is measured against automatically. This gives you a concrete accuracy number, and it catches regressions when anything in the pipeline changes. I also build in citation generation so users can verify individual answers by clicking through to the source.

Q: What about data privacy?

Your documents stay in your infrastructure. I can build on self-hosted embedding models and LLMs if you need to keep everything on-prem, use zero-retention API agreements with cloud providers, or set up data anonymization pipelines for sensitive content. I've built in regulated environments and can sign NDAs and BAAs.

Q: How long does a typical build take?

A focused document Q&A system: 2–4 weeks. Multi-source retrieval with hybrid search and re-ranking: 4–8 weeks. Agentic RAG with complex evaluation: 6–12 weeks. These are working-in-production timelines, not just a demo. I'll give you a realistic estimate in the proposal.

Q: Can our team maintain this after handoff?

That's the goal. Everything is built with standard tools (Python, PostgreSQL/PGVector, well-documented APIs), version controlled, and documented. I include a maintenance playbook covering how to add documents, re-index, and troubleshoot common issues. I'm also available for ongoing support if you want it.

// related servicesRelated Services

Related services

Want to work together?

30 minutes. No pitch. Just a conversation.

$ book --consultBook a consultation