Self-hosting LLMs for HIPAA workloads: what actually works

Most teams building AI in healthcare start with the same question: can we just use the OpenAI API? The answer is usually “it depends,” and the thing it depends on is whether your compliance team will sign off on sending PHI to a third-party endpoint.

For one of my clients — a care management platform generating roughly 6,000 clinical notes per day — the answer was no. So we went down the self-hosting path, and I spent three months evaluating what actually works.

The constraints

This wasn’t an academic exercise. We needed:

Throughput: 6,000+ note generations per day, with peaks during business hours
Latency: Under 8 seconds for a complete clinical note (roughly 400-600 tokens)
Compliance: Full HIPAA compliance — no PHI leaves our infrastructure
Cost: Competitive with API pricing at our volume
Quality: Output quality comparable to GPT-4-class models for clinical documentation

What we evaluated

Qwen 2.5 72B on RunPod

This ended up being our winner. Qwen 2.5 at the 72B parameter size hits a sweet spot for clinical text generation — it understands medical terminology well, follows structured output formats reliably, and the 72B size is large enough to handle nuanced clinical reasoning without the infrastructure overhead of a 405B model.

RunPod gave us the GPU availability we needed (A100 80GB instances) with the flexibility to scale up during peak hours and scale down overnight. We’re running 4 instances during peak and 2 overnight.

MLX on Mac Studio clusters

We tested this as a cost-optimization play. Apple’s MLX framework on M2 Ultra Mac Studios can run 70B models at reasonable speeds. The per-unit cost is lower than GPU cloud instances over a 2-year horizon.

The problem: operational complexity. Managing a cluster of Mac Studios is not something your average DevOps team is set up for. Monitoring, failover, updates — it all requires custom tooling that doesn’t exist yet. We shelved this for now but I think it becomes viable in 12-18 months as the tooling matures.

vLLM on bare metal

Fast, efficient, well-documented. If you have access to bare metal GPU servers and a team that can manage them, this is the gold standard for self-hosted inference. We didn’t go this route because my client didn’t want to manage hardware, but for organizations with existing GPU infrastructure, this is where I’d start.

The architecture that worked

[Clinical App] → [Load Balancer] → [RunPod Serverless]
                                        ↓
                                  [Qwen 2.5 72B]
                                        ↓
                              [Structured Output Parser]
                                        ↓
                              [Quality Score + Audit Log]

The key insight: treat the LLM as a microservice with the same observability you’d give any production service. We log every request, every response, every latency measurement, and every quality score. When something degrades, we know within minutes.

Cost comparison

At 6,000 notes/day (~3M tokens/day input, ~1.5M tokens/day output):

OpenAI GPT-4o: ~$45/day ($1,350/month)
Self-hosted Qwen on RunPod: ~$28/day ($840/month)
Savings: ~38% — and it scales better as volume grows

The breakeven point was around 2,000 notes/day. Below that, the API is probably cheaper when you factor in engineering time.

What I’d do differently

Start with the API, migrate later. We could have launched 6 weeks earlier by starting with OpenAI behind a BAA, then migrating to self-hosted once we validated the product. The compliance team was more flexible than we initially assumed.
Invest in evaluation harnesses early. We built our quality scoring system in week 8. Should have been week 1. Without automated quality measurement, you’re flying blind when you swap models or change prompts.
Don’t underestimate prompt migration. Prompts optimized for GPT-4 don’t transfer 1:1 to Qwen. Budget 2-3 weeks for prompt engineering when switching models.

The bottom line

Self-hosting LLMs for HIPAA workloads is absolutely viable in 2026, but it’s not a weekend project. The infrastructure is mature enough, the models are good enough, and the cost math works at scale. The hard part is everything around the model: monitoring, evaluation, failover, and the operational discipline to run it like a real production service.

If you’re generating more than 2,000 AI completions per day in a regulated environment, it’s worth running the numbers on self-hosting. Below that, just get a BAA from your API provider and ship.