Prompt engineering for clinical notes: first-person + self-audit

Generating clinical notes with LLMs sounds straightforward until you actually try it. The note needs to be accurate, complete, written in the right voice, and structured in a way that’s useful for the clinician reviewing it. Most prompt architectures fail on at least one of these.

We tested four approaches over three months, generating roughly 50,000 notes across them. Here’s what we found.

The four architectures

1. Template filling

The simplest approach: give the LLM a template with blanks and ask it to fill them in based on the encounter data.

Fill in the following clinical note template:
SUBJECTIVE: [patient's reported symptoms]
OBJECTIVE: [clinical observations]
ASSESSMENT: [clinical assessment]
PLAN: [treatment plan]

Result: Accurate but robotic. Clinicians complained the notes “didn’t sound like them.” Compliance teams were fine with it. Patients never saw it. We moved on.

2. Free-form generation with examples

Provide 3-5 example notes and let the model generate in a similar style.

Result: Better voice, worse accuracy. The model would sometimes confabulate details that weren’t in the encounter data — inventing symptoms or exam findings that sounded plausible but weren’t documented. Unacceptable for clinical use.

3. Third-person structured generation

Generate notes in third person (“The patient reports…”) with explicit section headers and a structured data extraction step before generation.

Result: Good accuracy, good structure, but clinicians found it impersonal. In care management settings where the clinician has an ongoing relationship with the patient, third-person voice creates distance.

4. First-person voice with self-audit (the winner)

Generate notes in first person (“I discussed with the patient…”) and append a structured self-audit that verifies every claim in the note against the source data.

{
  "audit": {
    "claims_verified": 12,
    "claims_unverifiable": 0,
    "missing_from_note": ["medication reconciliation"],
    "confidence": 0.94
  }
}

Result: Best of all worlds. First-person voice matched how clinicians actually write. The self-audit caught confabulation before it reached the clinician. And the missing_from_note field surfaced completeness issues that even human-written notes often have.

Why first-person + self-audit works

The self-audit block forces the model to do something it’s normally bad at: checking its own work. By requiring a structured JSON output that explicitly maps claims to source data, we create a mechanism for catching errors that would otherwise slip through.

The key insight: the audit isn’t optional decoration. We parse it, score it, and flag any note where claims_unverifiable > 0 for human review. This turns the LLM from a black box into a system with built-in quality controls.

The prompt structure

You are documenting a clinical encounter. Write the note in first person
as the treating clinician. Use the encounter data below as your ONLY source
of information. Do not infer, assume, or add any clinical details not
explicitly present in the data.

After the note, output a JSON audit block that:
1. Counts every factual claim in the note
2. Maps each claim to the source data field that supports it
3. Flags any claims that cannot be directly verified
4. Lists any encounter data fields NOT reflected in the note

[encounter data here]

The prompt is deliberately restrictive. “Do not infer, assume, or add” is doing heavy lifting — it prevents the model from filling gaps with medical knowledge that might be generically correct but isn’t documented for this specific patient.

Metrics after 6 months

Clinician acceptance rate: 87% (notes used as-is or with minor edits)
Confabulation rate: 0.3% (down from 4.2% with free-form generation)
Completeness score: 94% (measured against encounter data fields)
Time saved per note: ~4 minutes (clinician self-reported)

At 6,000 notes per day, that’s 400 hours of clinician time saved daily. That’s the number that makes the business case.