Evaluating LLMs in Medicine: What We Got Wrong

This is placeholder content to demonstrate the research post layout. Replace it with a real write-up.

Large language models can pass medical licensing-style exams, yet still fail in ways that matter at the bedside. In this note we walk through three failure modes we keep seeing, and the evaluation protocol our lab is developing to catch them before deployment.

The benchmark illusion

Aggregate accuracy hides the cases that matter. A model that is 92% accurate overall can be systematically wrong on the 8% of presentations that are rare, ambiguous, or high-stakes.

Distribution shift between benchmark vignettes and real charts.
Shortcut learning on surface features rather than clinical reasoning.
Overconfidence on exactly the questions where deferral is safest.

What we measure instead

We score models on calibrated uncertainty, abstention quality, and robustness to clinically irrelevant edits — not just top-line accuracy.

# Placeholder snippet
score = evaluate(model, suite="clinical-stress", metrics=[
    "selective_accuracy",
    "calibration_error",
    "robustness_delta",
])

Where this goes next

We’re releasing the evaluation suite and a leaderboard later this year. If you work on clinical AI and want to collaborate, get in touch.

The benchmark illusion

What we measure instead

Where this goes next

Interested in this line of work?