The Evaluation Stack for Health AI

1 minute read

Published: March 15, 2026

For a long time, medical AI evaluation was dominated by knowledge benchmarks: medical licensing exams, multiple-choice question answering, and short clinical vignettes [1]. These benchmarks were useful. They helped establish that language models had absorbed enough medical knowledge to be taken seriously at all.

But health AI is no longer just about answering exam-style questions or generating plausible-sounding advice. The frontier is shifting toward systems that hold real-world conversations with patients, ask follow-up questions, retrieve context, explain medical information, and increasingly take action inside clinical workflows. Once that happens, the old defaults for evaluation start to break down.

A model can do very well on a licensing-style benchmark and still be weak at triage. It can look strong on a static vignette and still fail once a real user is involved. It can sound clinically sophisticated and still make mistakes when it has to gather missing information, manage uncertainty, or execute a workflow correctly.

Frontier models now perform extremely well on many medical knowledge benchmarks. But as health AI moves toward broader real-world deployment, we are learning a more humbling lesson: saturating a knowledge benchmark is not the same as practicing medicine safely.

That is why health AI needs an evaluation stack.

References

Share on

Facebook LinkedIn X (formerly Twitter)

From Stochastic Parrots to Software as a Doctor

8 minute read

Published: December 07, 2025

LLMs have been characterized as stochastic parrots, probabilistic systems that merely remix text without understanding and predict the next word. But the frontier is shifting. Today, the question is no longer whether LLMs can imitate clinical expertise, but how we transform them into regulated medical devices that can interview patients, form preliminary diagnoses, triage safely, and even prescribe.

Amazon Machine Learning Conference (AMLC 2025)

4 minute read

Published: November 06, 2025

I’ve spent the last few days in Seattle at Amazon’s internal Machine Learning Conference (AMLC). If last year was defined by the frontier of GenAI capabilities, this year the focus shifted decisively toward agents, reliability, and real-world deployment. The conversation has moved from “Can we do X?” to “How do we evaluate, govern, and safely operationalize X at scale?”. It felt like a distinctly Amazonian event: pragmatic, execution-oriented, and full of hallway discussions about shipping real systems and delivering customer impact.

Daniel Lopez-Martinez

The Evaluation Stack for Health AI

References

Share on

You May Also Enjoy

Launching Amazon Health AI for Prime Members

From Stochastic Parrots to Software as a Doctor

AWS re:Invent 2025

Amazon Machine Learning Conference (AMLC 2025)