AI Isn't Ready for Your Patients
ChatGPT can pass exams, but it can’t manage uncertainty
A new randomized study published this week in Nature Medicine asked a simple question: Are large language models actually helpful when real people use them to make medical decisions?
The answer was sobering.
And I have to say up front that this is one of the most provocative studies around patients and AI that I’ve read in a while. You should check it out. I’ve had so many conversations with patients about what they are finding on ChatGPT, and I’ve been trying to understand where it falls short. This study helped me see it more clearly.
What’s interesting is that it lays bare the things I actually do in an exam room. Technology teaches us things about ourselves and how we work. LLMs and the challenge of public problem solving are a nice example of this.
Participants were given detailed medical scenarios and asked to use commercially available chatbots to decide what to do next. Call an ambulance? Stay home? Schedule an appointment? They were also asked to identify the likely diagnosis. The results were compared to a control group who used whatever they would normally use at home, which meant Google.
Bottom line: The chatbots did no better than Google. And the authors concluded that none of the models tested were ready for deployment in direct patient care.
Participants chose the correct course of action less than half the time. They identified the correct diagnosis only about a third of the time.
This clashes with the prevailing public narrative. These systems pass licensing exams and outperform poor slobs like me on carefully constructed diagnostic scenarios. By every benchmark, they look remarkable.
And they are remarkable in many ways. But for those of us who do this for a living, we know that good medicine is nothing like answering board questions.
So what happened? Here are three key thoughts/takeaways.
⸻
1. The failure is translation, not intelligence
This is critical: When researchers fed the full medical vignette directly into the model, accuracy jumped to 94 percent. The system could reason through the case when all relevant information was present. But when participants interacted naturally, leaving out key details or describing symptoms imprecisely, performance tanked.
There were no truer words than from Nassim Taleb on Twitter in 2023:
An expert is someone who knows what not to be wrong about.
And in our line of work knowing what not to be wrong about is everything.
That’s why doctors spend years learning which details matter. We are trained to detect the salient features of a patient’s story. Location, timing, severity, modifiers, aggravating and alleviating factors, associated symptoms, etc. We know that worst headache of my life is different from terrible headache. We know that sudden onset carries weight. We know that neck stiffness and photophobia can be ominous in the right context. And when discussing febrile infants, the word irritable gets taken very seriously by smart pediatricians.
Patients don’t know this. And why would they?
⸻
2. Small words change big outcomes
This study exposes a structural reality of generative AI: large language models are really sensitive to input framing.
Half of the observed errors in this study were attributed to users omitting relevant details. If a system is designed for public use (which ChatGPT is not, in its current iteration, IMHO), it can’t assume perfect prompting. It has to compensate for lousy storytelling.
Because if a model is dependent on the perfect input, it is not doing anything even remotely related to what a doctor does. For example, I create mini hypotheses during my history and I circle back and probe repeatedly.
Another finding deserves attention. Even when given complete information, the models struggled to consistently distinguish between urgent and non-urgent scenarios. Central to triage is the ability to calibrate an appropriate response. Not just to recognize a condition, but to determine how fast to move and how worried to be.
⸻
3. Knowledge is not judgment
It is important to be clear about what this study doesn’t show. It doesn’t show that large language models are useless in medicine. It doesn’t negate their utility for clinicians or even patients. It doesn’t refute the possibility that future iterations will improve.
What it shows is that intelligence alone is insufficient for what matters most.
What we are watching is the collision between answer engines and messy human input. The models are good at churning outputs when the inputs are structured and complete. But as we see in this study, patients do not give us structured and complete. They jump around, tell stories, and bring their own biases to the exam room. Or in this case, the prompt field.
The bridge between bad questions and helpful answers is judgment.
And judgment involves more than retrieving the correct differential. It involves calibrating risk and escalating when needed.
The real challenge for AI health leaders is building systems that understand how decisions are made and when uncertainty should trigger escalation. That requires deep engagement with the people like us who do this work daily.
⸻
For now, patients using chatbots for medical advice are navigating a powerful but clearly unstable interface. When I cut my teeth during Web 2.0, I argued that physicians have an obligation to help patients understand the limitations of search. It seems what’s old is new again.
Until AI systems can operate responsibly and consistently within the space of human ambiguity, patients need to understand the difference between intelligence and wisdom.
You can read the full study here. It’s open access, the graphics are excellent, and the design is easy to follow. As I like to say, don’t trust me. Read it yourself.



Wow, this is very important. No AI can replace that careful listening.
This is a really important (and frankly overdue) reality check.
What I appreciated most is your distinction between intelligence vs judgment. In clinic, the “work” isn’t naming the diagnosis from a complete vignette, but it’s building the vignette: extracting the discriminating details from messy narratives, iteratively probing uncertainty, and then doing the hardest part of all: risk calibration (what can wait vs what cannot).
Your point that performance jumps when the full vignette is fed directly into the model is telling. It’s not that the model “can’t think”; it’s that it can’t reliably do what clinicians do all day: compensate for missingness, ambiguity, and misframing, and then escalate appropriately when uncertainty is dangerous.
Two implications feel especially high-yield:
1. If we’re going to deploy AI in patient-facing contexts, the interface has to be built for imperfect storytelling—active questioning, symptom timelines, red-flag extraction, and explicit “stop rules” that default to care escalation when the cost of being wrong is high.
2. We should stop reassuring ourselves with exam benchmarks. Passing tests is pattern recognition under controlled inputs; medicine is decision-making under incomplete information with asymmetric risk.
AI can be helpful, but until it consistently handles the uncertainty space, it’s not a substitute for clinical triage and judgment!