There are many new AI tools available for healthcare professionals, from transcription and imaging to diagnostics — many tout accuracy rates above 90%, but most are tested only in isolation.
Those tools become less reliable when used together, an analysis by Korean AI scientist Kwansub Yun suggests.
Yun and health consultant Claire Hast ran an example scenario in which a patient had a physical transcribed by AI, received a mammogram using AI-assisted imaging and got a diagnosis with help from an AI tool.
While each tool individually had a reported accuracy rating of more than 85%, the system as a whole had a reliability score of just 74%.
Yun used a systems-level analysis to estimate the overall workflow reliability of the three tools used together.
- Drawing on publicly available accuracy data for an imaging tool (90%), a documentation tool (85%) and a diagnostic (97%), Yun arrived at a reliability score of 74%.
- “The formula is a standard reliability engineering heuristic — the same structural logic used to estimate system reliability in aerospace and defense,” says Yun.
If erroneous data from one AI tool is fed into another, the secondary tool has no way to flag the unreliable inputs, says Yun.
- “The result looks authoritative, but the chain that produced it was never measured end to end.”
That’s particularly troubling given that the standard regulatory procedure for evaluating the tools involves standalone model performance testing, Hast and Yun say.
- “What no one is currently required to measure is the reliability of the full workflow that the model sits inside,” Yun says.
Human doctors are also typically evaluated as individuals, not as part of a broader system — there’s no data on how much reliability slips as patients move between providers.
- “If you chain together the probabilities of accuracy for any human making many sequential decisions, you realize how likely you are to get errors,” says Mark Sendak, CEO of AI infrastructure and evaluation startup Vega Health.
- “My fear is that we’re going to hold AI to a standard of perfection that is clearly not the standard that we hold the existing medical system to,” says UC San Francisco Department of Medicine chair Robert Wachter.
More attention should be paid to the overall performance of what Wachter calls “the human-AI dyad.”
- For example, AI tools could be designed to more clearly signal to humans in the loop where their clinical reasoning is needed.
- In such a scenario, AI findings made with 100% confidence could be colored green, while those with less confidence be colored yellow or orange.
- Such a setup would better enable regulators and evaluators of such tools to look at “that dyad and its actual outcomes, rather than just assuming the human-in-the-loop adds safety,” Wachter says.
When it comes to AI in health care, “we have no data or oversight on the orchestra of it all,” says Hast.

The Bay’s saving grace may be the organizations already in place to protect it—pushing back against the greed that uses…
What's Virginia Department of Housing got to do with the town making money and hiring? They too need to do…
When official courthouse documents are denied, that is flat out CORRUPTION. I guess they need to get paid off so…
Cheryl, you're close, but this is the Eastern Shore of Virginia, actually the Developers get their way around here. They…
Cape Charles Police Department does not have a Sheriff. Never has... Never will