AI scribes show promise but risks remain, new study finds
Dr Thomas Draper, Research Fellow

A new study led by the Centre for Digital Excellence (CoDE) at the University of the West of England has compared seven commercial AI scribes used in primary care, uncovering both potential gains and pitfalls.

Published in BMJ Digital Health & AI, the research found that while AI scribes can quickly generate medical notes and reduce paperwork, none of the systems were free from errors. Crucially, the study introduced a new “Impact Score” system that highlights not just how often errors occur, but how severe their consequences could be for patient care.

 

The study

Researchers tested the scribes across a number of simulated GP consultations covering a range of conditions including sleep apnoea, abdominal pain, rash, and memory loss. Each consultation was recorded with actors, and the resulting AI-generated summaries were assessed against gold-standard transcripts.

Errors were classified as omissions (details left out), factual inaccuracies (misreported information), or hallucinations (content added that was never discussed). A separate team of medical doctors then rated each error for its potential clinical severity, feeding into the new Impact Score. This scoring approach revealed that even relatively rare errors could carry disproportionately high risks, such as delaying diagnosis or introducing unsafe treatment.

 

What was found

Omissions were by far the most common, accounting for more than four out of five errors. These included missing information about family history, medications, or when symptoms began. While omissions were frequent, hallucinations and inaccuracies often carried greater potential for harm.

Examples included one system incorrectly labelling a non-smoker as a pipe smoker, and another inventing a diagnosis that had never been mentioned in the consultation. Errors of this kind were less common but were judged far more serious when they did occur.

The novel Impact Score made these risks clear. Under the traditional way of counting errors, omissions dominate. But once severity is factored in, the occasional hallucination or factual inaccuracy can loom much larger. This approach, the authors argue, offers a more clinically meaningful way to judge AI scribes than raw error counts alone.

 

Cumulative clinical impact from summarisation errors across CAIS products, for all scenarios. Clear distinctions are visible between products.

 

Why it matters

The findings show that while AI scribes can improve efficiency, they cannot yet be relied on without oversight.

“AI scribes have the potential to transform clinical practice, but our study shows that not all systems perform equally well,” said Dr Thomas Draper, lead author and Research Fellow at CoDE. “Our new Impact Score demonstrates that a small number of severe errors may be more important than a larger number of minor ones. Clinicians must remain vigilant and always review AI-generated notes before trusting them in patient records.”

 

Looking ahead

The study found that some systems handled straightforward consultations well but struggled with more complex or ambiguous ones, such as rashes or sleep apnoea. It also showed that differences in accuracy and quality between products can be substantial, underlining the importance of careful product selection by healthcare providers.

The researchers suggest that regulators and purchasers could adopt the Impact Score as part of structured evaluations, while clinicians may benefit from focusing their checks on areas most likely to be missed, such as psychosocial history or medications.

 

The full paper, Clinical AI Scribes in Primary Care: Accuracy, Error Severity, and Implications for Clinical Practice, is freely available now in BMJ Digital Health & AI: http://dx.doi.org/10.1136/bmjdhai-2025-000092