In addition to evaluating the
safety of software as a medical device (SaMD), the agency needs to devote more
resources to evaluating its efficacy and quality.
John Halamka, M.D., president, Mayo
Clinic Platform, and Paul Cerrato, senior research analyst and communications
specialist, Mayo Clinic Platform, wrote this article.
The FDA’s approach to software as a
medical device (SaMD) has been evolving. Consider a few examples.
In 2018, IDx-DR, a software system used
to improve screening for retinopathy, a common complication of diabetes that
affects the eye, became the first AI-based medical device to receive US
Food and Drug Administration clearance to “detect greater than a mild level
of … diabetic retinopathy in adults who have diabetes.” To arrive at that
decision, the agency not only reviewed data to establish its safety, it also
took into account prospective studies, an essential form of evidence that
clinicians look for when trying to decide if a device or product is worth
using. The software was the first medical device approved by the FDA that does
not require the services of a specialist to interpret the results, making it a
useful tool for health care providers who may not normally be involved in eye
care. The FDA clearance emphasized the fact that IDx-DR is a screening tool not
a diagnostic tool, stating that patients with positive results should be
referred to an eye care professional. The algorithm built into the IDx-DR
system is intended to be used with the Topcon NW400 retinal camera and a cloud
server that contains the software.
Similarly, FDA looked at a
randomized prospective trial before approval of a machine learning-based
algorithm that can help endoscopists improve their ability to detected smaller,
easily missed colonic polyps. Its recent clearance of GI Genius by Medtronic
was based on a clinical trial published in Gastroenterology, in which investigators in Italy evaluated data from 685 patients,
comparing a group that underwent the procedure with the help of the computer-aided detection (CADe) system to a group who acted as controls. Repici et al
found that the adenoma detection rate was significantly higher in the CADe
group, as was the detection rate for polyps 5 mm or smaller, which led to the
conclusion: “Including CADe in colonoscopy examinations increases detection of
adenomas without affecting safety.”
Their findings raise several
questions: is it reasonable to assume that a study of 600+ Italians would apply
to a U.S. population, which has different demographic characteristics? More
importantly, were the 685 patients representative of the general public, including
adequate numbers of persons of color and those in lower socioeconomic groups?
While the Gastroenterology study did report enough female patients,
there is no mention of these other marginalized groups.
An independent 2021 analysis
of FDA approvals has likewise raised several concerns about the effectiveness
and equity of several recently approved AI algorithms. Eric Wu from Stanford
University and his colleagues examined the FDA’s clearance of 130 devices and
found the vast majority were approved based on retrospective studies (126 of
130). And when they separated all 130 devices into low- and high-risk subgroups
using FDA guidelines, they found none of the 54 high-risk devices had been evaluated
by prospective trials. Other shortcomings documented in Wu’s analysis included
the following:
- Of the 130 approved
products, 93 did not report multi-site evaluation. - Fifty-nine of the approved AI
devices included no mention of the sample size of the test population. - Only 17 of the approved
devices discussed a demographic subgroup.
We would certainly like to see the FDA take a more
thorough approach to AI-based algorithm clearance, but in lieu of that, several
leading academic medical centers, including Mayo Clinic, are contemplating a
more holistic and comprehensive approach to algorithmic evaluation. It would include
establishing a standard labeling schema to document the characteristics,
behavior, efficacy, and equity of AI systems, to reveal the properties of
systems necessary for stakeholders to assess them and build the trust necessary
for safe adoption. The schema will also support assessment of the portability
of systems to disparate datasets. The labeling schema will serve as an
organizational framework that specifies the elements of the label. Label
content will be specified in sections that will likely include:
- model
details such as name, developer, date of release, and version, - the
intended use of the system, - performance
measures, - accuracy
metrics, and - training
data and evaluation data characteristics
While it makes no sense to
sacrifice the good in pursuit of the perfect, the current regulatory framework
for evaluating SaMD is far from perfect. Combining a more robust FDA approval
process with the expertise of the world’s leading medical centers will offer
our patients the best of both worlds.