Concerns over variability in the clinical interpretation of medical images performed by radiologists accompany most trials that involve imaging biomarkers. Such variability in clinical trial assessment outcomes is due to the inherent variability of interpreting an image, interpretation of assessment criteria, and definition of potentially measurable / non-measurable and/or equivocal / unequivocal disease.
Several studies have been performed at various academic centers to evaluate this variability in image interpretation among experienced radiologists who reviewed images in a standardized manner and all of them have concluded this variability is not a reliable predictor of read quality.
Having acknowledged the innate variability in image interpretation and the fact that this variability does not simply equate to a reader error, a deep dive into the factors causing this variability is crucial to identify an acceptable benchmark. Consistently monitoring the variability in reads and analyzing the trends gives an opportunity for data-driven monitoring and early intervention, if needed.
An FDA published guidance document, “Clinical Trial Imaging Endpoint Process Standards Guidance for Industry” further advocates the role of monitoring reader performance. Based on this there are several metrics defined to monitor reader performance, which are not just a measure of past performance but also a robust predictor or trends / bias.
“Reviewer Disagreement Index offers advantages of identifying the most discordant reviewer, which may otherwise be missed during traditional reviewer performance monitoring.”
– Manish Sharma, MD, Vice President, Medical Imaging, Calyx
Despite the well-established assessment criteria in many therapeutic areas (RECIST 1.1 for example), there is always going to be inherent variability in the blinded independent central review (BICR) process due to the differing backgrounds, training, and humanity of the reviewers. Despite this inherent variability, the industry has failed to produce meaningful methods for tracking and proactively improving reviewer performance.
Reviewer Adjudication Rate (AR), for example, has been used as a metric to track reviewer performance, but does not accurately identify reviewer performance issues. Adjudication in BICR is triggered when two reviewers’ assessments do not agree. For example, if the first reviewer performs their assessment perfectly and the second reviewer fails to assess correctly, an adjudication will be triggered. AR as a performance indicator would falsely identify the first reviewer as potentially having a performance issue.
Introducing the Reader Disagreement Index (RDI)
To improve the industry’s standard metrics used to assess and monitor central reviewers, a retrospective review of 20 oncology studies (7,136 subjects and 32,536 timepoints) using RECIST 1.0 or RECIST 1.1 rules was conducted. This review validated the increase in performance monitoring of two novel Key Performance Indicators (KPIs). These KPIs include:
Adjudication Agreement Rate (AAR)
- a relative performance indicator for a given reviewer as compared to the other reviewers in a given study, with a higher adjudicator agreement rate suggesting better reader performance
Reviewer Disagreement Index (RDI)
- considers the subjects for which adjudicator disagreed with the reviewer and considers adjudicator disagreement relative to the total number of cases read
RDI proves to be a more reliable quality indicator as compared to AR and AAR, as RDI can additionally identify the discordant reader, therefore improving its reliability. RDI offers advantages of identifying the most discordant reviewer that may be missed by analysis of AR and AAR alone for reviewer performance monitoring.
Adding automated analysis of all or selected discordant assessment pairs for each reviewer in a study further improves the ability to monitor reader interpretation performance at a detailed level. Once a probability of low AAR has been ‘flagged,’ it automatically prompts the imaging core lab to further evaluate the signal. Discrepancy grids/assessment pair grids improve the capability to monitor BICR reviewers’ performance in specific trials. These methods can be used to explore a reviewer having a low overall AAR or high RDI.
When deployed into a BICR workflow, these metrics enable a dynamic, responsive risk model that is more accurate than historical monitoring tools and automatically indicates the precise intervention needed to rectify reviewer performance issues. Critically, this reduces the number of reviewer performance issues by constantly monitoring the proximity to risk thresholds.
By monitoring the performance trajectory, Calyx uses these metrics to proactively intervene before an issue occurs, improving not only the validity of the BICR results, but also patient safety and outcomes. The biggest advantage of the system is that once set up, the regression model can utilize supervised learning algorithms to improve its predictability, reducing repetitive programming needs, and also allow medical monitors actionable outliers.
The multi-dimensional, real-time monitoring metrics add dimension to Calyx’s quality assessment of reviewer performance, ultimately improving the quality and integrity of data that pharmaceutical companies rely on to determine the safety and efficacy of new medical treatments.
- Guidance for Industry Developing Medical Imaging Drug and Biologic Products. Part 3: Design, Analysis, and Interpretation of Clinical Studies. US Department of Health and Human Services. Food and Drug Administration. Center for Drug Evaluation and Research. Center for Biologics Evaluation and Research; 2004.
- Clinical Trial Imaging Endpoints Process Standards Guidance for Industry Draft. US Department of Health and Human Services. Food and Drug Administration. Center for Drug Evaluation and Research. Center for Biologics Evaluation and Research. March 2015 Revision1.