The Toolbox for Rating Diagnostic Tests: A Guide to Classification Metrics
DOI:
https://doi.org/10.20883/medical.e1474Keywords:
statistical analysis, binary classification, prediction model, diagnostic test, ROC curve, model evaluationAbstract
Evaluating a classifier's performance is critical for its successful application. This paper explores various metrics used for binary classification tasks, highlighting their strengths and limitations.
Simple threshold metrics, such as Accuracy and Sensitivity, are efficient for binary data and a single cutoff point. However, their reliance on a single threshold and sensitivity to imbalanced data can be drawbacks.
For more robust evaluation, ranking metrics such as Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves provide a threshold-agnostic approach, enabling comparison across different cutoff points. Additionally, probabilistic metrics like Brier Score and Log Loss assess the model's ability to predict class probabilities.
The choice of metric depends on the specific classification problem and the characteristics of the data. When dealing with imbalanced data or complex decision-making processes, using multiple metrics is recommended to gain a comprehensive understanding of the model's performance.
This paper emphasises the importance of understanding metric limitations and of selecting appropriate metrics for a specific classification task. By doing so, researchers and practitioners can ensure a more accurate and informative evaluation of their models, ultimately leading to the development of reliable tools for various applications.
Downloads
References
Rose G. Sick individuals and sick populations. Int J Epidemiol. 2001;30(3):427–432. https://doi.org/10.1093/ije/30.3.427.
Shandhi MMH, Cho PJ, Roghanizad AR, Singh K, Wang W, Enache OM, et al. A method for intelligent allocation of diagnostic testing by leveraging data from commercial wearable devices: a case study on COVID-19. NPJ Digit Med. 2022;5(1):130. https://doi.org/10.1038/s41746-022-00672-z.
Xi Y, Ding Y, Cheng Y, Zhao J, Zhou M, Qin S. Evaluation of the medical resource allocation: evidence from China. Healthcare (Basel). 2023;11(6):829. https://doi.org/10.3390/healthcare11060829.
Guzik P, Więckowska B. Data distribution analysis – a preliminary approach to quantitative data in biomedical research. J Med Sci. 2023;92(2):e869. https://doi.org/10.20883/medical.e869.
George DB, Taylor W, Shaman J, Rivers C, Paul B, O’Toole T, et al. Technology to advance infectious disease forecasting for outbreak management. Nat Commun. 2019;10(1):3932. https://doi.org/10.1038/s41467-019-11901-7.
Myers A, Johnston N, Rathore S, Kwon D, Kline J, Jehi L, et al. Electronic health record-based prediction model for acute kidney injury in patients undergoing major gastrointestinal surgery: a pilot study. J Pers Med. 2020;10(1):21. https://doi.org/10.3390/jpm10010021.
Flaks-Manov N, Topaz M, Hoshen M, Balicer RD, Shadmi E. Identifying patients at highest-risk: the best timing to apply a readmission predictive model. BMC Med Inform Decis Mak. 2019;19:118. https://doi.org/10.1186/s12911-019-0836-6.
Skov Benthien K, Kart Jacobsen R, Hjarnaa L, Mehl Virenfeldt G, Rasmussen K, Toft U. Predicting individual risk of emergency hospital admissions – a retrospective validation study. Risk Manag Healthc Policy. 2021;14:3865–3872. https://doi.org/10.2147/RMHP.S314588.
Berchialla P, Lanera C, Sciannameo V, Gregori D, Baldi I. Prediction of treatment outcome in clinical trials under a personalized medicine perspective. Sci Rep. 2022;12(1):4115. https://doi.org/10.1038/s41598-022-07801-4.
Selby JV, Fireman BH. Building predictive models for clinical care—where to build and what to predict? JAMA. 2015;313(17):1705–1706. https://doi.org/10.1001/jama.2015.3680.
Battineni G, Sagaro GG, Chintalapudi N, Amenta F. Applications of machine learning predictive models in the chronic disease diagnosis. J Pers Med. 2020;10(2):21. https://doi.org/10.3390/jpm10020021.
Ghaffar Nia A, Kaplan M, Khelifi A, et al. Evaluation of artificial intelligence techniques in disease diagnosis and prediction. Discov Artif Intell. 2023;3:5. https://doi.org/10.1007/s44163-023-00049-5.
Toma I, Wei Y. The application of artificial intelligence methods to public health data. Encyclopedia. 2023;3(2):590–601. https://doi.org/10.3390/encyclopedia3020042.
Alowais SA, Alghamdi SS, Alsuhebany N, et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ. 2023;23:689. https://doi.org/10.1186/s12909-023-04698-z.
Mansouri A, Mencattini A, Salmeri M, et al. A HUME approach for causal effects estimation in presence of unmeasured confounding. Bioinformatics. 2018;34(24):4274–4283. https://doi.org/10.1093/bioinformatics/bty490.
Ferner RE, Aronson JK. Susceptibility to adverse drug reactions. Br J Clin Pharmacol. 2019;85(10):2205–2212. https://doi.org/10.1111/bcp.14017.
Twick I, de Vetten JH, ten Berge M, et al. Performance measures for machine learning in a neonatal intensive care unit: a systematic review. J Am Med Inform Assoc. 2022;29(6):1064–1074. https://doi.org/10.1093/jamia/ocac036.
Pavlou M, Ambler G, Seaman SR, et al. How to develop a more accurate risk prediction model when there are few events. BMJ. 2015;351:h3868. https://doi.org/10.1136/bmj.h3868.
Davis J, Goadrich M. The relationship between precision-recall and ROC curves. In: Proc 23rd Int Conf Mach Learn (ICML). New York: ACM; 2006. p.233–240. https://doi.org/10.1145/1143844.1143874.
Canbek G, Taskaya Temizel T, Sagiroglu S. PToPI: a comprehensive review, analysis, and knowledge representation of binary classification performance measures/metrics. SN Comput Sci. 2022;4(1):13. https://doi.org/10.1007/s42979-022-01409-1.
Chicco D, Jurman G. The Matthews correlation coefficient (MCC) is more informative than Cohen’s kappa and Brier score in binary classification assessment. IEEE Access. 2023;16:4. https://doi.org/10.1109/ACCESS.2023.3301604.
Ferri C, Hernández-Orallo J, Modroiu R. An experimental comparison of performance measures for classification. Pattern Recognit Lett. 2009;30(1):27–38. https://doi.org/10.1016/j.patrec.2008.08.010.
Choi HY, Park HE, Seo H, et al. Fasting plasma glucose and glycated hemoglobin cutoffs for predicting diabetes and prediabetes: the Korean genome and epidemiology study. J Korean Med Sci. 2018;33(50):e93. https://doi.org/10.3346/jkms.2018.33.e93.
Wai JH, Turner RM, Koeman J, et al. Outcome analysis for BI-RADS category 3 in the national mammography database. Diagn Interv Imaging. 2017;98(3):179–190. https://doi.org/10.1016/j.diii.2017.01.003.
He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263–1284. https://doi.org/10.1109/TKDE.2008.239.
Hassanzad F, Fryback D, Hosmer D, et al. Optimal cut-point analysis: an updated review of methods in medical research. J Appl Stat. 2024;51(4):1222–1242. https://doi.org/10.1080/02664763.2022.2130717.
Naulaerts S, et al. The impact of feature selection on the performance of imbalanced binary classification problems. Oncotarget. 2017;8:109343–109353. (DOI could not be verified online; recommend confirming source.)
Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128–138. https://doi.org/10.1097/EDE.0b013e3181c30fb2.
Shi L, Reid LH, Jones WD, et al. The microarray quality control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol. 2010;28(8):827–838. https://doi.org/10.1038/nbt.1665.
Su Z, Łabaj PP, Li S, et al. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the SEQC/MAQC-III consortium. Nat Biotechnol. 2014;32(9):903–914. https://doi.org/10.1038/nbt.2957.
Youden WJ. Index for rating diagnostic tests. Cancer. 1950;3(1):32–35. https://doi.org/10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3.
Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin Chem. 1993;39(4):561–577. https://doi.org/10.1093/clinchem/39.4.561.
Obuchowski NA. ROC analysis. AJR Am J Roentgenol. 2005;184(2):364–372. https://doi.org/10.2214/ajr.184.2.01840364.
Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3):e0118432. https://doi.org/10.1371/journal.pone.0118432.
Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. In: Guyon I, von Luxburg U, Bengio S, et al., editors. Advances in Neural Information Processing Systems 30. Red Hook (NY): Curran Associates; 2017. p.4765–4774.
Więckowska B, Kubiak KB, Guzik P. Evaluating the three-level approach of the U-smile method for imbalanced binary classification. PLoS One. 2025;20(4):e0321661. https://doi.org/10.1371/journal.pone.0321661.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 The copyright to the submitted manuscript is held by the Author, who grants the Journal of Medical Science (JMS) a nonexclusive licence to use, reproduce, and distribute the work, including for commercial purposes.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

