The Toolbox for Rating Diagnostic Tests: A Guide to Classification Metrics
DOI:
https://doi.org/10.20883/medical.e1474Keywords:
statistical analysis, binary classification, prediction model, diagnostic test, ROC curve, model evaluationAbstract
Evaluating a classifier's performance is critical for its successful application. This paper explores various metrics used for binary classification tasks, highlighting their strengths and limitations.
Simple threshold metrics, such as Accuracy and Sensitivity, are efficient for binary data and a single cutoff point. However, their reliance on a single threshold and sensitivity to imbalanced data can be drawbacks.
For more robust evaluation, ranking metrics such as Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves provide a threshold-agnostic approach, enabling comparison across different cutoff points. Additionally, probabilistic metrics like Brier Score and Log Loss assess the model's ability to predict class probabilities.
The choice of metric depends on the specific classification problem and the characteristics of the data. When dealing with imbalanced data or complex decision-making processes, using multiple metrics is recommended to gain a comprehensive understanding of the model's performance.
This paper emphasises the importance of understanding metric limitations and of selecting appropriate metrics for a specific classification task. By doing so, researchers and practitioners can ensure a more accurate and informative evaluation of their models, ultimately leading to the development of reliable tools for various applications.
Downloads
References
1. Rose, G. Sick Individuals and Sick Populations. Int. J. Epidemiol. 2001, 30, 427–432, doi:10.1093/ije/30.3.427.
2. Kurvers, R.H.J.M.; Wolf, M. Identification of Acutely Sick People: Individual Differences and Social Information Use. Proc. R. Soc. B Biol. Sci. 2018, 285, 20181274, doi:10.1098/rspb.2018.1274.
3. Shandhi, M.M.H.; Cho, P.J.; Roghanizad, A.R.; Singh, K.; Wang, W.; Enache, O.M.; Stern, A.; Sbahi, R.; Tatar, B.; Fiscus, S.; et al. A Method for Intelligent Allocation of Diagnostic Testing by Leveraging Data from Commercial Wearable Devices: A Case Study on COVID-19. Npj Digit. Med. 2022, 5, 1–13, doi:10.1038/s41746-022-00672-z.
4. Xi, Y.; Ding, Y.; Cheng, Y.; Zhao, J.; Zhou, M.; Qin, S. Evaluation of the Medical Resource Allocation: Evidence from China. Healthcare 2023, 11, 829, doi:10.3390/healthcare11060829.
5. Data Distribution Analysis – a Preliminary Approach to Quantitative Data in Biomedical Research | Journal of Medical Science Available online: https://jms.ump.edu.pl/index.php/JMS/article/view/869 (accessed on 21 April 2024).
6. George, D.B.; Taylor, W.; Shaman, J.; Rivers, C.; Paul, B.; O’Toole, T.; Johansson, M.A.; Hirschman, L.; Biggerstaff, M.; Asher, J.; et al. Technology to Advance Infectious Disease Forecasting for Outbreak Management. Nat. Commun. 2019, 10, 3932, doi:10.1038/s41467-019-11901-7.
7. Myers, M.F.; Rogers, D.J.; Cox, J.; Flahault, A.; Hay, S.I. Forecasting Disease Risk for Increased Epidemic Preparedness in Public Health. Adv. Parasitol. 2000, 47, 309–330.
8. Flaks-Manov, N.; Topaz, M.; Hoshen, M.; Balicer, R.D.; Shadmi, E. Identifying Patients at Highest-Risk: The Best Timing to Apply a Readmission Predictive Model. BMC Med. Inform. Decis. Mak. 2019, 19, 118, doi:10.1186/s12911-019-0836-6.
9. Skov Benthien, K.; Kart Jacobsen, R.; Hjarnaa, L.; Mehl Virenfeldt, G.; Rasmussen, K.; Toft, U. Predicting Individual Risk of Emergency Hospital Admissions – A Retrospective Validation Study. Risk Manag. Healthc. Policy 2021, 14, 3865–3872, doi:10.2147/RMHP.S314588.
10. Berchialla, P.; Lanera, C.; Sciannameo, V.; Gregori, D.; Baldi, I. Prediction of Treatment Outcome in Clinical Trials under a Personalized Medicine Perspective. Sci. Rep. 2022, 12, 4115, doi:10.1038/s41598-022-07801-4.
11. Selby, J.V.; Fireman, B.H. Building Predictive Models for Clinical Care—Where to Build and What to Predict? JAMA Netw. Open 2021, 4, e2032539, doi:10.1001/jamanetworkopen.2020.32539.
12. Battineni, G.; Sagaro, G.G.; Chinatalapudi, N.; Amenta, F. Applications of Machine Learning Predictive Models in the Chronic Disease Diagnosis. J. Pers. Med. 2020, 10, 21, doi:10.3390/jpm10020021.
13. Ghaffar Nia, N.; Kaplanoglu, E.; Nasab, A. Evaluation of Artificial Intelligence Techniques in Disease Diagnosis and Prediction. Discov. Artif. Intell. 2023, 3, 5, doi:10.1007/s44163-023-00049-5.
14. Toma, M.; Wei, O.C. Predictive Modeling in Medicine. Encyclopedia 2023, 3, 590–601, doi:10.3390/encyclopedia3020042.
15. Alowais, S.A.; Alghamdi, S.S.; Alsuhebany, N.; Alqahtani, T.; Alshaya, A.I.; Almohareb, S.N.; Aldairem, A.; Alrashed, M.; Bin Saleh, K.; Badreldin, H.A.; et al. Revolutionizing Healthcare: The Role of Artificial Intelligence in Clinical Practice. BMC Med. Educ. 2023, 23, 689, doi:10.1186/s12909-023-04698-z.
16. Mansouri, M.; Yuan, B.; Ross, C.J.D.; Carleton, B.C.; Ester, M. HUME: Large-Scale Detection of Causal Genetic Factors of Adverse Drug Reactions. Bioinformatics 2018, 34, 4274–4283, doi:10.1093/bioinformatics/bty475.
17. Ferner, R.; Aronson, J. Susceptibility to Adverse Drug Reactions. Br. J. Clin. Pharmacol. 2019, 85, 2205–2212, doi:10.1111/bcp.14015.
18. Twick, I.; Zahavi, G.; Benvenisti, H.; Rubinstein, R.; Woods, M.S.; Berkenstadt, H.; Nissan, A.; Hosgor, E.; Assaf, D. Towards Interpretable, Medically Grounded, EMR-Based Risk Prediction Models. Sci. Rep. 2022, 12, 9990, doi:10.1038/s41598-022-13504-7.
19. Pavlou, M.; Ambler, G.; Seaman, S.R.; Guttmann, O.; Elliott, P.; King, M.; Omar, R.Z. How to Develop a More Accurate Risk Prediction Model When There Are Few Events. BMJ 2015, 351, h3868, doi:10.1136/bmj.h3868.
20. Binary Classification: Spam Detection for Text Messages Available online: https://csbiology.github.io/FSharpML/BinaryClass_SpamDetection.html (accessed on 2 May 2024).
21. What Is Social Media Sentiment Analysis? Available online: https://buffer.com/social-media-terms/sentiment-analysis (accessed on 2 May 2024).
22. Alamoodi, A.H.; Zaidan, B.B.; Zaidan, A.A.; Albahri, O.S.; Mohammed, K.I.; Malik, R.Q.; Almahdi, E.M.; Chyad, M.A.; Tareq, Z.; Albahri, A.S.; et al. Sentiment Analysis and Its Applications in Fighting COVID-19 and Infectious Diseases: A Systematic Review. Expert Syst. Appl. 2021, 167, 114155, doi:10.1016/j.eswa.2020.114155.
23. Yilmaz, E.A.; Balcisoy, S.; Bozkaya, B. A Link Prediction-Based Recommendation System Using Transactional Data. Sci. Rep. 2023, 13, 6905, doi:10.1038/s41598-023-34055-5.
24. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets | PLOS ONE Available online: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432 (accessed on 27 April 2024).
25. Canbek, G.; Taskaya Temizel, T.; Sagiroglu, S. PToPI: A Comprehensive Review, Analysis, and Knowledge Representation of Binary Classification Performance Measures/Metrics. SN Comput. Sci. 2022, 4, 13, doi:10.1007/s42979-022-01409-1.
26. Chicco, D.; Jurman, G. The Matthews Correlation Coefficient (MCC) Should Replace the ROC AUC as the Standard Metric for Assessing Binary Classification. BioData Min. 2023, 16, 4, doi:10.1186/s13040-023-00322-4.
27. Zhuan, B.; Ma, H.-H.; Zhang, B.-C.; Li, P.; Wang, X.; Yuan, Q.; Yang, Z.; Xie, J. Identification of Non-Small Cell Lung Cancer with Chronic Obstructive Pulmonary Disease Using Clinical Symptoms and Routine Examination: A Retrospective Study. Front. Oncol. 2023, 13, 1158948, doi:10.3389/fonc.2023.1158948.
28. Abdelmula, A.M.; Mirzaei, O.; Güler, E.; Süer, K. Assessment of Deep Learning Models for Cutaneous Leishmania Parasite Diagnosis Using Microscopic Images. Diagnostics 2024, 14, 12, doi:10.3390/diagnostics14010012.
29. Nam, H.-K.; Cho, W.K.; Kim, J.H.; Rhie, Y.-J.; Chung, S.; Lee, K.-H.; Suh, B.-K. HbA1c Cutoff for Prediabetes and Diabetes Based on Oral Glucose Tolerance Test in Obese Children and Adolescents. J. Korean Med. Sci. 2018, 33, e93, doi:10.3346/jkms.2018.33.e93.
30. Spak, D.A.; Plaxco, J.S.; Santiago, L.; Dryden, M.J.; Dogan, B.E. BI-RADS® Fifth Edition: A Summary of Changes. Diagn. Interv. Imaging 2017, 98, 179–190, doi:10.1016/j.diii.2017.01.001.
31. Menditto, A.; Patriarca, M.; Magnusson, B. Understanding the Meaning of Accuracy, Trueness and Precision. Accreditation Qual. Assur. 2007, 12, 45–47, doi:10.1007/s00769-006-0191-z.
32. Yerushalmy, J. Statistical Problems in Assessing Methods of Medical Diagnosis, with Special Reference to X-Ray Techniques. Public Health Rep. 1896-1970 1947, 62, 1432–1449, doi:10.2307/4586294.
33. G-Mean Score (GMS) — Permetrics 2.0.0 Documentation Available online: https://permetrics.readthedocs.io/en/stable/pages/classification/GMS.html (accessed on 7 May 2024).
34. Hand, D.J.; Christen, P.; Kirielle, N. F*: An Interpretable Transformation of the F-Measure. Mach. Learn. 2021, 110, 451–456, doi:10.1007/s10994-021-05964-1.
35. Brownlee, J. A Gentle Introduction to the Fbeta-Measure for Machine Learning. MachineLearningMastery.com 2020.
36. The Diagnostic Odds Ratio: A Single Indicator of Test Performance - ScienceDirect Available online: https://www.sciencedirect.com/science/article/abs/pii/S089543560300177X (accessed on 7 May 2024).
37. McHugh, M.L. Interrater Reliability: The Kappa Statistic. Biochem. Medica 2012, 22, 276–282.
38. Więckowska, B.; Kubiak, K.; Jóźwiak, P.; Moryson, W.; Stawińska-Witoszyńska, B. Cohen’s Kappa Coefficient as a Measure to Assess Classification Improvement Following the Addition of a New Marker to a Regression Model. Int. J. Environ. Res. Public. Health 2022, 19, 10213, doi:10.3390/ijerph191610213.
39. Peterson, L.E.; Coleman, M.A. Machine Learning-Based Receiver Operating Characteristic (ROC) Curves for Crisp and Fuzzy Classification of DNA Microarrays in Cancer Research. Int. J. Approx. Reason. Off. Publ. North Am. Fuzzy Inf. Process. Soc. 2008, 47, 17–36, doi:10.1016/j.ijar.2007.03.006.
40. Hond, A.A.H. de; Steyerberg, E.W.; Calster, B. van Interpreting Area under the Receiver Operating Characteristic Curve. Lancet Digit. Health 2022, 4, e853–e855, doi:10.1016/S2589-7500(22)00188-1.
41. When to Consult Precision-Recall Curves - Jonathan Cook, Vikram Ramadas, 2020 Available online: https://journals.sagepub.com/doi/full/10.1177/1536867X20909693 (accessed on 7 May 2024).
42. A Comprehensive Survey of Loss Functions in Machine Learning | Annals of Data Science Available online: https://link.springer.com/article/10.1007/s40745-020-00253-5 (accessed on 7 May 2024).
43. Hassanzad, M.; Hajian-Tilaki, K. Methods of Determining Optimal Cut-Point of Diagnostic Biomarkers with Application of Clinical Data in ROC Analysis: An Update Review. BMC Med. Res. Methodol. 2024, 24, 84, doi:10.1186/s12874-024-02198-2.
44. Liu, Y.; Cheng, J.; Yan, C.; Wu, X.; Chen, F. Research on the Matthews Correlation Coefficients Metrics of Personalized Recommendation Algorithm Evaluation. Int. J. Hybrid Inf. Technol. 2015, 8, 163–172, doi:10.14257/ijhit.2015.8.1.14.
45. Precision and Recall Oncology: Combining Multiple Gene Mutations for Improved Identification of Drug-Sensitive Tumours - PubMed Available online: https://pubmed.ncbi.nlm.nih.gov/29228590/ (accessed on 29 April 2024).
46. Shi, L.; Campbell, G.; Jones, W.D.; Campagne, F.; Wen, Z.; Walker, S.J.; Su, Z.; Chu, T.-M.; Goodsaid, F.M.; Pusztai, L.; et al. The MicroArray Quality Control (MAQC)-II Study of Common Practices for the Development and Validation of Microarray-Based Predictive Models. Nat. Biotechnol. 2010, 28, 827–838, doi:10.1038/nbt.1665.
47. SEQC/MAQC-III Consortium A Comprehensive Assessment of RNA-Seq Accuracy, Reproducibility and Information Content by the Sequencing Quality Control Consortium. Nat. Biotechnol. 2014, 32, 903–914, doi:10.1038/nbt.2957.
48. Chicco, D.; Jurman, G. The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. BMC Genomics 2020, 21, 6, doi:10.1186/s12864-019-6413-7.
49. Youden, W.J. Index for Rating Diagnostic Tests. Cancer 1950, 3, 32–35, doi:10.1002/1097-0142(1950)3:1<32::aid-cncr2820030106>3.0.co;2-3.
50. Zweig, M.H.; Campbell, G. Receiver-Operating Characteristic (ROC) Plots: A Fundamental Evaluation Tool in Clinical Medicine. Clin. Chem. 1993, 39, 561–577, doi:10.1093/clinchem/39.4.561.
51. Obuchowski, N.A. ROC Analysis. AJR Am. J. Roentgenol. 2005, 184, 364–372, doi:10.2214/ajr.184.2.01840364.
52. Saito, T.; Rehmsmeier, M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE 2015, 10, e0118432, doi:10.1371/journal.pone.0118432.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 The copyright to the submitted manuscript is held by the Author, who grants the Journal of Medical Science (JMS) a nonexclusive licence to use, reproduce, and distribute the work, including for commercial purposes.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

