A comprehensive review of classifier probability calibration metrics

Tracking #: 937-1917

Authors:

NameORCID
Richard LaneORCID logo https://orcid.org/0000-0003-3741-0348


Responsible editor: 

Francesca D. Faraci

Submission Type: 

Survey Paper

Abstract: 

Probabilities or confidence values produced by artificial intelligence (AI) and machine learning (ML) models often do not reflect their true accuracy, with some models being under or overconfident in their predictions. For example, if a model is 80% sure of an outcome, is it correct 80% of the time? Probability calibration metrics measure the discrepancy between confidence and accuracy, providing an independent assessment of model calibration performance that complements traditional accuracy metrics. Understanding calibration is important when the outputs of multiple systems are combined, to avoid overconfident subsystems dominating the output. Such awareness also underpins assurance in safety or business-critical contexts and builds user trust in models. This paper provides a comprehensive review of probability calibration metrics for classifier models, organizing them according to multiple groupings to highlight their relationships. We identify 94 metrics, and group them into four main families: point-based, bin-based, kernel or curve-based, and cumulative. For each metric, we catalogue properties of interest and provide equations in a unified notation, facilitating implementation and comparison by future researchers. Finally, we provide recommendations for which metrics should be used in different situations.

Manuscript: 

Supplementary Files (optional): 

Previous Version: 

Tags: 

  • Reviewed

Data repository URLs: 

none

Date of Submission: 

Thursday, November 27, 2025

Date of Decision: 

Friday, December 19, 2025


Nanopublication URLs:

Decision: 

Accept

Solicited Reviews:


1 Comment

meta-review by editor

Although the length of the main text has been reduced, the overall size of the manuscript—including appendices—has increased from 61 to 67 pages. Several of the appendices may not be essential for the purposes of this review. Survey papers should not exceed 16,000 words; however, the current manuscript still contains 20,893 words (excluding references, acknowledgements, declarations, and appendices, and without applying the guideline that counts each figure as 300 words). While we acknowledge that achieving such a comprehensive coverage is challenging, the manuscript appears to be excessively long. Please reduce it as much as possible, but it's fine if it remains above the 16,000 word limit.

Francesca D. Faraci (https://orcid.org/0000-0002-8720-1256)