A comprehensive review of classifier probability calibration metrics

Tracking #: 923-1903

Authors:

NameORCID
Richard LaneORCID logo https://orcid.org/0000-0003-3741-0348


Responsible editor: 

Francesca D. Faraci

Submission Type: 

Survey Paper

Abstract: 

Probabilities or confidence values produced by artificial intelligence (AI) and machine learning (ML) models often do not reflect their true accuracy, with some models being under or over confident in their predictions. For example, if a model is 80% sure of an outcome, is it correct 80% of the time? Probability calibration metrics measure the discrepancy between confidence and accuracy, providing an independent assessment of model calibration performance that complements traditional accuracy metrics. Understanding calibration is important when the outputs of multiple systems are combined, for assurance in safety or business-critical contexts, and for building user trust in models. This paper provides a comprehensive review of probability calibration metrics for classifier and object detection models, organising them according to a number of different categorisations to highlight their relationships. We identify 82 major metrics, which can be grouped into four classifier families (point-based, bin-based, kernel or curve-based, and cumulative) and an object detection family. For each metric, we provide equations where available, facilitating implementation and comparison by future researchers.

Manuscript: 

Supplementary Files (optional): 

Tags: 

  • Reviewed

Data repository URLs: 

N/A

Date of Submission: 

Friday, July 11, 2025

Date of Decision: 

Tuesday, November 11, 2025


Nanopublication URLs:

Decision: 

Undecided

Solicited Reviews:


1 Comment

meta-review by editor

The reviewers have expressed significant concerns regarding the paper’s relevance and originality, particularly in light of recent comprehensive works in the same area (Silva Filho et al., 2023; Tao et al., 2024). The manuscript offers minimal incremental value, much of its content reiterates existing literature. It is also overly long, with excessive detail on numerous metrics but limited new insights, experimental comparisons, or clear recommendations. In addition, some sections lack proper references, and several metrics included are not strictly calibration metrics. Overall, the contribution appears modest given the existing comprehensive coverage.

If you are able to substantially improve the paper’s level and focus, we would be glad to consider it for publication.

Francesca D. Faraci (https://orcid.org/0000-0002-8720-1256)