A comprehensive review of classifier probability calibration metrics

Tracking #: 923-1903

Authors:

	Name	ORCID
	Richard Lane	https://orcid.org/0000-0003-3741-0348

Responsible editor:

Francesca D. Faraci

Submission Type:

Survey Paper

Abstract:

Probabilities or confidence values produced by artificial intelligence (AI) and machine learning (ML) models often do not reflect their true accuracy, with some models being under or over confident in their predictions. For example, if a model is 80% sure of an outcome, is it correct 80% of the time? Probability calibration metrics measure the discrepancy between confidence and accuracy, providing an independent assessment of model calibration performance that complements traditional accuracy metrics. Understanding calibration is important when the outputs of multiple systems are combined, for assurance in safety or business-critical contexts, and for building user trust in models. This paper provides a comprehensive review of probability calibration metrics for classifier and object detection models, organising them according to a number of different categorisations to highlight their relationships. We identify 82 major metrics, which can be grouped into four classifier families (point-based, bin-based, kernel or curve-based, and cumulative) and an object detection family. For each metric, we provide equations where available, facilitating implementation and comparison by future researchers.

Manuscript:

ds-paper-923.docx

Supplementary Files (optional):

ds-supplementary-923-1527.docx

Data repository URLs:

N/A

Date of Submission:

Friday, July 11, 2025

Date of Decision:

Tuesday, November 11, 2025

Nanopublication URLs:

Decision:

Undecided

Solicited Reviews:

Review #1 submitted on 29/Aug/2025

By Beatrice Zanchi ORCID logo

https://orcid.org/0009-0009-3017-0645

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Excellent
Presentation: Good
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Comprehensive
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: This manuscript is too long for what it presents and should therefore be considerably shortened (below the general length limit)

Summary of paper in a few sentences:

The paper provides a comprehensive survey of probability calibration metrics, which measure how well AI/ML model confidence scores match actual accuracy. The authors identify and categorize 82 major calibration metrics across five families: point-based, bin-based, kernel/curve-based, cumulative (for classifiers), and object detection-specific metrics. The paper serves as a reference guide by providing equations and clear categorizations to help researchers understand, implement, and compare different calibration assessment methods.

Reasons to accept:

The paper demonstrates good organization and clarity throughout, with the authors providing examples for a substantial portion of the metrics they present. The work reflects the authors' comprehensive grasp of the probability calibration field, showing depth in their understanding of the underlying theoretical landscape and practical considerations.
The mathematical formulation is particularly strong - formulas are expressed with precision and rigor, while maintaining consistent notation conventions across the entire document. This mathematical consistency makes the paper accessible and reduces potential confusion when comparing different metrics within and across the various families.
Where the authors include advantages and disadvantages for specific metrics, these assessments are thoughtfully articulated and well-supported by relevant literature citations. The balanced evaluation of each metric's strengths and limitations provides valuable guidance for practitioners seeking to select appropriate calibration measures for their specific use cases.

Reasons to reject:

While I do not recommend rejection of this manuscript, I have significant concerns regarding its timeliness and positioning within the current literature landscape. As the authors themselves acknowledge, two comprehensive and recent publications have already addressed this domain extensively: Silva Filho et al.'s survey on classifier calibration in Machine Learning (September 2023) and Tao et al.'s benchmark study presented at ICLR (May 2024).

Although I recognize that this manuscript does provide incremental value to the field, the additional contribution appears modest given the existing comprehensive coverage. The primary distinction lies in the inclusion of supplementary calibration metrics that, while academically interesting, tend to be less frequently employed in practice. This raises questions about whether the marginal benefit justifies publication of another extensive survey so soon after the previous comprehensive works.

Nanopublication comments:

Further comments:

While I do not recommend rejecting this paper, I believe it requires revision to enhance its readability, clarity, and analytical depth in several areas that currently appear underdeveloped or unnecessarily redundant.

Following established best practices in systematic reviews (regardless of whether PRISMA guidelines are strictly applied), the authors should include explicit inclusion and exclusion criteria for their paper selection process. This section should be positioned after the Introduction to ensure the review's methodology is transparent and reproducible for future researchers.

Several sections suffer from redundancy that detracts from the manuscript's focus. Paragraph 2.4 on hypothesis tests includes unnecessary exposition on p-value interpretation that most readers in this domain would already understand. Similarly, Section 2.5 on bootstrapping and consistency sampling provides excessive detail for concepts that are not central to the calibration metrics themselves. Most notably, the entire Section 2.7 on "Families of metrics" is largely superfluous, as this taxonomy is already thoroughly established in the Introduction. This redundancy diminishes readability and could be replaced with the brief introductory approach successfully employed in Chapter 3.

The paper would benefit significantly from a more substantive discussion of the comparative advantages and limitations of different metric families in the Conclusion (Chapter 10). Given that practical guidance represents the core value proposition of this review, a deeper analysis of when and why researchers should choose specific metric families would increase the value of the work.

Review #2 submitted on 27/Oct/2025

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Weak
Suggested Decision: Reject
Technical Quality of the paper: Weak
Presentation: Average
Reviewer`s confidence: Medium
Significance: Low significance
Background: Comprehensive
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: This manuscript is too long for what it presents and should therefore be considerably shortened (below the general length limit)

Summary of paper in a few sentences:

This survey presents a unique and exhaustive description of existing metrics that have been used for calibration (not all of them are originally for calibration, or can measure calibration in isolation). It also provides useful foundations of how some of the metrics relate to each other, and summarises how previous work unifies and evaluates calibration.

Reasons to accept:

- It covers almost all the existing probability calibration metrics (and more).
- The first 8 pages and last 5 that provide a summary of the field are useful (but should be extended, and be the main contribution of the paper).
- The table in the appendix is useful.
- Having all the equations in a unified manner in one document may be useful (but maybe in an appendix, instead of the main contribution of the survey).
- It collects quite a few insights relating different metrics (but it is not clear sometimes if those are a contribution of this paper, or from the references).

Reasons to reject:

- The most interesting parts that provide descriptions of the field and/or new insights are very short (first 8 pages and last 5 pages).
- It is too long, and lots of parts of the document contain information that is not new, and do not add value. There are about 32 pages of descriptions of 82 metrics that tries to be brief but is too long.
- There are no new insights into how metrics connect to each other in a general form (a part from mentioning in each metric if it is related to another).
- There are some claims and insights that do not have a reference, and is not clear if those are new insights from the author.
- Not all of the presented metrics are actually classifier probability calibration metrics (e.g., some of them are losses that could be potentially used as metrics).
- The paper do not include new comparisons (e.g., experiments).
- There are no clear final recommendations metrics to use and for which cases.

Nanopublication comments:

Further comments:

- The work has lots of insights that does not make it clear if they are from previous references, or conclusions from the author of this survey.
- Page 1: What is the source of the sentence attributed to Flach [16] from an online tutorial at ECML-PKDD? Is it a transcription of the online tutorial? It needs a clearer citation.
- Page 1: “under operational conditions” → “under some/certain operational conditions”.
- Page 2: “Where available”, it would be good why the equations are not available. Is it because the original papers did not include them? Is it difficult to obtain the equation? Or is it for lack of space in this survey that they are not derived and included?
- Page 4: The first sentence is too ambiguous “An ideally calibrated classifier outputs confidence scores or predicted probabilities equal to its accuracy”, as the accuracy (in general) is defined for a whole dataset partition (e.g., training, validation or test accuracy). In this part of the text it is referring to the conditional accuracy, given the model’s score.
- Fig 2. The Y-axis indicates “Accuracy” which is usually referred to “actual positive rate”, “fraction of positives”, “empirical probability”, or “observed relative frequency” instead. I understand that the accuracy that this figure (and the text) is referring is to the conditional accuracy given the specific scores, but I think the other definitions are more clear.
- Fig 2. It shows red lines without clarification in the text or in the caption until page 10 when discussing Brier Score. I suggest to add a caption indicating that.
- Page 6: The grouping diagram is mentioned without providing context first, and it would be good to have a figure depicting it.
- Section 2.3 is titled Multi-class issues, but it discusses other aspects (a part from issues), I would suggest to rename to Multi-class aspects
- Page 6: description of Multiclass calibration indicates “to be correct simultaneously” which is too general. I would rephrase “to be correct” with a more concrete definition.
- Page 6: One paragraph starts talking about the decomposition into a set of binary sub-problems, which can be represented in the form of matrices. It would be good to introduce the reasoning for that, and why the representation as a matrix is useful, and how the matrices are used after.
- Section 2.4 mentions that “using a finite dataset” may lead to a nonzero metric value. It is not obvious why that would be the case.
- Section 2.6 when talking about the Naive Bayes assumptions, it may be good to mention the specific assumptions (e.g., independence of features).
- Page 9. When mentioning a reliable classifier that gives 80% or 20% with a better non-zero resolution. Shouldn’t it be if the performance is kept equal to the previous model?
- Section 3.3 when describing Brier score, I think it is missing in the text that this is a proper score.
- Page 11. (this is a personal preference) when talking about a model with confidence of zero, I find it misleading as the model may be highly confident in the true class (confidence of one) while the confidence of zero for the non-predicted class means indicates an output probability prediction of zero. This preference of mine extends to the rest of the paper in which I would refer to confidence for the highest output probability (or the prediction). I understand that this opinion may be different to the author, but it may be worth thinking about it.
- Page 11: mentions that Focal loss… focus on hard-to-classify examples. I understand that it refers to examples that are close to the decision boundary, or that are far from both probability extremes. But a hard-to-classify example could be anywhere on the output score region (e.g., an example can be hard to classify and be in a region of high output scores).
- Page 11: A section that talks about Focal loss “improves calibration”, it will be important to clarify in that same sentence that “it is not strictly proper”. It is mentioned multiple sentences later, but the first sentence is misleading, as it is counterintuitive that a not strictly proper loss improves the calibration of a strictly proper loss (NLL).
- Page 13: MAE is “always” greater than…. I would say “most of the time”, or “greater or equal”, given that “always” is not true.
- Page 13: mentions that reference [15] do not recommend MAE and one of the reasons is not “taking into account the cost of different wrong decisions”. This is clearly true for most calibration metrics described in the paper. Maybe include this claim in all the metrics that do not consider that as well? Or mention in this section that most of the other metrics also have the same problem.
- Page 13: Mentions a “jack-knife” estimate, I am not sure I have heard about this term before. It may be too informal, or it may be that I am not familiar with the term.
- Page 15: Point metrics for isotonic regression “are not widely used for measuring classifier calibration”. I think this is an example of trying to be exhaustive on this paper, making it too long. I think it is good to have the list somewhere (e.g., in an appendix), but realistically people looking into calibration metrics may want to focus on the well stablished and/or useful metrics instead.
- Page 16: Mentions that RPS has a “hidden preference”, it would be good to clarify in what sense is hidden. Or remove the word hidden.
- Page 20: It contains a couple of examples of metrics that are only mentioned to be exhaustive, but it is not clear why somebody would want to use them (e.g., ECE-LB and ECE-SWEEP). I am not sure if there is value in mentioning every single metric.
- Page 20: “the dimensions with highest variance”, does it refer to the output score for the class with the highest variance? The word “dimension” is too general.
- Page 26: The paper contains O(n) notation for computation time of some scores, however in this section it mentions “computation time for 50,000 data points is only one minute”. This information alone does not indicate the time complexity.
- Page 27: A sentence for WCR reinforces the point of not not needing to be exhaustive as it says that because of the lack of wide use of the metric, it is recommended that these are not used. The same could be said for most of the other 82 metrics, but it is only mentioned here.
- Section 5.6: This and other metrics start their description without much detail. It starts by mentioning the name and providing the equation. I think if any metric is worth mentioning, before showing the equation a description should be provided (other example 5.9, 7.4)
- Page 39: This page contains some examples of sentences that have claims for which is not clear their source. Are the following claims from previous papers? General knowledge? or is it part of the contribution of this paper? “A disadvantage of LAECE is…”, and “The advantage of LAACE over LAECE…”.
- Page 42: If I am not wrong, the method to use Brier curves to construct hybrid classifiers exist in previous literature. If that is the case, it would be good to add the references here. At the moment there are no references.
- Page 45: “does not exclude [the] use of a metric”.
- Appendix table: Include description for some headers (e.g., HT, UO, proper U).
- Appendix table entry GSB: “Very crude” sounds too informal. Explains that can be zero by reasoning on the “reliability diagram”. It sounds strange to me the justification with the use of the diagram.

1 Comment

meta-review by editor

Submitted by Tobias Kuhn on Tue, 11/11/2025 - 04:40

The reviewers have expressed significant concerns regarding the paper’s relevance and originality, particularly in light of recent comprehensive works in the same area (Silva Filho et al., 2023; Tao et al., 2024). The manuscript offers minimal incremental value, much of its content reiterates existing literature. It is also overly long, with excessive detail on numerous metrics but limited new insights, experimental comparisons, or clear recommendations. In addition, some sections lack proper references, and several metrics included are not strictly calibration metrics. Overall, the contribution appears modest given the existing comprehensive coverage.

If you are able to substantially improve the paper’s level and focus, we would be glad to consider it for publication.

Francesca D. Faraci (https://orcid.org/0000-0002-8720-1256)

Data Science

A comprehensive review of classifier probability calibration metrics

Tracking #: 923-1903

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Supplementary Files (optional):

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment

meta-review by editor