A comprehensive review of classifier probability calibration metrics

Tracking #: 937-1917

Authors:

	Name	ORCID
	Richard Lane	https://orcid.org/0000-0003-3741-0348

Responsible editor:

Francesca D. Faraci

Submission Type:

Survey Paper

Abstract:

Probabilities or confidence values produced by artificial intelligence (AI) and machine learning (ML) models often do not reflect their true accuracy, with some models being under or overconfident in their predictions. For example, if a model is 80% sure of an outcome, is it correct 80% of the time? Probability calibration metrics measure the discrepancy between confidence and accuracy, providing an independent assessment of model calibration performance that complements traditional accuracy metrics. Understanding calibration is important when the outputs of multiple systems are combined, to avoid overconfident subsystems dominating the output. Such awareness also underpins assurance in safety or business-critical contexts and builds user trust in models. This paper provides a comprehensive review of probability calibration metrics for classifier models, organizing them according to multiple groupings to highlight their relationships. We identify 94 metrics, and group them into four main families: point-based, bin-based, kernel or curve-based, and cumulative. For each metric, we catalogue properties of interest and provide equations in a unified notation, facilitating implementation and comparison by future researchers. Finally, we provide recommendations for which metrics should be used in different situations.

Manuscript:

ds-paper-937.docx

Supplementary Files (optional):

ds-supplementary-937-1553.docx

Previous Version:

A comprehensive review of classifier probability calibration metrics

Data repository URLs:

none

Date of Submission:

Thursday, November 27, 2025

Date of Decision:

Friday, December 19, 2025

Nanopublication URLs:

Decision:

Solicited Reviews:

Review #1 submitted on 15/Dec/2025

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: Medium
Significance: High significance
Background: Comprehensive
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

This survey presents a unique and exhaustive description of existing metrics that have been used for calibration (not all of them are originally for calibration, or can measure calibration in isolation). It also provides useful foundations of how some of the metrics relate to each other, and summarises how previous work unifies and evaluates calibration.

Reasons to accept:

- The author has addressed most of the points raised by the previous reviewers.
- The main text has been reduced by 6 pages by moving to the appendices multiple subsections and metrics.
- The new Section 7.2 is a great addition and I consider one of the more useful contributions of the paper.

Reasons to reject:

- Even if the main text has been reduced, the whole document including appendices has increased from 61 to 67 pages, with appendices that may not be that important for this review.
- The authors guidance indicates a maximum word count of 16,000 for surveys, and this one still has 20,893 (without including references, acknowledgements, declarations and appendices, and not counting each figure for 300 words as indicated in the guidelines). I still believe that it is too long, but I understand that trying to be that exhaustive is a very difficult task.
- The previous version had 82 major metrics and the reviewers mentioned that it may be not necessary to include everything, but focus on the metrics that are more useful. Instead, the new count increased to 94 metrics (although quite a few have been moved to the appendix). I believe that providing better overall insights should be the main contribution of this survey.

Nanopublication comments:

Further comments:

- Visualisation methods has been removed for the main text, but I believe that it would be a good addition in another survey. However, the section on the previous version was not informative enough to make this survey even longer.
- In the Multi-class aspects section, the matrix decomposition is mentioned as one of the methods to aggregate binary classifiers into multiclass classifiers. However, this is not put into the context of calibration (which is the focus of the paper and the section). At the moment, I don’t see why this is necessary in this survey, unless it specifically indicates what are the problems of using this method in regards to calibration. Can you aggregate the outputs of the multiple binary models? In which circumstances this will lead to calibrated models?
- Page 2: to date, instead of to-date

RESPONSE TO REVIEWERS

Thank you to the reviewers for positive comments in their summaries of the paper “A comprehensive review of classifier probability calibration metrics” (Tracking # 923-1903) and reasons to accept it. The reviewers are also thanked for their suggestions on how the paper could be improved. The following sections give a line-by-line response to each comment and how it has been addressed in an updated version of the paper. High-level responses to the editor’s meta review are given at the end.

Reviewer 1

Reasons to reject

1. While I do not recommend rejection of this manuscript, I have significant concerns regarding its timeliness and positioning within the current literature landscape. As the authors themselves acknowledge, two comprehensive and recent publications have already addressed this domain extensively: Silva Filho et al.'s survey on classifier calibration in Machine Learning (September 2023) and Tao et al.'s benchmark study presented at ICLR (May 2024). Although I recognize that this manuscript does provide incremental value to the field, the additional contribution appears modest given the existing comprehensive coverage. The primary distinction lies in the inclusion of supplementary calibration metrics that, while academically interesting, tend to be less frequently employed in practice. This raises questions about whether the marginal benefit justifies publication of another extensive survey so soon after the previous comprehensive works.
a. Section 1 “Introduction” now clarifies the specific substantial contributions in detail. In summary:
i. There is an evidenced demand for such a paper.
ii. A novel categorisation of metric families is proposed.
iii. The paper catalogues 94 metrics – a substantial increase on the 26 metrics mentioned in five combined previous surveys (including Silva Filho et al. and Tao et al.).
iv. Equations are provided with a unified notation.
v. Metrics that were previously treated in isolation are brought together and conceptual relationships are highlighted.
vi. New recommendations are given for which metrics should be used in various situations.

Further comments

2. While I do not recommend rejecting this paper, I believe it requires revision to enhance its readability, clarity, and analytical depth in several areas that currently appear underdeveloped or unnecessarily redundant.
a. Low-value sentences have been removed and where discussion was too brief additional insights have been added. A substantial analysis has been added in section 7.2 “Analysis and recommendations”. Collectively, these changes improve readability, clarity, and analytical depth.
b. Sections on object detection calibration metrics, visualisation methods, general frameworks and theory, hypothesis tests, families of metrics, and ranked probability score have been removed from the paper to improve its focus and reduce redundancy. Sections on several lesser-used metrics and the section on “Bootstrapping and consistency sampling” have been moved to Appendices A and B, respectively. This streamlining further improves readability of the main body of the paper.
3. Following established best practices in systematic reviews (regardless of whether PRISMA guidelines are strictly applied), the authors should include explicit inclusion and exclusion criteria for their paper selection process. This section should be positioned after the Introduction to ensure the review's methodology is transparent and reproducible for future researchers.
a. A new Section 2.5 “Scope of paper” has been added to discuss inclusion and exclusion criteria. The process has been updated to include recently published metrics that were not available at the time the first version of this paper was written.
4. Several sections suffer from redundancy that detracts from the manuscript's focus. Paragraph 2.4 on hypothesis tests includes unnecessary exposition on p-value interpretation that most readers in this domain would already understand.
a. Section 2.4 on “Hypothesis tests” has been removed as the relevant concepts are already discussed later in the paper in conjunction with metrics designed specifically for that purpose.
5. Similarly, Section 2.5 on bootstrapping and consistency sampling provides excessive detail for concepts that are not central to the calibration metrics themselves.
a. The author found the discussion on bootstrapping and consistency sampling useful for understanding uncertainty in calibration metrics. However, the author agrees that the presentation at the beginning of the paper is a distraction, so the section has been placed in Appendix B.
6. Most notably, the entire Section 2.7 on "Families of metrics" is largely superfluous, as this taxonomy is already thoroughly established in the Introduction. This redundancy diminishes readability and could be replaced with the brief introductory approach successfully employed in Chapter 3.
a. Section 2.7 on “Families of metrics” has now been deleted and non-redundant information moved into the introductions of Sections 3 to 6 on individual metric families.
7. The paper would benefit significantly from a more substantive discussion of the comparative advantages and limitations of different metric families in the Conclusion (Chapter 10). Given that practical guidance represents the core value proposition of this review, a deeper analysis of when and why researchers should choose specific metric families would increase the value of the work.
a. A substantial analysis of the merits of the most promising metrics from several families has been added in Section 7.2 “Analysis and recommendations”. This analysis includes recommendations on when and why researchers should choose specific metrics.

Reviewer 2

Reasons to reject

8. The most interesting parts that provide descriptions of the field and/or new insights are very short (first 8 pages and last 5 pages). It is too long, and lots of parts of the document contain information that is not new, and do not add value. There are about 32 pages of descriptions of 82 metrics that tries to be brief but is too long.
a. Sections on object detection calibration metrics, visualisation methods, general frameworks and theory, hypothesis tests, families of metrics, and ranked probability score have been removed from the paper. Sections on several lesser-used metrics and the section on “Bootstrapping and consistency sampling” have been moved to appendices. These removals from the main body reduce the paper’s length, allowing readers to concentrate on the most important aspects.
9. There are no new insights into how metrics connect to each other in a general form (apart from mentioning in each metric if it is related to another).
a. The main categorisation of metrics into four families is novel, providing new insight on how metrics in the same family are related to each other. This has been made clearer in Section 1 “Introduction”. The presentation of additional categorisations (range, propriety, and hypothesis test availability) and pros and cons of each metric in a uniform manner in the summary table in Appendix D allows easy comparisons to be made across a wide range of metrics, which is not available in other papers.
10. There are some claims and insights that do not have a reference, and it is not clear if those are new insights from the author.
a. In several places where statements were made based directly on information from external papers, additional citations have been added to make this clear. The remaining unreferenced statements are new insights from the author.
11. Not all the presented metrics are classifier probability calibration metrics (e.g., some of them are losses that could be potentially used as metrics).
a. All the metrics presented can be used as calibration metrics, which puts them in scope of the review, even if they were originally designed for another purpose. Where metrics are also considered to be loss functions, this is mentioned in the discussion of each metric. Section 2.5 “Scope of paper” has been added to clarify the situation and this includes a discussion on the inclusion of loss functions.
12. The paper does not include new comparisons (e.g., experiments).
a. The author agrees that it would be interesting to include experiments to compare metrics. However, as pointed out by both reviewers, the paper is already quite long. Adding experiments is out of scope of the paper as there is insufficient space for such a discussion. This point has been added to Section 2.5 “Scope of paper”.
13. There are no clear final recommendations metrics to use and for which cases.
a. A new Section 7.2 “Analysis and recommendations” has been added to analyse the merits of the most promising metrics and to make evidence-based recommendations on specific metrics for general, multiclass, and local calibration scenarios.

Further comments

14. The work has lots of insights that does not make it clear if they are from previous references, or conclusions from the author of this survey.
a. In several places where statements were made based directly on information from external papers, additional citations have been added to make this clear. The remaining unreferenced statements are new insights from the author.
15. Page 1: What is the source of the sentence attributed to Flach [16] from an online tutorial at ECML-PKDD? Is it a transcription of the online tutorial? It needs a clearer citation.
a. The citation in the main text has been clarified to say that the quote is sourced from the concluding slides of Flach, which is now reference [22].
16. Page 1: “under operational conditions” → “under some/certain operational conditions”.
a. This phrase in Section 1 “Introduction” now reads “particularly under any operational conditions that differ from training”.
17. Page 2: “Where available”, it would be good [to know] why the equations are not available. Is it because the original papers did not include them? Is it difficult to obtain the equation? Or is it for lack of space in this survey that they are not derived and included?
a. Some metrics (e.g. fit on the test) are a process rather than an equation. The wording in Section 1 “Introduction” has been changed to say: “Where relevant”.
18. Page 4: The first sentence is too ambiguous “An ideally calibrated classifier outputs confidence scores or predicted probabilities equal to its accuracy”, as the accuracy (in general) is defined for a whole dataset partition (e.g., training, validation or test accuracy). In this part of the text, it is referring to the conditional accuracy, given the model’s score.
a. The phrase “conditional on the score” has been added to the beginning of Section 2.2 “Calibration curve and reliability diagram”.
19. Fig 2. The Y-axis indicates “Accuracy” which is usually referred to “actual positive rate”, “fraction of positives”, “empirical probability”, or “observed relative frequency” instead. I understand that the accuracy that this figure (and the text) is referring is to the conditional accuracy given the specific scores, but I think the other definitions are clearer.
a. Although these alternative phrases are more descriptive than “accuracy”, the calibration literature (e.g. the seminal paper of Guo et al. 2017) often uses the simple word “accuracy” in this context. The term accuracy has been kept in the figure, but at the first time it is mentioned in Section 2.2 “Calibration curve and reliability diagram”, the above alternative terms are now also mentioned.
20. Fig 2. It shows red lines without clarification in the text or in the caption until page 10 when discussing Brier Score. I suggest adding a caption indicating that.
a. The definition of the red lines has been added to the caption and discussed further in the paragraph immediately after Fig. 2.
21. Page 6: The grouping diagram is mentioned without providing context first, and it would be good to have a figure depicting it.
a. Since the grouping diagram requires additional information and is rarely used, the paragraph describing it has been removed, to reduce the length of the paper.
22. Section 2.3 is titled Multi-class issues, but it discusses other aspects (apart from issues). I would suggest renaming to Multi-class aspects.
a. Section 2.3 has been renamed to “Multi-class aspects”.
23. Page 6: description of Multiclass calibration indicates “to be correct simultaneously” which is too general. I would rephrase “to be correct” with a more concrete definition.
a. The phrase now reads “vector of probabilities to be correct in all elements simultaneously”.
24. Page 6: One paragraph starts talking about the decomposition into a set of binary sub-problems, which can be represented in the form of matrices. It would be good to introduce the reasoning for that, and why the representation as a matrix is useful, and how the matrices are used after.
a. Two sentences on the advantages of decomposition into binary sub-problems and a sentence to discuss the reasoning behind matrices have been added to Section 2.3 “Multi-class aspects”.
25. Section 2.4 mentions that “using a finite dataset” may lead to a nonzero metric value. It is not obvious why that would be the case.
a. Section 2.4 on hypothesis testing, where this phrase was previously included, has been removed.
26. Section 2.6 when talking about the Naive Bayes assumptions, it may be good to mention the specific assumptions (e.g., independence of features).
a. The discussion on Naive Bayes classifiers has been removed as it is not a core part of the paper’s focus on calibration metrics.
27. Page 9. When mentioning a reliable classifier that gives 80% or 20% with a better non-zero resolution. Shouldn’t it be if the performance is kept equal to the previous model?
a. Yes, the performance in terms of accuracy should be equal for both models so that the discussion focuses on the meaning of resolution. The paragraph in Seciton 3.2 “Proper Scores” has been re-written to make this clearer.
28. Section 3.3 when describing Brier score, I think it is missing in the text that this is a proper score.
a. A statement that the Brier score is a proper score has been added to Section 3.3.
29. Page 11. (this is a personal preference) when talking about a model with confidence of zero, I find it misleading as the model may be highly confident in the true class (confidence of one) while the confidence of zero for the non-predicted class means indicates an output probability prediction of zero. This preference of mine extends to the rest of the paper in which I would refer to confidence for the highest output probability (or the prediction). I understand that this opinion may be different to the author, but it may be worth thinking about it.
a. Some classifiers only output the confidence value of the highest predicted probability. However, more general classifiers output a full vector of probabilities. This is discussed in Section 2.3 “Multi-class aspects”. The mathematical notation is simplified when confidence is associated with a particular class, and the author has chosen to use that convention for this reason.
30. Page 11: mentions that Focal loss… focus on hard-to-classify examples. I understand that it refers to examples that are close to the decision boundary, or that are far from both probability extremes. But a hard-to-classify example could be anywhere on the output score region (e.g., an example can be hard to classify and be in a region of high output scores).
a. In the definition of focal loss, hard-to-classify examples are considered to result in low confidence values for the most likely class. Low confidence will occur near decision boundaries. Whether low confidence also occurs in regions far from any data point depends on the classifier model.
31. Page 11: A section that talks about Focal loss “improves calibration”, it will be important to clarify in that same sentence that “it is not strictly proper”. It is mentioned multiple sentences later, but the first sentence is misleading, as it is counterintuitive that a not strictly proper loss improves the calibration of a strictly proper loss (NLL).
a. The phrase “despite not being strictly proper” has been added to Section 3.4 “Logarithmic metrics”.
32. Page 13: MAE is “always” greater than…. I would say “most of the time”, or “greater or equal”, given that “always” is not true.
a. The text has been changed in Section A.2 “Mean absolute error” to clarify this point.
33. Page 13: mentions that reference [15] does not recommend MAE and one of the reasons is not “taking into account the cost of different wrong decisions”. This is clearly true for most calibration metrics described in the paper. Maybe include this claim in all the metrics that do not consider that as well? Or mention in this section that most of the other metrics also have the same problem.
a. The discussion in reference [15] (now [21]) compares MAE to MSE with respect to the cost of decisions. Section A.2 “Mean absolute error” has been updated to make it clearer that the comparison is between those two metrics, rather than a more general statement.
34. Page 13: Mentions a “jack-knife” estimate, I am not sure I have heard about this term before. It may be too informal, or it may be that I am not familiar with the term.
a. Jackknife estimates are well-known (see e.g. https://en.wikipedia.org/wiki/Jackknife_resampling) and are described in detail in the cited reference by Wu. To clarify this term for readers that may be unfamiliar with the technique, it has been expanded to “leave-one-out jackknife estimate” in Section A.10 “Expected individual calibration error”.
35. Page 15: Point metrics for isotonic regression “are not widely used for measuring classifier calibration”. I think this is an example of trying to be exhaustive on this paper, making it too long. I think it is good to have the list somewhere (e.g., in an appendix), but realistically people looking into calibration metrics may want to focus on the well stablished and/or useful metrics instead.
a. The aim of the paper is to be a comprehensive catalogue of metrics. Nevertheless, the author agrees that these specific metrics are low priority, so the section “Point metrics for isotonic regression” has been moved to Appendix A.
36. Page 16: Mentions that RPS has a “hidden preference”, it would be good to clarify in what sense is hidden. Or remove the word hidden.
a. RPS has been removed from the paper as the scope has been reduced to exclude ordinal classifier metrics such as RPS.
37. Page 20: It contains a couple of examples of metrics that are only mentioned to be exhaustive, but it is not clear why somebody would want to use them (e.g., ECE-LB and ECE-SWEEP). I am not sure if there is value in mentioning every single metric.
a. Section 4.6 “Partially binned metrics” now clarifies that “ECE-LB has the advantage over many other binned metrics that it takes into account the variation of confidence values in each bin rather than relying only on their average”, justifying its inclusion.
b. ECE-SWEEP appears to be a sensible improvement on ECE and is discussed in several papers. It is also the base of the DRMSE-BCS metric, which is a recommended metric. The following justification for its inclusion has been added to Section 4.5 “Variants of expected calibration error”: “ECE-SWEEP has a lower bias than several other metrics, including standard ECE – see Section 4.13 for a discussion”.
c. The descriptions of several lesser-used metrics have been moved to Appendix A.
38. Page 20: “the dimension with highest variance”, does it refer to the output score for the class with the highest variance? The word “dimension” is too general.
a. The wording in Section A.11 “Max-variance mean-split” has been updated to clarify that this refers to “the mean of the class dimension with highest variance in confidence scores”.
39. Page 26: The paper contains O(N) notation for computation time of some scores, however in this section it mentions “computation time for 50,000 data points is only one minute”. This information alone does not indicate the time complexity.
a. The end of Section 4.14 “Hypothesis-test bin-based metrics” now has the following text to state the complexity: “Although the computation of TCE is O(NB), …”.
40. Page 27: A sentence for WCR reinforces the point of not needing to be exhaustive as it says that because of the lack of wide use of the metric, it is recommended that these are not used. The same could be said for most of the other 82 metrics, but it is only mentioned here.
a. WCR is speculatively an interesting metric as the thesis where it is defined, written more than 15 years ago, is highly cited. It took significant work on the part of the author to verify that all those citations refer to another aspect of the thesis. This metric has been included in the present paper to prevent other researchers from having to repeat that work. However, due to its lack of use, the author agrees that WCR should not form part of the main paper, and the relevant discussion has been moved to Appendix A.
b. Some other metrics mentioned in the paper have been published more recently and have not yet had the chance to be highly cited. In most cases, they have been designed to solve a specific issue with other pre-existing metrics and are therefore of intrinsic interest and included in the paper.
c. The descriptions of several other lesser-used metrics have been moved to Appendix A.
41. Section 5.6: This and other metrics start their description without much detail. It starts by mentioning the name and providing the equation. I think if any metric is worth mentioning, before showing the equation a description should be provided (other example 5.9, 7.4).
a. Introductory sentences have been added to the Dawid-Sebastiani score in Section 3.6, “Fit-on-the-test ECE” in Section 4.7, “Signed calibration error metrics” in Section 4.9, “Smooth Calibration Error” in Section 5.7, and “Estimated Calibration Index” in Section 5.10. The old Section 7.4 on “Localisation-aware calibration error” has been removed from the paper.
42. Page 39: This page contains some examples of sentences that have claims for which is not clear their source. Are the following claims from previous papers? General knowledge? or is it part of the contribution of this paper? “A disadvantage of LAECE is…”, and “The advantage of LAACE over LAECE…”.
a. The old Section 7.4 on “Localisation-aware calibration error” that contained these sentences has been removed from the paper.
b. In several other places where statements were made based directly on information from external papers, additional citations have been added to make this clear. The remaining unreferenced statements are new insights from the author.
43. Page 42: If I am not wrong, the method to use Brier curves to construct hybrid classifiers exist in previous literature. If that is the case, it would be good to add the references here. At the moment there are no references.
a. The old Section 8.3 on “Brier curves” has been removed from the paper.
44. Page 45: “does not exclude [the] use of a metric”.
a. The word “the” has been added to this phrase in Section 7 “Conclusion”.
45. Appendix table: Include description for some headers (e.g., HT, UO, proper).
a. The description of the header abbreviations has been included in introductory text added to the beginning of Appendix D. Abbreviation definitions have also been added to the table caption.
46. Appendix table entry GSB: “Very crude” sounds too informal. Explains that can be zero by reasoning on the “reliability diagram”. It sounds strange to me the justification with the use of the diagram.
a. The phrase “Very crude” has been changed to “Imprecise” in multiple entries in the summary table.
b. The reliability diagram is the best way to visualize under or overconfidence of a model for different confidence scores and is described in Section 2.2 on “Calibration curve and reliability diagram”. The diagram is a good way to understand the properties of metrics and its use is recommended in Section 7.2.

Editor

Meta Review

47. The reviewers have expressed significant concerns regarding the paper’s relevance and originality, particularly in light of recent comprehensive works in the same area (Silva Filho et al., 2023; Tao et al., 2024). The manuscript offers minimal incremental value, much of its content reiterates existing literature.
a. The introduction of the paper now clarifies that those previous works, while valuable, were not comprehensive, and makes explicit the specific contributions of the paper.
b. The response to comment 1 above summarises the value of the paper.
48. It is also overly long, with excessive detail on numerous metrics but limited new insights, experimental comparisons, or clear recommendations.
a. The focus of the paper has been narrowed, and several sections and subsections have been removed. Several other sections have been moved to appendices. Metric descriptions with excessive detail have also been cut back. These changes reduced main body length significantly. Additional insights have been added throughout the paper to highlight metric properties and the connections between them. The scope of the paper has been clarified to give reasons why new experiments are not included. A new Section 7.2 on recommendations has been added, and it is organised according to metric use cases.
49. In addition, some sections lack proper references, and several metrics included are not strictly calibration metrics.
a. Throughout the paper additional citations to references have been included to clarify the difference between contributions from the literature and new analysis. All metrics can be used as calibration metrics. Where they have other uses as well, this has been clarified.
50. Overall, the contribution appears modest given the existing comprehensive coverage.
a. The previous version of the paper did not properly articulate its value. The contributions paragraph at the end of the introduction now makes this clear.

1 Comment

meta-review by editor

Submitted by Tobias Kuhn on Fri, 12/19/2025 - 03:35

Although the length of the main text has been reduced, the overall size of the manuscript—including appendices—has increased from 61 to 67 pages. Several of the appendices may not be essential for the purposes of this review. Survey papers should not exceed 16,000 words; however, the current manuscript still contains 20,893 words (excluding references, acknowledgements, declarations, and appendices, and without applying the guideline that counts each figure as 300 words). While we acknowledge that achieving such a comprehensive coverage is challenging, the manuscript appears to be excessively long. Please reduce it as much as possible, but it's fine if it remains above the 16,000 word limit.

Francesca D. Faraci (https://orcid.org/0000-0002-8720-1256)

Data Science

A comprehensive review of classifier probability calibration metrics

Tracking #: 937-1917

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Supplementary Files (optional):

Previous Version:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment

meta-review by editor