Review Details
Reviewer has chosen to be Anonymous
Overall Impression: Weak
Suggested Decision: Reject
Technical Quality of the paper: Weak
Presentation: Average
Reviewer`s confidence: Medium
Significance: Low significance
Background: Comprehensive
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: This manuscript is too long for what it presents and should therefore be considerably shortened (below the general length limit)
Summary of paper in a few sentences:
This survey presents a unique and exhaustive description of existing metrics that have been used for calibration (not all of them are originally for calibration, or can measure calibration in isolation). It also provides useful foundations of how some of the metrics relate to each other, and summarises how previous work unifies and evaluates calibration.
Reasons to accept:
- It covers almost all the existing probability calibration metrics (and more).
- The first 8 pages and last 5 that provide a summary of the field are useful (but should be extended, and be the main contribution of the paper).
- The table in the appendix is useful.
- Having all the equations in a unified manner in one document may be useful (but maybe in an appendix, instead of the main contribution of the survey).
- It collects quite a few insights relating different metrics (but it is not clear sometimes if those are a contribution of this paper, or from the references).
Reasons to reject:
- The most interesting parts that provide descriptions of the field and/or new insights are very short (first 8 pages and last 5 pages).
- It is too long, and lots of parts of the document contain information that is not new, and do not add value. There are about 32 pages of descriptions of 82 metrics that tries to be brief but is too long.
- There are no new insights into how metrics connect to each other in a general form (a part from mentioning in each metric if it is related to another).
- There are some claims and insights that do not have a reference, and is not clear if those are new insights from the author.
- Not all of the presented metrics are actually classifier probability calibration metrics (e.g., some of them are losses that could be potentially used as metrics).
- The paper do not include new comparisons (e.g., experiments).
- There are no clear final recommendations metrics to use and for which cases.
Nanopublication comments:
Further comments:
- The work has lots of insights that does not make it clear if they are from previous references, or conclusions from the author of this survey.
- Page 1: What is the source of the sentence attributed to Flach [16] from an online tutorial at ECML-PKDD? Is it a transcription of the online tutorial? It needs a clearer citation.
- Page 1: “under operational conditions” → “under some/certain operational conditions”.
- Page 2: “Where available”, it would be good why the equations are not available. Is it because the original papers did not include them? Is it difficult to obtain the equation? Or is it for lack of space in this survey that they are not derived and included?
- Page 4: The first sentence is too ambiguous “An ideally calibrated classifier outputs confidence scores or predicted probabilities equal to its accuracy”, as the accuracy (in general) is defined for a whole dataset partition (e.g., training, validation or test accuracy). In this part of the text it is referring to the conditional accuracy, given the model’s score.
- Fig 2. The Y-axis indicates “Accuracy” which is usually referred to “actual positive rate”, “fraction of positives”, “empirical probability”, or “observed relative frequency” instead. I understand that the accuracy that this figure (and the text) is referring is to the conditional accuracy given the specific scores, but I think the other definitions are more clear.
- Fig 2. It shows red lines without clarification in the text or in the caption until page 10 when discussing Brier Score. I suggest to add a caption indicating that.
- Page 6: The grouping diagram is mentioned without providing context first, and it would be good to have a figure depicting it.
- Section 2.3 is titled Multi-class issues, but it discusses other aspects (a part from issues), I would suggest to rename to Multi-class aspects
- Page 6: description of Multiclass calibration indicates “to be correct simultaneously” which is too general. I would rephrase “to be correct” with a more concrete definition.
- Page 6: One paragraph starts talking about the decomposition into a set of binary sub-problems, which can be represented in the form of matrices. It would be good to introduce the reasoning for that, and why the representation as a matrix is useful, and how the matrices are used after.
- Section 2.4 mentions that “using a finite dataset” may lead to a nonzero metric value. It is not obvious why that would be the case.
- Section 2.6 when talking about the Naive Bayes assumptions, it may be good to mention the specific assumptions (e.g., independence of features).
- Page 9. When mentioning a reliable classifier that gives 80% or 20% with a better non-zero resolution. Shouldn’t it be if the performance is kept equal to the previous model?
- Section 3.3 when describing Brier score, I think it is missing in the text that this is a proper score.
- Page 11. (this is a personal preference) when talking about a model with confidence of zero, I find it misleading as the model may be highly confident in the true class (confidence of one) while the confidence of zero for the non-predicted class means indicates an output probability prediction of zero. This preference of mine extends to the rest of the paper in which I would refer to confidence for the highest output probability (or the prediction). I understand that this opinion may be different to the author, but it may be worth thinking about it.
- Page 11: mentions that Focal loss… focus on hard-to-classify examples. I understand that it refers to examples that are close to the decision boundary, or that are far from both probability extremes. But a hard-to-classify example could be anywhere on the output score region (e.g., an example can be hard to classify and be in a region of high output scores).
- Page 11: A section that talks about Focal loss “improves calibration”, it will be important to clarify in that same sentence that “it is not strictly proper”. It is mentioned multiple sentences later, but the first sentence is misleading, as it is counterintuitive that a not strictly proper loss improves the calibration of a strictly proper loss (NLL).
- Page 13: MAE is “always” greater than…. I would say “most of the time”, or “greater or equal”, given that “always” is not true.
- Page 13: mentions that reference [15] do not recommend MAE and one of the reasons is not “taking into account the cost of different wrong decisions”. This is clearly true for most calibration metrics described in the paper. Maybe include this claim in all the metrics that do not consider that as well? Or mention in this section that most of the other metrics also have the same problem.
- Page 13: Mentions a “jack-knife” estimate, I am not sure I have heard about this term before. It may be too informal, or it may be that I am not familiar with the term.
- Page 15: Point metrics for isotonic regression “are not widely used for measuring classifier calibration”. I think this is an example of trying to be exhaustive on this paper, making it too long. I think it is good to have the list somewhere (e.g., in an appendix), but realistically people looking into calibration metrics may want to focus on the well stablished and/or useful metrics instead.
- Page 16: Mentions that RPS has a “hidden preference”, it would be good to clarify in what sense is hidden. Or remove the word hidden.
- Page 20: It contains a couple of examples of metrics that are only mentioned to be exhaustive, but it is not clear why somebody would want to use them (e.g., ECE-LB and ECE-SWEEP). I am not sure if there is value in mentioning every single metric.
- Page 20: “the dimensions with highest variance”, does it refer to the output score for the class with the highest variance? The word “dimension” is too general.
- Page 26: The paper contains O(n) notation for computation time of some scores, however in this section it mentions “computation time for 50,000 data points is only one minute”. This information alone does not indicate the time complexity.
- Page 27: A sentence for WCR reinforces the point of not not needing to be exhaustive as it says that because of the lack of wide use of the metric, it is recommended that these are not used. The same could be said for most of the other 82 metrics, but it is only mentioned here.
- Section 5.6: This and other metrics start their description without much detail. It starts by mentioning the name and providing the equation. I think if any metric is worth mentioning, before showing the equation a description should be provided (other example 5.9, 7.4)
- Page 39: This page contains some examples of sentences that have claims for which is not clear their source. Are the following claims from previous papers? General knowledge? or is it part of the contribution of this paper? “A disadvantage of LAECE is…”, and “The advantage of LAACE over LAECE…”.
- Page 42: If I am not wrong, the method to use Brier curves to construct hybrid classifiers exist in previous literature. If that is the case, it would be good to add the references here. At the moment there are no references.
- Page 45: “does not exclude [the] use of a metric”.
- Appendix table: Include description for some headers (e.g., HT, UO, proper U).
- Appendix table entry GSB: “Very crude” sounds too informal. Explains that can be zero by reasoning on the “reliability diagram”. It sounds strange to me the justification with the use of the diagram.
1 Comment
meta-review by editor
Submitted by Tobias Kuhn on
The reviewers have expressed significant concerns regarding the paper’s relevance and originality, particularly in light of recent comprehensive works in the same area (Silva Filho et al., 2023; Tao et al., 2024). The manuscript offers minimal incremental value, much of its content reiterates existing literature. It is also overly long, with excessive detail on numerous metrics but limited new insights, experimental comparisons, or clear recommendations. In addition, some sections lack proper references, and several metrics included are not strictly calibration metrics. Overall, the contribution appears modest given the existing comprehensive coverage.
If you are able to substantially improve the paper’s level and focus, we would be glad to consider it for publication.
Francesca D. Faraci (https://orcid.org/0000-0002-8720-1256)