Measuring Data Drift with the Unstable Population Indicator

Tracking #: 779-1759

Authors:

	Name	ORCID
	Marcel Haas	https://orcid.org/0000-0003-2581-8370
	Lisette Sibbald	https://orcid.org/0009-0003-5030-0108

Responsible editor:

Gargi Datta

Submission Type:

Resource Paper

Abstract:

Measuring data drift is essential in machine learning applications where training is necessarily done on strictly different samples than later scoring. The Kullback-Leibler divergence is a common measure of shifted probability distributions, for which discretized versions are invented to deal with binned or categorical data. We present the Unstable Population Indicator, a robust, flexible and numerically stable, discretized implementation of Jeffrey’s divergence, along with an implementation in a Python package that can deal with continuous, discrete, ordinal and nominal data in a variety of popular data types. We show the numerical and statistical properties in controlled experiments. It is not advised to employ a common cut-off to distinguish stable from unstable populations, but rather to let that cut-off depend on the use case.

Manuscript:

ds-paper-779.pdf

Data repository URLs:

https://github.com/harcel/unstable_populations

Date of Submission:

Monday, October 16, 2023

Date of Decision:

Friday, December 29, 2023

Nanopublication URLs:
http://ds.kpxl.org/RAagramW3zuY74wddL8L7yWtJHXOMuCscs3HUKNb8YL50

Decision:

Solicited Reviews:

Review #1 submitted on 13/Nov/2023

By Samuel Ackerman ORCID logo

https://orcid.org/0000-0003-2631-0341

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: High
Significance: Moderate significance
Background: Reasonable
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The authors present a Python implementation of UPI, which is a slightly-modified version of the PSI distribution drift measure. It improves upon existing implementations in that they can handle more data modalities (categorical, nominal, ordinal), and can handle cases where there are histogram bins that are empty. They perform comparisons using Gaussian data, for which the underlying KL divergence can be calculated analytically, and compare it to the UPI and PSI values. They discuss that UPI doesn't have a fixed cut-off value to designate drift but that it is situation-dependent.

Reasons to accept:

The writing is pretty clear and the contribution seems to be useful (but not tremendous novelty). The discussion about the properties of the metric is good, but perhaps the experiments could be expanded to consider a non-Gaussian continuous distribution.

Reasons to reject:

None that I can think of

Nanopublication comments:

Further comments:

-Abstract: the last part "where training is necessarily done on strictly different samples than later scoring" is ungrammatical. Maybe say "where model scoring (evaluation) is done on data samples that differ from those used in training".
-Page 2 line 3: "know" should be "known"
-Page 2 line 5: can you expand on what the citations in 7,8 refer to? You just say they are techniques that are "sometimes used," but what are examples?
-Page 2 line 13: "Some care about the source of the drift, while some don’t". Does "source" refer to an external cause or the nature of the distribution? In what ways can the source of drift be incorporated into the metric?
-Page 2 line 22: to clarify, can the PSI only accept certain limited data modalities that you have extended your method to? Make it clear in what ways there is an improvement. Are the numerical nuisances allowing the metric to be defined in cases where the cells are 0 in one case? Maybe find a way to say this in a non-technical way here. "Numerical nuisances" is too vague.
-Page 2, equations 1-2: the definition x_i=x_1, ..., x_n and the notation works only in cases with a countable and finite set of values. What about, say, binomial where x_i can be 'infinite'? Probably better to define, say, $\mathcal{X}$, without length n, as the set of valid values (possibly uncountable), and say the integration and sum are done for all x \in \mathcal{X}. Obviously numerically, there is discretization done based on the sample of values, but your definition should be more inclusive.
-Page 3 line 6: is it more correct to say "P(X) is more determined" rather than X?
-Page 3 line 21: "so information needs to be added about the random variable X" meaning is not quite clear. You mean "we need to incorporate information about the values of the domain of X and not just the form of P(X)" or something like that? Or maybe say more precisely that the metric is not sensitive to location shifts, which it should be to be useful.
-Page 3 line 22: maybe clarify you are saying that d(x) != d(x, x + change to left) + d(x, x - change). I didn't quite get what you were saying at first. Of course, distances are like this in general, the distance from the original x to itself is not 0 since you have traveled "there and back" which is how nonnegative distance metrics work.
-Page 4 line 6: I'd also add that possibly one would want the bins to be of unequal length to account for differences in the concentration to capture more important areas of the domain.
-Page 4 line 14: maybe instead of using alternative terms current/new and observed/baseline, be consistent with one or the other. Same thing in line 22.
-Page 4 line 17: What do you mean by "it can be questioned"? Do you mean "it is questionable/suspicious" or just "we may want to ask"? Also when you say "ever true in practice", is this emphasis (e.g. it likely to never be true in practice)? This is an issue in null hypothesis testing if the null is an unreasonable choice, but I'm not sure if that is what you mean here.
-Page 4 line 30: Have you considered Jensen-Shannon? It should be defined in these cases, and it is also symmetric in that it averages the KL divergence in each direction. It's probably worth mentioning in the related work and may be a good thing to look at. See scipy.spatial.distance.jensenshannon. Also, and I haven't seen this done, you can consider the values that are summed under the square root as elementwise measures of how much the two distributions differ at each point i.
-Page 4 line 41: if this is "common practice" can you cite some examples?
-Page 5 line 1: instead of "smearing" say "spreading", it's the wrong word
-Page 5 line 5: instead of "parts" say "fractions"?
-Page 5 line 29: "add the separate dimensions like distance measures in a Euclidean space" sounds confusing. You mean "sum the marginal distance across dimensions"? Also, in equation 7, shouldn't there be an external 1/d root like there is in 2-D Euclidean?
-Page 6: is the normal distribution the only one that can be used? You mention "randomly sampled distribution functions" but I think you mean "random sampling from distribution functions". If it's only normal, then justify the choice. How sensitive is this to, say, if the distribution wasn't unimodal or symmetric? Be up-front.
-Page 6 line 8: specify what this auto setting does (then say as a side note this is the "auto" setting).
-Page 6 line 8: instead of "together" say the two samples are "pooled" without weighting them to ensure the samples contribute equally even if they are of different size. By the way, is that a good idea? Should they be equally weighted? I guess you deal with this when you specify different sample sizes but maybe say something up-front about whether it is expected to be sensitive. I would have thought you would want to equally-weight them.
-Page 6 line 26: this issue with the 0 cell counts seems to be a big drawback of these methods. I mentioned above the Jensen Shannon.
-Page 6 line 27: "then", not "than"
-Page 6 line 38: call this "uniform discrete/categorical" rather than "completely flat"
-Page 6 line 40: can you describe this in terms of an increase in skew? The description could use technical terms.
-Figure 1: - what is KL divergence when d/sigma = 0? It's not plotted. Maybe also consider a lighter shading of the boxplots so the star is more visible.
-is d here the difference in means? If so you already used d for dimension, maybe use \delta and define it earlier. I don't see d defined.
-Figure 2: for the lineplots, use two lines of different thickness or solid/dashed. The color is not always clear enough except digitally (plus for colorblindness). Similarly, maybe the bars in the lower chart should be different grayscales or shading rather than colors.
-Page 8 line 44: maybe cite examples how it is uninformed or doesn't work well?
-Page 9 line 8L can you give examples of how to pick a cut-off? It is usually based on some experiments? It's not interpretable in terms of the original data. Also, is the measure sensitive to constant multiples of the data? I assume in this case the bins just get shrunk. Maybe mention explicitly.
-Page 9 line 16: for ordinal, does it work by recoding to equally-spaced integers?
-Page 9 line 20: what are the existing packages and implementations? Also does something exist in R? You just mention it's better than what exists but don't say what these are.
-Page 9 line 24: has the repo name changed to unstable_populations? the link here doesn't work. Maybe also print the link in the text in case the page is printed.

Review #2 submitted on 16/Nov/2023

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Excellent
Presentation: Good
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Reasonable
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The authors offer a new robust data drift measure extending existing measures. There new measure, the unstable population indicator, can handle null values when bin densities drop to zero. They also provide a python implementation of the measure.

Reasons to accept:

1. The paper challenges commonly accepted heuristics on measuring data drift and provides a solution to make the measure robust.
2. The evidence for the new metric and discussion are very good.
3. The added python implementation is great and makes the work more usable and impactful.
4. The addition is clever but simple and easy to understand.

Reasons to reject:

The work needs some minor edits but overall there is no reason to reject the paper that I see.

Nanopublication comments:

Further comments:

Overall the paper is very good. I have some minor comments. See below in order of appearence:
1. The introduction is very broad. It would be helpful to have an example for applicability. It seems the paper mainly focuses on machine learning as an example and it would be great to use that throughout more.
2. The gap could be more clearly stated that the current metrics cannot handle null values. This relates to page 2 line 16 where the authors state "In this paper we introduce yet another such measure..." This undermines the value of the work a little bit.
3. It is not always possible to know the data that will be used in production/predicted on later. One statement of what the needed data is would be helpful.
4. The language is mostly formal but there is "don't" on page 2 line 13.
5. Is the total population calculated with duplicates if the original and new populations are added? In a similar vein, can new bins or categories be added to the data or no? Sometimes there is a need for new categories that did not exist in the original population based on new information. A clear list of assumptions about the populations would be helpful.
6. I am curious what happens to the UPI when the populations are small and if there is a minimum population needed?
7. For the figures, the colors are hard to differentiate in black and white.
8. A small discussion of how to set a cutoff would be interesting and add to the implementation.

Review #3 submitted on 29/Nov/2023

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: Medium
Significance: Low significance
Background: Comprehensive
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The paper introduces the Unstable Population Indicator (UPI) as a measure for quantifying data drift, a critical aspect for models that are trained on different samples than those encountered during production. UPI is defined as a flexible and robust implementation of Jeffrey's divergence, a symmetric and discretized version of the Kullback-Leibler divergence, designed to handle various data types, including continuous, discrete, ordinal, and nominal. It answers the problem of bins with zero count by adding a small quantity to each bin. The authors highlight the importance of measuring data drift in both target and feature variables, emphasizing that the choice of a cut-off value for distinguishing stable and unstable populations should be case-dependent. Numerical experiments demonstrate UPI's effectiveness in controlled scenarios, and the paper provides a Python package for practical implementation, offering improved flexibility and performance over existing measures.

Reasons to accept:

The paper provides a solution to the "zero count bin" problem arising in the PSI calculation. Although there may be other ways to resolve this issue, the paper provides an alternative to PSI by adding a measured small value to each bin. This analytical solution provides a solution which is applicable for large enough population sizes.

Reasons to reject:

None

Nanopublication comments:

Further comments:

Here are some notes for this paper. The following items should be reviewed and addressed in order to clarify the message of the paper:

1. Note how the definition of PSI does not take into account any order in the bins, nor distances between the bins, which makes the measure equally suitable for categorical/nominal data, but interestingly this is rarely done. Besides, many posts and papers suggest the same, uninformed cut-off values for the PSI as a distinction between stable and shifted or drifted data sets. What counts as an important shift in your data should be strongly use case dependent and investigated on a per-feature basis.

The text above is not clear to me. What is the significance of order in the bins for calculation of PSI or the resulting PSI values and their interpretations?
2. What exactly the uninformed cut-off values? Do you mean .1,.25 etc.? Why are they uninformed? [11] suggests that PSI has a distribution and one can use {95th, 99th, 99.9th etc} percentiles of PSI as cut-off instead of fixed values .1,.25 etc.? That may be mentioned here.

3. In section 2.3, a discussion of PSI and its relation to $\chi^2$ tests is included however, it is not clear why there is a relationship. It may be illuminating to include that PSI has an asymptotic $\chi^2$ distribution with $B-1$ degrees of freedom where $B$ is the number of bins used in the calculation of PSI

4. $$UPI = \sum_{bins, i} (f_{1,i} − f_{0,i}) · ln(\frac{f_{1,i} + \frac{1}{ntot}}{f_{0,i} + \frac{1}{ntot}})$$
The formulation of UPI introduces an addition of a small fraction based on total count of the both dataset i.e. $ntot =$ number of observation in the base and target datasets.

The impact of this addition for small population is not discussed. No warning is provided to use UPI with small population sizes.

5. It is not clear how high dimensional UPI is connected with the rest of the paper

6. On page 4, the footnote does not list the sources explicitly

2 Comments

meta-review by editor

Submitted by Tobias Kuhn on Fri, 12/29/2023 - 05:01

The authors of the paper extend the Population Stability Index (PSI) to a more flexible Unstable Population Indicator (UPI), which solves the problem of a zero bin size for categorical data in PSI by adding a fraction of the population to each bin. The authors have also released a python package for UPI, which makes it easily accessible by other researchers in the field. They present a comprehensive discussion around the statistical properties of UPI and how it compares to PSI and the well-known Kullback-Leibler divergence. While the modification presented in the publication is simple, it is clearly written and comprehensively evaluated, and it cleverly preserves Jeffrey’s divergence and is a smarter approach than adding a constant to the bin size (which can bias bins unexpectedly). The addition of the python package makes it an easily usable metric.

Gargi Datta (https://orcid.org/0000-0002-1314-7824)

Nanopublication

Submitted by Tobias Kuhn on Fri, 12/29/2023 - 05:02

Please also take into account the reactions on the nanopublication, which you can find here: http://ds.kpxl.org/RAagramW3zuY74wddL8L7yWtJHXOMuCscs3HUKNb8YL50

And get in touch with us if you need help with that.

Regards,

Tobias

Data Science

Measuring Data Drift with the Unstable Population Indicator

Tracking #: 779-1759

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

2 Comments

meta-review by editor

Nanopublication