Measuring Data Drift with the Unstable Population Indicator

Tracking #: 779-1759


Responsible editor: 

Gargi Datta

Submission Type: 

Resource Paper

Abstract: 

Measuring data drift is essential in machine learning applications where training is necessarily done on strictly different samples than later scoring. The Kullback-Leibler divergence is a common measure of shifted probability distributions, for which discretized versions are invented to deal with binned or categorical data. We present the Unstable Population Indicator, a robust, flexible and numerically stable, discretized implementation of Jeffrey’s divergence, along with an implementation in a Python package that can deal with continuous, discrete, ordinal and nominal data in a variety of popular data types. We show the numerical and statistical properties in controlled experiments. It is not advised to employ a common cut-off to distinguish stable from unstable populations, but rather to let that cut-off depend on the use case.

Manuscript: 

Tags: 

  • Reviewed

Data repository URLs: 

Date of Submission: 

Monday, October 16, 2023

Date of Decision: 

Friday, December 29, 2023


Nanopublication URLs:
http://ds.kpxl.org/RAagramW3zuY74wddL8L7yWtJHXOMuCscs3HUKNb8YL50

Decision: 

Accept

Solicited Reviews:


2 Comments

meta-review by editor

The authors of the paper extend the Population Stability Index (PSI) to a more flexible Unstable Population Indicator (UPI), which solves the problem of a zero bin size for categorical data in PSI by adding a fraction of the population to each bin. The authors have also released a python package for UPI, which makes it easily accessible by other researchers in the field. They present a comprehensive discussion around the statistical properties of UPI and how it compares to PSI and the well-known Kullback-Leibler divergence. While the modification presented in the publication is simple, it is clearly written and comprehensively evaluated, and it cleverly preserves Jeffrey’s divergence and is a smarter approach than adding a constant to the bin size (which can bias bins unexpectedly). The addition of the python package makes it an easily usable metric.

Gargi Datta (https://orcid.org/0000-0002-1314-7824)