Data Shift in Machine Learning: what is it and how to detect it

The problem

Lets suppose that John works as a Data Scientist for a Bank. His manager tasks him with creating a model for predicting Probability of Default for home loans. John has some intuition on what input variables he needs and what the output variable is. He reaches the Data Engineer, Nicky, and asks her for the data. Sure enough, he gets the required data, performs his Exploratory Data Analysis and then selects a proper classification algorithm to apply.

Cutting the long story short, John has now developed a fancy Machine Learning model to predict Probability of Default for any given client. Maybe he used a really complicated algorithm like a Deep Neural Network or he went with a simpler one like Logistic Regression because the Regulations department is not very eager to adopt black-box methods. He used all feature selection methods he knew to reduce the feature space and the model variance. He cross-validated his model within the training data and he also tested the performance of the model in a hold-out test set. He was able to achieve a good model performance of 80% accuracy in the binary prediction. Now, accuracy is probably the worse metric to check – in reality such an institution would care more about Recall but lets lets stick with accuracy for the sake of simplicity. He presents these findings and results with a fancy barchart to his manager, his manager is really happy and presents the results in higher management, the project gets green light for deployment.

So far so good. But all this happened at November 2019. The model was deployed at that time and worked well for a couple of months. But then it started deteriorating. Little by little, month by month. In March 2020 the model’s accuracy dropped to 60% percent. John could not deliver what he had promised. So what happened? Why did the model perform well at start but deteriorated heavily in the long term?

why

Data shift

As the its name suggests, a data shift occurs when there is a change in the data distribution. When building a Machine Learning model, one tries to unearth the (possibly non-linear) relations between the input and the target variable. Upon creating a model on this data, he then might feed new data of the same distribution and expect to get similar results. But in real-world cases this is rarely true. Changing consumer habbits, technology breakthroughs, political, socioeconomic and other unpredictable factors can dramatically change either a) the input dataset or b) the target variable or c) the underlying patterns and relations betweeen the input and output data. Each one of these situations has a distinct name in Data Science but they all lead to the same thing: model performance degradation.

Data shift or data drift, concept shift, changing environments, data fractures are all similar terms that describe the same phenomenon: the different distribution of data between train and test sets

So, in John’s case, what really happened is that in the next few months after deploying his model, a very unpredictable thing happened: a global pandemic due to a deadly new virus forced his country’s government to impose a citizen lockdown, temporarily shutting down enterprises and heavily reducing economic activity. These major changes affected the behavior of the Bank’s clients in repaying their loans: either they could not do so because their revenue streams were reduced or did not want to because the government granted a 3-month grace period to loan repayments. So what type of data shift did John face? Type (a)? Type (b)? Maybe (c)?

Lets give a more formal definition of each type of data shift and then we will discuss further on John’s problem. Moreover, we will see what he could do to proactively make sure that a data shift is not present in the data and what to do after a data shift is identified.

Formal definitions

1. Covariate shift

Covariate shift is the change of distributions in one or more of the independent variables (input features).

Definition 1. Covariate shift is termed the situation where Ptrn(Y|X)=Ptst(Y|X) but Ptrn(X) ≠Ptst(X)

covariate shift image

Covariate shift illustration. Source: [1]

Covariate shift may happen due to a changing environment that affects the input variables but not the target variable. In our Probability of Default example with Data Scientist John, this could mean that due to the pandemic many businesses closed or their revenues decreased, their employees became less, etc, however they decided to keep paying their loans because they were affraid that the bank may take their houses (different distributions for the X variables but the same distribution of Y).

Lets proceed with the other two cases.

2. Prior probability shift

Prior probability shift can be thought of as the exact opposite of covariate shift: it is the case that input feature distributions remain the same but the distribution of the target variable changes.

Definition 2: Prior probability shift is termed the situation where Ptrn(X|Y)=Ptst(X|Y) but Ptrn(Y) ≠Ptst(Y)

prior prob shift

Prior probability shift: histogram of target variable Y

A prior probability shift can occur in cases where despite the input variables remain the same, our target variable changes. In our Probability of Default example, John may be facing the following situation: there could be some companies that were not really affected by the lockdown and have not suffered any revenue losses (e.g. pharmacies) but they deliberately chose not to repay their loan installments in order to save some money in view of worse days or because they know that the government may subsidize the loans of all companies (same X distribution but different Y).

3. Concept drift

A concept drift happens where the relations between the input and output variables change. So we are not anymore only focusing on X variables or only the Y variable but on the relations between them.

Definition 3. A concept drift is termed the situation where Ptrn(Y|X) ≠ Ptst(Y|X).

concept drift

Concept drift in time series problems: Airplane passengers. Source [2]

A concept drift may happen in situations where the data is trully temporal and thus depend heavily on time. For example, we might have built a machine learning model to predict daily number of flights in some airport. Due to economic bloom and other variables that have not been accounted on the model (latent variables), our target variable keeps changing over time (time series presents a trend). Another example is selection bias. This can happen when the train sample selected does not contain all the possible data distribution (it is a common caveat in questionaires and statistical surveys).

Solutions

1. Covariate shift

To detect covariate shifts we are are going to use a simple but clever programmatic trick. We are actually going to deploy a machine learning solution for this goal. But this time, instead of trying to predict the target variable (whatever this is), we will build a classifier that will try to distinguish between the train and test sets. Seems confusing? Its really not, follow along and it will all make sense.

Approch description

Let’s build on John’s work. John initially trained a model on some data. We will call this data…the train set. He then deployed the model and every month he infers on a new dataset which shall be called…the test set. In order to check if the given test set is vastly different from the train set we are going to create a new dummy variable called ‘is_train’. This variable will contain all ones (1) in the train set and all zeroes (0) in the test set. We will then use every indepedent variable, in turn, and on its own, to try to predict the new target variable ‘is_train’. If an input variable is able to predict the new dummy variable, i.e. to separate between the train and test set, this means that this variable presents covariate shift between the train and test set and must be taken care of. In bullet-style we are going to follow the next steps in the next code chunk:

  • Create new variable with ones in train set and zeroes in test set.
  • Merge the two sets and shuffle randomly.
  • Split in new train-test at 80%-20%
  • For each single input variable:
    • Fit a simple classifier (e.g. Random Forests)
    • Predict ‘is_train’ variable
    • Calculate AUC
    • If AUC exceeds some threshold (e.g. 80%), feature displays data shift

A similar solution is to fit the model using ALL the input variables and then to use a model interpretability technique (such as SHAP) to examine the impact of each variable in the prediction, therefore understanding which ones present covariate shift.

Code

The following chink of code does exactly what was described above: it creates a new variable to characterize the train and test sets and then tries to distinguish between them, using each variable on its own, iterativelly.


from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Create new y label to detect shift covariance
train['is_train'] = 1
test['is_train'] = 0

# Create a random index to extract random train and test samples
training = train.sample(7000, random_state=12)
testing = test.sample(7000, random_state=11)

## combining random samples
combi = training.append(testing)
y = combi['is_train']
combi.drop('is_train', axis=1, inplace=True)

## modelling
model = RandomForestClassifier(n_estimators=50, max_depth=5, min_samples_leaf=5)
drop_list = []
score_list = []
temp = -1
for i in combi.columns[:50]:
temp += 1
score = cross_val_score(model, pd.DataFrame(combi[i]), y, cv=2, scoring='roc_auc')
if (np.mean(score) > 0.8):
drop_list.append(i)
score_list.append(np.mean(score))
print('checking feature no ', temp)
print(i, np.mean(score))

Running this code will output something like…

checking feature no 0 out of 2238
coupon_id 0.8002228979591837
checking feature no 1 out of 2238
customer_id 0.5554701428571429
checking feature no 2 out of 2238
age_range 0.532001693877551
checking feature no 3 out of 2238
marital_status 0.521186693877551

…depending on your specific features. We can then target specific features with high predictive power (e.g. >80% AUC) to be taken care of, by removing entirelly or re-weighting.

2. Prior probability shift

Approch description

Detecting prior probability shift is really straight-forward. The first way to detect it is simply to plot a histogram of variable Y frequencies between train and test set. This way we can have a quick visual insight on how different our dependent variable looks. On a second stage and to confirm our visual findings we can apply some statistical test of mean difference (t-test, ANOVA, etc). Let’s see how these apply below.

Code

You can run the following script to produce two normal distributions. This is only to simulate some variable Y.

mean = 0.5
std = 0.2
array = np.random.normal(0.5, 0.15, 1000)
count, bins, ignored = plt.hist(array, 30, normed=True)
plt.plot(bins, 1/(std * np.sqrt(2 * np.pi)) *
          np.exp( - (bins - mean)**2 / (2 * std**2) ),
          linewidth=2, color='r')
mean = 1
std = 0.2
array = np.random.normal(1, 0.15, 1000)
count, bins, ignored = plt.hist(array, 30, normed=True)
plt.plot(bins, 1/(std * np.sqrt(2 * np.pi)) *
          np.exp( - (bins - mean)**2 / (2 * std**2) ),
          linewidth=2, color='r')

plt.show()

Running this script will produce the following chart: prior prob shift figure

Target variable distribution change. Blue: train Y. Orange: Test T

We can observe with a naked eye that the distributions of the two variables are different (we can think that with blue colour is the train set target variable and with orange color is the target variable in the test set). Now lets proceed to apply a proper statistical test to confirm (or reject) what we see. For this goal we are going to use the handy scipy package of python. Really simply we execute:

from scipy.stats import ttest_ind
ttest_ind(f1,f2)

The statistical test has a null hypothesis of mean similarity. If we get a p-value less than a certain threshold (e.g. 0.05) we can say that on 95% confidence interval we can reject the null hypothesis of equal means on the two samples. For more details please check [3]. Sure enough, after running these two lines of code you should get a really small number, which indicates that our target variable has shifted from its original distribution. Apparently, in this particular example, it is supposed to be that way because I deliberately constructed the two variables to be distinct. But the real-world examples should present a similar behavior and can be uncovered in the exact same fashion.

3.Concept drift

Approch description

As already discussed, a concept drift generally arises when the relations between X and Y variables change. This might be the effect of seasonality patterns, especially in temporal data. For example, someone may be trying to fit a Machine Learning model on sales data across a period of time. Sales might display seasonalities across multiple levels such as daily, weekly or monthly. If the model is not fitted by taking into account all seasonality effects, it might not generalize well in the future. To solve this problem we need either to de-trend a time-series data set and work on the stationary part of it, or to use some sophisticated technique like time-series cross-validation. Diagrammatically this technique looks like this:

tts

Time-series split for temporal data illustration

I include a simple script below from sklearn to illustrate this.

Code

from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LogisticRegression
time_split = TimeSeriesSplit(n_splits=10)
logit = LogisticRegression(C=1, random_state=17, solver='liblinear')
cv_scores = cross_val_score(logit, X_train, y_train, cv=time_split, scoring='roc_auc', n_jobs=1)

For more information on time series cross validation please take a look at [6] and [7].

A couple of last notes

Data shifts are very common in real world problems and occur most often in structured, tabular data Machine Learning problems: in computer vision and image recognition problems, it is generally more difficult to develop a well-performing algorithm, mainly because of not-so-transparent Deep Neural Network architectures that are hard to train and to interpret. But once there, image classifiers are generally more stable in the course of time. Just imagine how many centuries it would take for Natural Selection to apply and change the looks of cats and dogs – dogs will probably look as dogs and cats as cats for a looong time. On the other hand, this is not true for structured, tabular corporate data. Such data and their underlying patterns are affected by many temporal, exogenous and latent variables and are prone to change drammatically in the course of time.

Machine Learning models need often re-training to remain effective and accurate. But how often is enough? Every day? Week? Month? Quarter maybe? Is hard to tell ex-ante. It all depends on the problem and the data at hand. A nice indicator however is model degradation that can trigger a model re-train and a data shift detection that can raise a red flag for further action such as feature exclusion, model re-training, etc.

Wrapping it up

In this post we described a very real but often overlooked problem of data shifts. We gave formal definitions of the most common causes of data shifts and provided code solutions to the problem at hand.

To avoid model degradation issues that arise from such reasons, make sure you make the data shift detection a core part of your deployment phase in order to prevent or take timely action once a data shift occurs. This overlooked process in the lifecycle of Machine Lerning deployment phase may literally save your job and reputation. Enjoy!

P.S. The described ‘John’ story is entirely fictitious and has no relation to real-life persons and situations.

References

[1] https://towardsdatascience.com/understanding-dataset-shift-f2a5a262a766

[2] https://facebook.github.io/prophet/docs/multiplicative_seasonality.html

[3] https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

[4] https://mitpress.mit.edu/books/dataset-shift-machine-learning

[5] https://www.analyticsvidhya.com/blog/2017/07/covariate-shift-the-hidden-problem-of-real-world-data-science/

[6] https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html

[7] https://www.kaggle.com/kashnitsky/correct-time-aware-cross-validation-scheme

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close