During my experience as a data scientist, one of the most common problems I have faced during the process of data cleaning/exploratory analysis has been the handling of missing tabular values. In the ideal case, all attributes of all objects in the data table have well-defined values. However, in real data sets, it is not unusual for an attribute to contain missing data.

When dealing with prediction tasks in supervised learning, I quickly came to the realization that a lot of machine learning algorithms available in Python cannot handle missing data naturally, i.e. the omitted instances have to somehow be filled with a placeholder (most likely a number) for them to run smoothly.

Perhaps you as a reader may have come up with a simple solution already:

Why not just ignore those instances in the pre-processing stage and be done with it?

After all, it is not our fault that the data is missing and moreover, we should not make any assumptions about the nature of missing data since their true value is unknowable in principle. While this may certainly be tempting (if not advisable) in some situations, I will attempt to make the case that imputing (rather than ignoring) missing values can be a better practice that in the end leads to more reliable and unbiased results for our machine learning models.


1. Types of missing data

There may be various reasons responsible for why the data is missing. Depending on those reasons, it can be classified in to three main types:

1) Missing completely at random (MCAR) - Imagine that you print out the data table on a sheet of paper with no missing values and then someone accidentally spills a cup of coffee on it. In this case it can be concluded that the unknown values of an attribute follow the same distribution as known ones. This is the best case for missing values [1].

2) Missing at random (MAR) - In this case the missing value from an attribute X is dependent on other attributes but is independent from the true value of X. For example if an outdoor air temperature sensor runs out of batteries and the staff forgets to change them because it was raining, we can conclude that temperature values are more likely to be missing when it is raining, so they are dependent on the rain attribute. If we compute the temperature based only on the present values, we would probably overestimate the average value, since temperature may be lower when it is raining compared to when it's not.


3) Missing not at random (MNAR) - This usually occurs when when the lack of data is directly depended on its value. For example when a temperature sensor fails if temperatures drop below 0°C. Another example is when people with a certain level of income choose not to disclose that information to a census taker. In this case it is more difficult to replace the missing values with a reasonable estimate.

It is important to identify these types of missing data, since it can help us make certain assumptions about their distribution and therefore improve our chances of making good estimations.

2. Ways to handle missing data

First of all, we need to identify which attributes exactly contain missing values, as well as get an idea of their frequency, as shown in the table below:

Quantifying missing values

The attributes were sorted in descending order based on the number of instances with unknown values.

2.1 Deleting missing data

In my opinion, if the missing value percentage is above a certain threshold (say, 60%), it does not make much sense to try and impute them because it would likely influence our predictions due to the biased estimations. Deletion of the rows or columns with unknown values would be better suited. For illustrative purposes, suppose the data set looks like this (missing instances are denoted with the NaN notation):

id    col1     col2     col3     col4     col5
0      2.0       5.0       3.0       6.0       4.0
1      9.0       NaN       9.0       0.0       7.0
2      19.0     17.0     NaN       9.0       NaN

The Python pandas library allows us to drop the missing values based on the rows that contain them (i.e. drop rows that have at least one NaN value):

import pandas as pd

df = pd.read_csv('data.csv')
df.dropna(axis=0)

The output is as follows:
id    col1     col2      col3     col4     col5
0      2.0       5.0       3.0       6.0       4.0

Similarly, we can drop columns that have at least one NaN in any row:
df.dropna(axis=1)

The above code produces:
id    col1     col4    
0      2.0       6.0      
1      9.0       0.0
2      19.0     9.0

However, I think that in most scenarios it is better to keep data than discard it. One obvious reason is that removing rows or columns that contain unknown values will result in losing too much valuable information, especially if we don't have much data to begin with.

2.2 Simple imputation of missing data

We could use simple interpolation techniques to estimate unknown data. One of the most common interpolation techniques is mean imputation [2]. Here, we simply replace the missing values in each column with the mean value of the corresponding feature column.

The sciki-learn library offers us a convenient way to achieve this by calling the SimpleImputer class and then applying the fit_transform() function:

from sklearn.impute import SimpleImputer
import numpy as np

sim = SimpleImputer(missing_values=np.nan, strategy='mean')
imputed_data = sim.fit_transform(df.values)

After running the code, we get the imputed dataset:

Simple mean imputation technique

Other imputation strategies are available with this class, for example "median" or "most frequent" in the case of categorical data, which replaces the missing data with the most common category.
This simplistic approach does have its drawbacks however. For example, by using the mean as an imputation strategy we do not:
1) Account for the variability of the missing values, since these values are replaced by a constant.
2) Take into account the potential dependency of the missing data from the other attributes which are present in the data set.
That's why I decided to focus my attention on a few more sophisticated approaches.

2.3 Imputation of missing data using machine learning

A more advanced method of imputation is to model an attribute containing unknown values as a target variable which is dependent on the other variables present in the data set and then apply traditional regression or machine learning algorithms to predict its missing instances. A rough mathematical representation could be formulated as follows:
                                                                  y=f(X)

where y represents the attribute for which we want to predict the missing values and X is the set of predictor variables, i.e. the other variables. This relationship is most clearly visible in the case of simple linear regression where we have:
                                                                y=c+b*X

After we build our simple model we can then use it to predict the unknown values of y for which the corresponding X values will be available. The exact same principle applies to ML-algorithms as well, albeit the relationship between target and predictor cannot be represented so neatly.
Relying on linear regression (or logistic regression for categorical data) to fill the gaps has of course its drawbacks as well. Most importantly, this approach assumes that the relationship between its predictors (or the log odds of its predictors in logistic regression) and the target variable is linear, even though this may not be the case at all.
For this reason, I have chosen to perform imputation using ML algorithms which are able to also capture non-linear relationships. The modus operandi can be summarized in the following pseudocode:

for each attribute containing missing values do:

  1. Substitute missing values in the other variables with temporary placeholder values derived solely from the non-missing values using a simple imputation technique

  2. Drop all rows where the values are missing for the current variable in the loop

  3. Train an ML model on the remaining data set to predict the current variable

  4. Predict the missing values of the current variable with the trained model (when the current variable will be subsequently used as an independent predictor in the models for other variables, both the observed and predicted values in this step will be used).

end

Firstly, as you probably noticed, I have performed a simple form of imputation (median) already in the first step. This is necessary because there may be multiple features with missing data present, and in order for them to be used as predictors for other features, their gaps need to be temporary filled somehow.
Secondly, the prediction of missing data is done in a "progressive" manner in the sense that variables which were imputed in the previous iteration are used as predictors along with those imputed values. So at each iteration except the first, we are relying on the predictive power of our model to fill the remaining gaps.
Thirdly, given that data set provided in this case contained a mix of data types, I have employed ML regressors (for continuous attributes) as well as classifiers (for categorical attributes) to cover all possible scenarios.

In the subsequent sections I have listed all the ML models used in this study, along with small snippets of code that demonstrates their implementation in Python.

2.3.1 Imputation of missing data using Random Forests

Quick data preprocesing tips

Before training a model on the data, it necessary to perform a few preporcessing steps first:

  • Scale the numeric attributes (apart from our target) to make the algorithm find a better solution quicker.
    This can be achieved using scikit-learns's StandardScaler() class:
    from sklearn.preprocessing import StandardScaler
    X = df.values
    standard_scaler = preprocessing.StandardScaler()
    x_scaled = standard_scaler.fit_transform(X)

  • Encode the categorical data so that each category of an attribute is represented in a binary 1 (present) - 0 (not present) fashion. This is done because most models cannot handle non-numerical features naturally.
    We can do this by using the pandas get_dummies() method:
    import pandas as pd
    encoded_country = pd.get_dummies(df['Country'])
    df.join([encoded_country])
    del df['Country']

The first ML model used was scikit-learn's RandomForestRegressor. Random forests are a collection of individual decision trees (bagging) that make decisions by averaging out the prediction of every single estimator. They tend to be resistant to overfitting because tree predictions cancel each-other out. If you want to learn more, refer to [3].

Below is a small snippet that translates the above pseudocode into actual Python code:

from sklearn.ensemble import RandomForestRegressor

for numeric_feature in num_features:
    df_temp = df.copy()
    sim = SimpleImputer(missing_values=np.nan, strategy='median')
    df_temp = pd.DataFrame(sim.fit_transform(df_temp))
    df_temp.columns = df.columns
    df_temp[numeric_feature] = df[numeric_feature]
    df_train = df_temp[~df_temp[numeric_feature].isnull()]
    y = df_train[numeric_feature].values
    del df_train[numeric_feature]
    df_test = df_temp[df_temp[numeric_feature].isnull()]
    del df_test[numeric_feature]
    X = df_train.values
    standard_scaler = preprocessing.StandardScaler()
    x_scaled = standard_scaler.fit_transform(X)
    test_scaled = standard_scaler.fit_transform(df_test.values)
    rf_regressor = RandomForestRegressor()
    rf_regressor = rf_regressor.fit(x_scaled, y)
    pred_values = rf_regressor.predict(test_scaled)
    df.loc[df[numeric_feature].isnull(), numeric_feature] = pred_values

Categorical feature imputation is done in a similar way. In this case we are dealing with a classification task, and should use the RandomForestClassifier class.

Important note on using categorical features as predictors:
In my opinion, it is correct to perform temporary imputation of categorical features before encoding them.

Consider the below example where the Country feature has already been encoded before beginnig the imputation procedure:
id    Austria    Italy     Germany
0            0             1              0
1            1             0              0
2            0             0              1
3            NaN         NaN          NaN

If we apply the a simple imputation using the most frequent value for example, we would get the following result on the last row:
id    Austria    Italy    Germany
0            0             0              0

This is a logical mistake in the representation, since each row should contain exactly one column that takes 1 as value to denote the presence of a particual county. We can avoid this mistake by imputing before encoding, since we are guaranteed to fill the missing values with a certain country value.

2.3.2 Imputation of missing data using XGBoost

The XGBoost algorithm is an improved version of the Gradient Boosting one. Similar to Random Forests, XGBoost is a tree based estimator, but decisions are taken sequentially rather than in parallel. For more information, check out the official documentation.
The XGB model can actually handle missing values on its own, so it is not necessary to perform temporary simple imputation on predictor variables, i.e. we could skip the first step in the pseudocode.
Training and prediction of missing values is done in a similar fashion to the random forest approach:

import xgboost as xgb
.
.
.
    xgbr = xgb.XGBRegressor()
    xgbr = xgbr.fit(x_scaled, y)
    pred_values = xgbr.predict(test_scaled)
.
.
.

2.3.3 Imputation of missing data using Keras Deep Neural Networks

Neural networks follow a fundamentally different approach during training compared to tree based estimators. In my work, I have used the neural network implementation offered by the Keras library. Below I wrote an example demonstrating its application in Python:

import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
.
.
.
    model = Sequential()
    model.add(Dense(30, input_dim=input_layer_size, activation='relu')
    model.add(Dense(30, activation='relu'))
    # identity activation in the output layer for regression
    model.add(Dense(1))
    # in case of multi classification:
    # model.add(Dense(1, activation='softmax'))
    model.compile(loss='mean_squared_error')
    # in case of multi classification:
    # model.compile(loss='categorical_crossentropy')
    model.fit(x_scaled, y)
    pred_values = model.predict(test_scaled)[:, 0]
.
.
.

2.3.4 Imputation of missing data using Datawig

Datawig is another deep learning model I employed. It is designed specifically for missing value imputation as it utilises MXNet's pre-trained DNNs to make predictions. It can work with missing data during training and it automatically handles categorical data with its CategoricalEncoder class, so we don't need to pre-encode them. A possible implementation can be done as follows:

import datawig
.
.
.
    imputer = datawig.SimpleImputer(input_columns=list(df_test.columns),
    output_column=numeric_feature, # the column to impute
    output_path='imputer_model' # stores model data and metrics)
    # Fit the imputer model on the train data:
    imputer.fit(train_df = scaled_df_train)

    # Alternatively, we could use the fit_hpo() method to find
    # the best hyperparameters:
    # imputer.fit_hpo(train_df = scaled_df_train)

    # Impute missing values, return original dataframe with predictions
    pred_vals = imputer.predict(scaled_df_test).iloc[:, -1:].values[:, 0]
.
.
.

Datawig is optimized for pandas DataFrames, meaning that it takes dataframe objects directly as input for training and prediction, so we do not need to transform them into numpy arrays.
Moreover, we should not drop the target variable column from the training set and input it as a separate argument as we did previously when fitting a model. Datawig handles this automatically.

2.3.5 Imputation of missing data using IterativeImputer

The scikit-learn package also offers a more sophisticated approach to data imputation with the IterativeImputer() class. So where does this approach differ from the ones we saw before? The names gives us a hint.

Iterative means that each feature is imputed multiple times. Each iteration is called a cycle. The reason behind running multiple cycles is to achive some sort of 'convergence', although it is not clear this means exactly, looking at the scikit-learn documentation. However, you can think of convergence in terms of stabilization of the predicted values:

from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer

iim=IterativeImputer(estimator=xgb.XGBRegressor(),
        initial_strategy='median',
        max_iter=10,
        missing_values=np.nan,
        skip_complete=True)

# impute all the numeric columns containing missing values
# with just one line of code:
imputed_df = pd.DataFrame(iim.fit_transform(df))

imputed_df.columns = df.columns

From the code above, we can see that each feature is imputed 10 times (max_iter=10) and in the end we get the imputed values of the last cycle.

Notice how I used an XGBoost regressor as model input. This shows that IterativeImputer also accepts some ML models that are not native to the scikit-learn library.
Despite being easy to implement, it takes a very large amount of time to calculate compared to the other approaches. In addition, I would advise using this class with care since it is still in its experimental stages.

3. Comparing performances of ML algorithms

Up until this point, we have seen how various techniques can be employed to impute missing data as well as the actual process of imputation. However, I have not explained how we can compare the qualities of predictions provided by these approaches.

This is not immediately obvious, because well, we do not possess the missing data to compare them to the predictions. Instead what I decided to do is keep a holdout or validation set from training data, and then use it for model performance evaluation.

So, we are pretending that some data is missing and inferring the actual accuracy of the imputed values based on the accuracy of the imputations on these fake missing values. The snipped below enables us to do this:

from sklearn.metrics import r2_score
import random
.
.
.
    train_copy = df_train.copy()
    random.seed(23)
    current_feat = train_copy[numeric_feature]
    missing_pct = int(current_feat.size * 0.2)
    i = sorted(random.sample(range(current_feat.shape[0]), missing_pct))
    current_feat.iloc[i] = np.nan

    y_fake_test = df_train.iloc[i, :][numeric_feature].values
    new_train_df = train_copy[~train_copy[numeric_feature].isnull()]
    fake_test_df = train_copy[train_copy[numeric_feature].isnull()]
    train_y = new_train_df[numeric_feature].values
    del new_train_df[numeric_feature]
    del fake_test_df[numeric_feature]

    rf_regressor = rf_regressor.fit(new_train_df.values, train_y)
    train_pred = rf_regressor.predict(new_train_df.values)
    test_pred = rf_regressor.predict(fake_test_df.values)

    print("R2 train:{} | R2 test:{}".format(r2_score(train_y, train_pred), r2_score(y_fake_test, test_pred)))

The prediction quality, or goodness of fit, is measured by the coefficient of determination, which is expressed as:

Coefficient of determination formula

where RSS is the sum of squared residuals, and TSS represents the total sum of squares. Below I have plotted a visual comparison of the model permformances for several attributes. Visualization was done utilizing the seaborn library.

Comparing model prediction accuracy on various attributes

We can see that the random forest model consistently ranks among the best.
To get another hint at the consistency of RF, I have plotted the actual values against the predicted values in the test set for the VAR_1 variable:

Plotting actual values against predicted ones for Var_1

Ideally, the line in any graph should be a straight, diagonal one. The model which comes closest to this is random forest, which was ultimately my choice for imputation.

4. Potential future steps

Another interesting technique for imputation which could be employed in the future is the Multiple Imputation Chained Equations (MICE) method. This takes iterative imputation up a notch. The core idea behind it is to create multiple copies of the original data set (usually 5 to 10 are enough) and perform iterative imputation on each dataset. The obtained results from each data set are then pooled together in accordance with some metric which we can define.

Ultimately, the goal is to somehow account for the variability of the missing data and study the effects of different permutations on the prediction results. The scheme below illustrates this:

Main-steps-used-in-multiple-imputation

In Python, MICE is offered by few libraries like impyute or statsmodels. However, they are limited to linear regression estimators.
Another way to mimic the MICE approach would be to run scikit-learn's IterativeImputer many times on the same dataset using different random seeds each time.
Yet another take at the imputation problem is to apply a technique called maximum likelihood estimation, which derives missing values from a user defined distribution function, the parameters of which are chosen in a way that maximizes the likelihood of the imputed values actually occurring

5. Conclusion

We got a glimpse of what the potential approaches for handling missing values are, from the simplest techniques like deletion to more complex ones like iterative imputation.

In general, there is no best way to solving imputation problems and solutions vary according to the nature of the problem, size of the data set etc. However, I hope to have convinced you that an ML based approach has inherent value because it offers us a 'universal' way out. While missing data may be truly unknowable, we can at least try to come up with an educated guess based on the hidden relationships with the already existing attributes, captured and exposed to us by the power of machine learning.


References

[1] Berthold M.R., and others, Data understanding, in: Guide to Intelligent Data Analysis, Springer, London, pp. 37-40, 42-44.

[2] Raschka S., Data preprocessing, in: Python Machine Learning, Packt, Birmingham, pp. 82-83, 90-91.

[3] Tan P., and others, Data preprocessing, Classification, Ensemble methods, in: Introduction to Data Mining, Addison Wesley, Boston, pp. 187-188, 289-292.