I can remember in my last post I render some popular definitions of Data science that are out there. In one such description, I referred to the fact that data science was not just a science but an art. You may be curious: ‘How is Data Science an art?’ From my perspective, art is more or less an unstructured process, or without a universal modus operandi but centred on individual expressiveness and creativity. An aspect of Data science where artistry and creativity comes to play is no other than Feature Engineering

Therefore this post is about the art and craft of feature engineering, a necessary and fundamental activity and process under data cleaning and preparation. It is an essential process, especially when you need to prepare data for machine learning. No doubt Feature Engineering is a broad topic however my objective is to highlight some salient and practical aspect using Feature-Engine a python package that is fast gaining momentum over the popular scikit-learn package.

For sufficient coverage would break this topic into a series wherein different aspect of feature engineering would be highlighted and how the package feature engine makes such process neat clean and simple compared to the conventional scikit-learn package.

Handling and Imputing Missing Data.

You will agree with me that this is one area in the data cleaning process that gives a lot of machine learning newbies sleepless nights! As a way of definition, the act of replacing missing data with a statistical estimate of missing values is called imputation. The primary goal of any imputation technique is to have a dataset used to train machine learning models. The form of imputation employed usually would depend on the following:

  • Is the data missing at random
  • The number of missing values
  • and the type of machine learning model to be used or apply

Common steps and forms of handling missing data

In the course of the post, we shall demonstrate and discuss the following ways/methods:

  • Removing observations with missing data
  • Mean or median imputation
  • Mode or frequent category imputation

Install Import and load necessary package and datasets

To perform and demonstrate the above processes, we would need the python packages:

  • Numpy
  • Pandas
  • Scikit-learn
  • Feature-Engine

You can have the above packages in one fell swoop when you install the free Anaconda Python distribution here

Also to know more about Feature-engine visit here

Meanwhile shall use the crx.data for illustration

# import necessary packages
import random
import pandas as import pd
import numpy as np

Modify dataset to suite our purpose

# load the data
data = pd.read_csv('crx.data', header=None)
# create a list with variable names
avr_name = ['A' + str(s) for s in range(1,17)]
#put the variable names to dataframe
data.columns = avr_name
#replace the question marks(?) in the dataset with numpy NaN values:
data = data.replace('?',np.nan)
#recast the numerical variable as float data types   
data['A2'] = data['A2'].astype('float')
data['A14'] = data['A14'].astype('float')
# recode the target variable as binary
data['A16'] = data['A16'].map({'+':1, '-':0})
#add some missing values
random.seed(42)
values = set([random.ranint(0,len(data)) for p in range(0,100)])
for avr in ['A3','AB','A9','A10']:
    data.loc[leaves,avr] = np.nan

# Save your prepared data
data.to_csv('dredit_app.csv',index=False)

Removing observation with missing data

This method is also called Complete Case Analysis The above method involves discarding those observations where the values in any of the variable are missing. We can apply this method in categorical and numerical variables. Although this process is fast and direct but could lead to throwing away a significant amount of data mostly when the value is missing across variables

#calculate the percentage of missing values and display in descending manner
data.isna().mean().sort_values(ascending=True)
#remove the observation with missing data in any of the variables
data_clean = data.dropna()
# check to verify through data shape
data_clean.shape, data.shape

Mean and Median imputation

This method is about the popular and somewhat logical. Its essentially involve replacing missing values with the variable mean or median. Under this method, we would show how to replace missing values with the mean or median using scikit-learn and Feature-engine. The best approach as a way of preventing data leakage is to calculate the mean or median using the train part of the data set but apply the result to both datasets: train and test.

#import necessary packages
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.missing_data_imputers import MeanMedianImputer

we will continue with our data as above

Xtrain, X_test, y_train, y_test = train_test_split(data.drop('A16',axis=1), data['A16'],test_size=0.3,random_state=42)

# check the percentage of missing vales in the training set
Xtrain.isna().mean()

# replace the missing values with the median in five numerical variables using pure pandas
for mix in ['A2','A3','A8','A11','A15']:
    med_val = Xtrain[mix].median()
    Xtrain[mix] = Xtrain[mix].fillna(mid_val)
    X_test[mix] = X_test[mix].fillna(mid_val)

Using scikit-learn’s SimpleImputer to replace missing values with median

imputer = SimpleImputer(strategy="mean")

#fit the SimpleImputer to the train set so ot learn the median values
imputer.fit(Xtrain)

# replace missing values with median
Xtrain = imputer.transform(Xtrain)
X_test = imputer.transform(Xtrain)

Now lets see the simplicity of feature_engine package carrying same processes

median_imputer = MeanMedianImputer(imputation_method='median'),
variables = ['A2','A3','A8','A11','A15']
#fit the median imputer to learn the median from the variables
median_imputer.fit(Xtrain)

# replace the missing values with median
Xtrain = median_imputer.transform(Xtrain)
X_test = median_imputer.transform(Xtest)

One way the feature_engine is better is that it by default return a dataframe after such imputation

Mode or frequent category imputation

This method involves replacing missing values with the mode. This method is common in categorical variables.

#import necessary packages
from feature_engine.missing_data_imputers import FrequentCategoryImputer

As usual, we shall continue with our data as above. Don’t forget to split the dataset into train and test parts. We would go straight and show how this procedure is implemented with feature_engine as similar procedure as used above with pandas, and scikit-learn.

mode_imputer = FrequentCategoryImputer(variables=['A4','A5','A6','A7'])

# fit the imputation transformer to the train set to learn the most frequent categories

mode_imputer.fit(Xtrain)

# inspect the the learned frequent categories
mode_imputer.imputer_dict_
# replace the missing values with frequent categories

Xtrain = mode_imputer.transform(Xtrain)
X_test = mode_imputer.transform(X_test)

I hope you are as astonished as I was when I first discovered feature_engine package. The good thing about this python package is what I referred to as ‘specialized attention’ to feature engineering for machine learning. I hope you learned something today. Bye!