Data Preprocessing for Machine Learning Part I

Data Preprocessing

Data Preprocessing is the process to clean and manage the data so that our machine learning model doesn’t get disturbed of biased during understanding the model. The real-world data available is very unmanaged and dirty which may fail our machine learning models so they must be pre-processed. There are various methods to pre-processed the data and we will be looking them in detail.

The various data preprocessing techniques are:

  • StandardScaler
  • MinMaxScaler
  • RobustScaler
  • Normalization
  • Binarization
  • Encoding Categorical (Ordinal & Nominal) Features
  • Imputation
  • Polynomial Features
  • Custom Transformer

StandardScaler

StandardScaler is a data pre-processing technique which make the data normally distributed. In our data there may be various inconsistency. let us look at the example of a real estate data where the size of the house will be in small number but the price may be millions which may cause the inconsistency and also may slow our model due to large values so the concept of StandardScaler comes in play. If data is not normally distributed, this is not the best scaler to use.

StandardScaler assumes our data is normally distributed (continuous numerical data) within each feature and will scale them such that the distribution is now centred around 0, with a standard deviation of 1. It subtracts the mean from the data from the original data and divides it by standard deviation.

Let’s look for an example. We will be importing StandardScaler function available in the sklearn.preprocessing library. let’s us generate the normal data. The data before pre-processing looks like:

#Generating Data
df = pd.DataFrame({
    'x1' : np.random.normal(0,2,10000),
    'x2' : np.random.normal(5,3,10000),
    'x3' : np.random.normal(-5,5,10000)
})
#Plotting Data
df.plot.kde()

Here, from the figure we can see that the data have a range of -20 to 20 and also the density of the different data types are different. To Normalize this data, lets us import the library and implement the following code.

from sklearn.preprocessing import StandardScaler
model = StandardScaler()
data_tf = model.fit_transform(df)
df = pd.DataFrame(data_tf,columns=['x1','x2','x3'])
df.plot.kde()

You can get Full code at Here.

MinMaxScaler

MinMaxScaler is used to scale the data that normalizes the data by shifting the values between 0 and 1. We can simply call it as compressing the data. The large values of data are converted to fractions between 0 and 1. We can use this scaling if it is not suitable for StandardScaler. This simply Subtract min of column and divides it by difference between max and min. MinMaxScalers shifts the data between 0 or 1 but StandardScaler normalizes the data.

Let’s look for a example. We will be importing MinMax function available in the sklearn.preprocessing library. let’s us generate the normal data. The data before preprocessing looks like:

#Generating Data

df = pd.DataFrame({
    'x1' : np.random.normal(0,2,10000),
    'x2' : np.random.normal(2,5,10000),
    'x3' : np.random.normal(-3,2,10000)
})
df.plot.kde()

The data have a very high variance and it may affect our machine learning model during features extraction. So, we will be using MinMaxScaler and see what our data will look like after scaling.

from sklearn.preprocessing import MinMaxScaler
model = MinMaxScaler()
data_tf = model.fit_transform(df)
df = pd.DataFrame(data_tf, columns=['x1','x2','x3'])
df.plot.kde()

We can now see that our data are completely scaled and normalized between 0 and 1. This will make our machine learning model faster and efficient. You are now ready to go for next step. Find full code here.

RobustScaler

RobustScaler is used for data with outliers. Outliers is the sudden change in the data that is unexpected or unsudden. Most of the Machine Learning removes those data which is not good. We can simply scale those data with RobustScaler. It is calculated by subtracting 1st-quartile and dividing it by difference between 3rd-quartile and 1st-quartile.

Let’s look for an example. We will be importing RobustSacaler function available in the sklearn.preprocessing library. let’s us generate the normal data and concatenate with the disturbance. the data before pre-processing looks like:

df = pd.DataFrame({
    'x1' : np.concatenate([np.random.normal(20, 1, 5000), np.random.normal(1, 1, 30)]),
    'x2' : np.concatenate([np.random.normal(30, 1, 5000), np.random.normal(50, 1, 30)]),
})
df.plot.kde()

Here, the data is disturbed in the middle section but still the data can have some meaning so it must be scaled not removed. So, we will be scaling it and see the output of the new modified data.

from sklearn.preprocessing import RobustScaler
model = RobustScaler()
df_trans = model.fit_transform(df)
df = pd.DataFrame(df_trans,columns=['x1','x2'])
df.plot.kde()

We can see that the spikes are now merged and the data are shifted to small scale so we can now apply our data to our machine learning model. You can get full code here.

We will be looking more Data Pre-processing techniques in the next blog.

Leave a Comment

Your email address will not be published. Required fields are marked *