Normalization techniques – Statistics and Python

Content

This notebook shows how we deal with the numeric data with different scale by using normalization techniques.

Currently, there are 3 normalization techniques which are used mostly in data preprocessing.

  1. L1 normalization – Least absolute deviations | Least absolute errors
  2. L2 normalization – Least squares
  3. MinMax normalization

It is clear that different normalization will be used for different purposes with different datasets. It is possible to use only one normalization technique for a particular dataset. Also, it is possible to use mixed normalization techniques (more than 2) for a different dataset.

Furthermore, there are two different ways of applying normalization technique(s) in a particular dataset (rows, columns or both). If we want to scale all values in one feature or one column, we will need to apply normalization by column. If we want to make all features on the same scale, we will need to apply normalization by row. In some cases, we are also able to apply normalization by row and column sequentially or vice versa.

When do we need to normalize our data?

  1. We want to seek for relationships between features.
  2. We want to use regression and multivariate for our further analysis. This is because these two types of analysis focus on exploring the relationships between features.
  3. We do not consider much about the mean between or within features.

When should we not use normalize our data?

  1. In experimental research, we compare the mean of treatment with the mean of another treatment.
  2. We are dealing with categorical data (more than 2 categories per feature).

Libraries

For reading data: Pandas

For scratch implementation: Numpy

For existing implementation: Scikit learn

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import normalize, MinMaxScaler

Data preparation

In this section, data is prepared for different normalization techniques.

Data summary

  • Age – symmetric data (measured in the range of 1-100)
  • Number of songs listened per day – symmetric data (measured in the range of 5-15)
  • Satisfaction score – symmetric data (measured in the range of 0-1)
In [2]:
data = pd.DataFrame(
    [[20, 10, 0.7], [30, 8, 0.4], [25, 11, 0.4], [10, 5, 0.8], [40, 7, 0.6]],
    columns=['Age', 'Songs Listened', 'Satisfaction']
)

display(data)
Age Songs Listened Satisfaction
0 20 10 0.7
1 30 8 0.4
2 25 11 0.4
3 10 5 0.8
4 40 7 0.6

Normalization techniques

L1 normalization

As mentioned above, L1 normalization is known as Least absolute deviations. It takes the sum of all elements in a particular row or column. Then, it divides the value of each element by that sum.

Formula:

For example: (by row)

Questions:

  1. When do we need to use L1 normalization?
    • We want to have a robust normalization for the model. (Robust normalization: normalization with good performance for data drawn from a wide range of probability distributions.
    • We do not consider much about outliers. (Outliers: Unusual data points in the distribution)
    • The stability is not required. (Stability is the slight change in the fitting line when there is an adjustment in the data.
    • We want to use sparse models which generate mostly zeros (vectors)
  2. What are the advantages of using L1 normalization?
    • It is used for performing feature selection (we can delete all features where the coefficient is 0.
    • It optimizes the median.
  3. What are the disadvantages of using L1 normalization?
    • It does not perform well in the model which requires to have a small number of columns. This is because it creates a sparse matrix.
    • It is not a good choice if datasets contain many outliers, especially when we want to detect outliers in the datasets.
In [3]:
def l1_from_scratch(data, axis=1):
    ''' This is the implementation of L1 normalization from scratch
    
    Parameters
    ----------
    data: Data needs to be normalized
    axis: A way to normalize (0 = by columns, 1 = by rows)
    
    Return
    ------
    L1-normalization result
    
    Notes
    -----
    1. We assume that the data contains only symmetric features (not categorical features)
    2. We assume that the data is clean (no missing values)
    '''
    # Intialize
    res = None
    cols = data.columns
    
    # Sum
    total = np.sum(data, axis=axis)
    
    # Division
    if axis == 0:
        # If we do normalization by columns
        data = data.T
    
    res = np.divide(data.values, total[:, None])
    
    # Tranpose if we do normalization by rows
    if axis == 0:
        res = res.T
        
    # Make DataFrame
    res = pd.DataFrame(res, columns=cols)
        
    return res
In [4]:
print("INPUT:")
display(data)

print("OUTPUT (from scratch):")
display(l1_from_scratch(data, axis=1))

print("OUTPUT (sklearn):")
display(
    pd.DataFrame(
        normalize(data, norm='l1'), columns=data.columns
    )
)
INPUT:
Age Songs Listened Satisfaction
0 20 10 0.7
1 30 8 0.4
2 25 11 0.4
3 10 5 0.8
4 40 7 0.6
OUTPUT (from scratch):
Age Songs Listened Satisfaction
0 0.651466 0.325733 0.022801
1 0.781250 0.208333 0.010417
2 0.686813 0.302198 0.010989
3 0.632911 0.316456 0.050633
4 0.840336 0.147059 0.012605
OUTPUT (sklearn):
Age Songs Listened Satisfaction
0 0.651466 0.325733 0.022801
1 0.781250 0.208333 0.010417
2 0.686813 0.302198 0.010989
3 0.632911 0.316456 0.050633
4 0.840336 0.147059 0.012605

L2 Normalization

As mentioned above, L2 normalization is known as Least squares. It takes the sum of squares of all elements in a particular row or column. Then, it divides the value of each element by the root squared of that sum.

Formula:

For example: (by row)

Questions:

  1. When do we need to use L2 normalization?
    • Robustness for the model is not required.
    • We do consider outliers.
    • The stability is required.
    • We want to use non-sparse models which generate mostly non-zeros (vectors). Also, this support computational efficiency
  2. What are the advantages of using L2 normalization?
    • It supports computational efficiency, so it is also easy to use gradient-based learning methods.
    • It keeps the overall error small.
    • It improves the prediction performance. This is because we consider almost all features.
    • It is a better choice if we want to detect outliers in the datasets.
  3. What are the disadvantages of using L2 normalization?
    • It is not used for performing feature selection.
In [5]:
def l2_from_scratch(data, axis=1):
    ''' This is the implementation of L2 normalization from scratch
    
    Parameters
    ----------
    data: Data needs to be normalized
    axis: A way to normalize (0 = by columns, 1 = by rows)
    
    Return
    ------
    L2-normalization result
    
    Notes
    -----
    1. We assume that the data contains only symmetric features (not categorical features)
    2. We assume that the data is clean (no missing values)
    '''
    # Intialize
    res = None
    cols = data.columns
    
    # Square
    sq = np.square(data)
    
    # Sum of square
    total = np.sum(sq, axis=axis)
    
    # Root squared
    sq_rt = np.sqrt(total)
    
    # Division
    if axis == 0:
        # If we do normalization by columns
        data = data.T
    
    res = np.divide(data.values, sq_rt[:, None])
    
    # Tranpose if we do normalization by rows
    if axis == 0:
        res = res.T
        
    # Make DataFrame
    res = pd.DataFrame(res, columns=cols)
        
    return res
In [6]:
print("INPUT:")
display(data)

print("OUTPUT (from scratch):")
display(l2_from_scratch(data, axis=1))

print("OUTPUT (sklearn):")
display(
    pd.DataFrame(
        normalize(data, norm='l2'), columns=data.columns
    )
)
INPUT:
Age Songs Listened Satisfaction
0 20 10 0.7
1 30 8 0.4
2 25 11 0.4
3 10 5 0.8
4 40 7 0.6
OUTPUT (from scratch):
Age Songs Listened Satisfaction
0 0.893989 0.446995 0.031290
1 0.966155 0.257641 0.012882
2 0.915217 0.402695 0.014643
3 0.892146 0.446073 0.071372
4 0.984923 0.172362 0.014774
OUTPUT (sklearn):
Age Songs Listened Satisfaction
0 0.893989 0.446995 0.031290
1 0.966155 0.257641 0.012882
2 0.915217 0.402695 0.014643
3 0.892146 0.446073 0.071372
4 0.984923 0.172362 0.014774

Min-Max Normalization

Min-Max normalization is the way we map the entire range of values of a particular row or column to the range 0 to 1. The minimum value is mapped to 0, the maximum value is mapped to 1 and every other value is mapped to a decimal between 0 and 1.

Formula:

X_min = The minimum value of a particular row or column
X_max = The maximum value of a particular row or column
X_i   = The value of a particular unit.
y     = The value after normalizing.

For example: (by column)

Questions:

  1. When do we need to use Min-Max normalization?
    • We want gradient descent to converge much faster. (Logistic Regression, SVMs, perceptrons, neural networks, etc.)
    • When our further analysis uses K-NN for classification problems or K-means for clustering problems
    • When we want to find directions of maximizing the variance. (LDA, PCA, Kernel-PCA)
  2. What are the advantages of using Min-Max normalization?
    • The distribution of the feature is not normally distributed.
    • The feature falls within a bound interval. (pixels intensities fit within 0-255 range)
  3. What are the disadvantages of using Min-Max normalization?
    • It is not a good choice if the dataset contains outliers and we want to detect them. There is a solution to this disadvantage. For visualization, we need to transform your data into 2-dimensionals space. However, this solution might not an inefficient solution if we try to use the value of the y-axis.
In [7]:
def mm_from_scratch(data, axis=0):
    ''' This is the implementation of Min-Max normalization from scratch
    
    Parameters
    ----------
    data: Data needs to be normalized
    axis: A way to normalize (0 = by columns, 1 = by rows)
    
    Return
    ------
    Min-Max normalization result
    
    Notes
    -----
    1. We assume that the data contains only symmetric features (not categorical features)
    2. We assume that the data is clean (no missing values)
    '''
    # Intialize
    res = None
    cols = data.columns
    vals = None
    
    # Transpose dataframe
    if axis == 0:
        vals = data.values
    else:
        vals = data.T.values
        
    # Collect min and max of a particular row or column
    bags = []
    for ind in range(len(vals[0])):
        min_val = min(vals[:, ind])
        max_val = max(vals[:, ind])
        
        vals[:, ind] = (vals[:, ind] - min_val)/(max_val - min_val)
        
    # Transform values
    if axis == 1:
        vals = vals.T
    
    # Make dataframe
    res = pd.DataFrame(vals, columns=cols)
        
    return res
In [8]:
print("INPUT:")
display(data)

print("OUTPUT (from scratch):")
display(mm_from_scratch(data, axis=0))

scaler = MinMaxScaler()

print("OUTPUT (sklearn):")
display(
    pd.DataFrame(
        scaler.fit_transform(data), columns=data.columns
    )
)
INPUT:
Age Songs Listened Satisfaction
0 20 10 0.7
1 30 8 0.4
2 25 11 0.4
3 10 5 0.8
4 40 7 0.6
OUTPUT (from scratch):
Age Songs Listened Satisfaction
0 0.333333 0.833333 0.75
1 0.666667 0.500000 0.00
2 0.500000 1.000000 0.00
3 0.000000 0.000000 1.00
4 1.000000 0.333333 0.50
OUTPUT (sklearn):
/Users/peterdu/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py:334: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by MinMaxScaler.
  return self.partial_fit(X, y)
Age Songs Listened Satisfaction
0 0.333333 0.833333 0.75
1 0.666667 0.500000 0.00
2 0.500000 1.000000 0.00
3 0.000000 0.000000 1.00
4 1.000000 0.333333 0.50

Source code: Github (peterdu98)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s