Distance measures – Statistics and Python

Content

Nowadays, we can vectorize everything such as numbers, words, sentences, etc. Why do we need to vectorize everything? We want to make everything countable and measurable so that we can apply many complicated statistical algorithms for it. That is why vectorization is one of the most important things in feature engineering. However, we do not discuss further those concepts in this notebook. Instead of that, this notebook shows how we calculate the distance between two vectors/observations by using different distance techniques.

There are many distance measures for different types of variable such as numeric, categorical, mixed, etc. We will focus on the following top 5 popular distance measures in this notebook.

  1. Euclidean
  2. Manhattan
  3. Mahalanobis
  4. Jaccard
  5. Cosine

Why do we need to study distance measures in statistics?

  1. It is clear that different distance measures will be used for different purposes and also generate different results for a particular model. As a result, it is extremely important to choose the correct distance measure for the statistical model, especially clustering.
  2. As mentioned above, everything is currently vectorized. Therefore, measuring the distance between vectors is the way how we explore relationships between variables.
  3. Not only exploring relationships between variables, but it can also be used for observing the behaviour of a particular point/unit.

Libraries

For reading data: Pandas

For scratch implementation: Numpy, Scikit-learn

For existing implementation: SciPy

For visualization: Matplotlib

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import normalize
from scipy.spatial import distance

import matplotlib.pyplot as plt

Data preparation

In this section, data is prepared for different distance measures. We have 3 different types of data, including numeric data, categorical data and text data.

Data summary:

  1. Numerical data: Age represents the age of a person, Song Listened represents a number of songs that person daily listens to and Satisfaction represents how satisfied that person is.
  2. Categorical data: Gender (0 = Female, 1 = Male), Accounting (0 = NO, 1 = YES), Computer Science (0 = NO, 1 = YES).
  3. Text data: Each row represents a sentence with a different meaning.
In [2]:
num_data = pd.DataFrame(
    [[20, 10, 0.7], [30, 8, 0.4], [25, 11, 0.4], [10, 5, 0.8], [40, 7, 0.6]],
    columns=['Age', 'Songs Listened', 'Satisfaction'], index=['A', 'B', 'C', 'D', 'E']
)

cate_data = pd.DataFrame(
    [[0, 1, 0], [1, 0, 1], [1, 0, 0], [0, 1, 1], [1, 1, 1], [0, 0, 0]],
    columns=['Gender', 'Accouting', 'Computer Science'], index=['A', 'B', 'C', 'D', 'E', 'F']
)

text_data = pd.DataFrame(
    [['The sky is blue'], ['Today is the hottest day'], ['Today is the coolest day'], ['The sun is bright']],
    columns=['text']
)

print("Numerical data:")
display(num_data)
print("Categorical data:")
display(cate_data)
print("Text data:")
display(text_data)
Numerical data:
Age Songs Listened Satisfaction
A 20 10 0.7
B 30 8 0.4
C 25 11 0.4
D 10 5 0.8
E 40 7 0.6
Categorical data:
Gender Accouting Computer Science
A 0 1 0
B 1 0 1
C 1 0 0
D 0 1 1
E 1 1 1
F 0 0 0
Text data:
text
0 The sky is blue
1 Today is the hottest day
2 Today is the coolest day
3 The sun is bright

Distance measures

Euclidean

As mentioned, Euclidean distance is one of the most popular distance measures. It is used to calculate the distance between two points by the sum of squares of the difference between them in one or more dimensions. We will cover two different types of euclidean which are normal euclidean (root square) and squared euclidean.

Formula:

For example:

INPUT
Age Satisfaction
30 0.4
40 0.6
OUTPUT
Method L2-Norm Result
Euclidean NO 10.00
Euclidean Row 0.001
Euclidean Column 0.34
Squared Euclidean NO 100.04
Squared Euclidean Row 2.77e-06
Euclidean Column 0.12

Questions:

  1. When do we need to use Euclidean?
    • We have continuous numerical variables.
    • We want to reflect absolute distances.
    • We do not want to remove redundancies in order to project the correlation between variables.
  2. What do we need to do before apply Euclidean?
    • Normally, we need to scale (normalize) the variables before applying Euclidean. This is because we can not measure the distance of two variables on different scales such as Age and Satisfaction.
    • In some cases, we do not need to normalize the variables because they are already on the same scale. As a result, it is also necessary to make a scatter plot between variables.
  3. What are the differences between Euclidean and Squared Euclidean?
    • In some clustering problems, Squared Euclidean is faster than Euclidean.
  4. Should we normalize data by samples (rows) or features (columns)?
    • Since we want variables to have the same scale, we should normalize data by features.
    • Another reason is that sample normalization might change the shape of points. This affects our measures significantly. Meanwhile, feature normalization only changes the scale of each feature without affecting the shape of points. Look image below.



In [3]:
def euclidean_from_scratch(X, Y, norm='l2', squared=False):
    ''' This is the implementation of Eunclidean distance from scratch
    
    Parameters
    ----------
    X: 1-D array of variable X with n-dimensions
    Y: 1-D array of variable Y with n-dimensions
    norm: Normalization techinque (by default, 'l2')
    squared: Squared Euclidean result (by default, False)
    
    Return
    ------
    A (squared) Euclidian distance between two observations.
    
    Notes
    -----
    1. We assume that X and Y contain only continuous numerical elements.
    2. We assume that there is no missing value in X and Y.
    3. We assume that a number of elements in X and Y are the same.
    '''
    # Initialize
    res = 0.0
    
    # Normalization
    if norm is not None:
        temp = normalize([X, Y], norm=norm, axis=0)
        X = temp[0]
        Y = temp[1]
    
    # Subtraction
    res = np.subtract(X, Y)
    
    # Square each unit
    res = np.square(res)
    
    # Summation
    res = np.sum(res)
    
    # Root square
    if not squared:
        res = np.sqrt(res)
        
    return res
In [4]:
_data  = num_data.loc[['B', 'C']]
_norm_data = normalize(_data, axis=0)
_res   = euclidean_from_scratch(_data.values[0], _data.values[1])
_other = distance.euclidean(_norm_data[0], _norm_data[1])

print("INPUT:")
display(_data)

print("OUTPUT (scratch implementation): {}".format(round(_res, 3)))
print("OUTPUT (scipy distance): {}".format(round(_other, 3)))
INPUT:
Age Songs Listened Satisfaction
B 30 8 0.4
C 25 11 0.4
OUTPUT (scratch implementation): 0.255
OUTPUT (scipy distance): 0.255

Manhattan

Manhattan distance is used to calculate the distance between two points by the sum of absolute distances between them in one or more dimensions.

Formula:

For example:

INPUT
Age Satisfaction
30 0.4
40 0.6
OUTPUT
Method L2-Norm Result
Manhattan NO 10.2
Manhattan Row 0.002
Manhattan Column 0.48

Questions:

  1. When do we need to use Manhattan?
    • We can not measure the distance between two points directly like Euclidean. Instead of that, we can measure that by paths between two points.
    • We use other statistical techniques like L1-norm that requires the sum of absolute distances between observations (points).
    • We want to measure differences in frequency distributions between vectors. Normally, this applies to text mining.
  2. Euclidean distance or Manhattan distance?
    • In some classification algorithms like k-nearest-neighbor (k-NN), we can use both Euclidean distance and Manhattan distance.
    • For high dimensional data, Manhattan distance might work better than Euclidean distance.
In [5]:
def manhattan_from_scratch(X, Y, norm='l2'):
    ''' This is the implementation of Manhattan distance from scratch
        
    Parameters
    ----------
    X: 1-D array of variable X with n-dimensions
    Y: 1-D array of variable Y with n-dimensions
    norm: Normalization techinque (by default, 'l2')
    
    Return
    ------
    A Manhattan distance between two observations
    
    Notes
    -----
    1. We assume that X and Y contain only continous numerical elements
    2. We assume that there is no missing value in X and Y
    3. We assume that a number of elements in X and Y are the same.
    '''
    # Initialize
    res = 0.0
    
    # Normalization
    if norm is not None:
        temp = normalize([X, Y], norm=norm, axis=0)
        X = temp[0]
        Y = temp[1]
        
    # Substraction
    res = np.subtract(X, Y)
    
    # Absoluate values
    res = np.abs(res)
    
    # Summation
    res = np.sum(res)
    
    return res
In [6]:
_data  = num_data.loc[['B', 'C']]
_norm_data = normalize(_data, axis=0)
_res   = manhattan_from_scratch(_data.values[0], _data.values[1])
_other = distance.cityblock(_norm_data[0], _norm_data[1])

print("INPUT:")
display(num_data.loc[['B', 'C']])

print("OUTPUT (scratch implementation): {}".format(round(_res, 3)))
print("OUTPUT (scipy distance): {}".format(round(_other, 3)))
INPUT:
Age Songs Listened Satisfaction
B 30 8 0.4
C 25 11 0.4
OUTPUT (scratch implementation): 0.349
OUTPUT (scipy distance): 0.349

Mahalanobis distance

Mahalanobis measures the distance between two points based on the covariance. For example, if we want to measure the distance between point A and point B, it will calculate how many standard deviations away A is from the mean of B. Furthermore, Mahalanobis distance is known as an effective multivariate distance metric.

Formula:

For example:

Questions:

  1. When do we need to use Mahalanobis distance?
    • We want to calculate the distance between two continuous numerical elements
    • We consider the correlation between variables before calculating distances of observations.
    • We want to detect anomaly in the high dimensional dataset.
    • We want to apply classification on highly imbalanced datasets.
    • We want to calculate the distance between two observations without depending on the scale of variables.
  2. Euclidean distance or Mahalanobis distance?
    • Euclidean distance is a distance-based method which only focuses on distance between two points without considering other factors. As a result, it might have some issues in the high dimensional dataset as well as different scaled variables. In constant, Mahalanobis distance considers covariance between variables before calculating distance. As a result, it might not raise the issues above.
    • Since Euclidean distance does not consider covariance between variables, it will generate the same result if variable X is either correlated to variable Y or not. Meanwhile, Mahalanobis distance will generate different results based on the correlation between two or more variables.
In [7]:
def mahalanobis_from_scratch(X, Y, cov):
    ''' This is the implementation of Mahalanobis distance from sratch
    
    Parameters
    ----------
    X: 1-D array of variable X with n-dimensions
    Y: 1-D array of variable Y with n-dimensions
    cov: A covariance matrix of variables.
    
    Return
    ------
    A Mahalanobis distance between two observations
    
    Notes
    -----
    1. We assume that X and Y contain only continuous numberical elements.
    2. We assume that there is no missing value in X and Y.
    3. We assume that a number of elements in X and Y are the same.
    4. We assume that a covariance matrix has a shape of n-dimensions 
        which are the same as X and Y.
    '''
    # Initialize
    res = 0.0
    X = np.array(X)
    Y = np.array(Y)
    
    # Substraction
    sub = np.subtract(X, Y)
    
    # Inverse covariance matrix
    inv_matrix = np.linalg.inv(cov)
    
    # Calculation
    res = np.sqrt(np.dot(np.dot(sub, inv_matrix), sub.T))
    
    return res
In [8]:
_data  = num_data.loc[['B', 'C']]
_cov   = np.cov(num_data, rowvar=False)
_inv   = np.linalg.inv(_cov)
_res   = mahalanobis_from_scratch(_data.values[0], _data.values[1], _cov)
_other = distance.mahalanobis(_data.values[0], _data.values[1], _inv)

print("INPUT:")
display(_data)

print("OUTPUT (scratch implementation): {}".format(round(_res, 3)))
print("OUTPUT (scipy distance): {}".format(round(_other, 3)))
INPUT:
Age Songs Listened Satisfaction
B 30 8 0.4
C 25 11 0.4
OUTPUT (scratch implementation): 1.553
OUTPUT (scipy distance): 1.553

Jaccard index

In this section, we move to another type of distance measure which is for categorical variables. Before going ahead with the Jaccard index, we need to understand two important concepts.

  1. Similarity measures how similar two observations are to each other. The higher similarity values the more alike two observations are.
  2. Distance (dissimilarity) measures how different two observations are to each other. The lower values the more alike two observations are.

Jaccard index is one of the most popular binary distance measures (variables with values of either zero of one). Its similarity coefficient is defined as the size of the intersection divided by the size of the union of the sample sets. Contrary to that, the Jaccard distance is obtained by subtracting the similarity coefficient from 1.

Formula:

For example:

INPUT:
Gender Accounting Computer Science
1 0 1
0 1 1
OUTPUT:
Method Result
Similarity 0.33
Distance 0.67

Questions:

  1. When do we need to use Jaccard index?
    • We want to explore the similarity between two observations, containing categorical variables.
    • We want to conceptualize the accuracy of object detection using convolutional neural networks (CNN)
  2. How can we deal with special situations?
    • For variables that have more than 2 categories, we can encode those categories into a string or list of binary numbers. For example, if the question has 4 different options, we can encode them into [0,0], [0,1], [1,0] and [1,1]
    • For a dataset that contains missing values, we can fill or replace those missing values by a non-existing number in the dataset.
In [9]:
def jaccard_from_scratch(X, Y, method='dist'):
    ''' This is the implementation of Jaccard index (distance|similarity) from scratch
    
    Parameters
    ----------
    X: 1-D array of variable X with n-dimensions
    Y: 1-D array of variable Y with n-dimensions
    method: A method to calculate the coefficient (dist|sim, by default: dist)
    
    Return
    ------
    A Jaccard (distance|similarity) between two observations
    
    Notes
    -----
    1. We assume that there is no missing value in X and Y
    2. We assume that X and Y contain only categorical elements
    '''
    # Initialize
    res = 0.0
    
    # Intersection
    inter = np.bitwise_and(X, Y)
    
    # Union
    uni = np.bitwise_or(X, Y)
    
    # Count non-zero values
    inter = len(np.nonzero(inter)[0])
    uni = len(np.nonzero(uni)[0])
    
    # Similarity cofficient
    res = inter/uni
    
    if method is 'dist':
        res = 1 - res
    
    return res
In [10]:
_data  = cate_data.loc[['B', 'D']]
_res   = jaccard_from_scratch(_data.values[0], _data.values[1])
_other = distance.jaccard(_data.values[0], _data.values[1])

print("INPUT:")
display(_data)

print("OUTPUT (scratch implementation): {}".format(round(_res, 3)))
print("OUTPUT (scipy distance): {}".format(round(_other, 3)))
INPUT:
Gender Accounting Computer Science
B 1 0 1
D 0 1 1
OUTPUT (scratch implementation): 0.667
OUTPUT (scipy distance): 0.667

Cosine similarity

In this section, we move to another type of distance measure which is for text data (or vectors). Like Jaccard index, we need to understand two important concepts before going ahead.

  1. Similarity measures how similar two observations are to each other. The higher similarity values the more alike two observations are.
  2. Distance measures how different two observations are to each other. The lower values the more alike two observations are.

Cosine similarity measures the similarity of two non-zero vectors by the cosine of the angle between them. Furthermore, the range of the similarity coefficient is from -1 to 1 inclusively.

Formula:

For example:

INPUT:
. text
A Today is the hottest day
B Today is the coolest day
PROCESS:
. Today is the hottest coolest day
A 1 1 1 1 0 1
B 1 1 1 0 1 1
 OUTPUT:
Method Result
Similarity 0.8
Distance 0.2

Questions:

  1. When do we need to use Cosine similarity?
    • We want to see the similarity between vectors.
    • We are doing text analysis in which we need to convert words to vectors.
  2. How to deal with negative value when using Cosine similarity?
    • If we are interested in the similarity coefficient, we will need to set the threshold whose minimum value is 0 and the maximum value is 1
    • If we are interested in measuring the distance, we will need to use the absolute value of the similarity coefficient. Afterward, we need to report the direction of it based on the sign.
In [11]:
def cosine_sim_from_scratch(X, Y, method='dist'):
    ''' This is the implementation of Cosine (similarity|distance) from scratch.
    
    Parameters
    ----------
    X: 1-D array of variable X with n-dimensions
    Y: 1-D array of variable Y with n-dimensions
    method: A method to calculate the coefficient (dist|sim, by default: dist)
    
    Return
    ------
    A Cosine (similarity|distance) of two vectors.
    
    Notes
    -----
    1. We assume that X and Y contain only counted vectors.
    2. We assume that the length of X and Y are the same.
    '''
    # Initialize
    res = 0.0
    
    # Above the fraction
    a = np.dot(X, Y)
    
    # Under the fraction
    ux = np.sqrt(np.sum(np.square(X)))
    uy = np.sqrt(np.sum(np.square(Y)))
    
    # Similarity coeff
    res = a/(ux * uy)
    
    if method == 'dist':
        return 1 - res
    
    return res
In [12]:
_data  = text_data.loc[[1, 2]]
_counted_data = pd.DataFrame(
    [[1, 1, 1, 1, 0, 1], [1, 1, 1, 0, 1, 1]], columns=['Today', 'is', 'the', 'hottest', 'coolest', 'day']
)

_res   = cosine_sim_from_scratch(_counted_data.values[0], _counted_data.values[1])
_other = distance.cosine(_counted_data.values[0], _counted_data.values[1])

print("INPUT (text):")
display(_data)
print("INPUT (counted vectors):")
display(_counted_data)

print("OUTPUT (scratch implementation): {}".format(round(_res, 3)))
print("OUTPUT (scipy distance): {}".format(round(_other, 3)))
INPUT (text):
text
1 Today is the hottest day
2 Today is the coolest day
INPUT (counted vectors):
Today is the hottest coolest day
0 1 1 1 1 0 1
1 1 1 1 0 1 1
OUTPUT (scratch implementation): 0.2
OUTPUT (scipy distance): 0.2

References

The knowledge, which has been covered in this notebook, comes from the following sources.

  1. Book:
    • Making sense of data II – A Practical Guide to Data Visualization, Advanced Data Mining Methods, and Applications.Glenn J. Myatt and Wayne P. Johnson
  2.  Online:

I have uploaded the notebook on Github: distance_measures.ipynb

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s