# Content

Nowadays, we can vectorize everything such as numbers, words, sentences, etc. Why do we need to vectorize everything? We want to make everything countable and measurable so that we can apply many complicated statistical algorithms for it. That is why vectorization is one of the most important things in feature engineering. However, we do not discuss further those concepts in this notebook. Instead of that, this notebook shows how we calculate `the distance between two vectors/observations` by using different distance techniques.

There are many distance measures for different types of variable such as numeric, categorical, mixed, etc. We will focus on the following top 5 popular distance measures in this notebook.

1. Euclidean
2. Manhattan
3. Mahalanobis
4. Jaccard
5. Cosine

Why do we need to study distance measures in statistics?

1. It is clear that different distance measures will be used for different purposes and also generate different results for a particular model. As a result, it is extremely important to choose the correct distance measure for the statistical model, especially clustering.
2. As mentioned above, everything is currently vectorized. Therefore, measuring the distance between vectors is the way how we explore relationships between variables.
3. Not only exploring relationships between variables, but it can also be used for observing the behaviour of a particular point/unit.

# Libraries

For scratch implementation: Numpy, Scikit-learn

For existing implementation: SciPy

For visualization: Matplotlib

In :
```import pandas as pd
import numpy as np

from sklearn.preprocessing import normalize
from scipy.spatial import distance

import matplotlib.pyplot as plt```

# Data preparation

In this section, data is prepared for different distance measures. We have 3 different types of data, including numeric data, categorical data and text data.

Data summary:

1. Numerical data: Age represents the age of a person, Song Listened represents a number of songs that person daily listens to and Satisfaction represents how satisfied that person is.
2. Categorical data: Gender (0 = Female, 1 = Male), Accounting (0 = NO, 1 = YES), Computer Science (0 = NO, 1 = YES).
3. Text data: Each row represents a sentence with a different meaning.
In :
```num_data = pd.DataFrame(
[[20, 10, 0.7], [30, 8, 0.4], [25, 11, 0.4], [10, 5, 0.8], [40, 7, 0.6]],
columns=['Age', 'Songs Listened', 'Satisfaction'], index=['A', 'B', 'C', 'D', 'E']
)

cate_data = pd.DataFrame(
[[0, 1, 0], [1, 0, 1], [1, 0, 0], [0, 1, 1], [1, 1, 1], [0, 0, 0]],
columns=['Gender', 'Accouting', 'Computer Science'], index=['A', 'B', 'C', 'D', 'E', 'F']
)

text_data = pd.DataFrame(
[['The sky is blue'], ['Today is the hottest day'], ['Today is the coolest day'], ['The sun is bright']],
columns=['text']
)

print("Numerical data:")
display(num_data)
print("Categorical data:")
display(cate_data)
print("Text data:")
display(text_data)```
`Numerical data:`
Age Songs Listened Satisfaction
A 20 10 0.7
B 30 8 0.4
C 25 11 0.4
D 10 5 0.8
E 40 7 0.6
`Categorical data:`
Gender Accouting Computer Science
A 0 1 0
B 1 0 1
C 1 0 0
D 0 1 1
E 1 1 1
F 0 0 0
`Text data:`
text
0 The sky is blue
1 Today is the hottest day
2 Today is the coolest day
3 The sun is bright

# Distance measures

## Euclidean

As mentioned, Euclidean distance is one of the most popular distance measures. It is used to calculate the distance between two points by the sum of squares of the difference between them in one or more dimensions. We will cover two different types of euclidean which are normal euclidean (root square) and squared euclidean.

Formula: For example:

``````INPUT
``````
Age Satisfaction
30 0.4
40 0.6
``````OUTPUT
``````
Method L2-Norm Result
Euclidean NO 10.00
Euclidean Row 0.001
Euclidean Column 0.34
Squared Euclidean NO 100.04
Squared Euclidean Row 2.77e-06
Euclidean Column 0.12

Questions:

1. When do we need to use Euclidean?
• We have continuous numerical variables.
• We want to reflect absolute distances.
• We do not want to remove redundancies in order to project the correlation between variables.
2. What do we need to do before apply Euclidean?
• Normally, we need to scale (normalize) the variables before applying Euclidean. This is because we can not measure the distance of two variables on different scales such as Age and Satisfaction.
• In some cases, we do not need to normalize the variables because they are already on the same scale. As a result, it is also necessary to make a scatter plot between variables.
3. What are the differences between Euclidean and Squared Euclidean?
• In some clustering problems, Squared Euclidean is faster than Euclidean.
4. Should we normalize data by samples (rows) or features (columns)?
• Since we want variables to have the same scale, we should normalize data by features.
• Another reason is that sample normalization might change the shape of points. This affects our measures significantly. Meanwhile, feature normalization only changes the scale of each feature without affecting the shape of points. Look image below.   In :
```def euclidean_from_scratch(X, Y, norm='l2', squared=False):
''' This is the implementation of Eunclidean distance from scratch

Parameters
----------
X: 1-D array of variable X with n-dimensions
Y: 1-D array of variable Y with n-dimensions
norm: Normalization techinque (by default, 'l2')
squared: Squared Euclidean result (by default, False)

Return
------
A (squared) Euclidian distance between two observations.

Notes
-----
1. We assume that X and Y contain only continuous numerical elements.
2. We assume that there is no missing value in X and Y.
3. We assume that a number of elements in X and Y are the same.
'''
# Initialize
res = 0.0

# Normalization
if norm is not None:
temp = normalize([X, Y], norm=norm, axis=0)
X = temp
Y = temp

# Subtraction
res = np.subtract(X, Y)

# Square each unit
res = np.square(res)

# Summation
res = np.sum(res)

# Root square
if not squared:
res = np.sqrt(res)

return res
```
In :
```_data  = num_data.loc[['B', 'C']]
_norm_data = normalize(_data, axis=0)
_res   = euclidean_from_scratch(_data.values, _data.values)
_other = distance.euclidean(_norm_data, _norm_data)

print("INPUT:")
display(_data)

print("OUTPUT (scratch implementation): {}".format(round(_res, 3)))
print("OUTPUT (scipy distance): {}".format(round(_other, 3)))```
`INPUT:`
Age Songs Listened Satisfaction
B 30 8 0.4
C 25 11 0.4
```OUTPUT (scratch implementation): 0.255
OUTPUT (scipy distance): 0.255```

## Manhattan

Manhattan distance is used to calculate the distance between two points by the sum of absolute distances between them in one or more dimensions.

Formula: For example:

``````INPUT
``````
Age Satisfaction
30 0.4
40 0.6
``````OUTPUT
``````
Method L2-Norm Result
Manhattan NO 10.2
Manhattan Row 0.002
Manhattan Column 0.48

Questions:

1. When do we need to use Manhattan?
• We can not measure the distance between two points directly like Euclidean. Instead of that, we can measure that by paths between two points.
• We use other statistical techniques like L1-norm that requires the sum of absolute distances between observations (points).
• We want to measure differences in frequency distributions between vectors. Normally, this applies to text mining.
2. Euclidean distance or Manhattan distance?
• In some classification algorithms like k-nearest-neighbor (k-NN), we can use both Euclidean distance and Manhattan distance.
• For high dimensional data, Manhattan distance might work better than Euclidean distance.
In :
```def manhattan_from_scratch(X, Y, norm='l2'):
''' This is the implementation of Manhattan distance from scratch

Parameters
----------
X: 1-D array of variable X with n-dimensions
Y: 1-D array of variable Y with n-dimensions
norm: Normalization techinque (by default, 'l2')

Return
------
A Manhattan distance between two observations

Notes
-----
1. We assume that X and Y contain only continous numerical elements
2. We assume that there is no missing value in X and Y
3. We assume that a number of elements in X and Y are the same.
'''
# Initialize
res = 0.0

# Normalization
if norm is not None:
temp = normalize([X, Y], norm=norm, axis=0)
X = temp
Y = temp

# Substraction
res = np.subtract(X, Y)

# Absoluate values
res = np.abs(res)

# Summation
res = np.sum(res)

return res
```
In :
```_data  = num_data.loc[['B', 'C']]
_norm_data = normalize(_data, axis=0)
_res   = manhattan_from_scratch(_data.values, _data.values)
_other = distance.cityblock(_norm_data, _norm_data)

print("INPUT:")
display(num_data.loc[['B', 'C']])

print("OUTPUT (scratch implementation): {}".format(round(_res, 3)))
print("OUTPUT (scipy distance): {}".format(round(_other, 3)))```
`INPUT:`
Age Songs Listened Satisfaction
B 30 8 0.4
C 25 11 0.4
```OUTPUT (scratch implementation): 0.349
OUTPUT (scipy distance): 0.349```

## Mahalanobis distance

Mahalanobis measures the distance between two points based on the covariance. For example, if we want to measure the distance between point A and point B, it will calculate how many standard deviations away A is from the mean of B. Furthermore, Mahalanobis distance is known as an effective multivariate distance metric.

Formula: For example:

Questions:

1. When do we need to use Mahalanobis distance?
• We want to calculate the distance between two continuous numerical elements
• We consider the correlation between variables before calculating distances of observations.
• We want to detect anomaly in the high dimensional dataset.
• We want to apply classification on highly imbalanced datasets.
• We want to calculate the distance between two observations without depending on the scale of variables.
2. Euclidean distance or Mahalanobis distance?
• Euclidean distance is a distance-based method which only focuses on distance between two points without considering other factors. As a result, it might have some issues in the high dimensional dataset as well as different scaled variables. In constant, Mahalanobis distance considers covariance between variables before calculating distance. As a result, it might not raise the issues above.
• Since Euclidean distance does not consider covariance between variables, it will generate the same result if variable X is either correlated to variable Y or not. Meanwhile, Mahalanobis distance will generate different results based on the correlation between two or more variables.
In :
```def mahalanobis_from_scratch(X, Y, cov):
''' This is the implementation of Mahalanobis distance from sratch

Parameters
----------
X: 1-D array of variable X with n-dimensions
Y: 1-D array of variable Y with n-dimensions
cov: A covariance matrix of variables.

Return
------
A Mahalanobis distance between two observations

Notes
-----
1. We assume that X and Y contain only continuous numberical elements.
2. We assume that there is no missing value in X and Y.
3. We assume that a number of elements in X and Y are the same.
4. We assume that a covariance matrix has a shape of n-dimensions
which are the same as X and Y.
'''
# Initialize
res = 0.0
X = np.array(X)
Y = np.array(Y)

# Substraction
sub = np.subtract(X, Y)

# Inverse covariance matrix
inv_matrix = np.linalg.inv(cov)

# Calculation
res = np.sqrt(np.dot(np.dot(sub, inv_matrix), sub.T))

return res
```
In :
```_data  = num_data.loc[['B', 'C']]
_cov   = np.cov(num_data, rowvar=False)
_inv   = np.linalg.inv(_cov)
_res   = mahalanobis_from_scratch(_data.values, _data.values, _cov)
_other = distance.mahalanobis(_data.values, _data.values, _inv)

print("INPUT:")
display(_data)

print("OUTPUT (scratch implementation): {}".format(round(_res, 3)))
print("OUTPUT (scipy distance): {}".format(round(_other, 3)))```
`INPUT:`
Age Songs Listened Satisfaction
B 30 8 0.4
C 25 11 0.4
```OUTPUT (scratch implementation): 1.553
OUTPUT (scipy distance): 1.553```

## Jaccard index

In this section, we move to another type of distance measure which is for categorical variables. Before going ahead with the Jaccard index, we need to understand two important concepts.

1. Similarity measures how similar two observations are to each other. The higher similarity values the more alike two observations are.
2. Distance (dissimilarity) measures how different two observations are to each other. The lower values the more alike two observations are.

Jaccard index is one of the most popular binary distance measures (variables with values of either zero of one). Its similarity coefficient is defined as the size of the intersection divided by the size of the union of the sample sets. Contrary to that, the Jaccard distance is obtained by subtracting the similarity coefficient from 1.

Formula: For example:

``````INPUT:
``````
Gender Accounting Computer Science
1 0 1
0 1 1
``````OUTPUT:
``````
Method Result
Similarity 0.33
Distance 0.67

Questions:

1. When do we need to use Jaccard index?
• We want to explore the similarity between two observations, containing categorical variables.
• We want to conceptualize the accuracy of object detection using convolutional neural networks (CNN)
2. How can we deal with special situations?
• For variables that have more than 2 categories, we can encode those categories into a string or list of binary numbers. For example, if the question has 4 different options, we can encode them into [0,0], [0,1], [1,0] and [1,1]
• For a dataset that contains missing values, we can fill or replace those missing values by a non-existing number in the dataset.
In :
```def jaccard_from_scratch(X, Y, method='dist'):
''' This is the implementation of Jaccard index (distance|similarity) from scratch

Parameters
----------
X: 1-D array of variable X with n-dimensions
Y: 1-D array of variable Y with n-dimensions
method: A method to calculate the coefficient (dist|sim, by default: dist)

Return
------
A Jaccard (distance|similarity) between two observations

Notes
-----
1. We assume that there is no missing value in X and Y
2. We assume that X and Y contain only categorical elements
'''
# Initialize
res = 0.0

# Intersection
inter = np.bitwise_and(X, Y)

# Union
uni = np.bitwise_or(X, Y)

# Count non-zero values
inter = len(np.nonzero(inter))
uni = len(np.nonzero(uni))

# Similarity cofficient
res = inter/uni

if method is 'dist':
res = 1 - res

return res
```
In :
```_data  = cate_data.loc[['B', 'D']]
_res   = jaccard_from_scratch(_data.values, _data.values)
_other = distance.jaccard(_data.values, _data.values)

print("INPUT:")
display(_data)

print("OUTPUT (scratch implementation): {}".format(round(_res, 3)))
print("OUTPUT (scipy distance): {}".format(round(_other, 3)))```
`INPUT:`
Gender Accounting Computer Science
B 1 0 1
D 0 1 1
```OUTPUT (scratch implementation): 0.667
OUTPUT (scipy distance): 0.667```

## Cosine similarity

In this section, we move to another type of distance measure which is for text data (or vectors). Like Jaccard index, we need to understand two important concepts before going ahead.

1. Similarity measures how similar two observations are to each other. The higher similarity values the more alike two observations are.
2. Distance measures how different two observations are to each other. The lower values the more alike two observations are.

Cosine similarity measures the similarity of two non-zero vectors by the cosine of the angle between them. Furthermore, the range of the similarity coefficient is from -1 to 1 inclusively.

Formula: For example:

``````INPUT:
``````
. text
A Today is the hottest day
B Today is the coolest day
``````PROCESS:
``````
. Today is the hottest coolest day
A 1 1 1 1 0 1
B 1 1 1 0 1 1
`````` OUTPUT:
``````
Method Result
Similarity 0.8
Distance 0.2

Questions:

1. When do we need to use Cosine similarity?
• We want to see the similarity between vectors.
• We are doing text analysis in which we need to convert words to vectors.
2. How to deal with negative value when using Cosine similarity?
• If we are interested in the similarity coefficient, we will need to set the threshold whose minimum value is 0 and the maximum value is 1
• If we are interested in measuring the distance, we will need to use the absolute value of the similarity coefficient. Afterward, we need to report the direction of it based on the sign.
In :
```def cosine_sim_from_scratch(X, Y, method='dist'):
''' This is the implementation of Cosine (similarity|distance) from scratch.

Parameters
----------
X: 1-D array of variable X with n-dimensions
Y: 1-D array of variable Y with n-dimensions
method: A method to calculate the coefficient (dist|sim, by default: dist)

Return
------
A Cosine (similarity|distance) of two vectors.

Notes
-----
1. We assume that X and Y contain only counted vectors.
2. We assume that the length of X and Y are the same.
'''
# Initialize
res = 0.0

# Above the fraction
a = np.dot(X, Y)

# Under the fraction
ux = np.sqrt(np.sum(np.square(X)))
uy = np.sqrt(np.sum(np.square(Y)))

# Similarity coeff
res = a/(ux * uy)

if method == 'dist':
return 1 - res

return res
```
In :
```_data  = text_data.loc[[1, 2]]
_counted_data = pd.DataFrame(
[[1, 1, 1, 1, 0, 1], [1, 1, 1, 0, 1, 1]], columns=['Today', 'is', 'the', 'hottest', 'coolest', 'day']
)

_res   = cosine_sim_from_scratch(_counted_data.values, _counted_data.values)
_other = distance.cosine(_counted_data.values, _counted_data.values)

print("INPUT (text):")
display(_data)
print("INPUT (counted vectors):")
display(_counted_data)

print("OUTPUT (scratch implementation): {}".format(round(_res, 3)))
print("OUTPUT (scipy distance): {}".format(round(_other, 3)))```
`INPUT (text):`
text
1 Today is the hottest day
2 Today is the coolest day
`INPUT (counted vectors):`
Today is the hottest coolest day
0 1 1 1 1 0 1
1 1 1 1 0 1 1
```OUTPUT (scratch implementation): 0.2
OUTPUT (scipy distance): 0.2```

The knowledge, which has been covered in this notebook, comes from the following sources.

1. Book:
• Making sense of data II – A Practical Guide to Data Visualization, Advanced Data Mining Methods, and Applications.Glenn J. Myatt and Wayne P. Johnson
2.  Online:

I have uploaded the notebook on Github: distance_measures.ipynb