# Content

Nowadays, we can vectorize everything such as numbers, words, sentences, etc. Why do we need to vectorize everything? We want to make everything countable and measurable so that we can apply many complicated statistical algorithms for it. That is why vectorization is one of the most important things in feature engineering. However, we do not discuss further those concepts in this notebook. Instead of that, this notebook shows how we calculate `the distance between two vectors/observations`

by using different distance techniques.

There are many distance measures for different types of variable such as numeric, categorical, mixed, etc. We will focus on the following top 5 popular distance measures in this notebook.

- Euclidean
- Manhattan
- Mahalanobis
- Jaccard
- Cosine

**Why do we need to study distance measures in statistics?**

- It is clear that different distance measures will be used for different purposes and also generate different results for a particular model. As a result, it is extremely important to choose the correct distance measure for the statistical model, especially clustering.
- As mentioned above, everything is currently vectorized. Therefore, measuring the distance between vectors is the way how we explore relationships between variables.
- Not only exploring relationships between variables, but it can also be used for observing the behaviour of a particular point/unit.

# Libraries

For reading data: **Pandas**

For scratch implementation: **Numpy, Scikit-learn**

For existing implementation: **SciPy**

For visualization: **Matplotlib**

```
import pandas as pd
import numpy as np
from sklearn.preprocessing import normalize
from scipy.spatial import distance
import matplotlib.pyplot as plt
```

# Data preparation

In this section, data is prepared for different distance measures. We have 3 different types of data, including numeric data, categorical data and text data.

**Data summary:**

**Numerical data:**Age represents the age of a person, Song Listened represents a number of songs that person daily listens to and Satisfaction represents how satisfied that person is.**Categorical data:**Gender (0 = Female, 1 = Male), Accounting (0 = NO, 1 = YES), Computer Science (0 = NO, 1 = YES).**Text data:**Each row represents a sentence with a different meaning.

```
num_data = pd.DataFrame(
[[20, 10, 0.7], [30, 8, 0.4], [25, 11, 0.4], [10, 5, 0.8], [40, 7, 0.6]],
columns=['Age', 'Songs Listened', 'Satisfaction'], index=['A', 'B', 'C', 'D', 'E']
)
cate_data = pd.DataFrame(
[[0, 1, 0], [1, 0, 1], [1, 0, 0], [0, 1, 1], [1, 1, 1], [0, 0, 0]],
columns=['Gender', 'Accouting', 'Computer Science'], index=['A', 'B', 'C', 'D', 'E', 'F']
)
text_data = pd.DataFrame(
[['The sky is blue'], ['Today is the hottest day'], ['Today is the coolest day'], ['The sun is bright']],
columns=['text']
)
print("Numerical data:")
display(num_data)
print("Categorical data:")
display(cate_data)
print("Text data:")
display(text_data)
```

# Distance measures

## Euclidean

As mentioned, Euclidean distance is one of the most popular distance measures. It is used to calculate the distance between two points by the sum of squares of the difference between them in one or more dimensions. We will cover two different types of euclidean which are normal euclidean (root square) and squared euclidean.

**Formula:**

**For example:**

```
INPUT
```

Age | Satisfaction |
---|---|

30 | 0.4 |

40 | 0.6 |

```
OUTPUT
```

Method | L2-Norm | Result |
---|---|---|

Euclidean | NO | 10.00 |

Euclidean | Row | 0.001 |

Euclidean | Column | 0.34 |

Squared Euclidean | NO | 100.04 |

Squared Euclidean | Row | 2.77e-06 |

Euclidean | Column | 0.12 |

**Questions:**

**When do we need to use Euclidean?**- We have continuous numerical variables.
- We want to reflect absolute distances.
- We do not want to remove redundancies in order to project the correlation between variables.

**What do we need to do before apply Euclidean?**- Normally, we need to scale (normalize) the variables before applying Euclidean. This is because we can not measure the distance of two variables on different scales such as Age and Satisfaction.
- In some cases, we do
**not**need to normalize the variables because they are already on the same scale. As a result, it is also necessary to make a scatter plot between variables.

**What are the differences between Euclidean and Squared Euclidean?**- In some clustering problems, Squared Euclidean is faster than Euclidean.

**Should we normalize data by samples (rows) or features (columns)?**- Since we want variables to have the same scale, we should normalize data by features.
- Another reason is that sample normalization might change the shape of points. This affects our measures significantly. Meanwhile, feature normalization only changes the scale of each feature without affecting the shape of points. Look image below.

```
def euclidean_from_scratch(X, Y, norm='l2', squared=False):
''' This is the implementation of Eunclidean distance from scratch
Parameters
----------
X: 1-D array of variable X with n-dimensions
Y: 1-D array of variable Y with n-dimensions
norm: Normalization techinque (by default, 'l2')
squared: Squared Euclidean result (by default, False)
Return
------
A (squared) Euclidian distance between two observations.
Notes
-----
1. We assume that X and Y contain only continuous numerical elements.
2. We assume that there is no missing value in X and Y.
3. We assume that a number of elements in X and Y are the same.
'''
# Initialize
res = 0.0
# Normalization
if norm is not None:
temp = normalize([X, Y], norm=norm, axis=0)
X = temp[0]
Y = temp[1]
# Subtraction
res = np.subtract(X, Y)
# Square each unit
res = np.square(res)
# Summation
res = np.sum(res)
# Root square
if not squared:
res = np.sqrt(res)
return res
```

```
_data = num_data.loc[['B', 'C']]
_norm_data = normalize(_data, axis=0)
_res = euclidean_from_scratch(_data.values[0], _data.values[1])
_other = distance.euclidean(_norm_data[0], _norm_data[1])
print("INPUT:")
display(_data)
print("OUTPUT (scratch implementation): {}".format(round(_res, 3)))
print("OUTPUT (scipy distance): {}".format(round(_other, 3)))
```

## Manhattan

Manhattan distance is used to calculate the distance between two points by the sum of absolute distances between them in one or more dimensions.

**Formula:**

**For example:**

```
INPUT
```

Age | Satisfaction |
---|---|

30 | 0.4 |

40 | 0.6 |

```
OUTPUT
```

Method | L2-Norm | Result |
---|---|---|

Manhattan | NO | 10.2 |

Manhattan | Row | 0.002 |

Manhattan | Column | 0.48 |

**Questions:**

**When do we need to use Manhattan?**- We can not measure the distance between two points directly like Euclidean. Instead of that, we can measure that by paths between two points.
- We use other statistical techniques like L1-norm that requires the sum of absolute distances between observations (points).
- We want to measure differences in frequency distributions between vectors. Normally, this applies to text mining.

**Euclidean distance or Manhattan distance?**- In some classification algorithms like k-nearest-neighbor (k-NN), we can use both Euclidean distance and Manhattan distance.
- For high dimensional data, Manhattan distance might work better than Euclidean distance.

```
def manhattan_from_scratch(X, Y, norm='l2'):
''' This is the implementation of Manhattan distance from scratch
Parameters
----------
X: 1-D array of variable X with n-dimensions
Y: 1-D array of variable Y with n-dimensions
norm: Normalization techinque (by default, 'l2')
Return
------
A Manhattan distance between two observations
Notes
-----
1. We assume that X and Y contain only continous numerical elements
2. We assume that there is no missing value in X and Y
3. We assume that a number of elements in X and Y are the same.
'''
# Initialize
res = 0.0
# Normalization
if norm is not None:
temp = normalize([X, Y], norm=norm, axis=0)
X = temp[0]
Y = temp[1]
# Substraction
res = np.subtract(X, Y)
# Absoluate values
res = np.abs(res)
# Summation
res = np.sum(res)
return res
```

```
_data = num_data.loc[['B', 'C']]
_norm_data = normalize(_data, axis=0)
_res = manhattan_from_scratch(_data.values[0], _data.values[1])
_other = distance.cityblock(_norm_data[0], _norm_data[1])
print("INPUT:")
display(num_data.loc[['B', 'C']])
print("OUTPUT (scratch implementation): {}".format(round(_res, 3)))
print("OUTPUT (scipy distance): {}".format(round(_other, 3)))
```

## Mahalanobis distance

Mahalanobis measures the distance between two points based on the covariance. For example, if we want to measure the distance between point A and point B, it will calculate how many standard deviations away A is from the mean of B. Furthermore, Mahalanobis distance is known as an effective multivariate distance metric.

**Formula:**

**For example:**

**Questions:**

**When do we need to use Mahalanobis distance?**- We want to calculate the distance between two continuous numerical elements
- We consider the correlation between variables before calculating distances of observations.
- We want to detect anomaly in the high dimensional dataset.
- We want to apply classification on highly imbalanced datasets.
- We want to calculate the distance between two observations
**without**depending on the scale of variables.

**Euclidean distance or Mahalanobis distance?**- Euclidean distance is a distance-based method which only focuses on distance between two points without considering other factors. As a result, it might have some issues in the high dimensional dataset as well as different scaled variables.
**In constant, Mahalanobis distance considers covariance between variables before calculating distance. As a result, it might not raise the issues above.** - Since Euclidean distance does not consider covariance between variables, it will generate the same result if variable X is either correlated to variable Y or not.
**Meanwhile, Mahalanobis distance will generate different results based on the correlation between two or more variables.**

- Euclidean distance is a distance-based method which only focuses on distance between two points without considering other factors. As a result, it might have some issues in the high dimensional dataset as well as different scaled variables.

```
def mahalanobis_from_scratch(X, Y, cov):
''' This is the implementation of Mahalanobis distance from sratch
Parameters
----------
X: 1-D array of variable X with n-dimensions
Y: 1-D array of variable Y with n-dimensions
cov: A covariance matrix of variables.
Return
------
A Mahalanobis distance between two observations
Notes
-----
1. We assume that X and Y contain only continuous numberical elements.
2. We assume that there is no missing value in X and Y.
3. We assume that a number of elements in X and Y are the same.
4. We assume that a covariance matrix has a shape of n-dimensions
which are the same as X and Y.
'''
# Initialize
res = 0.0
X = np.array(X)
Y = np.array(Y)
# Substraction
sub = np.subtract(X, Y)
# Inverse covariance matrix
inv_matrix = np.linalg.inv(cov)
# Calculation
res = np.sqrt(np.dot(np.dot(sub, inv_matrix), sub.T))
return res
```

```
_data = num_data.loc[['B', 'C']]
_cov = np.cov(num_data, rowvar=False)
_inv = np.linalg.inv(_cov)
_res = mahalanobis_from_scratch(_data.values[0], _data.values[1], _cov)
_other = distance.mahalanobis(_data.values[0], _data.values[1], _inv)
print("INPUT:")
display(_data)
print("OUTPUT (scratch implementation): {}".format(round(_res, 3)))
print("OUTPUT (scipy distance): {}".format(round(_other, 3)))
```

## Jaccard index

In this section, we move to another type of distance measure which is for categorical variables. Before going ahead with the Jaccard index, we need to understand two important concepts.

**Similarity**measures how similar two observations are to each other. The higher similarity values the more alike two observations are.**Distance (dissimilarity)**measures how different two observations are to each other. The lower values the more alike two observations are.

**Jaccard index** is one of the most popular binary distance measures (variables with values of either zero of one). Its similarity coefficient is defined as the size of the intersection divided by the size of the union of the sample sets. Contrary to that, the Jaccard distance is obtained by subtracting the similarity coefficient from 1.

**Formula:**

**For example:**

```
INPUT:
```

Gender | Accounting | Computer Science |
---|---|---|

1 | 0 | 1 |

0 | 1 | 1 |

```
OUTPUT:
```

Method | Result |
---|---|

Similarity | 0.33 |

Distance | 0.67 |

**Questions:**

**When do we need to use Jaccard index?**- We want to explore the similarity between two observations, containing categorical variables.
- We want to conceptualize the accuracy of object detection using convolutional neural networks (CNN)

**How can we deal with special situations?****For variables that have more than 2 categories**, we can encode those categories into a string or list of binary numbers. For example, if the question has 4 different options, we can encode them into [0,0], [0,1], [1,0] and [1,1]**For a dataset that contains missing values**, we can fill or replace those missing values by a non-existing number in the dataset.

```
def jaccard_from_scratch(X, Y, method='dist'):
''' This is the implementation of Jaccard index (distance|similarity) from scratch
Parameters
----------
X: 1-D array of variable X with n-dimensions
Y: 1-D array of variable Y with n-dimensions
method: A method to calculate the coefficient (dist|sim, by default: dist)
Return
------
A Jaccard (distance|similarity) between two observations
Notes
-----
1. We assume that there is no missing value in X and Y
2. We assume that X and Y contain only categorical elements
'''
# Initialize
res = 0.0
# Intersection
inter = np.bitwise_and(X, Y)
# Union
uni = np.bitwise_or(X, Y)
# Count non-zero values
inter = len(np.nonzero(inter)[0])
uni = len(np.nonzero(uni)[0])
# Similarity cofficient
res = inter/uni
if method is 'dist':
res = 1 - res
return res
```

```
_data = cate_data.loc[['B', 'D']]
_res = jaccard_from_scratch(_data.values[0], _data.values[1])
_other = distance.jaccard(_data.values[0], _data.values[1])
print("INPUT:")
display(_data)
print("OUTPUT (scratch implementation): {}".format(round(_res, 3)))
print("OUTPUT (scipy distance): {}".format(round(_other, 3)))
```

## Cosine similarity

In this section, we move to another type of distance measure which is for text data (or vectors). Like Jaccard index, we need to understand two important concepts before going ahead.

**Similarity**measures how similar two observations are to each other. The higher similarity values the more alike two observations are.**Distance**measures how different two observations are to each other. The lower values the more alike two observations are.

**Cosine similarity** measures the similarity of two non-zero vectors by the cosine of the angle between them. Furthermore, the range of the similarity coefficient is from -1 to 1 inclusively.

**Formula:**

**For example:**

```
INPUT:
```

. | text |
---|---|

A | Today is the hottest day |

B | Today is the coolest day |

```
PROCESS:
```

. | Today | is | the | hottest | coolest | day |
---|---|---|---|---|---|---|

A | 1 | 1 | 1 | 1 | 0 | 1 |

B | 1 | 1 | 1 | 0 | 1 | 1 |

```
OUTPUT:
```

Method | Result |
---|---|

Similarity | 0.8 |

Distance | 0.2 |

**Questions:**

**When do we need to use Cosine similarity?**- We want to see the similarity between vectors.
- We are doing text analysis in which we need to convert words to vectors.

**How to deal with negative value when using Cosine similarity?**- If we are interested in the similarity coefficient, we will need to set the threshold whose minimum value is 0 and the maximum value is 1
- If we are interested in measuring the distance, we will need to use the absolute value of the similarity coefficient. Afterward, we need to report the direction of it based on the sign.

```
def cosine_sim_from_scratch(X, Y, method='dist'):
''' This is the implementation of Cosine (similarity|distance) from scratch.
Parameters
----------
X: 1-D array of variable X with n-dimensions
Y: 1-D array of variable Y with n-dimensions
method: A method to calculate the coefficient (dist|sim, by default: dist)
Return
------
A Cosine (similarity|distance) of two vectors.
Notes
-----
1. We assume that X and Y contain only counted vectors.
2. We assume that the length of X and Y are the same.
'''
# Initialize
res = 0.0
# Above the fraction
a = np.dot(X, Y)
# Under the fraction
ux = np.sqrt(np.sum(np.square(X)))
uy = np.sqrt(np.sum(np.square(Y)))
# Similarity coeff
res = a/(ux * uy)
if method == 'dist':
return 1 - res
return res
```

```
_data = text_data.loc[[1, 2]]
_counted_data = pd.DataFrame(
[[1, 1, 1, 1, 0, 1], [1, 1, 1, 0, 1, 1]], columns=['Today', 'is', 'the', 'hottest', 'coolest', 'day']
)
_res = cosine_sim_from_scratch(_counted_data.values[0], _counted_data.values[1])
_other = distance.cosine(_counted_data.values[0], _counted_data.values[1])
print("INPUT (text):")
display(_data)
print("INPUT (counted vectors):")
display(_counted_data)
print("OUTPUT (scratch implementation): {}".format(round(_res, 3)))
print("OUTPUT (scipy distance): {}".format(round(_other, 3)))
```

# References

The knowledge, which has been covered in this notebook, comes from the following sources.

- Book:
*Making sense of data II – A Practical Guide to Data Visualization, Advanced Data Mining Methods, and Applications.*–**Glenn J. Myatt and Wayne P. Johnson**

- Online:
- Wikipedia – Euclidean distance
- Quora – What is the difference between Manhattan and Euclidean distance measures?
- DeepAI – Jaccard Index
- Wikipedia – Jaccard index
- Wikipedia – Mahalanobis distance
- Machine Learning Plus – Mahalanobis distance (Understanding the math with examples [python])
- Wikipedia – Cosine similarity

I have uploaded the notebook on Github: distance_measures.ipynb