Notes on statistical distance measures:
- Euclidean distance:The straight line distance between two points. Summing the square root of the squared differences between each coordinate.
- Cosine distance: Dividing the dot product of two vectors by the product of their lengths.
- Manhattan distance: The distance between two points measured along axes at right angles. Summing the absolute values of the difference between each coordinate.
- Hamming distance: The number of bits which differ between two binary strings.
- Levenshtein distance: The smallest number of insertions, deletions, and substitutions required to change one string or tree into another.
- Jaro-Winkler: A measure of similarity between two strings. The Jaro measure is the weighted sum of percentage of matched characters from each file and transposed characters. Winkler increased this measure for matching initial characters, then rescaled it by a piecewise function, whose intervals and weights depend on the type of string (first name, last name, street, etc.).
- Chebyshev distance: Finding the maximum difference between each coordinate. Also 'chessboard distance' due to the moves a king can make.
- Mahalanobis distance: Differs from Euclidean distance in that it takes into account the covariance among the variables in calculating distances and it is scale-invariant.
- Minkowski distance: a real-valued generalization of the integral L(n) distances: Manhattan = L1, Euclidean = L2. For high numbers of dimensions, very high exponents give more useful distances.
- Tanimoto distance: The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the interception divided by the size of the union of the sample sets. The Jaccard distance measures dissimilarity between sample sets, is complementary to Jaccard coefficient (1- Jaccard coefficient coefficient). Tanimoto defined over values of non-zero similarity.