Given a string, s, and a set of strings, S, find a string, k that closely resembles s. For this problem, some possible answers would be Jaro Distance, Cosine Similarity, TF-IDF, Levenshtein distance, fuzzy logic, etc., which produce a score between two strings to reliably approximate their atomic similarity.
In this blog post, I’d like to explore one of my favorites, Sorensen–Dice coefficient, to achieve even better similarity comparisons for some cases like FQDNs. Incidentally, this method is used widely in the industry - including New Relic!
Sorensen–Dice Coefficient
Sorensen–Dice Coefficient is “a statistic used to gauge the similarity of two samples.” (Wikipedia) A sample can be a sequence of types such as a string or an integer. It can also be described mathematically as below, where the cardinality of intersection (of X and Y) is multiplied by a constant 2 (2 because the elements common in both samples were reduced through intersection). The outcome in the numerator portion is then divided by the individual cardinality of X and Y.
I wrote a quick implementation below for reference:
|
|
Tests
I ran some test cases of SDC (using my implementation above) against a builtin string match module called difflib.SequenceMatcher and a 3rd party module named python-string-similarity.
|
|
Algorithm | x | y | Score |
---|---|---|---|
SDC | prod.us-east01.verizon.com | prod.us-east02.verizon.com | 0.846 |
Jaro-Winkler | prod.us-east01.verizon.com | prod.us-east02.verizon.com | 0.012 |
Levenshtein | prod.us-east01.verizon.com | prod.us-east02.verizon.com | 1.0 |
Metric Longest Common Subsequence | prod.us-east01.verizon.com | prod.us-east02.verizon.com | 0.038 |
N-Gram (2) | prod.us-east01.verizon.com | prod.us-east02.verizon.com | 0.038 |
N-Gram (4) | prod.us-east01.verizon.com | prod.us-east02.verizon.com | 0.038 |
SequenceMatcher | prod.us-east01.verizon.com | prod.us-east02.verizon.com | 0.0 |
SDC | apple | apple | 1.0 |
Jaro-Winkler | apple | apple | 0.0 |
Levenshtein | apple | apple | 0.0 |
Metric Longest Common Subsequence | apple | apple | 0.0 |
N-Gram (2) | apple | apple | 0.0 |
N-Gram (4) | apple | apple | 0.0 |
SequenceMatcher | apple | apple | 0.0 |
SDC | apple | apples | 0.36 |
Jaro-Winkler | apple | apples | 0.02 |
Levenshtein | apple | apples | 1.0 |
Metric Longest Common Subsequence | apple | apples | 0.16 |
N-Gram (2) | apple | apples | 0.16 |
N-Gram (4) | apple | apples | 0.16 |
SequenceMatcher | apple | apples | 0.0 |
SDC | apple | orange | 0.18 |
Jaro-Winkler | apple | orange | 0.42 |
Levenshtein | apple | orange | 5.0 |
Metric Longest Common Subsequence | apple | orange | 0.66 |
N-Gram (2) | apple | orange | 0.91 |
N-Gram (4) | apple | orange | 0.95 |
SequenceMatcher | apple | orange | 0.0 |
As you can see, using the Sorensen–Dice coefficient yields a much better accuracy especially in a case where the sample strings are composed of complex symbols. Now this doesn’t mean that using other algorithms are fundamentally weaker - as even without using SDC, you can still work around with limited impact by, for example, pre-processing the sample strings (e.g. split by symbols and comparing each element separately).