Given a string, s, and a set of strings, S, find a string, k that closely resembles s. For this problem, some possible answers would be Jaro Distance, Cosine Similarity, TF-IDF, Levenshtein distance, fuzzy logic, etc., which produce a score between two strings to reliably approximate their atomic similarity.

In this blog post, I’d like to explore one of my favorites, Sørensen–Dice coefficient, to achieve even better similarity comparisons for some cases like FQDNs. Incidentally, this method is used widely in the industry - including New Relic!

Sørensen–Dice Coefficient

Sørensen–Dice Coefficient is “a statistic used to gauge the similarity of two samples.” (Wikipedia) A sample can be a sequence of types such as a string or an integer. It can also be described mathematically as below, where the cardinality of intersection (of X and Y) is multiplied by a constant 2 (2 because the elements common in both samples were reduced through intersection). The outcome in the numerator portion is then divided by the individual cardinality of X and Y.

I wrote a quick implementation below for reference:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def _string_shift_chunks(
        string: str,
        chunk_size: int = 4,
        shift_size: int = 1,
    ) -> list[str]:
    """Return sub-samples of a string.

    Args:
        chunk_size: size of sub-sampling of `x` and `y`
        shift_size: size of sub-sampling shift of `x` and `y`
    """
    return [
        string[i:i+chunk_size]
        for i in range(0, len(string), shift_size)
    ]

def sdc_coefficient(
        x: str,
        y: str,
        chunk_size: int = 4,
        shift_size: int = 1,
    ) -> float:
    """Sørensen–Dice coefficient implementation.

    Args:
        x: sample string
        y: sample string
        chunk_size: size of sub-sampling of `x` and `y`
        shift_size: size of sub-sampling shift of `x` and `y`

    Returns:
        float: similarity score between [0, 1]
    """
    _x = _string_shift_chunks(
        string=x,
        chunk_size=chunk_size,
        shift_size=shift_size,
    )
    _y = _string_shift_chunks(
        string=y,
        chunk_size=chunk_size,
        shift_size=shift_size,
    )
    return (
        2 * len(set(_x) & set(_y)) /\
        (len(_x) + len(_y))
    )

Tests

I ran some test cases of SDC (using my implementation above) against a builtin string match module called difflib.SequenceMatcher and a 3rd party module named python-string-similarity.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from difflib import SequenceMatcher
from strsimpy.jaro_winkler import JaroWinkler
from strsimpy.levenshtein import Levenshtein
from strsimpy.metric_lcs import MetricLCS
from strsimpy.ngram import NGram

test_cases = [
    ("prod.us-east01.verizon.com", "prod.us-east02.verizon.com"),
    ("apple", "apple"),
    ("apple", "apples"),
    ("apple", "orange"),
]


if __name__ == "__main__":
    for (x, y) in test_cases:
        print("---")
        print(sdc(x, y))
        print(JaroWinkler().distance(x, y))
        print(Levenshtein().distance(x, y))
        print(MetricLCS().distance(x, y))
        print(NGram(2).distance(x, y))
        print(NGram(4).distance(x, y))
        print(SequenceMatcher(x, y).ratio())
AlgorithmxyScore
SDCprod.us-east01.verizon.comprod.us-east02.verizon.com0.846
Jaro-Winklerprod.us-east01.verizon.comprod.us-east02.verizon.com0.012
Levenshteinprod.us-east01.verizon.comprod.us-east02.verizon.com1.0
Metric Longest Common Subsequenceprod.us-east01.verizon.comprod.us-east02.verizon.com0.038
N-Gram (2)prod.us-east01.verizon.comprod.us-east02.verizon.com0.038
N-Gram (4)prod.us-east01.verizon.comprod.us-east02.verizon.com0.038
SequenceMatcherprod.us-east01.verizon.comprod.us-east02.verizon.com0.0
SDCappleapple1.0
Jaro-Winklerappleapple0.0
Levenshteinappleapple0.0
Metric Longest Common Subsequenceappleapple0.0
N-Gram (2)appleapple0.0
N-Gram (4)appleapple0.0
SequenceMatcherappleapple0.0
SDCappleapples0.36
Jaro-Winklerappleapples0.02
Levenshteinappleapples1.0
Metric Longest Common Subsequenceappleapples0.16
N-Gram (2)appleapples0.16
N-Gram (4)appleapples0.16
SequenceMatcherappleapples0.0
SDCappleorange0.18
Jaro-Winklerappleorange0.42
Levenshteinappleorange5.0
Metric Longest Common Subsequenceappleorange0.66
N-Gram (2)appleorange0.91
N-Gram (4)appleorange0.95
SequenceMatcherappleorange0.0

As you can see, using the Sørensen–Dice coefficient yields a much better accuracy especially in a case where the sample strings are composed of complex symbols. Now this doesn’t mean that using other algorithms are fundamentally weaker - as even without using SDC, you can still work around with limited impact by, for example, pre-processing the sample strings (e.g. split by symbols and comparing each element separately).