Design an algorithm that would sufficiently group all anagrams.

As an example, given a list of words {"a", "ab", "ba", "apple", "paple"}, the answer should be a nested list of {{"a"}, {"ab", "ba"}, {"apple", "paple"}} in any unsorted, insignificant order.

Equality of two sorted strings

An anagram, by definition, is a string formed by the rearrangement of another verbatim composition. This roughly infers that:

  • The atomic composition of a string must be identical to its anagram.

  • The length of a string must be equal to its' anagram’s length.

  • The longest substring must appear in its' anagram.

Hence we can say that a sorted string must be identical to its sorted anagram. For example, to determine whether “dog” is an anagram of “god”, we can sort both strings (“dgo” and “dgo”, respectively), assert the two with a simple string comparison, and match both strings. Performance-wise, this approach would be better than a stateful comparison between two strings since the comparison is only as fast as the pre-processing sorting algorithm itself.

Merkle tree of sorted strings

Although Solution #1 would yield accurate answers, it suffers from performance defects when dealing with lengthy strings. We can calculate hash or block ciphers of sorted strings.. but for more interesting projection, we can also rely on the Merkle tree of n-length substrings to transform long inputs to more manageable strings.

Let’s begin by defining a substring block b of length n.

Next, group s into smaller substrings of equal length n.

And finally, recursively hash the leaves (substrings) with a hash function H until the root node is calculated (“Merkle tree root”). For example, using the “dog” and “god” example from Solution #1, we’ll consider each character to be the leaves and construct two Merkle trees. After two strings are sorted and Merkle trees constructed, we can see the identity by comparing the Merkle root nodes. For the purpose of this example, I have relied on the MD5 digestion algorithm as my hash function.

Awesome! We can now accurately identify anagrams with better performance since we can always parallelize Merkle tree constructions.

Solution #3: positional weight of characters

Even with Merkle trees, we can still take it a step further.

Returning to the Merkle tree idea and comparing the roots to detect anagrams, we have achieved the concept of translating a string into a different entity. However, can a string be further expressed in something different other than just another string? A string is a series of characters; and a character can be represented in fixed bits; and bits can be numeric. Meaning, we can make another deduction.

A string is a series of characters; a character can be expressed as a number. Therefore a string is also a series of numbers, of which can be arithmetically summed to create a “fingerprint”.

This means anagrams must have equal sums. Or, as an “automata”:

But as you may have already noticed, this design suffers from one deficiency: two different words can still be summed to the same number. For example, “dog” and “foe” would yield the same sum:

This is due to the fact that a sum does not discriminate against the added elements (like how 1 + 1 = 2 just as 2 + 0 = 2). As we are now aware of the “sum collision” issue, we can fix it by introducing a negative feedback loop to counter the weight of each individual composition. Any counter function is fine, but I have decided to go with a simple square function.

Now “dog” and “god” are correctly identified, while “foe” is not.

We have now completed three different ways to detect anagrams :)