Distance Tutorial: String Distance

String Distance

String distance is very important for to be used in many areas of applications. In computational biology, some animal species are considered to come from the same family due to the matching of DNA. Since DNA sequence is basically a long sequence of 4 letters of {A,T,G,C}, matching DNA become matter of computing string distance. Checking word for wrong spelling and auto correction may also use string distance. Another application of string distance is to check plagiarism between two document. In etymology, string distance is applied to find the geographical origin, history and the root of words. For instance the word �Dato � datto - datu - datuk� are related to each other. Another example, name English name Christopher are related to Cristovao, Christoph, Christophe, Cristobal, Cristoforo, Kristoffer, Krystof.

There are several ways to measure the distance between two strings.
The simplest one is to use hamming distance to find the number of mismatch between two strings. However, the two strings must have the same length.

Most well-known string distance is Edit Distance or often called Levenshtein Distance or Levenstein distance (depending on the spelling) The algorithm to compute Edit distance is basically using dynamic programming (DP) to find the minimum number of 3 operations: Deletion , Insertion , and Substitution such that one string will become another string. Operation Substitution is equal to Deletion + Insertion.

Example :
String-1=GCACG
String-2=GTCGA
Distance between String-1 and String 2 is 3. (Delete C from string-1, substitute A with T and then insert A at the end, to make String-1 to become String-2)

Example :
String-1=GCGAT
String-2=GCAT
Distance between String-1 and String 2 is 1. (Delete second G from string 1 to make it into String-2)

Because string distance is very important, there are many versions of implementation in various programming languages (see Wikipedia ).

I propose the normalized similarity from String Distance to be computed as simple as

String Similarity = 1 - d/m

Where,
d = Edit Distance ( = Levenshtein distance)
m = length of the two strings = (Len(String_1) + Len(String_2))
The value of normalized string similarity is between 0 and 1.

Text Similarity Calculator

To implement string similarity formula above, I made simple Text Similarity Calculator below. You can copy and paste nearly similar two documents and test their normalized string similarity. The normalization can be done simply by division by the length of the strings. Note that I use similarity instead of distance, thus if the similarity of the two documents are close to 1, the probability of plagiarism is very high. To use this implementation, you need to set a threshold of similarity (e.g. >90% means very similar, >70% mean similar, <40% means not so similar and so on).

Note that for large text the computation of Edit Distance will take quite a long time because the alogorithm is O(m^2).

< Content | Previous | Next >

References:

furkanavcu, Levenshtein Edit Distance Algorithm
G. Cormode and S. Muthukrishnan (2007) The string edit distance matching problem with moves
P. Norvig, How to Write a Spelling Corrector
Sten Hjelmqvist (2012) Fast, memory efficient Levenshtein algorithm