String metric
In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or comparison and in fuzzy string searching. A necessary requirement for a string metric (e.g. in contrast to string matching) is fulfillment of the triangle inequality. For example, the strings "Sam" and "Samuel" can be considered to be close.[1] A string metric provides a number indicating an algorithmspecific indication of distance.
The most widely known string metric is a rudimentary one called the Levenshtein distance (also known as edit distance).[2] It operates between two input strings, returning a number equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance have expanded to include phonetic, token, grammatical and characterbased methods of statistical comparisons.
String metrics are used heavily in information integration and are currently used in areas including fraud detection, fingerprint analysis, plagiarism detection, ontology merging, DNA analysis, RNA analysis, image analysis, evidencebased machine learning, database data deduplication, data mining, incremental search, data integration, and semantic knowledge integration.
List of string metrics
 Levenshtein distance, or its generalization edit distance
 Damerau–Levenshtein distance
 Sørensen–Dice coefficient
 Block distance or L1 distance or City block distance
 Hamming distance
 Jaro–Winkler distance
 Simple matching coefficient (SMC)
 Jaccard similarity or Jaccard coefficient or Tanimoto coefficient
 Tversky index
 Overlap coefficient
 Variational distance
 Hellinger distance or Bhattacharyya distance
 Information radius (Jensen–Shannon divergence)
 Skew divergence
 Confusion probability
 Tau metric, an approximation of the Kullback–Leibler divergence
 Fellegi and Sunters metric (SFS)
 Maximal matches
 Grammarbased distance
 TFIDF distance metric[3]
Selected string measures examples
Name  Example 

Hamming distance  "karolin" and "kathrin" is 3. 
Levenshtein distance and Damerau–Levenshtein distance  kitten and sitting have a distance of 3.

Jaro–Winkler distance  JaroWinklerDist("MARTHA","MARHTA") =

Most frequent k characters  MostFreqKeySimilarity('research', 'seeking', 2) = 2 
References
 Lu, Jiaheng (et al) (2013). "String similarity measures and joins with synonyms". Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data: 373–384. doi:10.1145/2463676.2465313. ISBN 9781450320375.
 Navarro, Gonzalo (2001). "A guided tour to approximate string matching". ACM Computing Surveys. 33 (1): 31–88. doi:10.1145/375360.375365.
 Cohen, William; Ravikumar, Pradeep; Fienberg, Stephen (20030801). "A Comparison of String Distance Metrics for NameMatching Tasks": 73–78. Cite journal requires
journal=
(help)
External links
 https://web.archive.org/web/20070304092115/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html#qgram A fairly complete overview Archive index at the Wayback Machine
 Carnegie Mellon University open source library
 StringMetric project a Scala library of string metrics and phonetic algorithms
 Natural project a JavaScript natural language processing library which includes implementations of popular string metrics