String metric

"String distance" redirects here. For the distance between strings and the fingerboard in musical instruments, see Action (music).

In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or comparison and in fuzzy string searching. A necessary requirement for a string metric (e.g. in contrast to string matching) is fulfillment of the triangle inequality. For example, the strings "Sam" and "Samuel" can be considered to be close. A string metric provides a number indicating an algorithm-specific indication of distance.

The most widely known string metric is a rudimentary one called the Levenshtein Distance (also known as Edit Distance). It operates between two input strings, returning a number equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance have expanded to include phonetic, token, grammatical and character-based methods of statistical comparisons.

String metrics are used heavily in information integration and are currently used in areas including fraud detection, fingerprint analysis, plagiarism detection, ontology merging, DNA analysis, RNA analysis, image analysis, evidence-based machine learning, database data deduplication, data mining, Web interfaces, e.g. Ajax-style suggestions as you type, data integration, and semantic knowledge integration.

List of string metrics

Sørensen–Dice coefficient
Block distance or L1 distance or City block distance
Jaro–Winkler distance
Simple matching coefficient (SMC)
Jaccard similarity or Jaccard coefficient or Tanimoto coefficient
Most frequent k characters
Tversky index
Overlap coefficient
Variational distance
Hellinger distance or Bhattacharyya distance
Information radius (Jensen–Shannon divergence)
Skew divergence
Confusion probability
Tau metric, an approximation of the Kullback–Leibler divergence
Fellegi and Sunters metric (SFS)
Maximal matches
TFIDF distance metric^[1]

Selected string measures examples

Name	Example
Hamming distance	"karolin" and "kathrin" is 3.
Levenshtein distance and Damerau–Levenshtein distance	kitten and sitting have a distance of 3. kitten → sitten (substitution of "s" for "k") sitten → sittin (substitution of "i" for "e") sittin → sitting (insertion of "g" at the end).
Jaro–Winkler distance	JaroWinklerDist("MARTHA","MARHTA") = $d_j = \frac{1}{3}\left(\frac{m}{\|s_1\|} + \frac{m}{\|s_2\|} + \frac{m-t}{m}\right) = \frac{1}{3}\left(\frac{6}{6} + \frac{6}{6} + \frac{6-\frac{2}{2}}{6}\right) = 0.944$ $m$ is the number of matching characters; $t$ is half the number of transpositions(`"MARTHA"[3]!=H, "MARHTA"[3]!=T`).
Most frequent k characters	MostFreqKeySimilarity('research', 'seeking', 2) = 2

References

↑ Cohen, William; Ravikumar, Pradeep; Fienberg, Stephen (2003-08-01). "A Comparison of String Distance Metrics for Name-Matching Tasks.": 73–78.

External links

http://web.archive.org/web/20070304092115/http://www.dcs.shef.ac.uk:80/~sam/stringmetrics.html#qgram A fairly complete overview Archived index at the Wayback Machine.
Carnegie Mellon University open source library
StringMetric project a Scala library of string metrics and phonetic algorithms
Natural project a JavaScript natural language processing library which includes implementations of popular string metrics

Strings

String metric	Approximate string matching Bitap algorithm Damerau–Levenshtein distance Edit distance Hamming distance Jaro–Winkler distance Lee distance Levenshtein automaton Levenshtein distance Wagner–Fischer algorithm

String searching algorithm	Apostolico–Giancarlo algorithm Boyer–Moore string search algorithm Boyer–Moore–Horspool algorithm Knuth–Morris–Pratt algorithm Rabin–Karp string search algorithm

Multiple string searching	Aho–Corasick Commentz-Walter algorithm Rabin–Karp

Regular expression	Comparison of regular expression engines Regular tree grammar Thompson's construction Nondeterministic finite automaton

Sequence alignment	Hirschberg's algorithm Needleman–Wunsch algorithm Smith–Waterman algorithm

Data structures	DAFSA Suffix array Suffix automaton Suffix tree Generalized suffix tree Rope Ternary search tree Trie

Other	Parsing Pattern matching Compressed pattern matching Longest common subsequence Longest common substring Sequential pattern mining Sorting

This article is issued from Wikipedia - version of the 11/14/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.