Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
Approx(3)	      User Contributed Perl Documentation	     Approx(3)

NAME
       String::Approx -	Perl extension for approximate matching	(fuzzy
       matching)

SYNOPSIS
	 use String::Approx 'amatch';

	 print if amatch("foobar");

	 my @matches = amatch("xyzzy", @inputs);

	 my @catches = amatch("plugh", ['2'], @inputs);

DESCRIPTION
       String::Approx lets you match and substitute strings approximately.
       With this you can emulate errors: typing	errorrs, speling errors,
       closely related vocabularies (colour color), genetic mutations (GAG
       ACT), abbreviations (McScot, MacScot).

       NOTE: String::Approx suits the task of string matching, not string
       comparison, and it works	for strings, not for text.

       If you want to compare strings for similarity, you probably just	want
       the Levenshtein edit distance (explained	below),	the Text::Levenshtein
       and Text::LevenshteinXS modules in CPAN.	 See also Text::WagnerFischer
       and Text::PhraseDistance.  (There are functions for this	in
       String::Approx, e.g. adist(), but their results sometimes differ	from
       the bare	Levenshtein et al.)

       If you want to compare things like text or source code, consisting of
       words or	tokens and phrases and sentences, or expressions and
       statements, you should probably use some	other tool than
       String::Approx, like for	example	the standard UNIX diff(1) tool,	or the
       Algorithm::Diff module from CPAN.

       The measure of approximateness is the Levenshtein edit distance.	 It is
       the total number	of "edits": insertions,

	       word world

       deletions,

	       monkey money

       and substitutions

	       sun fun

       required	to transform a string to another string.  For example, to
       transform "lead"	into "gold", you need three edits:

	       lead gead goad gold

       The edit	distance of "lead" and "gold" is therefore three, or 75%.

       String::Approx uses the Levenshtein edit	distance as its	measure, but
       String::Approx is not well-suited for comparing strings of different
       length, in other	words, if you want a "fuzzy eq", see above.
       String::Approx is more like regular expressions or index(), it finds
       substrings that are close matches.>

MATCH
	       use String::Approx 'amatch';

	       $matched	    = amatch("pattern")
	       $matched	    = amatch("pattern",	[ modifiers ])

	       $any_matched = amatch("pattern",	@inputs)
	       $any_matched = amatch("pattern",	[ modifiers ], @inputs)

	       @match	    = amatch("pattern")
	       @match	    = amatch("pattern",	[ modifiers ])

	       @matches	    = amatch("pattern",	@inputs)
	       @matches	    = amatch("pattern",	[ modifiers ], @inputs)

       Match pattern approximately.  In	list context return the	matched
       @inputs.	 If no inputs are given, match against the $_.	In scalar
       context return true if any of the inputs	match, false if	none match.

       Notice that the pattern is a string.  Not a regular expression.	None
       of the regular expression notations (^, ., *, and so on)	work.  They
       are characters just like	the others.  Note-on-note: some	limited	form
       of "regular expressionism" is planned in	future:	for example character
       classes ([abc]) and any-chars (.).  But that feature will be turned on
       by a special modifier (just a guess: "r"), so there should be no
       backward	compatibility problem.

       Notice also that	matching is not	symmetric.  The	inputs are matched
       against the pattern, not	the other way round.  In other words: the
       pattern can be a	substring, a submatch, of an input element.  An	input
       element is always a superstring of the pattern.

   MODIFIERS
       With the	modifiers you can control the amount of	approximateness	and
       certain other control variables.	 The modifiers are one or more
       strings,	for example "i", within	a string optionally separated by
       whitespace.  The	modifiers are inside an	anonymous array: the [ ] in
       the syntax are not notational, they really do mean [ ], for example [
       "i", "2"	].  ["2	i"] would be identical.

       The implicit default approximateness is 10%, rounded up.	 In other
       words: every tenth character in the pattern may be an error, an edit.
       You can explicitly set the maximum approximateness by supplying a
       modifier	like

	       number
	       number%

       Examples: "3", "15%".

       Note that "0%" is not rounded up, it is equal to	0.

       Using a similar syntax you can separately control the maximum number of
       insertions, deletions, and substitutions	by prefixing the numbers with
       I, D, or	S, like	this:

	       Inumber
	       Inumber%
	       Dnumber
	       Dnumber%
	       Snumber
	       Snumber%

       Examples: "I2", "D20%", "S0".

       You can ignore case ("A"	becames	equal to "a" and vice versa) by	adding
       the "i" modifier.

       For example

	       [ "i 25%", "S0" ]

       means ignore case, allow	every fourth character to be "an edit",	but
       allow no	substitutions.	(See NOTES about disallowing substitutions or
       insertions.)

       NOTE: setting "I0 D0 S0"	is not equivalent to using index().  If	you
       want to use index(), use	index().

SUBSTITUTE
	       use String::Approx 'asubstitute';

	       @substituted = asubstitute("pattern", "replacement")
	       @substituted = asubstitute("pattern", "replacement", @inputs)
	       @substituted = asubstitute("pattern", "replacement", [ modifiers	])
	       @substituted = asubstitute("pattern", "replacement",
					  [ modifiers ], @inputs)

       Substitute approximate pattern with replacement and return as a list
       <copies>	of @inputs, the	substitutions having been made on the elements
       that did	match the pattern.  If no inputs are given, substitute in the
       $_.  The	replacement can	contain	magic strings $&, $`, $' that stand
       for the matched string, the string before it, and the string after it,
       respectively.  All the other arguments are as in	"amatch()", plus one
       additional modifier, "g"	which means substitute globally	(all the
       matches in an element and not just the first one, as is the default).

       See "BAD	NEWS" about the	unfortunate stinginess of "asubstitute()".

INDEX
	       use String::Approx 'aindex';

	       $index	= aindex("pattern")
	       @indices	= aindex("pattern", @inputs)
	       $index	= aindex("pattern", [ modifiers	])
	       @indices	= aindex("pattern", [ modifiers	], @inputs)

       Like "amatch()" but returns the index/indices at	which the pattern
       matches approximately.  In list context and if @inputs are used,
       returns a list of indices, one index for	each input element.  If
       there's no approximate match, "-1" is returned as the index.

       NOTE: if	there is character repetition (e.g. "aa") either in the
       pattern or in the text, the returned index might	start "too early".
       This is consistent with the goal	of the module of matching "as early as
       possible", just like regular expressions	(that there might be a "less
       approximate" match starting later is of somewhat	irrelevant).

       There's also backwards-scanning "arindex()".

SLICE
	       use String::Approx 'aslice';

	       ($index,	$size)	 = aslice("pattern")
	       ([$i0, $s0], ...) = aslice("pattern", @inputs)
	       ($index,	$size)	 = aslice("pattern", [ modifiers ])
	       ([$i0, $s0], ...) = aslice("pattern", [ modifiers ], @inputs)

       Like "aindex()" but returns also	the size (length) of the match.	 If
       the match fails,	returns	an empty list (when matching against $_) or an
       empty anonymous list corresponding to the particular input.

       NOTE: size of the match will very probably be something you did not
       expect (such as longer than the pattern,	or a negative number).	This
       may or may not be fixed in future releases. Also	the beginning of the
       match may vary from the expected	as with	aindex(), see above.

       If the modifier

	       "minimal_distance"

       is used,	the minimal possible edit distance is returned as the third
       element:

	       ($index,	$size, $distance) = aslice("pattern", [	modifiers ])
	       ([$i0, $s0, $d0], ...)	  = aslice("pattern", [	modifiers ], @inputs)

DISTANCE
	       use String::Approx 'adist';

	       $dist = adist("pattern",	$input);
	       @dist = adist("pattern",	@input);

       Return the edit distance	or distances between the pattern and the input
       or inputs.  Zero	edit distance means exact match.  (Remember that the
       match can 'float' in the	inputs,	the match is a substring match.)  If
       the pattern is longer than the input or inputs, the returned distance
       or distances is or are negative.

	       use String::Approx 'adistr';

	       $dist = adistr("pattern", $input);
	       @dist = adistr("pattern", @inputs);

       Return the relative edit	distance or distances between the pattern and
       the input or inputs.  Zero relative edit	distance means exact match,
       one means completely different.	(Remember that the match can 'float'
       in the inputs, the match	is a substring match.)	If the pattern is
       longer than the input or	inputs,	the returned distance or distances is
       or are negative.

       You can use adist() or adistr() to sort the inputs according to their
       approximateness:

	       my %d;
	       @d{@inputs} = map { abs } adistr("pattern", @inputs);
	       my @d = sort { $d{$a} <=> $d{$b}	} @inputs;

       Now @d contains the inputs, the most like "pattern" first.

CONTROLLING THE	CACHE
       "String::Approx"	maintains a LU (least-used) cache that holds the
       'matching engines' for each instance of a pattern+modifiers.  The cache
       is intended to help the case where you match a small set	of patterns
       against a large set of string.  However,	the more engines you cache the
       more you	eat memory.  If	you have a lot of different patterns or	if you
       have a lot of memory to burn, you may want to control the cache
       yourself.  For example, allowing	a larger cache consumes	more memory
       but probably runs a little bit faster since the cache fills (and	needs
       flushing) less often.

       The cache has two parameters: max and purge.  The first one is the
       maximum size of the cache and the second	one is the cache flushing
       ratio: when the number of cache entries exceeds max, max	times purge
       cache entries are flushed.  The default values are 1000 and 0.75,
       respectively, which means that when the 1001st entry would be cached,
       750 least used entries will be removed from the cache.  To access the
       parameters you can use the calls

	       $now_max	= String::Approx::cache_max();
	       String::Approx::cache_max($new_max);

	       $now_purge = String::Approx::cache_purge();
	       String::Approx::cache_purge($new_purge);

	       $limit =	String::Approx::cache_n_purge();

       To be honest, there are actually	two caches: the	first one is used far
       the patterns with no modifiers, the second one for the patterns with
       pattern modifiers.  Using the standard parameters you will therefore
       actually	cache up to 2000 entries.  The above calls control both	caches
       for the same price.

       To disable caching completely use

	       String::Approx::cache_disable();

       Note that this doesn't flush any	possibly existing cache	entries, to do
       that use

	       String::Approx::cache_flush_all();

NOTES
       Because matching	is by substrings, not by whole strings,	insertions and
       substitutions produce often very	similar	results: "abcde" matches
       "axbcde"	either by insertion or substitution of "x".

       The maximum edit	distance is also the maximum number of edits.  That
       is, the "I2" in

	       amatch("abcd", ["I2"])

       is useless because the maximum edit distance is (implicitly) 1.	You
       may have	meant to say

	       amatch("abcd", ["2D1S1"])

       or something like that.

       If you want to simulate transposes

	       feet fete

       you need	to allow at least edit distance	of two because in terms	of our
       edit primitives a transpose is first one	deletion and then one
       insertion.

   TEXT	POSITION
       The starting and	ending positions of matching, substituting, indexing,
       or slicing can be changed from the beginning and	end of the input(s) to
       some other positions by using either or both of the modifiers

	       "initial_position=24"
	       "final_position=42"

       or the both the modifiers

	       "initial_position=24"
	       "position_range=10"

       By setting the "position_range" to be zero you can limit	(anchor) the
       operation to happen only	once (if a match is possible) at the position.

VERSION
       Major release 3.

CHANGES	FROM VERSION 2
   GOOD	NEWS
       The version 3 is	2-3 times faster than version 2
       No pattern length limitation
	   The algorithm is independent	on the pattern length: its time
	   complexity is O(kn),	where k	is the number of edits and n the
	   length of the text (input).	The preprocessing of the pattern will
	   of course take some O(m) (m being the pattern length) time, but
	   "amatch()" and "asubstitute()" cache	the result of this
	   preprocessing so that it is done only once per pattern.

   BAD NEWS
       You do need a C compiler	to install the module
	   Perl's regular expressions are no more used;	instead	a faster and
	   more	scalable algorithm written in C	is used.

       "asubstitute()" is now always stingy
	   The string matched and substituted is now always stingy, as short
	   as possible.	 It used to be as long as possible.  This is an
	   unfortunate change stemming from switching the matching algorithm.
	   Example: with edit distance of two and substituting for "word" from
	   "cork" and "wool" previously	did match "cork" and "wool".  Now it
	   does	match "or" and "wo".  As little	as possible, or, in other
	   words, with as much approximateness,	as many	edits, as possible.
	   Because there is no need to match the "c" of	"cork",	it is not
	   matched.

       no more "aregex()" because regular expressions are no more used
       no more "compat1" for String::Approx version 1 compatibility

ACKNOWLEDGEMENTS
       The following people have provided valuable test	cases, documentation
       clarifications, and other feedback:

       Jared August, Arthur Bergman, Anirvan Chatterjee, Steve A. Chervitz,
       Aldo Calpini, David Curiel, Teun	van den	Dool, Alberto Fontaneda, Rob
       Fugina, Dmitrij Frishman, Lars Gregersen, Kevin Greiner,	B. Elijah
       Griffin,	Mike Hanafey, Mitch Helle, Ricky Houghton, 'idallen', Helmut
       Jarausch, Damian	Keefe, Ben Kennedy, Craig Kelley, Franz	Kirsch,	Dag
       Kristian, Mark Land, J. D. Laub,	John P.	Linderman, Tim Maher, Juha
       Muilu, Sergey Novoselov,	Andy Oram, Ji Y	Park, Eric Promislow, Nikolaus
       Rath, Stefan Ram, Slaven	Rezic, Dag Kristian Rognlien, Stewart Russell,
       Slaven Rezic, Chris Rosin, Pasha	Sadri, Ilya Sandler, Bob J.A.
       Schijvenaars, Ross Smith, Frank Tobin, Greg Ward, Rich Williams,	Rick
       Wise.

       The matching algorithm was developed by Udi Manber, Sun Wu, and Burra
       Gopal in	the Department of Computer Science, University of Arizona.

AUTHOR
       Jarkko Hietaniemi <jhi@iki.fi>

COPYRIGHT AND LICENSE
       Copyright 2001-2013 by Jarkko Hietaniemi

       This library is free software; you can redistribute it and/or modify
       under either the	terms of the Artistic License 2.0, or the GNU Library
       General Public License, Version 2.  See the files Artistic and LGPL for
       more details.

       Furthermore: no warranties or obligations of any	kind are given,	and
       the separate file COPYRIGHT must	be included intact in all copies and
       derived materials.

perl v5.32.1			  2017-04-16			     Approx(3)

NAME | SYNOPSIS | DESCRIPTION | MATCH | SUBSTITUTE | INDEX | SLICE | DISTANCE | CONTROLLING THE CACHE | NOTES | VERSION | CHANGES FROM VERSION 2 | ACKNOWLEDGEMENTS | AUTHOR | COPYRIGHT AND LICENSE

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=String::Approx&sektion=3&manpath=FreeBSD+13.0-RELEASE+and+Ports>

home | help