Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
Text::Similarity::OverUser(Contributed Perl DocumText::Similarity::Overlaps(3)

       Text::Similarity::Overlaps - Score the Overlaps Found Between Two
       Strings Based on	Literal	Text Matching

		 # you can measure the similarity between two input strings :
		 # if you don't	normalize the score, you get the number	of matching words
		 # if you normalize, you get a score between 0 and 1 that is scaled based
		 # on the length of the	strings

		 use Text::Similarity::Overlaps;

		 # my %options = ('normalize' => 1, 'verbose' => 1);
		 my %options = ('normalize' => 0, 'verbose' => 0);
		 my $mod = Text::Similarity::Overlaps->new (\%options);
		 defined $mod or die "Construction of Text::Similarity::Overlaps failed";

		 my $string1 = 'this is	a test for getSimilarityStrings';
		 my $string2 = 'we can test getSimilarityStrings this day';

		 my $score = $mod->getSimilarityStrings	($string1, $string2);
		 print "There are $score overlapping words between string1 and string2\n";

		 # you may want	to measure the similarity of a document
		 # sentence by sentence	- the below example shows you
		 # how - suppose you have two text files file1.txt and
		 # file2.txt - each having the same number of sentences.
		 # convert those files into multiple files, where each
		 # sentence from each file is in a separate file.

		 # if file1.txt	and file3.txt each have	three sentences,
		 # filex.txt will become sentx1.txt sentx2.txt sentx3.txt

		 # this	just calls getSimilarity( ) for	each pair of sentences

		 use Text::Similarity::Overlaps;
		 my %options = ('normalize' => 1, 'verbose' =>1,
					       'stoplist' => 'stoplist.txt');
		 my $mod = Text::Similarity::Overlaps->new (\%options);
		 defined $mod or die "Construction of Text::Similarity::Overlaps failed";

		 @file1s = qw /	sent11.txt sent12.txt sent13.txt /;
		 @file2s = qw /	sent21.txt sent22.txt sent23.txt /;

		 # assumes that	both documents have same number	of sentences

		 for ($i=0; $i <= $#file1s; $i++) {
			 my $score = $mod->getSimilarity ($file1s[$i], $file2s[$i]);
			 print "The similarity of $file1s[$i] and $file2s[$i] is : $score\n";

		 my $score = $mod->getSimilarity ('file1.txt', 'file2.txt');
		 print "The similarity of the two files	is : $score\n";

       This module computes the	similarity of two text documents or strings by
       searching for  literal word token overlaps. This	just means that	it
       determines how many word	tokens are are identical between the two
       strings.	Various	scores are computed based on the number	of shared
       words, and the length of	the strings.

       At present similarity measurements are made between entire files	or
       strings,	and  finer granularity is not supported. Files are treated as
       one long	input string, so overlaps can be found across sentence and
       paragraph boundaries.

       Files are first converted into strings by getSimilarity(), then
       getSimilarityStrings() does the actual processing. It counts the	number
       of overlaps (matching words) and	finds the longest common subsequences
       (phrases) between the two strings. However, most	of the measures	except
       for lesk	do not use the information about phrasal matches.

       Text::Similarity::Overlaps returns the F-measure, which is a normalized
       value between 0 and 1. Normalization can	be turned off by specifying
       --no-normalize, in which	case the raw_score is returned,	which is
       simply the number of words that overlap between the two strings.

       In addition, Overlaps returns the cosine, E-measure, precision, recall,
       Dice coefficient, and Lesk scores in the	allScores table.

	    precision =	raw_score / length_file_2
	    recall = raw_score / length_file_1
	    F-measure =	2 * precision *	recall / (precision + recall)
	    Dice = 2 * raw_score / (sum	of string lengths)
	    E-measure =	1 - F-measure
	    Cosine = raw_score / sqrt (precision + recall)
	    Lesk = sum of the squares of the length of phrasal matches
		(normalized by dividing	by the product of the string lengths)

       The raw_score is	simply the number of matching words between the	two
       inputs, without respect to their	order. Note that matches are literal
       and must	be exact, so 'cat' and 'cats' do not match. This corresponds
       to the idea of the intersection between the two strings.

       None of these measures (except lesk) considers the order	of the
       matches.	 In those cases	'jim bit the dog' and 'the dog bit jim'	are
       considered exact	matches	and will attain	the highest possible matching
       score, which would be a raw_score of 4 if not normalized	and 1 if the
       score is	normalized (which would	result in the f-measure	being

       lesk is different in that it looks for phrasal matches and scores them
       more highly. The	lesk measure is	based on the measure of	the same name
       included	in WordNet::Similarity.	There it is used to match the
       overlapping text	found in the gloss entries of the lexical database /
       dictionary WordNet in order to measure semantic relatedness.

       The lesk	measure	finds the length of all	the overlaps and squares them.
       It then sums those scores, and if the score is normalized divides them
       by the product of the lengths of	the strings. For example:

	       the dog bit jim
	       jim bit the dog

       The raw_score is	4, since the two strings are made up of	identical
       words (just in different	orders). The F-measure is equal	to 1, as are
       the Cosine, and the Dice	Coefficient. In	fact, the F-Measure and	the
       Dice Coefficient	are always equivalent, but both	are presented since
       some users may be more familiar with one	formulation versus the other.

       The raw_lesk score is 2^2 + 1 + 1 = 6, because 'the dog'	is a phrasal
       match between the strings and thus contributes it's length squared to
       the raw_lesk score. The normalized lesk score is	0.375, which is	6 / (4
       * 4), or	the raw_lesk score divided by the product of the lengths of
       the two strings.	Note that the normalized lesk score has	a maximum
       value of	1, since if there are n	words in the two strings, then their
       maximum overlap is n words, which receives a raw_lesk score of n^2,
       which is	the divided by the product of the string lengths, which	is
       again n^2..

       There is	some cleaning of text performed	automatically, which includes
       removal of most punctuation except embedded apostrophes and
       underscores. All	text is	made lower case. This occurs both for file and
       string input.


	Ted Pedersen, University of Minnesota, Duluth
	tpederse at

	Jason Michelizzi

       Last modified by	: $Id:,v 1.6 2015/10/08 13:22:13 tpederse
       Exp $

       Copyright (C) 2004-2008 by Jason	Michelizzi and Ted Pedersen

       This program is free software; you can redistribute it and/or modify it
       under the terms of the GNU General Public License as published by the
       Free Software Foundation; either	version	2 of the License, or (at your
       option) any later version.

       This program is distributed in the hope that it will be useful, but
       WITHOUT ANY WARRANTY; without even the implied warranty of
       General Public License for more details.

       You should have received	a copy of the GNU General Public License along
       with this program; if not, write	to the Free Software Foundation, Inc.,
       59 Temple Place,	Suite 330, Boston, MA  02111-1307  USA

perl v5.32.1			  2015-10-08	 Text::Similarity::Overlaps(3)


Want to link to this manual page? Use this URL:

home | help