Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
RANK(1)		      User Contributed Perl Documentation	       RANK(1)

NAME
       rank.pl - Calculate Spearman's Correlation on two ranked	lists output
       by count.pl or statistic.pl

SYNOPSIS
       Program to calculate the	rank correlation coefficient between the
       rankings	generated by two different statistical measures	on the same
       bigram-frequency	(as output by count.pl).

DESCRIPTION
   1. Introduction
       This is a program that is meant to be used to compare two different
       statistical measures of association. Given the same set of n-grams
       ranked in two different ways by two different statistical measures,
       this program computes Spearman's	rank correlation coefficient between
       the two rankings.

       1.2. Typical Way	to Run rank.pl:

       Assume that test.cnt is a list of n-grams with their frequencies	as
       output by program count.pl. Assume that we wish to test the
       dis/similarity of the statistical measures 'dice' and 'x2' with respect
       to the n-grams contained	in test.cnt. To	do so, we must first rank the
       n-grams using these two statistical measures using program
       statistic.pl.

	perl statistic.pl dice test.dice test.cnt
	perl statistic.pl x2 test.x2 test.cnt

       Having obtained two different rankings of the n-grams in	test.cnt in
       files test.dice and test.x2, we can now compute the Spearman's rank
       correlation coefficient using these two rankings	like so:

	perl rank.pl test.dice test.x2.

       This will output	a floating point number	between	-1 and 1. A return of
       '1' indicates a perfect match in	rankings, '-1' a completely reversed
       ranking and '0' a pair of rankings that are completely unrelated	to
       each other. Numbers that	lie between these numbers indicate various
       degrees of relatedness /	un-relatedness / reverse-relatedness.

       1.3. Re-Ranking the Ngrams:

       Recall that program statistic.pl	ranks n-grams in such a	way that the
       fact that an ngram has a	rank 'r' implies that there are	'r-1' distinct
       scores greater than the score of	this ngram. Thus say if	'k' n-grams
       are tied	at a score with	rank 'a', then the next	highest	scoring
       n-grams is given	a rank 'a+1' instead of	'a+k+1'.

       For example, observe the	following file output by statistic.pl:

	11
	of<>text<>1 1.0000 2 2 2
	and<>a<>1 1.0000 1 1 1
	a<>third<>1 1.0000 1 1 1
	text<>second<>1	1.0000 1 1 1
	line<>of<>2 0.8000 2 3 2
	third<>line<>3 0.5000 1	1 3
	line<>and<>3 0.5000 1 3	1
	second<>line<>3	0.5000 1 1 3
	first<>line<>3 0.5000 1	1 3

       Observe that although 4 bigrams have a rank of 1, the next highest
       scoring bigram is not ranked 5, but instead 2.

       Spearman's rank correlation coefficient requires	the more conventional
       kind of ranking.	Thus the above file is first "re-ranked" to the
       following:

	11
	of<>text<>1 1.0000 2 2 2
	and<>a<>1 1.0000 1 1 1
	a<>third<>1 1.0000 1 1 1
	text<>second<>1	1.0000 1 1 1
	line<>of<>5 0.8000 2 3 2
	third<>line<>6 0.5000 1	1 3
	line<>and<>6 0.5000 1 3	1
	second<>line<>6	0.5000 1 1 3
	first<>line<>6 0.5000 1	1 3

       And then	these rankings are used	to compute the correlation
       coefficient.

       1.4. Dealing with Dissimilar Lists of N-grams:

       The two input files to rank.pl may not have the same set	of n-grams. In
       particular, if one or both of the files generated using statistic.pl
       has been	generated using	a frequency, rank or score cut-off, then it is
       likely that the two files will have different sets of n-grams. In such
       a situation, n-grams that do not	occur in both files are	removed, the
       n-grams that remain are re-ranked and then the correlation coefficient
       is computed.

       For example assume the following	two files output by statistic.pl using
       two fictitious statistical measures from	a fictitious file output by
       program count.pl.

       The first file:

	first<>bigram<>1 4.000 1 1
	second<>bigram<>2 3.000	2 2
	extra<>bigram1<>3 2.000	3 3
	third<>bigram<>4 1.000 4 4

       The second file:

	second<>bigram<>1 4.000	2 2
	extra<>bigram2<>2 3.000	4 4
	first<>bigram<>3 2.000 1 1
	third<>bigram<>4 1.000 3 3

       Observe that the	bigrams	extra<>bigram1<> in the	first file and
       extra<>bigram2<>	in the second file are not present in both files.
       After removing these bigrams and	re-ranking the rest, we	get the
       following files:

       The modified first file:

	first<>bigram<>1 4.000 1 1
	second<>bigram<>2 3.000	2 2
	third<>bigram<>3 1.000 4 4

       The modified second file:

	second<>bigram<>1 4.000	2 2
	first<>bigram<>2 2.000 1 1
	third<>bigram<>3 1.000 3 3

       Since each ngram	belongs	to both	files, the correlation coefficient may
       be computed on both files.

       1.5. Example Shell Script rank-script.sh:

       We provide c-shell script rank-script.sh	that takes a bigram count file
       and the names of	two libraries and then computes	the Spearman's rank
       correlation coefficient by making use successively of programs
       statistic.pl and	rank.pl.

       Run this	script like so:	rank-script.sh <lib1> <lib2> <file>

	   where <lib1>	is the first library, say dice
		 <lib2>	is the second library, say x2
		 <file>	is the file of ngrams and their	frequencies produced
			by program count.pl.

       For example, if test.cnt	contains bigrams and their frequencies,	we can
       run it like so to compute the rank correlation coefficient between dice
       and x2:

	   csh rank-script.sh dice x2 test.cnt.

       This runs the following commands	in succession:

	perl statistic.pl dice out1 test.cnt
	perl statistic.pl x2 out2 test.cnt
	perl rank.pl out1 out2

       The intermediate	files out1 and out2 are	later destroyed.

       Note that since no command line options are utilized in the running of
       program statistic.pl here, this script only works for bigrams and
       enforces	no cut-offs. However the script	is simple enough to be
       manually	modified to the	user's requirements.

AUTHORS
	Ted Pedersen, tpederse@umn.edu
	Satanjeev Banerjee, bane0025@d.umn.edu
	Bridget	McInnes, bthomson@umn.edu

       This work has been partially supported by a National Science Foundation
       Faculty Early CAREER Development	award (\#0092784) and by a Grant-in-
       Aid of Research,	Artistry and Scholarship from the Office of the	Vice
       President for Research and the Dean of the Graduate School of the
       University of Minnesota.

COPYRIGHT
       Copyright (C) 2000-2012,	Ted Pedersen and Satanjeev Banerjee and
       Bridget T. McInnes

       This suite of programs is free software;	you can	redistribute it	and/or
       modify it under the terms of the	GNU General Public License as
       published by the	Free Software Foundation; either version 2 of the
       License,	or (at your option) any	later version.

       This program is distributed in the hope that it will be useful, but
       WITHOUT ANY WARRANTY; without even the implied warranty of
       MERCHANTABILITY or FITNESS FOR A	PARTICULAR PURPOSE.  See the GNU
       General Public License for more details.

       You should have received	a copy of the GNU General Public License along
       with this program; if not, write	to the Free Software Foundation, Inc.,
       59 Temple Place - Suite 330, Boston, MA	02111-1307, USA.

       Note: The text of the GNU General Public	License	is provided in the
       file GPL.txt that you should have received with this distribution.

perl v5.24.1			  2015-10-04			       RANK(1)

NAME | SYNOPSIS | DESCRIPTION | AUTHORS | COPYRIGHT

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=rank.pl&sektion=1&manpath=FreeBSD+12.1-RELEASE+and+Ports>

home | help