Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
RAWTEXTFREQ(1)	      User Contributed Perl Documentation	RAWTEXTFREQ(1)

NAME -	Compute	Information Content from Raw / Plain Text

SYNOPSIS --outfile OUTFILE [--stopfile=STOPFILE]
		      {--stdin | --infile FILE [--infile FILE ...]}
		       [--wnpath WNPATH] [--resnik] [--smooth=SCHEME]
		       | --help	| --version


	   The name of a file to which output should be	written


	   A file containing a list of stop listed words that will not be
	   considered in the frequency counts.	A sample file can be down-
	   loaded from


	   Location of the WordNet data	files (e.g.,


	   Use Resnik (1995) frequency counting


	   Smoothing should used on the	probabilities computed.	 SCHEME	can
	   only	be ADD1	at this	time


	   Show	a help message


	   Display version information


	   Read	from the standard input	the text that is to be used for
	   counting the	frequency of words.


	   The name of a raw text file to be used to count word	frequencies.
	   This	can actually be	a filename, a directory	name, or a pattern (as
	   understood by Perl's	glob() function).  If the value	is a directory
	   name, then all the files in that directory and its subdirectories will
	   be used.

	   If you are looking for some interesting files to use, check out
	   Project Gutenberg: <>.

	   This	option may be given more than once (if more than one file
	   should be used).

       This program reads a corpus of plain text and computes frequency	counts
       from that corpus	and then uses those to determine the information
       content of each synset in WordNet. In brief it does this	by first
       assigning counts	to each	synset for which it obtains a frequency	count
       in the corpus, and then those counts are	propagated up the WordNet
       hierarchy. More details on this process can be found in the
       documentation of	the lin, res, and jcn measures in WordNet::Similarity
       and in the publication by Patwardhan, et. al.  (2003) referred to

       The utility programs,,, all	function in exactly the	same way as this plain text
       program (, except	that they include the ability to deal
       with the	format of the corpus with which	they are used.

       None of these programs requires sense-tagged text; instead they simply
       distribute the counts of	the observed form of word to all the synsets
       in the corpus to	which it could be associated. The different forms of a
       word are	found via the validForms and querySense	methods	of

       For example, if the observed word is 'bank', then a count is given to
       the synsets associated with the financial institution, a	river shore,
       the act of turning a plane, etc.

   Distributing	Counts to Synsets
       If the corpora is sense-tagged, then distributing the counts of sense-
       tagged words to synsets is trivial; you increment the count of each
       synset for which	you have a sense tagged	instance. It is	very hard to
       obtain large quantities of sense	tagged text, so	in general it is not
       feasible	to obtain information content values from large	sense-tagged

       As such this program and	the related * utilities are all trying
       to increment the	counts of synsets based	on the occurence of raw
       untagged	word forms. In this case it is less obvious how	to proceed.
       This program supports two methods for distributing the counts of	an
       observed	word forms in untagged text to synsets.

       One is our default method, and we refer to the other as Resnik
       counting. In our	default	counting scheme, each synset receives the
       total count of each word	form associated	with it.

       Suppose the word	'bank' can be associated with six different synets. In
       our default scheme each of those	synsets	would receive a	count for each
       occurrence of 'bank'. In	Resnik counting, the count would be divided
       between the possible synsets, so	in this	case each synset would get one
       sixth (1/6) of the total	count.

   How are These Counts	Used?
       This program maps word forms to synsets.	These synset counts are	then
       propagated up the WordNet hierarchy to arrive at	Information Content
       values for each synset, which are then used by the Lin (lin), Resnik
       (res), and Jiang	& Conrath (jcn)	measures of semantic similarity.

       By default these	measures use counts derived from the cntlist file
       provided	by WordNet, which is based on frequency	counts from the	sense-
       tagged SemCor corpus. This consists of approximately 200,000 sense
       tagged tokens taken from	the Brown Corpus and the Red Badge of Courage.

       A file called ic-semcor.dat is created during installation of
       WordNet::Similarity from	cntlist. In fact, the util program is	used to	do this. This is the only one of the *
       utility programs	that uses sense	tagged text, and in fact it only uses
       the counts from cntlist,	not the	actual sense tagged text.

       This program simply creates an alternative version of the ic-semcor.dat
       file based on counts obtained from raw untagged text.

   Why Use This	Program?
       The default information content file (ic-semcor.dat) is based on
       SemCor, which includes sense tagged portions of the Brown Corpus	and
       the Red Badge of	Courage. It has	the advantage of being sense tagged,
       but is from a rather limited domain and is somewhat small in size
       (200,000	sense tagged tokens).

       If you are working in a different domain	or have	access to a larger
       quantity	of corpora, you	might find that	this program provides
       information content values that better reflect your underlying domain
       or problem.

   How can these counts	be reliable if they aren't based on sense tagged text?
       Remember	once the counts	are given to a synset, those counts are
       propogated upwards, so that each	synset receives	the counts of its
       children. These are then	used in	the calculation	of the information
       content of each synset, which is	simply :

	       information content (synset) = -	log [probability (synset)]

       More details on this calculation	and how	they are used in the res, lin,
       and jcn measures	can be found in	the WordNet::Similarity	module
       doumentation, and in the	following publication:

	Using Measures of Semantic Relatedness for Word	Sense Disambiguation
	(Patwardhan, Banerjee and Pedersen) - Appears in the Proceedings of
	the Fourth International Conference on Intelligent Text	Processing and
	Computational Linguistics, pp. 241-257,	February 17-21,	2003, Mexico City.

       We believe that a propagation effect will result	in concentrations or
       clusters	of information content values in the WordNet hierarchy.	For
       example,	if you have a text about banking, while	the different counts
       of "bank" will be dispersed around WordNet, there will also be other
       financial terms that occur with bank that will occur near the financial
       synset in WordNet, and lead to a	concentration of counts	in that	region
       of WordNet. It is best to view this as a	conjecture or hypothesis at
       this time. Evidence for or against would	be most	interesting.

       You can use raw text of any kind	in this	program. We sometimes use text
       from Project Gutenburg, for example the Complete	Works of Shakespeare,
       available from <>

       Report to WordNet::Similarity mailing list :


       WordNet home page :

       WordNet::Similarity home	page :

	Ted Pedersen, University of Minnesota, Duluth
	tpederse at

	Satanjeev Banerjee, Carnegie Mellon University,	Pittsburgh
	banerjee+ at

	Siddharth Patwardhan, University of Utah, Salt Lake City
	sidd at

	Jason Michelizzi

       Copyright (c) 2005-2008,	Ted Pedersen, Satanjeev	Banerjee, Siddharth
       Patwardhan and Jason Michelizzi

       This program is free software; you can redistribute it and/or modify it
       under the terms of the GNU General Public License as published by the
       Free Software Foundation; either	version	2 of the License, or (at your
       option) any later version.  This	program	is distributed in the hope
       that it will be useful, but WITHOUT ANY WARRANTY; without even the
       PURPOSE.	 See the GNU General Public License for	more details.

       You should have received	a copy of the GNU General Public License along
       with this program; if not, write	to

	Free Software Foundation, Inc.
	59 Temple Place	- Suite	330
	Boston,	MA  02111-1307,	USA

perl v5.32.0			  2020-08-09			RAWTEXTFREQ(1)


Want to link to this manual page? Use this URL:

home | help