Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
AI::Categorizer(3)    User Contributed Perl Documentation   AI::Categorizer(3)

       AI::Categorizer - Automatic Text	Categorization

	use AI::Categorizer;
	my $c =	new AI::Categorizer(...parameters...);

	# Run a	complete experiment - training on a corpus, testing on a test
	# set, printing	a summary of results to	STDOUT

	# Or, run the parts of $c->run_experiment separately
	print $c->stats_table;

	# After	training, use the Learner for categorization
	my $l =	$c->learner;
	while (...) {
	  my $d	= ...create a document...
	  my $hypothesis = $l->categorize($d);	# An AI::Categorizer::Hypothesis object
	  print	"Assigned categories: ", join ', ', $hypothesis->categories, "\n";
	  print	"Best category:	", $hypothesis->best_category, "\n";

       "AI::Categorizer" is a framework	for automatic text categorization.  It
       consists	of a collection	of Perl	modules	that implement common
       categorization tasks, and a set of defined relationships	among those
       modules.	 The various details are flexible - for	example, you can
       choose what categorization algorithm to use, what features (words or
       otherwise) of the documents should be used (or how to automatically
       choose these features), what format the documents are in, and so	on.

       The basic process of using this module will typically involve obtaining
       a collection of pre-categorized documents, creating a "knowledge	set"
       representation of those documents, training a categorizer on that
       knowledge set, and saving the trained categorizer for later use.	 There
       are several ways	to carry out this process.  The	top-level
       "AI::Categorizer" module	provides an umbrella class for high-level
       operations, or you may use the interfaces of the	individual classes in
       the framework.

       A simple	sample script that reads a training corpus, trains a
       categorizer, and	tests the categorizer on a test	corpus,	is distributed
       as eg/ .

       Disclaimer: the results of any of the machine learning algorithms are
       far from	infallible (close to fallible?).  Categorization of documents
       is often	a difficult task even for humans well-trained in the
       particular domain of knowledge, and there are many things a human would
       consider	that none of these algorithms consider.	 These are only
       statistical tests - at best they	are neat tricks	or helpful assistants,
       and at worst they are totally unreliable.  If you plan to use this
       module for anything really important, human supervision is essential,
       both of the categorization process and the final	results.

       For the usage details, please see the documentation of each individual

       This section explains the major pieces of the "AI::Categorizer" object
       framework.  We give a conceptual	overview, but don't get	into any of
       the details about interfaces or usage.  See the documentation for the
       individual classes for more details.

       A diagram of the	various	classes	in the framework can be	seen in
       "doc/classes-overview.png", and a more detailed view of the same	thing
       can be seen in "doc/classes.png".

   Knowledge Sets
       A "knowledge set" is defined as a collection of documents, together
       with some information on	the categories each document belongs to.  Note
       that this term is somewhat unique to this project - other sources may
       call it a "training corpus", or "prior knowledge".  A knowledge set
       also contains some information on how documents will be parsed and how
       their features (words) will be extracted	and turned into	meaningful
       representations.	 In this sense,	a knowledge set	represents not only a
       collection of data, but a particular view on that data.

       A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
       class.  Before you can start playing with categorizers, you will	have
       to start	playing	with knowledge sets, so	that the categorizers have
       some data to train on.  See the documentation for the
       "AI::Categorizer::KnowledgeSet" module for information on its

       Feature selection

       Deciding	which features are the most important is a very	large part of
       the categorization task - you cannot simply consider all	the words in
       all the documents when training,	and all	the words in the document
       being categorized.  There are two main reasons for this - first,	it
       would mean that your training and categorizing processes	would take
       forever and use tons of memory, and second, the significant stuff of
       the documents would get lost in the "noise" of the insignificant	stuff.

       The process of selecting	the most important features in the training
       set is called "feature selection".  It is managed by the
       "AI::Categorizer::KnowledgeSet" class, and you will find	the details of
       feature selection processes in that class's documentation.

       Because documents may be	stored in lots of different formats, a
       "collection" class has been created as an abstraction of	a stored set
       of documents, together with a way to iterate through the	set and	return
       Document	objects.  A knowledge set contains a single collection object.
       A "Categorizer" doing a complete	test run generally contains two
       collections, one	for training and one for testing.  A "Learner" can
       mass-categorize a collection.

       The "AI::Categorizer::Collection" class and its subclasses instantiate
       the idea	of a collection	in this	sense.

       Each document is	represented by an "AI::Categorizer::Document" object,
       or an object of one of its subclasses.  Each document class contains
       methods for turning a bunch of data into	a Feature Vector.  Each
       document	also has a method to report which categories it	belongs	to.

       Each category is	represented by an "AI::Categorizer::Category" object.
       Its main	purpose	is to keep track of which documents belong to it,
       though you can also examine statistical properties of an	entire
       category, such as obtaining a Feature Vector representing an
       amalgamation of all the documents that belong to	it.

   Machine Learning Algorithms
       There are lots of different ways	to make	the inductive leap from	the
       training	documents to unseen documents.	The Machine Learning community
       has studied many	algorithms for this purpose.  To allow flexibility in
       choosing	and configuring	categorization algorithms, each	such algorithm
       is a subclass of	"AI::Categorizer::Learner".  There are currently four
       categorizers included in	the distribution:

	   A pure-perl implementation of a Naive Bayes classifier.  No
	   dependencies	on external modules or other resources.	 Naive Bayes
	   is usually very fast	to train and fast to make categorization
	   decisions, but isn't	always the most	accurate categorizer.

	   An interface	to Corey Spencer's "Algorithm::SVM", which implements
	   a Support Vector Machine classifier.	 SVMs can take a while to
	   train (though in certain conditions there are optimizations to make
	   them	quite fast), but are pretty quick to categorize.  They often
	   have	very good accuracy.

	   An interface	to "AI::DecisionTree", which implements	a Decision
	   Tree	classifier.  Decision Trees generally take longer to train
	   than	Naive Bayes or SVM classifiers,	but they are also quite	fast
	   when	categorizing.  Decision	Trees have the advantage that you can
	   scrutinize the structures of	trained	decision trees to see how
	   decisions are being made.

	   An interface	to version 2 of	the Weka Knowledge Analysis system
	   that	lets you use any of the	machine	learners it defines.  This
	   gives you access to lots and	lots of	machine	learning algorithms in
	   use by machine learning researches.	The main drawback is that Weka
	   tends to be quite slow and use a lot	of memory, and the current
	   interface between Weka and "AI::Categorizer"	is a bit clumsy.

       Other machine learning methods that may be implemented soonish include
       Neural Networks,	k-Nearest-Neighbor, and/or a mixture-of-experts
       combiner	for ensemble learning.	No timetable for their creation	has
       yet been	set.

       Please see the documentation of these individual	modules	for more
       details on their	guts and quirks.  See the "AI::Categorizer::Learner"
       documentation for a description of the general categorizer interface.

       If you wish to create your own classifier, you should inherit from
       "AI::Categorizer::Learner" or "AI::Categorizer::Learner::Boolean",
       which are abstract classes that manage some of the work for you.

   Feature Vectors
       Most categorization algorithms don't deal directly with documents'
       data, they instead deal with a vector representation of a document's
       features.  The features may be any properties of	the document that seem
       helpful for determining its category, but they are usually some version
       of the "most important" words in	the document.  A list of features and
       their weights in	each document is encapsulated by the
       "AI::Categorizer::FeatureVector"	class.	You may	think of this class as
       roughly analogous to a Perl hash, where the keys	are the	names of
       features	and the	values are their weights.

       The result of asking a categorizer to categorize	a previously unseen
       document	is called a hypothesis,	because	it is some kind	of
       "statistical guess" of what categories this document should be assigned
       to.  Since you may be interested	in any of several pieces of
       information about the hypothesis	(for instance, which categories	were
       assigned, which category	was the	single most likely category, the
       scores assigned to each category, etc.),	the hypothesis is returned as
       an object of the	"AI::Categorizer::Hypothesis" class, and you can use
       its object methods to get information about the hypothesis.  See	its
       class documentation for the details.

       The "AI::Categorizer::Experiment" class helps you organize the results
       of categorization experiments.  As you get lots of categorization
       results (Hypotheses) back from the Learner, you can feed	these results
       to the Experiment class,	along with the correct answers.	 When all
       results have been collected, you	can get	a report on accuracy,
       precision, recall, F1, and so on, with both micro-averaging and macro-
       averaging over categories.  We use the "Statistics::Contingency"	module
       from CPAN to manage the calculations. See the docs for
       "AI::Categorizer::Experiment" for more details.

	   Creates a new Categorizer object and	returns	it.  Accepts lots of
	   parameters controlling behavior.  In	addition to the	parameters
	   listed here,	you may	pass any parameter accepted by any class that
	   we create internally	(the KnowledgeSet, Learner, Experiment,	or
	   Collection classes),	or any class that they create.	This is
	   managed by the "Class::Container" module, so	see its	documentation
	   for the details of how this works.

	   The specific	parameters accepted here are:

	       A string	that indicates a place where objects will be saved
	       during several of the methods of	this class.  The default value
	       is the string "save", which means files like
	       "save-01-knowledge_set" will get	created.  The exact names of
	       these files may change in future	releases, since	they're	just
	       used internally to resume where we last left off.

	       If true,	a few status messages will be printed during

	       Specifies the "path" parameter that will	be fed to the
	       KnowledgeSet's "scan_features()"	and "read()" methods during
	       our "scan_features()" and "read_training_set()" methods.

	       Specifies the "path" parameter that will	be used	when creating
	       a Collection during the "evaluate_test_set()" method.

	       A shortcut for setting the "training_set", "test_set", and
	       "category_file" parameters separately.  Sets "training_set" to
	       "$data_root/training", "test_set" to "$data_root/test", and
	       "category_file" (used by	some of	the Collection classes)	to

	   Returns the Learner object associated with this Categorizer.
	   Before "train()", the Learner will of course	not be trained yet.

	   Returns the KnowledgeSet object associated with this	Categorizer.
	   If "read_training_set()" has	not yet	been called, the KnowledgeSet
	   will	not yet	be populated with any training data.

	   Runs	a complete experiment on the training and testing data,
	   reporting the results on "STDOUT".  Internally, this	is just	a
	   shortcut for	calling	the "scan_features()", "read_training_set()",
	   "train()", and "evaluate_test_set()"	methods, then printing the
	   value of the	"stats_table()"	method.

	   Scans the Collection	specified in the "test_set" parameter to
	   determine the set of	features (words) that will be considered when
	   training the	Learner.  Internally, this calls the "scan_features()"
	   method of the KnowledgeSet, then saves a list of the	KnowledgeSet's
	   features for	later use.

	   This	step is	not strictly necessary,	but it can dramatically	reduce
	   memory requirements if you scan for features	before reading the
	   entire corpus into memory.

	   Populates the KnowledgeSet with the data specified in the
	   "test_set" parameter.  Internally, this calls the "read()" method
	   of the KnowledgeSet.	 Returns the KnowledgeSet.  Also saves the
	   KnowledgeSet	object for later use.

	   Calls the Learner's "train()" method, passing it the	KnowledgeSet
	   created during "read_training_set()".  Returns the Learner object.
	   Also	saves the Learner object for later use.

	   Creates a Collection	based on the value of the "test_set"
	   parameter, and calls	the Learner's "categorize_collection()"	method
	   using this Collection.  Returns the resultant Experiment object.
	   Also	saves the Experiment object for	later use in the
	   "stats_table()" method.

	   Returns the value of	the Experiment's (as created by
	   "evaluate_test_set()") "stats_table()" method.  This	is a string
	   that	shows various statistics about the
	   accuracy/precision/recall/F1/etc. of	the assignments	made during

       This module is a	revised	and redesigned version of the previous
       "AI::Categorize"	module by the same author.  Note the added 'r' in the
       new name.  The older module has a different interface, and no attempt
       at backward compatibility has been made - that's	why I changed the

       You can have both "AI::Categorize" and "AI::Categorizer"	installed at
       the same	time on	the same machine, if you want.	They don't know	about
       each other or use conflicting namespaces.

       Ken Williams <>

       Discussion about	this module can	be directed to the perl-AI list	at
       <>.  For	more info about	the list, see

       An excellent introduction to the	academic field of Text Categorization
       is Fabrizio Sebastiani's	"Machine Learning in Automated Text
       Categorization":	ACM Computing Surveys, Vol. 34,	No. 1, March 2002, pp.

       Copyright 2000-2003 Ken Williams.  All rights reserved.

       This distribution is free software; you can redistribute	it and/or
       modify it under the same	terms as Perl itself.  These terms apply to
       every file in the distribution -	if you have questions, please contact
       the author.

perl v5.24.1			  2017-07-02		    AI::Categorizer(3)


Want to link to this manual page? Use this URL:

home | help