Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
AI::Categorizer::FeatuUserlContributed PerlAI::Categorizer::FeatureSelector(3)

NAME
       AI::Categorizer::FeatureSelector	- Abstract Feature Selection class

SYNOPSIS
	...

DESCRIPTION
       The KnowledgeSet	class that provides an interface to a set of
       documents, a set	of categories, and a mapping between the two.  Many
       parameters for controlling the processing of documents are managed by
       the KnowledgeSet	class.

METHODS
       new()
	   Creates a new KnowledgeSet and returns it.  Accepts the following
	   parameters:

	   load
	       If a "load" parameter is	present, the "load()" method will be
	       invoked immediately.  If	the "load" parameter is	a string, it
	       will be passed as the "path" parameter to "load()".  If the
	       "load" parameter	is a hash reference, it	will represent all the
	       parameters to pass to "load()".

	   categories
	       An optional reference to	an array of Category objects
	       representing the	complete set of	categories in a	KnowledgeSet.
	       If used,	the "documents"	parameter should also be specified.

	   documents
	       An optional reference to	an array of Document objects
	       representing the	complete set of	documents in a KnowledgeSet.
	       If used,	the "categories" parameter should also be specified.

	   features_kept
	       A number	indicating how many features (words) should be
	       considered when training	the Learner or categorizing new
	       documents.  May be specified as a positive integer (e.g.	2000)
	       indicating the absolute number of features to be	kept, or as a
	       decimal between 0 and 1 (e.g. 0.2) indicating the fraction of
	       the total number	of features to be kept,	or as 0	to indicate
	       that no feature selection should	be done	and that the entire
	       set of features should be used.	The default is 0.2.

	   feature_selection
	       A string	indicating the type of feature selection that should
	       be performed.  Currently	the only option	is also	the default
	       option: "document_frequency".

	   tfidf_weighting
	       Specifies how document word counts should be converted to
	       vector values.  Uses the	three-character	specification strings
	       from Salton & Buckley's paper "Term-weighting approaches	in
	       automatic text retrieval".  The three characters	indicate the
	       three factors that will be multiplied for each feature to find
	       the final vector	value for that feature.	 The default weighting
	       is "xxx".

	       The first character specifies the "term frequency" component,
	       which can take the following values:

	       b   Binary weighting - 1	for terms present in a document, 0 for
		   terms absent.

	       t   Raw term frequency -	equal to the number of times a feature
		   occurs in the document.

	       x   A synonym for 't'.

	       n   Normalized term frequency - 0.5 + 0.5 * t/max(t).  This is
		   the same as the 't' specification, but with term frequency
		   normalized to lie between 0.5 and 1.

	       The second character specifies the "collection frequency"
	       component, which	can take the following values:

	       f   Inverse document frequency -	multiply term "t"'s value by
		   "log(N/n)", where "N" is the	total number of	documents in
		   the collection, and "n" is the number of documents in which
		   term	"t" is found.

	       p   Probabilistic inverse document frequency - multiply term
		   "t"'s value by "log((N-n)/n)" (same variable	meanings as
		   above).

	       x   No change - multiply	by 1.

	       The third character specifies the "normalization" component,
	       which can take the following values:

	       c   Apply cosine	normalization -	multiply by
		   1/length(document_vector).

	       x   No change - multiply	by 1.

	       The three components may	alternatively be specified by the
	       "term_weighting", "collection_weighting", and
	       "normalize_weighting" parameters	respectively.

	   verbose
	       If set to a true	value, some status/debugging information will
	       be output on "STDOUT".

       categories()
	   In a	list context returns a list of all Category objects in this
	   KnowledgeSet.  In a scalar context returns the number of such
	   objects.

       documents()
	   In a	list context returns a list of all Document objects in this
	   KnowledgeSet.  In a scalar context returns the number of such
	   objects.

       document()
	   Given a document name, returns the Document object with that	name,
	   or "undef" if no such Document object exists	in this	KnowledgeSet.

       features()
	   Returns a FeatureSet	object which represents	the features of	all
	   the documents in this KnowledgeSet.

       verbose()
	   Returns the "verbose" parameter of this KnowledgeSet, or sets it
	   with	an optional argument.

       scan_stats()
	   Scans all the documents of a	Collection and returns a hash
	   reference containing	several	statistics about the Collection.  (XXX
	   need	to describe stats)

       scan_features()
	   This	method scans through a Collection object and determines	the
	   "best" features (words) to use when loading the documents and
	   training the	Learner.  This process is known	as "feature
	   selection", and it's	a very important part of categorization.

	   The Collection object should	be specified as	a "collection"
	   parameter, or by giving the arguments to pass to the	Collection's
	   "new()" method.

	   The process of feature selection is governed	by the
	   "feature_selection" and "features_kept" parameters given to the
	   KnowledgeSet's "new()" method.

	   This	method returns the features as a FeatureVector whose values
	   are the "quality" of	each feature, by whatever measure the
	   "feature_selection" parameter specifies.  Normally you won't	need
	   to use the return value, because this FeatureVector will become the
	   "use_features" parameter of any Document objects created by this
	   KnowledgeSet.

       save_features()
	   Given the name of a file, this method writes	the features (as
	   determined by the "scan_features" method) to	the file.

       restore_features()
	   Given the name of a file written by "save_features",	loads the
	   features from that file and passes them as the "use_features"
	   parameter for any Document objects created in the future by this
	   KnowledgeSet.

       read()
	   Iterates through a Collection of documents and adds them to the
	   KnowledgeSet.  The Collection can be	specified using	a "collection"
	   parameter - otherwise, specify the arguments	to pass	to the "new()"
	   method of the Collection class.

       load()
	   This	method can do feature selection	and load a Collection in one
	   step	(though	it currently uses two steps internally).

       add_document()
	   Given a Document object as an argument, this	method will add	it and
	   any categories it belongs to	to the KnowledgeSet.

       make_document()
	   This	method will create a Document object with the given data and
	   then	call "add_document()" to add it	to the KnowledgeSet.  A
	   "categories"	parameter should specify an array reference containing
	   a list of categories	by name.  These	are the	categories that	the
	   document belongs to.	 Any other parameters will be passed to	the
	   Document class's "new()" method.

       finish()
	   This	method will be called prior to training	the Learner.  Its
	   purpose is to perform any operations	(such as feature vector
	   weighting) that may require examination of the entire KnowledgeSet.

       weigh_features()
	   This	method will be called during "finish()"	to adjust the weights
	   of the features according to	the "tfidf_weighting" parameter.

       document_frequency()
	   Given a single feature (word) as an argument, this method will
	   return the number of	documents in the KnowledgeSet that contain
	   that	feature.

       partition()
	   Divides the KnowledgeSet into several subsets.  This	may be useful
	   for performing cross-validation.  The relative sizes	of the subsets
	   should be passed as arguments.  For example,	to split the
	   KnowledgeSet	into four KnowledgeSets	of equal size, pass the
	   arguments .25, .25, .25 (the	final size is 1	minus the sum of the
	   other sizes).  The partitions will be returned as a list.

AUTHOR
       Ken Williams, ken@mathforum.org

COPYRIGHT
       Copyright 2000-2003 Ken Williams.  All rights reserved.

       This library is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

SEE ALSO
       AI::Categorizer(3)

perl v5.32.0			  2020-08-0AI::Categorizer::FeatureSelector(3)

NAME | SYNOPSIS | DESCRIPTION | METHODS | AUTHOR | COPYRIGHT | SEE ALSO

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=AI::Categorizer::FeatureSelector&sektion=3&manpath=FreeBSD+12.2-RELEASE+and+Ports>

home | help