Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
Statistics::ContingencUser Contributed Perl DocumentStatistics::Contingency(3)

NAME
       Statistics::Contingency - Calculate precision, recall, F1, accuracy,
       etc.

VERSION
       version 0.09

SYNOPSIS
	use Statistics::Contingency;
	my $s =	new Statistics::Contingency(categories => \@all_categories);

	while (...something...)	{
	  ...
	  $s->add_result($assigned_categories, $correct_categories);
	}

	print "Micro F1: ", $s->micro_F1, "\n";	# Access a single statistic
	print $s->stats_table; # Show several stats in table form

DESCRIPTION
       The "Statistics::Contingency" class helps you calculate several useful
       statistical measures based on 2x2 "contingency tables".	I use these
       measures	to help	judge the results of automatic text categorization
       experiments, but	they are useful	in other situations as well.

       The general usage flow is to tally a whole bunch	of results in the
       "Statistics::Contingency" object, then query that object	to obtain the
       measures	you are	interested in.	When all results have been collected,
       you can get a report on accuracy, precision, recall, F1,	and so on,
       with both macro-averaging and micro-averaging over categories.

   Macro vs. Micro Statistics
       All of the statistics offered by	this module can	be calculated for each
       category	and then averaged, or can be calculated	over all decisions and
       then averaged.  The former is called macro-averaging (specifically,
       macro-averaging with respect to category), and the latter is called
       micro-averaging.	 The two procedures bias the results differently -
       micro-averaging tends to	over-emphasize the performance on the largest
       categories, while macro-averaging over-emphasizes the performance on
       the smallest.  It's often best to look at both of them to get a good
       idea of how your	data distributes across	categories.

   Statistics available
       All of the statistics are calculated based on a so-called "contingency
       table", which looks like	this:

		     Correct=Y	 Correct=N
		   +-----------+-----------+
	Assigned=Y |	 a     |     b	   |
		   +-----------+-----------+
	Assigned=N |	 c     |     d	   |
		   +-----------+-----------+

       a, b, c,	and d are counts that reflect how the assigned categories
       matched the correct categories.	Depending on whether a macro-statistic
       or a micro-statistic is being calculated, these numbers will be tallied
       per-category or for the entire result set.

       The following statistics	are available:

       o   accuracy

	   This	measures the portion of	all decisions that were	correct
	   decisions.  It is defined as	"(a+d)/(a+b+c+d)".  It falls in	the
	   range from 0	to 1, with 1 being the best score.

	   Note	that macro-accuracy and	micro-accuracy will always give	the
	   same	number.

       o   error

	   This	measures the portion of	all decisions that were	incorrect
	   decisions.  It is defined as	"(b+c)/(a+b+c+d)".  It falls in	the
	   range from 0	to 1, with 0 being the best score.

	   Note	that macro-error and micro-error will always give the same
	   number.

       o   precision

	   This	measures the portion of	the assigned categories	that were
	   correct.  It	is defined as "a/(a+b)".  It falls in the range	from 0
	   to 1, with 1	being the best score.

       o   recall

	   This	measures the portion of	the correct categories that were
	   assigned.  It is defined as "a/(a+c)".  It falls in the range from
	   0 to	1, with	1 being	the best score.

       o   F1

	   This	measures an even combination of	precision and recall.  It is
	   defined as "2*p*r/(p+r)".  In terms of a, b,	and c, it may be
	   expressed as	"2a/(2a+b+c)".	It falls in the	range from 0 to	1,
	   with	1 being	the best score.

       The F1 measure is often the only	simple measure that is worth trying to
       maximize	on its own - consider the fact that you	can get	a perfect
       precision score by always assigning zero	categories, or a perfect
       recall score by always assigning	every category.	 A truly smart system
       will assign the correct categories and only the correct categories,
       maximizing precision and	recall at the same time, and therefore
       maximizing the F1 score.

       Sometimes it's worth trying to maximize the accuracy score, but
       accuracy	(and its counterpart error) are	considered fairly crude	scores
       that don't give much information	about the performance of a
       categorizer.

METHODS
       The general execution flow when using this class	is to create a
       "Statistics::Contingency" object, add a bunch of	results	to it, and
       then report on the results.

       o   $e =	Statistics::Contingency->new()

	   Returns a new "Statistics::Contingency" object.  Expects a
	   "categories"	parameter specifying the entire	set of categories that
	   may be assigned during this experiment.  Also accepts a "verbose"
	   parameter - if true,	some diagnostic	status information will	be
	   displayed when certain actions are performed.

       o   $e->add_result($assigned_categories,	$correct_categories, $name)

	   Adds	a new result to	the experiment.	 The lists of assigned and
	   correct categories can be given as an array of category names
	   (strings), as a hash	whose keys are the category names and whose
	   values are anything logically true, or as a single string if	there
	   is only one category.

	   If you've already got the lists in hash form, this will be the
	   fastest way to pass them.  Otherwise, the current implementation
	   will	convert	them to	hash form internally in	order to make its
	   calculations	efficient.

	   The $name parameter is an optional name for this result.  It	will
	   only	be used	in error messages or debugging/progress	output.

	   In the current implementation, we only store	the contingency	tables
	   per category, as well as a table for	the entire result set.	This
	   means that you can't	recover	information about any particular
	   single result from the "Statistics::Contingency" object.

       o   $e->set_entries($a, $b, $c, $d)

	   If you don't	wish to	use the	c<add_result()>	interface, but still
	   take	advantage of the calculation methods and the various edge
	   cases they handle, you can directly set the four elements of	the
	   contingency table with this method.

       o   $e->micro_accuracy

	   Returns the micro-averaged accuracy for the data set.

       o   $e->micro_error

	   Returns the micro-averaged error for	the data set.

       o   $e->micro_precision

	   Returns the micro-averaged precision	for the	data set.

       o   $e->micro_recall

	   Returns the micro-averaged recall for the data set.

       o   $e->micro_F1

	   Returns the micro-averaged F1 for the data set.

       o   $e->macro_accuracy

	   Returns the macro-averaged accuracy for the data set.

       o   $e->macro_error

	   Returns the macro-averaged error for	the data set.

       o   $e->macro_precision

	   Returns the macro-averaged precision	for the	data set.

       o   $e->macro_recall

	   Returns the macro-averaged recall for the data set.

       o   $e->macro_F1

	   Returns the macro-averaged F1 for the data set.

       o   $e->stats_table

	   Returns a string combining several statistics in one	graphic	table.
	   Since accuracy is 1 minus error, we only report error since it
	   takes less space to print.  An optional argument specifies the
	   number of significant digits	to show	in the data - the default is 3
	   significant digits.

       o   $e->category_stats

	   Returns a hash reference whose keys are the names of	each category,
	   and whose values contain the	various	statistical measures
	   (accuracy, error, precision,	recall,	or F1) about each category as
	   a hash reference.  For example, to print a single statistic:

	    print $e->category_stats->{sports}{recall},	"\n";

	   Or to print certain statistics for all categtories:

	    my $stats =	$e->category_stats;
	    while (my ($cat, $value) = each %$stats) {
	      print "Category '$cat': \n";
	      print "  Accuracy: $value->{accuracy}\n";
	      print "  Precision: $value->{precision}\n";
	      print "  F1: $value->{F1}\n";
	    }

AUTHOR
       Ken Williams <kwilliams@cpan.org>

COPYRIGHT
       Copyright 2002-2008 Ken Williams.  All rights reserved.

       This distribution is free software; you can redistribute	it and/or
       modify it under the same	terms as Perl itself.

perl v5.24.1			  2013-06-09	    Statistics::Contingency(3)

NAME | VERSION | SYNOPSIS | DESCRIPTION | METHODS | AUTHOR | COPYRIGHT

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=Statistics::Contingency&sektion=3&manpath=FreeBSD+12.0-RELEASE+and+Ports>

home | help