# FreeBSD Manual Pages

```Statistics::CaseResampUser(Contributed Perl DocumStatistics::CaseResampling(3)

NAME
Statistics::CaseResampling - Efficient resampling and calculation of
medians with confidence intervals

SYNOPSIS
use Statistics::CaseResampling	':all';

my \$sample = [1,3,5,7,1,2,9]; # ... usually MUCH more data ...
my \$confidence	= 0.95;	# ~2*sigma or "90% within confidence limits"
#my \$confidence = 0.37; # ~1*sigma or "~66% within confidence limits"

# calculate the median	of the sample with lower and upper confidence
# limits using	resampling/bootstrapping:
my (\$lower_cl,	\$median, \$upper_cl)
= median_simple_confidence_limits(\$sample, \$confidence);

# There are many auxiliary functions:

my \$resampled = resample(\$sample);
# \$resampled is now a random set of measurements from \$sample,
# including potential duplicates

my \$medians = resample_medians(\$sample, \$n_resamples);
# \$medians is now an array reference containing the medians
# of \$n_resamples resample runs
# This	is vastly more efficient that doing the	same thing with
# repeated resample() calls!
# Analogously:
my \$means = resample_means(\$sample, \$n_resamples);

# you can get the cl's	from a set of separately resampled medians, too:
my (\$lower_cl,	\$median, \$upper_cl)
= simple_confidence_limits_from_samples(\$median, \$medians, \$confidence);

# utility functions:
print median([1..5]), "\n"; # prints 3
print mean([1..5]), "\n"; # prints 3, too, surprise!
print select_kth([1..5], 1), "\n"; # inefficient way to calculate the minimum

DESCRIPTION
The purpose of this (XS)	module is to calculate the median (or in
principle also other statistics)	with confidence	intervals on a sample.
To do that, it uses a technique called bootstrapping. In	a nutshell, it
resamples the sample a lot of times and for each	resample, it
calculates the median.  From the	distribution of	medians, it then
calculates the confidence limits.

In order	to implement the confidence limit calculation, various other
functions had to	be implemented efficiently (both algorithmically
efficient and done in C). These functions may be	useful in their	own
right and are thus exposed to Perl. Most	notably, this exposes a	median
(and general selection) algorithm that works in linear time as opposed
to the trivial implementation that requires "O(n*log(n))".

Random numbers
The resampling involves drawing many random numbers. Therefore, the
module comes with an embedded Mersenne twister (taken from
"Math::Random::MT").

If you want to change the seed of the RNG, do this:

\$Statistics::CaseResampling::Rnd
= Statistics::CaseResampling::RdGen::setup(\$seed);

or

\$Statistics::CaseResampling::Rnd
= Statistics::CaseResampling::RdGen::setup(@seed);

Do not use the embedded random number generator for other purposes.
Use "Math::Random::MT" instead! At this point, you cannot change	the
type of RNG.

EXPORT
None by default.

Can export any of the functions that are	documented below using
standard	"Exporter" semantics, including	the customary ":all" group.

FUNCTIONS
This list of functions is loosely sorted	from basic to comprehensive
because the more	complicated functions are usually (under the hood, in
C) implemented using the	basic building blocks. Unfortunately, that
means you may want to read the documentation backwards. :)

Additionally, there is a	whole set of general purpose, fast (XS)
functions for calculating statistical metrics. They're useful without
the bootstrapping related stuff,	so they're listed in the "OTHER
FUNCTIONS" section below.

All of these functions are written in C for speed.

resample(ARRAYREF)
Returns a reference to an array containing N random elements from the
input array, where N is the length of the original array.

This is different from shuffling	in that	it's random drawing without
removing	the drawn elements from	the sample.

resample_medians(ARRAYREF, NMEDIANS)
Returns a reference to an array containing the medians of "NMEDIANS"
resamples of the	original input sample.

resample_means(ARRAYREF, NMEANS)
Returns a reference to an array containing the means of "NMEANS"
resamples of the	original input sample.

simple_confidence_limits_from_median_samples(STATISTIC, STATISTIC_SAMPLES,
CONFIDENCE)
Calculates the confidence limits	for STATISTIC. Returns the lower
confidence limit, the statistic,	and the	upper confidence limit.
STATISTIC_SAMPLES is the	output of, for example,
"resample_means(\@sample)".  CONFIDENCE indicates the fraction of data
within the confidence limits (assuming a	normal,	symmetric distribution
of the statistic	=> simple confidence limits).

For example, to get the 90% confidence (~2 sigma) interval for the mean
of your sample, you can do the following:

my \$sample = [<numbers>];
my \$means = resample_means(\$sample, \$n_resamples);
my (\$lower_cl,	\$mean, \$upper_cl)
= simple_confidence_limits_from_samples(mean(\$sample), \$means, 0.90);

If you want to apply this logic to other	statistics such	as the first
or third	quartile, you have to manually resample	and lose the advantage
of doing	it in C:

my \$sample = [<numbers>];
my \$quartiles = [];
foreach (1..1000) {
push	@\$quartiles, first_quartile(resample(\$sample));
}
my (\$lower_cl,	\$firstq, \$upper_cl)
= simple_confidence_limits_from_samples(
first_quartile(\$sample),	\$quartiles, 0.90
);

For a reliable calculation of the confidence limits, you	should use at
least 1000 samples if not more. Yes. This whole procedure is expensive.

For medians, however, use the following function:

median_simple_confidence_limits(SAMPLE, CONFIDENCE, [NSAMPLES])
Calculates the confidence limits	for the	"CONFIDENCE" level for the
median of SAMPLE. Returns the lower confidence limit, the median, and
the upper confidence limit.

In order	to calculate the limits, a lot of resampling has to be done.
NSAMPLES	defaults to 1000. Try running this a couple of times on	your
data interactively to see how the limits	still vary a little bit	at
this setting.

OTHER FUNCTIONS
approx_erf(\$x)
Calculates an approximatation of	the error function of x.  Implemented
after

Winitzki, Sergei (6 February 2008).
"A handy approximation	for the	error function and its inverse"	(PDF).
http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf

Quoting:	Relative precision better than "1.3e-4".

approx_erf_inv(\$x)
Calculates an approximation of the inverse of the error function	of x.

Algorithm from the same source as "approx_erf".

Quoting:	Relative precision better than "2e-3".

nsigma_to_alpha(\$nsigma)
Calculates the probability that a measurement from a normal
distribution is further away from the mean than \$nsigma standard
deviations.

The confidence level (what you pass as the "CONFIDENCE" parameter to
some functions in this module) is "1 - nsigma_to_alpha(\$nsigma)".

alpha_to_nsigma(\$alpha)
Inverse of "nsigma_to_alpha()".

mean(ARRAYREF)
Calculates the mean of a	sample.

median(ARRAYREF)
Calculates the median (second quartile) of a sample. Works in linear
time thanks to using a selection	instead	of a sort.

Unfortunately, the way this is implemented, the median of an even
number of parameters is,	here, defined as the "n/2-1"th largest number
and not the average of the "n/2-1"th and	the "n/2"th number. This
shouldn't matter	for nontrivial sample sizes!

median_absolute_deviation(ARRAYREF)
Calculates the median absolute deviation	(MAD) in what I	believe	is
O(n).  Take care	to rescale the MAD before using	it in place of a
standard	deviation.

first_quartile(ARRAYREF)
Calculates the first quartile of	the sample.

third_quartile(ARRAYREF)
Calculates the third quartile of	the sample.

select_kth(ARRAYREF,	K)
Selects the kth smallest	element	from the sample.

This is the general function that implements the	median/quartile
calculation. You	get the	median with this definition of K:

my \$k = int(@\$sample/2) + (@\$sample & 1);
my \$median = select_kth(\$sample, \$k);

sample_standard_deviation
Given the sample	mean and an anonymous array of numbers (the sample),
calculates the sample standard deviation.

population_standard_deviation
Same as sample_standard_deviation, but without the correction to	"N".

TODO
One could calculate more	statistics in C	for performance.

Math::Random::MT

On the approximation of the error function:

Winitzki, Sergei (6 February 2008).
"A handy approximation	for the	error function and its inverse"	(PDF).
http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf

The ~O(n) median	implementation is based	on C.A.R. Hoare's quickselect
algorithm. See
<http://en.wikipedia.org/wiki/Selection_algorithm#Partition-based_general_selection_algorithm>.
Right now, it does not implement	the Median of Medians algorithm	that
would guarantee linearity.

AUTHOR
Steffen Mueller,	<smueller@cpan.org>

Daniel Dragan, <bulk88@hotmail.com>, who	supplied MSVC compatibility
patches.