Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
Regression(3)	      User Contributed Perl Documentation	 Regression(3)

NAME - weighted linear regression package (line+plane	fitting)

	 use Statistics::Regression;

	 # Create regression object
	 my $reg = Statistics::Regression->new(	"sample	regression", [ "const",	"someX", "someY" ] );

	 # Add data points
	 $reg->include(	2.0, [ 1.0, 3.0, -1.0 ]	);
	 $reg->include(	1.0, [ 1.0, 5.0, 2.0 ] );
	 $reg->include(	20.0, [	1.0, 31.0, 0.0 ] );
	 $reg->include(	15.0, [	1.0, 11.0, 2.0 ] );


	 my %d;
	 $d{const} = 1.0; $d{someX}= 5.0; $d{someY}= 2.0; $d{ignored}="anything	else";
	 $reg->include(	3.0, \%d );  # names are picked	off the	Regression specification

       Please note that	*you* must provide the constant	if you want one.

	 # Finally, print the result

       This prints the following:

	 Regression 'sample regression'
	 Name			      Theta	     StdErr	T-stat
	 [0='const']		     0.2950	     6.0512	  0.05
	 [1='someX']		     0.6723	     0.3278	  2.05
	 [2='someY']		     1.0688	     2.7954	  0.38

	 R^2= 0.808, N=	4

       The hash	input method has the advantage that you	can now	just fill the
       observation hashes with all your	variables, and use the same code to
       run regression, changing	the regression specification at	one and	only
       one spot	(the new() invokation).	 You do	not need to change the inputs
       in the include()	statement.  For	example,

	 my @obs;  ## a	global variable.  observations are like: %oneobs= %{$obs[1]};

	 sub run_regression {
	   my $reg = Statistics::Regression->new( $_[0], $_[2] );
	   foreach my $obshashptr (@obs) { $reg->include( $_[1], $_[3] ); }

	 run_regression("bivariate regression",	 $obshashptr->{someY}, [ "const", "someX" ] );
	 run_regression("trivariate regression",  $obshashptr->{someY},	[ "const", "someX", "someZ" ] );

       Of course, you can use the subroutines to do the	printing work

	 my @theta  = $reg->theta();
	 my @se	    = $reg->standarderrors();
	 my $rsq    = $reg->rsq();
	 my $adjrsq = $reg->adjrsq();
	 my $ybar   = $reg->ybar();  ##	the average of the y vector
	 my $sst    = $reg->sst();  ## the sum-squares-total
	 my $sigmasq= $reg->sigmasq();	## the variance	of the residual
	 my $k	    = $reg->k();   ## the number of variables
	 my $n	    = $reg->n();   ## the number of observations

       In addition, there are some other helper	routines, and a	subroutine
       linearcombination_variance().  If you don't know	what this is, don't
       use it.

       You should have an understanding	of OLS regressions if you want to use
       this package.  You can get this from an introductory college
       econometrics class and/or from most intermediate	college	statistics
       classes.	 If you	do not have this background knowledge, then this
       package will remain a mystery to	you.  There is no support for this
       package--please don't expect any.

DESCRIPTION is	a multivariate linear regression package.  That	is, it
       estimates the c coefficients for	a line-fit of the type

	 y= c(0)*x(0) +	c(1)*x1	+ c(2)*x2 + ...	+ c(k)*xk

       given a data set	of N observations, each	with k independent x variables
       and one y variable.  Naturally, N must be greater than k---and
       preferably considerably greater.	 Any reasonable	undergraduate
       statistics book will explain what a regression is.  Most	of the time,
       the user	will provide a constant	('1') as x(0) for each observation in
       order to	allow the regression package to	fit an intercept.

   Original Algorithm (ALGOL-60):
	       W.  M.  Gentleman, University of	Waterloo, "Basic
	       Description For Large, Sparse Or	Weighted Linear	Least
	       Squares Problems	(Algorithm AS 75)," Applied Statistics
	       (1974) Vol 23; No. 3

       Gentleman's algorithm is	the statistical	standard. Insertion of a new
       observation can be done one observation at any time (WITH A WEIGHT!),
       and still only takes a low quadratic time.  The storage space
       requirement is of quadratic order (in the indep variables). A
       practically infinite number of observations can easily be processed!

   Internal Data Structures
       R=Rbar is an upperright triangular matrix, kept in normalized form with
       implicit	1's on the diagonal.  D	is a diagonal scaling matrix.  These
       correspond to "standard Regression usage" as

		       X' X  = R' D R

       A backsubsitution routine (in thetacov) allows to invert	the R matrix
       (the inverse is upper-right triangular, too!). Call this	matrix H, that
       is H=R^(-1).

		 (X' X)^(-1) = [(R' D^(1/2)') (D^(1/2) R)]^(-1)
		 = [ R^-1 D^(-1/2) ] [ R^-1 D^(-1/2) ]'

       None known.

       Perl Problem
	   Unfortunately, perl is unaware of IEEE number representations.
	   This	makes it a pain	to test	whether	an observation contains	any
	   missing variables (coded as 'NaN' in

       2007/04/04:  Added Coefficient Standard Errors

       2007/07/01:  Added self-test use	(if invoked as perl
		 at the	end.  cleaned up some print sprintf.
		    changed syntax on new() to eliminate passing K.

       2007/07/07:  allowed passing hash with names to include().

       Naturally, Gentleman invented this algorithm.  It was adaptated by Ivo
       Welch.  Alan Miller (alan\@dmsmelb.mel.dms.CSIRO.AU) pointed out	nicer
       ways to compute the R^2.	Ivan Tubert-Brohman helped wrap	the module as
       as a standard CPAN distribution.

       This module is released for free	public use under a GPL license.

       (C) Ivo Welch, 2001,2004, 2007.

perl v5.24.1			  2007-07-07			 Regression(3)


Want to link to this manual page? Use this URL:

home | help