Perl Basics and Biostatistics

Variance and Standard Deviation

In this lesson the student will learn how to:
  1. use sqrt function
  2. calculate the population variance and standard deviation
  3. calculate the sample variance and standard deviation
  4. use my variables
  5. import the strict module
By the end of this lesson the student will be able to:

  Write a script which calculates the standard
  deviation for a group of data points.


Most data collected from any naturally occuring population will fall into a bell-shaped curve. The vast majority of the data will plot close to the mean and the extreme low and high values will have few occurrences. This bell-shaped curve pattern is typical of measurements such as shoe size, height, weight, and IQ.

Variance

The range of your data is simply the distance between your high and low scores. The scatter is the distribution of individual data points and is quantified as the average of the square of the deviations of each value from the mean. The equation looks like this:


  Population variance = v2 = [summation(Yi - mean)2 ] / N

Thus if we collect the following weights of second graders, we can find their average and then their variance like this:

 Weights of second graders:

  44, 55, 52, 53, 66, 70, 75, 78.

  N = 8

  mean = 493 / 8 = 61.6

  V2 = [ (44-61.6)2 + (55-61.6)2 + (52-61.6)2 + 
                  (53-61.6)2 + (66-61.6)2 + (70-61.6)2 + 
                  (75-61.6)2 + (78-61.6)2 ] / 8

                = ( 309.76 + 43.56 + 92.16 + 73.96 + 19.36 + 70.56 + 179.56 + 268.96 ) / 8

                = 1057.88 / 8

                = 132.235


The square root of the Population variance is the standard deviation (SD):

    Population SD = sqrt ( variance )
 
                  = 11.50

The above information applies to the variance and standard deviation for an entire population. So, if the only population we care about is the second graders we measured, then we use these equations, BUT if we only measured a sample representing a larger population then we use Sample Variance and Sample Standard Deviation, which is somewhat different. Here are the equations for Sample Variance and Sample Standard Deviation:

  Sample Variance = s2 = [summation(Yi-mean)2]/N-1

  Sample SD = sqrt(Sample_variance)

So, using our data from above we get:

  Sample Variance = 1057.88 / 7 = 151.13

  Sample SD = 12.29

We will talk more about the meaning of SD and variance in upcoming lessons. For now you should just make sure that you understand how it is calculated.

SD Script

Performing summation is a very repetitve process. To do this sort of thing by hand is really tedious. Perl can make calculating the Standard Deviation quick and painless. Even performing the SD calculation with a calculator can be annoying. The Perl script way of doing things is far superior. In this lesson, you will not be given a full program. You will just be given a code snippet and you will have to get the program up and running.

The following code snippet will find the variance:

# @set contains the data set # $m is the mean calculated from the set # $n is the number of data items in the set my $sum; foreach my $i (@set){ my $partial = $i - $m; $partial *= $partial; $sum += $partial; } $sum = sqrt($sum); my $sd = $sum / ($n-1); # for sample SD
The foreach loop simply loads each value contained in @set into $i. Then the repetitive arithmetic is performed in the loop.

Consider the following points:

Importing the Strict Module

The following script example shows you how to load the strict module:

#!/usr/bin/perl use strict; my $dog = "FREDDY"; my $cat = "JAYJAY"; my @fish = ("BURT", "BETTY", "BETH", "BOB", "BART" ); print "PETS: $dog, $cat"; foreach my $i (@fish){ print ", $i"; } print "\n"; exit;
The directive "use strict;" puts us in what we could call strict mode. This forces us to declare our variables using the keyword my. Once we have declared a variable we no longer use the keyword my in front of it. If we try to declare another variable using my, we will get an error message. When you are writing short scripts it is hard to see the value of using strict and my variables, but when the script gets long using strict and my variables will help you to avoid accidentally using the same variable name to refer to what you are thinking of as two distinct variables.

Try removing all the my's and leaving in the use strict line and see what happens when you attempt to execute this script. Then put the my's back in and execute it again.

Assignment

Write a script which takes a data set of any size as input from the user and then reports the mean, the sample variance, and the sample SD. You must use strict and my variables in this assignment.