Python Basics and Biostatistics

Variance and Standard Deviation

Back to index
In this lesson the student will learn how to:
  1. use sqrt function
  2. calculate the population variance and standard deviation
  3. calculate the sample variance and standard deviation
By the end of this lesson the student will be able to:

  Write a script which calculates the standard
  deviation for a group of data points.


Most data collected from any naturally occuring population will fall into a bell-shaped curve. The vast majority of the data will plot close to the mean and the extreme low and high values will have few occurrences. This bell-shaped curve pattern is typical of measurements such as shoe size, height, weight, and IQ.

Variance

The range of your data is simply the distance between your high and low scores. The scatter is the distribution of individual data points and is quantified as the average of the square of the deviations of each value from the mean. The equation looks like this:


  Population variance = v2 = [summation(Yi - mean)2 ] / N

Thus if we collect the following weights of second graders, we can find their average and then their variance like this:

 Weights of second graders:

  44, 55, 52, 53, 66, 70, 75, 78.

  N = 8

  mean = 493 / 8 = 61.6

  V2 = [ (44-61.6)2 + (55-61.6)2 + (52-61.6)2 + 
                  (53-61.6)2 + (66-61.6)2 + (70-61.6)2 + 
                  (75-61.6)2 + (78-61.6)2 ] / 8

                = ( 309.76 + 43.56 + 92.16 + 73.96 + 19.36 + 70.56 + 179.56 + 268.96 ) / 8

                = 1057.88 / 8

                = 132.235


The square root of the Population variance is the standard deviation (SD):

    Population SD = sqrt ( variance )
 
                  = 11.50

The above information applies to the variance and standard deviation for an entire population. So, if the only population we care about is the second graders we measured, then we use these equations, BUT if we only measured a sample representing a larger population then we use Sample Variance and Sample Standard Deviation, which is somewhat different. Here are the equations for Sample Variance and Sample Standard Deviation:

  Sample Variance = s2 = [summation(Yi-mean)2]/N-1

  Sample SD = sqrt(Sample_variance)

So, using our data from above we get:

  Sample Variance = 1057.88 / 7 = 151.13

  Sample SD = 12.29

We will talk more about the meaning of SD and variance in upcoming lessons. For now you should just make sure that you understand how it is calculated.

SD Script

Performing summation is a very repetitve process. To do this sort of thing by hand is really tedious. Python can make calculating the Standard Deviation quick and painless. Although even performing the SD calculation with a calculator can be annoying, the Python script way of doing things is far superior. In this lesson, you will not be given a full program. You will just be given a code snippet and you will have to get the program up and running.

The following code snippet will find the variance:

import math # set contains the data set # m is the mean calculated from the set # n is the number of data items in the set sum = 0 for i in set: partial = i - m partial *= partial sum += partial } sum = math.sqrt(sum) sd = sum / (n-1) # for sample SD
The for loop simply loads each value contained in set into i. Then the repetitive arithmetic is performed in the loop.

Assignment

Write a script which takes a data set of any size as input from the user and then reports the mean, the standard deviation, the sample variance, and the sample SD.