Perl Basics and Biostatistics

The Prediction Interval

In this lesson the student will learn how to:

calculate the prediction interval
store and retrieve values from a hash
lookup values stored in a hash

By the end of this lesson the student will be able to:


  Write a Perl script which calculates the prediction interval
  for a group of data points based on mean, SD, and N.

Let's say some governmental agency to know what the average weight of ten-year-old girls in the United States is. Let's assume that you successfully measured a representative sample of the ten-year-old, female population, but you only managed to actually measure 100 girls. How confident can you be that the statistics you derive from this data apply to the overall population of ten-year-old girls in the United States? The prediction interval converts our raw data into a range of values in which we can be reasonably sure that the true statistics exist.

The Prediction Interval

Assuming that we are dealing with a Gaussian population, 95% of the population will fall within 1.96 SDs of the population mean. This means that given four vital pieces of information that we can predict a range of actual scores in which 95% of the population will fall.

that the distribution of scores is Gaussian
the population mean
the SD
sample size

The larger the sample size the more precise our prediction interval will be, but with small samples this range will have to increase to account for likely discrepancies between the mean and SD of the overall population. Here are the coefficients we use to calculate the 95% Prediction Interval:

N	K
2	15.56
3	4.97
4	3.56
5	3.04
6	2.78
7	2.62
8	2.51
9	2.43
10	2.37
11	2.33
12	2.29
13	2.26
14	2.24
15	2.22
16	2.20
17	2.18
18	2.17
19	2.16
20	2.14
25	2.10
30	2.08
35	2.06
40	2.05
50	2.03
60	2.02
70	2.01
80	2.00
90	2.00
100	1.99
200	1.98
infinity	1.96

In this table K represents the number of SDs from the mean. Given a sufficiently large population we know from this table that 95% of the population will fall within 1.96 SDs of the mean. However, for instance, if our sample has only 20 members, 95% of the population will fall within 2.14 SDs of the mean. (Remember it is assumed that our sample population is actually representative of the overall population.) So, all we need to know is the sample size, sample SD, and sample mean to calculate the prediction interval:


   SD = 10.0
   mean = 100
   K = 1.96

   PI(upper) = 100 + 10.0 * 1.96
             = 119.6
   
   PI(lower) = 100 - 10.0 * 1.96
             = 80.4

If our sample size is small, however, we cannot use K = 1.96. For instance, if our sample size is only 40, then K = 2.05. Here's what this would do to our calculation:


   SD = 10.0
   mean = 100
   K = 2.05

   PI(upper) = 100 + 10.0 * 2.05
             = 120.5
   
   PI(lower) = 100 - 10.0 * 2.05
             = 79.5

The definition of 95% prediction interval: "Each new observation has a 95% chance of being within the interval."

Perl Script for Calculating the Prediction Interval

To calculate the prediction interval we will use hashes. Here is a simple Perl script illustrating the use of hashes:

#!/usr/bin/perl -w my %squares = ( 1, 1, 3, 9, 5, 25, 7, 49, 9, 81 ); my @keys = keys(%squares); my @vals = values(%squares); print "KEYS: @keys\n"; print "VALUES: @vals\n"; #printing a single value: print "5 squared equals $squares{5}\n" exit;

You will have to create a hash for the 95% Prediction Interval K values shown on this page. Once you have done this the rest of the script to calculate the prediction interval should be very straight forward.

ASSIGNMENT:

Create a Perl script which calculates the prediction interval. Your script will take three inputs: mean, SD, and N. Your script will use N to lookup K in your hash. It will then calculate the PI and produce output with the following format:


   SD = 10.0
   mean = 100
   K = 2.05

   PI(upper): 120.5
   
   PI(lower): 79.5