Perl Basics and Biostatistics
The Prediction Interval
In this lesson the student will learn how to:
- calculate the prediction interval
- store and retrieve values from a hash
- lookup values stored in a hash
By the end of this lesson the student will be able to:
Write a Perl script which calculates the prediction interval
for a group of data points based on mean, SD, and N.
Let's say some governmental agency to
know what the average weight of ten-year-old girls in
the United States is. Let's assume that you successfully measured a
representative sample of the ten-year-old, female population, but you only
managed to actually measure 100 girls. How confident can you be that the
statistics you derive from this data apply to the overall population of
ten-year-old girls in the United States? The prediction interval converts
our raw data into a range of values in which we can be reasonably sure that
the true statistics exist.
The Prediction Interval
Assuming that we are dealing with a Gaussian population, 95% of the
population will fall within 1.96 SDs of the population mean. This means
that given four vital pieces of information that we can predict a range
of actual scores in which 95% of the population will fall.
- that the distribution of scores is Gaussian
- the population mean
- the SD
- sample size
The larger the sample size the more precise our prediction interval
will be, but with small samples this range will have to increase to
account for likely discrepancies
between the mean and SD of the overall population.
Here are the coefficients we use to calculate the 95% Prediction Interval:
N | K |
2 | 15.56 |
3 | 4.97 |
4 | 3.56 |
5 | 3.04 |
6 | 2.78 |
7 | 2.62 |
8 | 2.51 |
9 | 2.43 |
10 | 2.37 |
11 | 2.33 |
12 | 2.29 |
13 | 2.26 |
14 | 2.24 |
15 | 2.22 |
16 | 2.20 |
17 | 2.18 |
18 | 2.17 |
19 | 2.16 |
20 | 2.14 |
25 | 2.10 |
30 | 2.08 |
35 | 2.06 |
40 | 2.05 |
50 | 2.03 |
60 | 2.02 |
70 | 2.01 |
80 | 2.00 |
90 | 2.00 |
100 | 1.99 |
200 | 1.98 |
infinity | 1.96 |
In this table K represents the number of SDs from the mean. Given a
sufficiently large population we know from this table that 95% of
the population will fall within 1.96 SDs of the mean. However, for instance,
if our sample has only 20 members, 95% of the population will fall within
2.14 SDs of the mean. (Remember it is assumed that our sample population is
actually representative of the overall population.) So, all we need to
know is the sample size, sample SD, and sample mean
to calculate the prediction interval:
SD = 10.0
mean = 100
K = 1.96
PI(upper) = 100 + 10.0 * 1.96
= 119.6
PI(lower) = 100 - 10.0 * 1.96
= 80.4
If our sample size is small, however, we cannot use
K = 1.96. For instance, if our sample size is only 40, then
K = 2.05. Here's what this would do to our calculation:
SD = 10.0
mean = 100
K = 2.05
PI(upper) = 100 + 10.0 * 2.05
= 120.5
PI(lower) = 100 - 10.0 * 2.05
= 79.5
The definition of 95% prediction interval: "Each new
observation has a 95% chance of being within the interval."
Perl Script for Calculating the Prediction Interval
To calculate the prediction interval we will use
hashes. Here is a simple Perl script illustrating the use of hashes:
#!/usr/bin/perl -w
my %squares = ( 1, 1,
3, 9,
5, 25,
7, 49,
9, 81
);
my @keys = keys(%squares);
my @vals = values(%squares);
print "KEYS: @keys\n";
print "VALUES: @vals\n";
#printing a single value:
print "5 squared equals $squares{5}\n"
exit;
You will have to create a hash for the 95% Prediction Interval K
values shown
on this page. Once you have done this the rest of the script to
calculate the prediction interval should be very straight forward.
ASSIGNMENT:
Create a Perl script which calculates the prediction interval.
Your script will take three inputs: mean, SD, and N.
Your script will use N to lookup K in your hash.
It will then calculate the PI and produce output with the following format:
SD = 10.0
mean = 100
K = 2.05
PI(upper): 120.5
PI(lower): 79.5