Perl Basics and Biostatistics

Random Sampling

In this lesson the student will learn how to:
  1. explain the statistical reliability of random sampling
  2. redirect script output to a file
  3. push scalar values onto a list
  4. generate random numbers
By the end of this lesson the student will be able to:

  Use a script to perform a simple statistical
  experiment.

Assuming that two populations of 40 subjects are identical for some measurable aspect of their physiology (for instance, systolic blood pressure), what are the chances that a random sample of five subjects from each population will reflect this? How reliable is random sampling?

To help us discuss these questions, let's create two identical groups of subjects:


G1 = ( 88, 90, 79, 82, 99, 105, 91, 86, 74, 101, 80, 96 ).

G2 = ( 88, 90, 79, 82, 99, 105, 91, 86, 74, 101, 80, 96 ).

If we select five values from each group at random we might come up with:

G1 Sample = ( 90, 82, 105, 74, 80 ).
   Sample Mean: 86

G2 Sample = ( 88, 82, 99, 101, 86 ).
   Sample Mean: 91

NOTE: sample means rounded to nearest whole

The actual average for each group is 89. So, you can see that random sampling doesn't guarantee that our samples will be very representative of our actual groups. In fact, we could wind up with random samples like this:


G1 Sample = ( 99, 105, 91, 101, 96 )
   Sample Mean: 98

G2 Sample = ( 79, 82, 74, 80, 86 )
   Sample Mean: 80

Although the odds aren't very high that we will wind up with random samples yielding extremes (especially if we are dealing with large populations and sample sizes), there is always a possibility that this could happen. So, random sampling may not be very reliable if our sample size is small, but as our sample size increases the reliability of random sampling increases.

Random Sampling Script

Consider the following Perl script which illustrates the reliability of random sampling:

#!/usr/bin/perl -w use strict; my @pop = (62, 65, 66, 72, 80, 75, 67, 73, 79, 69, 70, 64, 64, 65, 67, 81, 61, 82, 68, 59, 83, 77, 73, 74, 66, 67, 77, 78, 82, 80, 73, 68, 67, 73, 84, 58, 75, 76, 72, 68); my $actual_average = 0; my $num = @pop; my $tally = 0; foreach my $n (@pop){ $tally+=$n; } $actual_average = $tally/$num; print "ACTUAL AVERAGE: $actual_average\n"; srand(time); my $s1 = 0; my @g1 = (); my $i = 0; #index $tally = 0; foreach my $r (0..4){ $i = int rand($num); $tally += $pop[$i]; push(@g1,$pop[$i]); } $s1 = $tally/5; print "SAMPLE: @g1, SAMPLE AVERAGE: $s1\n"; my $s2 = 0; my @g2 = (); $tally = 0; foreach my $r (0..4){ $i = int rand($num); $tally += $pop[$i]; push(@g2,$pop[$i]); } $s2 = $tally/5; print "SAMPLE: @g2, SAMPLE AVERAGE: $s2\n"; exit;
Run this sample script several times. Make sure you pause for a couple seconds between each run of the script since srand needs a new value to produce unique output.

Notice the use of push to add an item to an array. This is a very useful trick in a lot of circumstances. One particularly useful context for this trick is when you are creating tables in a CGI script (something we'll discuss in unit four).

Also notice the use of srand(time) and the int rand($num) construct. The srand function seeds the random number generator with whatever input it is given. In this case we seed the random number generator with the current time (which is returned by the time function). The actual random number is generated by int rand($num). The int function converts the number returned by the rand function to an integer value. The rand function is given a number representing the size of our data array.

ASSIGNMENT:

Present the results to this activity on a web page.

  1. Run the Perl script shown above ten times. Redirect the output to a file so that it can be easily reformatted into an HTML page. To redirect you do this: ./perlScript.pl >> output_file.txt Once you have ten sets of output, calculate an average for the GROUP ONES and the GROUP TWOS. Present this information on a well-organized and easily readable web page.