Perl Basics and Biostatistics

Quartiles

In this lesson the student will learn how to:
  1. construct and populate associative arrays
  2. use the .. operator
  3. use the logical and operator in an if statement
  4. set and turn off flags
  5. get a list of keys from a hash
  6. read a box-and-whiskers graph
  7. divide data points into quartiles
By the end of this lesson the student will be able to:

Write a Perl script to divide a group of data points into quartiles.

For this lesson we will use our "data" from the last lesson again:
Height of Nine-Year-Old Girls in Trona, 2003:

 height    number
   3'11"     1
   4' 0"     2
   4' 1"     1
   4' 2"     0
   4' 3"     2
   4' 4"     2
   4' 5"     1
   4' 6"     0
   4' 7"     1
   4' 8"     0
   4' 9"     2
   4'10"     1
   4'11"     0
   5' 0"     0
   5' 1"     1

With the numbers in order it is easy to identify quartiles.


0-25%    47, 48, 48
26%-50%  49, 51, 51
51%-75%  52, 52, 53, 55
76%-100% 57, 57, 58, 61

The ties in our data forced us to group our numbers so that our quartiles were slightly less than even. This is likely to be less of a problem with large data samples, but this also depends in part on the range of possible scores. In general, since we had 14 subjects, each quartile should contain 14/4 or 3.5 subjects. Since we can't split a subject in two this means that there should be 3 or 4 subjects per quartile. Of course you can imagine a situation where all fourteen were the same height, in which case we really couldn't assign quartiles. Quartiles are all about distribution of data points.

Box and Whisker Plots

We can graphically represent this information in a number of ways. The most commonly used way is to use a simple bar graph:
Number of individualsHeight Distribution
5
0 47-50 51-53 54-57 58-61

In this bar graph we distribute the data according to even intervals of data values, not according to distribution of data points. It is important to keep in mind the difference in perspective between focusing on data point distribution (as does the median) and focusing on data values (as does the mean). There is interesting information to be gained in comparing the median and the mean. For instance, the median and mean may be the same or one may be higher or lower than the other. These two numbers by themselves tell you a lot about the distribution of your scores.
Another interesting way of graphing the data is to use a "Box and Whisker" plot. In this plot the middle line represents the median. The very top line represents the greatest value. The lowest line represents the lowest value. The top of the box represents the 75th percentile. The bottom of the box represents the 25th precentile. The box-and-whiskers plot is a good way of representing the distribution of data points. It is important to remember that percentile, quartile and median in the context of this lesson all refer to distribution of data points and that the arithmetic mean or average is about data values.
Script to Split Data Into Quartiles:

This script is longer and more complicated than previous scripts. You should get it up and running, experiment with it, and read it carefully (including the comments) before proceeding.

#!/usr/bin/perl %tally = (); #hash to hold tally $pop = 0; #total number of data points print "ENTER START: "; #lowest data point value $start = <STDIN>; chomp($start); print "ENTER END: "; #highest data point value $end = <STDIN>; chomp($end); #create tally foreach $n ($start..$end){ #use of .. operator print "ENTER AMOUNT for $n: "; $input = <STDIN>; $pop += $input; #keep track of total number of data points chomp($input); $tally{$n}=$input; #add item to hash } @keys = keys %tally; @keys = sort{$a<=>$b}@keys; $qtr = $pop / 4; $qtr = int($qtr); #drop any digits to right of decimal point $q_total = $qtr; #counter (allotment per quarter) $extra = $pop % 4; #cannot be over 3 $q_num = 1; #label for output $q_flag = 1; #used to decide when to print output $q_cnt = 0; #counter (actual output per quarter) foreach $i (@keys){ if($q_flag){ #print only at beginning of new quartile print "Quartile $q_num: ($q_cnt, $q_total)\n"; #debugging info in parens $q_flag=0; #turn off print flag } print "$i:\t$tally{$i}\n"; #print tally item $q_cnt += $tally{$i}; #add amount to counter if($q_cnt >= $q_total && $q_num<4 ){ #compare counters #when actual amount exceeds allotment #go to next quartile $q_flag = 1; #turn on print flag $q_num++; #increment quartile number #the following three lines adjust counters if($extra == 3 && $q_num == 2) { $q_total++; } if($extra == 2 && $q_num == 3) { $q_total++; } if($extra == 1 && $q_num == 4) { $q_total++; } $q_total += $qtr; #adjust counter } }
Hashes
This script introduces associative arrays (also known as: hash tables or simple hashes). A hash variable is represented using the percent sign (%). You can populate hashes in a lot of ways, but in this lesson we add one element at a time to the hash. So, first we declare our hash like this:

   %myhash = ();

Then we add items one at a time like this:

  $myhash{$key} = $value;

The following will all work as ways of adding values to hashes:

  $myhash{10} = 77;
  $myhash{'ten'} = 66;
  $myhash{'one'} = 'The elephants are attacking!';

After we have stored our data in our hash table we need to retrieve the data. We do this by using the keys function like this:

  @list_of_keys = keys(%myhash);

In order for these keys to be useful we must sort them. There is no guarantee that the keys will be returned from the keys function in the order in which we stored them and so sorting is necessary. For this script the keys must be sorted into numeric order and so we use the sort function followed by the compare construct: sort{$a<=>$b} The .. Operator

You probably noticed the foreach loop that looks like this:

foreach $n ($start..$end){ #do stuff } The .. operator is also known as the range operator and it is used to specify a range of values in a shorthand kind of way. For instance, foreach $n (1..100){ #do stuff } will iterate through the foreach loop for values of $n from 1 to 100. The script allows you to specify the values of $start and $end which makes it more flexible than would be the case if the numbers used with the range operator where hard coded.

Logical AND Operator

The logical AND operator (&&) allows you to construct compound conditional statements. In other words, the if statement is true only if BOTH of the conditions are met. We use the && operator four times in this script.

if($q_cnt >= $q_total && $q_num<4 ){ #compare counters #when actual amount exceeds allotment #go to next quartile $q_flag = 1; #turn on print flag $q_num++; #increment quartile number #the following three lines adjust counters if($extra == 3 && $q_num == 2) { $q_total++; } if($extra == 2 && $q_num == 3) { $q_total++; } if($extra == 1 && $q_num == 4) { $q_total++; } $q_total += $qtr; #adjust counter } The first use controls entrance into the section of code which increments the various counters. The other ones allow for adjustment of the quartile boundaries which may be necessary when the total number of data points are not evenly divisible by four.

Use of Flags

In this assignment the $q_flag variable is used as a flag. It is just a normal scalar (there is nothing special about it from the point of view of the Perl interpreter), but we use it in the ROLE of a flag. When it is set to 1 we print a quartile heading and when it is set to zero we don't.

ASSIGNMENT:

You will make one small adjustment to this script. Instead of using quartile headings like this:


  Quartile 1:
  Quartile 2:
  Quartile 3:
  Quartile 4:

yours will look like this:
  
    0%-25%:
   26%-50%:
   51%-75%:
  76%-100%:

Make sure your headings are offset as shown.

The easiest way to make this change is to use a list containing these four headings and access them using the typical array indexing method:

print "$headings[$cnt]:\n";