Perl Basics and Biostatistics
Quartiles
In this lesson the student will learn how to:
- construct and populate associative arrays
- use the .. operator
- use the logical and operator in an if statement
- set and turn off flags
- get a list of keys from a hash
- read a box-and-whiskers graph
- divide data points into quartiles
By the end of this lesson the student will be able to:
Write a Perl script to divide a group of data points into quartiles.
For this lesson we will use our "data" from the last lesson again:
Height of Nine-Year-Old Girls in Trona, 2003:
height number
3'11" 1
4' 0" 2
4' 1" 1
4' 2" 0
4' 3" 2
4' 4" 2
4' 5" 1
4' 6" 0
4' 7" 1
4' 8" 0
4' 9" 2
4'10" 1
4'11" 0
5' 0" 0
5' 1" 1
With the numbers in order it is easy to identify quartiles.
0-25% 47, 48, 48
26%-50% 49, 51, 51
51%-75% 52, 52, 53, 55
76%-100% 57, 57, 58, 61
The ties in our data forced us to group our numbers so that our
quartiles were slightly less than even. This is likely to be less
of a problem with large data samples, but this also depends in part
on the range of possible scores. In general, since we had 14 subjects,
each quartile should contain 14/4 or 3.5 subjects. Since we can't split
a subject in two this means that there should be 3 or 4 subjects
per quartile. Of course you can imagine a situation where all fourteen
were the same height, in which case we really couldn't assign quartiles.
Quartiles are all about distribution of data points.
Box and Whisker Plots
We can graphically represent this information in a number of ways. The most commonly used way is to use a simple bar graph:
Number of individuals | Height Distribution |
5 |
|
|
|
|
0 |
47-50 |
51-53 |
54-57 |
58-61 |
In this bar graph we distribute the data according to even intervals of
data values, not according to distribution of data points. It is important
to keep in mind the difference in perspective between focusing on data
point distribution (as does the median) and focusing on data values (as
does the mean). There is interesting information to be gained in comparing
the median and the mean. For instance, the median and mean may be the same
or one may be higher or lower than the other. These two numbers by themselves
tell you a lot about the distribution of your scores.
|
Another interesting way of graphing the data is to use a "Box and Whisker"
plot. In this plot the middle line represents the median. The very top line
represents the greatest value. The lowest line represents the lowest value.
The top of the box represents the 75th percentile. The bottom of the box
represents the 25th precentile. The box-and-whiskers plot is a good way of
representing the distribution of data points. It is important to remember
that percentile, quartile and median in the context of this lesson all
refer to distribution of data points and that the arithmetic mean or average
is about data values.
|
Script to Split Data Into Quartiles:
This script is longer and more complicated than previous scripts. You should
get it up and running, experiment with it, and read it carefully (including
the comments) before proceeding.
#!/usr/bin/perl
%tally = (); #hash to hold tally
$pop = 0; #total number of data points
print "ENTER START: "; #lowest data point value
$start = ;
chomp($start);
print "ENTER END: "; #highest data point value
$end = ;
chomp($end);
#create tally
foreach $n ($start..$end){ #use of .. operator
print "ENTER AMOUNT for $n: ";
$input = ;
$pop += $input; #keep track of total number of data points
chomp($input);
$tally{$n}=$input; #add item to hash
}
@keys = keys %tally;
@keys = sort{$a<=>$b}@keys;
$qtr = $pop / 4;
$qtr = int($qtr); #drop any digits to right of decimal point
$q_total = $qtr; #counter (allotment per quarter)
$extra = $pop % 4; #cannot be over 3
$q_num = 1; #label for output
$q_flag = 1; #used to decide when to print output
$q_cnt = 0; #counter (actual output per quarter)
foreach $i (@keys){
if($q_flag){ #print only at beginning of new quartile
print "Quartile $q_num: ($q_cnt, $q_total)\n"; #debugging info in parens
$q_flag=0; #turn off print flag
}
print "$i:\t$tally{$i}\n"; #print tally item
$q_cnt += $tally{$i}; #add amount to counter
if($q_cnt >= $q_total && $q_num<4 ){ #compare counters
#when actual amount exceeds allotment
#go to next quartile
$q_flag = 1; #turn on print flag
$q_num++; #increment quartile number
#the following three lines adjust counters
if($extra == 3 && $q_num == 2) { $q_total++; }
if($extra == 2 && $q_num == 3) { $q_total++; }
if($extra == 1 && $q_num == 4) { $q_total++; }
$q_total += $qtr; #adjust counter
}
}
Hashes
This script introduces associative arrays (also known as: hash tables or simple
hashes). A hash variable is represented using the percent sign (%). You can
populate hashes in a lot of ways, but in this lesson we add one element at a
time to the hash. So, first we declare our hash like this:
%myhash = ();
Then we add items one at a time like this:
$myhash{$key} = $value;
The following will all work as ways of adding values to hashes:
$myhash{10} = 77;
$myhash{'ten'} = 66;
$myhash{'one'} = 'The elephants are attacking!';
After we have stored our data in our hash table we need to retrieve the
data. We do this by using the keys function like this:
@list_of_keys = keys(%myhash);
In order for these keys to be useful we must sort them. There is no
guarantee that the keys will be returned from the keys function in the order
in which we stored them and so sorting is necessary. For this script the
keys must be sorted into numeric order and so we use the sort function
followed by the compare construct:
sort{$a<=>$b}
The .. Operator
You probably noticed the foreach loop that looks like this:
foreach $n ($start..$end){
#do stuff
}
The .. operator is also known as the range operator and it is used to
specify a range of values in a shorthand kind of way. For instance,
foreach $n (1..100){
#do stuff
}
will iterate through the foreach loop for values of $n from 1 to 100. The
script allows you to specify the values of $start and $end which makes it
more flexible than would be the case if the numbers used with the range
operator where hard coded.
Logical AND Operator
The logical AND operator (&&) allows you to construct compound conditional
statements. In other words, the if statement is true only if BOTH of the
conditions are met. We use the && operator four times in this script.
if($q_cnt >= $q_total && $q_num<4 ){ #compare counters
#when actual amount exceeds allotment
#go to next quartile
$q_flag = 1; #turn on print flag
$q_num++; #increment quartile number
#the following three lines adjust counters
if($extra == 3 && $q_num == 2) { $q_total++; }
if($extra == 2 && $q_num == 3) { $q_total++; }
if($extra == 1 && $q_num == 4) { $q_total++; }
$q_total += $qtr; #adjust counter
}
The first use controls entrance into the section of code which increments
the various counters. The other ones allow for adjustment of the quartile
boundaries which may be necessary when the total number of data points are
not evenly divisible by four.
Use of Flags
In this assignment the $q_flag variable is used as a flag. It is just a
normal scalar (there is nothing special about it from the point of view of
the Perl interpreter), but we use it in the ROLE of a flag. When it is set
to 1 we print a quartile heading and when it is set to zero we don't.
ASSIGNMENT:
You will make one small adjustment to this script. Instead of using quartile
headings like this:
Quartile 1:
Quartile 2:
Quartile 3:
Quartile 4:
yours will look like this:
0%-25%:
26%-50%:
51%-75%:
76%-100%:
Make sure your headings are offset as shown.
The easiest way to make this change is to use a list containing these four
headings and access them using the typical array indexing method:
print "$headings[$cnt]:\n";