Perl Basics and Biostatistics
Correlation Coefficient
In this lesson the student will learn how to:
- use for-loops
- calculate the correlation coefficient
By the end of this lesson the student will be able to:
Write scripts containing simple for-loops
The equation for calculating r (the correlation coefficient) is
programmed into many cheap calculators, so you don't really need to know how
to calculate it yourself. The equation can be expressed in several ways,
here's one:
summation i to N [ (Xi - mean of X)/ SDx * (Yi - mean of Y) / SDy ]
r = ------------------------------------------------------------------------
( N - 1 )
As you will recall:
SD = sqrt( summation i to N (Yi - mean)2 / (N - 1) )
So calculating r requires several steps:
- Find the mean of your groups of numbers.
X = ( 3, 5, 6, 7, 10, 12 )
mean of X = 7.17
Y = ( 5, 6, 7, 9, 10, 13 )
mean of Y = 8.33
- Calculate the SD for each group.
SDx = [ (3-7.17)2+(5-7.17)2+(6-7.17)2+(7-7.17)2+(10-7.17)2+(12-7.17)2 ] / 5
SD = 1.48
SDx = [ (5-8.33)2+(6-8.33)2+(7-8.33)2+(9-8.33)2+(10-8.33)2+(13-8.33)2 ] / 5
SD = 1.32
- Calculate r:
r = [((3-7.17)/1.48*(5-8.33)/1.32)/5+((5-7.17)/1.48*(6-8.33)/1.32)/5+((6-7.17)/1.48*(7-8.33)/1.32)/5+((7-7.17)/1.48*(9-8.33)/1.32)/5+((10-7.17)/1.48*(10-8.33)/1.32)/5+((12-7.17)/1.48*(13-8.33)/1.32)/5 ] / 5
r = 0.978
Nonparametric Methods
Statistical methods that do not make assumptions about the distribution of
the population are called nonparametric tests. Most nonparametric ideas are
based on a simple idea. List the values in order from low to high, and
assign each value a rank. Base all further analyses on the ranks. By
analyzing ranks rather than values, you don't need to care about the
distribution of the population.
PERL: For Loops
You've seen this construct many times:
#!/usr/bin/perl -w
foreach $n (1..10){
print "$n\n";
}
exit;
You've also seen it used like this:
#!/usr/bin/perl -w
@stuff = ("hairpin", 33, "PIG", "lantern", "33-34", "earth");
foreach $n (@stuff){
print "$n\n";
}
exit;
So, that's the foreach loop. Now let's look at the for loop.
#!/usr/bin/perl -w
for($i = 0; $i < 10; $i++){
print "$i\n";
}
exit;
Let's look at each portion of the for-loop construct:
The first part: $i = 0;
This is the initializer. The variable which is being used to keep
track of the count is initialized. Commonly this variable is initialized
to zero, but it can be initialized to any integer value.
The middle part: $i < 10;
This is the exit condition. The loop will continue as long as this
condition evaluates to true. Once this condition becomes false the
loop ends.
The end: $i++
This is the increment amount. In this case the value stored in the
counter variable is increased by one on every iteration of the loop.
The counter can be increased by any integer amount on each loop or
it can even be decreased in value!
Now consider what we can do when we utilize the comma
operator:
#!/usr/bin/perl -w
for($i = 0, $v = 10; $i < 10; $i++, $v--){
print "$i, $v\n";
}
exit;
Here the loop is under control of the $i variable and the $v variable is
just sort of along for the ride. That is, exiting the loop is contingent
upon the value of $i and $v isn't a factor in deciding when to end the loop.
Now consider this useful trick using a foreach loop:
#!/usr/bin/perl -w
@list = (1,2,3,4,5);
print "@list\n";
foreach $n (@list){
$n = $n * 10;
}
print "@list\n";
exit;
Consider this reverse-sorting program:
#!/usr/bin/perl -w
$str = "norman zoolander always wanted to purchase four umbrellas and seven turnips";
foreach $word (reverse sort split(/ /, $str)){
print "$word ";
}
print "\n";
exit;
Finally, take a look at this little gem:
#!/usr/bin/perl -w
$str = "AGTABAVDADGADEDGATGADKLLPILIGADS";
$pat = "GAD";
for($i = 0; $i< length($str)-3; $i++){
$seg = substr($str,$i,3); # string, start pt, length
if($pat eq $seg){
$output .= "^";
}
else{
$output .= " ";
}
}
print "SEARCH PATTERN: $pat\n";
print "$str\n";
print "$output\n\n";
exit;
ASSIGNMENT:
- Write a for-loop (not foreach loop) which outputs the multiples of five
from five to 100
- Write a for-loop (not foreach loop) which outputs the first 15 powers of
two
- Create a written explanation of how a for-loop could be used to calculate the correlation
coefficient