Perl Basics and Biostatistics

False Positives

In this lesson the student will learn how to:

understand what is meant by the term false positive
work with the logical OR and logical AND operators
use the ** operator to raise a base by a power

By the end of this lesson the student will be able to:


  Write a script demonstrating the use of the logical
  AND and the logical OR operators.

Most tests which test for the presence of a disease in a person are not perfect. Sometimes you test positive for a disease, but you don't actually have the disease. That is, the test falsely indicates the presence of the disease. This is known as a false positive. Tests exist for different types of cancer, AIDS, and various other disease. All of them are less than perfect. For instance, let's say that a man notices some discomfort around his groin. He might go to the doctor who might have him take a test to test for the existence of prostrate cancer (the prostrate gland is located right behind the testicles). One such test is the PSA (Prostrate Specific Antigen) test. It is not a perfect test, but if the test indicates the presence of prostrate cancer, further testing is scheduled to determine if prostrate cancer actually exists and how far it has developed.

The Statistics of False Positives

Suppose that a rare disease infects one out of every 1000 people in a population. Also suppose that there is a good, but not perfect, test for this disease. If a person has the disease, the test comes back positive 99% of the time. But, this test also produces some false positives. About 2% of uninfected patients also test positive. Let's say some person named Shelly just took the test. What are the chances that Shelly has the disease?

There are two possible events for us to consider:


  A: patient has the disease
  B: patient tests positive

We can summarize the information about the test's effectiveness like this:


  P(A) = .001          The probability of A is 1 out of 1000. One 
                       person in 1000 has the disease.

  P(B|A) = .99         The probability of a positive test result for 
                       a person who actually has the disease is 99 
                       out of 100.

  P(B|NOT A) = .02     The probability of a false positive, given no 
                       infection, is .02

  P(A|B) = X           The probability of actually having the test given
                       a positive test. This is the answer we are 
                       looking for.

To find the answer to our question, we create a 2x2 table which divides the sample space into four mutually exclusive events. It displays every possible combination of disease state and test result.


                A                    NOT A
        |-------------------|----------------------|
        |                   |                      |
    B   |    A and B        |     NOT A and B      |
        |                   |                      |
        |-------------------|----------------------|
        |                   |                      |
 NOT B  |   A and NOT B     |   NOT A and NOT B    |
        |                   |                      |
        |-------------------|----------------------|

Let's find the probabilities of each event in the table:


                A                    NOT A             SUM
        |-------------------|----------------------|------------|
        |                   |                      |            |
    B   |    A and B        |     NOT A and B      |   P(B)     |
        |                   |                      |            |
        |-------------------|----------------------|------------|
        |                   |                      |            |
 NOT B  |   A and NOT B     |   NOT A and NOT B    |  P(NOT B)  |
        |                   |                      |            |
        |-------------------|----------------------|------------|
        |                   |                      |            |
   SUM  |     P(A)          |      P(NOT A)        |     1      |
        |                   |                      |            |
        |-------------------|----------------------|------------|

The probabilities in the margins are found by summing across rows and down columns.

Now compute:


    P(A and B) = P(B|A)P(A) = (.99)(.001) = .00099
    P(NOT A and B) = P(B|NOT A)P(NOT A) = (.02)(.999) = .01998

This allows us to fill in a large portion of our table.


                A                    NOT A             SUM
        |-------------------|----------------------|------------|
        |                   |                      |            |
    B   |     .00099        |        .01998        |   .02097   |
        |                   |                      |            |
        |-------------------|----------------------|------------|
        |                   |                      |            |
 NOT B  |   A and NOT B     |   NOT A and NOT B    |  P(NOT B)  |
        |                   |                      |            |
        |-------------------|----------------------|------------|
        |                   |                      |            |
   SUM  |     .001          |         .999         |     1      |
        |                   |                      |            |
        |-------------------|----------------------|------------|

We can find the remaining probabilities by subtracting in the columns, then adding across the rows.


                A                    NOT A             SUM
        |-------------------|----------------------|------------|
        |                   |                      |            |
    B   |     .00099        |        .01998        |   .02097   |
        |                   |                      |            |
        |-------------------|----------------------|------------|
        |                   |                      |            |
 NOT B  |     .00001        |        .97902        |    .97903  |
        |                   |                      |            |
        |-------------------|----------------------|------------|
        |                   |                      |            |
   SUM  |     .001          |         .999         |     1      |
        |                   |                      |            |
        |-------------------|----------------------|------------|

From this we directly derive:


                P(A and B)     .00099
     P(A|B) =  ------------ = --------- = .0472
                   P(B)        .02097

Despite the high accuracy of the test, less than 5% of those who test positive actually have the disease! This is called the FALSE POSITIVE PARADOX. The next table shows what happens in a group of 1000 patients. On average, 21 people will test positive - and only one of them actually has the disease. The other twenty tests which come out positive actually belong to uninfected people!


            Disease           No Disease
         |--------------|-----------------|-----|
   TESTS |              |                 |     |
 POSITIVE|     1        |       20        |  21 |
         |              |                 |     |
         |--------------|-----------------|-----|
   TESTS |              |                 |     |
 NEGATIVE|     0        |      979        |  979|
         |              |                 |     |
         |--------------|-----------------|-----|
         |     1        |      999        | 1000|
         |--------------|-----------------|-----|

So we see that our test yields far more false positives than true positives. So, is this test useless? NO! This test would be very useful for screening purposes. Without the test the chances of an individual carrying the disease is 1 out of 1000. A person who tests positive has a 1 out of 21 chance of having the disease. So a positive on this test indicates that the person should take another, probably more expensive and more accurate, test.

PERL Information

Consider this PERL script:

#!/usr/bin/perl -w use strict; print "INPUT BASE: "; my $b = <STDIN>; chomp($b); #gets rid of newline character print "INPUT EXPONENT: "; my $e = <STDIN>; chomp($e); my $a = $b ** $e; print "$b raised to the $e power equals $a\n\n"; exit;

The ** operator raises a base to the exponent value.

Logical Operators

Consider the following PERL script:

#!/usr/bin/perl -w use strict; my $a = 1; #true my $b = 1; my $c = 0; #false my $t1 = $a && $b; my $t2 = $a && $c; print "TEST ONE: $a AND $b = $t1\n"; print "TEST TWO: $a AND $c = $t2\n"; my $t3 = $a || $b; my $t4 = $a || $c; print "TEST THREE: $a OR $b = $t3\n"; print "TEST FOUR: $a OR $c = $t4\n"; exit;

The && is the logical AND operator and the || is the logical OR operator.

ASSIGNMENT:

Write a script which shows the results of all four possible combinations of true and false for the AND and the OR operators. (Remember that 1 means true and 0 means false.)