Perl Basics and Biostatistics

Introduction

In this lesson the student will learn how to:
  1. Write and execute a simple PERL script
  2. Use the print statement
  3. Identify and use the newline character
  4. Locate and write the bang line in a PERL script
  5. Associate common terms with statistics
  6. Explain in general terms how statistics can be useful
By the end of this lesson the student will be able to:

    Write a short Perl script which prints out strings 
    containing common words associated with statistics.

Perl stands for Practical Extraction and Report Language. It is one of the most widely used languages in existence. The only languages which are used as much or more than PERL are C/C++, Java, and Python. (We could also mention Ruby, PHP, JavaScript, and Visual Basic as close contenders.) Perl is an extremely useful language as you will see in the following lessons.

Follow these steps to get your first Perl script up and running:

  1. Open JOE for a file called test.pl and enter the following:
    #!/usr/bin/perl print "HELLO FROM PERL\n"; print "THIS IS A SIMPLE PERL SCRIPT\n"; exit;
  2. Save the file and run the following command at the command line:
    chmod +x test.pl
  3. Next you are ready to run or execute the script. Here's how you do it:
    ./test.pl
    The ./ indicates that the file is in the current directory.
The first line of this little script is called the bang line because it starts with an exclamation mark (sometimes referred to as a bang by people who work with computers). This line specifies the location of the PERL interpreter. The next two lines are simply print commands which do nothing more than output the characters between the quotation marks. These lines end with semicolons (as do most lines in a PERL script). The script will work fine without the last line (the one that says exit;), but it's a good thing to use the exit command nonetheless.

You will notice the use of \n at the end of the print command lines. This symbol stands for newline. Try deleting these from the ends of this test program and see what happens.


IMPORTANT NOTE: The Perl interpreter comes as a standard part of most Linux installations. It is most often located in the /usr/bin directory. It COULD be located elsewhere, for instance, /usr/local/bin, is a frequent alternate location. Perl does not come as standard on the Windows platform. You must install it yourself. The documentation for the Windows platform installation should give you information about the path to the Perl interpreter. Macintosh users running OS X will most likely find that Perl is available.

Bioinformatics and Biostatistics

You've probably heard terms such as average, mean, standard deviation, median, range, probability, odds, interval, ratio, sensitivity, correlation, and prevalence. These are all terms which are part of the vocabulary of statistics. In this unit we will introduce you to statistics and show you a little about how statistics can be applied to biological data (and also how Perl can help you with this endeavor).

You've probably heard of the field of bioinformatics. Bioinformatics is all about making sense of biological data. Probably the most important tool used in bioinformatics is the computer. The two computer languages used most in field of bioinformatics are Perl and Java (with C/C++ and Python also being very important). In this unit we will limit our discussion to biostatistics. Biostatistics is all about making sense of the statistics generated through the analysis of biological data. Biostatistics is a sub-discipline of bioinformatics. Computers and computer languages are also very important for performing biostatical analysis.

One thing is certain about biological data: There is a lot of it. In fact, there is a HUGE amount of data being generated on a daily basis. The shear fact that one human genome contains 3.1 billion basepairs (something we will discuss at great length in unit two) and that there are billions of humans (all with slightly different genomes) should help you to get some persepctive on the enormous amount of information which could be gathered just on the contents of the human genome. Add to that the enormous number of species of living things (which all have genomes of their own) which could be studied and you get even more data which could potentially be analyzed.

Biology isn't just about the study of genomes, however. Biologists conduct experiments dealing with the effects of potential pharmaceuticals on living organisms, how proteins are constructed, the structure of various microorganisms, biological pathways in the body, the function of various systems within the body, and many other interesting topics. All of these studies yield information: usually lots of information. So, again we see the importance of bioinformatics and it's sub-field, biostatistics.

ASSIGNMENT:

You will write a simple Perl script which produces four lines of output. Each line will contain at least four common words associated with the field of statistics.