String Processing
Pattern Matching
In this lesson the student will learn how to:
- Use pattern matching with conditionals
- Use the g option to force exhaustive pattern matching
- Count using the pattern matching construct
By the end of this lesson the student will be able to:
Write a script which counts the number of each type of
nucleotide in a DNA sequence.
Nucleotides and Amino Acids
Proteins are basically just chains of amino acids. Chains of amino acids are
often referred to as polypeptides. As you know most genes code for proteins.
Genes are stored as DNA, then transcribed to RNA, and then translated to
amino acids. The relationship between DNA and RNA is one to one. That is,
one nucleotide in a DNA chain is directly transcribed to another nucleotide
in an RNA chain. The relationship between RNA and an amino acid is three to
one. That is, it takes three nucleotides of RNA to code for a single amino
acid. A set of three RNA nucleotides in called a codon.
Here are a few examples:
AMINO ACID | CODON(S) |
ARGININE (R) | CGU, CGC, CGA, CGG, AGA, AGG |
ISOLEUCINE (I) | AUU, AUC, AUA |
GLYCINE (G) | GGU, GGC, GGA, GGG |
METHIONINE (M) | AUG |
If you do a little math, you will discover that there are 64 possible codons
since there are four nucleotides which can be used in each position in a
codon (4x4x4=64). Since there are only 20 amino acids redundancy is not a
problem and so the fact that there are six codons that code for arginine is
perfectly reasonable. You will notice that only one codon codes for
methionine. Besides the 20 amino acids there are two stop codons (uaa, uag,
uga) and one
start (aug) codon. The aug codon also specifies methionine, as you may have
noticed.
#!/usr/bin/perl
my $seq = "auaucgucgaugaccgauuaucagcuaucuauguaugaucaucacaauaucguag";
if(@cnt = $seq =~ /AUC/){
print "FOUND " . @cnt . " times.\n";
}
else{
print "NOT PRESENT\n";
}
This first example prints out "NOT PRESENT" because the pattern "AUC" is not
found in $seq. A slight modification and we get a match:
#!/usr/bin/perl
my $seq = "auaucgucgaugaccgauuaucagcuaucuauguaugaucaucacaauaucguag";
if(@cnt = $seq =~ /AUC/i){
print "FOUND " . @cnt . " times.\n";
}
else{
print "NOT PRESENT\n";
}
Now we get "FOUND 1 times" as output. The i modifier means "ignore case" and
so the pattern "AUC" will now match any "auc" in the search string. If you
look through $seq carefully, however, you will notice that "auc" occurs more
than once! So, we make the following very slight modification to fix that.
#!/usr/bin/perl
my $seq = "auaucgucgaugaccgauuaucagcuaucuauguaugaucaucacaauaucguag";
if(@cnt = $seq =~ /AUC/ig){
print "FOUND " . @cnt . " times.\n";
}
else{
print "NOT PRESENT\n";
}
The g modifier fixes everything. The g stands for global and will find all
possible matches and so we now get a proper report of the number of matches.
Here's another pattern matching example:
#!/usr/bin/perl
$input="";
while($input ne "q"){
print "INPUT: ";
$input = ;
chomp $input;
if($input =~ /auc/i){
print "ISOLEUCINE\n";
}
}
In this example everytime the user enters "auc" or anything containing "auc"
regardless of case the word "ISOLEUCINE" will be printed. The loop will
continue until the user enters "q".
There are three sets of three nucleotides which all code for isoleucine:
auu, auc, aua. A slight modification to the previous script will catch all
three in one if statement:
#!/usr/bin/perl
$input="";
while($input ne "q"){
print "INPUT: ";
$input = ;
chomp $input;
if($input =~ /auc|auu|aua/i){
print "ISOLEUCINE\n";
}
}
The | symbol means "or" and so the if statement is saying something like "If $input
is auc OR auu OR aua then print ISOLEUCINE."
This last examples shows how you can count using pattern matching:
#!/usr/bin/perl
$dna = "The fat rat ate all the almonds.";
@aaa = ( $dna =~ /a/g );
print $dna . "\n";
print "COUNT: " . @aaa . "\n";
Without the g option you
will not get a count since without the g the first match ends the search.
The g option forces an exhaustive search. As you can see the actual count is
stored in the array @aaa.
By the way, we can place an m at the beginning of the pattern matching
construct:
$dna =~ m/a/g;
ASSIGNMENT:
Write a script which counts all the instances of each nucleotide in a
nucleotide sequence entered by the user. Place this in a while loop so that
the user can get counts on as many sequences as desired. Also make sure a
special message is printed whenever an inappropriate nucleotide is
encountered.