String Processing

Pattern Matching

In this lesson the student will learn how to:

Use pattern matching with conditionals
Use the g option to force exhaustive pattern matching
Count using the pattern matching construct

By the end of this lesson the student will be able to:


   Write a script which counts the number of each type of
   nucleotide in a DNA sequence.

Nucleotides and Amino Acids

Proteins are basically just chains of amino acids. Chains of amino acids are often referred to as polypeptides. As you know most genes code for proteins. Genes are stored as DNA, then transcribed to RNA, and then translated to amino acids. The relationship between DNA and RNA is one to one. That is, one nucleotide in a DNA chain is directly transcribed to another nucleotide in an RNA chain. The relationship between RNA and an amino acid is three to one. That is, it takes three nucleotides of RNA to code for a single amino acid. A set of three RNA nucleotides in called a codon. Here are a few examples:

AMINO ACID CODON(S)
ARGININE (R) CGU, CGC, CGA, CGG, AGA, AGG
ISOLEUCINE (I) AUU, AUC, AUA
GLYCINE (G) GGU, GGC, GGA, GGG
METHIONINE (M) AUG

If you do a little math, you will discover that there are 64 possible codons since there are four nucleotides which can be used in each position in a codon (4x4x4=64). Since there are only 20 amino acids redundancy is not a problem and so the fact that there are six codons that code for arginine is perfectly reasonable. You will notice that only one codon codes for methionine. Besides the 20 amino acids there are two stop codons (uaa, uag, uga) and one start (aug) codon. The aug codon also specifies methionine, as you may have noticed.

AMINO ACID	CODON(S)
ARGININE (R)	CGU, CGC, CGA, CGG, AGA, AGG
ISOLEUCINE (I)	AUU, AUC, AUA
GLYCINE (G)	GGU, GGC, GGA, GGG
METHIONINE (M)	AUG

#!/usr/bin/perl my $seq = "auaucgucgaugaccgauuaucagcuaucuauguaugaucaucacaauaucguag"; if(@cnt = $seq =~ /AUC/){ print "FOUND " . @cnt . " times.\n"; } else{ print "NOT PRESENT\n"; }

This first example prints out "NOT PRESENT" because the pattern "AUC" is not found in $seq. A slight modification and we get a match:

#!/usr/bin/perl my $seq = "auaucgucgaugaccgauuaucagcuaucuauguaugaucaucacaauaucguag"; if(@cnt = $seq =~ /AUC/i){ print "FOUND " . @cnt . " times.\n"; } else{ print "NOT PRESENT\n"; }

Now we get "FOUND 1 times" as output. The i modifier means "ignore case" and so the pattern "AUC" will now match any "auc" in the search string. If you look through $seq carefully, however, you will notice that "auc" occurs more than once! So, we make the following very slight modification to fix that.

#!/usr/bin/perl my $seq = "auaucgucgaugaccgauuaucagcuaucuauguaugaucaucacaauaucguag"; if(@cnt = $seq =~ /AUC/ig){ print "FOUND " . @cnt . " times.\n"; } else{ print "NOT PRESENT\n"; }

The g modifier fixes everything. The g stands for global and will find all possible matches and so we now get a proper report of the number of matches.

Here's another pattern matching example:

#!/usr/bin/perl $input=""; while($input ne "q"){ print "INPUT: "; $input = <STDIN>; chomp $input; if($input =~ /auc/i){ print "ISOLEUCINE\n"; } }

In this example everytime the user enters "auc" or anything containing "auc" regardless of case the word "ISOLEUCINE" will be printed. The loop will continue until the user enters "q".

There are three sets of three nucleotides which all code for isoleucine: auu, auc, aua. A slight modification to the previous script will catch all three in one if statement:

#!/usr/bin/perl $input=""; while($input ne "q"){ print "INPUT: "; $input = <STDIN>; chomp $input; if($input =~ /auc|auu|aua/i){ print "ISOLEUCINE\n"; } }

The | symbol means "or" and so the if statement is saying something like "If $input is auc OR auu OR aua then print ISOLEUCINE."

This last examples shows how you can count using pattern matching:

#!/usr/bin/perl $dna = "The fat rat ate all the almonds."; @aaa = ( $dna =~ /a/g ); print $dna . "\n"; print "COUNT: " . @aaa . "\n";

Without the g option you will not get a count since without the g the first match ends the search. The g option forces an exhaustive search. As you can see the actual count is stored in the array @aaa.

By the way, we can place an m at the beginning of the pattern matching construct:

$dna =~ m/a/g;

ASSIGNMENT:

Write a script which counts all the instances of each nucleotide in a nucleotide sequence entered by the user. Place this in a while loop so that the user can get counts on as many sequences as desired. Also make sure a special message is printed whenever an inappropriate nucleotide is encountered.