String Processing
Regular Expressions Revisited
In this lesson the student will learn how to:
- Specify whether matching will be greedy or minimal
- Do calculations within regular expressions
- Use regular expressions to represent codons
By the end of this lesson the student will be able to:
Utilizing regular expressions, write a program which
translates a nucleotide sequence into an amino acid
sequence.
The Genetic Code
The genetic code for humans is not exactly the same as the genetic code for
other organisms. In fact, the genetic code used in human mitochondria is
slightly different from that used in the nucleus. The human nuclear genetic
code looks like this:
A GCU GCC GCA GCG
R CGU CGC CGA CGG AGA AGG
N AAU AAC
D GAU GAC
C UGU UGC
Q CAA CAG
E GAA GAG
G GGU GGC GGA GGG
H CAU CAC
I AUU AUC AUA
L UUA UUG CUU CUC CUA CUG
K AAA AAG
M AUG
F UUU UUC
P CCU CCC CCA CCG
S UCU UCC UCA UCG AGU AGC
T ACU ACC ACA ACG
W UGG
Y UAU UAC
V GUU GUC GUA GUG
. UAA UAG UGA
The human mitochondria genetic code is very similar:
A GCU GCC GCA GCG
R CGU CGC CGA CGG
N AAU AAC
D GAU GAC
C UGU UGC
Q CAA CAG
E GAA GAG
G GGU GGC GGA GGG
H CAU CAC
I AUU AUC
L UUA UUG CUU CUC CUA CUG
K AAA AAG
M AUG AUA
F UUU UUC
P CCU CCC CCA CCG
S UCU UCC UCA UCG AGU AGC
T ACU ACC ACA ACG
W UGG UGA
Y UAU UAC
V GUU GUC GUA GUG
. UAA UAG AGA AGG
More Information On Regular Expressions
Entire books exist which deal with nothing but regular expressions. So, you
can surmise that there is no way to cover all there is to know about regular
expressions in a few lessons. Actually, there's no way to even cover much at
all about all there is to know about them. But here are a few more examples
of things you can do with regular expressions:
Minimal vs. Greedy Matching
#!/usr/bin/perl
$m = "Mississippi";
#greedy:
$m =~ /i(.*)i/;
print $1 ."\n";
#minimal:
$m =~ /i(.*?)i/;
print $1 . "\n";
When you run this script, you will notice that the question mark changed the
behavior of this regular expression. Rather than matching the largest
possible sequence beginning and ending with i, it matched the shortest.
(QUESTION: Did it match the first pair of s's or the second pair?)
Calculations Using Regular Expressions
#!/usr/bin/perl
$fruit = "Mark had 5 dollars and Jill had 15 dollars.";
print $fruit . "\n";
$fruit =~ s/(\d+)/($1*5)/ge;
print $fruit . "\n";
Translating Codons to Amino Acids With Regular Expressions:
if( $codon =~ /GC./i) { return 'A' } # alanine
elsif( $codon =~ /TG[TC]/i) { return 'C' } # cysteine
elsif( $codon =~ /GA[TC]/i) { return 'D' } # aspartic acid
elsif( $codon =~ /GA[AG]/i) { return 'E' } # glutamic acid
elsif( $codon =~ /TT[TC]/i) { return 'F' } # phenylalanine
elsif( $codon =~ /GG./i) { return 'G' } # glycine
elsif( $codon =~ /CA[TC]/i) { return 'H' } # histidine
elsif( $codon =~ /AT[TCA]/i) { return 'I' } # isoleucine
elsif( $codon =~ /AA[AG]/i) { return 'K' } # lysine
elsif( $codon =~ /TT[AG]|CT./i) { return 'L' } # leucine
elsif( $codon =~ /ATG/i) { return 'M' } # methionine
elsif( $codon =~ /AA[TC]/i) { return 'N' } # asparagine
elsif( $codon =~ /CC./i) { return 'P' } # proline
elsif( $codon =~ /CA[AG]/i) { return 'Q' } # glutamine
elsif( $codon =~ /CG.|AG[AG]/i) { return 'R' } # arginine
elsif( $codon =~ /TC.|AG[TC]/i) { return 'S' } # serine
elsif( $codon =~ /AC./i) { return 'T' } # threonine
elsif( $codon =~ /GT./i) { return 'V' } # valine
elsif( $codon =~ /TGG/i) { return 'W' } # tryptophan
elsif( $codon =~ /TA[TC]/i) { return 'Y' } # tyrosine
elsif( $codon =~ /TA[AG]|TGA/i) { return '_' } # stop
else {
print STDERR "Bad codon \"$codon\"!!";
exit;
}
Notice how certain amino acids have multiple codons while others have only
one or very few. Make sure that you understand how these regular expressions
account for all the codons.
ASSIGNMENT:
Write a script which allows the user to enter a nucleotide sequence and the
translate that sequence into an amino acid sequence using regular
expressions. USE THE MITOCHONDRIAL GENETIC CODE INSTEAD OF THE NUCLEAR
GENETIC CODE.