String Processing

Regular Expressions Revisited

In this lesson the student will learn how to:
  1. Specify whether matching will be greedy or minimal
  2. Do calculations within regular expressions
  3. Use regular expressions to represent codons
By the end of this lesson the student will be able to:

	Utilizing regular expressions, write a program which
	translates a nucleotide sequence into an amino acid
	sequence.

The Genetic Code

The genetic code for humans is not exactly the same as the genetic code for other organisms. In fact, the genetic code used in human mitochondria is slightly different from that used in the nucleus. The human nuclear genetic code looks like this:


A     GCU GCC GCA GCG
R     CGU CGC CGA CGG AGA AGG
N     AAU AAC
D     GAU GAC
C     UGU UGC
Q     CAA CAG
E     GAA GAG
G     GGU GGC GGA GGG
H     CAU CAC
I     AUU AUC AUA
L     UUA UUG CUU CUC CUA CUG
K     AAA AAG
M     AUG
F     UUU UUC
P     CCU CCC CCA CCG
S     UCU UCC UCA UCG AGU AGC
T     ACU ACC ACA ACG
W     UGG
Y     UAU UAC
V     GUU GUC GUA GUG
.     UAA UAG UGA

The human mitochondria genetic code is very similar:

A     GCU GCC GCA GCG
R     CGU CGC CGA CGG
N     AAU AAC
D     GAU GAC
C     UGU UGC
Q     CAA CAG
E     GAA GAG
G     GGU GGC GGA GGG
H     CAU CAC
I     AUU AUC
L     UUA UUG CUU CUC CUA CUG
K     AAA AAG
M     AUG AUA
F     UUU UUC
P     CCU CCC CCA CCG
S     UCU UCC UCA UCG AGU AGC
T     ACU ACC ACA ACG
W     UGG UGA
Y     UAU UAC
V     GUU GUC GUA GUG
.     UAA UAG AGA AGG

More Information On Regular Expressions

Entire books exist which deal with nothing but regular expressions. So, you can surmise that there is no way to cover all there is to know about regular expressions in a few lessons. Actually, there's no way to even cover much at all about all there is to know about them. But here are a few more examples of things you can do with regular expressions:

Minimal vs. Greedy Matching

#!/usr/bin/perl $m = "Mississippi"; #greedy: $m =~ /i(.*)i/; print $1 ."\n"; #minimal: $m =~ /i(.*?)i/; print $1 . "\n";
When you run this script, you will notice that the question mark changed the behavior of this regular expression. Rather than matching the largest possible sequence beginning and ending with i, it matched the shortest. (QUESTION: Did it match the first pair of s's or the second pair?)

Calculations Using Regular Expressions

#!/usr/bin/perl $fruit = "Mark had 5 dollars and Jill had 15 dollars."; print $fruit . "\n"; $fruit =~ s/(\d+)/($1*5)/ge; print $fruit . "\n";
Translating Codons to Amino Acids With Regular Expressions:
if( $codon =~ /GC./i) { return 'A' } # alanine elsif( $codon =~ /TG[TC]/i) { return 'C' } # cysteine elsif( $codon =~ /GA[TC]/i) { return 'D' } # aspartic acid elsif( $codon =~ /GA[AG]/i) { return 'E' } # glutamic acid elsif( $codon =~ /TT[TC]/i) { return 'F' } # phenylalanine elsif( $codon =~ /GG./i) { return 'G' } # glycine elsif( $codon =~ /CA[TC]/i) { return 'H' } # histidine elsif( $codon =~ /AT[TCA]/i) { return 'I' } # isoleucine elsif( $codon =~ /AA[AG]/i) { return 'K' } # lysine elsif( $codon =~ /TT[AG]|CT./i) { return 'L' } # leucine elsif( $codon =~ /ATG/i) { return 'M' } # methionine elsif( $codon =~ /AA[TC]/i) { return 'N' } # asparagine elsif( $codon =~ /CC./i) { return 'P' } # proline elsif( $codon =~ /CA[AG]/i) { return 'Q' } # glutamine elsif( $codon =~ /CG.|AG[AG]/i) { return 'R' } # arginine elsif( $codon =~ /TC.|AG[TC]/i) { return 'S' } # serine elsif( $codon =~ /AC./i) { return 'T' } # threonine elsif( $codon =~ /GT./i) { return 'V' } # valine elsif( $codon =~ /TGG/i) { return 'W' } # tryptophan elsif( $codon =~ /TA[TC]/i) { return 'Y' } # tyrosine elsif( $codon =~ /TA[AG]|TGA/i) { return '_' } # stop else { print STDERR "Bad codon \"$codon\"!!"; exit; }
Notice how certain amino acids have multiple codons while others have only one or very few. Make sure that you understand how these regular expressions account for all the codons.

ASSIGNMENT:

Write a script which allows the user to enter a nucleotide sequence and the translate that sequence into an amino acid sequence using regular expressions. USE THE MITOCHONDRIAL GENETIC CODE INSTEAD OF THE NUCLEAR GENETIC CODE.