String Processing
Regular Expressions
In this lesson the student will learn how to:
- Know that regular expressions are generalized patterns
- Be able to use a subset of regular expression symbols
By the end of this lesson the student will be able to:
Create a script which compares polypeptide sequences
based on crude amino acid similarity.
Amino Acids
There are twenty amino acids which are coded for in the genetic code. Each
amino acid has the following features:
- central carbon atom
- single hydrogen atom
- amino group (NH2)
- carboxyl group (COOH)
- variable side chain
Note that the only difference between the amino acids is the variable side
chain (sometimes referred to an an "R" group).
Here are the twenty amino acids, their three-letter symbols, and their
one-letter symbols:
Three letter symbol
|
alanine |
A |
Ala |
arginine |
R |
Arg |
asparagine |
N |
Asn |
aspartic acid |
D |
Asp |
cysteine |
C |
Cys |
glutamic acid |
E |
Glu |
glutamine |
Q |
Gln |
glycine |
G |
Gly |
histidine |
H |
His |
isoleucine |
I |
Ile |
leucine |
L |
Leu |
lysine |
K |
Lys |
methionine |
M |
Met |
phenylalanine |
F |
Phe |
proline |
P |
Pro |
serine |
S |
Ser |
threonine |
T |
Thr |
tryptophan |
W |
Trp |
tyrosine |
Y |
Tyr |
valine |
V |
Val |
These amino acids can be grouped into five different categories like this:
- Nonpolar, hydrophobic: A, G, I, L, M, F, P, V.
- Basic, hydrophilic: R, K, H.
- Acidic, hydrophilic: D, E.
- Polar, hydrophobic: W.
- Polar, hydrophilic: N, C, Q, S, T, Y.
Regular Expressions
Here's a little sample script which will report if the symbol you entered is
a basic, hydrophilic amino acid:
#!/usr/bin/perl
$input = "";
do{
$input=;
if($input =~ /R|K|H/){
print "basic, hydrophilic\n";
}
else{
print "other\n";
}
}while( !($input =~ /q|Q/) );
print "bye\n\n";
Note the use of the pipe (|) symbol. It stands for OR and in the case of the
first pattern match the regular expression R|K|H matched a single 'R', a
single 'K', or a single 'H'.
Regular expressions specify general patterns. The following example shows
several symbols used to stand for groups of characters. The following
regular expressions are used in the next script:
\d\d matches two digits
.oo. matches any group of four characters with oo in the middle
\d.*\d matches any group of characters which starts and ends with a digit
^[AGCT].* matches a line beginning with A, G, C, or T
^[AGCT]*$ matches any line containing only As, Gs, Cs, and/or Ts
#!/usr/bin/perl
$seq = ;
if($seq =~ /\d\d/){ # the \d stands for a single digit
print "TWO DIGITS\n";
}
if($seq =~ /.oo./){ # the dot (.) stands for any character
print "double o\n";
}
if($seq =~ /\d.*\d/){ # the asterisk means "any number of" the item it follows
print "TWO DIGITS (not necessarily adjacent)\n";
}
if($seq =~ /^[AGCT].*/){ # The carot (^) specifies the beginning of a line
# The group of letters in square brackets specify
# a character class
print "possible nucleotide sequence\n";
}
if($seq =~ /^[AGCT]*$/){ # Just like the last one except that there is
# no dot and the dollar sign appears at the
# end meaning that a sequence containing any
# number of nucleotides is specified
print "nucleotide sequence\n";
}
Run this example for each of the following inputs:
food
55oox
A56
AGCGGGACGACCA
5AGAGCACA
Gthedogisintheyard
ava6eve7ivi
xyz
Is there any way to match all of the if statements? Why or why not?
ASSIGNMENT:
Write a script in which the user enters two strings representing
polypeptides (strings of amino acids). Use the two strings entered by the
user to create two corresponding strings translated into the following code:
- Nonpolar, hydrophobic: A, G, I, L, M, F, P, V.
- Basic, hydrophilic: R, K, H.
- Acidic, hydrophilic: D, E.
- Polar, hydrophobic: W.
- Polar, hydrophilic: N, C, Q, S, T, Y.
So, for instance, the following sequence will be converted like so:
ORIGINAL: AMDEWNHQSFFHDE
OUTPUT: 11334525511233
Once the two sequences are converted into code compare them and report
percent identity.