String Processing

Regular Expressions

In this lesson the student will learn how to:

Know that regular expressions are generalized patterns
Be able to use a subset of regular expression symbols

By the end of this lesson the student will be able to:


  Create a script which compares polypeptide sequences
  based on crude amino acid similarity.

Amino Acids

There are twenty amino acids which are coded for in the genetic code. Each amino acid has the following features:

central carbon atom
single hydrogen atom
amino group (NH₂)
carboxyl group (COOH)
variable side chain

Note that the only difference between the amino acids is the variable side chain (sometimes referred to an an "R" group).

Here are the twenty amino acids, their three-letter symbols, and their one-letter symbols:

Three letter symbol

alanine A Ala

arginine R Arg

asparagine N Asn

aspartic acid D Asp

cysteine C Cys

glutamic acid E Glu

glutamine Q Gln

glycine G Gly

histidine H His

isoleucine I Ile

leucine L Leu

lysine K Lys

methionine M Met

phenylalanine F Phe

proline P Pro

serine S Ser

threonine T Thr

tryptophan W Trp

tyrosine Y Tyr

valine V Val

These amino acids can be grouped into five different categories like this:

Nonpolar, hydrophobic: A, G, I, L, M, F, P, V.
Basic, hydrophilic: R, K, H.
Acidic, hydrophilic: D, E.
Polar, hydrophobic: W.
Polar, hydrophilic: N, C, Q, S, T, Y.

Regular Expressions

Here's a little sample script which will report if the symbol you entered is a basic, hydrophilic amino acid:

#!/usr/bin/perl $input = ""; do{ $input=<STDIN>; if($input =~ /R|K|H/){ print "basic, hydrophilic\n"; } else{ print "other\n"; } }while( !($input =~ /q|Q/) ); print "bye\n\n";

Note the use of the pipe (|) symbol. It stands for OR and in the case of the first pattern match the regular expression R|K|H matched a single 'R', a single 'K', or a single 'H'.

Regular expressions specify general patterns. The following example shows several symbols used to stand for groups of characters. The following regular expressions are used in the next script:


  \d\d		matches two digits
  .oo.		matches any group of four characters with oo in the middle
  \d.*\d        matches any group of characters which starts and ends with a digit
  ^[AGCT].*     matches a line beginning with A, G, C, or T
  ^[AGCT]*$     matches any line containing only As, Gs, Cs, and/or Ts

#!/usr/bin/perl $seq = <STDIN>; if($seq =~ /\d\d/){ # the \d stands for a single digit print "TWO DIGITS\n"; } if($seq =~ /.oo./){ # the dot (.) stands for any character print "double o\n"; } if($seq =~ /\d.*\d/){ # the asterisk means "any number of" the item it follows print "TWO DIGITS (not necessarily adjacent)\n"; } if($seq =~ /^[AGCT].*/){ # The carot (^) specifies the beginning of a line # The group of letters in square brackets specify # a character class print "possible nucleotide sequence\n"; } if($seq =~ /^[AGCT]*$/){ # Just like the last one except that there is # no dot and the dollar sign appears at the # end meaning that a sequence containing any # number of nucleotides is specified print "nucleotide sequence\n"; }

Run this example for each of the following inputs:

food 55oox A56 AGCGGGACGACCA 5AGAGCACA Gthedogisintheyard ava6eve7ivi xyz

Is there any way to match all of the if statements? Why or why not?

ASSIGNMENT:

Write a script in which the user enters two strings representing polypeptides (strings of amino acids). Use the two strings entered by the user to create two corresponding strings translated into the following code:

Nonpolar, hydrophobic: A, G, I, L, M, F, P, V.
Basic, hydrophilic: R, K, H.
Acidic, hydrophilic: D, E.
Polar, hydrophobic: W.
Polar, hydrophilic: N, C, Q, S, T, Y.

So, for instance, the following sequence will be converted like so: ORIGINAL: AMDEWNHQSFFHDE OUTPUT: 11334525511233 Once the two sequences are converted into code compare them and report percent identity.

Three letter symbol
alanine	A	Ala
arginine	R	Arg
asparagine	N	Asn
aspartic acid	D	Asp
cysteine	C	Cys
glutamic acid	E	Glu
glutamine	Q	Gln
glycine	G	Gly
histidine	H	His
isoleucine	I	Ile
leucine	L	Leu
lysine	K	Lys
methionine	M	Met
phenylalanine	F	Phe
proline	P	Pro
serine	S	Ser
threonine	T	Thr
tryptophan	W	Trp
tyrosine	Y	Tyr
valine	V	Val