String Processing

Restriction Enzymes

In this lesson the student will learn how to:
  1. Use parentheses in regular expressions
  2. Create complex regular expressions
By the end of this lesson the student will be able to:

     Write a script which identifies a portion of a long
     sequence which is located between two patterns

Restriction Enzymes

Restriction enzymes (also referred to as molecular scissors or restriction endonucleases) are special enzymes which cut DNA at specific locations called restriction sites. Often they are used to cut a DNA strand and there are a multitude of restriction enzymes available which can cut DNA sequences in many locations. A typical restriction enzyme recognizes a sequence of about six nucleotides in length and makes the cut somewhere within or at one of the ends of the sequence. EcoRI is the name of one of the most common restriction enzymes and it recognizes the sequence GAATTC with the actual cut occurring between the G and the A.

Here are some more examples of restriction enzymes:


	PunAI		C|YCGRG
	ZanI		CC|WGG
	BoxI		GACNN|NNGTC
	BbeI		GGCGC|C
	BamHI		G|GATCC
	ApoI		R|AATTY
	CacI		|GATC
	CfoI		GCG|C
	MthZI		C|TAG
	SinI		G|GWCC
	PfeI		G|AWTC
	ECO24I		GRGCY|C
	FauBII		CG|CG
	FseI		GGCCGG|CC
	HalII		CTGCA|G
	HindII		GTY|RAC
	Kzo49I		G|GWCC

The vertical line in each sequence shows the cut site for the enzyme.

You will notice that a few letters which do not stand for nucleotides are include in the recognition sequences. Here are their meanings:


      	                R = G or A
                        Y = C or T
                        M = A or C
                        K = G or T
                        S = G or C
                        W = A or T
                        B = not A (C or G or T)
                        D = not C (A or G or T)
                        H = not G (A or C or T)
                        V = not T (A or C or G)
                        N = A or C or G or T

Parentheses in Regular Expressions

#!/usr/bin/perl $line = "The fish eat the juicy worms for lunch."; $line =~ /fish(.*)worm/; $patt = $1; print $patt . "\n";
The special variable $1 contains the part of the regular expression which matches the part within parentheses. You can actually use several sets of parentheses, like this:
#!/usr/bin/perl $the_line = "mississississinkippippippi"; $the_line =~ /.(iss)+(.[^p]*)(ipp)+./; print "1) $1, 2) $2, 3) $3\n";
Recall that the [^p] means any character that is not the letter p. Also recall that the + means one or more of a pattern.

How do the results differ for this modified version of the last example?

#!/usr/bin/perl $the_line = "mississississinkippippippi"; $the_line =~ /.(iss)+(.*)(ipp)+./; print "1) $1, 2) $2, 3) $3\n";

ASSIGNMENT:

Write a script which identifies the portion of a long sequence which is located between two restriction enzymes. Use BamHI as the first restriction enzyme and EcoRI as the second.