String Processing
Strings and DNA
In this lesson the student will learn how to:
- Represent a DNA sequence using a string
- Create complementary strings of nucleotides
- Use substr to inspect a single character in a string
- Concatentate strings using the dot operator
By the end of this lesson the student will be able to:
Write a short Perl script which creates a
complementary sequence for an input DNA sequence.
Perl stands for Practical Extraction and Report Language, as you undoubtedly
recall from the first unit of this course. In this unit we will be learning
how to use PERL to help us process strings and analyze biological sequences.
There are three types of biological sequences which we will investigate:
DNA, RNA, and polypeptide. In this lesson we will discuss only DNA
sequences.
Strands of DNA
Your genome is composed of over 3 billion pairs of nucleotides. Your genome
is organized into 23 pairs of chromosomes. Each chromosomes consists of two
complementary strands of DNA. Your genes are located on these strands of
DNA. DNA is composed of four nucleotides: adenine (A), cytosine (C), guanine
(G), and thymine (T). Adenine always pairs up with thymine. Cytosine always
pairs with guanine. Two short complementary strand segments might look like
this:
AATGGCCGATAGAGGGATTTACC
TTACCGGCTATCTCCCTAAATGG
You will notice that A's are always paired with T's and that C's are always
paired with G's. This is a fairly simple arrangement.
Although there are lots of details about chromosomes, DNA, and genomes which
concern the biologist, for now we will concern ourselves only with
nucleotide sequence.
The following PERL script will help us get started analyzing a strand of
DNA.
#!/usr/bin/perl
$s1 = "GTAACGAGTTAACACTACGACCCGGCTTACAATGATGCCG";
$len = length $s1;
print $s1 . "\n";
print "LENGTH: $len\n";
The length of a strand of DNA is about the most basic thing you might want
to know about it. (Make sure you notice the use of the length function here
to find out the length of the string.)
Another interesting thing to know is the number of each type of nucleotide.
Here's a PERL script for that:
#!/usr/bin/perl
$s1 = "GTAACGAGTTAACACTACGACCCGGCTTACAATGATGCCG";
$len = length $s1;
$a=0;
$c=0;
$g=0;
$t=0;
for($i=0; $i<$len; $i++){
if(substr($s1,$i,1) eq "A"){
$a++;
}
elsif(substr($s1,$i,1) eq "T"){
$t++;
}
elsif(substr($s1,$i,1) eq "C"){
$c++;
}
elsif(substr($s1,$i,1) eq "G"){
$g++;
}
else{
print "BAD CHARACTER\n";
}
}
print $s1 . "\n";
print "A: $a, C: $c, G: $g, T: $t\n";
print "LENGTH: $len\n";
You should understand three things about this script:
- for loop - Make sure you understand how it is that every nucleotide in
the sequence is inspected, one at a time.
- if-elsif-else - Make sure you understand how the complementary
nucleotide is selected. Recall that only one choice from an if-elsif-else
construct can be selected. In this case there are five choices specified by
the construct.
- substr - The substr function takes three arguments: a string, a starting
point, and a length. In the useage shown in this example, the length is
always one character since we are inspecting each nucleotide one at a time.
Here's a quick perl script which illustrates concatenation (adding strings
to strings).
#!/usr/bin/perl
$str = "one";
$str .= " two";
$str .= " three";
print $str ."\n";
The little period in front of the equal signs shows one way to use the dot
operator. Another example is shown in the second to the last line in the
first example.
ASSIGNMENT:
Write a short script in which the user is prompted to enter
a single strand of DNA of any length.
Use a for loop, if-elsif-else constructs, and substr to produce a
second sequence which is complementary to the first sequence. Here is a pair
of complementary strands:
INPUT: ATCGGGCCTA
OUTPUT: TAGCCCGGAT