Write a script which generates two random polynucleotides and compares them reporting a percent similarity.
Sequence Similarity
We can compare two sequences and report their level of similarity. Since there are only four nucleotides used in a strand of DNA, random similarity will be 25%. Although there are different ways of comparing the similarity of two polynucleotides, we will use a two state system. For each position in the two polynucleotides the nucleotides in that position can either match (+) or not (-).
seq1: ATCCGTACGA comp: +--+----++ seq2: AGGCTAGACAIn this sample there are ten nucleotides in each sequence and there are four matches between them and so their similarity score is 40%.
Sequence similarity in biology is kind of a big deal. One method used to locate unknown genes is by their similarity to known genes. Also the relatedness of two genes can be measured in terms of similarity. For instance, a gene for keratin from a human and a mouse may be more similar than a keratin gene from a bird and a human. Sequence comparisons can be done between nucleotide sequences or between amino acid sequences. When dealing with 20 different amino acids similarity can be more complicated since there are groups of amino acids which are more similar or different in structure and other characteristics and so various scoring matrices have been divised to score the similarity of polypeptides.
The following script provides a quick overview of how to use split and join:
Make sure you spend time experimenting with each of the examples so you can have a strong understanding of the split and join.
ASSIGNMENT:
Write a script which generates two random polynucleotides 20 base pairs in length. Next compare these nucleotide sequences and provide output which both numerically and graphically reports their similarity:
seq1: ATTATACGAGCTTAACTAGC comp: +----------------++- seq2: ACAGATACGATAGCGAGAGG similarity: 15%Notice that "+" stands for a match and that "-" stands for a mismatch.