String Processing

Formatted Output

In this lesson the student will learn how to:

Use the printf function
Use field specifiers with the printf function

By the end of this lesson the student will be able to:


	Write a script that produces formatted output.

Affinity and Nucleotide Sequences

The affinity between two nucleotide sequences can be measured by comparing the sequences for complementarity base by base. For instance, take a look at the following two sequences:


	aagctgacga
	--|||--|-|
	agcgaaagat

As you can see when the two nucleotide sequences are compared one nucleotide at a time, there are five pairs of complementary nucleotides. It would not be difficult to convert this into a percentage. In this case, you could say that the two sequences have an affinity of .50 or 50%. The greater the affinity between two sequences the more likely they are to anneal when they come in contact with each other. Recall that two hydrogen bonds form between G and C nucleotides and three hydrogen bonds form between A and T nucleotides.

(We are ignoring the fact that DNA sequences have 3' and 5' ends here. In reality, one sequence will anneal with another in the opposite direction so that one strand is going in the 3' to 5' direction and the other in the 5' to 3' direction. We ignore that little detail here.)

Formatted Output

The following example shows one way to print formatted output. Notice that the function is called printf (instead of just print) and that the variables are plugged into little place holders (%s).

#!/usr/bin/perl @codons = (); foreach $a ("A","T","C","G"){ foreach $b ("A","T","C","G"){ foreach $c ("A","T","C","G"){ push(@codons, "$a$b$c"); } } } for($i=0; $i<64; $i+=4){ printf ("%s\t\t%s\t\t%s\t\t%s\n", $codons[$i],$codons[$i+1],$codons[$i+2],$codons[$i+3]); }

Here are some more field specifiers which can be used with printf:

Specifier	Description
%c	Single character
%d	Integer in decimal (base-10) format
%e	Floating-point number in scientific notation
%f	Floating-point number in "normal" (fixed-point) notation
%g	Floating-point number in compact format
%o	Integer in octal (base-8) format
%s	Character string
%u	Unsigned integer
%x	Integer in hexadecimal (base-16) format

#!/usr/bin/perl for($i = 0; $i < 100; $i++){ printf("DECIMAL: %d, OCTAL: %o, HEXADECIMAL: %x\n",$i, $i, $i); } exit;

The number of decimals spots in a floating point value can be controlled like this:

#!/usr/bin/perl $f = 43.455678; printf("VALUE WITH PRINTF: %5.1f\n", $f); printf("VALUE WITH PRINTF: %5.2f\n", $f); printf("VALUE WITH PRINTF: %5.3f\n", $f); printf("VALUE WITH PRINTF: %5.4f\n", $f); printf("VALUE WITH PRINTF: %5.5f\n", $f); printf("VALUE WITH PRINTF: %5.6f\n", $f); printf("VALUE WITH PRINTF: %5.7f\n", $f); exit;

Does printf appear to round correctly? What happens when printf specifies that more decimal spots to the right be printed than exist in the value to be printed?

ASSIGNMENT:

Your job will be to create an affinity report with the following format:



	  gcgttagatc  ccctttagag  atgctgatgc  gagcactata  cgtgtcatat  gagatttgcc  
gcgttagatc     0           5           1           1           4           2      
ccctttagag     5           0           2           5           1           5      
atgctgatgc     1           2           0           5           2           3      
gagcactata     1           5           5           0           7           1      
cgtgtcatat     4           1           2           7           0           2      
gagatttgcc     2           5           3           1           2           0

The numbers represent the number of complementary nucleotides between each sequence. Notice that the sequences listed vertically on the left side of the table are the same as those listed horizontally along the top.