Lawrence Technological University
College of Arts and Science
Departments of Mathematics, Computer Sciences and Biology

Handouts

A Ruby Language Companion for Exploring Bioinformatics

by John M. Miller M.D. and Julie Zwiesler-Vollick Ph.D.

Contents

Preface
Chapter 1 Exploring Bioinformatics
Chapter 2 The Central Dogma: Exploring Genetic Disease
Chapter 3 Gene Alignments: Investigating Antibiotic Resistance
Chapter 4 Protein Alignments: Clues to Protein Function
Chapter 5 Sequence Assembly: The Human Genome
Chapter 6 Gene Prediction: Finding Genes in the Human Genome
Chapter 7 RNA and Protein Structure: Structure Prediction
Chapter 8 Molecular Phylogenetics: Measuring Evolution
Chapter 9 Microarrays and Gene Expression
Appendix A Installing Perl and Running a Perl Program
Appendix B Introduction to Programming and Perl syntax
Appendix C Regular Expressions
Appendix D Ruby on Rails
Appendix E Emacs
Appendix F R
Appendix G Git
Appendix H NCBI eUtilities
Appendix I Image Magick
Appendix J Unit Testing
Appendix K Abbreviating Variables

Preface

   This little manual is meant as a vade mecum for students while they are Exploring Bioinformatics. This is a supplement to Caroline St. Clair's and Jonathan Visick's Exploring Bioinformatics: A Project-Based Approach, published by Jones and Bartlett. The scripting languages are easy to learn, computer tools that biologists have used to great advantage in the past decade. Larry Wall's Perl lead the way and became the default language of computational biology. Guido von Rossum's Python, Rasmus Lerdof's PHP and Yukihiro Matsumoto's Ruby also had the same general design goals: Scripting should make the routine easy and the difficult possible for programmers and scientists. Also, the code should "do about what you would expect" it to do as the script is read over either by a fellow programmer or a colleague in biology.

   Computer scientists and biologists alike are still waiting for the perfect computer language. Each of these established four has its strengths and weaknesses. Jules J. Berman Ph.D. M.D. is a series editor for Jones and Bartlett and has written both Perl Programming for Medicine and Biology and Ruby Programming for Medicine and Biology. In the latter he makes a good case for including Ruby in the toolkit of a computational biologist. Our hope is to allow some students to learn enough of two of these very good scripting languages during the course of one semester to recognize which is best for their project of the moment.

   The format of this manual will be simply to take the Perl code like the examples in each chapter and put similar Ruby next to it for comparison. There will be some extra appendices:

  1. Routine tasks, like coercing sequence data provided in FASTA format into a single long string of just 4 nucleotide symbols with no spaces or numbers or line breaks, often fall to the newest member of the computational biology team. Regular expressions are a small computer language that comes built into both Perl and Ruby, and makes such tasks much easier. Appendix C will expand a bit on starting to use regular expressions.
  2. Some day the student may hear something like, "I really like your program; but, why can't I use it on the Web like the other tools we use?" Perl was known as the "Swiss Army Chainsaw of the Web" before it was the language of bioinformatics. Currently however, this is one of Ruby's strongest features. Appendix D will introduce a programming framework which is justifiably the most popular new way answer this request: Ruby on Rails.
  3. The choice of a programming editor sometimes seems to be an almost religious argument. Appendix E is for Emacs, which is a fully customizable editor with full spell-checker support and the ability to use regular expressions while editing.
  4. Spreadsheets are not a bad tool for pre-processing biological data. However, seldom are they good enough. Appendix F is for R, which is the GNU version of the S programming language.
  5. Once you have divided the task of annotating your favorite genome among your team members, it is time to think about how you will assemble their separate results into a whole to post on Gen Bank. Relational database software is not quite what you want. General purpose version control software is for any kind of data, not just programming code. Appendix G is for Git, which is another good piece of open source software.
  6. When your team needs to check their results against the latest version of a sequence in the NCBI databases, they can do a little Web surfing on the NCBI site. However, you could help by automating this process with the NCBI eUtilities described in Appendix H.
  7. Edward Tufte has said, "Good graphics with Excel is an oxymoron." But, what you can't do with Excel, you usually can do with R. Sometimes however, a little finer control is needed. Appendix I is for Image Magick, which is a full-featured image creation and manipulation tool-kit.
  8. Collaborative development of software tools with your team usually means that no one team member understands the details of every part of a project. Unfortunately dynamically typed programming languages, like Perl and Ruby, don't make this problem easier. Unit testing, where the expert in the particular part or "unit" writes the test questions and answers for everybody's use, is introduced in Appendix J.

   "There is more than one way to do it," is an old Perl programmer's aphorism. Perl has so many options, shortcuts and idioms (as does Ruby for that matter) that considering them all at once, can be bewildering even for an experienced programmer. St. Clair and Visick have set forth an elegantly simple path for learning that we will try to follow.

TOC

Chapter 1 Exploring Bioinformatics

   Perl and Ruby are both "block" languages where statements are grouped together. In Perl the block of statements that are associated with the for-loop are between the curly braces. Ruby can also use this kind of notation but more often groups statements between "do" and "end" as a block.

   Perl and Ruby are both descendants of the C programming language. Ruby also has Smalltalk as an ancestor and so many of the statements read like noun.verb, e.g. gets.chomp (which means the-string-typed-in.chomp-the-enter-key-off-the-end-of-that-string.)

   Ruby does not have a for-loop in the C style like we see here in Perl. (0...dnaseq.length).each do |i| means: Take each integer, i, from 0 up to but not including the length of the dnaseq string.

   Ruby allows the programmer to use the more familiar subscript notation rather than the substr() function.

   As you practice writing programs for yourself, always keep in mind the person who will come after you in your job. Comments are a big help in making your code readable.

# Perl single line comments are
# everything from a # to the end
# of that line.
=begin comment
     Paragraph sized comments can be 
  written like this in Perl.  =cut
  means to stop the "comment" block
  and cut back to regular Perl
  code statements.
=end comment
=cut
# Regular Perl statements resume here.
# Ruby single line comments are
# everything from a # to the end
# of that line.
=begin
     Paragraph sized comments can be 
  written like this in Ruby.
  ...
  ...
  ...
=end
# =cut is not used in Ruby.
# Regular Ruby statements resume here.

   The blank # comments below are only to keep the number of lines the same and make comparison easier. Perl uses different comparison operators for strings of letters and numbers, "eq" and "==" respectively to test for equality. Ruby uses "==" to test both strings of letters and numbers for equality.

# scalarops.pl
$s1 = $s2 = '0' . '0';
$s2++;
$s3 = $s4 = 'z';
$s4++;
$s5 = '9';
$s6 = '10';
$s7 = 3 * 3;
$s8 = '3' x 3;
print "After $s1 comes $s2.\n";
print "After $s3 comes $s4.\n";
print "$s5 is lexically greater than $s6.\n"
  if $s5 gt $s6;
print "$s6 is numerically greater than $s5.\n"
  if $s6 > $s5;
print "3 times 3 is $s7.\n";
print "'3' times 3 is $s8.\n";
=begin comment
Output:
After 00 comes 01.
After z comes aa.
9 is lexically greater than 10.
10 is numerically greater than 9.
3 times 3 is 9.
'3' times 3 is 333.
=end coment
=cut
# scalarops.rb
# There are fewer Ruby lines because of the more
# flexible string interpolation that allows any
# expression.
#
#
#
#
#
puts "After #{'0' + '0'} comes #{('0' + '0').succ}."
puts "After #{'z'} comes #{'z'.succ}."
puts "#{9} is lexically greater than #{10}." \
  if 9.to_s > 10.to_s
puts "#{10} is numerically greater than #{9}." \
  if 10 > 9
puts "3 times 3 is #{3 * 3}."
puts "'3' times 3 is #{'3' * 3}."
=begin
Output:
After 00 comes 01.
After z comes aa.
9 is lexically greater than 10.
10 is numerically greater than 9.
3 times 3 is 9.
'3' times 3 is 333.
=end
#

   Of the two languages, only Perl has the C style for-loop. Ruby has a similar construct that reads: "Use each of the integers in a range of integers."

# ch1p12.pl
print "Enter single DNA strand: ";
$dnaseq = <STDIN>;
chomp $dnaseq;
print "\nOpposite strand: ";
for ($i = 0;$i < length($dnaseq);$i++) {
    $nucleo = substr($dnaseq,$i,1);
    if ($nucleo eq 'A') { print "T"; }
    elsif ($nucleo eq 'C') { print "G"; }
    elsif ($nucleo eq 'G') { print "C"; }
    else { print "A"; }
#
}
# ch1p12.rb
print "Enter single DNA strand: "
dnaseq = gets.chomp
#
print "\nOpposite strand: "
(0...dnaseq.length).each do |i|
  nucleo = dnaseq[i,1]
  if nucleo == 'A' then print "T"
  elsif nucleo == 'C' then print "G"
  elsif nucleo == 'G' then print "C"
  else print "A"
  end
end

   Getting Windows Vista and the Emacs Shell-Mode to play nicely together seems to be a little tricky. Sarah Savaya found that using a Windows Command-Line window instead of the Emacs sub-shell was one workaround. Another workaround might be to force auto-flushing of standard-output.

# ch1p12v.pl
# A Windows Vista version to work
# within the Emacs shell by not buffering
# STDOUT.
$| = 1;
print "Enter single DNA strand: ";
$dnaseq = <STDIN>;
chomp $dnaseq;
print "\nOpposite strand: ";
for ($i = 0;$i < length($dnaseq);$i++) {
    $nucleo = substr($dnaseq,$i,1);
    if ($nucleo eq 'A') { print "T"; }
    elsif ($nucleo eq 'C') { print "G"; }
    elsif ($nucleo eq 'G') { print "C"; }
    else { print "A"; }
#
}
# ch1p12v.rb
# A Windows Vista version to work
# within the Emacs shell by not buffering
# STDOUT.
$stdout.sync = true;
print "Enter single DNA strand: "
dnaseq = gets.chomp
#
print "\nOpposite strand: "
(0...dnaseq.length).each do |i|
  nucleo = dnaseq[i,1]
  if nucleo == 'A' then print "T"
  elsif nucleo == 'C' then print "G"
  elsif nucleo == 'G' then print "C"
  else print "A"
  end
end

   Both Perl and Ruby have for-each-thing-in-a-list loops. If we jump a little ahead to Chapter 4, and begin by splitting our sequence string into a list of individual nucleotide characters, then Perl and Ruby look even more similar.

# ch1p12b.pl
print "Enter single DNA strand: ";
$dnaseq = <STDIN>;
chomp $dnaseq;
@nucleotides = split "",$dnaseq;
print "\nOpposite strand: ";
foreach $nucleo (@nucleotides) {
# for $nucleo (@nucleotides) { is the same thing.
    if ($nucleo eq 'A') { print "T"; }
    elsif ($nucleo eq 'C') { print "G"; }
    elsif ($nucleo eq 'G') { print "C"; }
    else { print "A"; }
#
}
# ch1p12b.rb
print "Enter single DNA strand: "
nucleotides = gets.chomp.split ""
#
#
print "\nOpposite strand: "
nucleotides.each do |nucleo|
# for nucleo in nucleotides do is the same thing.
  if nucleo == 'A' then print "T"
  elsif nucleo == 'C' then print "G"
  elsif nucleo == 'G' then print "C"
  else print "A"
  end
end

TOC

Chapter 2 The Central Dogma: Exploring Genetic Disease

   In the first example, notice the difference in the ways Perl and Ruby interpolate variable values into strings. The "$" in $dnaseq tells Perl that dnaseq is a variable name and not some ordinary characters. Ruby variables lack these prefixes, so #{dnaseq} is used. Also notice that in Ruby, puts "string" is just shorthand for print "string\n".

   The Ruby convention is that when you see something like noun.verb, the noun is not changed; and, when you see noun.verb! the "!" warns that the noun is changed.

# Perl:
# Two steps on two lines.
$rnaseq = $dnaseq; # Make a copy first,
$rnaseq =~ s/T/U/g; # Make the substitution in the copy.
# Two steps on one line, but a little obscure.
($rnaseq = $dnaseq) =~ s/T/U/g; # Same in idiomatic Perl.
# Ruby:
# These two lines are equivalent to
rnaseq = dnaseq
rnaseq.gsub!(/T/,'U')
# this one line.
rnaseq = dnaseq.gsub(/T/,'U')

# ch2ex1.pl
# Convert a non-template strand of DNA to its RNA equivalent.
print "Enter non-template DNA strand in capital letters: ";
$dnaseq = <STDIN>;
chomp $dnaseq;
$rnaseq = $dnaseq;
$rnaseq =~ s/T/U/g;
print "DNA: $dnaseq \n";
print "RNA: $rnaseq \n";
# ch2ex1.rb
# Convert a non-template strand of DNA to its RNA equivalent.
print "Enter non-template DNA strand in capital letters: "
dnaseq = gets.chomp
#
rnaseq = dnaseq.gsub(/T/,'U')
#
puts "DNA: #{dnaseq}"
puts "RNA: #{rnaseq}"

   For the next example, in the interest of both conserving space and easier visual comparison with tabular representations of the genetic code (Like Table 2.1), we will introduce initialization of hashes with lists of key => value pairs.

# ch2ex2.pl
# Translation Program
# Build hashtable for genetic code
%codes = (
  'GCA'=>'Ala','GCC'=>'Ala','GCG'=>'Ala','GCU'=>'Ala',
  'GGA'=>'Gly','GGC'=>'Gly','GGG'=>'Gly','GGU'=>'Gly',
  'CCA'=>'Pro','CCC'=>'Pro','CCG'=>'Pro','CCU'=>'Pro',
  'CGA'=>'Arg','CGC'=>'Arg','CGG'=>'Arg','CGU'=>'Arg',
  'AGA'=>'Arg','AGG'=>'Arg',
  'CAC'=>'His','CAU'=>'His',
  'AGC'=>'Ser','AGU'=>'Ser','UCA'=>'Ser','UCC'=>'Ser',
  'UCG'=>'Ser','UCU'=>'Ser',
  'AAU'=>'Asn','AAC'=>'Asn',
  'AUA'=>'Ile','AUC'=>'Ile','AUU'=>'Ile',
  'ACA'=>'Thr','ACC'=>'Thr','ACG'=>'Thr','ACU'=>'Thr',
  'GAC'=>'Asp','GAU'=>'Asp',
  'CUA'=>'Leu','CUC'=>'Leu','CUG'=>'Leu','CUU'=>'Leu',
  'UUA'=>'Leu','UUG'=>'Leu',
  'UGG'=>'Trp',
  'UGC'=>'Cys','UGU'=>'Cys',
  'GAA'=>'Glu','GAG'=>'Glu',
  'AAA'=>'Lys','AAG'=>'Lys',
  'AUG'=>'Met',
  'UAC'=>'Tyr','UAU'=>'Tyr',
  'GUA'=>'Val','GUC'=>'Val','GUG'=>'Val','GUU'=>'Val',
  'CAA'=>'Gln','CAG'=>'Gln',
  'UUC'=>'Phe','UUU'=>'Phe',
  'UAA'=>'Stop','UAG'=>'Stop','UGA'=>'Stop'
);
# Get non-template DNA sequence from user.
print "Enter non-template DNA strand in capital letters: ";
$dnaseq = <STDIN>;
chomp $dnaseq;
# Convert DNA to RNA.
$rnaseq = $dnaseq;
$rnaseq =~ s/T/U/g;
print "DNA: $dnaseq \n";
print "RNA: $rnaseq \n";
# Get number of codons.
$codonctr = length($rnaseq) / 3; # Floating point division!
print "Protein: ";
for ($i = 0;$i < $codonctr;$i++) {
  # Call to hashtable:
    $aminoacid = $codes{substr($rnaseq,$i * 3, 3)};
    print "$aminoacid";
}
print "\n";
# ch2ex2.rb
# Translation Program
# Build hashtable for genetic code.
codes = {
  'GCA'=>'Ala','GCC'=>'Ala','GCG'=>'Ala','GCU'=>'Ala',
  'GGA'=>'Gly','GGC'=>'Gly','GGG'=>'Gly','GGU'=>'Gly',
  'CCA'=>'Pro','CCC'=>'Pro','CCG'=>'Pro','CCU'=>'Pro',
  'CGA'=>'Arg','CGC'=>'Arg','CGG'=>'Arg','CGU'=>'Arg',
  'AGA'=>'Arg','AGG'=>'Arg',
  'CAC'=>'His','CAU'=>'His',
  'AGC'=>'Ser','AGU'=>'Ser','UCA'=>'Ser','UCC'=>'Ser',
  'UCG'=>'Ser','UCU'=>'Ser',
  'AAU'=>'Asn','AAC'=>'Asn',
  'AUA'=>'Ile','AUC'=>'Ile','AUU'=>'Ile',
  'ACA'=>'Thr','ACC'=>'Thr','ACG'=>'Thr','ACU'=>'Thr',
  'GAC'=>'Asp','GAU'=>'Asp',
  'CUA'=>'Leu','CUC'=>'Leu','CUG'=>'Leu','CUU'=>'Leu',
  'UUA'=>'Leu','UUG'=>'Leu',
  'UGG'=>'Trp',
  'UGC'=>'Cys','UGU'=>'Cys',
  'GAA'=>'Glu','GAG'=>'Glu',
  'AAA'=>'Lys','AAG'=>'Lys',
  'AUG'=>'Met',
  'UAC'=>'Tyr','UAU'=>'Tyr',
  'GUA'=>'Val','GUC'=>'Val','GUG'=>'Val','GUU'=>'Val',
  'CAA'=>'Gln','CAG'=>'Gln',
  'UUC'=>'Phe','UUU'=>'Phe',
  'UAA'=>'Stop','UAG'=>'Stop','UGA'=>'Stop'
}
# Get non-template DNA sequence from user.
print "Enter non-template DNA strand in capital letters: "
dnaseq = gets.chomp
#
# Convert DNA to RNA.
rnaseq = dnaseq.gsub(/T/,'U')
#
puts "DNA: #{dnaseq}"
puts "RNA: #{rnaseq}"
# Get number of codons.
codonctr = rnaseq.length / 3 # Integer division.
print "Protein: ";
(0...codonctr).each do |i|
  # Call to hashtable:
  # Note #{expression} interpolation:
  print "#{codes[rnaseq[i * 3,3]]}"
end
puts

   Old C, Fortran or Assembly Language programmers expect that the first element of an array is at subscript 0. That is the first element is found offset 0 elements from the start. Math minded folks who write algorithms usually expect that the first element is at subscript 1. Translating subscripts from math notation to programming notation is a source of error and confusion. It might help to think in terms of iteration, or going through a list of things and pointing at one thing after another in order to do something with each thing in the list individually. All the scripting languages implement iteration fairly painlessly.

   Let us retry Example 2.2 while thinking in terms of lists and iterations.

  1. Change the DNA to RNA while still in string format as before.
  2. Search through the RNA string and make a list of codons. In regular expressions [ACGU]{3} means a string of 3 letters from the set {A, C, G, U} -- a codon. In Perl, save each codon that matches our pattern in a flat list and then continue where we left off and search some more until we run out of RNA. The "g" option for a match means a "progressive match" instead of "global" as it did for a substitution. The Ruby scan function does the same thing -- makes a list of all the matches for a given pattern found in a string.
  3. Next, take the list of codons and make a list of the corresponding amino acids. A iterator that does this is called a map.
  4. Finally, rejoin the list of amino acids into a protein string.

# ch2ex2b.pl
# Translation Program
# Build hashtable for genetic code
%codes = (
  'GCA'=>'Ala','GCC'=>'Ala','GCG'=>'Ala','GCU'=>'Ala',
  'GGA'=>'Gly','GGC'=>'Gly','GGG'=>'Gly','GGU'=>'Gly',
  'CCA'=>'Pro','CCC'=>'Pro','CCG'=>'Pro','CCU'=>'Pro',
  'CGA'=>'Arg','CGC'=>'Arg','CGG'=>'Arg','CGU'=>'Arg',
  'AGA'=>'Arg','AGG'=>'Arg',
  'CAC'=>'His','CAU'=>'His',
  'AGC'=>'Ser','AGU'=>'Ser','UCA'=>'Ser','UCC'=>'Ser',
  'UCG'=>'Ser','UCU'=>'Ser',
  'AAU'=>'Asn','AAC'=>'Asn',
  'AUA'=>'Ile','AUC'=>'Ile','AUU'=>'Ile',
  'ACA'=>'Thr','ACC'=>'Thr','ACG'=>'Thr','ACU'=>'Thr',
  'GAC'=>'Asp','GAU'=>'Asp',
  'CUA'=>'Leu','CUC'=>'Leu','CUG'=>'Leu','CUU'=>'Leu',
  'UUA'=>'Leu','UUG'=>'Leu',
  'UGG'=>'Trp',
  'UGC'=>'Cys','UGU'=>'Cys',
  'GAA'=>'Glu','GAG'=>'Glu',
  'AAA'=>'Lys','AAG'=>'Lys',
  'AUG'=>'Met',
  'UAC'=>'Tyr','UAU'=>'Tyr',
  'GUA'=>'Val','GUC'=>'Val','GUG'=>'Val','GUU'=>'Val',
  'CAA'=>'Gln','CAG'=>'Gln',
  'UUC'=>'Phe','UUU'=>'Phe',
  'UAA'=>'Stop','UAG'=>'Stop','UGA'=>'Stop'
);
# Get non-template DNA sequence from user.
print "Enter non-template DNA strand in capital letters: ";
$dnaseq = <STDIN>;
chomp $dnaseq;
# Transcribe DNA to RNA.
$rnaseq = $dnaseq;
$rnaseq =~ s/T/U/g;
print "DNA: $dnaseq \n";
print "RNA: $rnaseq \n";
# Split the RNA into a list of codons.
@codons = $rnaseq =~ /([ACGU]{3})/g;
# Translate RNA codons to amino acids.
@aminoacids = map {$codes{$_}}@codons; 
print "Protein: " . join("",@aminoacids) . "\n";
# ch2ex2b.rb
# Translation Program
# Build hashtable for genetic code.
codes = {
  'GCA'=>'Ala','GCC'=>'Ala','GCG'=>'Ala','GCU'=>'Ala',
  'GGA'=>'Gly','GGC'=>'Gly','GGG'=>'Gly','GGU'=>'Gly',
  'CCA'=>'Pro','CCC'=>'Pro','CCG'=>'Pro','CCU'=>'Pro',
  'CGA'=>'Arg','CGC'=>'Arg','CGG'=>'Arg','CGU'=>'Arg',
  'AGA'=>'Arg','AGG'=>'Arg',
  'CAC'=>'His','CAU'=>'His',
  'AGC'=>'Ser','AGU'=>'Ser','UCA'=>'Ser','UCC'=>'Ser',
  'UCG'=>'Ser','UCU'=>'Ser',
  'AAU'=>'Asn','AAC'=>'Asn',
  'AUA'=>'Ile','AUC'=>'Ile','AUU'=>'Ile',
  'ACA'=>'Thr','ACC'=>'Thr','ACG'=>'Thr','ACU'=>'Thr',
  'GAC'=>'Asp','GAU'=>'Asp',
  'CUA'=>'Leu','CUC'=>'Leu','CUG'=>'Leu','CUU'=>'Leu',
  'UUA'=>'Leu','UUG'=>'Leu',
  'UGG'=>'Trp',
  'UGC'=>'Cys','UGU'=>'Cys',
  'GAA'=>'Glu','GAG'=>'Glu',
  'AAA'=>'Lys','AAG'=>'Lys',
  'AUG'=>'Met',
  'UAC'=>'Tyr','UAU'=>'Tyr',
  'GUA'=>'Val','GUC'=>'Val','GUG'=>'Val','GUU'=>'Val',
  'CAA'=>'Gln','CAG'=>'Gln',
  'UUC'=>'Phe','UUU'=>'Phe',
  'UAA'=>'Stop','UAG'=>'Stop','UGA'=>'Stop'
}
# Get non-template DNA sequence from user.
print "Enter non-template DNA strand in capital letters: "
dnaseq = gets.chomp
#
# Transcribe DNA to RNA.
rnaseq = dnaseq.gsub(/T/,'U')
#
puts "DNA: #{dnaseq}"
puts "RNA: #{rnaseq}"
# Split the RNA into a list of codons.
codons = rnaseq.scan(/[ACGU]{3}/)
# Translate RNA codons to amino acids
# and print the protein.
puts "Protein: #{codons.map{|c| codes[c]}.join("")}"

   Example 2.3 uses text files. Let us look at the Perl: Need to Know, Section 2.5.3. As a variation we will use an input file with more than one line.

ch2ntk-files.txt
line 2
line 3
A binary view of the input, which was typed on OS X..
00000000: 6368 326e 746b 2d66 696c 6573 2e74 7874  ch2ntk-files.txt
00000010: 0a6c 696e 6520 320a 6c69 6e65 2033 0a    .line 2.line 3.

# ch2ntk-files.pl
# Understanding text files and newlines.
open INFILE, 'ch2ntk-files.txt' or die;
open OUTFILE, '>ch2ntk-files-out.txt' or die;
while ($data = <INFILE>) {
  chomp $data;
  print OUTFILE "Input String: $data"; # Notice no comma.
}
close OUTFILE; # Why not also close the infile?
Which produces:
Input String: ch2ntk-files.txtInput String: line 2Input String: line 3


What changes need to be made here to make our program work with a file that is more than one line?
# ch2ntk-files.rb
# Understanding text files and newlines.
infile = open 'ch2ntk-files.txt'
outfile = open 'ch2ntk-files-out-rb.txt',"w"
infile.readlines.each do |data|
  data.chomp! # ! to alter data.
  outfile.puts "Input String: #{data}"
end
outfile.close
Which produces:
Input String: ch2ntk-files.txt
Input String: line 2
Input String: line 3
We could also use outfile.write("... \n").

   Begin Example 2.3 with review of the FASTA format as used at NCBI. With modern computer memories being bigger than most text files, we could practice reading whole files at once rather than line-by-line. So we can have several test files and not change our code, we will use ARGV[0], which is the first argument on the command line. (This may confuse C and C++ programmers who would use ARGV[0] and expect the name of the program and use ARGV[1] to get the first argument on the command line. Our four sets of test files:

  1. >ch2ex3a_1.txt
    AGTCGT
    
    >ch2ex3a_2.txt
    AGTCGT
    
  2. >ch2ex3b_1.txt
    GGTCGT
    
    >ch2ex3b_2.txt
    AGTCGT
    
  3. >ch2ex3c_1.txt
    AGG
    
    >ch2ex3c_2.txt
    GAA
    
  4. >ch2ex3d_1.txt
    F
    FS
    RGDE
    
    >ch2ex3d_2.txt
    F
    SS
    RGWE
    
So the command line
perl ch2ex3.pl ch2ex3a
would compare the first 2 files using the Perl version of Chapter 2, Example 3.

# ch2ex3.pl
# Get the strings from the file.
$f1 = "$ARGV[0]_1.txt";
if (!open INFILE_1, $f1) {
  print "Error opening input file $f1\n";
  exit;
}
$fasta_header = <INFILE_1>; # Ignore FASTA header.
@lines = <INFILE_1>; # Save the rest of the file.
$string1 = join "",@lines; # Make that into 1 string.
$string1 =~ s/\s//g; # Remove spaces and line breaks.
$f2 = "$ARGV[0]_2.txt";
if (!open INFILE_2, $f2) {
  print "Error opening input file $f2\n";
  exit;
}
$fasta_header = <INFILE_2>; # Ignore FASTA header.
@lines = <INFILE_2>; # Save the rest of the file.
$string2 = join "",@lines; # Make that into 1 string.
$string2 =~ s/\s//g; # Remove spaces and line breaks.
print "Comparing:\n$string1\n$string2\n"; 
# Check for mismatches.
$mismatches = 0; # Initialize the flag.
for ($i = 0;$i < length $string1;$i++) {
  if (substr($string1,$i,1) ne substr($string2,$i,1)) {
    $mismatches++; # Event occurred, increment the counter.
    print "Mismatch found at location: ", $i + 1, "\n";
  }
}
# Print test results.
if (!$mismatches)  { # 0 mismatches.
  print "Strings are identical.\n";
} elsif ($mismatches == 1) {
  print "There was 1 mismatch.\n";
} else {
  print "There were $mismatches mismatches.\n";
}
# ch2ex3.rb
# Get the strings from the file.
f1 = "#{ARGV[0]}_1.txt"
if !(infile1 = open f1)
  puts "Error opening input file #{f1}"
  exit
end
fasta_header = infile1.readline # Ignore FASTA header.
string1 = infile1.readlines.join("").gsub(/\s/,"")
#
#
f2 = "#{ARGV[0]}_2.txt"
if !(infile2 = open f2)
  puts "Error opening input file #{f2}"
  exit
end
fasta_header = infile2.readline # Ignore FASTA header.
string2 = infile2.readlines.join("").gsub(/\s/,"")
#
#
puts "Comparing:\n#{string1}\n#{string2}" 
# Check for mismatches.
mismatches = 0 # Initialize the flag.
(0...string1.length).each do |i|
  if string1[i,1] != string2[i,1] # Event occurred,
    mismatches += 1 # so increment the counter.
    puts "Mismatch found at location: #{i + 1}"
  end
end
# Print test results.
if mismatches == 0
  puts "Strings are identical."
elsif mismatches == 1
  puts "There was 1 mismatch"
else
  puts "There were #{mismatches} mismatches."
end

   These exercises check some sequence against what is assumed to be the latest version of a reference sequence. Appendix H has an example of how to update your team's local file copy of the reference sequence to the latest version with help from the NCBI eUtilities.

TOC

Chapter 3 Gene Alignments: Investigating Antibiotic Resistance

   In the example of programming Algorithm 3.3, we will look at some additional concepts: stacks, here documents, matrices, built-in help, and the difference between using "if" as a flow control statement versus as a statement modifier.

Run the programs below first with the test pairs from the text and then try some of your own sequence pairs.

# ch3ga.pl
# Optimal global alignments.
# Scoring constants:
$match = 1;
$mismatch = 0;
$gap = -1;
$debug = 0;
# See if help is needed.
if (grep /^-h$/,@ARGV or
    scalar(@ARGV) - scalar(grep /^-[hd]$/,@ARGV) < 2) {
  print <<HELP_MESSAGE;
Usage: ch3ga.pl [-options] sequence_1 sequence_2
Options:
  -d  Print out the scoring matrix for debugging.
  -h  Print out this help message.
HELP_MESSAGE
  exit 1;
}
# Get the input sequences and
# decide if the scoring matrix should be printed.
$i = 0;
for $argument (@ARGV) {
  if ($argument eq '-d') {
    $debug = 1;
  } elsif ($i == 0) {
    @h = (" ",split("",$argument));
    $i = 1;
  } else {
    @v = (" ",split("",$argument));
  }
}
# Figure the dimensions of the scoring matrix.
$n = scalar(@v) - 1;
$m = scalar(@h) - 1;
# Prepare the alignment scoring matrix 
# and the backward links matrix during the
# same pass from the upper, left to lower,
# right corners of the dynamic programming
# table.
# Prepare the dynamic programming table.
$s[0][0] = 0;    # Top, left corner
$b[0][0] = "";
for $i (1..$n) { # Left column
  $s[$i][0] = $i * $gap;
  $b[$i][0] = 'v';
}
for $j (1..$m) { # Top row
  $s[0][$j] = $j * $gap;
  $b[0][$j] = 'h';
 }
for $i (1..$n) {
  for $j (1..$m) {
    # Determine the maximum of scores.
    # Check vertically and horizontally first.
    $up = $s[$i - 1][$j] + $gap;
    $left = $s[$i][$j - 1] + $gap;
    if ($up > $left) {
      $s[$i][$j] = $up;
    } else {
      $s[$i][$j] = $left;
    }
    $delta = $v[$i] eq $h[$j] ? $match : $mismatch;
    $diag = $s[$i - 1][$j - 1] + $delta;
    # Using the statement modifier form of if.:
    $s[$i][$j] = $diag if $diag > $s[$i][$j];
    # Collect and save backtracking pointers,
    # for scores equal to the maximum.
    $b[$i][$j] = "";
    $b[$i][$j] .= 'v' if $s[$i][$j] == $up;
    $b[$i][$j] .= 'h' if $s[$i][$j] == $left;
    $b[$i][$j] .= 'd' if $s[$i][$j] == $diag;
  }
} 
# Display the dynamic programming table if the
# debug option is requested.
# Also practice with string multiplication and
# old-fashioned drawing with characters.
if ($debug) {
  print "Dynamic programming table:\n";
  print ' ' x 4 . '|';
  for $l (@h) { print "   $l  |"; }
  print "\n";
  $line0 = "----+" . ('-' x 6 . '+') x ($m + 1);
  print "$line0\n";
  for $i (0..$n) {
    $line1 = "    |";
    $line2 = " $v[$i]  |";
    for $j (0..$m) {
      $up = $b[$i][$j] =~ /v/ ? "^" : " ";
      $left = $b[$i][$j] =~ /h/ ? "<" : " ";
      $diagonal = $b[$i][$j] =~ /d/ ? "\\" : " ";
      $line1 .= "$diagonal   $up |";
      $line2 .= sprintf "%s %3d |",$left,$s[$i][$j];
    }
    print "$line1\n";
    print "$line2\n";
    print "$line0\n";
  }
}
# Find the optimal alignment paths.
@todo_list = ($n,$m,"",""); # Entry (i,j,string1,string2)
@done_list = ();
while (@todo_list) {
  ($str2,$str1,$j,$i) = (pop @todo_list,pop @todo_list,
                         pop @todo_list,pop @todo_list);
  if (length $b[$i][$j]) { # If some back-link exists.
    push @todo_list,($i - 1, $j - 1,
                     $h[$j] . $str1, $v[$i] . $str2)
      if $b[$i][$j] =~ /d/;
    push @todo_list,($i - 1, $j,
                     '-' . $str1, $v[$i] . $str2)
      if $b[$i][$j] =~ /v/;
    push @todo_list,($i, $j - 1,
                     $h[$j] . $str1, '-' . $str2)
      if $b[$i][$j] =~ /h/;
  } else {
    push @done_list,($str1,$str2);
  }
}
# Print each of the alignments in @done_list.
$i = 1;
# Change from last-in-first-out to first-in-first-out.
@done_list = reverse @done_list;
while (@done_list) {
  ($str1,$str2) = (pop @done_list,pop @done_list);
  print "Alignment $i\n$str1\n$str2\n";
  $i++;
}
# ch3ga.rb
# Optimal global alignments.
# Scoring constants:
match = 1
mismatch = 0
gap = -1
debug = nil
# See if help is needed.
if ARGV.grep(/^-h$/).size > 0 or
   ARGV.size - ARGV.grep(/^-[hd]$/).size < 2
  print <<HELP_MESSAGE
Usage: ch3ga.rb [-options] sequence_1 sequence_2
Options:
  -d  Print out the scoring matrix for debugging.
  -h  Print out this help message.
HELP_MESSAGE
  exit 1
end
# Get the input sequences and
# decide if the scoring matrix should be printed.
i,h,v = 0,[],[]
ARGV.each do |argument|
  if argument == '-d'
    debug = 1
  elsif i == 0
    h = (" " + argument).split("")
    i = 1
  else
    v = (" " + argument).split("")
  end
end
# Figure the dimensions of the scoring matrix.
n = v.size - 1
m = h.size - 1
# Prepare the alignment scoring matrix 
# and the backward links matrix during the
# same pass from the upper, left to lower,
# right corners of the dynamic programming
# table.
# Prepare the dynamic programming table.
# First build s and b as arrays of rows.
# While adding each row to the array or
# list of rows, fill in the left-most
# column.
s = (0..n).inject([]) {|s,e| s << [e * gap]}
b = (0..n).inject([]) {|b,e| b << [e == 0?"":'v']}
(1..m).each do |j| # Top row
  s[0][j] = j * gap
  b[0][j] = 'h'
end
(1..n).each do |i|
  (1..m).each do |j|
    # Determine the maximum of scores.
    # Check vertically and horizontally first.
    up = s[i - 1][j] + gap
    left = s[i][j - 1] + gap
    if up > left
      s[i][j] = up
     else
      s[i][j] = left
    end
    delta = v[i] == h[j] ? match : mismatch
    diag = s[i - 1][j - 1] + delta
    # Using the statement modifier form of if.
    s[i][j] = diag if (diag > s[i][j])
    # Collect and save backtracking pointers,
    # for scores equal to the maximum.
    b[i][j] = ""
    b[i][j] += 'v' if s[i][j] == up
    b[i][j] += 'h' if s[i][j] == left
    b[i][j] += 'd' if s[i][j] == diag
  end
end
# Display the dynamic programming table if the
# debug option is requested.
# Also practice with string multiplication and
# old-fashioned drawing with characters.
if debug
  puts "Dynamic programming table:"
  print ' ' * 4 + '|'
  h.each { |l| print "   #{l}  |"}
  puts
  line0 = "----+" + ('-' * 6 + '+') * (m + 1)
  puts line0
  (0..n).each do |i|
    line1 = "    |"
    line2 = " #{v[i]}  |"
    (0..m).each do |j|
      up = b[i][j] =~ /v/ ? "^" : " "
      left = b[i][j] =~ /h/ ? "<" : " "
      diagonal = b[i][j] =~ /d/ ? "\\" : " "
      line1 += "#{diagonal}   #{up} |"
      line2 += sprintf "%s %3d |",left,s[i][j]
    end
    puts line1
    puts line2
    puts line0
  end
end
# Find the optimal alignment paths.
todo_list = [n,m,"",""] # Entry (i,j,string1,string2)
done_list = []
while !todo_list.empty?
  str2,str1,j,i = todo_list.pop,todo_list.pop,
                  todo_list.pop,todo_list.pop
  if !b[i][j].empty? # If some back-link exists.
    todo_list.push(i - 1, j - 1,
                   h[j] + str1, v[i] + str2) \
      if b[i][j] =~ /d/
    todo_list.push(i - 1, j,
                   '-' + str1, v[i] + str2) \
      if b[i][j] =~ /v/
    todo_list.push(i, j - 1,
                     h[j] + str1, '-' + str2) \
      if b[i][j] =~ /h/
  else
    done_list.push(str1,str2)
  end
end
# Print each of the alignments in done_list.
i = 1
# Change from last-in-first-out to first-in-first-out.
done_list.reverse!
while !done_list.empty?
  str1,str2 = done_list.pop,done_list.pop
  puts "Alignment #{i}\n#{str1}\n#{str2}"
  i += 1
end

TOC

Chapter 4 Protein Alignments: Clues to Protein Function

   In Programming Exercise 4.2 we store the substitution matrix in a text file and need to read it into a 2-dimensional data structure. Here are some examples to review those structures in Perl and Ruby, and provide more practice at iterating over such data structures. In both Perl and Ruby the default regular expression pattern for split() is \s+. So the default behavior is to split a sentence on whitespace, that is, into an array of words.

# ch4getmatrix.pl
$matrix_file = 'pam250.txt';
die "Lost $matrix_file" if (!open MATRIX, $matrix_file);
@lines = <MATRIX>; # Read the entire file.
# The top row is set of amino acid labels.
@aa = split /\s+/,shift @lines;
# Build a hash to use as an index into the
# substitution matrix.
for $i (0..19) {
  $aa_idx{$aa[$i]} = $i;
}
# Populate the substitution matrix.
for (@lines) { # Using the default argument, $_.
  push @submatrix, [ split ]; # \s+ is the default pattern.
}
# Summarize.
print "There were " . scalar(@aa) . " amino acids and " .
      scalar(@submatrix) . " matrix rows:\n";
# Echo the substitution matrix.
print join(' ',map {sprintf "%3s",$_} @aa), "\n";
for $row_ref (@submatrix) {
  print join(' ',map {sprintf "%3d",$_} @$row_ref), "\n";
}
# ch4getmatrix.rb
matrix_file = 'pam250.txt'
#
lines = File::readlines(matrix_file) # Read the entire file.
# The top row is set of amino acid labels.
aa = lines[0].split
# Build a hash to use as an index into the
# substitution matrix.
aa_idx = {}
aa.each_with_index {|e,i| aa_idx[e] = i}
#
# Populate the substitution matrix.
submatrix = lines[1..-1].inject([]) {|m,l| m << l.split}
#
#
# Summarize.
puts "There were #{aa.size} amino acids " +
     "and #{submatrix.size} matrix rows:"
# Echo the substitution matrix.
puts aa.map {|a| sprintf "%3s",a}.join(' ')
submatrix.each do |row|
  puts row.map {|p| sprintf "%3d",p}.join(' ')
end

   Jumping ahead to Chapter 8, we see that a 2-dimensional hash is as easy to use as a 2-dimensional array. Which sort of data structure you use depends somewhat on what your program does most often. Arrays have order and hashes do not. If you primarily will be displaying the rows and columns in their existing order, the approach above with an array of arrays and a separate hash index, would be suitable. If you will primarily be accessing the data structure randomly, a hash of hashes would use half the indexing operations.

# ch4getmatrix2.pl
$matrix_file = 'pam250.txt';
die "Lost $matrix_file" if (!open MATRIX, $matrix_file);
@lines = <MATRIX>; # Read the entire file.
# The top row is set of amino acid labels.
@aa = split /\s+/,shift @lines;
# Populate the substitution matrix.
%submatrix = ();
for $a_row (@aa) {
  @row = split /\s+/,shift @lines;
  for $a_col (@aa) {
    $submatrix{$a_row}{$a_col} = shift @row;
  }
}
#
# Summarize.
print "There were " . scalar(@aa) . " amino acids and " .
      scalar(keys %submatrix) . " matrix rows:\n";
# Echo the substitution matrix.
print join(' ',map {sprintf "%3s",$_} @aa), "\n";
for $i (0..19) {
  #
  @s = map {sprintf "%3d",$submatrix{$aa[$i]}{$aa[$_]}} (0..19);
  print join(' ',@s), "\n";
}
# ch4getmatrix2.rb
matrix_file = 'pam250.txt'
#
lines = File::readlines(matrix_file) # Read the entire file.
# The top row is set of amino acid labels.
aa = lines[0].split
# Populate the substitution matrix.
submatrix = {}
(0..19).each do |i|
  row = lines[i + 1].split
  submatrix[aa[i]] = (0..19).inject({}) do |h,j|
    h[aa[j]] = row[j]
    h # Need the whole hash and not the single value.
  end
end
# Summarize.
puts "There were #{aa.size} amino acids " +
     "and #{submatrix.size} matrix rows:"
# Echo the substitution matrix.
puts aa.map {|a| sprintf "%3s",a}.join(' ')
aa.each do |a_row|
  puts aa.map { |a_col| 
   sprintf "%3d",submatrix[a_row][a_col]}.join(' ')
end
#

   Consider some code for Problem 4.4 on page 119. Ruby's module for matrices is built in.

# ch4matrices.pl
# A program to print submatrix * submatrix * submatrix.
#
@submatrix = ([2,-1],[3,0]);
# This is left as an exercise for the student.
#
# ch4matrices.rb
# A program to print submatrix * submatrix * submatrix.
require 'matrix'
submatrix = Matrix[[2,-1],[3,0]]
puts submatrix**3
# ==> Matrix[[-4, -1], [3, -6]]

   For more on matrix multiplication, see Appendix F on R.

   In Perl and Ruby the operators |, || and or are slightly different. Consider this upper left corner of the half PAM250 matrix from Figure 4.7.

$ irb
>> delta = {}
=> {}
>> delta['a'] = {'a' => 2}
=> {"a"=>2}
>> delta['r'] = {'a' => -2,'r' => 6}
=> {"a"=>-2, "r"=>6}
>> delta['n'] = {'a' => 0,'r' => 0,'n' => 2}
=> {"a"=>0, "n"=>2, "r"=>0}
>> delta['d'] = {'a' => 0,'r' => -1,'n' => 2,'d' => 4}
=> {"a"=>0, "n"=>2, "d"=>4, "r"=>-1}
>> delta['a']['d']
=> nil
>> delta['d']['a']
=> 0
>> delta['a']['d'] | delta['d']['a']
=> true
>> delta['a']['d'] || delta['d']['a']
=> 0
>> delta['a']['d'] or delta['d']['a']
=> 0

   In Chapter 3 and Chapter 4 we have used Perl and Ruby to initialize matrices. In Chapter 3 we introduced Perl references and passing a matrix by reference as an argument to a subroutine. It is common for Perl to "do about what you would expect" and that includes having a variable "spring into existence" when it appears to the left of an assignment operator like "=". This allows the straightforward initialization of @matrix in this excerpt from ch3skills.pl below.

# Chapter 3 - Solution to Skills 6
# Recursive Implementation of Needleman and Wunsch Algorithm
...
# declare a two dimensional array and initialize to spaces
# array must contain one extra row and one extra column
@matrix = ();
for($i = 0; $i < $len1; $i++){
   for($j = 0; $j < $len2; $j++){
      $matrix[$i][$j] = ' ';
   }
}
...
# notice passing arrays and scalar variables requires the arrays to be
# passed by reference
findpaths($path, $len1-1, $len2-1, \@matrix, $seq1, $seq2);
...
sub findpaths{
   my($path, $pathrow, $pathcol, $matrix, $dnaseq1, $dnaseq2) = @_;

   # NOTE: ...
   #        $matrix is now referred to as @$matrix and elements $$matrix[x]
...
      if (($$matrix[$pathrow][$pathcol-1] - 1) == $$matrix[$pathrow][$pathcol]){
Some Interactive Ruby to examine why the same approach does not work in Ruby:
> matrix = [[1]]
 => [[1]] 
> matrix[0][2] = 3 # We can add columns to an existing row in any order.
 => 3 
> matrix
 => [[1, nil, 3]] 
> matrix[0][1] = 2
 => 2 
> matrix
 => [[1, 2, 3]] 
> matrix[1] = matrix[0].reverse # We can add a new row.
 => [3, 2, 1] 
> matrix
 => [[1, 2, 3], [3, 2, 1]] 
> matrix[2][0] = 4 # But, we cannot add a column to a non-existent row.
NoMethodError: undefined method `[]=' for nil:NilClass
> matrix[2] = [] # So, add an empty row first.
 => [] 
> matrix[2][0] = 4 Now, OK.
 => 4 
> matrix
 => [[1, 2, 3], [3, 2, 1], [4]] 
> matrix[2] += [5,6] # Make the matrix rectangular.
 => [4, 5, 6] 
> matrix
 => [[1, 2, 3], [3, 2, 1], [4, 5, 6]] 
Although, initialization of an array is simpler in Perl; earlier we saw that referring to a single row as a whole is simpler in Ruby. A little more on de-referencing references in Perl:
# ch4perlref.pl
# Notice the use of the block delimiters { and } to keep the @ de-reference
# from binding too tightly to the variable identifier.
@matrix = ([1,2],[3,4]);
$matrixref = \@matrix;
print "1st row of matrix: ";
    for ($i=0,$j=0;$j < 2;$j++) {print $matrix[$i][$j] . ' ';}
print "\n";
print '$matrix[0]: ' . $matrix[0] . "\n";
print '@{$matrix[0]}: ' . join(',',@{$matrix[0]}) . "\n";
print '$$matrixref[0]: ' . $$matrixref[0] . "\n";
print '@{$$matrixref[0]}: ' . join(',',@{$$matrixref[0]}) . "\n";
print '@{$matrixref->[0]}: ' . join(',',@{$matrixref->[0]}) . "\n";
print "Add a 3rd row:\n";
push @$matrixref,[5,6];
print '@{$matrix[2]}: ' . join(',',@{$matrix[2]}) . "\n";
=begin comment
Output:
bash-3.2$ perl ch4perlref.pl 
1st row of matrix: 1 2 
$matrix[0]: ARRAY(0x100804ed0)
@{$matrix[0]}: 1,2
$$matrixref[0]: ARRAY(0x100804ed0)
@{$$matrixref[0]}: 1,2
@{$matrixref->[0]}: 1,2
Add a 3rd row:
@{$matrix[2]}: 5,6
=end comment
=cut
Deciding whether the audience of your program code will be more likely to be comfortable or confused by references, can help you decide whether to use Perl or Ruby for a project.

TOC

Chapter 5 Sequence Assembly: The Human Genome

   To test your code that implements an algorithm or heuristic that uses random numbers can be tricky. It may be useful to have a way to have a reproduceable series of random numbers. Both Perl and Ruby allow this because they use pseudo-random number generation. Calling rand() automatically first calls srand() if srand() was not already called. If srand() is first called with a constant, the random number series can be reproduced.

# ch5random.pl
# rand in Perl returns a floating point number.
@series = ();
srand 1234;
for (1..10) {
  push @series, int(rand 100);
}
print join(' ',@series) . "\n";
=begin comment
Output:
bash-3.2$ perl ch5random.pl 
74 21 33 32 99 28 1 50 58 27
bash-3.2$ perl ch5random.pl 
74 21 33 32 99 28 1 50 58 27
bash-3.2$ perl ch5random.pl 
74 21 33 32 99 28 1 50 58 27
bash-3.2$ perl ch5random.pl 
74 21 33 32 99 28 1 50 58 27
=end comment
=cut
# ch5random.rb
# rand(i >= 1) in Ruby returns an integer.
series = []
srand 1234
10.times { series << rand(100) }
#
#
puts series.join ' '
=begin
Output:
bash-3.2$ ruby ch5random.rb
47 83 38 53 76 24 15 49 23 26
bash-3.2$ ruby ch5random.rb
47 83 38 53 76 24 15 49 23 26
bash-3.2$ ruby ch5random.rb
47 83 38 53 76 24 15 49 23 26
bash-3.2$ ruby ch5random.rb
47 83 38 53 76 24 15 49 23 26
=end
#

TOC

Chapter 6 Gene Prediction: Finding Genes in the Human Genome

   Here are some examples of subroutines in Perl and Ruby. Notice:

# ch6subroutine.pl
#    An exercise in subroutine parameters and
# return values that uses sorting on
# hash values rather than keys.
#    For a little more granularity we will
# will use these measurements from Table 4.1.
#
# Subroutine to fill in the hydrophobicity hash.
sub get_hydro_hash {
  my $table = <<WHITE;
Ala  1.8
Arg -4.5
Asn -3.5
Asp -3.5
Cys  2.5
Gln -3.5
Glu -3.5
Gly -0.4
His -3.2
Ile  4.5
Leu  3.8
Lys -3.9
Met  1.9
Phe  2.8
Pro -1.6
Ser -0.8
Thr -0.7
Trp -0.9
Tyr -1.3
Val  4.2
WHITE
  # Parse the table into a hash, or actually
  # returns a list of key-value pairs.
  $table =~ /^([A-Z][a-z]{2})\s+(-?\d\.\d)/gm;
#
}
# Call the subroutine in list context.
%hh = &get_hydro_hash;
print scalar(keys %hh), "\n";
for $key (sort {$hh{$a} <=> $hh{$b}} keys %hh) {
  print "$key, $hh{$key}\n";
}
# ch6subroutine.rb
#    An exercise in subroutine parameters and
# return values that uses sorting on
# hash values rather than keys.
#    For a little more granularity we will
# will use the measurements from Table 4.1.
#
# Subroutine to fill in the hydrophobicity hash.
def get_hydro_hash
  table = <<WHITE
Ala  1.8
Arg -4.5
Asn -3.5
Asp -3.5
Cys  2.5
Gln -3.5
Glu -3.5
Gly -0.4
His -3.2
Ile  4.5
Leu  3.8
Lys -3.9
Met  1.9
Phe  2.8
Pro -1.6
Ser -0.8
Thr -0.7
Trp -0.9
Tyr -1.3
Val  4.2
WHITE
  # Parses the table into an array of arrays and then
  # changes the values to floats and returns a hash.
  Hash[table.scan(/^([A-Z][a-z]{2})\s+(-?\d\.\d)/
                 ).map {|e| [e[0],e[1].to_f]}]
end
# Call the subroutine.
hh = get_hydro_hash
puts hh.keys.size
hh.keys.sort {|a,b| hh[a] <=> hh[b]}.each do |key|
  puts "#{key}, #{hh[key]}"
end

   In Perl before Version 6, with no formal parameters, the my function is important for keeping local variable names from conflicting with the same names in other subroutines. This is another instance where understanding the precedence of operations is important. It is often convenient to write function calls and expressions that look like function calls without bothering with the parentheses -- but not always!

# ch6my.pl
# Using my.
# Remember the my binds more tightly than the comma.
sub s1 {
# This is probably not what you want, since $two and $three
# are still global.
  my $one,$two,$three;
  print "\$one is not yet defined.\n" unless defined $one;
  print "\$two is not yet defined.\n" unless defined $two;
  print "\$three is not yet defined.\n" unless defined $three;
}

sub s2 {
# This is probably what you want:
  my ($one,$two,$three);
  print "\$one is not yet defined.\n" unless defined $one;
  print "\$two is not yet defined.\n" unless defined $two;
  print "\$three is not yet defined.\n" unless defined $three;
}

$one = 1;
$two = 2;
$three = 3;
print "Without parentheses:\n";
&s1;
print "With parentheses:\n";
&s2;
=begin comment
Output:
$ perl ch6my.pl
Without parentheses:
$one is not yet defined.
With parentheses:
$one is not yet defined.
$two is not yet defined.
$three is not yet defined.
=end comment
=cut

   Chapter 6 mentions regular expressions. Here is some Ruby that does the same thing as the Perl for the Skills section of 6.5.2. Notice that Regular expressions can simplify some pattern matching problems. This Ruby example uses iterators and blocks instead of subroutines.

   This example also introduces using objects and their methods. Perl allows you to either use or not use object-oriented programming. Ruby uses objects all the time, so take advantage of them. The ShineDalgarno object does a sort of brute-force match against each substring of the motif to find the longest common substring ~ Ο(n2). The object has just 2 methods or subroutines:

initialize()
Constructs @keys, a list of unique substrings, saved with the longest substrings first. The @keys variable is not globally known like a Ruby $global_variable. The "@" means that it is known only to the instance. In this example, "instance" is the object of class ShineDalgarno that is constructed using the e. coli motif. hash is only around in the initialize method long enough to weed out the duplicates from all the possible substrings.
score()
Returns the length of the longest common substring between the motif and the region of nucleotides.
Try this code with both singleorf.txt and pACYC184.txt.
# ch6orfregex.rb
class ShineDalgarno
  def initialize(motif)
    hash = (0...motif.size).inject({}) do |h,j|
      (0..j).each {|i| h[motif[i,motif.size - j]] = 1}
      h # Return value needs to be the entire hash.
    end
    @keys = hash.keys.sort {|a,b| b.size <=> a.size} # Longest 1st.
  end
  def score(region)
    lcs = @keys.find {|key| region =~ /#{key}/}
    lcs ? lcs.size : 0
  end
end
# Check that there is a sequence on the command line.
unless ARGV.size == 1
  puts "Usage:\nch6orfregex.rb fasta-sequence-file"
end
# Read in just the sequence.
seq = open(ARGV[0]).readlines[1..-1].join.gsub(/[^ATGC]/,'')
puts "Sequence is #{seq.size} nucleotides long."
clength = 60 * 3 # Minimum ORF of 60 codons.
seqc = seq.reverse.tr('ATCG','TAGC')
ecoli_k12 = 'AAGGAGG'
sd = ShineDalgarno.new(ecoli_k12)
[['+',seq.tr('T','U'),seq],['-',seqc.tr('T','U'),seqc]].each do |s|
  puts "Searching #{s[0] == '+' ? 'non-' : ""}template strand."  
  # As a preliminary, are there any full matches for a Shine-Dalgarno site?
  puts "Maximum length of Shine-Dalgarno substring is: #{sd.score(s[1])}."
  (0..2).each do |i|
    puts "Searching frame #{s[0]}#{i + 1}."
    offset = i
    while offset < s[1].size - clength - 3
      start_match = /^.{#{offset}}(?:...)*?(AUG)/.match(s[1])
      if start_match
        beginning = start_match.begin(1)
        stop_match = /^.{#{beginning + 3}}
                      ((?:...)*?)(UAA|UAG|UGA)/x.match(s[1])
        if stop_match
          ending = stop_match.end(2) - 1
          if stop_match[1].size >= clength
            if s[0] == '+' # Just change offset to a position.
              first_pos = beginning + 1
              last_pos = ending + 1
            else # Report positions relative to the original strand.
              first_pos = s[1].size - ending
              last_pos = s[1].size - beginning
            end
            puts "ORF in frame #{s[0]}#{i + 1} " +
                 "from position #{first_pos} " +
                 "to position #{last_pos}: " +
                 "#{(ending - beginning + 1 - 3) / 3} codons."
            # Tentatively resume looking for next start.
            offset = beginning + 3
            # Now check for e. coli specific Shine-Dalgarno sequence.
            if beginning > 16
              puts "SD site: " +
                   "#{s[1][beginning - 16,13]}...#{s[1][beginning,3]}."
              sd_score = sd.score(s[1][beginning - 16,13])
              if sd_score >= 3
                puts "This ORF is preceded by a Shine-Dalgarno sequence " +
                     "with score: #{sd_score}."
                offset = ending + 1 # Resume looking for non-overlapping ORF.
              end
            end
            # Now check for e. coli sigma 70 promoter sequence.
            if beginning >= 85 and
                 s[2][0..beginning - 50] =~ /TTGACA.{15,19}TATAAT/
              puts "This ORF is preceded by a sigma 70 promoter sequence."
              offset = ending + 1 # Resume looking for non-overlapping ORF.
            end
          else # ORF too short, resume looking for next start codon.
            offset = beginning + 3
          end
        else # No more stop codons to match.
          offset = s[1].size
        end
      else # No more start codons to match.
        offset = s[1].size
      end
    end
  end
end

   ch6orfregex.rb needs a little work to handle a circular piece of DNA like pACYC184.

   Subprograms or subroutines are tools to modularize your code so that it is more readable and more maintainable. For a related discussion see Appendix K, Abbreviating Variables.

TOC

Chapter 7 RNA and Protein Structure: Structure Prediction

   Here let us modify our global alignment examples from Chapter 3 to use a recursive function to follow the optimal paths back from the lower right corner to the top left corner.

# ch7recursive_ga.pl
# Optimal global alignments using a
# recursive function.
sub trace_back { # (@path_so_far)
  my ($i,$j,$str1,$str2) = @_;
  if (length $b[$i][$j]) { # If some back-link exists.
    trace_back($i, $j - 1,$h[$j] . $str1, '-' . $str2)
      if $b[$i][$j] =~ /h/;
    trace_back($i - 1, $j,'-' . $str1, $v[$i] . $str2)
      if $b[$i][$j] =~ /v/;
    trace_back($i - 1, $j - 1,$h[$j] . $str1, $v[$i] . $str2)
      if $b[$i][$j] =~ /d/;
  } else {
    push @done_list,[$str1,$str2];
  }
}
# Scoring constants:
$match = 1;
$mismatch = 0;
$gap = -1;
$debug = 0;
# See if help is needed.
if (grep /^-h$/,@ARGV or
    scalar(@ARGV) - scalar(grep /^-[hd]$/,@ARGV) < 2) {
  print <<HELP_MESSAGE;
Usage: ch7recursive_ga.pl [-options] sequence_1 sequence_2
Options:
  -d  Print out the scoring matrix for debugging.
  -h  Print out this help message.
HELP_MESSAGE
  exit 1;
}
# Get the input sequences and
# decide if the scoring matrix should be printed.
$i = 0;
for $argument (@ARGV) {
  if ($argument eq '-d') {
    $debug = 1;
  } elsif ($i == 0) {
    @h = (" ",split("",$argument));
    $i = 1;
  } else {
    @v = (" ",split("",$argument));
  }
}
# Figure the dimensions of the scoring matrix.
$n = scalar(@v) - 1;
$m = scalar(@h) - 1;
# Prepare the alignment scoring matrix 
# and the backward links matrix during the
# same pass from the upper, left to lower,
# right corners of the dynamic programming
# table.
# Prepare the dynamic programming table.
$s[0][0] = 0;    # Top, left corner
$b[0][0] = "";
for $i (1..$n) { # Left column
  $s[$i][0] = $i * $gap;
  $b[$i][0] = 'v';
}
for $j (1..$m) { # Top row
  $s[0][$j] = $j * $gap;
  $b[0][$j] = 'h';
 }
for $i (1..$n) {
  for $j (1..$m) {
    # Determine the maximum of scores.
    # Check vertically and horizontally first.
    $up = $s[$i - 1][$j] + $gap;
    $left = $s[$i][$j - 1] + $gap;
    if ($up > $left) {
      $s[$i][$j] = $up;
    } else {
      $s[$i][$j] = $left;
    }
    $delta = $v[$i] eq $h[$j] ? $match : $mismatch;
    $diag = $s[$i - 1][$j - 1] + $delta;
    # Using the statement modifier form of if.:
    $s[$i][$j] = $diag if $diag > $s[$i][$j];
    # Collect and save backtracking pointers,
    # for scores equal to the maximum.
    $b[$i][$j] = "";
    $b[$i][$j] .= 'v' if $s[$i][$j] == $up;
    $b[$i][$j] .= 'h' if $s[$i][$j] == $left;
    $b[$i][$j] .= 'd' if $s[$i][$j] == $diag;
  }
} 
# Display the dynamic programming table if the
# debug option is requested.
# Also practice with string multiplication and
# old-fashioned drawing with characters.
if ($debug) {
  print "Dynamic programming table:\n";
  print ' ' x 4 . '|';
  for $l (@h) { print "   $l  |"; }
  print "\n";
  $line0 = "----+" . ('-' x 6 . '+') x ($m + 1);
  print "$line0\n";
  for $i (0..$n) {
    $line1 = "    |";
    $line2 = " $v[$i]  |";
    for $j (0..$m) {
      $up = $b[$i][$j] =~ /v/ ? "^" : " ";
      $left = $b[$i][$j] =~ /h/ ? "<" : " ";
      $diagonal = $b[$i][$j] =~ /d/ ? "\\" : " ";
      $line1 .= "$diagonal   $up |";
      $line2 .= sprintf "%s %3d |",$left,$s[$i][$j];
    }
    print "$line1\n";
    print "$line2\n";
    print "$line0\n";
  }
}
# Find the optimal alignment paths.
@lr_end = ($n,$m,"",""); # Each path (i,j,string1,string2)
@done_list = ();
trace_back(@lr_end);
# Print each of the completed alignment paths.
$i = 1;
for $path_ref (@done_list) {
  ($str1,$str2) = @$path_ref;
  print "Alignment $i\n$str1\n$str2\n";
  $i++;
}
# ch7recursive_ga.rb
# Optimal global alignments using a
# recursive function.
def trace_back(path_so_far)
  i,j,str1,str2 = path_so_far
  if !$b[i][j].empty? # If some back-link exists.
    trace_back([i,j - 1,$h[j] + str1,'-' + str2]) \
      if $b[i][j] =~ /h/
    trace_back([i - 1,j,'-' + str1,$v[i] + str2]) \
      if $b[i][j] =~ /v/
    trace_back([i - 1,j - 1,$h[j] + str1,$v[i] + str2]) \
      if $b[i][j] =~ /d/
  else
    $done_list.push([str1,str2])
  end
end
# Scoring constants:
match = 1
mismatch = 0
gap = -1
debug = nil
# See if help is needed.
if ARGV.grep(/^-h$/).size > 0 or
   ARGV.size - ARGV.grep(/^-[hd]$/).size < 2
  print <<HELP_MESSAGE
Usage: ch7recursive_ga.rb [-options] sequence_1 sequence_2
Options:
  -d  Print out the scoring matrix for debugging.
  -h  Print out this help message.
HELP_MESSAGE
  exit 1
end
# Get the input sequences and
# decide if the scoring matrix should be printed.
i,$h,$v = 0,[],[]
ARGV.each do |argument|
  if argument == '-d'
    debug = 1
  elsif i == 0
    $h = (" " + argument).split("")
    i = 1
  else
    $v = (" " + argument).split("")
  end
end
# Figure the dimensions of the scoring matrix.
n = $v.size - 1
m = $h.size - 1
# Prepare the alignment scoring matrix 
# and the backward links matrix during the
# same pass from the upper, left to lower,
# right corners of the dynamic programming
# table.
# Prepare the dynamic programming table.
# First build s and b as arrays of rows.
# While adding each row to the array or
# list of rows, fill in the left-most
# column.
$s = (0..n).inject([]) {|s,e| s << [e * gap]}
$b = (0..n).inject([]) {|b,e| b << [e == 0?"":'v']}
(1..m).each do |j| # Top row
  $s[0][j] = j * gap
  $b[0][j] = 'h'
end
(1..n).each do |i|
  (1..m).each do |j|
    # Determine the maximum of scores.
    # Check vertically and horizontally first.
    up = $s[i - 1][j] + gap
    left = $s[i][j - 1] + gap
    if up > left
      $s[i][j] = up
     else
      $s[i][j] = left
    end
    delta = $v[i] == $h[j] ? match : mismatch
    diag = $s[i - 1][j - 1] + delta
    # Using the statement modifier form of if.
    $s[i][j] = diag if (diag > $s[i][j])
    # Collect and save backtracking pointers,
    # for scores equal to the maximum.
    $b[i][j] = ""
    $b[i][j] += 'v' if $s[i][j] == up
    $b[i][j] += 'h' if $s[i][j] == left
    $b[i][j] += 'd' if $s[i][j] == diag
  end
end
# Display the dynamic programming table if the
# debug option is requested.
# Also practice with string multiplication and
# old-fashioned drawing with characters.
if debug
  puts "Dynamic programming table:"
  print ' ' * 4 + '|'
  $h.each { |l| print "   #{l}  |"}
  puts
  line0 = "----+" + ('-' * 6 + '+') * (m + 1)
  puts line0
  (0..n).each do |i|
    line1 = "    |"
    line2 = " #{$v[i]}  |"
    (0..m).each do |j|
      up = $b[i][j] =~ /v/ ? "^" : " "
      left = $b[i][j] =~ /h/ ? "<" : " "
      diagonal = $b[i][j] =~ /d/ ? "\\" : " "
      line1 += "#{diagonal}   #{up} |"
      line2 += sprintf "%s %3d |",left,$s[i][j]
    end
    puts line1
    puts line2
    puts line0
  end
end
# Find the optimal alignment paths.
lr_end = [n,m,"",""] # Each path (i,j,string1,string2)
$done_list = []
trace_back(lr_end)
# Print each of the completed alignment paths.
i = 1
$done_list.each do |path|
  str1,str2 = path
  puts "Alignment #{i}\n#{str1}\n#{str2}"
  i += 1
end

   An irb (Interactive Ruby) session to fetch our Chapter 7 practice protein using the NCBI eUtilities.

$ irb
# 1st a module that allows reading a Web page just like a file.
>> require 'open-uri'
>> base_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
=> "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
>> fasta = open(base_url +
               'efetch.fcgi?db=protein&id=NP_005408&rettype=fasta').read
=> ">gi|4885609|ref|NP_005408.1| proto-oncogene tyrosine-protein kinase SRC [Homo sapiens]\n
MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSS\n
DTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYV\n
APSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKL\n
DSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGC\n
FGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLL\n
DFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYT\n
ARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPEC\n
PESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL\n\n"
>> open('src.fasta',"w").write(fasta)
=> 632 # Bytes written to the file.

    The parts of the Chou-Fasman algorithm that find and extend either the nucleus of an α helix region or the nucleus of a β strand are quite similar and may be implemented as a single subroutine. When writing a subroutine with several or many parameters it becomes hard to remember the order of the parameters in the subroutine call. Some programming languages solve this by allowing named parameters. Neither Perl nor Ruby have named parameters, but both allow passing a hash as an argument to create a similar effect. There is an obscure syntax rule in Ruby which allows a hash, that is the last argument in the parameter list of a subroutine, to be written without its {} curly-braces. Thus such a subroutine call looks the same whether written in Perl or Ruby. Ruby has an additional idiom that we will mention here because it is ubiquitous in the popular Ruby on Rails and RMagick frameworks. When a 'string' is a name that will be constant, like a hash key, a :symbol is used instead of the string. Notice the two forms are not exactly interchangeable.

$ irb
>> h = {'one' => 1,:one => 'i','four' => 4, :four => 'iv'}
=> {:four=>"iv", "one"=>1, "four"=>4, :one=>"i"}
>> h['one']
=> 1
>> h['one'.to_sym]
=> "i"
>> h[:one.to_s]
=> 1
>> h[:one]
=> "i"
The following Ruby program is similar to the Perl used in the textbook for the programming project. It extends a region by 4 amino acids if each scores above 100, or optionally extends a region if the average score of 4 amino acids is above 100. Command line options like this are nice for the programmer's use during development, but may confuse non-programmer users later. The program also optionally takes secondary structure predictions gleaned from the Cn3D crystal structure viewer and from PSIPRED. The Cn3D crystal results were manually translated from color codes to a text file.
# Cn3D predictions for PDB 1FMK.
# Alignment offset: 84 Pairs: [start-position secondary-structure]
# (Position 1 of PDB 1FMK aligns with Position 85 of GenBank NP_005408.)
84 2 E 14 - 19 E 31 - 36 E 44 - 46 E 54 - 55 E 59 - 68 E 72 H 83 - 89 E 97 -
102 E 128 - 129 E 134 - 142 H 152 - 161 E 165 - 185 E 207 - 208 E 218 -
223 H 234 - 243 E 251 - 252 E 259 - 262 E 266 - 279 H 300 - 310 E 316 -
317 E 323 - 329 x 343 - 360 H 377 - 387 H 397 - 408 H 419 - 428 H 438 - 
The PSIPRED results came from the email sent from the server.
# Predictions for SRC from PSIPRED email.
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCEEEEEECCCCCCCCCCCCCCCCCEEEECCCCCCH
HHHHHHCCCCCEEECCCCCCCCCCCCCCCCCCCCCCCHHHHHHHHCCCCCCCCCEEEEEC
CCCCCEEEEEEEECCCCCCCCEEEEEEEECCCCCEEECCCCCCCCHHHHHHHHHHCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCHHHEEEEEEECCCCCEEEEECCCCCCEEEEEEEC
CCCCCCHHHHHHHHHHHHHCCCCCEEEEEEEECCCCCEEEEECCCCCCHHHHHHHCCCCC
CCHHHHHHHHHHHHHHHHHHHHCCCCCCCHHHHHEEECCCCEEEECCCCCCEECCCCCEE
ECCCCCCCEECCCHHHHHHCCCCHHHHHHHHHHHHHHHHCCCCCCCCCCCHHHHHHHHHC
CCCCCCCCCCCHHHHHHHHHHHHCCCCCCCCHHHHHHHHHHHHHCCCCCCCCCCCC
# ch7cf.rb
# A program to predict protein structure using the Chou-Fasman algorithm.
# A subroutine for printing the output.
def print_results(seq,regions,cf_turns,xtal_regions,psipred_regions)
  0.step(seq.size - 1,50) do |i|
    puts "#{sprintf "%04d",i + 1} #{'1234567890' * 5}"
    puts "     #{seq[i,50].join}"
    puts "Ruby #{regions[i,50].join}"
    puts "Xtal #{xtal_regions[i,50].join}" if xtal_regions
    puts "PSIP #{psipred_regions[i,50].join}" if psipred_regions
    puts "Turn #{cf_turns[i,50].join}"
  end
end
def print_verbose(seq,regions,score)
  abr = {'A' => 'Ala','R' => 'Arg','N' => 'Asn','D' => 'Asp','C' => 'Cys',
         'Q' => 'Gln','E' => 'Glu','G' => 'Gly','H' => 'His','I' => 'Ile',
         'L' => 'Leu','K' => 'Lys','M' => 'Met','F' => 'Phe','P' => 'Pro',
         'S' => 'Ser','T' => 'Thr','W' => 'Trp','Y' => 'Tyr','V' => 'Val'}
  0.step(seq.size - 1,20) do |i|
    puts "#{sprintf "%04d",i + 1} " +
         "#{(i+1..i+20).map{|j| sprintf "%3d", j}.join}"
    puts "     #{seq[i,20].map{|a| abr[a]}.join}"
    puts "     #{seq[i,20].map{|a| sprintf "%3d",score[a]}.join}"
    puts "      #{regions[i,20].join '  '}"
  end
end
def annotate_regions(seq,params)
  # parameter keys:
  pkeys = [:window_size,:threshold,:min_nucleus,:min_region,
           :score_hash,:other_score_hash,:tag,
           :debug,:extend_by_avg]
  # Give the parameters some local aliases for readability.
  ws,tscore,min_nucleus,min_region,sh,osh,tag,debug,extend_by_avg =
    pkeys.map{|k| params[k]}
  sz = seq.size
  # At the beginning all positions are unknown.
  regions = ['-'] * sz
  i = 0  # Offset of beginning of the window.
  while i <= seq.size - ws
    # Step a: Find a candidate window. 
    if seq[i,ws].select{|aa| sh[aa] > tscore}.size >= min_nucleus
      print "Nucleus of #{tag} at position #{i + 1}, " if debug
      # Step b: Extend left from the shortest nucleus,  and ...
      left = i - 1 + ws - min_nucleus
      until left < 0
        ex_size = left >= 3 ? 4 : left + 1
        lend = left - ex_size + 1
        if ex_size == 4 and not extend_by_avg
          break if seq[lend,ex_size].select{|aa| sh[aa] < 100}.size == 4
        else
          break if seq[lend,ex_size].inject(0) \
                   {|sum,aa| sum + sh[aa]} / ex_size < 100
        end
        left -= 1
      end
      left += 1
      # then extend right, again from the shortest nucleus.
      right = i + min_nucleus
      until right > sz - 1
        ex_size = sz - right > 4 ? 4 : sz - right
        if ex_size == 4 and not extend_by_avg
          break if seq[right,ex_size].select{|aa| sh[aa] < 100}.size == 4
        else
          break if seq[right,ex_size].inject(0) \
                   {|sum,aa| sum + sh[aa]} / ex_size < 100
        end
        right += 1
      end
      right -= 1
      # Step c: Check results.
      puts "extended: #{left + 1} to #{right + 1}" if debug
      region_size = right - left + 1
      suma = seq[left..right].inject(0) {|sum,aa| sum + sh[aa]}
      sumb = seq[left..right].inject(0) {|sum,aa| sum + osh[aa]}
      if suma / region_size > tscore and
         region_size > min_region and
         suma > sumb
        # Record results.
        regions[left..right] = ['X'] * region_size
      end
      i = right
    end
    i += 1
  end
  print_verbose(seq,regions,sh) if debug
  regions
end
# Main program begins.
if ARGV.empty? or ARGV.grep(/^-h/).size == 1
  puts <<HELP
Usage: ruby ch7cf.rb -[h d a] fasta-file [crystal-file PSIPRED-file] 
Options:
  -h Display this help screen and exit.
  -d Display the intermediate results for the alpha helix and beta strands.
  -a Use averaging when extending an alpha helix or beta strand region.
HELP
  exit 0
end
# Check for debugging mode and extending-by-average mode.
debug = false
debug = true if ARGV.grep(/^-d/).size == 1
extend_by_avg = false
extend_by_avg = true if ARGV.grep(/^-a/).size == 1
# Initialize the parameter hashes.
Pa = {};Pb = {};Pt = {};F0 = {};F1 = {};F2 = {};F3 = {}
P_hashes = [Pa,Pb,Pt]
F_hashes = [F0,F1,F2,F3]
# Read in the Chou-Fasman parameters.
lines = File::readlines('choufasman.txt')
lines.each do |l|
  columns = l.split /,/
  aa = columns.shift
  P_hashes.each {|h| h[aa] = columns.shift.to_i}
  F_hashes.each {|h| h[aa] = columns.shift.to_f}
end
# Read the sequence into an array of characters.
infiles = ARGV.grep(/^(?!-[ad])/) 
seq = File::readlines(infiles[0])[1..-1].join.gsub(/[^A-Z]/,"").split("")
puts "Sequence length: #{seq.size}"
alpha_regions = annotate_regions(seq,
                                 :window_size => 6,
                                 :threshold => 103,
                                 :min_nucleus => 4,
                                 :min_region => 6,
                                 :score_hash => Pa,
                                 :other_score_hash => Pb,
                                 :tag => 'alpha helix',
                                 :debug => debug,
                                 :extend_by_avg => extend_by_avg)
beta_regions = annotate_regions(seq,
                                :window_size => 5,
                                :threshold => 105,
                                :min_nucleus => 3,
                                :min_region => 3,
                                :score_hash => Pb,
                                :other_score_hash => Pa,
                                :tag => 'beta strand',
                                :debug => debug,
                                :extend_by_avg => extend_by_avg)
# Identifying the Beta-turns
cf_turns = ['-'] * seq.size
(0..seq.size - 4).each do |i|
  pt = F0[seq[i]] * F1[seq[i + 1]] * F2[seq[i + 2]] * F3[seq[i + 3]]
  pa = seq[i,4].inject(0) {|sum,aa| sum + Pa[aa]}
  pb = seq[i,4].inject(0) {|sum,aa| sum + Pb[aa]}
  pturn = seq[i,4].inject(0) {|sum,aa| sum + Pt[aa]}
  if pt > 0.000075 and pturn / 4.0 > 100 and pturn > pa and pturn > pb
    cf_turns[i,4] = ['T'] * 4
  end
end
# Merge overlaps.
regions = []
i = 0
while i < seq.size
  # Find contiguous sub-regions. 
  a = alpha_regions[i]
  b = beta_regions[i]
  left = i
  i += 1 while alpha_regions[i] == a and beta_regions[i] == b
  # Decide on proper label.
  if a == '-' and b == '-'
    regions += ['-'] * (i - left)
  elsif a == 'X' and b == '-'
    regions += ['H'] * (i - left)
  elsif a == '-' and b == 'X'
    regions += ['E'] * (i - left)
  elsif a == 'X' and b == 'X' and
                     seq[left..i - 1].inject(0){|sum,aa| sum + Pa[aa]} >
                     seq[left..i - 1].inject(0){|sum,aa| sum + Pb[aa]}
    regions += ['H'] * (i - left)
  elsif  a == 'X' and b == 'X' and
                     seq[left..i - 1].inject(0){|sum,aa| sum + Pb[aa]} >
                     seq[left..i - 1].inject(0){|sum,aa| sum + Pa[aa]}
    regions += ['H'] * (i - left)
  else
    regions += ['-'] * (i - left)
  end
end 
# If available add a line to the report with the crystal secondary structures.
# This file is manual, shorthand notes taken from the Cn3D viewer.
if infiles.size >= 2
  xtal_regions = []
  data = File::readlines(infiles[1]).grep(/^[^#]/).join.split(/\s/)
  # Expand the shorthand for the secondary structure of the crystal.
  offset = data.shift.to_i
  str = '-'
  left = 1
  while !data.empty?
    right = data.shift.to_i + offset
    xtal_regions += [str] * (right - left)
    left = right
    str = data.shift
  end
  right = seq.size + 1
  xtal_regions += [str] * (right - left)
else
  xtal_regions = nil
end
# Also if available add a line with PSIPRED's predictions.
# This file was just cut and pasted from the PSIPRED email.
if infiles.size == 3
  data = File::readlines(infiles[2]).grep(/^[^#]/).join.gsub(/[^CHE]/,"")
  data.gsub!(/C/,'-')
  psipred_regions = data.split("")
else
  psipred_regions = nil
end
print_results(seq,regions,cf_turns,xtal_regions,psipred_regions)

TOC

Chapter 8 Molecular Phylogenetics: Measuring Evolution

   For examples of nested hashes, see the sections on earlier chapters. On the surface, the syntax to delete the hash entry for a given key is the same for Perl and Ruby. Just under the surface, there are some important differences.

# ch8delete.pl
# A hash:
%pet = ('one' => 'dog',
        'two' => 'cat',
        'three' => 'rabbit',
        'four' => 'turtle');
print "The keys are: " . join(' ',keys %pet) . "\n";
for $key (keys %pet) { print "Pet $key is $pet{$key}\n"; }
# Deleting a single key-value pair:
delete $pet{'three'};
print "Deleting pet three.\n";
for $key (keys %pet) { print "Pet $key is $pet{$key}\n"; }
# A slice of a hash is an array of values.
print "Pets " . join(' and ',('one','two')) . " are " .
    join(' and ',@pet{'one','two'}) . "\n";
# Deleting 2 key-value pairs:
delete @pet{'one','two'};
print "Deleting pets one and two.\n";
for $key (keys %pet) { print "Pet $key is $pet{$key}\n"; }
# Deleting by value rather than key.
for $key (keys %pet) {
  delete $pet{$key} if $pet{$key} eq 'turtle'; }
print "Deleting pet with the value \"turtle\".\n";
print scalar(keys %pet) . " entries remain in \%pet.\n";
=begin comment
Output:
$ perl ch8delete.pl
The keys are: three one two four
Pet three is rabbit
Pet one is dog
Pet two is cat
Pet four is turtle
Deleting pet three.
Pet one is dog
Pet two is cat
Pet four is turtle
Pets one and two are dog and cat
Deleting pets one and two.
Pet four is turtle

Deleting pet with the value "turtle".
0 entries remain in %pet.
=end comment
=cut
# ch8delete.rb
# A hash:
pet = {'one' => 'dog',
        'two' => 'cat',
        'three' => 'rabbit',
        'four' => 'turtle'}
puts "The keys are: #{pet.keys.join(' ')}"
pet.keys.each {|key| puts "Pet #{key} is #{pet[key]}"}
# Deleting a single key-value pair:
pet.delete('three')
puts "Deleting pet three."
pet.keys.each {|key| puts "Pet #{key} is #{pet[key]}"}
# An array is just another key.
pet[['one','two']] = 'chimera'
puts "Pet #{['one','two']} is #{pet[['one','two']]}"
# Deleting 2 key-value pairs:
pet.reject! {|key,value| ['one','two'].member?(key)}
puts "Deleting pets one and two."
pet.keys.each {|key| puts "Pet #{key} is #{pet[key]}"}
# Deleting by value rather than key.
pet.reject! {|key,value| value == 'turtle'}
#
puts "Deleting pet with the value \"turtle\"."
pet.keys.each {|key| puts "Pet #{key} is #{pet[key]}"}
=begin
Output:
$ ruby ch8delete.rb
The keys are: three two one four
Pet three is rabbit
Pet two is cat
Pet one is dog
Pet four is turtle
Deleting pet three.
Pet two is cat
Pet one is dog
Pet four is turtle
Pet onetwo is chimera
Deleting pets one and two.
Pet four is turtle
Pet onetwo is chimera
Deleting pet with the value "turtle".
Pet onetwo is chimera
=end
#

TOC

Chapter 9 Microarrays and Gene Expression

   From Section 9.4.2, Programming the Solution: Preprocessing, "You already have the Perl programming skills you will need, so only the conceptual steps are provided; it will be up to you to actually write the code." In Chapter 9 some preprocessing was done with a spreadsheet. Both Perl and Ruby have library modules to work directly with Excel spreadsheet data cells. However the next step for an interested bioinformatics student might be to learn a little bit about R. R is a GNU, open-source S with all the power of Excel to work with tabular data and more power than Excel to render statistical models using that data. Both Perl and Ruby have library modules to work directly with R.

   Seldom will you use a single tool as a computer person on a computational biology team. GSE9311_series_matrix.txt is straightforward in that it uses the "#" for all comment lines and has only one table. GSE9311_family.soft is messy in that it has several different comment characters, several sub-tables and the ID order of the rows is not the same as in GSE9311_series_matrix.txt. The following Ruby was designed, written and run in Emacs and generates a script to be run in R.

# parse_call.rb
# A quick Ruby script to find the start and length of the 8 "call" tables
# within the file GSE9311_family.soft and then produce the
# R commands to read them into data frames.
call_file = "GSE9311_family.soft"
r_script = "#{/^(.*)_/.match(call_file)[1]}_calls.r"
open(r_script,"w") do |script|
  script.puts "print(\"Breaking #{call_file} into individual experiments.\")"
  lines = File::readlines(call_file)
  id = ""
  title = ""
  experiments = {}
  lines.each_with_index do |l,i|
    m1 = /^\^SAMPLE\s*=\s*(GSM\d{6})/.match(l)
    id = m1[1] if m1
    m2 = /^!Sample_title\s*=\s*(\S.*?\d)/.match(l)
    title = m2[1] if m2
    if l =~ /^ID_REF/
      experiments[id] = title
      script.puts "print(\"Experiment #{id}: #{title}, " +
                  "begins on line #{i+1}\")"
      script.puts "#{id} <- read.table(\"#{call_file}\",skip=#{i}," +
                  "nrows=22810,header=T,row.names=1,comment.char=\"\")"
    end
  end
  script.puts "print(\"#{call_file} has #{experiments.size} experiments.\")"
  experiments.keys.sort.each do |k|
    script.puts "print(\"Experiment ID #{k} (#{experiments[k]})\")"
  end
end

   You could use this script in R in this way:

> setwd("your-working-directory-for-this-project")
> system("ruby parse_call.rb")
> source("GSE9311_calls.r")

   After running this script from R, we could

  1. Print out the data frame df as a text file and continue in Perl or Ruby to find fold-changes more than 2 standard deviations away from the mean, or
  2. Continue in R to find fold-changes more than 2 standard deviations away from the mean with a script something like this:
    # regulated.r
    # We want to make 1 data frame from the 8 experiments:
    # 2 root-control, 2 root-selenate, 2 shoot-control, 2 shoot-selenate.
    # Each pair of repetitions is combined as follows: 
    # If both repetitions are rated P, use the average value,
    # if 1 repetition is rated P, use that value,
    # otherwise use 0.0 as the value.
    # The order of the rows and the row IDs is sorted to be in
    # the same as the order of the IDs in GSE9311_series_matrix.txt,
    # which is alphabetical.
    # To facilitate this process define a function:
    merged_reps <- function(v1,c1,v2,c2,odr)
                     ifelse(c1 == "P" & c2 == "P",(v1 + v2)/2.0,
                     ifelse(c1 == "P",v1,
                     ifelse(c2 == "P",v2, NA)))[odr]
    # and find the order of any one of the experiment repetitions:
    id_o <- order(row.names(GSM237280))
    rc = merged_reps(GSM237280[,1],GSM237280[,2],GSM237281[,1],GSM237281[,2],id_o)
    rs = merged_reps(GSM237282[,1],GSM237282[,2],GSM237283[,1],GSM237283[,2],id_o)
    sc = merged_reps(GSM237292[,1],GSM237292[,2],GSM237293[,1],GSM237293[,2],id_o)
    ss = merged_reps(GSM237294[,1],GSM237294[,2],GSM237295[,1],GSM237295[,2],id_o)
    # Now compute the log fold changes.
    change.root <- log2(rs/rc)
    change.shoot <- log2(ss/sc)
    mean.change.root <- mean(change.root,na.rm=TRUE)
    cat("mean.change.root:",mean.change.root,"\n")
    mean.change.shoot <- mean(change.shoot,na.rm=TRUE)
    cat("mean.change.shoot:",mean.change.shoot,"\n")
    sd.change.root <- sd(change.root,na.rm=TRUE)
    cat("sd.change.root:",sd.change.root,"\n")
    sd.change.shoot <- sd(change.shoot,na.rm=TRUE)
    cat("sd.change.shoot:",sd.change.shoot,"\n")
    # Finally, select those with a log change of greater than 2 standard deviations.
    up.change.root <- ifelse(change.root >
                             mean.change.root + 2 * sd.change.root,TRUE,FALSE)
    down.change.root <- ifelse(change.root <
                               mean.change.root - 2 * sd.change.root,TRUE,FALSE)
    up.change.shoot <- ifelse(change.shoot > 
                              mean.change.shoot + 2 * sd.change.shoot,TRUE,FALSE)
    down.change.shoot <- ifelse(change.shoot <
                                mean.change.shoot - 2 * sd.change.shoot,TRUE,FALSE)
    df = data.frame(cbind(rc,rs,sc,ss,change.root,change.shoot,
                          up.change.root,down.change.root,
                          up.change.shoot,down.change.shoot),
                    row.names = row.names(GSM237280)[id_o])
    print(df[1:10,])
    
    cat("Number in roots regulated up:",
        length(change.root[up.change.root == TRUE]),"\n")
    cat("Number in roots regulated down:",
        length(change.root[down.change.root == TRUE]),"\n")
    cat("Number in shoots regulated up:",
        length(change.root[up.change.shoot == TRUE]),"\n")
    cat("Number in shoots regulated down:",
        length(change.root[down.change.shoot == TRUE]),"\n")
    cat("Number with both regulated up:",
        length(change.root[up.change.root == TRUE &
                           up.change.shoot == TRUE]),"\n")
    cat("Number with both regulated down:",
        length(change.root[down.change.root == TRUE &
                           down.change.shoot == TRUE]),"\n")
    
    or
  3. Use R to graphically find a few interesting fold-changes by making a plot and clicking on a couple interesting looking points above the 45° line. R labels the mouse clicks and saves a list of the row numbers of the clicks. right-click or Esc is usually used to terminate the identify() function.
    > plot(rc,rs)
    > up.regulated <- identify(rc,rs,labels=row.names(df))
    > df[up.regulated,]
                      rc       rs       sc       ss
    258415_at   10484.30 14884.20 14127.05 10411.65
    259875_s_at  4993.35  9777.85  2684.80  2843.70
    > dev.copy(png) # Copy the plot to the format you see now.
    quartz_off_screen 
                    3 
    > dev.off(3) # Write out the copy in PNG format.
    quartz 
         2 
    
    root control vs. root selenate

TOC

Appendix A Installing Perl and Running a Perl Program

   For Perl on Windows, Active State is an important resource. Ruby has support to get you started on almost any operating system at ruby-lang.org. You will find that interactive Ruby, the irb command prompt, is a helpful learning tool. Students of C and its descendants always begin:

$ irb
irb(main):001:0> print "Hello, World!\n"
Hello, World!

TOC

Appendix B Introduction to Programming and Perl syntax

   After reading Appendix B in St. Clair and Visick, visit "To Ruby from Perl" on the Ruby site.

TOC

Appendix C Regular Expressions

   A good place to begin is the Perl Regular Expression Tutorial, perlretut that you already have on the computer where you installed Perl. Alternatively, you can find perlretut on the Web. The differences between Perl and Ruby are too subtle to cover here.

   Here is a short example of using the matching and substitution operators to search for a pattern in a string of nucleotides.

# nucleo-regex.pl
# Searching with a pattern that contains
# non-specific nucleotide codes.
$str = 'AACGTTCCGGA';
$pattern = 'CNGGW';
$pattern =~ s/N/[ATGC]/g;
$pattern =~ s/W/[AT]/g;
$str =~ /$pattern/;
print "Found $& in $str.\n" if $&;
# nucleo-regex.rb
# Searching with a pattern that contains
# non-specific nucleotide codes.
str = 'AACGTTCCGGA'
pattern = 'CNGGW'
pattern.gsub!(/N/,'[ATGC]')
pattern.gsub!(/W/,'[AT]')
m = /#{pattern}/.match(str)
puts "Found #{m[0]} in #{str}." if m

TOC

Appendix D Ruby on Rails

   First try some of the demos on the Ruby on Rails site. A computational biology team is a very good example of a group where all the users can participate in the construction of their own software tools. There are many frameworks that allow such participatory development. Very few of these have their default parameters correctly set in order to make this an efficient process. Rails is among the best of these few frameworks.

   The examples prepared for this companion handout are not really interactive command line programs, but rather Unix style tool-chain modules with the input taken not from the stdin or keyboard stream of characters, but from the command line arguments. There still are many situations where a custom tool-chain is an efficient solution. An interactive solution is also often a good choice if the user interface is one which is already familiar to your team members -- like a Web browser. Rails will help you build such a solution. (For the moment we will set aside Rail's ability to manage a database to save your data between runs.) Let us rewrite ch1p12.rb as a Web application. The Rails Model-View-Controller framework has a-place-for-everything and by default expects everything-in-its-place. Learning the default pigeon holes is perhaps the steepest part of the learning curve on the way from Ruby to Ruby on Rails. Accepting the guidance of the Rails' defaults allows the framework to do a lot of tedious work for you. Very soon you and your team have a working, rough draft of your solution that you can develop together. Then having climbed the learning curve and understanding where each part of your code belongs, you know exactly where to make corrections and improvements.

   A little digression: To get the most out of the Rails framework, you should understand the object-oriented programming strengths of Ruby. To get started with Rails you do not need to know very much about objects and their methods in Ruby -- you can just accept that the trick that makes Ruby on Rails programming look as easy as typing configuration files is performed with Ruby objects. One thing to understand however, is the @ before a symbol like @dna2. In Perl @variable announces that the variable is an array. In Ruby @variable announces that the variable is an instance variable. In the following example we have a class of objects called DNA Controllers which are a kind of standard Rails class of objects called Application Controllers. Each time a request comes in from a Web browser a new, specific instance of of dna type controllers is created to handle the HTTP request. That instance of dna controller has two variables @dna1 and @dna2 which are only known to itself and its view, opposite.html.erb.

   A few steps to get the Web application up and running. As you can see, the Rails framework is doing a lot of work behind the scenes.

  1. Build the basic structure of directories and default configuration files for our application that we will call opposite.
    $ rails opposite
          create  
          create  app/controllers
          ...
          create  app/views/layouts
          ...
          create  script/generate
          ...
    
  2. Move into the top level directory of our new application.
    $ cd opposite
    
  3. We will call our view "opposite" and the controller, that arranges and presents the view, "dna". First, some more configuration.
    $ ruby script/generate controller dna opposite
          exists  app/controllers/
          exists  app/helpers/
          create  app/views/dna
          ...
          create  app/controllers/dna_controller.rb
          create  app/views/dna/opposite.html.erb
    
  4. Create app/views/layouts/dna.html.erb. This HTML wrapper does not have to be separate from app/views/dna/opposite.html.erb below. This division would be handy if we later wanted to have another view in the dna controller, say a view of the mRNA called "transcript".
    <html>
      <head>
        <title>Chapter 1, the opposite DNA strand</title>
      </head>
      <body>
        <h1>Finding the Opposite DNA Strand</h1>
        <h3>The Example from Chapter 1, page 12</h3>
        <%= yield :layout %>
      </body>
    </html>
    
  5. The shell of app/views/dna/opposite.html.erb was provided by the generate script. Here it is filled in:
    <% form_tag(:action => :opposite) do %>
      <%= text_field_tag(:dna1,params[:dna1],:size => 60) %>
      <%= submit_tag "Replicate" %>
    <% end %>
    <% if !@dna2.blank? %>
    <pre>
      Strand 1: <%= @dna1 %>
      Strand 2: <%= @dna2 %>
    </pre>
    <% end %>
    
  6. app/controllers/dna_controller.rb was also provided as a shell with the "opposite" action function def empty. We will fill it in with code like that from ch1p12.rb.
    class DnaController < ApplicationController
      def opposite
        @dna2 = ""
        if request.post?
          @dna1 = params[:dna1].strip.upcase
          (0...@dna1.length).each do |i|
            nucleo = @dna1[i,1]
            if nucleo == 'A' then @dna2 += "T"
            elsif nucleo == 'C' then @dna2 += "G"
            elsif nucleo == 'G' then @dna2 += "C"
            elsif nucleo == 'T' then @dna2 += "A"
            else @dna2 += "N"
            end
          end
        end
      end
    
    end
    
  7. Finally, fire up a Web server so we can use our application with a Web browser.
    $ ruby script/server
    => Booting Mongrel
    => Rails 2.3.4 application starting on http://0.0.0.0:3000
    
A Web browser view of the application:
Initial screen of the Opposite Web application.
Test screen of the Opposite Web application.

TOC

Appendix E Emacs

   Emacs is an exceptionally good text editor from the Free Software Foundation at www.gnu.org/software/emacs. Emacs runs on most computer platforms including Unix, Linux, OS X and Windows (including Vista.) Emacs is good for many things, including writing and running programs in Perl and Ruby, editing text files of enormous sequences, and preparing publication ready scientific articles with LaTeΧ. Emacs spell-checking helps you write readable, maintainable code. Emacs key-board short cuts are the same as those in the BASH shell, so your fingers only have to learn one set of key-board commands when working on the Linux or OS X command lines.

   You need an editor that handles huge text files, uses regular expressions, has a spell checker, can also be used as a hex-editor, can run a terminal window as a sub-process, completely understands the different text line-terminators (\n\r, \n and \r), and is fully programmable. NotePad has none of these features. Emacs, which has all of these features and more, is a good place to start. As with Perl and Ruby, if you are running Linux or OS X you already have Emacs.

   Emacs is open source software and so you can always build your own copy using the source code. Windows based systems are a little more of a problem. First, unlike the flavors or Unix and OS X, Windows usually does not come with Emacs already installed. Second, compiling from source requires that all the dependent libraries are compiled in the same way or the object code will not link into a working whole program.

   One way to obtain and use Emacs on Windows is to install Cygwin, which will incidentally gives you a better command line interface on which to practice the textbook's examples.

   Another way to obtain and use Emacs on Windows is to manually install a binary specifically for Windows. At the moment that would be Version 23.1. Try downloading and installing the following in order

  1. Emacs 23.1 full binary was extracted to C:\Program Files
  2. Aspell installer was installed in C\Program files\aspell
  3. Aspell English dictionary was also installed in C\Program files\aspell
Finally, tell Emacs which spell-checker to use. Because the concept of home directory varies among versions of Windows, open the .emacs file by C-x C-f ~/.emacs and paste in the following line
(setq-default ispell-program-name "../../aspell/bin/aspell")
and save it with C-x C-s. After quitting Emacs with C-x C-c the next time you start, the spell checker should work.

TOC

Appendix F R

    The R site.

   It is as important to understand the vector in R as it is to understand the list in Lisp and Scheme. The following R script is to remind us that a single number is just a vector of length 1. Also, in expressions with multiple vectors, the shorter vectors are recycled (possibly fractionally) until all the vectors are the same length. Indexing vectors, like [y < 4], can be vectors of logical values. Positions in index vectors begin with 1 and not 0, more like a spreadsheet than a descendant of C. Multiply dimensioned arrays are column major and the first index varies fastest, like in Fortran and not in the descendants of C. As in Perl and Ruby, indexed vectors can appear to the left of the assignment operator.

> x <- 1
> y <- c(5:1)
> z = c(5:26)
> x + y + z
 [1] 11 11 11 11 11 16 16 16 16 16 21 21 21 21 21 26 26 26 26 26 31 31
> y < 4
[1] FALSE FALSE  TRUE  TRUE  TRUE
> x + y[y < 4] + z
 [1]  9  9  9 12 12 12 15 15 15 18 18 18 21 21 21 24 24 24 27 27 27 30
> y[y < 4][1]
[1] 3
> w <- c(1:9)
> w
[1] 1 2 3 4 5 6 7 8 9
> dim(w) <- c(3,3)
> w
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
> i <- array(c(1:3,3:1),dim=c(3,2))
> i
     [,1] [,2]
[1,]    1    3
[2,]    2    2
[3,]    3    1
> w[i]
[1] 7 5 3
> w[i] <- 0
> w
     [,1] [,2] [,3]
[1,]    1    4    0
[2,]    2    0    8
[3,]    0    6    9
Matrices and multidimensional arrays can be made by associating dimensions with a vector.
> # With the example from Section 4.6.3:
> v1 = c(2,3,-1,0)
> v2 = c(1,-3,-2,4)
> # Remember R and Fortran are column major.
> m1 = array(v1,dim=c(2,2))
> m2 = array(v2,dim=c(2,2))
> m1
     [,1] [,2]
[1,]    2   -1
[2,]    3    0
> m2
     [,1] [,2]
[1,]    1   -2
[2,]   -3    4
> # For the case of 2 dimensions we can use matrix() instead of array().
> matrix(c(2,3,-1,0),nrow=2,ncol=2)
     [,1] [,2]
[1,]    2   -1
[2,]    3    0
> # Array multiplication and matrix multiplication are not the same!
> v1[2] * v2[2]
[1] -9
> v1 * v2
[1]  2 -9  2  0
> m1 * m2
     [,1] [,2]
[1,]    2    2
[2,]   -9    0
> array(v1 * v2,dim=c(2,2))
     [,1] [,2]
[1,]    2    2
[2,]   -9    0
> # There is a "special" operator for matrix multiplication.
> m1 %*% m2
     [,1] [,2]
[1,]    5   -8
[2,]    3   -6
PAM250 is PAM1250. What about exponentiation in R?
> # For numbers, exponentiation does what you would expect.
> 2 ** 2
[1] 4
> m = matrix(c(2,3,-1,0),nrow=2,ncol=2)
> m
     [,1] [,2]
[1,]    2   -1
[2,]    3    0
# Since * is a vector operator,
> m * m
     [,1] [,2]
[1,]    4    1
[2,]    9    0
> m ** 3
     [,1] [,2]
[1,]    8   -1
[2,]   27    0
> # and %*% is matrix multiplication,
> m %*% m
     [,1] [,2]
[1,]    1   -2
[2,]    6   -3
> # maybe %**% is what you want?
> m %**% 2
Error: could not find function "%**%"
# But we can make one...
> "%**%" <- function(m,pwr) {
+ mp <- m
+ for (i in 1:(pwr-1)) {mp <- mp %*% m}
+ mp}
> m %**% 2
     [,1] [,2]
[1,]    1   -2
[2,]    6   -3
> m %**% 3
     [,1] [,2]
[1,]   -4   -1
[2,]    3   -6

TOC

Appendix G Git

   The Git site. Git was invented by Linus Torvalds to serve the widely distributed Linux Kernel team. There are a lot of similarities between the way the Linux kernel is updated and the way a genome is annotated. There is no single person who is the best expert in everything. Merging the results of all the distributed experts needs to be able to be done often and easily.

TOC

Appendix H NCBI eUtilities

   Begin by getting and viewing the slides from the last NCBI PowerScripting course. Then try Building Customized Data Pipelines Using the Entrez Programming Utilities (eUtils).

   Today asking for the human HBB gene as NM_000518 will return version NM_000518.4. Here are Perl and Ruby versions of pulling the latest version from NCBI.

#! /usr/bin/perl
# getseq_nucleotide.pl
use LWP::Simple;
if (!@ARGV) {
  print "Usage: getseq_nucleotide.pl accession-number\n";
  exit 1;
}
$url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/" .
       "efetch.fcgi?" .
       "db=nucleotide" .
       "\&id=$ARGV[0]" .
       "\&rettype=fasta";
@seq = split /^/,get($url);
if ($seq[0] =~ /$ARGV[0]/) {
  open FASTA,">$ARGV[0].fasta" or die "Can't open file: $!";
  print FASTA @seq;
} else {
  print "Problem fetching $ARGV[0].\n";
}
print @seq;
#! /usr/bin/ruby
# getseq_nucleotide.rb
require 'open-uri'
if ARGV.empty?
  puts "Usage: getseq_nucleotide.rb accession-number"
  exit 1
end
url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/" +
      "efetch.fcgi?" +
      "db=nucleotide" +
      "&id=#{ARGV[0]}" +
      "&rettype=fasta"
seq = open(url).readlines
if seq[0] =~ /#{ARGV[0]}/
  fasta = open("#{ARGV[0]}.fasta","w")
  fasta.write(seq)
else
  puts "Problem fetching #{ARGV[0]}."
end
puts seq

TOC

Appendix I Image Magick

   The Image Magick site. After installing Image Magick, get a few of the libraries so you can use it with your favorite programming language. For Ruby use RMagick and for Perl use PerlMagick.

TOC

Appendix J Unit Testing

   Perl has a larger assortment of modules than Ruby for doing unit testing. In either case it is better to use the modules that actually come with the standard distribution of the language like Test::Harness, Test::Builder and Test::More with Perl and Test::Unit with Ruby. It is easy to forget to add on add-ons. One very nice feature of the Rails framework is that, beginning right from the rails new-team-app command, you start writing unit tests along with your code. Again, it is easy to forget to do what you put off until later.

   As an example, consider the opposite Rails application version of ch1p12.rb, introduced in Appendix D, Ruby on Rails. One of the improvements suggested in Chapter 1 of St. Clair and Visick was checking for characters in the nucleotide string other than A, C, G and T. Once we write the code to do the checking, we would type some test strings to test that our new code actually worked. Suppose we typed a half dozen strings and were satisfied that those six cases covered the important error possibilities and that our code worked well in each case. If we discard those strings we have to retype them -- if we remember them correctly. If we save the strings as unit tests then we or another programmer can easily recheck that our code still works in the event of code changes that may or may not be interrelated.

   In the Rails, Model-View-Controller framework the unit tests are on the data model. So the first step will be to change dna1 and dna2 from being simply strings to being instances of a class "Dna". Objects of class Dna have a property which is its sequence string.

  1. A new Model file is added: app/models/dna.rb
    class Dna
      attr_accessor :sequence
      Alphabet = %w{A C G T}
      
      def initialize
        @sequence = ""
      end
    
      def valid? # Some error checking for the model.
        @sequence =~  /^[#{Alphabet}]+$/i
      end
    end
    
  2. The View file is updated: app/views/dna/opposite.html.erb.
    <% form_tag(:action => :opposite) do %>
      <%= text_field_tag(:dna1,params[:dna1],:size => 60) %>
      <%= submit_tag "Replicate" %>
    <% end %>
    <% if !@dna1.sequence.blank? %>
    <pre>
      Strand 1: <%= @dna1.sequence %>
      Strand 2: <%= @dna2.sequence %>
    </pre>
      <% if !@dna1.valid? %>
        <p>Note:
           The opposite of an unknown nucleotide symbol is a<b>N</b>y.
        </p>
      <% end %>
    <% end %>
    
  3. The Controller file is updated: app/controllers/dna_controller.rb.
    class DnaController < ApplicationController
      def opposite
        @dna1 = Dna.new
        @dna2 = Dna.new
        if request.post?
          @dna1.sequence = params[:dna1].strip.upcase
          (0...@dna1.sequence.length).each do |i|
            nucleo = @dna1.sequence[i,1]
            if nucleo == 'A' then @dna2.sequence += "T"
            elsif nucleo == 'C' then @dna2.sequence += "G"
            elsif nucleo == 'G' then @dna2.sequence += "C"
            elsif nucleo == 'T' then @dna2.sequence += "A"
            else @dna2.sequence += "N"
            end
          end
        end
      end
    end
    
  4. Now some automated tests in a new file: test/unit/dna_test.rb
    require 'test_helper'
    
    class DnaTest < ActiveSupport::TestCase
    
      def setup
        @dna = Dna.new
      end
    
      test "good DNA" do
        @dna.sequence = Dna::Alphabet.join
        assert @dna.valid?, "#{@dna.sequence} contains unknown nucleotide symbols"
      end
    
      test "bad DNA" do
        @dna.sequence = (Dna::Alphabet | [1,2,3]).join
        assert !@dna.valid?, "DNA accepts nucleotide symbols: 1 2 3"
      end
    end
    
  5. Now run the tests.
    $ ruby -I test test/unit/dna_test.rb 
    Loaded suite test/unit/dna_test
    Started
    ..
    Finished in 0.024772 seconds.
    2 tests, 2 assertions, 0 failures, 0 errors
    
  6. Start the Web server and see how the updated application looks.
    $ ruby script/server 
    => Booting Mongrel
    => Rails 2.3.4 application starting on http://0.0.0.0:3000
    
    Testing a nucleotide string with an unknown character.

TOC

Appendix K Abbreviating Variables

   Shorter variable names are easier to type and like other abbreviations are easier to read when repeated frequently. If abbreviations are defined away from the current page or screen, they probably should not be used. Variable names used just in a small block of code often should be abbreviated. Here are examples of abbreviation while saying:

for a-list-of-just-one-variable
  v.manipulation-1 
  v.manipulation-2 
  v.manipulation-3 
  ...

# with.pl
# Perl like Pascal's using or Visual Basic's with.
$a_string = " Four score\n\tand seven  years...  ";
# Idiomatic Perl to normalize whitespaces.
for ($a_string) {
  print $_ . "\n"; # Before.
  s/\s+$//;        # Remove trailing spaces.
  s/^\s+//;        # Remove leading spaces.
  s/\s+/ /g;       # Collapse internal spaces.
  print $_ . "\n"; # After.
}
=begin comment
Output:
$ perl with.pl 
 Four score
	and seven  years...  
Four score and seven years...
=end comment
# with.rb
# Ruby like Pascal's using or Visual Basic's with.
a_string = " Four score\n\tand seven  years...  "
# Ruby equivalent to a_string.strip!.gsub!(/\s+/," ")
[a_string].each do |s|
  print s + "\n"     # Before.
  s.sub!(/\s+$/,"")  # Remove trailing spaces.
  s.sub!(/^\s+/,"")  # Remove leading spaces.
  s.gsub!(/\s+/," ") # Collapse internal spaces.
  print s + "\n"     # After.
end
=begin comment
Output:
$ ruby with.rb 
 Four score
	and seven  years...  
Four score and seven years...
=end comment

TOC

Revised April 22, 2013