Lawrence Technological University
College of Arts and Science
Departments of Mathematics, Computer Sciences and Biology

Handouts

General Data Munging and Computational Biology

by John M. Miller M.D.

  • A little more on why we are here, or what computer science can offer computational biology besides elegant algorithms
  • It should go without saying that to develop with a team some things need to be integrated right from the beginning and not added as afterthoughts:
    1. Comments and documentation
    2. Unit testing
    3. Version control
  • Applying Petroski's "the purpose of design is to obviate failure" to software engineering, there are several things that can help:
    1. Start with pseudo code with occasional diagrams like FSAs
    2. Use built in data structures and data types
    3. Use iterators and code blocks
    4. Use dynamic typing when interfacing with real biology databases
    5. Use open source to demonstrate reproducible research
    6. Use data structures that are understood by natural scientists, in particular, arrays that are multidimensional but not necessarily homogeneous or completely rectangular
    7. Use a null value that is not numeric
    If a language does not make it easy to do these things don't use it. Two code examples:
    1. A counting problem from Tony Berard mentioned earlier.
    2. What will this Java program print? (Inspired by Bloch and Gafter's Java Puzzlers.)
      public class PrintLoop {
        public static final int END = Integer.MAX_VALUE;
        public static final int START = END - 100;
        public static void main(String[] args) {
          int count = 0;
          for (int i = START;i <=END;i++)
            count++;
          System.out.println(count);
        }
      }
      
  • Things your programmer's editor should do:
    1. Spell check
    2. Search and replace with regular expressions
    3. Handle really large files
    4. Have a binary mode that can help juggle files that have been created with different line endings in different operating systems
    5. Have a shell where test output can be merged with code
    6. Be extensible with a scripting language
  • Thoughts on software that meets the requirements above:
  • Basic string processing amenities for the programmer:
  • Idiosyncrasies of the languages we are considering can be understood by observing where they do or do not follow the patterns of Perl.
  • Many programs accept user input from the command line arguments or the keyboard. However, if for example, you wished to take the DNA from a file named on the command line and the strand sense and direction from the keyboard, you need to be careful. When mixing your input sources, be sure that you are reading from standard input when you want keyboard input. Remember that Perl, Python and Ruby inherit behavior from the Unix shell where being able to read from a file that is piped in or named on the command line is considered to be a routine feature used in building tool chains.
  • Being a little less like Perl (than Ruby is) may be an advantage for Python.
  • Much of bioinformatics involves large text files. Considerations for dealing with end-of-line characters:
    1. We are processing text files which are divided into lines.
    2. We may be working on computers with mixed operating systems. Unix lines end "\n", Windows lines end "\r\n", and some embedded systems lines end "\r".
    3. Perl's chomp running on Unix will just remove the "\n".
    4. Ruby's chomp is a little more portable.
    5. Python does not have a chomp function.
    6. rstrip is a reasonable choice for both Ruby and Python.
    7. sub and gsub are helpful when the input sequence may have been manually copied from NCBI and contain line numbers and or internal spaces.
  • In languages like Python and Ruby and Perl there is usually more than one way to do it. Use the method that seems most clear and still has the fewest lines and fewest chances for mistakes.
    $ irb
    > lines = ["line 1","line 2","line 3"]
     => ["line 1", "line 2", "line 3"] 
    > for i in (0...lines.size)
    >   puts lines[i]
    > end
    line 1
    line 2
    line 3
    > for l in lines
    >   puts l
    > end
    line 1
    line 2
    line 3
     => ["line 1", "line 2", "line 3"] 
    > lines.each do |l|
    >   puts l
    > end
    line 1
    line 2
    line 3
     => ["line 1", "line 2", "line 3"] 
    > lines.each {|l|
    >   puts l
    > }
    line 1
    line 2
    line 3
     => ["line 1", "line 2", "line 3"] 
    > allinone = ""
    > for i in (0...lines.size)
    >   allinone += lines[i] 
    >   allinone += ", " unless (i == lines.size-1)
    > end
    > allinone
     => "line 1, line 2, line 3" 
    > lines.join(", ")
     => "line 1, line 2, line 3"
    > a = Array(1..10)
     => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] 
    > sum = 0 
     => 0 
    > for i in (0...a.size)
    >   sum += a[i]
    > end
    > sum
     => 55 
    > sum = 0
     => 0 
    > for i in a
    >   sum += i
    > end
    > sum
     => 55 
    > a.inject{|sum,i| sum + i}
     => 55 
    > a.reduce(:+)
     => 55 
    
  • R is useful for data munging alone or in combination with other languages.
  • Dynamic interpreted languages can be used in both the intial survey of a data set and in developing a program to transform the data into a format that is ready for detailed analysis.
  • An example of developing a program with a programmer's editor and an open source scripting language in interactive mode:
    1. Open Emacs and use Ctrl-x Ctrl-f to begin our program with some comments and pseudo code:
      #! /usr/bin/env ruby
      # fetch-fasta.rb
      # A program to find a file name on the command line that looks like
      #   AccNumber.fasta and if the file does not exist then fetch the 
      #   required nucleotide sequence from NCBI (using the eUtils) and
      #   save it in the indicated file.
      
    2. Using irb
      >> filename = 'CR541913.fasta'
      filename = 'CR541913.fasta'
      => "CR541913.fasta"
      >> accno = filename.split(".")[0]
      accno = filename.split(".")[0]
      => "CR541913"
      >> # Note the same with regular expressions.
      # Note the same with regular expressions.
      ?> accno = filename.split(/\./)[0]
      accno = filename.split(/\./)[0]
      => "CR541913"
      >> # Not like this.
      # Not like this.
      ?> accno_wrong = filename.split(/./)[0]
      accno_wrong = filename.split(/./)[0]
      => nil
      >> accno_wrong = filename.split(/./)
      accno_wrong = filename.split(/./)
      => []
      >> require 'open-uri'
      ?> url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/" +
      ?>    "efetch.fcgi?" +
      ?>    "db=nucleotide" +
      ?>    "&id=#{accno}" +
      ?>    "&rettype=fasta"
      => "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=CR541913&rettype=fasta"
      >> seq = open(url).readlines
      => [">gi|49456780|emb|CR541913.1| Homo sapiens full open reading ...",
          "ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG\n",
          "TTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGG\n",
          ...,
          "GCCCTGGCCCACAAGTATCAC\n", "\n"]
      >> if seq[0] =~ /#{accno}/
      if seq[0] =~ /#{accno}/
      >> File.open(filename,"w").write(seq)
      File.open(filename,"w").write(seq)
      >> else
      else
      ?> puts "Problem fetching #{accno}"
      puts "Problem fetching #{accno}"
      >> end
      end
      => 605
      
    3. Add the above to fetch-fasta.rb and manually remove the unwanted lines or run some eLisp like this:
      (defun strip-prompt-shell ()
        "Strip the prompts from the command lines in the region and remove the
      lines that are not commands -- use with Ruby 1.8 in Emacs shell."
        (interactive)
        (if (> (point) (mark)) (exchange-point-and-mark))
        ;; Also strip out the echo of the command from shell.
        (beginning-of-line)
        (while (< (point) (mark))
          (cond ((looking-at "\\s-*\\??[+>]\\{1,2\\}\\s-")
                   (kill-forward-chars (length (match-string 0)))
                   (forward-line))
                  ((looking-at "\\s-*#")
                   (forward-line))
                  (t (kill-line)))))
      
    4. Then changing the program to get the filename from the command line argument, and adding some minimal error checking, you would have something like this:
      #! /usr/bin/env ruby
      # fetch-fasta.rb
      # A program to find a file name on the command line that looks like
      #   AccNumber.fasta and if the file does not exist then fetch the 
      #   required nucleotide sequence from NCBI (using the eUtils) and
      #   save it in the indicated file.
      require 'open-uri'
      if ARGV.empty? or ARGV[0] =~ /^-h/i
        puts <<HELP
        Usage fetch-fasta.rb accession-number.fasta
      HELP
        exit
      end
      filename = ARGV[0]
      accno = filename.split(".")[0]
      if ! accno or accno !~ /^[a-zA-Z]{2}\d{4,7}$/
        puts "#{filename} does not seem to contain an accession number."
        exit
      end
      if File.exists?(filename)
        puts "#{filename} already exists."
        exit
      end
      url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/" + "efetch.fcgi?" +
            "db=nucleotide" + "&id=#{accno}" + "&rettype=fasta"
      seq = open(url).readlines
      if seq[0] =~ /#{accno}/
        File.open(filename,"w").write(seq)
      else
        puts "Problem fetching #{accno}!"
      end
      
    5. Initial setup and testing:
      $ ls -l fetch-fasta.rb
      -rw-r--r--  1 jay  staff  878 Jan 31 20:51 fetch-fasta.rb
      $ chmod +x fetch-fasta.rb
      $ ls -l fetch-fasta.rb
      -rwxr-xr-x  1 jay  staff  878 Jan 31 20:51 fetch-fasta.rb
      $ ./fetch-fasta.rb 
        Usage fetch-fasta.rb accession-number.fasta
      $ ./fetch-fasta.rb 1234.five
      1234.fivedoes not seem to contain an accession number.
      $ ./fetch-fasta.rb CR541913.fasta
      $ ./fetch-fasta.rb CR541913.fasta
      CR541913.fasta already exists.
      
  • Effective and efficient data munging without regular expressions is too difficult for serious consideration here.Regular expressions in Perl, Python, Ruby and Lisp.

    Revised September 4, 2014