Lawrence Technological University
College of Arts and Science
Department of Mathematics and Computer Sciences

Handouts

A handout that is under construction!

Perl, Ruby and Tabular Data:
Examples of using scripting languages to automate repetitive tasks

by John M. Miller M.D.

   A discussion of the importance of tabular data to an investigator's research and publications begins in The Visual Display of Quantitative Information and continues through all of Edward Tufte's books. Tufte presents the table as a fundamental building block of graphics, where good graphics is not artistic decoration but transparently clear exposition of the investigator's reasoning, and bad graphics is simply bad reasoning. However, the techniques in Tufte's volumes are not the issue here. Larry Wall's Perl and Yukihiro Matsumoto's Ruby both share the goal of making the routine easy and the difficult possible. Wall feels his programming tool Perl should "do about what you would expect." Similarly, Matsumoto's Ruby language design "follows the rule of least surprise." Here we will consider some practical examples of gathering, storing, retrieving and presenting data. There are some differences between Perl and Ruby. Hopefully, these examples will help you decide which of these tools make your task at hand easiest.

   Relational database tables are another information management tool. If you have not used this kind of table and you generally reach for Excel when you want to "put this stuff in a database," try Digression I.

   Some common tasks related to tables.

  1. Accessing the piece of data at rowi, columnj. Points important to understand at the outset:
  2. "Scraping" the data you need off of an HTML table on the Web. When you use the View Source function of your Web browser you will see that an HTML table looks something like this:
    <table>
      <tr><td>row-1-column-1-entry<td>row-1-column-2-entry... </tr>
      <tr><td>row-2-column-1-entry<td>row-2-column-2-entry ... </tr>
      ...
    </table>
    
    Whenever you encounter "something like," the pattern matching language called "regular expressions" is likely to be helpful. Using regular expressions to scrape each row of a table off the Web is done slightly differently in Perl and Ruby. In both languages the pattern matches are "greedy" by default and match the longest string possible. This is a problem when pairing the beginning and end of the same row in an HTML table. A "?" is used to make the ".*" non-greedy.
    . matches any character.
    .* matches 0 → ∞ of any character.
    <tr>.*</tr> matches from the beginning of the first row to the end of the last row.
    <tr>.*?</tr> matches just 1 row.
    Allowing for instances where a single table row is written over more than one HTML source line, is done in both Perl and Ruby by having the ".", which matches any other character already, also match the line break character. In Perl you modify the pattern with an "s" and in Ruby you modify the pattern with an "m".
    In Perl:
    Progressively match each row.
      Match each of columns in the matched row.
    
    In Ruby:
    Scan for rows containing groups of columns.
    
  3. Retrieving data from relational database tables. Listed from the hardest to the easiest. Use the easiest method available. Speed is largely irrelevant here.
  4. Database backed Web sites to hold the working tables for your research.
  5. Non-relational database access with tools like NCBI's Entrez Programming Utilities
  6. Publication of tables with LaTeΧ, PostScript and .pdf files.
  7. Improving tables with sparklines.

Digression I

   A digression on Edward Tufte's graphic tables versus Edgar Codd's relational database tables.
Table Type: Graphic Relational
Purpose: Show the relationship between neighboring entries. Allow the accurate selection of a subset of the tables' entries.
Order of the rows and columns: Important Irrelevant
Rows form a true set: No, can have duplicate rows. Yes, each row is unique.
Entries may have multiple values: Yes. No, the entry at rowi, columnj is one single value.

Revised June 24, 2009