A discussion of the importance of tabular data to an investigator's research and publications begins in The Visual Display of Quantitative Information and continues through all of Edward Tufte's books. Tufte presents the table as a fundamental building block of graphics, where good graphics is not artistic decoration but transparently clear exposition of the investigator's reasoning, and bad graphics is simply bad reasoning. However, the techniques in Tufte's volumes are not the issue here. Larry Wall's Perl and Yukihiro Matsumoto's Ruby both share the goal of making the routine easy and the difficult possible. Wall feels his programming tool Perl should "do about what you would expect." Similarly, Matsumoto's Ruby language design "follows the rule of least surprise." Here we will consider some practical examples of gathering, storing, retrieving and presenting data. There are some differences between Perl and Ruby. Hopefully, these examples will help you decide which of these tools make your task at hand easiest.
Relational database tables are another information management tool. If you have not used this kind of table and you generally reach for Excel when you want to "put this stuff in a database," try Digression I.
Some common tasks related to tables.
for-each-row for-each-column-in-each-row do-something
<table> <tr><td>row-1-column-1-entry<td>row-1-column-2-entry... </tr> <tr><td>row-2-column-1-entry<td>row-2-column-2-entry ... </tr> ... </table>Whenever you encounter "something like," the pattern matching language called "regular expressions" is likely to be helpful. Using regular expressions to scrape each row of a table off the Web is done slightly differently in Perl and Ruby. In both languages the pattern matches are "greedy" by default and match the longest string possible. This is a problem when pairing the beginning and end of the same row in an HTML table. A "?" is used to make the ".*" non-greedy.
.matches any character.
.*matches 0 → ∞ of any character.
<tr>.*</tr>matches from the beginning of the first row to the end of the last row.
<tr>.*?</tr>matches just 1 row.
In Perl: Progressively match each row. Match each of columns in the matched row.
In Ruby: Scan for rows containing groups of columns.
A digression on Edward Tufte's graphic
tables versus Edgar Codd's relational database tables.
|Purpose:||Show the relationship between neighboring entries.||Allow the accurate selection of a subset of the tables' entries.|
|Order of the rows and columns:||Important||Irrelevant|
|Rows form a true set:||No, can have duplicate rows.||Yes, each row is unique.|
|Entries may have multiple values:||Yes.||No, the entry at rowi, columnj is one single value.|
Revised June 24, 2009