What is a table anyway? Comment

11:36 am on November 18, 2009 ,

In a previous post, I did a quick analysis of the numeric content of webpages. To no one’s great surprise, web pages containing sensor data often contained a greater amount of numeric data. As nice as it would be to leave it at that, it turns out that analysis is a bit too simple for our purposes. First, a webpage could contain sensor data but be surrounded by large amounts of text (ie: many weather-related pages). Even if that weren’t the case, it’s possible that a webpage may contain data from multiple data sources separated by text.

So instead of asking: “does this webpage contain sensor data?”, it makes more sense to ask “where is the sensor data located on this page?” In order to address this question,  we need to make some assumptions. First, it makes sense to assume that sensor data in a webpage is organized and presented in a tabular fashion. However, not all things organized in a tabular fashion will be sensor data since tables and lists can be used for formatting and decorative purposes. Given this, the main challenge is to identify all the tabular elements, and classify those into sensor / non-sensor. So the problem seems straightforward enough: identify all the tables and classify them. But wait…exactly how do we identify a table? Once we identified it, how do we determine its structure so that we can feed it into our classifier?

Obviously a <table> tag denotes tabular structure. However, this doesn’t cover all the tables we’re interested in. First, it’s possible that sensor data is being uploaded as an XML file without the standard <table> tags. Even in the case that we find an HTML file, there’s a good chance that the sensor data will be embedded in <div> tags (which are differentiated using arbitrary user-defined strings). Even when sensor data is embedded in <table> tags, it’s not clear what constitutes columnar data. In some situations the <td> tag may represent a single column, while in others the column structure might be denoted by spaces or line breaks.

In order to handle all these cases, we start by assuming that all tabular data will exhibit a nested tag structure. For example, a row of sensor data may be embedded between one or more “row” tags, which in turn is usually embedded in some sort of “table” tag. The trick will be figuring out the tabular structure using the nested tag structure while ignoring any potential “decorative” tags (such as <em>).

Instead of figuring out all the possible meanings of these tags (which probably wouldn’t work across different sites), we can try to generate several possible table “interpretations” (by alternately treating tags as decorative or structural) and choose an interpretation that minimizes some value. For example, we may interpret a set of tags as:

Interpretation 1 (2 rows, 2 columns) Interpretation 2 (2 rows, 1 column) Interpretation 3 (2 rows, 1 and 2 columns)
“Humidity”  “Temperature”
“30″            “50C”
“Humidity Temperature”
“30 50C”
“Humidity Temperature”
“30″     “50C”

So how to choose between these possible interpretations? Notice that some interpretations will have a more regular column structure (where rows have the same number of columns) and a more regular spacing structure (cells have the same number of spaces).  By measuring and combining these two features, we can assign a regularity index to each table. The lower this index, the more structured the table is.

More formally, the regularity index can be defined by the function:

RI = W_r * S_r + W_c * S_c

where W are weights and S is the column and spacing structure:

S_[Number of Columns per Row | Number of Spaces per Cell]  =  Standard Deviation / Average

After assigning RI to each table, we can then choose the tables with the smallest values. Finally, in the case that multiple tables have the same minimum value, we choose the one that maximizes the number of columns and minimizes the number of rows (to favor compactness).

Using this simple method (and choosing weights that favor a regular column structure), we can extrapolate the tabular structure of many different tag elements. For example the following simple table:

<table>

<tr> <td> Wind Speed </td> <td> Temperature </td> </tr>

<tr> <td> 50 mph </td> <td> 27 C </td> </tr>

</table>

is interpreted as:

row: “Wind Speed”       “Temperature”

row: “50 mph”              ” 27 C”

Another table with the same interpretation but with a different column syntax:

<table>

<header> Wind Speed: Temperature </header>

<data> 50 mph : 27 C </data>

</table>

There are a few technical details regarding corner cases (ie: single column tables) that I didn’t cover and there are still some small issues with respect to missing data values. Overall, though, this technique is relatively robust to decorative tags (users can surround both rows and data elements with these tags) and it seems to produce reasonable results. With this, we can begin comparing all the tables in a document and start performing the actual classification into sensor and non-sensor. However, I haven’t done this yet, so until then please leave thoughts and comments.

Leave a Reply

Have something to say? Jump right in!     Formatting

(required)
(required)

Close

Formatting Your Comment

The following XHTML tags are available for use:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

URLs that start with http:// are automatically converted to hyperlinks.