You are currently browsing the Sensorpedia blog archives for November, 2009.

Archive for November, 2009

What is a table anyway?

Wednesday, November 18th, 2009

In a previous post, I did a quick analysis of the numeric content of webpages. To no one’s great surprise, web pages containing sensor data often contained a greater amount of numeric data. As nice as it would be to leave it at that, it turns out that analysis is a bit too simple for our purposes. First, a webpage could contain sensor data but be surrounded by large amounts of text (ie: many weather-related pages). Even if that weren’t the case, it’s possible that a webpage may contain data from multiple data sources separated by text.

So instead of asking: “does this webpage contain sensor data?”, it makes more sense to ask “where is the sensor data located on this page?” In order to address this question,  we need to make some assumptions. First, it makes sense to assume that sensor data in a webpage is organized and presented in a tabular fashion. However, not all things organized in a tabular fashion will be sensor data since tables and lists can be used for formatting and decorative purposes. Given this, the main challenge is to identify all the tabular elements, and classify those into sensor / non-sensor. So the problem seems straightforward enough: identify all the tables and classify them. But wait…exactly how do we identify a table? Once we identified it, how do we determine its structure so that we can feed it into our classifier?

Obviously a <table> tag denotes tabular structure. However, this doesn’t cover all the tables we’re interested in. First, it’s possible that sensor data is being uploaded as an XML file without the standard <table> tags. Even in the case that we find an HTML file, there’s a good chance that the sensor data will be embedded in <div> tags (which are differentiated using arbitrary user-defined strings). Even when sensor data is embedded in <table> tags, it’s not clear what constitutes columnar data. In some situations the <td> tag may represent a single column, while in others the column structure might be denoted by spaces or line breaks.

In order to handle all these cases, we start by assuming that all tabular data will exhibit a nested tag structure. For example, a row of sensor data may be embedded between one or more “row” tags, which in turn is usually embedded in some sort of “table” tag. The trick will be figuring out the tabular structure using the nested tag structure while ignoring any potential “decorative” tags (such as <em>).

Instead of figuring out all the possible meanings of these tags (which probably wouldn’t work across different sites), we can try to generate several possible table “interpretations” (by alternately treating tags as decorative or structural) and choose an interpretation that minimizes some value. For example, we may interpret a set of tags as:

Interpretation 1 (2 rows, 2 columns) Interpretation 2 (2 rows, 1 column) Interpretation 3 (2 rows, 1 and 2 columns)
“Humidity”  “Temperature”
“30″            “50C”
“Humidity Temperature”
“30 50C”
“Humidity Temperature”
“30″     “50C”

So how to choose between these possible interpretations? Notice that some interpretations will have a more regular column structure (where rows have the same number of columns) and a more regular spacing structure (cells have the same number of spaces).  By measuring and combining these two features, we can assign a regularity index to each table. The lower this index, the more structured the table is.

More formally, the regularity index can be defined by the function:

RI = W_r * S_r + W_c * S_c

where W are weights and S is the column and spacing structure:

S_[Number of Columns per Row | Number of Spaces per Cell]  =  Standard Deviation / Average

After assigning RI to each table, we can then choose the tables with the smallest values. Finally, in the case that multiple tables have the same minimum value, we choose the one that maximizes the number of columns and minimizes the number of rows (to favor compactness).

Using this simple method (and choosing weights that favor a regular column structure), we can extrapolate the tabular structure of many different tag elements. For example the following simple table:

<table>

<tr> <td> Wind Speed </td> <td> Temperature </td> </tr>

<tr> <td> 50 mph </td> <td> 27 C </td> </tr>

</table>

is interpreted as:

row: “Wind Speed”       “Temperature”

row: “50 mph”              ” 27 C”

Another table with the same interpretation but with a different column syntax:

<table>

<header> Wind Speed: Temperature </header>

<data> 50 mph : 27 C </data>

</table>

There are a few technical details regarding corner cases (ie: single column tables) that I didn’t cover and there are still some small issues with respect to missing data values. Overall, though, this technique is relatively robust to decorative tags (users can surround both rows and data elements with these tags) and it seems to produce reasonable results. With this, we can begin comparing all the tables in a document and start performing the actual classification into sensor and non-sensor. However, I haven’t done this yet, so until then please leave thoughts and comments.

Virtual Sensor Nodes

Friday, November 13th, 2009

In a previous blog post, I discussed the idea of a virtual scripting layer that operated over the Sensorpedia data sources. Each data source, instead of just exposing links to data, will also expose a data-oriented programming interface. Users will be able to write and upload scripts that can transform raw data and communicate with external applications. By associating scripts with tags, users will be able to easily share and incorporate functionality into a data source by applying the appropriate tag to the data source. So, for example, if a user has associated an HTML parsing script with the tag HTML, other users can benefit by applying the same tag to their data source.

So what will these scripts actually look like? Although I still haven’t quite decided on the final syntax,  I’d like users to write these scripts in multiple languages. That way users comfortable with Java can program in Java while trendier users can write these scripts in Ruby. One way we can accomplish this is by exploiting the many languages that are available for the Java runtime (Java, JRuby, Jython, Clojure, etc.). The only thing we’ll impose is some minor syntax additions that are shared across all languages that

Script Management

define the data-events the script can handle. For now, the scripts kind of look like this:

script ScriptName

dep “DependantData”

event “DependantData”
// Language-specific code

end

Each script defines a set of data-event methods that are invoked whenever data with the appropriate name is published (a publication may be associated with some data value). Data can be published via several sources, including from the network (so that scripts can react to external tools), timer mechanisms (for periodic sampling), or from

other methods. Once invoked, methods can, in turn, publish other data elements and thus trigger additional methods. Since each script works independently and are loosely coupled, the user will be able to load new scripts without affecting or even knowing about the internal operation of other scripts.

Although there’s still a long way to go with respect to both the specification and implementation, enough has been implemented that  I’ve been able to rewrite the Tables demo using this new framework. Instead of hard-wiring Tables to interact with the DASMet tower webpages, Tables interacts with a set of scripts I wrote associated with the Tables and DASMet tags. These scripts are then grouped together as a virtual sensor node. Tables, in turn, communicates with these virtual sensor nodes over the internet using a simple text-based protocol (see Figure). Unlike the old demo, Tables can now interact with any data source (besides DASMet) that apply the Tables tag and includes a script that exposes sensor data.

Obviously this short description is not enough for users to begin writing their own scripts, but hopefully it should give you an idea of where I’d like to go with this. There’s still a lot of work to do, including better session support and better integration with the Sensorpedia database.  Also, besides evolving the final syntax, users can only write scripts using Java. Once I integrate some other language support, I’ll post another blog entry explaining the more technical details of how these things are written, loaded, and executed.

Evaluating authentication options for Sensorpedia

Thursday, November 5th, 2009

As we move Sensorpedia from a limited beta (with a sneak peek) to an open beta, we have an important decision to make. We are currently evaluating multiple authentication options for Sensorpedia. Some of the options we are currently evaluating include:

Each of these options has pros and cons. We need to consider both technical issues (ease of implementation, robustness, networking/firewall limitations) and functionality (features, user base, flexibility). Our original plan for Sensorpedia was to simply use OpenID to offload the authentication process by supporting a number of (all?) OpenID providers. That is valuable in and of itself, but we’d still have to develop and maintain all social networking functionality ourselves. That certainly offers a lot of flexibility, but it’s hard to ignore the benefits one gains from leveraging the existing social networking capabilities of APIs from companies like Google, Facebook, and Twitter. Each of these company’s offerings have some real strengths. Google Friend Connect is easy to use and has a wide variety of widgets available. It also doesn’t tie you to a specific social network. Facebook Connect is attractive because of its size (currently 300+ million users!) and you immediately get the benefit of users’ existing social graphs for sharing and managing groups for access control requirements. Twitter is also nice because of its popularity, asymmetrical connections, and because it’s used more for non-personal communication (which probably fits Sensorpedia’s user base better). The situation is further complicated now that Google and Facebook both support OpenID and Facebook even allows you to log in with your GMail credentials.

So which path should we take? We need a solution that we can implement quickly and also gives our users the greatest set of features for sharing information, managing their social graph, and supporting data mashups. We also want to keep in mind the desire to open source the Sensorpedia software and make it available to run within an enterprise and on secure networks. (Think Wikipedia / Intellipedia.) Would the social networking functionality provided by Google, Facebook, or Twitter have to be reimplemented anyway for such scenarios? These are all issues that we are discussing internally on the Sensorpedia team. We’d love to get your input as well. Please share your thoughts in the comments or get in touch with us on Twitter, Facebook, or LinkedIN.

As we move forward in this area, we’ll begin consolidating our discussion and documentation on the Sensorpedia Developers Wiki.

We value your input and look forward to hearing from you!