Posts Tagged ‘research’

Parsing NOAA sources

Monday, January 4th, 2010

For those that prefer Ruby over Java, I have good news: Sensorpedia scripts can now be written using JRuby. It uses the same API as the Java interface (both a positive and negative), so now you’ll be able to parse data sources using the convenience of Ruby regular expressions. In order to test this, I wrote a simple parser for the NOAA data sources (which consist of simple text files). Since a sensor can have multiple scripts associated with it, I also integrated Tables functionality by associating the NOAA sensors with the “Tables” tag (the Tables scripts were already written for the DASMet sources). Here’s a video demonstrating this functionality.

(Direct link: http://blip.tv/file/get/Jhorey-NOAADemo650.ogv)

In the video, the user first creates a pivot table to examine the “Air Temperature”. Not all the nodes have air temperature data, but some of those that do are in pretty cold climes (temperatures are in F). The user then writes a local function that filters nodes that register less than 20 F. Afterwards, the user creates a new pivot table to view which nodes were filtered. Then, the user constructs a collective function to take the average temperature and locations of all the filtered nodes. Finally, the user requests the final averaged data.

There’s still quite a bit of work to do with respect to the scripting layer. The most important probably being the key-value store I’m using to store and access all the data from the scripts. Right now it’s pretty fragile, but ultimately I want something that resembles map-reduce (both in terms of functionality and scalability).

What is a table anyway?

Wednesday, November 18th, 2009

In a previous post, I did a quick analysis of the numeric content of webpages. To no one’s great surprise, web pages containing sensor data often contained a greater amount of numeric data. As nice as it would be to leave it at that, it turns out that analysis is a bit too simple for our purposes. First, a webpage could contain sensor data but be surrounded by large amounts of text (ie: many weather-related pages). Even if that weren’t the case, it’s possible that a webpage may contain data from multiple data sources separated by text.

So instead of asking: “does this webpage contain sensor data?”, it makes more sense to ask “where is the sensor data located on this page?” In order to address this question,  we need to make some assumptions. First, it makes sense to assume that sensor data in a webpage is organized and presented in a tabular fashion. However, not all things organized in a tabular fashion will be sensor data since tables and lists can be used for formatting and decorative purposes. Given this, the main challenge is to identify all the tabular elements, and classify those into sensor / non-sensor. So the problem seems straightforward enough: identify all the tables and classify them. But wait…exactly how do we identify a table? Once we identified it, how do we determine its structure so that we can feed it into our classifier?

Obviously a <table> tag denotes tabular structure. However, this doesn’t cover all the tables we’re interested in. First, it’s possible that sensor data is being uploaded as an XML file without the standard <table> tags. Even in the case that we find an HTML file, there’s a good chance that the sensor data will be embedded in <div> tags (which are differentiated using arbitrary user-defined strings). Even when sensor data is embedded in <table> tags, it’s not clear what constitutes columnar data. In some situations the <td> tag may represent a single column, while in others the column structure might be denoted by spaces or line breaks.

In order to handle all these cases, we start by assuming that all tabular data will exhibit a nested tag structure. For example, a row of sensor data may be embedded between one or more “row” tags, which in turn is usually embedded in some sort of “table” tag. The trick will be figuring out the tabular structure using the nested tag structure while ignoring any potential “decorative” tags (such as <em>).

Instead of figuring out all the possible meanings of these tags (which probably wouldn’t work across different sites), we can try to generate several possible table “interpretations” (by alternately treating tags as decorative or structural) and choose an interpretation that minimizes some value. For example, we may interpret a set of tags as:

Interpretation 1 (2 rows, 2 columns) Interpretation 2 (2 rows, 1 column) Interpretation 3 (2 rows, 1 and 2 columns)
“Humidity”  “Temperature”
“30″            “50C”
“Humidity Temperature”
“30 50C”
“Humidity Temperature”
“30″     “50C”

So how to choose between these possible interpretations? Notice that some interpretations will have a more regular column structure (where rows have the same number of columns) and a more regular spacing structure (cells have the same number of spaces).  By measuring and combining these two features, we can assign a regularity index to each table. The lower this index, the more structured the table is.

More formally, the regularity index can be defined by the function:

RI = W_r * S_r + W_c * S_c

where W are weights and S is the column and spacing structure:

S_[Number of Columns per Row | Number of Spaces per Cell]  =  Standard Deviation / Average

After assigning RI to each table, we can then choose the tables with the smallest values. Finally, in the case that multiple tables have the same minimum value, we choose the one that maximizes the number of columns and minimizes the number of rows (to favor compactness).

Using this simple method (and choosing weights that favor a regular column structure), we can extrapolate the tabular structure of many different tag elements. For example the following simple table:

<table>

<tr> <td> Wind Speed </td> <td> Temperature </td> </tr>

<tr> <td> 50 mph </td> <td> 27 C </td> </tr>

</table>

is interpreted as:

row: “Wind Speed”       “Temperature”

row: “50 mph”              ” 27 C”

Another table with the same interpretation but with a different column syntax:

<table>

<header> Wind Speed: Temperature </header>

<data> 50 mph : 27 C </data>

</table>

There are a few technical details regarding corner cases (ie: single column tables) that I didn’t cover and there are still some small issues with respect to missing data values. Overall, though, this technique is relatively robust to decorative tags (users can surround both rows and data elements with these tags) and it seems to produce reasonable results. With this, we can begin comparing all the tables in a document and start performing the actual classification into sensor and non-sensor. However, I haven’t done this yet, so until then please leave thoughts and comments.

Virtual Sensor Nodes

Friday, November 13th, 2009

In a previous blog post, I discussed the idea of a virtual scripting layer that operated over the Sensorpedia data sources. Each data source, instead of just exposing links to data, will also expose a data-oriented programming interface. Users will be able to write and upload scripts that can transform raw data and communicate with external applications. By associating scripts with tags, users will be able to easily share and incorporate functionality into a data source by applying the appropriate tag to the data source. So, for example, if a user has associated an HTML parsing script with the tag HTML, other users can benefit by applying the same tag to their data source.

So what will these scripts actually look like? Although I still haven’t quite decided on the final syntax,  I’d like users to write these scripts in multiple languages. That way users comfortable with Java can program in Java while trendier users can write these scripts in Ruby. One way we can accomplish this is by exploiting the many languages that are available for the Java runtime (Java, JRuby, Jython, Clojure, etc.). The only thing we’ll impose is some minor syntax additions that are shared across all languages that

Script Management

define the data-events the script can handle. For now, the scripts kind of look like this:

script ScriptName

dep “DependantData”

event “DependantData”
// Language-specific code

end

Each script defines a set of data-event methods that are invoked whenever data with the appropriate name is published (a publication may be associated with some data value). Data can be published via several sources, including from the network (so that scripts can react to external tools), timer mechanisms (for periodic sampling), or from

other methods. Once invoked, methods can, in turn, publish other data elements and thus trigger additional methods. Since each script works independently and are loosely coupled, the user will be able to load new scripts without affecting or even knowing about the internal operation of other scripts.

Although there’s still a long way to go with respect to both the specification and implementation, enough has been implemented that  I’ve been able to rewrite the Tables demo using this new framework. Instead of hard-wiring Tables to interact with the DASMet tower webpages, Tables interacts with a set of scripts I wrote associated with the Tables and DASMet tags. These scripts are then grouped together as a virtual sensor node. Tables, in turn, communicates with these virtual sensor nodes over the internet using a simple text-based protocol (see Figure). Unlike the old demo, Tables can now interact with any data source (besides DASMet) that apply the Tables tag and includes a script that exposes sensor data.

Obviously this short description is not enough for users to begin writing their own scripts, but hopefully it should give you an idea of where I’d like to go with this. There’s still a lot of work to do, including better session support and better integration with the Sensorpedia database.  Also, besides evolving the final syntax, users can only write scripts using Java. Once I integrate some other language support, I’ll post another blog entry explaining the more technical details of how these things are written, loaded, and executed.

Programming tools for Sensorpedia

Sunday, October 11th, 2009

Besides automatically discovering new sensor data, I’m also interested in what to do after we get the data into Sensorpedia. Right now users interact with Sensorpedia via the web mapping tool. Users type in a search term that matches the title, textual description, or user-supplied tags and generates a list of geo-indexed placemarkers. This is pretty useful (and cool to view), but there are limitations.

Unfortunately, the data in Sensorpedia is mostly formatted to be consumed by human viewers. Click on one of placemarkers and you may find an HTML table, an XML file, or a png file of a graph. This can make interacting with the actual data a bit difficult. However, if we are able to parse the data, then we may be able to do more interesting and dynamic things with the data.

What sort of tool might we use? In previous (and ongoing) research, I developed Tables, a spreadsheet-inspired programming tool for sensor networks. Tables supports flexible data querying, local computation that executes on individual sensor nodes, and collective functions that can aggregate data from multiple nodes. I thought that this may be a useful tool for Sensorpedia, so I began porting a basic version.

I began by first writing scripts that parse the pages provided by the DASMet weather towers at ORNL. (search for “dasmet”). Next, I wrote a Java interface for a virtual “sensor node” that stores this data and implements an interpreter for the Tables spreadsheet language. Finally, I linked up the Tables interface to this virtual sensor network.

The first video shows different ways of querying the sensor network using a tool called a “pivot table”. The user is instructing the virtual sensor network to display the URLs associated with each DASMet tower with the unique ID of each virtual sensor node. Next, the user is constructing a more useful query asking for the Temperature values associated each tower. The video progresses through several more queries that includes additional metadata visually organized in different ways. The final query shows the user correlating two different sensor data.

In the second video, the user constructs a query to view the Temperature data again. Afterwards, the user writes a function that records whether the sensor node has any Temperature values greater than ‘60′. This function is executed on each sensor node and is automatically executed whenever the node gets new Temperature values. Now the user can construct a query to view which sensor nodes exceeded the threshold value. Although our example does this immediately (and so we don’t have any surprises), we could leave the sensor network running longer to accumulate these values.

Afterwards, the user types in another function to average the Temperature values. Unlike the previous function, this function is typed into the “t = 1″ sheet. This makes it so that the function collects data from multiple sensor nodes (instead of executing over a single node). Whenever a sensor exceeds the Temperature threshold, it will contribute data to the average.

So we see how using a combination of pivot tables, local functions, and collective functions the user can write some interesting and dynamic code that runs over both real and virtual sensor networks. The key, of course, is actually creating the virtual sensor network. Currently Tables only works with the DASMet towers. In the future, I hope to find a scalable way to get Tables to work over more of the Sensorpedia data. This will be difficult since different data sources present data in a different way, but we may be able to extend the virtual sensor node concept so that users can submit “scripts” for each sensor source that converts the native data format to a more structured format.

Well enough for this post. Like always, feel free to leave suggestions, comments, and results…

Discovering new sensor data

Friday, October 9th, 2009

While David has been working hard on Sensorpedia’s infrastructure, I’ve been thinking about different ways to automate the process of identifying, tagging, and extracting sensor data from the internet. This would be handy for several reasons. 1) we wouldn’t have to spend valuable human time performing a relatively mundane task and 2) having a sensor crawler would ensure that we would discover new sensors as they come online. Overall this is a pretty ambitious task, but to get started I’ve been asking: what is sensor data anyway? For the purpose of this small experiment, I decided sensor data is any numeric data  that contains some textual elements that describes that data. This is probably too simple of a definition, but it will do for now. Using this simple definition, it should be relatively straightforward to examine the number of numeric characters in a document to determine if a page has “sensor” data.

known_sensor_data_static_thresholdrandom_data_static_threshold

In the first figure I took the list of known sensor sources from the sensorpedia database. The sources were filtered to only include ‘text/html’ and ‘text/plain’ to avoid images, video, etc. For each data source I downloaded the page and graphed the ratio of numeric characters that appears in the main body (excluding any html tags and punctuation). For example, if the page contained exactly two characters (an ‘a’ and ‘5′), then the ratio would be 0.5.

It’s pretty evident that most of the data sources contained between 30 and 50 percent numeric characters. The only exceptions to this were the first few sources and the very last source. As for the first few sources, I found that they were php files that contained images of sensor graphs instead of alphanumeric content (apparently mislabeled in the sensorpedia database). The last source supposedly contained 100% numeric data (after punctuation removal). This is a little weird since most users would have no way of understanding this data, but presumably somebody is publishing this data for their own benefit. After removing these two extreme groups, we get an average of about 37%.

As for the second figure, I did the exact same thing except I substituted the known sensor sources with 2695 random webpages (I wrote a small Ruby crawler to do this for me). It’s pretty striking how different the figures are. There appears to be two distinct groups of pages. The great majority of the webpages contained less than 1% numeric data. There’s also a smaller group that contains about 20% numeric data. Oddly enough many of the ones with 20% numeric data seemed to be pointing to some Japanese website discussing weather data. I can’t read Japanese, so I’m not quite sure what it’s all about. Finally there’s at least one page with nearly 50% numeric data. Upon closer inspection that extreme page ended up being a UPS page that contained lots of actual data (see screenshot).random_with_lots_of_numeric

Once I graphed this data I wanted to know if a simple threshold test would work to differentiate the two types of webpages. The threshold I used was the average numeric ratio of the known sensor data minus one deviation. This excludes the random webpages, but also excludes several of the legitimate sensor sources. Using two deviations (the lower brown line) still excluded most of the random pages, but also included all the known sensor data. For a first pass, this test seems to work pretty well!

There’s still a lot of work to do (ie: differentiate sensor data from any old table of data, etc.) and I haven’t even thought about graphs, images, and video… Until then, please send me suggestions (or better yet, results)!