Posts Tagged ‘sensors’

What is a table anyway?

Wednesday, November 18th, 2009

In a previous post, I did a quick analysis of the numeric content of webpages. To no one’s great surprise, web pages containing sensor data often contained a greater amount of numeric data. As nice as it would be to leave it at that, it turns out that analysis is a bit too simple for our purposes. First, a webpage could contain sensor data but be surrounded by large amounts of text (ie: many weather-related pages). Even if that weren’t the case, it’s possible that a webpage may contain data from multiple data sources separated by text.

So instead of asking: “does this webpage contain sensor data?”, it makes more sense to ask “where is the sensor data located on this page?” In order to address this question,  we need to make some assumptions. First, it makes sense to assume that sensor data in a webpage is organized and presented in a tabular fashion. However, not all things organized in a tabular fashion will be sensor data since tables and lists can be used for formatting and decorative purposes. Given this, the main challenge is to identify all the tabular elements, and classify those into sensor / non-sensor. So the problem seems straightforward enough: identify all the tables and classify them. But wait…exactly how do we identify a table? Once we identified it, how do we determine its structure so that we can feed it into our classifier?

Obviously a <table> tag denotes tabular structure. However, this doesn’t cover all the tables we’re interested in. First, it’s possible that sensor data is being uploaded as an XML file without the standard <table> tags. Even in the case that we find an HTML file, there’s a good chance that the sensor data will be embedded in <div> tags (which are differentiated using arbitrary user-defined strings). Even when sensor data is embedded in <table> tags, it’s not clear what constitutes columnar data. In some situations the <td> tag may represent a single column, while in others the column structure might be denoted by spaces or line breaks.

In order to handle all these cases, we start by assuming that all tabular data will exhibit a nested tag structure. For example, a row of sensor data may be embedded between one or more “row” tags, which in turn is usually embedded in some sort of “table” tag. The trick will be figuring out the tabular structure using the nested tag structure while ignoring any potential “decorative” tags (such as <em>).

Instead of figuring out all the possible meanings of these tags (which probably wouldn’t work across different sites), we can try to generate several possible table “interpretations” (by alternately treating tags as decorative or structural) and choose an interpretation that minimizes some value. For example, we may interpret a set of tags as:

Interpretation 1 (2 rows, 2 columns) Interpretation 2 (2 rows, 1 column) Interpretation 3 (2 rows, 1 and 2 columns)
“Humidity”  “Temperature”
“30″            “50C”
“Humidity Temperature”
“30 50C”
“Humidity Temperature”
“30″     “50C”

So how to choose between these possible interpretations? Notice that some interpretations will have a more regular column structure (where rows have the same number of columns) and a more regular spacing structure (cells have the same number of spaces).  By measuring and combining these two features, we can assign a regularity index to each table. The lower this index, the more structured the table is.

More formally, the regularity index can be defined by the function:

RI = W_r * S_r + W_c * S_c

where W are weights and S is the column and spacing structure:

S_[Number of Columns per Row | Number of Spaces per Cell]  =  Standard Deviation / Average

After assigning RI to each table, we can then choose the tables with the smallest values. Finally, in the case that multiple tables have the same minimum value, we choose the one that maximizes the number of columns and minimizes the number of rows (to favor compactness).

Using this simple method (and choosing weights that favor a regular column structure), we can extrapolate the tabular structure of many different tag elements. For example the following simple table:

<table>

<tr> <td> Wind Speed </td> <td> Temperature </td> </tr>

<tr> <td> 50 mph </td> <td> 27 C </td> </tr>

</table>

is interpreted as:

row: “Wind Speed”       “Temperature”

row: “50 mph”              ” 27 C”

Another table with the same interpretation but with a different column syntax:

<table>

<header> Wind Speed: Temperature </header>

<data> 50 mph : 27 C </data>

</table>

There are a few technical details regarding corner cases (ie: single column tables) that I didn’t cover and there are still some small issues with respect to missing data values. Overall, though, this technique is relatively robust to decorative tags (users can surround both rows and data elements with these tags) and it seems to produce reasonable results. With this, we can begin comparing all the tables in a document and start performing the actual classification into sensor and non-sensor. However, I haven’t done this yet, so until then please leave thoughts and comments.

Discovering new sensor data

Friday, October 9th, 2009

While David has been working hard on Sensorpedia’s infrastructure, I’ve been thinking about different ways to automate the process of identifying, tagging, and extracting sensor data from the internet. This would be handy for several reasons. 1) we wouldn’t have to spend valuable human time performing a relatively mundane task and 2) having a sensor crawler would ensure that we would discover new sensors as they come online. Overall this is a pretty ambitious task, but to get started I’ve been asking: what is sensor data anyway? For the purpose of this small experiment, I decided sensor data is any numeric data  that contains some textual elements that describes that data. This is probably too simple of a definition, but it will do for now. Using this simple definition, it should be relatively straightforward to examine the number of numeric characters in a document to determine if a page has “sensor” data.

known_sensor_data_static_thresholdrandom_data_static_threshold

In the first figure I took the list of known sensor sources from the sensorpedia database. The sources were filtered to only include ‘text/html’ and ‘text/plain’ to avoid images, video, etc. For each data source I downloaded the page and graphed the ratio of numeric characters that appears in the main body (excluding any html tags and punctuation). For example, if the page contained exactly two characters (an ‘a’ and ‘5′), then the ratio would be 0.5.

It’s pretty evident that most of the data sources contained between 30 and 50 percent numeric characters. The only exceptions to this were the first few sources and the very last source. As for the first few sources, I found that they were php files that contained images of sensor graphs instead of alphanumeric content (apparently mislabeled in the sensorpedia database). The last source supposedly contained 100% numeric data (after punctuation removal). This is a little weird since most users would have no way of understanding this data, but presumably somebody is publishing this data for their own benefit. After removing these two extreme groups, we get an average of about 37%.

As for the second figure, I did the exact same thing except I substituted the known sensor sources with 2695 random webpages (I wrote a small Ruby crawler to do this for me). It’s pretty striking how different the figures are. There appears to be two distinct groups of pages. The great majority of the webpages contained less than 1% numeric data. There’s also a smaller group that contains about 20% numeric data. Oddly enough many of the ones with 20% numeric data seemed to be pointing to some Japanese website discussing weather data. I can’t read Japanese, so I’m not quite sure what it’s all about. Finally there’s at least one page with nearly 50% numeric data. Upon closer inspection that extreme page ended up being a UPS page that contained lots of actual data (see screenshot).random_with_lots_of_numeric

Once I graphed this data I wanted to know if a simple threshold test would work to differentiate the two types of webpages. The threshold I used was the average numeric ratio of the known sensor data minus one deviation. This excludes the random webpages, but also excludes several of the legitimate sensor sources. Using two deviations (the lower brown line) still excluded most of the random pages, but also included all the known sensor data. For a first pass, this test seems to work pretty well!

There’s still a lot of work to do (ie: differentiate sensor data from any old table of data, etc.) and I haven’t even thought about graphs, images, and video… Until then, please send me suggestions (or better yet, results)!

Sensorpedia Milestone: Over 3,000 sensors now registered!

Thursday, July 30th, 2009

I’m happy to announce that we’ve hit a special milestone for Sensorpedia. We now have over 3,000 individual sensors and data sources interfaced with the framework! (3,298 as of Thursday morning July 30th to be exact.) A big thanks to our beta testers and summer interns for all their contributions.

Sensorpedia now contains a wide variety of sensor information. The following tag cloud gives you an idea of what types of data you can find. Much of the data is weather related, coming from International Civil Aviation Organization (ICAO) and National Oceanic and Atmospheric Administration (NOAA) weather stations and buoys. We also have a number of traffic cameras from around the Southeast, including a concentration of cameras in Charleston, SC (thus it’s prominence in the tag cloud). Other important sensor types monitor seismic activity, energy efficiency, and water levels. Then there are the more obscure sensors like those used for monitoring of bacteria level at beaches. The greatest concentration of sensors is still in the United States, but several sensor systems are also included from Europe, Asia, and other parts of the world!

Sensorpedia Tag Cloud

See our Sensorpedia Flickr Group for screenshots of what 3,000 sensors looks like in the Sensorpedia web application.

What sensor data will we be adding next? Maybe it’s yours! If you have sensor data that you’d like to incorporate into the Sensorpedia beta, please contact us.

How To: Setup a Sensorpedia Sensing Station (guide 1)

Friday, January 9th, 2009

What is Sensorpedia without sensors? This article will be one of a series that detail how to talk to a few different types of sensors, calibrate their data, and interface them with Sensorpedia.

These are the sensors we will be connecting (this guide will focus on reading the LM34 analog temperature sensor):

sensing_station_sensors_annotated_2_600px1

One of the preferred data formats for Sensorpedia is GeoRSS.  Simply put, a GeoRSS feed contains entries of data readings tagged with a specific time and spatial location. We can serve RSS or GeoRSS feeds to Sensorpedia, but first we need to have data to display in the feed.  Since we can’t plug little sensor PCBs straight into the back of a server, we will need a bit of infrastructure to calibrate and relay the data.  An ARM microcontroller board designed for the DIY/tinkering crowd called the Make Controller (henceforth referred to as “MC”) will serve as a middleman and communicate with a host server computer and the various sensors. It can poll analog sensors, digital sensors, control TTL, communicate via USB and Ethernet, and even control servos. The MC is a powerful little device.

(more…)