You may have noticed a new look here at Sensorpedia.com. We’ve migrated to new servers and decided to experiment with a new look in the process. We’re still working on updating our WordPress blog template to match, so for now we’re using the simple Pangea theme temporarily. We’ve also taken down the “Sneak Peak” application that was available on our old server. Stay tuned for an updated Sensorpedia API and beta application in the near future!
Sensorpedia intern Ashley is featured in this Oak Ridge Institute for Science and Education (ORISE) student intern video.
“Are you ready to join us at Oak Ridge National Laboratory as part of the next generation of scientists?”
You’ve seen the introduction video; now its time to get down to business. This episode is for anyone who wonders what Sensorpedia is, how it is innovative, and how it fits into The Internet of Things. Jason Frank films himself and fellow team members David Resseguie, Tim Garvin, and Ashley Dailey in this behind-the-scenes look at Sensorpedia.
It’s hard to believe 10 weeks have already passed, but the summer internships at ORNL wrapped up today. I big Thank You! is in store for all the hard work this summer from Ashley Dailey, Jason Frank, and Tim Garvin. All three played a critical part of making this a very successful and productive summer for Sensorpedia. I realize that not much has changed on the Sensorpedia web application during the initial beta (alpha?) release. But we haven’t been sitting still either. We’ve actually been using the Sensorpedia software (primarily the backend services) on a number of other projects here at ORNL for our government customers. Ashley, Jason, and Tim have played a big role in incorporating much of the lessons learned from these other projects back into Sensorpedia proper. Stay tuned for several blog posts from the team that I’ve got queued up to explain several of the latest changes. And you won’t want to miss the latest video that’s in the works either…
You’ve been waiting for it… now view the Teaser, Opening Scene, and Opening Credits in this first installment. Jason Frank films himself and fellow team members David Resseguie, Tim Garvin, and Ashley Dailey in this riveting (or use your own cliche word) behind-the-scenes look at Sensorpedia. Enjoy.
In many ways, it was just another day of class. But as I sat there awaiting the final exam of my first graduate computer science course, Advanced Algorithm Analysis and Design, I wondered if my studies had prepared me to succeed at this level. Most of the students surrounding me were PhD students, and their experience in this subject was greater than mine. For a moment my mind flashed back to a time three years earlier. It was the day I told my wife that I wanted to go back to school, starting over in a new subject, to get a Masters degree in computer science. Then I plotted a course to make it happen.
Unlike the degree that I completed nine years earlier, there were no liberal arts or elective classes in my plan for this new degree; it was all math and computer science. I wasn’t sure how well I would fair since the last math class I had taken was when I was a junior in high school. But I moved forward with my plan and one by one, I received an “A” in all of the undergraduate math and computer science courses that led to me sitting in that classroom surrounded by fellow graduate students. Would this class end in the same manner?
Two weeks later, I was walking across the campus of the Oak Ridge National Laboratory. I was starting my second summer internship with Sensorpedia, so I was heading to over to meet up with my mentor and lead developer of Sensorpedia, David Resseguie. After catching up on what we’d been up to the past several months, David told me that one of my first tasks of the summer was to write a blog post. He said to write about something I had learned this past school year. For those of you that don’t know David, that might sound a little silly… to just write about something that I learned. For those of you who do know David, you know that learning is very important to him. He knows that good ideas can come from a wide variety of sources, and a great way to facilitate that is to continue to learn new things.
For those of you still hanging on to my opening story about my Advanced Algorithms class, I can happily report that my 4.0 GPA in this subject remains intact. So the question was now before me: Of all the things I learned this past school year, what did I feel was worthy of writing about for this blog? A quick thought back revealed several possibilities. Should I write about the inner workings of the proof of Rice’s theorem from Theory of Computation class (which gives us the fact that only trivial properties of programs are algorithmically decidable)? Or should I write about the mangled NP complete proof for the Hamiltonian Cycle problem (where we transform an instance of Vertex Cover to an instance of Hamiltonian)? Perhaps the topic should be more programming related. I could write about the nights spend debugging my bipartite network flow program. Or just as easily I could tell of my experiences implementing Unix pipes. Which topic truly deserves the honor?
To answer this question, I will ask one of my own. Why do we ask our kids to clean their rooms? Is it to rid ourselves of the feeling the chaos? Sure. But isn’t another main reason that, as adults, we understand the importance of having the things we need access to accessible when we need them? We know that the investment of time to clean one’s room pays off. Now, how does this relate to my past semesters of math and computer science studies? As I reflected over these months of study, it dawned on me that the things that I truly learned were things that I had already learned before: the power of preparation, hard work, and in particular, organization. Of all the things that I’ve studied in this journey of my new degree, it is these simple, timeless qualities that I have again learned to be priceless. It’s true that I might not have the raw intelligence of some of my classmates. But I have found that I can make up that difference with these other qualities. I found that if I spent the time necessary to organize complex issues, I could perform at a high level. I found my investment of organization paid off.
Others also know the power of organization. When Bryan Gorman and David Resseguie started Sensorpedia, they knew the potential of organizing the world’s sensor data. Sensor data was somewhat available, but not organized. One type of sensor data was often times not compatible with other similar types of sensor data. There was one proprietary format after another. It was harder to have access to certain data when you needed it. Sensorpedia seeks to fix these problems. It seeks to organize complex issues and data sets, so that our users can perform high level tasks. How ironic is it that my lesson learned has been a core goal of Sensorpedia all along?
I’m happy to announce that an article about Sensorpedia has been published by Sensors Magazine. I’d like to thank my co-author Scott Fairgrieve of Northrop Grumman for his input and development of a translation tool to simplify the registration of standards-based sensor systems that use the OGC Sensor Observation Service (SOS) Interface Standard. Watch for a guest blog post by Scott describing his effort in the near future.
Here is a link to the Sensors Magazine Article:
“Taking a page from social networking sites that offer users the ability to share and manipulate data in novel ways, Sensorpedia allows users to find, share, and use sensor data online.”
I’m thankful for the questions and requests for more information I’ve received since the article was published earlier this week. If you haven’t already done so, please check out the Sensorpedia sneak peek. I know the blog and Twitter feed have been fairly quiet over the last few months. That doesn’t mean there’s been nothing going on with Sensorpedia. In fact, we’ve been very busy. As part of the private beta effort, we’ve been collecting feedback on how to improve the Sensorpedia API and web application. In addition to the beta testing, we’ve also been applying the Sensorpedia concepts and software to other related domains as part of our ongoing work here at ORNL. We’ll be writing more about these individual efforts in the near future. But the key point for now is that we’ve been incorporating the valuable feedback and are preparing to migrate Sensorpedia to an updated API. It’s still Atom-based, but is much more powerful and flexible. The updated API will fully support the Atom Publishing Protocol (AtomPub) and additional querying capabilities. I’m very excited about where we are and the changes we’ve been making. I’ve already used an alpha version of the new services internally on several projects with lots of success.
Because of the particular interest I’ve received regarding technical details on the framework and the new API (thanks @freaklabs, @rafik, @SiliconFarmer and others!), I will plan to post a draft version of the updated API documentation before even completing the development work to incorporate it into the beta web application. When I do, I’ll post links on the blog and @Sensorpedia Twitter account. I’d love to hear your thoughts on the changes and the direction we’re taking with Sensorpedia in general.
Just before leaving for the Christmas holidays, I got some good news! Our application for an Apple iPhone development license was approved. The process took much longer than I had hoped, primarily because a national lab is not a typical “business” and the usual application process didn’t exactly fit. But I’m happy that we’re all set now. I’ve got the SDK installed and will begin actual development soon.
I’m excited to build on the outstanding work that Chris did this past summer in building a prototype Sensorpedia app. Chris focused primarily on submitting observations from the phone. Be sure and click through his presentation on “enabling citizen sensors” and read the related blog post. I’m working now to add a mobile viewer component to allow you to search for and view nearby sensor data.
Hopefully we’ll get the details worked out soon and you’ll find us in the app store! Of course, you’ll be the first to know by following us here on our blog and on Twitter.
For those that prefer Ruby over Java, I have good news: Sensorpedia scripts can now be written using JRuby. It uses the same API as the Java interface (both a positive and negative), so now you’ll be able to parse data sources using the convenience of Ruby regular expressions. In order to test this, I wrote a simple parser for the NOAA data sources (which consist of simple text files). Since a sensor can have multiple scripts associated with it, I also integrated Tables functionality by associating the NOAA sensors with the “Tables” tag (the Tables scripts were already written for the DASMet sources). Here’s a video demonstrating this functionality.
(Direct link: http://blip.tv/file/get/Jhorey-NOAADemo650.ogv)
In the video, the user first creates a pivot table to examine the “Air Temperature”. Not all the nodes have air temperature data, but some of those that do are in pretty cold climes (temperatures are in F). The user then writes a local function that filters nodes that register less than 20 F. Afterwards, the user creates a new pivot table to view which nodes were filtered. Then, the user constructs a collective function to take the average temperature and locations of all the filtered nodes. Finally, the user requests the final averaged data.
There’s still quite a bit of work to do with respect to the scripting layer. The most important probably being the key-value store I’m using to store and access all the data from the scripts. Right now it’s pretty fragile, but ultimately I want something that resembles map-reduce (both in terms of functionality and scalability).
In a previous post, I did a quick analysis of the numeric content of webpages. To no one’s great surprise, web pages containing sensor data often contained a greater amount of numeric data. As nice as it would be to leave it at that, it turns out that analysis is a bit too simple for our purposes. First, a webpage could contain sensor data but be surrounded by large amounts of text (ie: many weather-related pages). Even if that weren’t the case, it’s possible that a webpage may contain data from multiple data sources separated by text.
So instead of asking: “does this webpage contain sensor data?”, it makes more sense to ask “where is the sensor data located on this page?” In order to address this question, we need to make some assumptions. First, it makes sense to assume that sensor data in a webpage is organized and presented in a tabular fashion. However, not all things organized in a tabular fashion will be sensor data since tables and lists can be used for formatting and decorative purposes. Given this, the main challenge is to identify all the tabular elements, and classify those into sensor / non-sensor. So the problem seems straightforward enough: identify all the tables and classify them. But wait…exactly how do we identify a table? Once we identified it, how do we determine its structure so that we can feed it into our classifier?
Obviously a <table> tag denotes tabular structure. However, this doesn’t cover all the tables we’re interested in. First, it’s possible that sensor data is being uploaded as an XML file without the standard <table> tags. Even in the case that we find an HTML file, there’s a good chance that the sensor data will be embedded in <div> tags (which are differentiated using arbitrary user-defined strings). Even when sensor data is embedded in <table> tags, it’s not clear what constitutes columnar data. In some situations the <td> tag may represent a single column, while in others the column structure might be denoted by spaces or line breaks.
In order to handle all these cases, we start by assuming that all tabular data will exhibit a nested tag structure. For example, a row of sensor data may be embedded between one or more “row” tags, which in turn is usually embedded in some sort of “table” tag. The trick will be figuring out the tabular structure using the nested tag structure while ignoring any potential “decorative” tags (such as <em>).
Instead of figuring out all the possible meanings of these tags (which probably wouldn’t work across different sites), we can try to generate several possible table “interpretations” (by alternately treating tags as decorative or structural) and choose an interpretation that minimizes some value. For example, we may interpret a set of tags as:
|Interpretation 1 (2 rows, 2 columns)||Interpretation 2 (2 rows, 1 column)||Interpretation 3 (2 rows, 1 and 2 columns)|
So how to choose between these possible interpretations? Notice that some interpretations will have a more regular column structure (where rows have the same number of columns) and a more regular spacing structure (cells have the same number of spaces). By measuring and combining these two features, we can assign a regularity index to each table. The lower this index, the more structured the table is.
More formally, the regularity index can be defined by the function:
RI = W_r * S_r + W_c * S_c
where W are weights and S is the column and spacing structure:
S_[Number of Columns per Row | Number of Spaces per Cell] = Standard Deviation / Average
After assigning RI to each table, we can then choose the tables with the smallest values. Finally, in the case that multiple tables have the same minimum value, we choose the one that maximizes the number of columns and minimizes the number of rows (to favor compactness).
Using this simple method (and choosing weights that favor a regular column structure), we can extrapolate the tabular structure of many different tag elements. For example the following simple table:
<tr> <td> Wind Speed </td> <td> Temperature </td> </tr>
<tr> <td> 50 mph </td> <td> 27 C </td> </tr>
is interpreted as:
| row: “Wind Speed” “Temperature”
row: “50 mph” ” 27 C”
Another table with the same interpretation but with a different column syntax:
<header> Wind Speed: Temperature </header>
<data> 50 mph : 27 C </data>
There are a few technical details regarding corner cases (ie: single column tables) that I didn’t cover and there are still some small issues with respect to missing data values. Overall, though, this technique is relatively robust to decorative tags (users can surround both rows and data elements with these tags) and it seems to produce reasonable results. With this, we can begin comparing all the tables in a document and start performing the actual classification into sensor and non-sensor. However, I haven’t done this yet, so until then please leave thoughts and comments.