Less than two years old, microformats are changing the way we pull semantic information out of the web. Instead of being based on standards by committee, microformats are conventions based on observations about how people actually use the web. Essentially, it's a clever use of both the community process and the class attribute. Did I mention that it's really clever?Personally, I had never heard of them until I was invited to the microformats birthday party last year. That was about the time that (at Supernova) Yahoo! Local announced they were going to be adding hCard (for people), hCalendar (for events), and hReview (for opinions) to their portal. What did that mean? It was a great day for data-mining engineers and page scrapers. It also meant that I went out and added the hCard tags to my own website.
But how many search engines have picked this up? It's not a trivial task; being a community effort, the project has a lot of force but is also likely to change in subtle ways over time. That's where the Alexa Web Search Platform comes in. In a couple of weeks' time, one of our new developers was able to learn how to use the Platform and create a system that extracts hCards from web pages, indexes the data, and uses the same tools that power Alexa Web Search to serve up content as a web service. That's hard to beat! And he was nice enough to write up the process as a tutorial; if you've got an idea for making it better, be our guest!
The hCard search engine is just a sample application and as such doesn't contain all hCards in the Alexa crawl. But it should give you a good indication of the power of microformats and hCards. To give you an idea, here are searches where name is Tantek, title is manager, and city is San Francisco.
Comments | Permalink