Tuesday, November 28, 2006

I have a new hobby...

...taking actual traffic history graphs (in green) and overlaying them with Alexa traffic history graphs for the same site (the blue line.)

First up: Techcrunch

I'm getting the green graphs from sitemeter, a log analyzer program that leaves the stats viewable by the public. For example, here is the stats page for techcrunch.com

Techcrunch is ranked 605 in Alexa, so you'd guess that Alexa's data is pretty good. After all, the more traffic a site gets the better Alexa's data will be. As expected, Alexa's traffic history graph for techcrunch correlates well with their actual traffic.

Next up, a site ranked at 38, 822, Cox and Forkum.

This one looks pretty good too.

Of course, when I tried to do shorter timeframes the graphs didn't look so great. So I extended the range to 1 year and voila! It matched up nicely.

I haven't had a chance to do this for sites with less traffic yet. I expect the correlation will drop off the further out I go. Of course many other disclaimers apply. But I thought these two were an interesting start.

Comments | Permalink

Wednesday, November 22, 2006

Microformats on a Macro Scale

hCard Search logoLess than two years old, microformats are changing the way we pull semantic information out of the web. Instead of being based on standards by committee, microformats are conventions based on observations about how people actually use the web. Essentially, it's a clever use of both the community process and the class attribute. Did I mention that it's really clever?

Personally, I had never heard of them until I was invited to the microformats birthday party last year. That was about the time that (at Supernova) Yahoo! Local announced they were going to be adding hCard (for people), hCalendar (for events), and hReview (for opinions) to their portal. What did that mean? It was a great day for data-mining engineers and page scrapers. It also meant that I went out and added the hCard tags to my own website.

But how many search engines have picked this up? It's not a trivial task; being a community effort, the project has a lot of force but is also likely to change in subtle ways over time. That's where the Alexa Web Search Platform comes in. In a couple of weeks' time, one of our new developers was able to learn how to use the Platform and create a system that extracts hCards from web pages, indexes the data, and uses the same tools that power Alexa Web Search to serve up content as a web service. That's hard to beat! And he was nice enough to write up the process as a tutorial; if you've got an idea for making it better, be our guest!

The hCard search engine is just a sample application and as such doesn't contain all hCards in the Alexa crawl. But it should give you a good indication of the power of microformats and hCards. To give you an idea, here are searches where name is Tantek, title is manager, and city is San Francisco.

Comments | Permalink

Thursday, November 09, 2006

Traffic on the Long... Long... Tail...

I know I've done this before, but we always seem to get people asking why sites in the long tail are able to jump around tens of thousands, or even hundreds of thousand places, in the rankings. So here's a fresh shot at explaining the traffic in the long tail.
Take this graph, for example:



 (Note:  the original larger image is no longer available. 2/12/10)



This graph attempts to plot the top 200,000 most popular sites on the Web and show what percent of Web surfers can be expected to visit a site on any given day. I realize that it looks like a blank graph with no data points, but you'll have to look closer. In fact, you might want to break out your magnifying glass and inspect the lower left of the graph.

The most popular site on the Web, Yahoo.com gets whopping 28% of all Web visitors; but it is hard to see because it is crammed all the way over there on the left of the graph. Site number 1,000, imagehigh.com, gets an impressive 0.11% of all Web surfers visiting their site on a daily basis and you can catch just a glimpse of it in the lower left of the graph. Yes, that tiny little hooked line hanging out in the corner of the graph represent the only visible data points on the graph.

Moving down the list, site number 100,000, mum.edu, gets 0.00120% of Web surfers visiting their site and can't be seen on the graph because, like 99% of all the sites on this graph, it is vanishingly close to the axis at 0.
Of course, it isn't fair to compare other sites to Yahoo. So I have this other graph. It starts at site number 1,000, imagehigh.com, and continues out to 200,000:






Now it is a bit easier to see those sites that are close to the axis. But you'll notice that the trend hasn't changed. The sites between 100,000 and 200,000 have virtually the same number of visitors going to their site each day. In other words, to move from a rank of 200,000 to all the way up to a rank of 100,000, a whopping 100,000 places up the rankings, a site only needs to improve its traffic a slight amount.
That is the long tail in action. That's how it works. You get hundreds of thousands of sites all with similar traffic. If your traffic improves a slight amount you get to move up thousands and thousands of places. Add to that the fact that Alexa, even with millions of users in the panel, is tracking just a sample of Internet users, and not the whole bunch, and you can get some artificial bumps as well.

So, what's the lesson here? The long tail is very long. Very... Long... It keeps going out past 200,000, past a million and keeps going past 18 million sites, the vast majority of which receive no measurable traffic at all. That's the nature of the long tail and traffic distribution on the Web and that's why sites can jump up hundreds of thousands of places in the rankings with little actual change in traffic.
Comments - Permalink

Friday, November 03, 2006

YouTube Goes Flat

I was just playing with some of the new ajaxy buttons and sliders our traffic graphs and ran across something that raised an eyebrow. Check out this YouTube graph...

It started going flat on October 9th. Anybody care to guess what happened on that date?

Comments | Permalink