Update: Some of the numbers got scrambled on my way to the spreadsheet. Updated the graph and the numbers below.I'm on a jag here. We are still digging around the Web Search Platform and pulling up some stats. It can be addictive.
I managed to pull this graph together without spending any money. I simply used the "Create a Collection" feature of the platform to find out what was in the crawl.
The question was this: What response codes does our crawler get when it tries to crawl the Web?
When the crawler attempts to crawl a document, it is like knocking on a door... is anybody home? Did you move? Did I knock on the wrong door? Did the house disappear?
My methodology was dead-simple. I just used the Create a Collection form on the Web Search Platform to construct queries asking how many documents existed in the Alexa Crawl with various HTTP response codes during the April to May time period. Within a few seconds I had the answer.
In the pie chart above, green represents response code 200 - OK, meaning that it was a successful transaction.
Items in Red are in the 5XX class and represent server errors. These could be DNS problems, servers that are down, or other problems. 6.5 Less than .5 percent of all docs were server errors.
Another 6.5 percent were the items in blue, 4XX class, including everybody's favorite, 404 error - page not found. These are referred to as client errors.
Last, but not least, there are the 3XX error codes in yellow, indicating that the document has been moved. About 8% of all docs returned this code.
We have a few more ideas about what to do next, like distribution of domains across the Web and we are looking for more. Suggestions are welcome.