Webware.com has an article on the technology behind the search superpower Google. The article covered a presentation by Jeff Dean in the 2008 Google IO conference. Google is somewhat famous for having shunned conventional wisdom a bit by building their empire on throw-away hardware. Quoting the article:
In each cluster’s first year, it’s typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will “go wonky,” with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span, Dean said. And there’s about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover
For most any other company, that would have been a very unhappy story. Yet for smart companies like Google, getting service reliability without reliable hardware is part of the advantage. All of these outages are handled skillfully by Google software. Its the software that redirects queries around these outages, distributing the work across hundreds of servers all at once. The little Network Engineer in me is screaming ‘That’s what I’m talking about!’ That isn’t just fully redundant, that’s approaching biological complexity. The brain’s software can do some amazing things despite the hardware being unreliable. Does that then imply that the google search service has reached phone company levels of availability? A hint might be in their next big engineering hurdle; Datacenter to Datacenter switching of jobs, seemlessly. In otherwords, one big Namespace.
Technorati Tags: google, google io, web serving