Google - How It Works
According to Urs Hoelzle at Eclipse Con 2005, Googles scaling strategy was integral to their search dominance and later emergence as a major player in this net-based software era.
They use cheap, commodity hardware, not high-end servers. Lowers cost/CPU allowing for greater redundancy. But, this increases potential maintenance.
They run Linux.
To win, they planned to fail; seems to break basic rules for success but actually it's smart. When you have hundreds or thousands of servers, expected hardware failure at any rate makes efficient responses to these failures less than trivial.
Urs took us through a few unbelievable, humorous slides showing their hardware progression from the late 90s on.
Redundancy is a Google core value. Not losing data is central to Google's business so, it makes sense. Requires reliable infrastructure building blocks. To achieve this Google realized several useful abstractions:
Google File System (GFS)
- GFS Master manages metadata; these are replicated
- 64 MB file 'chunks' are managed Chunkservers, also replicated 3X
- Chunks also triplicated for fault tolerance.
- GFS client servers directly access the GFS Master and Chunkservers
Basic Computing Cluster
- Needed massive parallelization and distribution that are easy to use
- MapReduce solves the problem. MapReducing = mapping + reduction.
Map: take input k/v and produce set of intermediate k/v pairs
Reduce: emit final, condensed k/v pair - these are sorted, merged search results
MapReduction is so redundant that, in one unplanned test, they lost 90% of their reduction 'worker' servers and all of the reduction tasks still completed. Now that's fault tolerance!
Regarding Query Frequency, he showed several successive graphs where frequency of
"eclipse" searches, before Eclipse.org, peaked every three years
"world series" searches peaked every year
"watermelon" peaked during the summer
Funny but, more importantly, Google uses these patterns to learn from the data. This learning process is broken into two basic steps: establishing relationships between searched data, then clustering the related documents for relevant search results.
It was an interesting talk. Their scaling approach isn't mind bending but it's sooo effective. What's most fascinating to me is that they had the audacity and forsight to tackle the problem at the beginning. For more information on How Google Works, go here.