Really Big Things
|
|
Posted by Ken Farmer, Thursday August 28 2008 @ 04:31PM EDT
|
|
Linux Magazine: How does one manage really big clusters? Perhaps nature can give us a clue.
As clusters continue to grow in size and complexity, I often think about the programming and management issues users and administrators must face. There are currently proposals floating around that require five digit node counts. That is a lot of cores, heat, power, cables, floor space, and coffee. If I were setting up a 25,000 node cluster, I would suddenly acquire a renewed interest in statistics. Not just the nodes, but also the networking. As you may know, infrastructure parts are rated in terms of MTBF (Mean Time Between Failure). This is a statistic that is often mis-understood. If a part has a MTBF of 10,000 hours, that does not mean the part will last 10,000 hours. It is a number that is used to determine the failure rate of a collection of similar parts.
Read more...
|
|