Sunday, September 20, 2009

Thoughts on Beautiful Architecture ch. 8 - Guardian: A Fault-Tolerant OS Environment

In this chapter Greg Lehey described the architecture of Guardian - the OS for Tandem's "NonStop" series of computer and attempted to answer the question on why such a revolutionary architecture had so little impact.

The first thing that impressed me was its total hardware redundancy at its time. I have also worked on a couple of high availability systems with hardware redundancy myself in this decade. In the systems that I've worked on, in contrast the design is to have one single set of components on one board with high speed links between boards. This way the user can decide whether to pay the additional cost for redundancy. Thus, it actually makes sense to have a total duplication of hardware to prevent the cost associated with a second design. As for Tandem's case, its hardware architecture actually incorporates redundancy on the same board which may be less flexible. It is interesting that they did not take into account of the difference between the mean time between failure (MTBF) of the different components. For low-risk components such as IPB, one might avoid having a standby to cut down on the cost; especially from reading the chapter it seems the additional cost was one of the disadvantages when competing with its competitors.

Since many devices can only be directly accessible by only a couple of CPUs, the Tandem machine relies heavily on the messages exchanges between CPUs. The Tandem engineers later cleverly extended the message-passing mechanism beyond one box to support EXPAND and FOX. I am curious about the time difference between local IO access, in-box IO access, and remote IO access and I think the difference most likely is significant. In this case, it may be necessary for the OS to allocate the resources with locality in mind. I am not sure whether this is done in Guardian.

The author reserves his explanation on Tandem machine's lack of impact at the end. Since this was way before my time, I really don't know how open that companies at the seventies on documenting their architecture. I can't help to speculate perhaps since this is a proprietary machine, disclosing the architecture is therefore discouraged? Perhaps that is why it did not make a bigger splash?

No comments:

Post a Comment