[ale] Linux Cluster Server Room
Bjorn Dittmer-Roche
bjorn at sccs.swarthmore.edu
Tue Apr 20 08:16:54 EDT 2004
On Tue, 20 Apr 2004, Jeffrey B. Layton wrote:
> Well, my response is - it depends. How long is long? How important
> is it to you? Can you checkpoint or modify the code to checkpoint?
> Unfortunately, there are questions you have to answer. However, let
> me give you some things I think about.
>
> We run CFD codes (Computational Fluid Dynamics) to explore
> fluid flow over and in aircraft. The runs can last up to about 48
> hours. Our codes checkpoint themselves, so if we lose the nodes
> (or a node since we're running MPI codes), we just back up to the
> last checkpoint. Not a big deal. However, if we didn't checkpoint,
> I would think about it a bit. 48 hours is long time. If the cluster
> dies at 47:59 I would be very upset. However, if we're running
> on a cluster with 256 nodes with UPS and if getting rid of UPS
> means I can get 60 more nodes, then perhaps I could just run my
> job on my more nodes and get done faster (reducing the window
> of vulnerability if you will).
Jeff touches on an important point here: what happens when you loose one
node? You should think about the hardware's MTBF and think about how often
you will loose a single node and what the consequences of that are. If
your computations run for a week without checkpoints and you have a lot of
nodes, you will have to worry about hardware failure as well as power. So
good coding practice involves checkpoints.
At the risk of getting flamed: Have you considered alternative
multiprocessor machines from Sun, SGI and the like? These systems have
great reliability and let you do things like put 60 G RAM on one machine.
> You also need to think about how long the UPS' will last. If you
> need to run 48 hours and the UPS kicks in about 24 hours, will
> the UPS last 24 hours? If not, you will lose the job anyway (with
> no check pointing) unless you get some really big UPS'. So in this
> case, UPS won't help much. However, it would help if you were
> only a few minutes away from completing a computation and
> just needed to finish (if it's a long run, the odds are this scenario
> won't happen often). If you could just touch a file and have your
> code recognize this so it could quickly check point, then a UPS
> might be worth it (some of our codes do this).
Most power problems where I used to work were very brief. I don't know
about what things are like here in Georgia, or weather or not you have
backup generators, but a UPS that gives you 30 seconds will get you
through a lot of tough spots and will save you from loosing your
computations because of a ten second power outage. If you want to ride
over major blackouts, a small UPS and a generator will be more cost
effective than a large UPS, but again, what's the point when your node
MTBF is on the same order as the frequency of power outages.
bjorn
More information about the Ale
mailing list