[ale] Linux Cluster Server Room
Jeffrey B. Layton
laytonjb at charter.net
Tue Apr 20 07:31:28 EDT 2004
Well, my response is - it depends. How long is long? How important
is it to you? Can you checkpoint or modify the code to checkpoint?
Unfortunately, there are questions you have to answer. However, let
me give you some things I think about.
We run CFD codes (Computational Fluid Dynamics) to explore
fluid flow over and in aircraft. The runs can last up to about 48
hours. Our codes checkpoint themselves, so if we lose the nodes
(or a node since we're running MPI codes), we just back up to the
last checkpoint. Not a big deal. However, if we didn't checkpoint,
I would think about it a bit. 48 hours is long time. If the cluster
dies at 47:59 I would be very upset. However, if we're running
on a cluster with 256 nodes with UPS and if getting rid of UPS
means I can get 60 more nodes, then perhaps I could just run my
job on my more nodes and get done faster (reducing the window
of vulnerability if you will).
You also need to think about how long the UPS' will last. If you
need to run 48 hours and the UPS kicks in about 24 hours, will
the UPS last 24 hours? If not, you will lose the job anyway (with
no check pointing) unless you get some really big UPS'. So in this
case, UPS won't help much. However, it would help if you were
only a few minutes away from completing a computation and
just needed to finish (if it's a long run, the odds are this scenario
won't happen often). If you could just touch a file and have your
code recognize this so it could quickly check point, then a UPS
might be worth it (some of our codes do this).
Unfortunately, there is no easy answer. You need to figure out
the answers yourself :)
Good Luck!
Jeff
P.S. Dow - notice my address change. You can talk to me off
line if you want.
> I understand your philosophy here but have a question? What if the
> calculations are long and costly to restart? Shouldn't I look at the
> value of spent computation that might have to be done over if I lose
> power? The code I am most concerned about running on the cluster may
> or may not be checkpointable. I think it might be, but I know my
> users and they won't want power to be an issue with predicting when
> their jobs will finish. ;-)
>
> Are Best UPS better performing than Tripplite or APC? I have
> experience with Tripplite, APC, and Leibert so far and never used
> Best. I like the toughness and quality of the enclosure of the APC
> and Leibert. I like the quality of all three. I like the performance
> and cost of APC and Tripplite. Tripplite's cases or enclosures on the
> low end aren't as nice as APC, but when you get the high UPSes they
> have nice rack enclosures. Performance wise, I haven't been able to
> tell a difference between the two. Heat production leans toward APC
> producing less overall.
>
> What do you mean by getting the wrong power factor conversion? Do you
> mean getting 120v at 60Hz vs 220v at 60Hz on the output outlets?
>
> I appreciate all this advice!
> Dow
>
>
>
> Jeffrey B. Layton wrote:
>
>> I'll give you my 2 cents about clusters and UPS's if you wish.
>>
>> A good cluster configuration will treat each compute node as
>> an appliance. You don't really care about it too much and it
>> doesn't hold any data of any importance. What you care about
>> is the master node and/or where the data is stored These
>> machines can have their own UPS or a single UPS to cover
>> the machines (they may be more than one). Then take the cost
>> savings (if you can) and put them into more nodes, or a better
>> interconnect (if needed), or a large file system, or a better
>> backup system, or .... well, you get the picture.
>>
>> Thinking of only putting a UPS on the important parts of the
>> cluster will save you money, time, and headaches. However,
>> if you put a cluster in a server room you can have all power
>> covered by a single huge UPS and probably a diesel backup
>> generator as well. This goes back to the purpose of a server
>> room - to support independent servers, not clusters. While this
>> is nice and good, it is somewhat wasteful. If you could have
>> a combination of UPS/Diesel backed power and just regular
>> conditioned power, that would be more economical. However,
>> the budgets for clusters (computing) and the budget for facilities
>> are never really seen as related by management. Even though
>> they come out of the same overall pot within the company (or
>> university), management has a tendency to compartmentalize
>> things for easy managing (and the definite lack of brain power
>> on the part of most managers). Try arguing that you really
>> don't need the giant UPS/Diesel combo and you will get IT
>> managers screaming all sorts of things about you. Sigh.
>>
>> Of course, these comments depend on your cluster configuration.
>> If you are running a global filesystem across all of the nodes,
>> so that each node has part of the filesystem, then you might
>> want to think about a good UPS for all of the nodes (try
>> restoring a 20 TB global filesystem from backup after a
>> power outage).
>>
>> Good Luck!
>>
>> Jeff
>>
>>> What type of UPS system are you using? Do most install a large UPS
>>> system for the entire server room? If so, how much will this cost?
>>>
>>> Thanks,
>>> Chris
>>>
>>> -----Original Message-----
>>> From: Dow Hurst [mailto:dhurst at kennesaw.edu]
>>> Sent: Monday, April 12, 2004 11:20 AM
>>> To: ale
>>> Subject: Re: [ale] Linux Cluster Server Room
>>>
>>>
>>> Thanks Jonathon! That is exactly the kind of ballpark I needed! I
>>> don't need
>>> the vendors right now as we are still kicking around ideas. If
>>> anyone would
>>> throw some specs or ideas out there, I'd appreciate it. Here is a
>>> quick
>>> question? Is planning for double your planned load a good rule? I
>>> would
>>> think that would be a good idea. How about backup cooling if the
>>> main unit
>>> dies? The firesafe is one I had not thought of.
>>> Dow
>>>
>>>
>>> Jonathan Glass (IBB) wrote:
>>>
>>>
>>>> How big are the Opteron nodes? Are they 1,2,4U? How big are the
>>>> power
>>>> supplies? What is the maximum draw you expect? Convert that
>>>> number to
>>>> figure out how much heat dissipation you'll need to handle.
>>>>
>>>> I have a 3-ton A/C unit in my 14|15 x 14|15 server room, and the 24-33
>>>> node cluster I just spec'd out from IBM (1U, Dual Opterons) was
>>>> rated at
>>>> a max heat dissipation (is this the right word?) of 18,000 BTU.
>>>> According to my A/C guy, the 3-ton unit can handle a max of 36,000
>>>> BTU,
>>>> so I'm well inside my limits. Getting the 3-ton unit installed in the
>>>> drop-down ceiling, including installing new chilled water lines, was
>>>> around $20K.
>>>>
>>>> I do have sprinkler fire protection, but that room is set to
>>>> release its
>>>> water supply independent of the other rooms. Also, supposedly, the
>>>> fire
>>>> sprinkler heads (whatever they're called) withstand considerably more
>>>> heat than normal ones. So, the reasoning goes, if it gets hot enough
>>>> for those to go off, I have bigger problems than just water. Thus, I
>>>> have a fire safe nearby (in the same bldg...yeah, yeah, I know;
>>>> off-site
>>>> storage!) that holds my tapes, and will shortly hold a hardware
>>>> inventory and admin password list on all my servers.
>>>>
>>>> If you want my list of vendors, send me an email off-list, or call my
>>>> office, and I'll see if I can track down the DPOs for you.
>>>>
>>>> Thanks
>>>>
>>>> Jonathan Glass
>>>>
>>>> On Fri, 2004-04-09 at 17:35, Dow Hurst wrote:
>>>>
>>>>
>>>>
>>>>> If I needed to take an existing space 400 square feet w/8'
>>>>> ceiling, 20'x20'x8', and add A/C and fire protection for a server
>>>>> room, what kind of cost would be incurred? Sounds like an algebra
>>>>> problem from highschool doesn't it? Let's say a full 84" rack of
>>>>> 4CPU Opteron nodes and supporting hardware were in the room. Does
>>>>> anyone have any ballpark figures they could throw out there? Any
>>>>> links I could be pointed to?
>>>>> Thank a bunch,
>>>>> Dow
>>>>>
>>>>>
>>>>> PS. I'd like some other type of fire protection than sprinkler
>>>>> heads. ;-)
>>>>>
>>>>
More information about the Ale
mailing list