[ale] Server room issues

Jeff Hubbs hbbs at comcast.net
Thu Apr 22 09:42:34 EDT 2004


Also - a "crash cart" with LCD monitor, keyboard, and mouse affixed to
it!

On Thu, 2004-04-22 at 09:33, Dow Hurst wrote:
> Someone added that having enough room in the server room to have a moveable 
> cart on big rubber wheels is very convenient.  You can wheel to the rack your 
> working on and use as a bench as well as a cart.  I like the cover over the 
> main kill switch.
> Dow
> 
> 
> Jim Popovitch wrote:
> > Excellent write-up Dow.  I only have one point to add.... ALWAYS put a
> > plastic cover over the big red manual power kill switch, and for
> > sanity's sake: locate the switch at maximum arm-length away from exit
> > doors so as to make it highly inconvenient for contractors to hang their
> > jacket on. :)
> > 
> > -Jim P.
> > 
> > On Thu, 2004-04-22 at 02:37, Dow Hurst wrote:
> > 
> >>Here is a not so condensed listing of some advice from the list and others. 
> >>Read this as a rundown of a particular installation with some thoughts 
> >>injected.  I'd appreciate comments or advice if you want to add anything. 
> >>Names were blacked out to protect someone:
> >>
> >>Starting Philosphy: servers should never shutdown, run 24/7; jobs should be
> >>checkpointable for restart in case of failure; people more important than 
> >>servers; we are not a 99.99999 facility
> >>
> >>1.  A/C issues
> >>
> >>   3 separate systems, each with 0.5 needed capacity to cool server room, so 1
> >>can fail and be repaired w/out downtime on servers.  No redundant piping of
> >>coolant so that is a point of failure.  Two window units (if windows
> >>available) are installed as extra cooling capacity to handle either extra heat
> >>load or main unit failure for only a short time.  The window units make use of
> >>the window space which would have been a leak point for the A/C anyway.
> >>
> >>Think of A/C and Power supply as a fixed capacity that is set when the room is
> >>built.  No more A/C or Power will be added due to the expense.  So, plan for
> >>future heat load and power load for the next 5-10 years.
> >>
> >>Put in a raised floor for A/C plenum to deliver air to the underside of
> >>servers.  Extremely valuable and in XXXXX's opinion the most important server
> >>room feature.  Perforated tiles allow control and tuning of the cool air flow
> >>in the room.  This is important since the airflow is never correct after the
> >>room is filled with equipment.  It always needs redirection after the servers
> >>are installed.  The floor needs to be of high quality design with steel
> >>capable of supporting a 2000lb rack of equipment with ease.  XXXXX indicated
> >>that the floor should not deflect more than 1/32" for 4000lb/square inch
> >>applied force.  A steel loading dock ramp should be put in, not wood since
> >>heavy equipment will move in up to the raised floor level.
> >>
> >>
> >>An ale.org member, Jonathon Glass, said this:
> >>"You should really look at the IBM e325 series (Opterons) for cooling.  I have
> >>4 of them (demo units) in a cluster.  I felt the cases while running a demo,
> >>and they were cool, if not cold, to the touch.  IBM has spent a lot of money
> >>on making these machines rack-ready, and cool running, and it has paid off."
> >>
> >>Also he said:
> >>"How big are the Opteron nodes?  Are they 1,2,4U?  How big are the power
> >>supplies?  What is the maximum draw you expect?  Convert that number to figure
> >>out how much heat dissipation you'll need to handle.
> >>
> >>I have a 3-ton A/C unit in my 14|15 x 14|15 server room, and the 24-33 node
> >>cluster I just spec'd out from IBM (1U, Dual Opterons) was rated at a max heat
> >>dissipation (is this the right word?) of 18,000 BTU.  According to my A/C guy,
> >>the 3-ton unit can handle a max of 36,000 BTU, so I'm well inside my limits.
> >>Getting the 3-ton unit installed in the drop-down ceiling, including
> >>installing new chilled water lines, was around $25K."
> >>
> >>Another ale.org member, Chris Ricker, had this to say:
> >>"Just to give you another price point to compare, we just spec'ed out getting
> >>an additional 30 tons A/C (360,000 BTUs), and it's coming in at ~$100,000.
> >>That's just for adding two more 15-ton units, as most of the other
> >>infrastructure needed for that's already there...."
> >>
> >>
> >>2.  Power issues
> >>
> >>   Power cabling is run under the raised floor to receptacles in the floor.
> >>All circuits, except a couple of 120V 20A outlets, are 240V 30A single phase
> >>with the possibility of three phase if needed.  Large twist lock receptacles
> >>were used so unplugging a power cord has to be a deliberate action.  Some IBM
> >>servers need two 60A 240V 3phase circuits since they were designed to replace
> >>older IBM servers that used that type circuitry.
> >>
> >>Grounding of servers is thru the raised floor steel structure which would be
> >>grounded thru the building ground for safety.
> >>
> >>No data cables should go inside the raised floor if possible, only power
> >>cabling.  A high ceiling that allows overhead data cables is ideal since
> >>working in the cold air under the floor to install or fix data cables is an
> >>unpleasant experience.  However, XXXXX says don't sacrifice the raised floor
> >>for overhead cable runs.
> >>
> >>UPSes were only used on the file servers and disk arrays.  The main compute
> >>servers were supported by space saving power conditioners that provide a pure
> >>sine wave and suppress voltage changes.  Space is at a premium in XXXXXXX and
> >>the city power is excellent so this solution saved battery space on large kVA
> >>capacity UPSes and in the long term the cost of battery replacements.  We may
> >>not have that luxury with the southern storms and above ground power grid.
> >>Power strips were put in that have digital readouts showing current amperage
> >>used in the circuit.  These readouts allow you to know your amperage per
> >>circuit in realtime and tune the load per circuit since most server power
> >>supplies run at less current than their rating.  Plus the strips will turn on
> >>each outlet in a timed sequence during power up so the server loads are staged
> >>and don't hit the circuits all at once.  Dial-in modem control or terminal
> >>server control is available on the power strips if wanted.
> >>
> >>Jonathon Glass had this to say:
> >>"Just for the cluster, I have a 6kVa BestPower UPS.  It'll run all 16 nodes
> >>for about 15 minutes."
> >>
> >>Jeffrey Layton chimed in with this thought:
> >>"We run CFD codes (Computational Fluid Dynamics) to explore fluid flow over
> >>and in aircraft.  The runs can last up to about 48 hours.  Our codes
> >>checkpoint themselves, so if we lose the nodes (or a node since we're running
> >>MPI codes), we just back up to the last checkpoint.  Not a big deal.  However,
> >>if we didn't checkpoint, I would think about it a bit.  48 hours is long time.
> >>If the cluster dies at 47:59 I would be very upset.  However, if we're running
> >>on a cluster with 256 nodes with UPS and if getting rid of UPS means I can get
> >>60 more nodes, then perhaps I could just run my job on my more nodes and get
> >>done faster (reducing the window of vulnerability if you will).
> >>
> >>You also need to think about how long the UPS' will last.  If you need to run
> >>48 hours and the UPS kicks in about 24 hours, will the UPS last 24 hours?  If
> >>not, you will lose the job anyway (with no check pointing) unless you get some
> >>really big UPS'.  So in this case, UPS won't help much.  However, it would
> >>help if you were only a few minutes away from completing a computation and
> >>just needed to finish (if it's a long run, the odds are this scenario won't
> >>happen often).  If you could just touch a file and have your code recognize
> >>this so it could quickly check point, then a UPS might be worth it (some of
> >>our codes do this).  We've got generators that kick in about 10 seconds after
> >>power failure.  And the best thing is that they get tested every month (I can
> >>tell you stories about installations that never tested their diesel).
> >>
> >>However, like I mentioned below, the ultimate answer really depends.  If I can
> >>tolerate the lose of my apps running then I can take the money I would dump
> >>into UPS and diesel and buy more nodes.  If your codes can't or don't check
> >>point, then you might consider UPS and diesel.  If you have a global file
> >>system on the nodes (like many people are doing today) that you need up or you
> >>need to at least gracefully shutdown, then consider a UPS and/or diesel.
> >>
> >>I guess my ultimate question is, "Is UPS and diesel necessary for all or part
> >>of the cluster?"  There is no one correct answer.  The answer depends upon the
> >>situation.  However, don't be boxed into a corner that says you have to have
> >>UPS and diesel."
> >>
> >>
> >>3.  Fire Detection versus Suppression
> >>
> >>   XXXXX felt suppression systems might endanger the life of someone in the
> >>server room when actuated, so a fire detection system was installed.  This
> >>detection system is wired into the A/C and power so can turn them off if fire
> >>is present.  The logic is that the A/C would be the primary point for a fire
> >>to start or a overheating circuit so cutting power to A/C and servers would be
> >>most likely to stop the problem.
> >>
> >>Also, insurance for major equipment items are written in as a rider on the
> >>building or institution's insurance policy.
> >>
> >>Fire Suppression systems have come a long way and I am getting info on them.
> >>A mixture of gases that suppress fire but allow people to breathe are
> >>available and considered the norm for server rooms in businesses.  A
> >>particulate based system that leaves no residue is also available.  I've
> >>started the process to get info but exact room dimensions are required to
> >>quote accurately.  Probably $15-25K is about right for a suppression system.
> >>
> >>Building codes are strict in XXXXXXX, so whatever building codes require is
> >>what they had to live with.  The sprinkler heads can be switched to high
> >>temperature heads and the piping can be isolated from other areas to prevent
> >>disasters, but we may not be able to eliminate sprinklers from the area.
> >>XXXXX explained that an extension cable is illegal in their server rooms and
> >>most of XXXXXXX due to building codes.  In the South codes will be much more
> >>relaxed.
> >>
> >>Jonathon Glass said:
> >>"I do have sprinkler fire protection, but that room is set to release its
> >>water supply independent of the other rooms. Also, supposedly, the fire
> >>sprinkler heads (whatever they're called) withstand considerably more heat
> >>than normal ones.  So, the reasoning goes, if it gets hot enough for those to
> >>go off, I have bigger problems than just water.  Thus, I have a fire safe
> >>nearby (in the same bldg...yeah, yeah, I know; off-site storage!) that holds
> >>my tapes, and will shortly hold a hardware inventory and admin password list
> >>on all my servers."
> >>
> >>
> >>4.  Room Location Caveats
> >>
> >>   Don't be far from the loading dock to ease the movement of equipment into
> >>the server room.  Elevators, stairs, tight turns, doorways, and possibilities
> >>of flooding and corrosive gases are not obstacles we want our room to be near.
> >>XXXXX has one room at the level of the xxxx river so flooding is a concern.
> >>He mentioned that a large drain in the center of the floor is always nice to
> >>have.  Large servers may not fit thru doorways or in elevators.  IBM ships a
> >>large server in two pieces at an extra charge of thousands of dollars because
> >>some installations can't fit the server thru a door or up a stair.  XXXXX and
> >>XXXXX ran into this on their P960 IBM server.  They got the installation for
> >>free but it was a pain to deal with.
> >>
> >>A continuous hinge steel door is good for sealing in cool air and discouraging
> >>theft, but is hard to remove for equipment installation.  A key lock works
> >>when their is no power, battery or otherwise. ;-)  A keycard electronic lock
> >>is good for multiple employees entering the room since you can track entry
> >>time and card codes, but the lock should be able to be opened in case of loss
> >>of power.  (Batteries are used alot for this, I think.  I'll ask XXX how the
> >>keycard locks work at KSU since I ought to know that anyway.)
> >>
> >>
> >>5.  Server Cabinets
> >>
> >>   Most standard clusters come in standard racks except for blades.  Some
> >>special large servers like the IBM P960 come in special racks that are a
> >>required purchase.  The SGI Altix can be put in a standard rack.  So, APC
> >>makes nice standard racks that can be purchased.  We would then install the 
> >>Altix parts into the standard rack and go from there.  Skip the front doors on 
> >>the racks unless you share the server room and feel you must lock up the 
> >>servers.  Keep the back doors since cabling and sensitive connectors need 
> >>protection.  Standard racks can come in half rack, full rack, and extra tall 
> >>rack sizes.  The extra tall sizes won't fit in elevators and may need some 
> >>special help to get into the server room.  A 47U rack full of servers will 
> >>easily weigh 2000lb, so expect to place them carefully before filling them up! 
> >>  Sliding rails are preferred for compute nodes but not preferred for disk 
> >>arrays.  Disk arrays usually have removeable disks from the front for hot swap 
> >>replacement so sliding the array out doesn't help.  Definitely high on XXXXX's 
> >>list of must haves is the LCD and keyboard mounted on a sliding rail.  You use 
> >>this to access the servers via a serial connection so is immune to network 
> >>problems.  A terminal server for serial access or a KVM switch is recommended. 
> >>  They have a $20K Raritan KVM switch while others rave about a Cyclades 
> >>terminal server.  I'm familiar with both products thru articles and 
> >>advertisements, but have only used the cheap 4 port or 2 port KVM switches 
> >>myself.  A KVM or terminal server setup might handle up to 128 or 256 console 
> >>connections that are hardwired via serial cables or a special backplane 
> >>connection.  The IBM blades have a special backplane that has the network 
> >>wiring and console wiring embedded in it.
> >>
> >>You need the steel loading ramp when your delivery guys take a running start
> >>to get a 1 ton server up onto the raised floor!
> >>
> >>XXXXX has two 30A 240V circuits per cabinet installed in the floor.  I imagine
> >>these are in the floor just past the rear of the cabinet's footprint.
> >>
> >>Eleven inches are needed for overhead cable raceways if overhead cable runs
> >>are put in place.
> >>
> >>At XXXXX they have one server room for proprietary servers and one room for
> >>standard rack based servers.
> >>
> >>
> >>5.  Networking
> >>
> >>Not discussed yet.
> > 
> > 
> > 
> > _______________________________________________
> > Ale mailing list
> > Ale at ale.org
> > http://www.ale.org/mailman/listinfo/ale
> > 



More information about the Ale mailing list