[ale] HADTOO: Automatic-node-generating Hadoop cluster on Gentoo Linux
Jeff Hubbs
jhubbslist at att.net
Mon Sep 10 11:22:46 EDT 2018
On 9/9/18 9:35 AM, Jim Kinney wrote:
> Ha!!
>
> My ultimate plan with the APS project was to cobble all the "waste
> machines" together into a system-wide distributed storage cluster then
> run a single-system image cluster tool like mosics or kerrigh. That
> would make the entire school district a single instance, giant super
> computer that K12 students use :-)
I haven't had occasion to use this yet but Hadoop has a thing where you
can divide your nodes into "racks" such that the ResourceManager knows
to expect network bottlenecks between groups of nodes and moves
processing and storage accordingly; in the APS environment one would
group each school into a "rack" (no idea if Hadoop supports "racks" of
"racks").
>
> Excellent work on the hadoop cluster!
Thanks! It's been interesting and I've covered a few things along the
way I'd long wanted to be able to do (like PXE-booting), so, bonus.
>
> I have my cluster nodes set to always pxe boot and the pxe boot
> default is to fall back to local boot drive. That way I can drive a
> new install/rebuild by twiddling a file on the DHCP server and
> rebooting a node. Eventually, the nodes will use no local storage as
> all will be reserved for /tmp (raid0 across all drives for speed) and
> use an NFS mounted root and remote log file process. Basically a
> homegrown PAAS set up controlled by job submission defined need.
I thought about running the working instance out of RAM leaving
everything NFS-based but I decided against it. For one thing, Hadoop
activity alone hammers the LAN and if a lot of additional traffic (the
Hadoop binary distribution is about 830MiB) is trying to concentrate on,
say, the edge node where the worker node instance actually lives while
there's a lot of HDFS traffic shooting from node to node, that's not cool.
>
> My current jobs are compiled matlab and custom python. Hadoop is
> coming back. Can't run hadoop on the same nodes with the others as it
> assumes full system control. Not enough demand yet for dedicated
> hadoop stack. So a reboot to nfs root hadoop cluster with a temp "node
> offline" status in torque seems to be feasible. Use pre-run scripts in
> torque to call reboot to hadoop node and post-run to reboot to normal
> cluster mode.
Sounds reasonable.
>
> On September 7, 2018 11:54:04 PM EDT, Jeff Hubbs via Ale <ale at ale.org>
> wrote:
>
> On 9/7/18 4:52 PM, dev null zero two wrote:
>> that is pretty incredible (never thought to use Gentoo for this
>> purpose).
> I use it for everything. :)
>>
>> have you thought about using orchestration tools for this
>> (Kubernetes etc.)?
> I have been immersed in enough IRC/forum/StackExchange traffic
> that I know that people do this, but thus far I haven't seen a
> need past the simple crafting of a single readily replicable Linux
> instance that justifies the added complexity. In addition to being
> able to make changes to that instance in chroot on the edge node
> and reboot the workers and any other daemon-running machines (to
> facilitate this, I set the machines to boot to disk first and PXE
> second and then I have a script that "breaks" the first drive on
> each worker node by overwriting the GPT partition table and
> forcing a reboot), I can still "fan" changes via ssh across all
> nodes serially or simultaneously and I still have the NFS
> mechanism available to me (right now, it handles only the Hadoop
> workers file).
>
> By the way, I've done this thing where the node setup script opens
> the optical media tray, which then automatically closes after a
> few minutes when the machine reboots. The on-disk instance is set
> to open and immediately close the tray at boot, so if all the
> machines have optical drives with trays that can open and close on
> command there will be quite an entertaining racket when a whole
> cluster starts up.
>
> Hey, Jim/Aaron - just think; I could have turned Sutton Middle
> School into a 500-node Hadoop cluster! There's a 24-seat lab at
> work that'd be good for about 3.3TiB of HDFS.
>>
>> On Fri, Sep 7, 2018 at 4:46 PM Jeff Hubbs via Ale <ale at ale.org
>> <mailto:ale at ale.org>> wrote:
>>
>> For the past few months, I've been operating an Apache Hadoop
>> cluster at Emory University's Goizueta Business School. That
>> cluster is Gentoo-Linux-based and consists of a dual-homed
>> "edge node" and three 16-GiB-RAM 16-thread two-disk "worker"
>> nodes. The edge node provides NAT for the active cluster
>> nodes and holds a complete mirror of the Gentoo package
>> repository that is updated nightly. There is also an
>> auxiliary edge node (a one-piece Dell Vostro 320) with xorg
>> and xfce that I mostly use to display exported instances of
>> xosview from all of the other nodes so that I can keep an eye
>> on the cluster's operation. Each of the worker nodes carries
>> a standalone Gentoo Linux instance that was flown in via
>> rsync from another node while booted to a liveCD-style
>> distribution (SystemRescueCD, which happens to be Gentoo-based).
>>
>> I have since set up the main edge node to form a "shadow
>> cluster" in addition to the one I've been operating. Via iPXE
>> and dnsmasq on the edge node, any x86_64 system that is
>> connected to the internal cluster network and allowed to
>> PXE-boot will download a stripped-down Gentoo instance via
>> HTTP (served up by nginx), boot to this instance in RAM, and
>> execute a bash script that finds, partitions, and formats all
>> of that system's disks, downloads and writes to those disks a
>> complete Gentoo Linux instance, installs and configures the
>> GRUB bootloader, sets a hostname based on the system's first
>> NIC's MAC address, and reboots the system into that
>> freshly-written instance.
>>
>> At present, there is only one read/write NFS export on the
>> edge node and it holds a flat file that Hadoop uses as a list
>> of available worker nodes. The list is populated by the
>> aforementioned node setup script after the hostname is generated.
>>
>> Both the PXE-booted Gentoo Linux instance and the on-disk
>> instance are managed within a chroot on the edge node in a
>> manner not unlike how Gentoo Linux is conventionally
>> installed on a system. Once set up as desired, these
>> instances are compressed into separate squashfs files and
>> placed in the nginx doc root. In the case of the PXE-booted
>> instance, there is an intermediate step where much of the
>> instance is stripped away just to reduce the size of the
>> squashfs file, which is currently 431MiB. The full cluster
>> node distribution file is 1.6GiB but I sometimes exclude the
>> kernel source tree and local package meta-repository to bring
>> it down to 1.1GiB. The on-disk footprint of the complete
>> worker node instance is 5.9GiB.
>>
>> The node setup script takes the first drive it finds and
>> GPT-partitions it six ways: 1) a 2MiB "spacer" for the
>> bootloader; 2) 256MiB for /boot; 3) 32GiB for root; 4) 2xRAM
>> for swap (this is WAY overkill; it's set by ratio in the
>> script and a ratio of one or less would suffice); 5) 64GiB
>> for /tmp/hadoop-yarn (more about this later); 6) whatever is
>> left for /hdfs1. Any remaining disks identified are
>> single-partitioned as /hdfs2, /hdfs3, etc. All partitions are
>> formatted btrfs with the exception of /boot, which is vfat
>> for UEFI compatibility (a route I went down because I have
>> one old laptop I found that was UEFI-only and I expect that
>> will become more the case than less over time). A
>> quasi-boolean in the script optionally enables compression at
>> mount time for /tmp/hadoop-yarn.
>>
>> One of Gentoo Linux's strengths is the ability to compile
>> software specifically for the CPU but the node instance is
>> set up with the gcc option -mtune=generic. Another
>> quasi-boolean setting in the node setup script will change
>> that to -march=native but that change will only effectuate
>> when packages are built or rebuilt locally (as opposed to in
>> chroot on the edge node, where everything must be built
>> generic). I can couple this feature with another feature to
>> optionally rebuild all the system's binaries native but
>> that's an operation that would take a fair bit of time
>> (that's over 500 packages and only some of them would affect
>> cluster operation). Similarly, in the interest of
>> run-what-ya-brung flexibility, I'm using Gentoo's genkernel
>> utility to generate a kernel and initrd befitting a
>> liveCD-style instance that will boot on basically any x86-64
>> along with whatever NICs and disk controllers it finds.
>>
>> I am using the Hadoop binary distribution (currently 3.1.1)
>> as distributed directly by Apache (no HortonWorks; no
>> Cloudera). Each cluster node has its own Hadoop distribution
>> and each node's Hadoop distribution has configuration
>> features both in common and specific to that node, modified
>> in place by the node setup script. In the latter case, the
>> amount of available RAM, the number of available CPU threads,
>> and the list of available HDFS partitions on a system are
>> flown into the proper local config files. Hadoop services run
>> in a Java VM; I am currently using the IcedTea 3.8.0 source
>> distribution supplied within Gentoo's packaging system. I
>> have also run it under the IcedTea binary distribution and
>> the Oracle JVM with equal success.
>>
>> Hadoop has three primary constructs that make it up. HDFS
>> (Hadoop Distributed File System) consists of a NameNode
>> daemon that runs on a single machine and controls the
>> filesystem namespace and user access to it; DataNode daemons
>> run on each worker node and coordinate between the NameNode
>> daemon and the local machine's on-disk filesystem. You access
>> the filesystem with command-line-like options to the hdfs
>> binary like -put, -get, -ls, -mkdir, etc. but in the on-disk
>> filesystem underneath /hdfs1.../hdfsN, the files you write
>> are cut up into "blocks" (default size: 128MiB) and those
>> blocks are replicated (default: three times) among all the
>> worker nodes. My initial cluster with standalone workers
>> reported 7.2TiB of HDFS available spread across six physical
>> spindles. As you can imagine, it's possible to accumulate
>> tens of TiB of HDFS across only a handful of nodes but doing
>> so isn't necessarily helpful.
>>
>> YARN (Yet Another Resource Negotiator) is the construct that
>> manages the execution of work among the nodes. Part of the
>> whole point behind Hadoop is to /move the processing to where
>> the data is /and it's YARN that coordinates all that. It
>> consists of a ResourceManager daemon that communicates with
>> all the worker nodes and NodeManager daemons that run on each
>> of the worker nodes. You can run the ResourceManager daemon
>> and HDFS' NameNode daemon on the same machines that act as
>> worker nodes but past a point you won't want to and past
>> /that/ point you'd want to run each of NameNode and
>> ResourceManager on two separate machines. In that regime,
>> you'd have two machines dedicated to those roles (their names
>> would be taken out of the centrally-located workers file) and
>> the rest would run both the DataNode and NodeManager daemons,
>> forming the HDFS storage subsystem and the YARN execution
>> subsystem.
>>
>> There is another construct, MapReduce, whose architecture I
>> don't fully understand yet; it comes into play as a later
>> phase in Hadoop computations and there is a JobHistoryServer
>> daemon associated with it.
>>
>> Another place where the bridge is out with respect to my
>> understanding of Hadoop is coding for it - but I'll get there
>> eventually. There are other apps like Apache's Spark and Hive
>> that use HDFS and/or YARN that I have better mental insight
>> into, and I have successfully gotten Python/Spark demo
>> programs to run on YARN in my cluster.
>>
>> One thing I have learned is that Hadoop clusters do not
>> "genericize" well. When I first tried running the
>> Hadoop-supplied teragen/terasort example (goal: make a file
>> of 10^10 100-character lines and sort it), it failed for want
>> of space available in /tmp/hadoop-yarn but it ran perfectly
>> when the file was cut down to 1/100th that size. For my
>> PXE-boot-based cluster, I gave my worker nodes a separate
>> partition for /tmp/hadoop-yarn and gave it optional
>> transparent compression. There are a lot of parameters for
>> controlling things like minimum size and minimum size
>> increment of memory containers and JVM parameters that I
>> haven't messed with but to optimize the cluster for a given
>> job, one would expect to.
>>
>> What I have right now - basically, a single Gentoo Linux
>> instance for installation on a dual-homed edge node - is able
>> to generate a working Hadoop cluster with an arbitrary number
>> of nodes, limited primarily by space, cooling, and electric
>> power (the Dell Optiplex desktops I'm using right now max out
>> at about an amp, so you have to be prepared to supply at
>> least N amps for N nodes). They can be purpose-built
>> rack-mount servers, a lab environment full of thin clients,
>> or wire shelf units full of discarded desktops and laptops.
>>
>> - Jeff
>>
>>
>> _______________________________________________
>> Ale mailing list
>> Ale at ale.org <mailto:Ale at ale.org>
>> https://mail.ale.org/mailman/listinfo/ale
>> See JOBS, ANNOUNCE and SCHOOLS lists at
>> http://mail.ale.org/mailman/listinfo
>>
>> --
>> Sent from my mobile. Please excuse the brevity, spelling, and
>> punctuation.
>
>
>
> --
> Sent from my Android device with K-9 Mail. All tyopes are thumb
> related and reflect authenticity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.ale.org/pipermail/ale/attachments/20180910/524eb6fa/attachment.html>
More information about the Ale
mailing list