[ale] HADTOO: Automatic-node-generating Hadoop cluster on Gentoo Linux
Jim Kinney
jim.kinney at gmail.com
Sun Sep 9 09:35:21 EDT 2018
Ha!!
My ultimate plan with the APS project was to cobble all the "waste machines" together into a system-wide distributed storage cluster then run a single-system image cluster tool like mosics or kerrigh. That would make the entire school district a single instance, giant super computer that K12 students use :-)
Excellent work on the hadoop cluster!
I have my cluster nodes set to always pxe boot and the pxe boot default is to fall back to local boot drive. That way I can drive a new install/rebuild by twiddling a file on the DHCP server and rebooting a node. Eventually, the nodes will use no local storage as all will be reserved for /tmp (raid0 across all drives for speed) and use an NFS mounted root and remote log file process. Basically a homegrown PAAS set up controlled by job submission defined need.
My current jobs are compiled matlab and custom python. Hadoop is coming back. Can't run hadoop on the same nodes with the others as it assumes full system control. Not enough demand yet for dedicated hadoop stack. So a reboot to nfs root hadoop cluster with a temp "node offline" status in torque seems to be feasible. Use pre-run scripts in torque to call reboot to hadoop node and post-run to reboot to normal cluster mode.
On September 7, 2018 11:54:04 PM EDT, Jeff Hubbs via Ale <ale at ale.org> wrote:
>On 9/7/18 4:52 PM, dev null zero two wrote:
>> that is pretty incredible (never thought to use Gentoo for this
>purpose).
>I use it for everything. :)
>>
>> have you thought about using orchestration tools for this (Kubernetes
>
>> etc.)?
>I have been immersed in enough IRC/forum/StackExchange traffic that I
>know that people do this, but thus far I haven't seen a need past the
>simple crafting of a single readily replicable Linux instance that
>justifies the added complexity. In addition to being able to make
>changes to that instance in chroot on the edge node and reboot the
>workers and any other daemon-running machines (to facilitate this, I
>set
>the machines to boot to disk first and PXE second and then I have a
>script that "breaks" the first drive on each worker node by overwriting
>
>the GPT partition table and forcing a reboot), I can still "fan"
>changes
>via ssh across all nodes serially or simultaneously and I still have
>the
>NFS mechanism available to me (right now, it handles only the Hadoop
>workers file).
>
>By the way, I've done this thing where the node setup script opens the
>optical media tray, which then automatically closes after a few minutes
>
>when the machine reboots. The on-disk instance is set to open and
>immediately close the tray at boot, so if all the machines have optical
>
>drives with trays that can open and close on command there will be
>quite
>an entertaining racket when a whole cluster starts up.
>
>Hey, Jim/Aaron - just think; I could have turned Sutton Middle School
>into a 500-node Hadoop cluster! There's a 24-seat lab at work that'd be
>
>good for about 3.3TiB of HDFS.
>>
>> On Fri, Sep 7, 2018 at 4:46 PM Jeff Hubbs via Ale <ale at ale.org
>> <mailto:ale at ale.org>> wrote:
>>
>> For the past few months, I've been operating an Apache Hadoop
>> cluster at Emory University's Goizueta Business School. That
>> cluster is Gentoo-Linux-based and consists of a dual-homed "edge
>> node" and three 16-GiB-RAM 16-thread two-disk "worker" nodes. The
>> edge node provides NAT for the active cluster nodes and holds a
>> complete mirror of the Gentoo package repository that is updated
>> nightly. There is also an auxiliary edge node (a one-piece Dell
>> Vostro 320) with xorg and xfce that I mostly use to display
>> exported instances of xosview from all of the other nodes so that
>> I can keep an eye on the cluster's operation. Each of the worker
>> nodes carries a standalone Gentoo Linux instance that was flown
>in
>> via rsync from another node while booted to a liveCD-style
>> distribution (SystemRescueCD, which happens to be Gentoo-based).
>>
>> I have since set up the main edge node to form a "shadow cluster"
>> in addition to the one I've been operating. Via iPXE and dnsmasq
>> on the edge node, any x86_64 system that is connected to the
>> internal cluster network and allowed to PXE-boot will download a
>> stripped-down Gentoo instance via HTTP (served up by nginx), boot
>> to this instance in RAM, and execute a bash script that finds,
>> partitions, and formats all of that system's disks, downloads and
>> writes to those disks a complete Gentoo Linux instance, installs
>> and configures the GRUB bootloader, sets a hostname based on the
>> system's first NIC's MAC address, and reboots the system into
>that
>> freshly-written instance.
>>
>> At present, there is only one read/write NFS export on the edge
>> node and it holds a flat file that Hadoop uses as a list of
>> available worker nodes. The list is populated by the
>> aforementioned node setup script after the hostname is generated.
>>
>> Both the PXE-booted Gentoo Linux instance and the on-disk
>instance
>> are managed within a chroot on the edge node in a manner not
>> unlike how Gentoo Linux is conventionally installed on a system.
>> Once set up as desired, these instances are compressed into
>> separate squashfs files and placed in the nginx doc root. In the
>> case of the PXE-booted instance, there is an intermediate step
>> where much of the instance is stripped away just to reduce the
>> size of the squashfs file, which is currently 431MiB. The full
>> cluster node distribution file is 1.6GiB but I sometimes exclude
>> the kernel source tree and local package meta-repository to bring
>> it down to 1.1GiB. The on-disk footprint of the complete worker
>> node instance is 5.9GiB.
>>
>> The node setup script takes the first drive it finds and
>> GPT-partitions it six ways: 1) a 2MiB "spacer" for the
>bootloader;
>> 2) 256MiB for /boot; 3) 32GiB for root; 4) 2xRAM for swap (this
>is
>> WAY overkill; it's set by ratio in the script and a ratio of one
>> or less would suffice); 5) 64GiB for /tmp/hadoop-yarn (more about
>> this later); 6) whatever is left for /hdfs1. Any remaining disks
>> identified are single-partitioned as /hdfs2, /hdfs3, etc. All
>> partitions are formatted btrfs with the exception of /boot, which
>> is vfat for UEFI compatibility (a route I went down because I
>have
>> one old laptop I found that was UEFI-only and I expect that will
>> become more the case than less over time). A quasi-boolean in the
>> script optionally enables compression at mount time for
>> /tmp/hadoop-yarn.
>>
>> One of Gentoo Linux's strengths is the ability to compile
>software
>> specifically for the CPU but the node instance is set up with the
>> gcc option -mtune=generic. Another quasi-boolean setting in the
>> node setup script will change that to -march=native but that
>> change will only effectuate when packages are built or rebuilt
>> locally (as opposed to in chroot on the edge node, where
>> everything must be built generic). I can couple this feature with
>> another feature to optionally rebuild all the system's binaries
>> native but that's an operation that would take a fair bit of time
>> (that's over 500 packages and only some of them would affect
>> cluster operation). Similarly, in the interest of
>> run-what-ya-brung flexibility, I'm using Gentoo's genkernel
>> utility to generate a kernel and initrd befitting a liveCD-style
>> instance that will boot on basically any x86-64 along with
>> whatever NICs and disk controllers it finds.
>>
>> I am using the Hadoop binary distribution (currently 3.1.1) as
>> distributed directly by Apache (no HortonWorks; no Cloudera).
>Each
>> cluster node has its own Hadoop distribution and each node's
>> Hadoop distribution has configuration features both in common and
>> specific to that node, modified in place by the node setup
>script.
>> In the latter case, the amount of available RAM, the number of
>> available CPU threads, and the list of available HDFS partitions
>> on a system are flown into the proper local config files. Hadoop
>> services run in a Java VM; I am currently using the IcedTea 3.8.0
>> source distribution supplied within Gentoo's packaging system. I
>> have also run it under the IcedTea binary distribution and the
>> Oracle JVM with equal success.
>>
>> Hadoop has three primary constructs that make it up. HDFS (Hadoop
>> Distributed File System) consists of a NameNode daemon that runs
>> on a single machine and controls the filesystem namespace and
>user
>> access to it; DataNode daemons run on each worker node and
>> coordinate between the NameNode daemon and the local machine's
>> on-disk filesystem. You access the filesystem with
>> command-line-like options to the hdfs binary like -put, -get,
>-ls,
>> -mkdir, etc. but in the on-disk filesystem underneath
>> /hdfs1.../hdfsN, the files you write are cut up into "blocks"
>> (default size: 128MiB) and those blocks are replicated (default:
>> three times) among all the worker nodes. My initial cluster with
>> standalone workers reported 7.2TiB of HDFS available spread
>across
>> six physical spindles. As you can imagine, it's possible to
>> accumulate tens of TiB of HDFS across only a handful of nodes but
>> doing so isn't necessarily helpful.
>>
>> YARN (Yet Another Resource Negotiator) is the construct that
>> manages the execution of work among the nodes. Part of the whole
>> point behind Hadoop is to /move the processing to where the data
>> is /and it's YARN that coordinates all that. It consists of a
>> ResourceManager daemon that communicates with all the worker
>nodes
>> and NodeManager daemons that run on each of the worker nodes. You
>> can run the ResourceManager daemon and HDFS' NameNode daemon on
>> the same machines that act as worker nodes but past a point you
>> won't want to and past /that/ point you'd want to run each of
>> NameNode and ResourceManager on two separate machines. In that
>> regime, you'd have two machines dedicated to those roles (their
>> names would be taken out of the centrally-located workers file)
>> and the rest would run both the DataNode and NodeManager daemons,
>> forming the HDFS storage subsystem and the YARN execution
>subsystem.
>>
>> There is another construct, MapReduce, whose architecture I don't
>> fully understand yet; it comes into play as a later phase in
>> Hadoop computations and there is a JobHistoryServer daemon
>> associated with it.
>>
>> Another place where the bridge is out with respect to my
>> understanding of Hadoop is coding for it - but I'll get there
>> eventually. There are other apps like Apache's Spark and Hive
>that
>> use HDFS and/or YARN that I have better mental insight into, and
>I
>> have successfully gotten Python/Spark demo programs to run on
>YARN
>> in my cluster.
>>
>> One thing I have learned is that Hadoop clusters do not
>> "genericize" well. When I first tried running the Hadoop-supplied
>> teragen/terasort example (goal: make a file of 10^10
>100-character
>> lines and sort it), it failed for want of space available in
>> /tmp/hadoop-yarn but it ran perfectly when the file was cut down
>> to 1/100th that size. For my PXE-boot-based cluster, I gave my
>> worker nodes a separate partition for /tmp/hadoop-yarn and gave
>it
>> optional transparent compression. There are a lot of parameters
>> for controlling things like minimum size and minimum size
>> increment of memory containers and JVM parameters that I haven't
>> messed with but to optimize the cluster for a given job, one
>would
>> expect to.
>>
>> What I have right now - basically, a single Gentoo Linux instance
>> for installation on a dual-homed edge node - is able to generate
>a
>> working Hadoop cluster with an arbitrary number of nodes, limited
>> primarily by space, cooling, and electric power (the Dell
>Optiplex
>> desktops I'm using right now max out at about an amp, so you have
>> to be prepared to supply at least N amps for N nodes). They can
>be
>> purpose-built rack-mount servers, a lab environment full of thin
>> clients, or wire shelf units full of discarded desktops and
>laptops.
>>
>> - Jeff
>>
>>
>> _______________________________________________
>> Ale mailing list
>> Ale at ale.org <mailto:Ale at ale.org>
>> https://mail.ale.org/mailman/listinfo/ale
>> See JOBS, ANNOUNCE and SCHOOLS lists at
>> http://mail.ale.org/mailman/listinfo
>>
>> --
>> Sent from my mobile. Please excuse the brevity, spelling, and
>punctuation.
--
Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.ale.org/pipermail/ale/attachments/20180909/d96a25d4/attachment.html>
More information about the Ale
mailing list