[ale] HADTOO: Automatic-node-generating Hadoop cluster on Gentoo Linux

Mon Sep 10 11:22:46 EDT 2018

On 9/9/18 9:35 AM, Jim Kinney wrote:
> Ha!!
>
> My ultimate plan with the APS project was to cobble all the "waste 
> machines" together into a system-wide distributed storage cluster then 
> run a single-system image cluster tool like mosics or kerrigh. That 
> would make the entire school district a single instance, giant super 
> computer that K12 students use :-)
I haven't had occasion to use this yet but Hadoop has a thing where you 
can divide your nodes into "racks" such that the ResourceManager knows 
to expect network bottlenecks between groups of nodes and moves 
processing and storage accordingly; in the APS environment one would 
group each school into a "rack" (no idea if Hadoop supports "racks" of 
"racks").
>
> Excellent work on the hadoop cluster!
Thanks! It's been interesting and I've covered a few things along the 
way I'd long wanted to be able to do (like PXE-booting), so, bonus.
>
> I have my cluster nodes set to always pxe boot and the pxe boot 
> default is to fall back to local boot drive. That way I can drive a 
> new install/rebuild by twiddling a file on the DHCP server and 
> rebooting a node. Eventually, the nodes will use no local storage as 
> all will be reserved for /tmp (raid0 across all drives for speed) and 
> use an NFS mounted root and remote log file process. Basically a 
> homegrown PAAS set up controlled by job submission defined need.
I thought about running the working instance out of RAM leaving 
everything NFS-based but I decided against it. For one thing, Hadoop 
activity alone hammers the LAN and if a lot of additional traffic (the 
Hadoop binary distribution is about 830MiB) is trying to concentrate on, 
say, the edge node where the worker node instance actually lives while 
there's a lot of HDFS traffic shooting from node to node, that's not cool.
>
> My current jobs are compiled matlab and custom python. Hadoop is 
> coming back. Can't run hadoop on the same nodes with the others as it 
> assumes full system control. Not enough demand yet for dedicated 
> hadoop stack. So a reboot to nfs root hadoop cluster with a temp "node 
> offline" status in torque seems to be feasible. Use pre-run scripts in 
> torque to call reboot to hadoop node and post-run to reboot to normal 
> cluster mode.
Sounds reasonable.
>
> On September 7, 2018 11:54:04 PM EDT, Jeff Hubbs via Ale <ale at ale.org> 
> wrote:
>
>     On 9/7/18 4:52 PM, dev null zero two wrote:
>>     that is pretty incredible (never thought to use Gentoo for this
>>     purpose).
>     I use it for everything. :)
>>
>>     have you thought about using orchestration tools for this
>>     (Kubernetes etc.)?
>     I have been immersed in enough IRC/forum/StackExchange traffic
>     that I know that people do this, but thus far I haven't seen a
>     need past the simple crafting of a single readily replicable Linux
>     instance that justifies the added complexity. In addition to being
>     able to make changes to that instance in chroot on the edge node
>     and reboot the workers and any other daemon-running machines (to
>     facilitate this, I set the machines to boot to disk first and PXE
>     second and then I have a script that "breaks" the first drive on
>     each worker node by overwriting the GPT partition table and
>     forcing a reboot), I can still "fan" changes via ssh across all
>     nodes serially or simultaneously and I still have the NFS
>     mechanism available to me (right now, it handles only the Hadoop
>     workers file).
>
>     By the way, I've done this thing where the node setup script opens
>     the optical media tray, which then automatically closes after a
>     few minutes when the machine reboots. The on-disk instance is set
>     to open and immediately close the tray at boot, so if all the
>     machines have optical drives with trays that can open and close on
>     command there will be quite an entertaining racket when a whole
>     cluster starts up.
>
>     Hey, Jim/Aaron - just think; I could have turned Sutton Middle
>     School into a 500-node Hadoop cluster! There's a 24-seat lab at
>     work that'd be good for about 3.3TiB of HDFS.
>>
>>     On Fri, Sep 7, 2018 at 4:46 PM Jeff Hubbs via Ale <ale at ale.org
>>     <mailto:ale at ale.org>> wrote:
>>
>>         For the past few months, I've been operating an Apache Hadoop
>>         cluster at Emory University's Goizueta Business School. That
>>         cluster is Gentoo-Linux-based and consists of a dual-homed
>>         "edge node" and three 16-GiB-RAM 16-thread two-disk "worker"
>>         nodes. The edge node provides NAT for the active cluster
>>         nodes and holds a complete mirror of the Gentoo package
>>         repository that is updated nightly. There is also an
>>         auxiliary edge node (a one-piece Dell Vostro 320) with xorg
>>         and xfce that I mostly use to display exported instances of
>>         xosview from all of the other nodes so that I can keep an eye
>>         on the cluster's operation. Each of the worker nodes carries
>>         a standalone Gentoo Linux instance that was flown in via
>>         rsync from another node while booted to a liveCD-style
>>         distribution (SystemRescueCD, which happens to be Gentoo-based).
>>
>>         I have since set up the main edge node to form a "shadow
>>         cluster" in addition to the one I've been operating. Via iPXE
>>         and dnsmasq on the edge node, any x86_64 system that is
>>         connected to the internal cluster network and allowed to
>>         PXE-boot will download a stripped-down Gentoo instance via
>>         HTTP (served up by nginx), boot to this instance in RAM, and
>>         execute a bash script that finds, partitions, and formats all
>>         of that system's disks, downloads and writes to those disks a
>>         complete Gentoo Linux instance, installs and configures the
>>         GRUB bootloader, sets a hostname based on the system's first
>>         NIC's MAC address, and reboots the system into that
>>         freshly-written instance.
>>
>>         At present, there is only one read/write NFS export on the
>>         edge node and it holds a flat file that Hadoop uses as a list
>>         of available worker nodes. The list is populated by the
>>         aforementioned node setup script after the hostname is generated.
>>
>>         Both the PXE-booted Gentoo Linux instance and the on-disk
>>         instance are managed within a chroot on the edge node in a
>>         manner not unlike how Gentoo Linux is conventionally
>>         installed on a system. Once set up as desired, these
>>         instances are compressed into separate squashfs files and
>>         placed in the nginx doc root. In the case of the PXE-booted
>>         instance, there is an intermediate step where much of the
>>         instance is stripped away just to reduce the size of the
>>         squashfs file, which is currently 431MiB. The full cluster
>>         node distribution file is 1.6GiB but I sometimes exclude the
>>         kernel source tree and local package meta-repository to bring
>>         it down to 1.1GiB. The on-disk footprint of the complete
>>         worker node instance is 5.9GiB.
>>
>>         The node setup script takes the first drive it finds and
>>         GPT-partitions it six ways: 1) a 2MiB "spacer" for the
>>         bootloader; 2) 256MiB for /boot; 3) 32GiB for root; 4) 2xRAM
>>         for swap (this is WAY overkill; it's set by ratio in the
>>         script and a ratio of one or less would suffice); 5) 64GiB
>>         for /tmp/hadoop-yarn (more about this later); 6) whatever is
>>         left for /hdfs1. Any remaining disks identified are
>>         single-partitioned as /hdfs2, /hdfs3, etc. All partitions are
>>         formatted btrfs with the exception of /boot, which is vfat
>>         for UEFI compatibility (a route I went down because I have
>>         one old laptop I found that was UEFI-only and I expect that
>>         will become more the case than less over time). A
>>         quasi-boolean in the script optionally enables compression at
>>         mount time for /tmp/hadoop-yarn.
>>
>>         One of Gentoo Linux's strengths is the ability to compile
>>         software specifically for the CPU but the node instance is
>>         set up with the gcc option -mtune=generic. Another
>>         quasi-boolean setting in the node setup script will change
>>         that to -march=native but that change will only effectuate
>>         when packages are built or rebuilt locally (as opposed to in
>>         chroot on the edge node, where everything must be built
>>         generic). I can couple this feature with another feature to
>>         optionally rebuild all the system's binaries native but
>>         that's an operation that would take a fair bit of time
>>         (that's over 500 packages and only some of them would affect
>>         cluster operation). Similarly, in the interest of
>>         run-what-ya-brung flexibility, I'm using Gentoo's genkernel
>>         utility to generate a kernel and initrd befitting a
>>         liveCD-style instance that will boot on basically any x86-64
>>         along with whatever NICs and disk controllers it finds.
>>
>>         I am using the Hadoop binary distribution (currently 3.1.1)
>>         as distributed directly by Apache (no HortonWorks; no
>>         Cloudera). Each cluster node has its own Hadoop distribution
>>         and each node's Hadoop distribution has configuration
>>         features both in common and specific to that node, modified
>>         in place by the node setup script. In the latter case, the
>>         amount of available RAM, the number of available CPU threads,
>>         and the list of available HDFS partitions on a system are
>>         flown into the proper local config files. Hadoop services run
>>         in a Java VM; I am currently using the IcedTea 3.8.0 source
>>         distribution supplied within Gentoo's packaging system. I
>>         have also run it under the IcedTea binary distribution and
>>         the Oracle JVM with equal success.
>>
>>         Hadoop has three primary constructs that make it up. HDFS
>>         (Hadoop Distributed File System) consists of a NameNode
>>         daemon that runs on a single machine and controls the
>>         filesystem namespace and user access to it; DataNode daemons
>>         run on each worker node and coordinate between the NameNode
>>         daemon and the local machine's on-disk filesystem. You access
>>         the filesystem with command-line-like options to the hdfs
>>         binary like -put, -get, -ls, -mkdir, etc. but in the on-disk
>>         filesystem underneath /hdfs1.../hdfsN, the files you write
>>         are cut up into "blocks" (default size: 128MiB) and those
>>         blocks are replicated (default: three times) among all the
>>         worker nodes. My initial cluster with standalone workers
>>         reported 7.2TiB of HDFS available spread across six physical
>>         spindles. As you can imagine, it's possible to accumulate
>>         tens of TiB of HDFS across only a handful of nodes but doing
>>         so isn't necessarily helpful.
>>
>>         YARN (Yet Another Resource Negotiator) is the construct that
>>         manages the execution of work among the nodes. Part of the
>>         whole point behind Hadoop is to /move the processing to where
>>         the data is /and it's YARN that coordinates all that. It
>>         consists of a ResourceManager daemon that communicates with
>>         all the worker nodes and NodeManager daemons that run on each
>>         of the worker nodes. You can run the ResourceManager daemon
>>         and HDFS' NameNode daemon on the same machines that act as
>>         worker nodes but past a point you won't want to and past
>>         /that/ point you'd want to run each of NameNode and
>>         ResourceManager on two separate machines. In that regime,
>>         you'd have two machines dedicated to those roles (their names
>>         would be taken out of the centrally-located workers file) and
>>         the rest would run both the DataNode and NodeManager daemons,
>>         forming the HDFS storage subsystem and the YARN execution
>>         subsystem.
>>
>>         There is another construct, MapReduce, whose architecture I
>>         don't fully understand yet; it comes into play as a later
>>         phase in Hadoop computations and there is a JobHistoryServer
>>         daemon associated with it.
>>
>>         Another place where the bridge is out with respect to my
>>         understanding of Hadoop is coding for it - but I'll get there
>>         eventually. There are other apps like Apache's Spark and Hive
>>         that use HDFS and/or YARN that I have better mental insight
>>         into, and I have successfully gotten Python/Spark demo
>>         programs to run on YARN in my cluster.
>>
>>         One thing I have learned is that Hadoop clusters do not
>>         "genericize" well. When I first tried running the
>>         Hadoop-supplied teragen/terasort example (goal: make a file
>>         of 10^10 100-character lines and sort it), it failed for want
>>         of space available in /tmp/hadoop-yarn but it ran perfectly
>>         when the file was cut down to 1/100th that size. For my
>>         PXE-boot-based cluster, I gave my worker nodes a separate
>>         partition for /tmp/hadoop-yarn and gave it optional
>>         transparent compression. There are a lot of parameters for
>>         controlling things like minimum size and minimum size
>>         increment of memory containers and JVM parameters that I
>>         haven't messed with but to optimize the cluster for a given
>>         job, one would expect to.
>>
>>         What I have right now - basically, a single Gentoo Linux
>>         instance for installation on a dual-homed edge node - is able
>>         to generate a working Hadoop cluster with an arbitrary number
>>         of nodes, limited primarily by space, cooling, and electric
>>         power (the Dell Optiplex desktops I'm using right now max out
>>         at about an amp, so you have to be prepared to supply at
>>         least N amps for N nodes). They can be purpose-built
>>         rack-mount servers, a lab environment full of thin clients,
>>         or wire shelf units full of discarded desktops and laptops.
>>
>>         - Jeff
>>
>>
>>         _______________________________________________
>>         Ale mailing list
>>         Ale at ale.org <mailto:Ale at ale.org>
>>         https://mail.ale.org/mailman/listinfo/ale
>>         See JOBS, ANNOUNCE and SCHOOLS lists at
>>         http://mail.ale.org/mailman/listinfo
>>
>>     -- 
>>     Sent from my mobile. Please excuse the brevity, spelling, and
>>     punctuation.
>
>
>
> -- 
> Sent from my Android device with K-9 Mail. All tyopes are thumb 
> related and reflect authenticity. 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.ale.org/pipermail/ale/attachments/20180910/524eb6fa/attachment.html>