Coyote: all IB, all the time (Booting as a Linux HPC application) - PowerPoint PPT Presentation

Coyote: all IB, all the time (Booting as a Linux HPC application) Ron Minnich Sandia National Labs

Acknowledgments Andrew White, Bob Tomlinson, Daryl Grunau, Kevin Tegtmeier, Ollie Lo, Latchesar Ionkov, Josh Aune, and many others at LANL and Linux NetworX (RIP)

Overview ● HPC systems have HPC networks ● Which HPC applications use ● And admin applications don't – Usually add an extra Ethernet network – Or 2 or 3 … ● And the admin networks are either: – Wildly overcommitted – Expensive ● But they are guaranteed to reduce reliability

Why don't admins use HPC networks? ● Well, they can … if the vendors let them ● At Los Alamos, from 2000-2006, we built HPC machines that did just that ● Which gave us admin networks that – Allowed us high quality monitoring – High performance boot – Lower expense – Higher reliability

The reliability point bears mentioning ● Probability 101 ● Vendors require HPC net, and admin net – Claim is this is “more reliable” ● Uh, no: both are needed to operate – Which decreases reliability for users ● Vendor practices make HPC systems less reliable ● So why do they do it? ● Because the BIOS can't work any other way ...

Why the BIOS wants that admin net ● Usually Ethernet ● Which is what IPMI understands ● In fact, just about all the (closed) vendor software runs only on that Ethernet ● We get machines with 40Gbits/s HPC net ● And 1 Gbits/s Ethernet – Which, per port, can cost more than the IB

So, the other thing we did at LANL ● Embed a Linux kernel on the mainboard ● Exploit Linux for everything related to boot ● Allowed us to build Pink, a Top 10 machine, in 2002, for < ½ the cost of a similar machine ● Remove 1024-port Enet, remove disks, save a lotta money, make it more reliable ● Not a bad deal ● But relied on replacing BIOS with Linux

Linux as BIOS was a Big Deal in 1999 ● It's not a big deal now in many places ● Taken for granted in embedded world (cars, network switches, etc.) ● But it's still a Big Deal in the PC world – In other words, PCs are falling behind ● PCs are now as closed as the workstations they replaced in 1994: ecosystem is closing ● PC vendors should beware: closed ecosystems die off rapidly (see: workstation vendors)

Example: Booting as an HPC application ● I discovered in 2007 that some of our IB software is, ah, not quite as mature as I thought ● “IB-only boot? Solved problem” ● Well, maybe

PXE on IB experiences: 2007 ● For SC 07 we set up a cluster to use the PXE- in-firmware on Mellanox cards ● Not surprised, not shocked: required wget this, patch that, things did not quite work – And people kept telling me to “just boot over enet” ● IB has come far, but not far enough ● I still talk to people who want an “IB only” solution -- and we did this in 2005 at LANL

Vendor boot-over-IB solutions ● Add an extra Ethernet – Yuck! ● Use the IB cards in “I'm just an Ethernet device” mode – Yuck! ● You've got an HPC network and want to emulate a low speed network? ● Maybe that's nice on small systems ...

Overview ● What Coyote is ● The challenge: IB only boot, compute, operate ● How it all fit together ● Challenges and fixes

Coyote in 2005/6 Coyote in 2005/6 Possibile to connect 2 SUs together for a larger 1032-cpu partition 24-port IB 24-port IB 8 8 8 8 Infiniband 4x Infiniband 4x Infiniband 4x Infiniband 4x Infiniband 4x Infiniband 4x 258 dual- 12 dual- 258 dual- 12 dual- 258 dual- 12 dual- 258 dual- 12 dual- 258 dual- 12 dual- 36 dual- 4 dual- Master Master Master Master Master Master Master Master Master Master Master Master processor processor processor processor processor processor processor processor processor processor processor processor Compute I / O Nodes Compute I / O Nodes Compute I / O Nodes Compute I / O Nodes Compute I / O Nodes Compute I / O Nodes Nodes Nodes Nodes Nodes Nodes Nodes DotX C1 C2 C3 C4 C5 • • Linux Networx system: System Software – – 5 Scalable Unit (SU) clusters of 272 nodes 2.6.14 based Linux – Fedora Core 3 + 1 cluster (DotX) of 42 nodes: – Clustermatic V4 (BProcV4) – Dual-2.6GHz AMD Opteron CPUs (single core) – OpenMPI – 4GB memory / CPU – LSF – Scheduler • 272 node SUs: – PathScale Compilers (also gcc, pgi) – – 258 compute nodes + 1 compute-master Mellanox AuCD 2.0 – OpenSM/Gen2 – 12 I/O nodes + 1 I/O-master • 42 node DotX: • System Monitoring – 36 compute nodes + 1 compute-master – Hardware monitoring network (not shown) – 4 I/O nodes + 1 I/O-master accessed via third network interface (eth2) on master nodes provides for console and power • Not pictured: 4 compile & 10 serial job nodes management via conman and powerman. – Environment monitoring via Supermon

Coyote boot software (beoboot) ● This software can support any cluster system ● i.e., on top of this: ● can build Rocks, Oscar, OneSIS, etc. – This software is not bproc or Clustermatic specific ● It is (in my experience) the fastest, most reliable, most scalable boot system ● Because it uses Linux to perform the boot, not PXE or similar systems

The Challenge: IB only compute, boot, operate ● Early goal was to build Coyote with one, not two, networks ● Experience on Pink and Blue Steel with Ether – Pink: Ethernet not needed, greatly reduced cost – Pink: Motherboard issues with Ethernet on IO nodes delayed delivery – Blue Steel: Ethernet was needed, greatly increased headaches

Digression: A note on failure models ● It is odd to this day to see that the concept of points-of-failure is misunderstood ● People do understand a single point of failure ● People don't always understand that multiple points of failure is not the same as no single point of failure ● This confusion leads to strange design decisions

Example: boot management Boot system ● Here is a boot system for a 1024-node cluster ● “But it's a Single Point Of Failure” ● So people frequently do this:

Example: boot management: hierarchy of tftp servers ● What happens if one node goes out? ● Answer determines if this is MPOF ● In most cases, it is: you lose some nodes

Coyote software components Firmware (i.e. in BIOS/CF) ● coreboot ● Linux kernel with: – IB Gold stack, IPoIB – beoboot – kexec ● These components were sufficient to provide a high performance, scalable, ad-hoc boot infrastructure for Coyote

Note: Kernel was in Compact Flash ● In many cases we can put coreboot + Linux in BIOS flash – (see: http://tinyurl.com/2umm66) Linux + X11 BIOS! ● Once we add myrinet or IB drivers, standard FLASH parts are too small (only 1 MB) ● Long term goal:Linux back in BIOS FLASH – Else have to fall back to Ether + PXE! ● Newer boards will have 4 MByte and up parts

Coyote master node ● This node controls the cluster ● It is contacted by the individual compute/IO nodes for boot management ● Provides a Single Point Of Failure model with ad-hoc tree boot system (more on that later) ● Fastest way to boot; far faster than PXE

Coyote boot process ● coreboot has two files: coreboot Configure kernel+initrd Platform ● Initrd contains drivers Load kernel ● At this point, Load initrd Config IB modprobe+ifconfig worked fine (thanks vendors!) ifconfig ib0 up beoboot ● Thanks to Hal for DHCP that worked

Why not just use PXE at this point? ● PXE can boot a node, but: – Requires network card firmware to make the card act like an ethernet – Does not exploit all HPC network features ● In practice, we have booted 1024 node clusters with Linux in the time it takes PXE to not configure one network interface! – Much less NOT configuring two or three ...

PXE inefficiencies lead to construction of unreliable boot setup ● Our old friend, MPOF, we meet again

Linux: the right way to boot ● Use the strengths of the HPC network and Linux ● We'd been doing this at LANL since 2000, and understand it well ● The idea is simple: conscript the booting nodes to help boot other nodes ● That's the beoboot component

Booting fast and reliably B B Boot me! You're drafted C1 C1 B B B Ask C1 Boot me! Boot C1 C1 C1 C2 C2 me!

Ad-hoc tree boot ● In practice, this is incredibly fast ● Image distribution: 20 Mbytes, 1024 nodes, << 10 seconds – 2 Gbytes/second minimum ● Most boot time: Linux serial output ● Extraordinarily reliable – Tested, fast Linux drivers ● Exploit Linux concurrency

Conclusions ● HPC systems are best built with Linux “boot firmware” ● Ad-hoc trees use HPC network for booting, eliminate slow, failure-prone static trees ● Single point of failure, not many ● Have been working on IB since 2005 ● We are re-releasing the scalable boot software: follow the clustermatic project at github.com

Coyote: all IB, all the time (Booting as a Linux HPC application) - PowerPoint PPT Presentation

Coyote: all IB, all the time (Booting as a Linux HPC application) Ron Minnich Sandia National Labs Acknowledgments Andrew White, Bob Tomlinson, Daryl Grunau, Kevin Tegtmeier, Ollie Lo, Latchesar Ionkov, Josh Aune, and many others at LANL

Section 15 Section 15 ADSP-BF533 Booting a 15-1 1 What is Booting? What is Booting?

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

COYOTE HIGHLANDS - COYOTE CANYON INTERIM PLANNING PROCESS COMMUNITY MEETING #2 DRAFT

Diskless Booting via AFS Summary of Advantages of AFS over NFS as a network subsystem for

Reliable network booting of cluster computers Matthew Steggink July 2nd, 2008 Matthew Steggink

PC Hardware & Booting Chester Rebeiro IIT Madras Outline Memory and Device Addresses

Introduction to Linux Aline Abler Aline Abler Linux, whats that? The pieces of a Linux

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Linux Overview Amir Hossein Payberah payberah@gmail.com 1 Agenda Linux Overview Linux

Linux from Sensors to Servers ! When is Linux Not Linux? ! 1 1 Linux runs across a huge range

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

Linux Kung Fu Introduction What is Linux? Why Linux? What is the difference between a client

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Native American Folktales 01.29.12 || English 2327: Am erican Literature I || D. Glen Sm ith,

3 3 years years Rev. 16:1, Then I heard a loud voice from the temple saying to

A Random History of Mining A Random History of Mining (And thus the industrial revolution) (And

Report of International Tsunami Expedition to Madagascar August 28 September 12, 2006

Misperception, Self-Deception and Information Warfare Lachlan Brumley, Carlo Kopp and Kevin Korb

Music in the Downtown Collection at New York University Kent Underwood New York University

Assessing Our Progress 2nd Annual Coyote Evaluation and Planning Retreat Spring 2018 May 11,

Computing Like The Brain The Path To Machine Intelligence YOW! 2013 Jeff Hawkins

Coyote: all IB, all the time (Booting as a Linux HPC application) - PowerPoint PPT Presentation

Coyote: all IB, all the time (Booting as a Linux HPC application) Ron Minnich Sandia National Labs Acknowledgments Andrew White, Bob Tomlinson, Daryl Grunau, Kevin Tegtmeier, Ollie Lo, Latchesar Ionkov, Josh Aune, and many others at LANL

Section 15 Section 15 ADSP-BF533 Booting a 15-1 1 What is Booting? What is Booting?

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

COYOTE HIGHLANDS - COYOTE CANYON INTERIM PLANNING PROCESS COMMUNITY MEETING #2 DRAFT

Diskless Booting via AFS Summary of Advantages of AFS over NFS as a network subsystem for

Reliable network booting of cluster computers Matthew Steggink July 2nd, 2008 Matthew Steggink

PC Hardware &amp; Booting Chester Rebeiro IIT Madras Outline Memory and Device Addresses

Introduction to Linux Aline Abler Aline Abler Linux, whats that? The pieces of a Linux

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Linux Overview Amir Hossein Payberah payberah@gmail.com 1 Agenda Linux Overview Linux

Linux from Sensors to Servers ! When is Linux Not Linux? ! 1 1 Linux runs across a huge range

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

Linux Kung Fu Introduction What is Linux? Why Linux? What is the difference between a client

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Native American Folktales 01.29.12 || English 2327: Am erican Literature I || D. Glen Sm ith,

3 3 years years Rev. 16:1, Then I heard a loud voice from the temple saying to

A Random History of Mining A Random History of Mining (And thus the industrial revolution) (And

Report of International Tsunami Expedition to Madagascar August 28 September 12, 2006

Misperception, Self-Deception and Information Warfare Lachlan Brumley, Carlo Kopp and Kevin Korb

Music in the Downtown Collection at New York University Kent Underwood New York University

Assessing Our Progress 2nd Annual Coyote Evaluation and Planning Retreat Spring 2018 May 11,

Computing Like The Brain The Path To Machine Intelligence YOW! 2013 Jeff Hawkins

PC Hardware & Booting Chester Rebeiro IIT Madras Outline Memory and Device Addresses