Bare Metal Library Abstractions for modern hardware Cyprien Noel

Plan 1. Modern Hardware? 2. New challenges & opportunities 3. Three use cases ○ Current solutions ○ Leveraging hardware ○ Simple abstraction

Myself ● High performance trading systems ○ Lock-free algos, distributed systems ● H2O ○ Distributed CPU machine learning, async SGD ● Flickr Scaling deep learning on GPU ⎼ Multi GPU Caffe ○ RDMA, multicast, distributed Hogwild ⎼ CaffeOnSpark ○ ● UC Berkeley ○ NCCL Caffe, GPU cluster tooling ○ Bare Metal

Modern Hardware?

Device-to-device networks

Moving from ms software to µs hardware Number crunching ➔ GPU FS, block io, virt mem ➔ Pmem Network stack ➔ RDMA RAID, replication ➔ Erasure codes Device mem ➔ Coherent fabrics And more: Video, crypto etc.

OS abstractions replaced by ● CUDA ● OFED ● Libpmem ● DPDK More powerful, but also more complex ● SPDK and non-interoperable ● Libfabric ● UCX ● VMA ● More every week...

Summary So Far ● Big changes coming! ○ At least for high-performance applications ● CPU should orchestrate ○ Not in critical path ○ Device-to-device networks ● Retrofitting existing architectures difficult ○ CPU-centric abstractions ○ ms software on µs hardware (e.g. 100s instructions per packet) ○ OK in some cases, e.g. VMA (kernel bypass sockets), but much lower acceleration, most features inexpressible

What do we do? ● Start from scratch? ○ E.g. Google Fushia - no fs, block io, network etc. ○ Very interesting but future work ● Use already accelerated frameworks? ○ E.g. PyTorch, BeeGFS ○ Not general purpose, no interop, not device-to-device ● Work incrementally from use cases ○ Look for simplest hardware solution ○ Hopefully useful abstractions will emerge

Use cases ● Build datasets ○ Add, update elements ○ Apply functions to sets, map-reduce ○ Data versioning ● Training & inference ○ Compute graphs, pipelines ○ Deployment ○ Model versioning

Datasets ● Typical solution ○ Protobuf messages ○ KV store ○ Dist. file system ● Limitations ○ Serialization granularity (x12) ○ Copies: kv log, kernel1, replication, kernel2, fs ○ Remote CPU involved, stragglers ○ Cannot place data in device

EC shard Datasets ● Simplest hardware implementation ○ Write protobuf in arena, like Flatbuffers ○ Pick an offset on disks, e.g. a namespace ○ Call ibv_exp_ec_encode_async ● Comments ○ Management, coordination, crash resiliency (x12) ○ Thin wrapper over HW: line rate perf. ● User abstraction? ○ Simple, familiar ○ Efficient, device friendly

mmap ● Extension to classic mmap ○ Distributed ○ Typed - Protobuf, other formats planned ● Protobuf is amazing ○ Forward and backward compatible ○ Lattice

mmap ● C++ const Test& test = mmap<Test>("/test"); int i = test.field(); ● Python test = Test() bm.mmap("/test", test) i = test.field()

mmap, recap ● Simple abstraction for data storage ● Fully accelerated, “mechanically friendly” ○ Thin wrapper over HW, device-to-device, zero copy ○ ~1.5x replication factor ○ Network automatically balanced ○ Solves straggler problem ○ No memory pinning or TLB thrashing, NUMA aware

Use cases ● Compute ○ Map-reduce, compute graphs, pipelines ● Typical setup ○ Spark, DL frameworks ○ Distribution using Akka, gRPC, MPI ○ Kubernetes or SLURM scheduling ● Limitations ○ No interop ○ Placement difficult ○ Inefficient resources allocation

Compute ● Simplest hardware implementation ○ Define a task, e.g. img. resize, CUDA kernel, PyTorch graph ○ Place tasks in queue ○ Work stealing - RDMA atomics ○ Device-to-device chaining - GPU Direct Async ● User abstraction?

task ● Python @bm.task def compute(x, y): return x * y # Runs locally compute(1, 2) # Might be rebalanced on cluster data = bm.list() bm.mmap("/data", data) compute(data, 2)

task, recap ● Simple abstraction for CPU and device kernels ● Work stealing instead of explicit schedule ○ No GPU hoarding ○ Better work balancing ○ Dynamic placement, HA ● Device-to-device chaining ○ Data placed directly in device memory ○ Efficient pipelines, even very short tasks ○ E.g. model parallelism, low latency inference

Use cases ● Versioning ○ Track datasets and models ○ Deploy / rollback models ● Typical setup ○ Copy before update ○ Symlinks as versions to data ○ Staging / production environments split

Versioning ● Simplest hardware implementation ○ Keep multiple write ahead logs ○ mmap updates ○ tasks queues ● User abstraction?

branch ● Like a git branch ○ But any size data ○ Simplifies collaboration, experimentation ○ Generalized staging / production split ● Simplifies HA ○ File system fsync, msync (Very hard! Rajimwale et al. DSN ‘11) ○ Replaces transactions, e.g. queues, persistent memory ○ Allows duplicate work merge

branch ● C++ Test* test = mutable_mmap<Test>("/test"); branch b; # Only visible in current branch test->set_field(12); ● Similar in Python

Summary ● mmap, task, and branch simplify hardware-acceleration ● Helps build pipelines, manage cluster resources etc. ● Early micro benchmarks suggest very high performance

Thank You! Will be open sourced BSD Contact me if interested - cyprien.noel@berkeley.edu Thanks to our sponsor

Bare Metal Library Abstractions for modern hardware Cyprien Noel - PowerPoint PPT Presentation

Bare Metal Library Abstractions for modern hardware Cyprien Noel Plan 1. Modern Hardware? 2. New challenges & opportunities 3. Three use cases Current solutions Leveraging hardware Simple abstraction Myself High

64 bit Bare Metal Tristan Gingold Programming on RPI-3 gingold@adacore.com What is Bare Metal ?

Electronic Packaging Custom Metal Fabrication Custom Metal Fabrication Custom Metal Fabrication

Running Kubernetes on OpenStack and Bare Metal OpenStack Summit Berlin, November 2018 Ramon

Bare Metal In The Cloud: Isnt it Ironic? by Dmitry Tantsur and Ilya Etingof, Red Hat In this

BARE ROOT AND BARE ROOT AND CONTAINERIZED FOREST CONTAINERIZED FOREST PLANTS PLANTS PLANTS

Measurements on bare ASIC and full detector. M.Borri STFC Tests on bare asics v2.

Towards Converged SmartNIC Architecture for Bare Metal & Public Clouds Layong (Larry) Luo,

Improving Agility and Elasticity in Bare-metal Clouds Yushi Omote , Takahiro Shinagawa ,

TVM Deep Learning on Bare-Metal Devices Pratyush Patel No OS stack Extend TVM to support

Jim Young April 1, 2017 ASTM Standards for Metal Jacketing Contents Purpose of ASTM metal

Library Department FY 2021 Library Department FY 2021 Library Organization Chart Springfield

Presentation 7.3b: Multiple linear regression Murray Logan 09 Aug 2016 library (GGally) library

Gold (Precious Metal) Gold (Precious Metal) Gold is a precious metal, as are silver and the

/ Ge and On the control of GeO 2 / Ge and metal/ Ge interfaces metal/ Ge interfaces toward

Advanced Sheet Metal Design Faster Steve Lynch Rapid Sheet Metal SOLIDWORKS Intro NESWUC

MUSC 666 Introduction to Metal Giovanni Viviani MUSC 666 Syllabus MUSC 666 Syllabus

Introduction to Unix Class 1 * Notes adapted by Alexey Onufriev from previous work by other

Design Implications for Enterprise Storage Systems via Multi-Dimensional Trace Analysis Piotr

FUTURE CHALLENGES IN SOFTWARE EVOLUTION AND QUALITY ANALYSIS FROM INDUSTRIAL & ACADEMIC

Ho How w to o Bui Build & Secur cure a RISC-V Em Embe bedde ded System HARDWEAR.IO,

NEWMARK GROUP, INC. General Investor Presentation March 2019 DISCLAIMER 2 Discussion of

NEWMARK GROUP, INC. Investor PresentationNovember 2018 DISCLAIMER 2 Discussion of

Industry Ventures Partnership Holdings III-C, L.P. Investing in small venture capital funds

Increasing Reliability and Resiliency with Microgrids Anthony Ng Energy Research and Development