Bare Metal Library Abstractions for modern hardware Cyprien Noel - - PowerPoint PPT Presentation

bare metal library
SMART_READER_LITE
LIVE PREVIEW

Bare Metal Library Abstractions for modern hardware Cyprien Noel - - PowerPoint PPT Presentation

Bare Metal Library Abstractions for modern hardware Cyprien Noel Plan 1. Modern Hardware? 2. New challenges & opportunities 3. Three use cases Current solutions Leveraging hardware Simple abstraction Myself High


slide-1
SLIDE 1

Bare Metal Library

Abstractions for modern hardware Cyprien Noel

slide-2
SLIDE 2

1. Modern Hardware? 2. New challenges & opportunities 3. Three use cases ○ Current solutions ○ Leveraging hardware ○ Simple abstraction

Plan

slide-3
SLIDE 3
  • High performance trading systems

○ Lock-free algos, distributed systems

  • H2O

○ Distributed CPU machine learning, async SGD

  • Flickr

○ Scaling deep learning on GPU ⎼ Multi GPU Caffe ○ RDMA, multicast, distributed Hogwild ⎼ CaffeOnSpark

  • UC Berkeley

○ NCCL Caffe, GPU cluster tooling ○ Bare Metal

Myself

slide-4
SLIDE 4

Modern Hardware?

slide-5
SLIDE 5

Device-to-device networks

slide-6
SLIDE 6
slide-7
SLIDE 7

Number crunching ➔ GPU FS, block io, virt mem ➔ Pmem Network stack ➔ RDMA RAID, replication ➔ Erasure codes Device mem ➔ Coherent fabrics And more: Video, crypto etc.

Moving from ms software to µs hardware

slide-8
SLIDE 8

OS abstractions replaced by

More powerful, but also more complex and non-interoperable

  • CUDA
  • OFED
  • Libpmem
  • DPDK
  • SPDK
  • Libfabric
  • UCX
  • VMA
  • More every week...
slide-9
SLIDE 9

Summary So Far

  • Big changes coming!

○ At least for high-performance applications

  • CPU should orchestrate

○ Not in critical path ○ Device-to-device networks

  • Retrofitting existing architectures difficult

○ CPU-centric abstractions ○ ms software on µs hardware (e.g. 100s instructions per packet) ○ OK in some cases, e.g. VMA (kernel bypass sockets), but much lower acceleration, most features inexpressible

slide-10
SLIDE 10

What do we do?

  • Start from scratch?

○ E.g. Google Fushia - no fs, block io, network etc. ○ Very interesting but future work

  • Use already accelerated frameworks?

○ E.g. PyTorch, BeeGFS ○ Not general purpose, no interop, not device-to-device

  • Work incrementally from use cases

○ Look for simplest hardware solution ○ Hopefully useful abstractions will emerge

slide-11
SLIDE 11

Use cases

  • Build datasets

○ Add, update elements ○ Apply functions to sets, map-reduce ○ Data versioning

  • Training & inference

○ Compute graphs, pipelines ○ Deployment ○ Model versioning

slide-12
SLIDE 12

Datasets

  • Typical solution

○ Protobuf messages ○ KV store ○

  • Dist. file system
  • Limitations

○ Serialization granularity ○ Copies: kv log, kernel1, replication, kernel2, fs ○ Remote CPU involved, stragglers ○ Cannot place data in device

(x12)

slide-13
SLIDE 13

Datasets

  • Simplest hardware implementation

○ Write protobuf in arena, like Flatbuffers ○ Pick an offset on disks, e.g. a namespace ○ Call ibv_exp_ec_encode_async

  • Comments

○ Management, coordination, crash resiliency ○ Thin wrapper over HW: line rate perf.

  • User abstraction?

○ Simple, familiar ○ Efficient, device friendly

(x12)

EC shard

slide-14
SLIDE 14
  • Extension to classic mmap

○ Distributed ○ Typed - Protobuf, other formats planned

  • Protobuf is amazing

○ Forward and backward compatible ○ Lattice

mmap

slide-15
SLIDE 15

mmap

  • C++

const Test& test = mmap<Test>("/test"); int i = test.field();

  • Python

test = Test() bm.mmap("/test", test) i = test.field()

slide-16
SLIDE 16
  • Simple abstraction for data storage
  • Fully accelerated, “mechanically friendly”

○ Thin wrapper over HW, device-to-device, zero copy ○ ~1.5x replication factor ○ Network automatically balanced ○ Solves straggler problem ○ No memory pinning or TLB thrashing, NUMA aware

mmap, recap

slide-17
SLIDE 17

Use cases

  • Compute

○ Map-reduce, compute graphs, pipelines

  • Typical setup

○ Spark, DL frameworks ○ Distribution using Akka, gRPC, MPI ○ Kubernetes or SLURM scheduling

  • Limitations

○ No interop ○ Placement difficult ○ Inefficient resources allocation

slide-18
SLIDE 18

Compute

  • Simplest hardware implementation

○ Define a task, e.g. img. resize, CUDA kernel, PyTorch graph ○ Place tasks in queue ○ Work stealing - RDMA atomics ○ Device-to-device chaining - GPU Direct Async

  • User abstraction?
slide-19
SLIDE 19

task

  • Python

@bm.task def compute(x, y): return x * y # Runs locally compute(1, 2) # Might be rebalanced on cluster data = bm.list() bm.mmap("/data", data) compute(data, 2)

slide-20
SLIDE 20

task, recap

  • Simple abstraction for CPU and device kernels
  • Work stealing instead of explicit schedule

○ No GPU hoarding ○ Better work balancing ○ Dynamic placement, HA

  • Device-to-device chaining

○ Data placed directly in device memory ○ Efficient pipelines, even very short tasks ○ E.g. model parallelism, low latency inference

slide-21
SLIDE 21

Use cases

  • Versioning

○ Track datasets and models ○ Deploy / rollback models

  • Typical setup

○ Copy before update ○ Symlinks as versions to data ○ Staging / production environments split

slide-22
SLIDE 22

Versioning

  • Simplest hardware implementation

○ Keep multiple write ahead logs ○ mmap updates ○ tasks queues

  • User abstraction?
slide-23
SLIDE 23

branch

  • Like a git branch

○ But any size data ○ Simplifies collaboration, experimentation ○ Generalized staging / production split

  • Simplifies HA

○ File system fsync, msync (Very hard! Rajimwale et al. DSN ‘11) ○ Replaces transactions, e.g. queues, persistent memory ○ Allows duplicate work merge

slide-24
SLIDE 24

branch

  • C++

Test* test = mutable_mmap<Test>("/test"); branch b; # Only visible in current branch test->set_field(12);

  • Similar in Python
slide-25
SLIDE 25
  • mmap, task, and branch simplify hardware-acceleration
  • Helps build pipelines, manage cluster resources etc.
  • Early micro benchmarks suggest very high performance

Summary

slide-26
SLIDE 26

Thank You!

Will be open sourced BSD Contact me if interested - cyprien.noel@berkeley.edu Thanks to our sponsor