SLIDE 1
Bare Metal Library Abstractions for modern hardware Cyprien Noel - - PowerPoint PPT Presentation
Bare Metal Library Abstractions for modern hardware Cyprien Noel - - PowerPoint PPT Presentation
Bare Metal Library Abstractions for modern hardware Cyprien Noel Plan 1. Modern Hardware? 2. New challenges & opportunities 3. Three use cases Current solutions Leveraging hardware Simple abstraction Myself High
SLIDE 2
SLIDE 3
- High performance trading systems
○ Lock-free algos, distributed systems
- H2O
○ Distributed CPU machine learning, async SGD
- Flickr
○ Scaling deep learning on GPU ⎼ Multi GPU Caffe ○ RDMA, multicast, distributed Hogwild ⎼ CaffeOnSpark
- UC Berkeley
○ NCCL Caffe, GPU cluster tooling ○ Bare Metal
Myself
SLIDE 4
Modern Hardware?
SLIDE 5
Device-to-device networks
SLIDE 6
SLIDE 7
Number crunching ➔ GPU FS, block io, virt mem ➔ Pmem Network stack ➔ RDMA RAID, replication ➔ Erasure codes Device mem ➔ Coherent fabrics And more: Video, crypto etc.
Moving from ms software to µs hardware
SLIDE 8
OS abstractions replaced by
More powerful, but also more complex and non-interoperable
- CUDA
- OFED
- Libpmem
- DPDK
- SPDK
- Libfabric
- UCX
- VMA
- More every week...
SLIDE 9
Summary So Far
- Big changes coming!
○ At least for high-performance applications
- CPU should orchestrate
○ Not in critical path ○ Device-to-device networks
- Retrofitting existing architectures difficult
○ CPU-centric abstractions ○ ms software on µs hardware (e.g. 100s instructions per packet) ○ OK in some cases, e.g. VMA (kernel bypass sockets), but much lower acceleration, most features inexpressible
SLIDE 10
What do we do?
- Start from scratch?
○ E.g. Google Fushia - no fs, block io, network etc. ○ Very interesting but future work
- Use already accelerated frameworks?
○ E.g. PyTorch, BeeGFS ○ Not general purpose, no interop, not device-to-device
- Work incrementally from use cases
○ Look for simplest hardware solution ○ Hopefully useful abstractions will emerge
SLIDE 11
Use cases
- Build datasets
○ Add, update elements ○ Apply functions to sets, map-reduce ○ Data versioning
- Training & inference
○ Compute graphs, pipelines ○ Deployment ○ Model versioning
SLIDE 12
Datasets
- Typical solution
○ Protobuf messages ○ KV store ○
- Dist. file system
- Limitations
○ Serialization granularity ○ Copies: kv log, kernel1, replication, kernel2, fs ○ Remote CPU involved, stragglers ○ Cannot place data in device
(x12)
SLIDE 13
Datasets
- Simplest hardware implementation
○ Write protobuf in arena, like Flatbuffers ○ Pick an offset on disks, e.g. a namespace ○ Call ibv_exp_ec_encode_async
- Comments
○ Management, coordination, crash resiliency ○ Thin wrapper over HW: line rate perf.
- User abstraction?
○ Simple, familiar ○ Efficient, device friendly
(x12)
EC shard
SLIDE 14
- Extension to classic mmap
○ Distributed ○ Typed - Protobuf, other formats planned
- Protobuf is amazing
○ Forward and backward compatible ○ Lattice
mmap
SLIDE 15
mmap
- C++
const Test& test = mmap<Test>("/test"); int i = test.field();
- Python
test = Test() bm.mmap("/test", test) i = test.field()
SLIDE 16
- Simple abstraction for data storage
- Fully accelerated, “mechanically friendly”
○ Thin wrapper over HW, device-to-device, zero copy ○ ~1.5x replication factor ○ Network automatically balanced ○ Solves straggler problem ○ No memory pinning or TLB thrashing, NUMA aware
mmap, recap
SLIDE 17
Use cases
- Compute
○ Map-reduce, compute graphs, pipelines
- Typical setup
○ Spark, DL frameworks ○ Distribution using Akka, gRPC, MPI ○ Kubernetes or SLURM scheduling
- Limitations
○ No interop ○ Placement difficult ○ Inefficient resources allocation
SLIDE 18
Compute
- Simplest hardware implementation
○ Define a task, e.g. img. resize, CUDA kernel, PyTorch graph ○ Place tasks in queue ○ Work stealing - RDMA atomics ○ Device-to-device chaining - GPU Direct Async
- User abstraction?
SLIDE 19
task
- Python
@bm.task def compute(x, y): return x * y # Runs locally compute(1, 2) # Might be rebalanced on cluster data = bm.list() bm.mmap("/data", data) compute(data, 2)
SLIDE 20
task, recap
- Simple abstraction for CPU and device kernels
- Work stealing instead of explicit schedule
○ No GPU hoarding ○ Better work balancing ○ Dynamic placement, HA
- Device-to-device chaining
○ Data placed directly in device memory ○ Efficient pipelines, even very short tasks ○ E.g. model parallelism, low latency inference
SLIDE 21
Use cases
- Versioning
○ Track datasets and models ○ Deploy / rollback models
- Typical setup
○ Copy before update ○ Symlinks as versions to data ○ Staging / production environments split
SLIDE 22
Versioning
- Simplest hardware implementation
○ Keep multiple write ahead logs ○ mmap updates ○ tasks queues
- User abstraction?
SLIDE 23
branch
- Like a git branch
○ But any size data ○ Simplifies collaboration, experimentation ○ Generalized staging / production split
- Simplifies HA
○ File system fsync, msync (Very hard! Rajimwale et al. DSN ‘11) ○ Replaces transactions, e.g. queues, persistent memory ○ Allows duplicate work merge
SLIDE 24
branch
- C++
Test* test = mutable_mmap<Test>("/test"); branch b; # Only visible in current branch test->set_field(12);
- Similar in Python
SLIDE 25
- mmap, task, and branch simplify hardware-acceleration
- Helps build pipelines, manage cluster resources etc.
- Early micro benchmarks suggest very high performance
Summary
SLIDE 26