Operating System Services for High Throughput Processors Mark - - PowerPoint PPT Presentation
Operating System Services for High Throughput Processors Mark - - PowerPoint PPT Presentation
Operating System Services for High Throughput Processors Mark Silberstein EE, Technion Traditional Systems Software Stack Applications OS CPU 2 Feb 2014 Mark Silberstein - EE, Technion Modern Systems Software Stack Accelerated
Feb 2014 Mark Silberstein - EE, Technion 2
Traditional Systems Software Stack
Applications
OS CPU
Feb 2014 Mark Silberstein - EE, Technion 3
Modern Systems Software Stack
Manycore processors FPGA DSPs GPUs
Accelerated applications
OS CPU
Feb 2014 Mark Silberstein - EE, Technion 4
GPUs make a difference...
- Top 10 fastest supercomputers use GPUs
Physics Vision Chemistry Graph Algorithms Bio informatics Finance Linear Algebra HCI Metheo rology
Feb 2014 Mark Silberstein - EE, Technion 5
GPUs make a difference, but only in HPC!
Physics Vision Chemistry Graph Algorithms Bio informatics Finance Linear Algebra HCI Metheo rology Web servers ??? Network services ??? Antivirus, file search ???
Feb 2014 Mark Silberstein - EE, Technion 6
Software-hardware gap is widening
Manycore processors FPGA Hybrid CPU-GPU GPUs
OS CPU
Inadequate abstractions and management mechanisms
Manycore processors FPGA GPUs Manycore processors FPGA DSPs GPUs
Accelerated applications
Feb 2014 Mark Silberstein - EE, Technion 7
Fundamentals in question
accelerators ≡ co-processors accelerators ≡ peer-processors
Feb 2014 Mark Silberstein - EE, Technion 8
Software stack for accelerated applications
OS
Accelerated Applications
CPU
Manycore processors FPGA GPUs Manycore processors FPGA DSPs GPUs Accelerator abstractions and mechanisms
Feb 2014 Mark Silberstein - EE, Technion 9
Manycore processors FPGA GPUs Manycore processors FPGA DSPs GPUs
Software stack for accelerator applications
OS CPU
Accelerator abstractions and mechanisms
Accelerator applications (centralized and distributed)
Accelerator OS support (Interprocessor I/O, file system, network APIs)
Accelerator I/O services (network, files) Hardware support for OS
Accelerated Applications
Feb 2014 Mark Silberstein - EE, Technion 10
Manycore processors FPGA GPUs Manycore processors FPGA DSPs GPUs
OS
Accelerator abstractions and mechanisms
Accelerated Applications
Hardware support for OS
CPU
Storage
This talk
Accelerator OS support (Interprocessor I/O, file system, network APIs)
Accelerator I/O services (network, files) GPUs
ASPLOS13, TOCS14
Network
Accelerator applications centralized and distributed
Feb 2014 Mark Silberstein - EE, Technion 11
- GPU 101
- GPUfs: File I/O support for GPUs
- Future work
Feb 2014 Mark Silberstein - EE, Technion 12
Hybrid GPU-CPU 101 Architecture
CPU GPU Memory Memory
Feb 2014 Mark Silberstein - EE, Technion 13
Co-processor model
CPU GPU Memory Memory Computation
Feb 2014 Mark Silberstein - EE, Technion 14
CPU GPU Memory Memory Computation tation
Co-processor model
Feb 2014 Mark Silberstein - EE, Technion 15
CPU GPU Memory Memory Computation tation t a t i
- n
GPU kernel
Co-processor model
Feb 2014 Mark Silberstein - EE, Technion 16
CPU GPU Memory Memory Computation
Co-processor model
Feb 2014 Mark Silberstein - EE, Technion 17
Building systems with GPUs is hard Why?
Feb 2014 Mark Silberstein - EE, Technion 18
GPU kernels are isolated
Parallel Algorithm
GPU
Data transfers Invocation Memory management
CPU
Feb 2014 Mark Silberstein - EE, Technion 19
Example: accelerating photo collage
CPU CPU CPU Application
While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() }
Feb 2014 Mark Silberstein - EE, Technion 20
Offloading computations to GPU
CPU CPU CPU Application
While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() }
Move to GPU
Feb 2014 Mark Silberstein - EE, Technion 21
Offloading computations to GPU
GPU CPU Kernel start Data transfer Kernel termination
Feb 2014 Mark Silberstein - EE, Technion 22
Overheads
CPU GPU copy to GPU c
- p
y t
- C
P U invoke Invocation latency Synchronization Transfer
- verhead
Feb 2014 Mark Silberstein - EE, Technion 23
Working around overheads
CPU GPU copy to GPU c
- p
y t
- C
P U invoke Data reuse management Asynchronous invocation Double buffering copy to GPU Buffer size optimization GPU-CPU low-level tricks
Feb 2014 Mark Silberstein - EE, Technion 24
Management overhead
Why do we need to deal with low-level system details?
Data reuse management Asynchronous invocation Double buffering Buffer size optimization GPU-CPU low-level tricks Data reuse management Asynchronous invocation Double buffering Buffer size optimization GPU-CPU low-level tricks
Feb 2014 Mark Silberstein - EE, Technion 25
The reason is....
GPUs are peer-processors They need I/O OS services
Feb 2014 Mark Silberstein - EE, Technion 26
GPUfs: application view
- p
e n ( “ s h a r e d _ f i l e ” ) mmap()
- pen(“shared_file”)
write() Host File System GPUfs
CPUs GPU1 GPU2 GPU3
Feb 2014 Mark Silberstein - EE, Technion 27
GPUfs: application view
- p
e n ( “ s h a r e d _ f i l e ” ) mmap()
- pen(“shared_file”)
write() Host File System GPUfs System-wide shared namespace Persistent storage POSIX (CPU)-like API
CPUs GPU1 GPU2 GPU3
Feb 2014 Mark Silberstein - EE, Technion 28
Accelerating collage app with GPUfs
GPUfs GPUfs
- pen/read from GPU
GPU
No CPU management code
Feb 2014 Mark Silberstein - EE, Technion 29
CPU CPU CPU GPUfs buffer cache GPUfs GPU GPUfs Overlapping
Overlapping computations and transfers Read-ahead
Accelerating collage app with GPUfs
Feb 2014 Mark Silberstein - EE, Technion 30
CPU CPU CPU GPUfs GPU
Data reuse
Accelerating collage app with GPUfs
Random data access
Feb 2014 Mark Silberstein - EE, Technion 31
Understanding the hardware
Feb 2014 Mark Silberstein - EE, Technion 32
GPU hardware characteristics
Parallelism Heterogeneous memory Low serial performance
Feb 2014 Mark Silberstein - EE, Technion 33
GPU hardware parallelism
- 1. Multi-core
GPU memory MP MP MP MP
GPU
GPU memory
GPU
Core Core Core Core
Feb 2014 Mark Silberstein - EE, Technion 34
GPU hardware parallelism
- 2. SIMD
GPU memory MP
GPU
GPU memory
GPU
SIMD vector SIMD vector SIMD vector
Feb 2014 Mark Silberstein - EE, Technion 35
GPU hardware parallelism
- 3. Parallelism for latency hiding
GPU memory MP
GPU
GPU memory
GPU
T1 T2 T3
Execution state
Feb 2014 Mark Silberstein - EE, Technion 36
GPU Hardware
- 3. Parallelism for latency hiding
GPU memory MP
GPU
GPU memory
GPU
T1 T2 T3 R 0x01
Execution state
Feb 2014 Mark Silberstein - EE, Technion 37
GPU Hardware
- 3. Parallelism for latency hiding
GPU memory MP
GPU
GPU memory
GPU
T1 T2 T3 R 0x01
Execution state
R 0x04
Feb 2014 Mark Silberstein - EE, Technion 38
GPU Hardware
- 3. Parallelism for latency hiding
GPU memory MP
GPU
GPU memory
GPU
T1 T2 T3 R 0x01
Execution state
R 0x04 R 0x08
Feb 2014 Mark Silberstein - EE, Technion 39
GPU Hardware
- 3. Parallelism for latency hiding
GPU memory MP
GPU
GPU memory
GPU
T1 T2 T3 R 0x01
Execution state
R 0x04 R 0x08
Feb 2014 Mark Silberstein - EE, Technion 40
Putting it all together: 3 levels of hardware parallelism
GPU memory MP MP MP MP Thread Ctx 1 Thread Ctx k MP SIMD vector
GPU
GPU memory
GPU
Core Core Core Core Core State 1 State k
Feb 2014 Mark Silberstein - EE, Technion 41
Software-Hardware mapping
GPU memory MP MP MP MP Thread Ctx 1 Thread Ctx k MP SIMD vector
GPU
MP GPU memory MP MP MP MP
GPU
T h r e a d n T h r e a d 1 Core Core Core Core Core State 1 State k
Feb 2014 Mark Silberstein - EE, Technion 42
(1) 10,000-s of concurrent threads!
GPU memory MP MP MP MP Thread Ctx 1 Thread Ctx k MP SIMD vector
GPU
MP GPU memory MP MP MP MP
GPU
T h r e a d n T h r e a d 1 Core Core Core Core Core State 1 State k 64 32 14
NVIDIA K20x GPU: 64x14x32= 28672 concurrent threads
Feb 2014 Mark Silberstein - EE, Technion 43
(2) Each thread is slow
GPU memory MP MP MP MP Thread Ctx 1 Thread Ctx k MP SIMD vector
GPU
MP GPU memory MP MP MP MP
GPU
T h r e a d n T h r e a d 1 Core Core Core Core Core State 1 State k
~100x slower than a CPU thread
Feb 2014 Mark Silberstein - EE, Technion 44
(3) Heterogeneous memory
CPU GPU Memory Memory 10-32GB/s 12 GB/s 250GB/s
x20
Feb 2014 Mark Silberstein - EE, Technion 45
GPUfs: file system layer for GPUs
Joint work with Bryan Ford, Idit Keidar, Emmett Witchel [ASPLOS13, TOCS14]
Feb 2014 Mark Silberstein - EE, Technion 46
GPUfs: principled redesign of the whole file system stack
- Modified FS API semantics for massive
parallelism
- Relaxed distributed FS consistency for
non-uniform memory
- GPU-specific implementation of
synchronization primitives, read-optimized data structures, memory allocation, ….
Feb 2014 Mark Silberstein - EE, Technion 47
GPU program using GPUfs
__shared__ float buffer[1024]; int fd=gopen(filename,O_GRDWR); gread(fd,offset,1024*4,buffer); buffer[myId]=compute(buffer[myId]);// parallel compute gwrite(fd,offset,1024*4,buffer); gclose(fd);
Feb 2014 Mark Silberstein - EE, Technion 48
GPU program using GPUfs
__shared__ float buffer[1024]; int fd=gopen(filename,O_GRDWR); gread(fd,offset,1024*4,buffer); buffer[myId]=compute(buffer[myId]);// parallel compute gwrite(fd,offset,1024*4,buffer); gclose(fd);
Supporting GPU programming idioms
Feb 2014 Mark Silberstein - EE, Technion 49
GPU program using GPUfs
__shared__ float buffer[1024]; int fd=gopen(filename,O_GRDWR); gread(fd,offset,1024*4,buffer); buffer[myId]=compute(buffer[myId]);// parallel compute gwrite(fd,offset,1024*4,buffer); gclose(fd);
Parallel API calls: hundreds of threads perform the same call in lockstep
Feb 2014 Mark Silberstein - EE, Technion 50
GPU program using GPUfs
__shared__ float buffer[1024]; int fd=gopen(filename,O_GRDWR); gread(fd,offset,1024*4,buffer); buffer[myId]=compute(buffer[myId]);// parallel compute gwrite(fd,offset,1024*4,buffer); gclose(fd);
- pen is cached
- n GPU
Feb 2014 Mark Silberstein - EE, Technion 51
GPU program using GPUfs
__shared__ float buffer[1024]; int fd=gopen(filename,O_GRDWR); gread(fd,offset,1024*4,buffer); buffer[myId]=compute(buffer[myId]);// parallel compute gwrite(fd,offset,1024*4,buffer); gclose(fd);
read/write: explicit offsets to for parallel access and low contention
Feb 2014 Mark Silberstein - EE, Technion 52
GPU program using GPUfs
__shared__ float buffer[1024]; int fd=gopen(filename,O_GRDWR); gread(fd,offset,1024*4,buffer); buffer[myId]=compute(buffer[myId]);// parallel compute gwrite(fd,offset,1024*4,buffer); gclose(fd);
Asynchronous close
Feb 2014 Mark Silberstein - EE, Technion 53
GPU application using GPUfs File API OS File System Interface
High-level design
GPU Memory (Page cache) CPU Memory GPUfs Distributed Buffer Cache Unchanged applications using OS File API GPUfs hooks GPUfs GPU File I/O library OS CPU GPU Disk Host File System
Feb 2014 Mark Silberstein - EE, Technion 54
GPU application using GPUfs File API OS File System Interface
High-level design
GPU Memory (Page cache) CPU Memory GPUfs Distributed Buffer Cache Unchanged applications using OS File API GPUfs hooks GPUfs GPU File I/O library OS CPU GPU Disk Host File System Massive parallelism Non-uniform memory
Feb 2014 Mark Silberstein - EE, Technion 55
Buffer cache semantics
Local or Distributed file system data consistency?
Feb 2014 Mark Silberstein - EE, Technion 56
Weak data consistency model
- close(sync)-to-open semantics (AFS)
write(1)
- pen()
read(1) CPU GPU fsync() write(2) Not visible to CPU
Reason
Minimize inter-processor synchronization
Implications
- Overlapping writes
- Cache page false sharing
- Consistency protocol
Feb 2014 Mark Silberstein - EE, Technion 57
Implementation bits
Feb 2014 Mark Silberstein - EE, Technion 58
GPUfs prototype
CPU OS Kernel space CPU OS User space GPU kernel program
GPUfs consistency module Buffer Cache File State GPUfs API CPU-GPU RPC RPC daemon
Feb 2014 Mark Silberstein - EE, Technion 59
GPUfs prototype
CPU OS Kernel space CPU OS User space GPU kernel program
GPUfs consistency module Buffer Cache File State GPUfs API CPU-GPU RPC RPC daemon
Feb 2014 Mark Silberstein - EE, Technion 60
On-demand data transfer
CPU RPC daemon CPU memory Write-shared CPU memory GPU memory gread() RPC queue GPU kernel
Feb 2014 Mark Silberstein - EE, Technion 61
On-demand data transfer
CPU RPC daemon pread() staging area Buffer cache CPU memory Write-shared CPU memory GPU memory cudaMemcpy() gread() RPC queue GPU kernel Ack
Feb 2014 Mark Silberstein - EE, Technion 62
On-demand data transfer
CPU RPC daemon pread() staging area Buffer cache CPU memory Write-shared CPU memory GPU memory cudaMemcpy() gread() RPC queue GPU kernel
CPU acts as a file server
Feb 2014 Mark Silberstein - EE, Technion 63
More implementation challenges
- Paging
- Dynamic data structures and memory
allocators
- Lock-less read-optimized radix tree
- Inter-processor consistency
I n t h e p a p e r
Feb 2014 Mark Silberstein - EE, Technion 64
GPUfs impact on GPU programs
Memory overhead
- Very little CPU involvement
Pay-as-you-go design
Feb 2014 Mark Silberstein - EE, Technion 65
Evaluation
All benchmarks are written with a GPU self-contained kernel – no CPU part
Feb 2014 Mark Silberstein - EE, Technion 66
Real applications
- Approximate image matching
- 4 GPUs – 6-9x faster than 8 CPU cores
- String matching in Linux kernel tree: 33,000 files
- 1 GPU – 6 - 7x faster than 8 CPU cores
- GPUfs overhead = 7%
Feb 2014 Mark Silberstein - EE, Technion 67
Summary - GPUfs
GPUfs is a first system to provide I/O for GPUs
Feb 2014 Mark Silberstein - EE, Technion 68
Open issues
- Buffer cache: consistency, CPU page cache
interaction, page faults, mmap
- Direct access to storage devices
- Optimizing file naming mechanisms
- Applications
- Image format readers, git-grep
- Other accelerators – FPGAs, Xeon-Phi, DSP
- GPU networking
Feb 2014 Mark Silberstein - EE, Technion 69
Summary
- System performance will rely on accelerators
- Programmable accelerators are
peer-processors (not co-processors)
- They need I/O abstractions and OS services
- GPUnet, GPUfs – first step in this direction
Feb 2014 Mark Silberstein - EE, Technion 70