Operating System Services for High Throughput Processors Mark - - PowerPoint PPT Presentation

operating system services for high throughput processors
SMART_READER_LITE
LIVE PREVIEW

Operating System Services for High Throughput Processors Mark - - PowerPoint PPT Presentation

Operating System Services for High Throughput Processors Mark Silberstein EE, Technion Traditional Systems Software Stack Applications OS CPU 2 Feb 2014 Mark Silberstein - EE, Technion Modern Systems Software Stack Accelerated


slide-1
SLIDE 1

Operating System Services for High Throughput Processors

Mark Silberstein EE, Technion

slide-2
SLIDE 2

Feb 2014 Mark Silberstein - EE, Technion 2

Traditional Systems Software Stack

Applications

OS CPU

slide-3
SLIDE 3

Feb 2014 Mark Silberstein - EE, Technion 3

Modern Systems Software Stack

Manycore processors FPGA DSPs GPUs

Accelerated applications

OS CPU

slide-4
SLIDE 4

Feb 2014 Mark Silberstein - EE, Technion 4

GPUs make a difference...

  • Top 10 fastest supercomputers use GPUs

Physics Vision Chemistry Graph Algorithms Bio informatics Finance Linear Algebra HCI Metheo rology

slide-5
SLIDE 5

Feb 2014 Mark Silberstein - EE, Technion 5

GPUs make a difference, but only in HPC!

Physics Vision Chemistry Graph Algorithms Bio informatics Finance Linear Algebra HCI Metheo rology Web servers ??? Network services ??? Antivirus, file search ???

slide-6
SLIDE 6

Feb 2014 Mark Silberstein - EE, Technion 6

Software-hardware gap is widening

Manycore processors FPGA Hybrid CPU-GPU GPUs

OS CPU

Inadequate abstractions and management mechanisms

Manycore processors FPGA GPUs Manycore processors FPGA DSPs GPUs

Accelerated applications

slide-7
SLIDE 7

Feb 2014 Mark Silberstein - EE, Technion 7

Fundamentals in question

accelerators ≡ co-processors accelerators ≡ peer-processors

slide-8
SLIDE 8

Feb 2014 Mark Silberstein - EE, Technion 8

Software stack for accelerated applications

OS

Accelerated Applications

CPU

Manycore processors FPGA GPUs Manycore processors FPGA DSPs GPUs Accelerator abstractions and mechanisms

slide-9
SLIDE 9

Feb 2014 Mark Silberstein - EE, Technion 9

Manycore processors FPGA GPUs Manycore processors FPGA DSPs GPUs

Software stack for accelerator applications

OS CPU

Accelerator abstractions and mechanisms

Accelerator applications (centralized and distributed)

Accelerator OS support (Interprocessor I/O, file system, network APIs)

Accelerator I/O services (network, files) Hardware support for OS

Accelerated Applications

slide-10
SLIDE 10

Feb 2014 Mark Silberstein - EE, Technion 10

Manycore processors FPGA GPUs Manycore processors FPGA DSPs GPUs

OS

Accelerator abstractions and mechanisms

Accelerated Applications

Hardware support for OS

CPU

Storage

This talk

Accelerator OS support (Interprocessor I/O, file system, network APIs)

Accelerator I/O services (network, files) GPUs

ASPLOS13, TOCS14

Network

Accelerator applications centralized and distributed

slide-11
SLIDE 11

Feb 2014 Mark Silberstein - EE, Technion 11

  • GPU 101
  • GPUfs: File I/O support for GPUs
  • Future work
slide-12
SLIDE 12

Feb 2014 Mark Silberstein - EE, Technion 12

Hybrid GPU-CPU 101 Architecture

CPU GPU Memory Memory

slide-13
SLIDE 13

Feb 2014 Mark Silberstein - EE, Technion 13

Co-processor model

CPU GPU Memory Memory Computation

slide-14
SLIDE 14

Feb 2014 Mark Silberstein - EE, Technion 14

CPU GPU Memory Memory Computation tation

Co-processor model

slide-15
SLIDE 15

Feb 2014 Mark Silberstein - EE, Technion 15

CPU GPU Memory Memory Computation tation t a t i

  • n

GPU kernel

Co-processor model

slide-16
SLIDE 16

Feb 2014 Mark Silberstein - EE, Technion 16

CPU GPU Memory Memory Computation

Co-processor model

slide-17
SLIDE 17

Feb 2014 Mark Silberstein - EE, Technion 17

Building systems with GPUs is hard Why?

slide-18
SLIDE 18

Feb 2014 Mark Silberstein - EE, Technion 18

GPU kernels are isolated

Parallel Algorithm

GPU

Data transfers Invocation Memory management

CPU

slide-19
SLIDE 19

Feb 2014 Mark Silberstein - EE, Technion 19

Example: accelerating photo collage

CPU CPU CPU Application

While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() }

slide-20
SLIDE 20

Feb 2014 Mark Silberstein - EE, Technion 20

Offloading computations to GPU

CPU CPU CPU Application

While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() }

Move to GPU

slide-21
SLIDE 21

Feb 2014 Mark Silberstein - EE, Technion 21

Offloading computations to GPU

GPU CPU Kernel start Data transfer Kernel termination

slide-22
SLIDE 22

Feb 2014 Mark Silberstein - EE, Technion 22

Overheads

CPU GPU copy to GPU c

  • p

y t

  • C

P U invoke Invocation latency Synchronization Transfer

  • verhead
slide-23
SLIDE 23

Feb 2014 Mark Silberstein - EE, Technion 23

Working around overheads

CPU GPU copy to GPU c

  • p

y t

  • C

P U invoke Data reuse management Asynchronous invocation Double buffering copy to GPU Buffer size optimization GPU-CPU low-level tricks

slide-24
SLIDE 24

Feb 2014 Mark Silberstein - EE, Technion 24

Management overhead

Why do we need to deal with low-level system details?

Data reuse management Asynchronous invocation Double buffering Buffer size optimization GPU-CPU low-level tricks Data reuse management Asynchronous invocation Double buffering Buffer size optimization GPU-CPU low-level tricks

slide-25
SLIDE 25

Feb 2014 Mark Silberstein - EE, Technion 25

The reason is....

GPUs are peer-processors They need I/O OS services

slide-26
SLIDE 26

Feb 2014 Mark Silberstein - EE, Technion 26

GPUfs: application view

  • p

e n ( “ s h a r e d _ f i l e ” ) mmap()

  • pen(“shared_file”)

write() Host File System GPUfs

CPUs GPU1 GPU2 GPU3

slide-27
SLIDE 27

Feb 2014 Mark Silberstein - EE, Technion 27

GPUfs: application view

  • p

e n ( “ s h a r e d _ f i l e ” ) mmap()

  • pen(“shared_file”)

write() Host File System GPUfs System-wide shared namespace Persistent storage POSIX (CPU)-like API

CPUs GPU1 GPU2 GPU3

slide-28
SLIDE 28

Feb 2014 Mark Silberstein - EE, Technion 28

Accelerating collage app with GPUfs

GPUfs GPUfs

  • pen/read from GPU

GPU

No CPU management code

slide-29
SLIDE 29

Feb 2014 Mark Silberstein - EE, Technion 29

CPU CPU CPU GPUfs buffer cache GPUfs GPU GPUfs Overlapping

Overlapping computations and transfers Read-ahead

Accelerating collage app with GPUfs

slide-30
SLIDE 30

Feb 2014 Mark Silberstein - EE, Technion 30

CPU CPU CPU GPUfs GPU

Data reuse

Accelerating collage app with GPUfs

Random data access

slide-31
SLIDE 31

Feb 2014 Mark Silberstein - EE, Technion 31

Understanding the hardware

slide-32
SLIDE 32

Feb 2014 Mark Silberstein - EE, Technion 32

GPU hardware characteristics

Parallelism Heterogeneous memory Low serial performance

slide-33
SLIDE 33

Feb 2014 Mark Silberstein - EE, Technion 33

GPU hardware parallelism

  • 1. Multi-core

GPU memory MP MP MP MP

GPU

GPU memory

GPU

Core Core Core Core

slide-34
SLIDE 34

Feb 2014 Mark Silberstein - EE, Technion 34

GPU hardware parallelism

  • 2. SIMD

GPU memory MP

GPU

GPU memory

GPU

SIMD vector SIMD vector SIMD vector

slide-35
SLIDE 35

Feb 2014 Mark Silberstein - EE, Technion 35

GPU hardware parallelism

  • 3. Parallelism for latency hiding

GPU memory MP

GPU

GPU memory

GPU

T1 T2 T3

Execution state

slide-36
SLIDE 36

Feb 2014 Mark Silberstein - EE, Technion 36

GPU Hardware

  • 3. Parallelism for latency hiding

GPU memory MP

GPU

GPU memory

GPU

T1 T2 T3 R 0x01

Execution state

slide-37
SLIDE 37

Feb 2014 Mark Silberstein - EE, Technion 37

GPU Hardware

  • 3. Parallelism for latency hiding

GPU memory MP

GPU

GPU memory

GPU

T1 T2 T3 R 0x01

Execution state

R 0x04

slide-38
SLIDE 38

Feb 2014 Mark Silberstein - EE, Technion 38

GPU Hardware

  • 3. Parallelism for latency hiding

GPU memory MP

GPU

GPU memory

GPU

T1 T2 T3 R 0x01

Execution state

R 0x04 R 0x08

slide-39
SLIDE 39

Feb 2014 Mark Silberstein - EE, Technion 39

GPU Hardware

  • 3. Parallelism for latency hiding

GPU memory MP

GPU

GPU memory

GPU

T1 T2 T3 R 0x01

Execution state

R 0x04 R 0x08

slide-40
SLIDE 40

Feb 2014 Mark Silberstein - EE, Technion 40

Putting it all together: 3 levels of hardware parallelism

GPU memory MP MP MP MP Thread Ctx 1 Thread Ctx k MP SIMD vector

GPU

GPU memory

GPU

Core Core Core Core Core State 1 State k

slide-41
SLIDE 41

Feb 2014 Mark Silberstein - EE, Technion 41

Software-Hardware mapping

GPU memory MP MP MP MP Thread Ctx 1 Thread Ctx k MP SIMD vector

GPU

MP GPU memory MP MP MP MP

GPU

T h r e a d n T h r e a d 1 Core Core Core Core Core State 1 State k

slide-42
SLIDE 42

Feb 2014 Mark Silberstein - EE, Technion 42

(1) 10,000-s of concurrent threads!

GPU memory MP MP MP MP Thread Ctx 1 Thread Ctx k MP SIMD vector

GPU

MP GPU memory MP MP MP MP

GPU

T h r e a d n T h r e a d 1 Core Core Core Core Core State 1 State k 64 32 14

NVIDIA K20x GPU: 64x14x32= 28672 concurrent threads

slide-43
SLIDE 43

Feb 2014 Mark Silberstein - EE, Technion 43

(2) Each thread is slow

GPU memory MP MP MP MP Thread Ctx 1 Thread Ctx k MP SIMD vector

GPU

MP GPU memory MP MP MP MP

GPU

T h r e a d n T h r e a d 1 Core Core Core Core Core State 1 State k

~100x slower than a CPU thread

slide-44
SLIDE 44

Feb 2014 Mark Silberstein - EE, Technion 44

(3) Heterogeneous memory

CPU GPU Memory Memory 10-32GB/s 12 GB/s 250GB/s

x20

slide-45
SLIDE 45

Feb 2014 Mark Silberstein - EE, Technion 45

GPUfs: file system layer for GPUs

Joint work with Bryan Ford, Idit Keidar, Emmett Witchel [ASPLOS13, TOCS14]

slide-46
SLIDE 46

Feb 2014 Mark Silberstein - EE, Technion 46

GPUfs: principled redesign of the whole file system stack

  • Modified FS API semantics for massive

parallelism

  • Relaxed distributed FS consistency for

non-uniform memory

  • GPU-specific implementation of

synchronization primitives, read-optimized data structures, memory allocation, ….

slide-47
SLIDE 47

Feb 2014 Mark Silberstein - EE, Technion 47

GPU program using GPUfs

__shared__ float buffer[1024]; int fd=gopen(filename,O_GRDWR); gread(fd,offset,1024*4,buffer); buffer[myId]=compute(buffer[myId]);// parallel compute gwrite(fd,offset,1024*4,buffer); gclose(fd);

slide-48
SLIDE 48

Feb 2014 Mark Silberstein - EE, Technion 48

GPU program using GPUfs

__shared__ float buffer[1024]; int fd=gopen(filename,O_GRDWR); gread(fd,offset,1024*4,buffer); buffer[myId]=compute(buffer[myId]);// parallel compute gwrite(fd,offset,1024*4,buffer); gclose(fd);

Supporting GPU programming idioms

slide-49
SLIDE 49

Feb 2014 Mark Silberstein - EE, Technion 49

GPU program using GPUfs

__shared__ float buffer[1024]; int fd=gopen(filename,O_GRDWR); gread(fd,offset,1024*4,buffer); buffer[myId]=compute(buffer[myId]);// parallel compute gwrite(fd,offset,1024*4,buffer); gclose(fd);

Parallel API calls: hundreds of threads perform the same call in lockstep

slide-50
SLIDE 50

Feb 2014 Mark Silberstein - EE, Technion 50

GPU program using GPUfs

__shared__ float buffer[1024]; int fd=gopen(filename,O_GRDWR); gread(fd,offset,1024*4,buffer); buffer[myId]=compute(buffer[myId]);// parallel compute gwrite(fd,offset,1024*4,buffer); gclose(fd);

  • pen is cached
  • n GPU
slide-51
SLIDE 51

Feb 2014 Mark Silberstein - EE, Technion 51

GPU program using GPUfs

__shared__ float buffer[1024]; int fd=gopen(filename,O_GRDWR); gread(fd,offset,1024*4,buffer); buffer[myId]=compute(buffer[myId]);// parallel compute gwrite(fd,offset,1024*4,buffer); gclose(fd);

read/write: explicit offsets to for parallel access and low contention

slide-52
SLIDE 52

Feb 2014 Mark Silberstein - EE, Technion 52

GPU program using GPUfs

__shared__ float buffer[1024]; int fd=gopen(filename,O_GRDWR); gread(fd,offset,1024*4,buffer); buffer[myId]=compute(buffer[myId]);// parallel compute gwrite(fd,offset,1024*4,buffer); gclose(fd);

Asynchronous close

slide-53
SLIDE 53

Feb 2014 Mark Silberstein - EE, Technion 53

GPU application using GPUfs File API OS File System Interface

High-level design

GPU Memory (Page cache) CPU Memory GPUfs Distributed Buffer Cache Unchanged applications using OS File API GPUfs hooks GPUfs GPU File I/O library OS CPU GPU Disk Host File System

slide-54
SLIDE 54

Feb 2014 Mark Silberstein - EE, Technion 54

GPU application using GPUfs File API OS File System Interface

High-level design

GPU Memory (Page cache) CPU Memory GPUfs Distributed Buffer Cache Unchanged applications using OS File API GPUfs hooks GPUfs GPU File I/O library OS CPU GPU Disk Host File System Massive parallelism Non-uniform memory

slide-55
SLIDE 55

Feb 2014 Mark Silberstein - EE, Technion 55

Buffer cache semantics

Local or Distributed file system data consistency?

slide-56
SLIDE 56

Feb 2014 Mark Silberstein - EE, Technion 56

Weak data consistency model

  • close(sync)-to-open semantics (AFS)

write(1)

  • pen()

read(1) CPU GPU fsync() write(2) Not visible to CPU

Reason

Minimize inter-processor synchronization

Implications

  • Overlapping writes
  • Cache page false sharing
  • Consistency protocol
slide-57
SLIDE 57

Feb 2014 Mark Silberstein - EE, Technion 57

Implementation bits

slide-58
SLIDE 58

Feb 2014 Mark Silberstein - EE, Technion 58

GPUfs prototype

CPU OS Kernel space CPU OS User space GPU kernel program

GPUfs consistency module Buffer Cache File State GPUfs API CPU-GPU RPC RPC daemon

slide-59
SLIDE 59

Feb 2014 Mark Silberstein - EE, Technion 59

GPUfs prototype

CPU OS Kernel space CPU OS User space GPU kernel program

GPUfs consistency module Buffer Cache File State GPUfs API CPU-GPU RPC RPC daemon

slide-60
SLIDE 60

Feb 2014 Mark Silberstein - EE, Technion 60

On-demand data transfer

CPU RPC daemon CPU memory Write-shared CPU memory GPU memory gread() RPC queue GPU kernel

slide-61
SLIDE 61

Feb 2014 Mark Silberstein - EE, Technion 61

On-demand data transfer

CPU RPC daemon pread() staging area Buffer cache CPU memory Write-shared CPU memory GPU memory cudaMemcpy() gread() RPC queue GPU kernel Ack

slide-62
SLIDE 62

Feb 2014 Mark Silberstein - EE, Technion 62

On-demand data transfer

CPU RPC daemon pread() staging area Buffer cache CPU memory Write-shared CPU memory GPU memory cudaMemcpy() gread() RPC queue GPU kernel

CPU acts as a file server

slide-63
SLIDE 63

Feb 2014 Mark Silberstein - EE, Technion 63

More implementation challenges

  • Paging
  • Dynamic data structures and memory

allocators

  • Lock-less read-optimized radix tree
  • Inter-processor consistency

I n t h e p a p e r

slide-64
SLIDE 64

Feb 2014 Mark Silberstein - EE, Technion 64

GPUfs impact on GPU programs

Memory overhead

  • Very little CPU involvement

Pay-as-you-go design

slide-65
SLIDE 65

Feb 2014 Mark Silberstein - EE, Technion 65

Evaluation

All benchmarks are written with a GPU self-contained kernel – no CPU part

slide-66
SLIDE 66

Feb 2014 Mark Silberstein - EE, Technion 66

Real applications

  • Approximate image matching
  • 4 GPUs – 6-9x faster than 8 CPU cores
  • String matching in Linux kernel tree: 33,000 files
  • 1 GPU – 6 - 7x faster than 8 CPU cores
  • GPUfs overhead = 7%
slide-67
SLIDE 67

Feb 2014 Mark Silberstein - EE, Technion 67

Summary - GPUfs

GPUfs is a first system to provide I/O for GPUs

slide-68
SLIDE 68

Feb 2014 Mark Silberstein - EE, Technion 68

Open issues

  • Buffer cache: consistency, CPU page cache

interaction, page faults, mmap

  • Direct access to storage devices
  • Optimizing file naming mechanisms
  • Applications
  • Image format readers, git-grep
  • Other accelerators – FPGAs, Xeon-Phi, DSP
  • GPU networking
slide-69
SLIDE 69

Feb 2014 Mark Silberstein - EE, Technion 69

Summary

  • System performance will rely on accelerators
  • Programmable accelerators are

peer-processors (not co-processors)

  • They need I/O abstractions and OS services
  • GPUnet, GPUfs – first step in this direction
slide-70
SLIDE 70

Feb 2014 Mark Silberstein - EE, Technion 70

Set GPUs free!

Interested in a project? Talk to me 046274: Spring 2014, Mon 16.30 GPU-accelerated systems

mark@ee.technion.ac.il