GPUfs: Integrating a file system with GPUs Mark Silberstein (UT - - PowerPoint PPT Presentation

gpufs integrating a file system with gpus
SMART_READER_LITE
LIVE PREVIEW

GPUfs: Integrating a file system with GPUs Mark Silberstein (UT - - PowerPoint PPT Presentation

GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Mark Silberstein - UT Austin Traditional System Architecture Applications OS CPU 2


slide-1
SLIDE 1

Mark Silberstein - UT Austin 1

GPUfs: Integrating a file system with GPUs

Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin)

slide-2
SLIDE 2

Mark Silberstein - UT Austin 2

Traditional System Architecture

Applications

OS CPU

slide-3
SLIDE 3

Mark Silberstein - UT Austin 3

Modern System Architecture

Manycore processors FPGA Hybrid CPU-GPU GPUs

CPU

Accelerated applications

OS

slide-4
SLIDE 4

Mark Silberstein - UT Austin 4

Software-hardware gap is widening

Manycore processors FPGA Hybrid CPU-GPU GPUs

CPU

Accelerated applications

OS

slide-5
SLIDE 5

Mark Silberstein - UT Austin 5

Software-hardware gap is widening

Manycore processors FPGA Hybrid CPU-GPU GPUs

CPU

Accelerated applications

OS

Ad-hoc abstractions and management mechanisms

slide-6
SLIDE 6

Mark Silberstein - UT Austin 6

On-accelerator OS support closes the programmability gap

Manycore processors FPGA Hybrid CPU-GPU GPUs

CPU

Accelerated applications

OS

On-accelerator OS support

Native accelerator applications

Coordination

slide-7
SLIDE 7

Mark Silberstein - UT Austin 7

  • GPUfs: File I/O support for GPUs
  • Motivation
  • Goals
  • Understanding the hardware
  • Design
  • Implementation
  • Evaluation
slide-8
SLIDE 8

Mark Silberstein - UT Austin 8

Building systems with GPUs is hard. Why?

slide-9
SLIDE 9

Mark Silberstein - UT Austin 9

Data transfers GPU invocation Memory management

Goal of GPU programming frameworks

GPU

Parallel Algorithm

CPU

slide-10
SLIDE 10

Mark Silberstein - UT Austin 10

Headache for GPU programmers

Parallel Algorithm

GPU

Data transfers Invocation Memory management

CPU

Half of the CUDA SDK 4.1 samples: at least 9 CPU LOC per 1 GPU LOC

slide-11
SLIDE 11

Mark Silberstein - UT Austin 11

GPU kernels are isolated

Parallel Algorithm

GPU

Data transfers Invocation Memory management

CPU

slide-12
SLIDE 12

Mark Silberstein - UT Austin 12

Example: accelerating photo collage

http://www.codeproject.com/Articles/36347/Face-Collage

While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() }

slide-13
SLIDE 13

Mark Silberstein - UT Austin 13

CPU Implementation

CPU CPU CPU Application

While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() }

slide-14
SLIDE 14

Mark Silberstein - UT Austin 14

Offloading computations to GPU

CPU CPU CPU Application

While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() }

Move to GPU

slide-15
SLIDE 15

Mark Silberstein - UT Austin 15

Offloading computations to GPU

GPU CPU Kernel start Data transfer Kernel termination

Co-processor programming model

slide-16
SLIDE 16

Mark Silberstein - UT Austin 16

Kernel start/stop overheads

CPU GPU copy to GPU c

  • p

y t

  • C

P U invoke Invocation latency Synchronization Cache flush

slide-17
SLIDE 17

Mark Silberstein - UT Austin 17

Hiding the overheads

CPU GPU copy to GPU c

  • p

y t

  • C

P U invoke Manual data reuse management Asynchronous invocation Double buffering copy to GPU

slide-18
SLIDE 18

Mark Silberstein - UT Austin 18

Implementation complexity

CPU GPU copy to GPU c

  • p

y t

  • C

P U invoke Manual data reuse management Asynchronous invocation Double buffering copy to GPU

Management overhead

slide-19
SLIDE 19

Mark Silberstein - UT Austin 19

Implementation complexity

CPU GPU copy to GPU c

  • p

y t

  • C

P U invoke Manual data reuse management Asynchronous invocation Double buffering copy to GPU

Why do we need to deal with low-level system details?

Management overhead

slide-20
SLIDE 20

Mark Silberstein - UT Austin 20

The reason is....

GPUs are peer-processors They need I/O OS services

slide-21
SLIDE 21

Mark Silberstein - UT Austin 21

GPUfs: application view

CPUs GPU1 GPU2 GPU3

  • p

e n ( “ s h a r e d _ f i l e ” ) m m a p ( )

  • pen(“shared_file”)

write()

Host File System GPUfs

slide-22
SLIDE 22

Mark Silberstein - UT Austin 22

GPUfs: application view

CPUs GPU1 GPU2 GPU3

  • p

e n ( “ s h a r e d _ f i l e ” ) m m a p ( )

  • pen(“shared_file”)

write()

Host File System GPUfs

System-wide shared namespace Persistent storage POSIX (CPU)-like API

slide-23
SLIDE 23

Mark Silberstein - UT Austin 23

Accelerating collage app with GPUfs

CPU CPU CPU GPUfs GPUfs

  • pen/read from GPU

GPU

No CPU management code

slide-24
SLIDE 24

Mark Silberstein - UT Austin 24

CPU CPU CPU GPUfs buffer cache GPUfs GPU GPUfs Overlapping

Overlapping computations and transfers Read-ahead

Accelerating collage app with GPUfs

slide-25
SLIDE 25

Mark Silberstein - UT Austin 25

CPU CPU CPU GPUfs GPU

Data reuse

Accelerating collage app with GPUfs

Random data access

slide-26
SLIDE 26

Mark Silberstein - UT Austin 26

Challenge

GPU ≠ CPU

slide-27
SLIDE 27

Mark Silberstein - UT Austin 27

Massive parallelism

NVIDIA Fermi* AMD HD5870*

From M. Houston/A. Lefohn/K. Fatahalian – A trip through the architecture of modern GPUs*

23,000 active threads 31,000 active threads

Parallelism is essential for performance in deeply multi-threaded wide-vector hardware

slide-28
SLIDE 28

Mark Silberstein - UT Austin 28

Heterogeneous memory

CPU GPU Memory Memory 10-32GB/s 6-16 GB/s 288-360GB/s

~x20

GPUs inherently impose high bandwidth demands on memory

slide-29
SLIDE 29

Mark Silberstein - UT Austin 29

How to build an FS layer

  • n this hardware?
slide-30
SLIDE 30

Mark Silberstein - UT Austin 30

GPUfs: principled redesign of the whole file system stack

  • Relaxed FS API semantics for parallelism
  • Relaxed FS consistency for heterogeneous

memory

  • GPU-specific implementation of

synchronization primitives, lock-free data structures, memory allocation, ….

slide-31
SLIDE 31

Mark Silberstein - UT Austin 31

GPU application using GPUfs File API OS File System Interface

GPUfs high-level design

GPU Memory (Page cache) CPU Memory GPUfs Distributed Buffer Cache Unchanged applications using OS File API GPUfs hooks GPUfs GPU File I/O library OS CPU GPU Disk Host File System

Massive parallelism Heterogeneous memory

slide-32
SLIDE 32

Mark Silberstein - UT Austin 32

GPU application using GPUfs File API OS File System Interface

GPUfs high-level design

GPU Memory (Page cache) CPU Memory GPUfs Distributed Buffer Cache Unchanged applications using OS File API GPUfs hooks GPUfs GPU File I/O library OS CPU GPU Disk Host File System

slide-33
SLIDE 33

Mark Silberstein - UT Austin 33

Buffer cache semantics

Local or Distributed file system data consistency?

slide-34
SLIDE 34

Mark Silberstein - UT Austin 34

GPUfs buffer cache Weak data consistency model

  • close(sync)-to-open semantics (AFS)

write(1)

  • pen()

read(1) GPU1 GPU2 fsync() write(2) Not visible to CPU

Remote-to-Local memory performance ratio is similar to a distributed system

>>

slide-35
SLIDE 35

Mark Silberstein - UT Austin 35

On-GPU File I/O API

  • pen/close

read/write mmap/munmap fsync/msync ftrunc gopen/gclose gread/gwrite gmmap/gmunmap gfsync/gmsync gftrunc

I n t h e p a p e r

Changes in the semantics are crucial

slide-36
SLIDE 36

Mark Silberstein - UT Austin 36

Implementation bits

  • Paging support
  • Dynamic data structures and memory

allocators

  • Lock-free radix tree
  • Inter-processor communications (IPC)
  • Hybrid H/W-S/W barriers
  • Consistency module in the OS kernel

I n t h e p a p e r

~1,5K GPU LOC, ~600 CPU LOC

slide-37
SLIDE 37

Mark Silberstein - UT Austin 37

Evaluation

All benchmarks are written as a GPU kernel: no CPU-side development

slide-38
SLIDE 38

Mark Silberstein - UT Austin 38

Matrix-vector product (Inputs/Outputs in files)

Vector 1x128K elements, Page size = 2MB, GPU=TESLA C2075

280 560 2800 5600 11200 500 1000 1500 2000 2500 3000 3500

CUDA piplined CUDA optimized GPU file I/O Input matrix size (MB) Throughput (MB/s)

slide-39
SLIDE 39

Mark Silberstein - UT Austin 39

Word frequency count in text

  • Count frequency of modern English words in

the works of Shakespeare, and in the Linux kernel source tree

Challenges

Dynamic working set Small files Lots of file I/O (33,000 files,1-5KB each) Unpredictable output size

English dictionary: 58,000 words

slide-40
SLIDE 40

Mark Silberstein - UT Austin 40

Results

8CPUs GPU-vanilla GPU-GPUfs

Linux source 33,000 files, 524MB

6h 50m (7.2X) 53m (6.8X)

Shakespeare 1 file, 6MB

292s 40s (7.3X) 40s (7.3X)

slide-41
SLIDE 41

Mark Silberstein - UT Austin 41

Results

8CPUs GPU-vanilla GPU-GPUfs

Linux source 33,000 files, 524MB

6h 50m (7.2X) 53m (6.8X)

Shakespeare 1 file, 6MB

292s 40s (7.3X) 40s (7.3X)

Unbounded input/output size support

8% overhead

slide-42
SLIDE 42

Mark Silberstein - UT Austin 42

GPUfs

CPU GPU CPU GPU

Code is available for download at: https://sites.google.com/site/silbersteinmark/Home/gpufs http://goo.gl/ofJ6J

GPUfs is the first system to provide native access to host OS services from GPU programs

slide-43
SLIDE 43

Mark Silberstein - UT Austin 43

Our life would have been easier with

  • PCI atomics
  • Preemptive background daemons
  • GPU-CPU signaling support
  • In-GPU exceptions
  • GPU virtual memory API (host-based or device)
  • Compiler optimizations for register-heavy

libraries

  • Seems like accomplished in 5.0
slide-44
SLIDE 44

Mark Silberstein - UT Austin 44

CPU CPU

Sequential access to file: 3 versions

GPU file I/O

CUDA pipelined transfer

Read chunk Transfer to GPU Read chunk Transfer to GPU Read chunk Transfer to GPU Read chunk Transfer to GPU

CUDA whole file transfer

GPU gmmap() Read file Transfer to GPU

slide-45
SLIDE 45

Mark Silberstein - UT Austin 45

16K 64K 256K 512K 1M 2M

500 1000 1500 2000 2500 3000 3500 4000

GPU File I/O CUDA whole file CUDA pipeline Page size Throughput (MB/s)

Sequential read Throughput vs. Page size

slide-46
SLIDE 46

Mark Silberstein - UT Austin 46

16K 64K 256K 512K 1M 2M

500 1000 1500 2000 2500 3000 3500 4000

GPU File I/O CUDA whole file CUDA pipeline Page size Throughput (MB/s)

Sequential read Throughput vs. Page size

Benefit: Decouple performance constraints from application logic

slide-47
SLIDE 47

Mark Silberstein - UT Austin 47

Accelerators as peers Accelerators as co-processors On-accelerator OS support

Yesterday Tomorrow

slide-48
SLIDE 48

Mark Silberstein - UT Austin 48

Accelerators as co- processors

?

What about software?

CPU GPU CPU GPU

Tomorrow

Accelerators as peers

Yesterday

slide-49
SLIDE 49

Mark Silberstein - UT Austin 49

Set GPUs free!

slide-50
SLIDE 50

Mark Silberstein - UT Austin 50

Parallel square root on GPU

gpu_thread(thread_id i){ float buffer; int fd=gopen(filename,O_GRDWR);

  • ffset=sizeof(float)*i;

gread(fd,sizeof(float),&buffer,offset); buffer=sqrt(buffer); gwrite(fd,sizeof(float),&buffer,offset); gclose(fd); }

Same code will run in all thousands of the GPU threads

slide-51
SLIDE 51

Mark Silberstein - UT Austin 51

GPUfs impact on GPU programs

Memory overhead

  • Register pressure
  • Very little CPU coding
  • Makes exitless GPU kernels possible

Pay-as-you-go design

slide-52
SLIDE 52

Mark Silberstein - UT Austin 52

Preserve CPU semantics?

GPU threads are different from CPU threads

SIMD vector T h r e a d T h r e a d T h r e a d T h r e a d SIMD vector T h r e a d T h r e a d T h r e a d T h r e a d

What does it mean to

  • pen/read/write/close/mmap a file

in thousands of threads?

slide-53
SLIDE 53

Mark Silberstein - UT Austin 53

Preserve CPU semantics?

GPU threads are different from CPU threads

SIMD vector T h r e a d T h r e a d T h r e a d T h r e a d SIMD vector T h r e a d T h r e a d T h r e a d T h r e a d

GPU kernel is a single data-parallel application

What does it mean to

  • pen/read/write/close/mmap a file

in thousands of threads?

slide-54
SLIDE 54

Mark Silberstein - UT Austin 54

GPUfs semantics (see more discussion in the paper)

int fd=gopen(“filename”,O_GRDWR);

One file descriptor per file:

  • pen()/close()

cached on a GPU One call per SIMD vector: bulk-synchronous cooperative execution

SIMD vector T h r e a d T h r e a d T h r e a d T h r e a d SIMD vector T h r e a d T h r e a d T h r e a d T h r e a d

slide-55
SLIDE 55

Mark Silberstein - UT Austin 55

GPU hardware characteristics

Parallelism Heterogeneous memory

slide-56
SLIDE 56

Mark Silberstein - UT Austin 56

API semantics

int fd=gopen(“filename”,O_GRDWR);

slide-57
SLIDE 57

Mark Silberstein - UT Austin 57

API semantics

int fd=gopen(“filename”,O_GRDWR);

C P U

int fd=gopen(“filename”,O_GRDWR);

slide-58
SLIDE 58

Mark Silberstein - UT Austin 58

This code runs in 100,000 GPU threads

int fd=gopen(“filename”,O_GRDWR);

C P U ≠ G P U

int fd=gopen(“filename”,O_GRDWR);