[PPT] - GPUfs: Integrating a file system with GPUs Mark Silberstein (UT PowerPoint Presentation

SLIDE 1

Mark Silberstein - UT Austin 1

GPUfs: Integrating a file system with GPUs

Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin)

ASPLOS 2013

SLIDE 2

Mark Silberstein - UT Austin 2

Traditional System Architecture

Applications

OS CPU

SLIDE 3

Mark Silberstein - UT Austin 3

Modern System Architecture

Manycore processors FPGA Hybrid CPU-GPU GPUs

CPU

Accelerated applications

OS

SLIDE 4

Mark Silberstein - UT Austin 4

Software-hardware gap is widening

Manycore processors FPGA Hybrid CPU-GPU GPUs

CPU

Accelerated applications

OS

SLIDE 5

Mark Silberstein - UT Austin 5

Software-hardware gap is widening

Manycore processors FPGA Hybrid CPU-GPU GPUs

CPU

Accelerated applications

OS

Ad-hoc abstractions and management mechanisms

SLIDE 6

Mark Silberstein - UT Austin 6

On-accelerator OS support closes the programmability gap

Manycore processors FPGA Hybrid CPU-GPU GPUs

CPU

Accelerated applications

OS

On-accelerator OS support

Native accelerator applications

Coordination

SLIDE 7

Mark Silberstein - UT Austin 7

GPUfs: File I/O support for GPUs
Motivation
Goals
Understanding the hardware
Design
Implementation
Evaluation

SLIDE 8

Mark Silberstein - UT Austin 8

Building systems with GPUs is hard. Why?

SLIDE 9

Mark Silberstein - UT Austin 9

Data transfers GPU invocation Memory management

Goal of GPU programming frameworks

GPU

Parallel Algorithm

CPU

SLIDE 10

Mark Silberstein - UT Austin 10

Headache for GPU programmers

Parallel Algorithm

GPU

Data transfers Invocation Memory management

CPU

Half of the CUDA SDK 4.1 samples: at least 9 CPU LOC per 1 GPU LOC

SLIDE 11

Mark Silberstein - UT Austin 11

GPU kernels are isolated

Parallel Algorithm

GPU

Data transfers Invocation Memory management

CPU

SLIDE 12

Mark Silberstein - UT Austin 12

Example: accelerating photo collage

http://www.codeproject.com/Articles/36347/Face-Collage

While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() }

SLIDE 13

Mark Silberstein - UT Austin 13

CPU Implementation

CPU CPU CPU Application

While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() }

SLIDE 14

Mark Silberstein - UT Austin 14

Offloading computations to GPU

CPU CPU CPU Application

While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() }

Move to GPU

SLIDE 15

Mark Silberstein - UT Austin 15

Offloading computations to GPU

GPU CPU Kernel start Data transfer Kernel termination

Co-processor programming model

SLIDE 16

Mark Silberstein - UT Austin 16

Kernel start/stop overheads

CPU GPU copy to GPU c

p

y t

C

P U invoke Invocation latency Synchronization Cache flush

SLIDE 17

Mark Silberstein - UT Austin 17

Hiding the overheads

CPU GPU copy to GPU c

p

y t

C

P U invoke Manual data reuse management Asynchronous invocation Double buffering copy to GPU

SLIDE 18

Mark Silberstein - UT Austin 18

Implementation complexity

CPU GPU copy to GPU c

p

y t

C

P U invoke Manual data reuse management Asynchronous invocation Double buffering copy to GPU

Management overhead

SLIDE 19

Mark Silberstein - UT Austin 19

Implementation complexity

CPU GPU copy to GPU c

p

y t

C

P U invoke Manual data reuse management Asynchronous invocation Double buffering copy to GPU

Why do we need to deal with low-level system details?

Management overhead

SLIDE 20

Mark Silberstein - UT Austin 20

The reason is....

GPUs are peer-processors They need I/O OS services

SLIDE 21

Mark Silberstein - UT Austin 21

GPUfs: application view

CPUs GPU1 GPU2 GPU3

p

e n ( “ s h a r e d _ f i l e ” ) m m a p ( )

pen(“shared_file”)

write()

Host File System GPUfs

SLIDE 22

Mark Silberstein - UT Austin 22

GPUfs: application view

CPUs GPU1 GPU2 GPU3

p

e n ( “ s h a r e d _ f i l e ” ) m m a p ( )

pen(“shared_file”)

write()

Host File System GPUfs

System-wide shared namespace Persistent storage POSIX (CPU)-like API

SLIDE 23

Mark Silberstein - UT Austin 23

Accelerating collage app with GPUfs

CPU CPU CPU GPUfs GPUfs

pen/read from GPU

GPU

No CPU management code

SLIDE 24

Mark Silberstein - UT Austin 24

CPU CPU CPU GPUfs buffer cache GPUfs GPU GPUfs Overlapping

Overlapping computations and transfers Read-ahead

Accelerating collage app with GPUfs

SLIDE 25

Mark Silberstein - UT Austin 25

CPU CPU CPU GPUfs GPU

Data reuse

Accelerating collage app with GPUfs

Random data access

SLIDE 26

Mark Silberstein - UT Austin 26

Challenge

GPU ≠ CPU

SLIDE 27

Mark Silberstein - UT Austin 27

Massive parallelism

NVIDIA Fermi* AMD HD5870*

From M. Houston/A. Lefohn/K. Fatahalian – A trip through the architecture of modern GPUs*

23,000 active threads 31,000 active threads

Parallelism is essential for performance in deeply multi-threaded wide-vector hardware

SLIDE 28

Mark Silberstein - UT Austin 28

Heterogeneous memory

CPU GPU Memory Memory 10-32GB/s 6-16 GB/s 288-360GB/s

~x20

GPUs inherently impose high bandwidth demands on memory

SLIDE 29

Mark Silberstein - UT Austin 29

How to build an FS layer

n this hardware?

SLIDE 30

Mark Silberstein - UT Austin 30

GPUfs: principled redesign of the whole file system stack

Relaxed FS API semantics for parallelism
Relaxed FS consistency for heterogeneous

memory

GPU-specific implementation of

synchronization primitives, lock-free data structures, memory allocation, ….

SLIDE 31

Mark Silberstein - UT Austin 31

GPU application using GPUfs File API OS File System Interface

GPUfs high-level design

GPU Memory (Page cache) CPU Memory GPUfs Distributed Buffer Cache Unchanged applications using OS File API GPUfs hooks GPUfs GPU File I/O library OS CPU GPU Disk Host File System

Massive parallelism Heterogeneous memory

SLIDE 32

Mark Silberstein - UT Austin 32

GPU application using GPUfs File API OS File System Interface

GPUfs high-level design

GPU Memory (Page cache) CPU Memory GPUfs Distributed Buffer Cache Unchanged applications using OS File API GPUfs hooks GPUfs GPU File I/O library OS CPU GPU Disk Host File System

SLIDE 33

Mark Silberstein - UT Austin 33

Buffer cache semantics

Local or Distributed file system data consistency?

SLIDE 34

Mark Silberstein - UT Austin 34

GPUfs buffer cache Weak data consistency model

close(sync)-to-open semantics (AFS)

write(1)

pen()

read(1) GPU1 GPU2 fsync() write(2) Not visible to CPU

Remote-to-Local memory performance ratio is similar to a distributed system

>>

SLIDE 35

Mark Silberstein - UT Austin 35

On-GPU File I/O API

pen/close

read/write mmap/munmap fsync/msync ftrunc gopen/gclose gread/gwrite gmmap/gmunmap gfsync/gmsync gftrunc

I n t h e p a p e r

Changes in the semantics are crucial

SLIDE 36

Mark Silberstein - UT Austin 36

Implementation bits

Paging support
Dynamic data structures and memory

allocators

Lock-free radix tree
Inter-processor communications (IPC)
Hybrid H/W-S/W barriers
Consistency module in the OS kernel

I n t h e p a p e r

~1,5K GPU LOC, ~600 CPU LOC

SLIDE 37

Mark Silberstein - UT Austin 37

Evaluation

All benchmarks are written as a GPU kernel: no CPU-side development

SLIDE 38

Mark Silberstein - UT Austin 38

Matrix-vector product (Inputs/Outputs in files)

Vector 1x128K elements, Page size = 2MB, GPU=TESLA C2075

280 560 2800 5600 11200 500 1000 1500 2000 2500 3000 3500

CUDA piplined CUDA optimized GPU file I/O Input matrix size (MB) Throughput (MB/s)

SLIDE 39

Mark Silberstein - UT Austin 39

Word frequency count in text

Count frequency of modern English words in

the works of Shakespeare, and in the Linux kernel source tree

Challenges

Dynamic working set Small files Lots of file I/O (33,000 files,1-5KB each) Unpredictable output size

English dictionary: 58,000 words

SLIDE 40

Mark Silberstein - UT Austin 40

Results

8CPUs GPU-vanilla GPU-GPUfs

Linux source 33,000 files, 524MB

6h 50m (7.2X) 53m (6.8X)

Shakespeare 1 file, 6MB

292s 40s (7.3X) 40s (7.3X)

SLIDE 41

Mark Silberstein - UT Austin 41

Results

8CPUs GPU-vanilla GPU-GPUfs

Linux source 33,000 files, 524MB

6h 50m (7.2X) 53m (6.8X)

Shakespeare 1 file, 6MB

292s 40s (7.3X) 40s (7.3X)

Unbounded input/output size support

8% overhead

SLIDE 42

Mark Silberstein - UT Austin 42

GPUfs

CPU GPU CPU GPU

Code is available for download at: https://sites.google.com/site/silbersteinmark/Home/gpufs http://goo.gl/ofJ6J

GPUfs: Integrating a file system with GPUs

Traditional System Architecture

Modern System Architecture

Software-hardware gap is widening

Software-hardware gap is widening

Ad-hoc abstractions and management mechanisms

On-accelerator OS support closes the programmability gap

On-accelerator OS support

Building systems with GPUs is hard. Why?

Goal of GPU programming frameworks

Parallel Algorithm

Headache for GPU programmers

Data transfers Invocation Memory management

GPU kernels are isolated

Data transfers Invocation Memory management

Example: accelerating photo collage

CPU Implementation

Offloading computations to GPU

Offloading computations to GPU

Co-processor programming model

Kernel start/stop overheads

Hiding the overheads

Implementation complexity

Implementation complexity

Why do we need to deal with low-level system details?

The reason is....

GPUs are peer-processors They need I/O OS services

GPUfs: application view

GPUfs: application view

Accelerating collage app with GPUfs

Accelerating collage app with GPUfs

Accelerating collage app with GPUfs

Challenge

GPU ≠ CPU

Massive parallelism

Parallelism is essential for performance in deeply multi-threaded wide-vector hardware

Heterogeneous memory

GPUs inherently impose high bandwidth demands on memory

How to build an FS layer

GPUfs: principled redesign of the whole file system stack

memory

synchronization primitives, lock-free data structures, memory allocation, ….

GPUfs high-level design

GPUfs high-level design

Buffer cache semantics

Local or Distributed file system data consistency?

GPUfs buffer cache Weak data consistency model

Remote-to-Local memory performance ratio is similar to a distributed system

On-GPU File I/O API

read/write mmap/munmap fsync/msync ftrunc gopen/gclose gread/gwrite gmmap/gmunmap gfsync/gmsync gftrunc

Changes in the semantics are crucial

Implementation bits

allocators

~1,5K GPU LOC, ~600 CPU LOC

Evaluation

All benchmarks are written as a GPU kernel: no CPU-side development

Matrix-vector product (Inputs/Outputs in files)

Word frequency count in text

the works of Shakespeare, and in the Linux kernel source tree

Challenges

Results

Results

8% overhead

GPUfs

GPUfs is the first system to provide native access to host OS services from GPU programs