GPUfs: Integrating a file system with GPUs Mark Silberstein (UT - - PowerPoint PPT Presentation

gpufs integrating a file system with gpus
SMART_READER_LITE
LIVE PREVIEW

GPUfs: Integrating a file system with GPUs Mark Silberstein (UT - - PowerPoint PPT Presentation

ASPLOS 2013 GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Mark Silberstein - UT Austin Traditional System Architecture Applications


slide-1
SLIDE 1

Mark Silberstein - UT Austin 1

GPUfs: Integrating a file system with GPUs

Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin)

ASPLOS 2013

slide-2
SLIDE 2

Mark Silberstein - UT Austin 2

Traditional System Architecture

Applications

OS CPU

slide-3
SLIDE 3

Mark Silberstein - UT Austin 3

Modern System Architecture

Manycore processors FPGA Hybrid CPU-GPU GPUs

CPU

Accelerated applications

OS

slide-4
SLIDE 4

Mark Silberstein - UT Austin 4

Software-hardware gap is widening

Manycore processors FPGA Hybrid CPU-GPU GPUs

CPU

Accelerated applications

OS

slide-5
SLIDE 5

Mark Silberstein - UT Austin 5

Software-hardware gap is widening

Manycore processors FPGA Hybrid CPU-GPU GPUs

CPU

Accelerated applications

OS

Ad-hoc abstractions and management mechanisms

slide-6
SLIDE 6

Mark Silberstein - UT Austin 6

On-accelerator OS support closes the programmability gap

Manycore processors FPGA Hybrid CPU-GPU GPUs

CPU

Accelerated applications

OS

On-accelerator OS support

Native accelerator applications

Coordination

slide-7
SLIDE 7

Mark Silberstein - UT Austin 7

  • GPUfs: File I/O support for GPUs
  • Motivation
  • Goals
  • Understanding the hardware
  • Design
  • Implementation
  • Evaluation
slide-8
SLIDE 8

Mark Silberstein - UT Austin 8

Building systems with GPUs is hard. Why?

slide-9
SLIDE 9

Mark Silberstein - UT Austin 9

Data transfers GPU invocation Memory management

Goal of GPU programming frameworks

GPU

Parallel Algorithm

CPU

slide-10
SLIDE 10

Mark Silberstein - UT Austin 10

Headache for GPU programmers

Parallel Algorithm

GPU

Data transfers Invocation Memory management

CPU

Half of the CUDA SDK 4.1 samples: at least 9 CPU LOC per 1 GPU LOC

slide-11
SLIDE 11

Mark Silberstein - UT Austin 11

GPU kernels are isolated

Parallel Algorithm

GPU

Data transfers Invocation Memory management

CPU

slide-12
SLIDE 12

Mark Silberstein - UT Austin 12

Example: accelerating photo collage

http://www.codeproject.com/Articles/36347/Face-Collage

While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() }

slide-13
SLIDE 13

Mark Silberstein - UT Austin 13

CPU Implementation

CPU CPU CPU Application

While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() }

slide-14
SLIDE 14

Mark Silberstein - UT Austin 14

Offloading computations to GPU

CPU CPU CPU Application

While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() }

Move to GPU

slide-15
SLIDE 15

Mark Silberstein - UT Austin 15

Offloading computations to GPU

GPU CPU Kernel start Data transfer Kernel termination

Co-processor programming model

slide-16
SLIDE 16

Mark Silberstein - UT Austin 16

Kernel start/stop overheads

CPU GPU copy to GPU c

  • p

y t

  • C

P U invoke Invocation latency Synchronization Cache flush

slide-17
SLIDE 17

Mark Silberstein - UT Austin 17

Hiding the overheads

CPU GPU copy to GPU c

  • p

y t

  • C

P U invoke Manual data reuse management Asynchronous invocation Double buffering copy to GPU

slide-18
SLIDE 18

Mark Silberstein - UT Austin 18

Implementation complexity

CPU GPU copy to GPU c

  • p

y t

  • C

P U invoke Manual data reuse management Asynchronous invocation Double buffering copy to GPU

Management overhead

slide-19
SLIDE 19

Mark Silberstein - UT Austin 19

Implementation complexity

CPU GPU copy to GPU c

  • p

y t

  • C

P U invoke Manual data reuse management Asynchronous invocation Double buffering copy to GPU

Why do we need to deal with low-level system details?

Management overhead

slide-20
SLIDE 20

Mark Silberstein - UT Austin 20

The reason is....

GPUs are peer-processors They need I/O OS services

slide-21
SLIDE 21

Mark Silberstein - UT Austin 21

GPUfs: application view

CPUs GPU1 GPU2 GPU3

  • p

e n ( “ s h a r e d _ f i l e ” ) m m a p ( )

  • pen(“shared_file”)

write()

Host File System GPUfs

slide-22
SLIDE 22

Mark Silberstein - UT Austin 22

GPUfs: application view

CPUs GPU1 GPU2 GPU3

  • p

e n ( “ s h a r e d _ f i l e ” ) m m a p ( )

  • pen(“shared_file”)

write()

Host File System GPUfs

System-wide shared namespace Persistent storage POSIX (CPU)-like API

slide-23
SLIDE 23

Mark Silberstein - UT Austin 23

Accelerating collage app with GPUfs

CPU CPU CPU GPUfs GPUfs

  • pen/read from GPU

GPU

No CPU management code

slide-24
SLIDE 24

Mark Silberstein - UT Austin 24

CPU CPU CPU GPUfs buffer cache GPUfs GPU GPUfs Overlapping

Overlapping computations and transfers Read-ahead

Accelerating collage app with GPUfs

slide-25
SLIDE 25

Mark Silberstein - UT Austin 25

CPU CPU CPU GPUfs GPU

Data reuse

Accelerating collage app with GPUfs

Random data access

slide-26
SLIDE 26

Mark Silberstein - UT Austin 26

Challenge

GPU ≠ CPU

slide-27
SLIDE 27

Mark Silberstein - UT Austin 27

Massive parallelism

NVIDIA Fermi* AMD HD5870*

From M. Houston/A. Lefohn/K. Fatahalian – A trip through the architecture of modern GPUs*

23,000 active threads 31,000 active threads

Parallelism is essential for performance in deeply multi-threaded wide-vector hardware

slide-28
SLIDE 28

Mark Silberstein - UT Austin 28

Heterogeneous memory

CPU GPU Memory Memory 10-32GB/s 6-16 GB/s 288-360GB/s

~x20

GPUs inherently impose high bandwidth demands on memory

slide-29
SLIDE 29

Mark Silberstein - UT Austin 29

How to build an FS layer

  • n this hardware?
slide-30
SLIDE 30

Mark Silberstein - UT Austin 30

GPUfs: principled redesign of the whole file system stack

  • Relaxed FS API semantics for parallelism
  • Relaxed FS consistency for heterogeneous

memory

  • GPU-specific implementation of

synchronization primitives, lock-free data structures, memory allocation, ….

slide-31
SLIDE 31

Mark Silberstein - UT Austin 31

GPU application using GPUfs File API OS File System Interface

GPUfs high-level design

GPU Memory (Page cache) CPU Memory GPUfs Distributed Buffer Cache Unchanged applications using OS File API GPUfs hooks GPUfs GPU File I/O library OS CPU GPU Disk Host File System

Massive parallelism Heterogeneous memory

slide-32
SLIDE 32

Mark Silberstein - UT Austin 32

GPU application using GPUfs File API OS File System Interface

GPUfs high-level design

GPU Memory (Page cache) CPU Memory GPUfs Distributed Buffer Cache Unchanged applications using OS File API GPUfs hooks GPUfs GPU File I/O library OS CPU GPU Disk Host File System

slide-33
SLIDE 33

Mark Silberstein - UT Austin 33

Buffer cache semantics

Local or Distributed file system data consistency?

slide-34
SLIDE 34

Mark Silberstein - UT Austin 34

GPUfs buffer cache Weak data consistency model

  • close(sync)-to-open semantics (AFS)

write(1)

  • pen()

read(1) GPU1 GPU2 fsync() write(2) Not visible to CPU

Remote-to-Local memory performance ratio is similar to a distributed system

>>

slide-35
SLIDE 35

Mark Silberstein - UT Austin 35

On-GPU File I/O API

  • pen/close

read/write mmap/munmap fsync/msync ftrunc gopen/gclose gread/gwrite gmmap/gmunmap gfsync/gmsync gftrunc

I n t h e p a p e r

Changes in the semantics are crucial

slide-36
SLIDE 36

Mark Silberstein - UT Austin 36

Implementation bits

  • Paging support
  • Dynamic data structures and memory

allocators

  • Lock-free radix tree
  • Inter-processor communications (IPC)
  • Hybrid H/W-S/W barriers
  • Consistency module in the OS kernel

I n t h e p a p e r

~1,5K GPU LOC, ~600 CPU LOC

slide-37
SLIDE 37

Mark Silberstein - UT Austin 37

Evaluation

All benchmarks are written as a GPU kernel: no CPU-side development

slide-38
SLIDE 38

Mark Silberstein - UT Austin 38

Matrix-vector product (Inputs/Outputs in files)

Vector 1x128K elements, Page size = 2MB, GPU=TESLA C2075

280 560 2800 5600 11200 500 1000 1500 2000 2500 3000 3500

CUDA piplined CUDA optimized GPU file I/O Input matrix size (MB) Throughput (MB/s)

slide-39
SLIDE 39

Mark Silberstein - UT Austin 39

Word frequency count in text

  • Count frequency of modern English words in

the works of Shakespeare, and in the Linux kernel source tree

Challenges

Dynamic working set Small files Lots of file I/O (33,000 files,1-5KB each) Unpredictable output size

English dictionary: 58,000 words

slide-40
SLIDE 40

Mark Silberstein - UT Austin 40

Results

8CPUs GPU-vanilla GPU-GPUfs

Linux source 33,000 files, 524MB

6h 50m (7.2X) 53m (6.8X)

Shakespeare 1 file, 6MB

292s 40s (7.3X) 40s (7.3X)

slide-41
SLIDE 41

Mark Silberstein - UT Austin 41

Results

8CPUs GPU-vanilla GPU-GPUfs

Linux source 33,000 files, 524MB

6h 50m (7.2X) 53m (6.8X)

Shakespeare 1 file, 6MB

292s 40s (7.3X) 40s (7.3X)

Unbounded input/output size support

8% overhead

slide-42
SLIDE 42

Mark Silberstein - UT Austin 42

GPUfs

CPU GPU CPU GPU

Code is available for download at: https://sites.google.com/site/silbersteinmark/Home/gpufs http://goo.gl/ofJ6J

GPUfs is the first system to provide native access to host OS services from GPU programs