Mark Silberstein - UT Austin 1
GPUfs: Integrating a file system with GPUs
Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin)
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT - - PowerPoint PPT Presentation
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Mark Silberstein - UT Austin Traditional System Architecture Applications OS CPU 2
Mark Silberstein - UT Austin 1
Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin)
Mark Silberstein - UT Austin 2
Applications
OS CPU
Mark Silberstein - UT Austin 3
Manycore processors FPGA Hybrid CPU-GPU GPUs
CPU
Accelerated applications
OS
Mark Silberstein - UT Austin 4
Manycore processors FPGA Hybrid CPU-GPU GPUs
CPU
Accelerated applications
OS
Mark Silberstein - UT Austin 5
Manycore processors FPGA Hybrid CPU-GPU GPUs
CPU
Accelerated applications
OS
Mark Silberstein - UT Austin 6
Manycore processors FPGA Hybrid CPU-GPU GPUs
CPU
Accelerated applications
OS
Native accelerator applications
Coordination
Mark Silberstein - UT Austin 7
Mark Silberstein - UT Austin 8
Mark Silberstein - UT Austin 9
Data transfers GPU invocation Memory management
GPU
CPU
Mark Silberstein - UT Austin 10
Parallel Algorithm
GPU
CPU
Half of the CUDA SDK 4.1 samples: at least 9 CPU LOC per 1 GPU LOC
Mark Silberstein - UT Austin 11
Parallel Algorithm
GPU
CPU
Mark Silberstein - UT Austin 12
http://www.codeproject.com/Articles/36347/Face-Collage
While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() }
Mark Silberstein - UT Austin 13
CPU CPU CPU Application
While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() }
Mark Silberstein - UT Austin 14
CPU CPU CPU Application
While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() }
Move to GPU
Mark Silberstein - UT Austin 15
GPU CPU Kernel start Data transfer Kernel termination
Mark Silberstein - UT Austin 16
CPU GPU copy to GPU c
y t
P U invoke Invocation latency Synchronization Cache flush
Mark Silberstein - UT Austin 17
CPU GPU copy to GPU c
y t
P U invoke Manual data reuse management Asynchronous invocation Double buffering copy to GPU
Mark Silberstein - UT Austin 18
CPU GPU copy to GPU c
y t
P U invoke Manual data reuse management Asynchronous invocation Double buffering copy to GPU
Management overhead
Mark Silberstein - UT Austin 19
CPU GPU copy to GPU c
y t
P U invoke Manual data reuse management Asynchronous invocation Double buffering copy to GPU
Management overhead
Mark Silberstein - UT Austin 20
Mark Silberstein - UT Austin 21
CPUs GPU1 GPU2 GPU3
e n ( “ s h a r e d _ f i l e ” ) m m a p ( )
write()
Host File System GPUfs
Mark Silberstein - UT Austin 22
CPUs GPU1 GPU2 GPU3
e n ( “ s h a r e d _ f i l e ” ) m m a p ( )
write()
Host File System GPUfs
System-wide shared namespace Persistent storage POSIX (CPU)-like API
Mark Silberstein - UT Austin 23
CPU CPU CPU GPUfs GPUfs
GPU
No CPU management code
Mark Silberstein - UT Austin 24
CPU CPU CPU GPUfs buffer cache GPUfs GPU GPUfs Overlapping
Overlapping computations and transfers Read-ahead
Mark Silberstein - UT Austin 25
CPU CPU CPU GPUfs GPU
Data reuse
Random data access
Mark Silberstein - UT Austin 26
Mark Silberstein - UT Austin 27
NVIDIA Fermi* AMD HD5870*
From M. Houston/A. Lefohn/K. Fatahalian – A trip through the architecture of modern GPUs*
23,000 active threads 31,000 active threads
Mark Silberstein - UT Austin 28
CPU GPU Memory Memory 10-32GB/s 6-16 GB/s 288-360GB/s
~x20
Mark Silberstein - UT Austin 29
Mark Silberstein - UT Austin 30
Mark Silberstein - UT Austin 31
GPU application using GPUfs File API OS File System Interface
GPU Memory (Page cache) CPU Memory GPUfs Distributed Buffer Cache Unchanged applications using OS File API GPUfs hooks GPUfs GPU File I/O library OS CPU GPU Disk Host File System
Massive parallelism Heterogeneous memory
Mark Silberstein - UT Austin 32
GPU application using GPUfs File API OS File System Interface
GPU Memory (Page cache) CPU Memory GPUfs Distributed Buffer Cache Unchanged applications using OS File API GPUfs hooks GPUfs GPU File I/O library OS CPU GPU Disk Host File System
Mark Silberstein - UT Austin 33
Mark Silberstein - UT Austin 34
write(1)
read(1) GPU1 GPU2 fsync() write(2) Not visible to CPU
>>
Mark Silberstein - UT Austin 35
I n t h e p a p e r
Mark Silberstein - UT Austin 36
I n t h e p a p e r
Mark Silberstein - UT Austin 37
Mark Silberstein - UT Austin 38
Vector 1x128K elements, Page size = 2MB, GPU=TESLA C2075
280 560 2800 5600 11200 500 1000 1500 2000 2500 3000 3500
CUDA piplined CUDA optimized GPU file I/O Input matrix size (MB) Throughput (MB/s)
Mark Silberstein - UT Austin 39
Dynamic working set Small files Lots of file I/O (33,000 files,1-5KB each) Unpredictable output size
English dictionary: 58,000 words
Mark Silberstein - UT Austin 40
8CPUs GPU-vanilla GPU-GPUfs
Linux source 33,000 files, 524MB
6h 50m (7.2X) 53m (6.8X)
Shakespeare 1 file, 6MB
292s 40s (7.3X) 40s (7.3X)
Mark Silberstein - UT Austin 41
8CPUs GPU-vanilla GPU-GPUfs
Linux source 33,000 files, 524MB
6h 50m (7.2X) 53m (6.8X)
Shakespeare 1 file, 6MB
292s 40s (7.3X) 40s (7.3X)
Unbounded input/output size support
Mark Silberstein - UT Austin 42
CPU GPU CPU GPU
Code is available for download at: https://sites.google.com/site/silbersteinmark/Home/gpufs http://goo.gl/ofJ6J
Mark Silberstein - UT Austin 43
Mark Silberstein - UT Austin 44
CPU CPU
GPU file I/O
CUDA pipelined transfer
Read chunk Transfer to GPU Read chunk Transfer to GPU Read chunk Transfer to GPU Read chunk Transfer to GPU
CUDA whole file transfer
GPU gmmap() Read file Transfer to GPU
Mark Silberstein - UT Austin 45
16K 64K 256K 512K 1M 2M
500 1000 1500 2000 2500 3000 3500 4000
GPU File I/O CUDA whole file CUDA pipeline Page size Throughput (MB/s)
Mark Silberstein - UT Austin 46
16K 64K 256K 512K 1M 2M
500 1000 1500 2000 2500 3000 3500 4000
GPU File I/O CUDA whole file CUDA pipeline Page size Throughput (MB/s)
Mark Silberstein - UT Austin 47
Yesterday Tomorrow
Mark Silberstein - UT Austin 48
CPU GPU CPU GPU
Tomorrow
Yesterday
Mark Silberstein - UT Austin 49
Mark Silberstein - UT Austin 50
gpu_thread(thread_id i){ float buffer; int fd=gopen(filename,O_GRDWR);
gread(fd,sizeof(float),&buffer,offset); buffer=sqrt(buffer); gwrite(fd,sizeof(float),&buffer,offset); gclose(fd); }
Mark Silberstein - UT Austin 51
Mark Silberstein - UT Austin 52
GPU threads are different from CPU threads
SIMD vector T h r e a d T h r e a d T h r e a d T h r e a d SIMD vector T h r e a d T h r e a d T h r e a d T h r e a d
Mark Silberstein - UT Austin 53
GPU threads are different from CPU threads
SIMD vector T h r e a d T h r e a d T h r e a d T h r e a d SIMD vector T h r e a d T h r e a d T h r e a d T h r e a d
GPU kernel is a single data-parallel application
Mark Silberstein - UT Austin 54
int fd=gopen(“filename”,O_GRDWR);
SIMD vector T h r e a d T h r e a d T h r e a d T h r e a d SIMD vector T h r e a d T h r e a d T h r e a d T h r e a d
Mark Silberstein - UT Austin 55
Mark Silberstein - UT Austin 56
int fd=gopen(“filename”,O_GRDWR);
Mark Silberstein - UT Austin 57
int fd=gopen(“filename”,O_GRDWR);
int fd=gopen(“filename”,O_GRDWR);
Mark Silberstein - UT Austin 58
int fd=gopen(“filename”,O_GRDWR);
int fd=gopen(“filename”,O_GRDWR);