Operating System Services for High Throughput Processors Mark - PowerPoint PPT Presentation

Operating System Services for High Throughput Processors Mark Silberstein EE, Technion

Traditional Systems Software Stack Applications OS CPU 2 Feb 2014 Mark Silberstein - EE, Technion

Modern Systems Software Stack Accelerated applications OS Manycore CPU GPUs FPGA DSPs processors 3 Feb 2014 Mark Silberstein - EE, Technion

GPUs make a difference... ● Top 10 fastest supercomputers use GPUs HCI Metheo Vision Physics rology Bio Graph informatics Algorithms Chemistry Linear Finance Algebra 4 Feb 2014 Mark Silberstein - EE, Technion

GPUs make a difference, but only in HPC! HCI Metheo Vision Physics rology Bio Web Network Antivirus, Graph informatics servers services file search Algorithms ??? ??? ??? Chemistry Linear Finance Algebra 5 Feb 2014 Mark Silberstein - EE, Technion

Software-hardware gap is widening Accelerated applications Inadequate abstractions and OS management mechanisms Manycore Manycore Manycore Hybrid CPU GPUs GPUs GPUs FPGA FPGA FPGA DSPs processors processors processors CPU-GPU 6 Feb 2014 Mark Silberstein - EE, Technion

Fundamentals in question accelerators ≡ co-processors accelerators ≡ peer-processors 7 Feb 2014 Mark Silberstein - EE, Technion

Software stack for accelerat ed applications Accelerated Applications OS Accelerator abstractions and mechanisms Manycore Manycore CPU GPUs GPUs FPGA FPGA DSPs processors processors 8 Feb 2014 Mark Silberstein - EE, Technion

Software stack for accelerat or applications Accelerated Accelerator applications Applications (centralized and distributed) Accelerator I/O Accelerator OS support services (network, files) (Interprocessor I/O, OS file system, network APIs) Accelerator abstractions and mechanisms Hardware support for OS Manycore Manycore CPU GPUs GPUs FPGA FPGA DSPs processors processors 9 Feb 2014 Mark Silberstein - EE, Technion

This talk Accelerated Accelerator applications Applications centralized and distributed Accelerator I/O Accelerator OS support services (network, files) (Interprocessor I/O, OS file system, network APIs) Accelerator abstractions ASPLOS13, TOCS14 and mechanisms Hardware support for OS Storage Manycore Manycore CPU GPUs GPUs FPGA FPGA DSPs GPUs processors processors Network 10 Feb 2014 Mark Silberstein - EE, Technion

● GPU 101 ● GPUfs: File I/O support for GPUs ● Future work 11 Feb 2014 Mark Silberstein - EE, Technion

Hybrid GPU-CPU 101 Architecture CPU GPU Memory Memory 12 Feb 2014 Mark Silberstein - EE, Technion

Co-processor model CPU GPU Computation Memory Memory 13 Feb 2014 Mark Silberstein - EE, Technion

Co-processor model CPU GPU Computation tation Memory Memory 14 Feb 2014 Mark Silberstein - EE, Technion

Co-processor model GPU kernel CPU GPU Computation tation t a t i o n Memory Memory 15 Feb 2014 Mark Silberstein - EE, Technion

Co-processor model CPU GPU Computation Memory Memory 16 Feb 2014 Mark Silberstein - EE, Technion

Building systems with GPUs is hard Why? 17 Feb 2014 Mark Silberstein - EE, Technion

GPU kernels are isolated GPU CPU Data transfers Parallel Invocation Algorithm Memory management 18 Feb 2014 Mark Silberstein - EE, Technion

Example: accelerating photo collage Application CPU CPU CPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 19 Feb 2014 Mark Silberstein - EE, Technion

Offloading computations to GPU Application CPU CPU CPU Move to GPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 20 Feb 2014 Mark Silberstein - EE, Technion

Offloading computations to GPU CPU Data transfer GPU Kernel Kernel start termination 21 Feb 2014 Mark Silberstein - EE, Technion

Overheads Invocation latency CPU copy to o invoke t GPU U y p P o C c GPU Transfer overhead Synchronization 22 Feb 2014 Mark Silberstein - EE, Technion

Working around overheads Asynchronous invocation Data reuse management Double buffering Buffer size optimization CPU copy to copy to o invoke t GPU GPU U GPU-CPU y p P o low-level tricks C c GPU 23 Feb 2014 Mark Silberstein - EE, Technion

Management overhead Asynchronous invocation Asynchronous invocation Data reuse management Data reuse management Double buffering Double buffering Buffer size optimization Buffer size optimization GPU-CPU GPU-CPU low-level tricks low-level tricks Why do we need to deal with low-level system details? 24 Feb 2014 Mark Silberstein - EE, Technion

The reason is.... GPUs are peer-processors They need I/O OS services 25 Feb 2014 Mark Silberstein - EE, Technion

GPUfs: application view C PUs G PU1 G PU2 G PU3 open(“shared_file”) o p e n ( mmap() “ s write() h a r e d _ f i l e ” ) GPUfs Host File System 26 Feb 2014 Mark Silberstein - EE, Technion

GPUfs: application view C PUs G PU1 G PU2 G PU3 open(“shared_file”) o p System-wide e n ( mmap() shared namespace “ s write() h a r e d _ f i l e POSIX (CPU)-like API ” ) GPUfs Host File System Persistent storage 27 Feb 2014 Mark Silberstein - EE, Technion

Accelerating collage app with GPUfs No CPU management code GPUfs GPUfs GPU open/read from GPU 28 Feb 2014 Mark Silberstein - EE, Technion

Accelerating collage app with GPUfs CPU CPU CPU Read-ahead GPUfs buffer cache GPUfs GPUfs GPU Overlapping Overlapping computations and transfers 29 Feb 2014 Mark Silberstein - EE, Technion

Accelerating collage app with GPUfs CPU CPU CPU GPUfs GPU Data reuse Random data access 30 Feb 2014 Mark Silberstein - EE, Technion

Understanding the hardware 31 Feb 2014 Mark Silberstein - EE, Technion

GPU hardware characteristics Parallelism Low serial performance Heterogeneous memory 32 Feb 2014 Mark Silberstein - EE, Technion

GPU hardware parallelism 1. Multi-core GPU GPU GPU memory GPU memory Core MP Core MP Core MP Core MP 33 Feb 2014 Mark Silberstein - EE, Technion

GPU hardware parallelism 2. SIMD GPU GPU GPU memory GPU memory SIMD vector SIMD vector SIMD vector MP 34 Feb 2014 Mark Silberstein - EE, Technion

GPU hardware parallelism 3. Parallelism for latency hiding GPU GPU GPU memory GPU memory T1 T2 MP T3 Execution state 35 Feb 2014 Mark Silberstein - EE, Technion

GPU Hardware 3. Parallelism for latency hiding GPU GPU GPU memory GPU memory R 0x01 T1 T2 MP T3 Execution state 36 Feb 2014 Mark Silberstein - EE, Technion

GPU Hardware 3. Parallelism for latency hiding GPU GPU GPU memory GPU memory R 0x01 T1 R 0x04 T2 MP T3 Execution state 37 Feb 2014 Mark Silberstein - EE, Technion

GPU Hardware 3. Parallelism for latency hiding GPU GPU GPU memory GPU memory R 0x01 T1 R 0x04 T2 MP R 0x08 T3 Execution state 38 Feb 2014 Mark Silberstein - EE, Technion

GPU Hardware 3. Parallelism for latency hiding GPU GPU GPU memory GPU memory R 0x01 T1 R 0x04 T2 MP R 0x08 T3 Execution state 39 Feb 2014 Mark Silberstein - EE, Technion

Putting it all together: 3 levels of hardware parallelism GPU GPU GPU memory GPU memory Core MP Core MP Core MP Core MP SIMD vector Thread Ctx 1 State 1 Core MP Thread Ctx k State k 40 Feb 2014 Mark Silberstein - EE, Technion

Software-Hardware mapping GPU GPU GPU memory GPU memory Core MP MP Core MP MP Core MP MP Core MP MP SIMD vector Thread Ctx 1 State 1 T T h h r r Core MP MP e e a a d d n 1 Thread Ctx k State k 41 Feb 2014 Mark Silberstein - EE, Technion

(1) 10,000-s of concurrent threads! GPU GPU GPU memory GPU memory NVIDIA K20x GPU: 64x14x32= 28672 concurrent threads Core MP MP Core MP MP Core MP MP Core MP MP 14 32 SIMD vector Thread Ctx 1 State 1 T T h h 64 r r Core MP MP e e a a d d n 1 Thread Ctx k State k 42 Feb 2014 Mark Silberstein - EE, Technion

(2) Each thread is slow GPU GPU GPU memory GPU memory Core MP MP Core MP MP Core MP MP Core MP MP ~100x slower than a CPU thread SIMD vector Thread Ctx 1 State 1 T T h h r r Core MP MP e e a a d d n 1 Thread Ctx k State k 43 Feb 2014 Mark Silberstein - EE, Technion

(3) Heterogeneous memory CPU GPU 10-32GB/s 250GB/s x20 Memory Memory 12 GB/s 44 Feb 2014 Mark Silberstein - EE, Technion

GPUfs: file system layer for GPUs Joint work with Bryan Ford, Idit Keidar, Emmett Witchel [ASPLOS13, TOCS14] 45 Feb 2014 Mark Silberstein - EE, Technion

GPUfs: principled redesign of the whole file system stack ● Modified FS API semantics for massive parallelism ● Relaxed distributed FS consistency for non-uniform memory ● GPU-specific implementation of synchronization primitives, read-optimized data structures, memory allocation, …. 46 Feb 2014 Mark Silberstein - EE, Technion

Operating System Services for High Throughput Processors Mark - PowerPoint PPT Presentation

Operating System Services for High Throughput Processors Mark Silberstein EE, Technion Traditional Systems Software Stack Applications OS CPU 2 Feb 2014 Mark Silberstein - EE, Technion Modern Systems Software Stack Accelerated

Chapter 3: Operating-System Structures System Components Operating System Services

Chapter 3: Operating-System Structures System Components Operating System Services

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

A simple tool from a complex system: A simple tool from a complex system: high- -throughput,

Module 3: Operating-System Structures System Components Operating-System Services

Module 3: Operating-System Structures System Components Operating System Services

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Utilizing commercial graphics processors Utilizing commercial graphics processors in the

High Throughput Computing Notebooks HTCondor Week 2019 Todd Tannenbaum Center for High

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

What is an Operating System? Three views of an operating system Application View: what services

Neural Acceleration for GPU Throughput Processors Hardik Sharma Jongse Park Amir Yazdanbakhsh

Lect. 4: Shared Memory Multiprocessors Obtained by connecting full processors together

CS 105 Intel x86 (IA32/64) Processors Intel x86 (IA32/64) Processors Tour of the Black Holes

VLIW Processors VLIW (very long instruction word) processors instructions are scheduled

Regional Innovation Strategies Program Prospective Applicant Webinar Sep tem ber 8 th a nd 10 th,

Autumn 2015 Radia&on and Radia&on Detectors Course home

Microservices on the Edge: The Infrastructure Impact Ram (Ramki) Krishnan: Industry Consultant,

MIT Bates Linear Accelerator Center Wind Energy Resource Assessment Project Preliminary Analysis

Performance of HPC Applications on the Amazon Web Services Cloud Keith R. Jackson , Lavanya

JLab Site Report Blint Jo USQCD All Hands Meeting Brookhaven National Laboratory April 19,

Advanced Social Media Advertising & Strategies Overview Masars training courses are

Online Learning with Sleeping Experts and Feedback Graphs Corinna Cortes 1 , Giulia DeSalvo 1 ,

Operating System Services for High Throughput Processors Mark - PowerPoint PPT Presentation

Operating System Services for High Throughput Processors Mark Silberstein EE, Technion Traditional Systems Software Stack Applications OS CPU 2 Feb 2014 Mark Silberstein - EE, Technion Modern Systems Software Stack Accelerated

Chapter 3: Operating-System Structures System Components Operating System Services

Chapter 3: Operating-System Structures System Components Operating System Services

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

A simple tool from a complex system: A simple tool from a complex system: high- -throughput,

Module 3: Operating-System Structures System Components Operating-System Services

Module 3: Operating-System Structures System Components Operating System Services

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Utilizing commercial graphics processors Utilizing commercial graphics processors in the

High Throughput Computing Notebooks HTCondor Week 2019 Todd Tannenbaum Center for High

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

What is an Operating System? Three views of an operating system Application View: what services

Neural Acceleration for GPU Throughput Processors Hardik Sharma Jongse Park Amir Yazdanbakhsh

Lect. 4: Shared Memory Multiprocessors Obtained by connecting full processors together

CS 105 Intel x86 (IA32/64) Processors Intel x86 (IA32/64) Processors Tour of the Black Holes

VLIW Processors VLIW (very long instruction word) processors instructions are scheduled

Regional Innovation Strategies Program Prospective Applicant Webinar Sep tem ber 8 th a nd 10 th,

Autumn 2015 Radia&amp;on and Radia&amp;on Detectors Course home

Microservices on the Edge: The Infrastructure Impact Ram (Ramki) Krishnan: Industry Consultant,

MIT Bates Linear Accelerator Center Wind Energy Resource Assessment Project Preliminary Analysis

Performance of HPC Applications on the Amazon Web Services Cloud Keith R. Jackson , Lavanya

JLab Site Report Blint Jo USQCD All Hands Meeting Brookhaven National Laboratory April 19,

Advanced Social Media Advertising &amp; Strategies Overview Masars training courses are

Online Learning with Sleeping Experts and Feedback Graphs Corinna Cortes 1 , Giulia DeSalvo 1 ,

Autumn 2015 Radia&on and Radia&on Detectors Course home

Advanced Social Media Advertising & Strategies Overview Masars training courses are