operating system services for high throughput processors
play

Operating System Services for High Throughput Processors Mark - PowerPoint PPT Presentation

Operating System Services for High Throughput Processors Mark Silberstein EE, Technion Traditional Systems Software Stack Applications OS CPU 2 Feb 2014 Mark Silberstein - EE, Technion Modern Systems Software Stack Accelerated


  1. Operating System Services for High Throughput Processors Mark Silberstein EE, Technion

  2. Traditional Systems Software Stack Applications OS CPU 2 Feb 2014 Mark Silberstein - EE, Technion

  3. Modern Systems Software Stack Accelerated applications OS Manycore CPU GPUs FPGA DSPs processors 3 Feb 2014 Mark Silberstein - EE, Technion

  4. GPUs make a difference... ● Top 10 fastest supercomputers use GPUs HCI Metheo Vision Physics rology Bio Graph informatics Algorithms Chemistry Linear Finance Algebra 4 Feb 2014 Mark Silberstein - EE, Technion

  5. GPUs make a difference, but only in HPC! HCI Metheo Vision Physics rology Bio Web Network Antivirus, Graph informatics servers services file search Algorithms ??? ??? ??? Chemistry Linear Finance Algebra 5 Feb 2014 Mark Silberstein - EE, Technion

  6. Software-hardware gap is widening Accelerated applications Inadequate abstractions and OS management mechanisms Manycore Manycore Manycore Hybrid CPU GPUs GPUs GPUs FPGA FPGA FPGA DSPs processors processors processors CPU-GPU 6 Feb 2014 Mark Silberstein - EE, Technion

  7. Fundamentals in question accelerators ≡ co-processors accelerators ≡ peer-processors 7 Feb 2014 Mark Silberstein - EE, Technion

  8. Software stack for accelerat ed applications Accelerated Applications OS Accelerator abstractions and mechanisms Manycore Manycore CPU GPUs GPUs FPGA FPGA DSPs processors processors 8 Feb 2014 Mark Silberstein - EE, Technion

  9. Software stack for accelerat or applications Accelerated Accelerator applications Applications (centralized and distributed) Accelerator I/O Accelerator OS support services (network, files) (Interprocessor I/O, OS file system, network APIs) Accelerator abstractions and mechanisms Hardware support for OS Manycore Manycore CPU GPUs GPUs FPGA FPGA DSPs processors processors 9 Feb 2014 Mark Silberstein - EE, Technion

  10. This talk Accelerated Accelerator applications Applications centralized and distributed Accelerator I/O Accelerator OS support services (network, files) (Interprocessor I/O, OS file system, network APIs) Accelerator abstractions ASPLOS13, TOCS14 and mechanisms Hardware support for OS Storage Manycore Manycore CPU GPUs GPUs FPGA FPGA DSPs GPUs processors processors Network 10 Feb 2014 Mark Silberstein - EE, Technion

  11. ● GPU 101 ● GPUfs: File I/O support for GPUs ● Future work 11 Feb 2014 Mark Silberstein - EE, Technion

  12. Hybrid GPU-CPU 101 Architecture CPU GPU Memory Memory 12 Feb 2014 Mark Silberstein - EE, Technion

  13. Co-processor model CPU GPU Computation Memory Memory 13 Feb 2014 Mark Silberstein - EE, Technion

  14. Co-processor model CPU GPU Computation tation Memory Memory 14 Feb 2014 Mark Silberstein - EE, Technion

  15. Co-processor model GPU kernel CPU GPU Computation tation t a t i o n Memory Memory 15 Feb 2014 Mark Silberstein - EE, Technion

  16. Co-processor model CPU GPU Computation Memory Memory 16 Feb 2014 Mark Silberstein - EE, Technion

  17. Building systems with GPUs is hard Why? 17 Feb 2014 Mark Silberstein - EE, Technion

  18. GPU kernels are isolated GPU CPU Data transfers Parallel Invocation Algorithm Memory management 18 Feb 2014 Mark Silberstein - EE, Technion

  19. Example: accelerating photo collage Application CPU CPU CPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 19 Feb 2014 Mark Silberstein - EE, Technion

  20. Offloading computations to GPU Application CPU CPU CPU Move to GPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 20 Feb 2014 Mark Silberstein - EE, Technion

  21. Offloading computations to GPU CPU Data transfer GPU Kernel Kernel start termination 21 Feb 2014 Mark Silberstein - EE, Technion

  22. Overheads Invocation latency CPU copy to o invoke t GPU U y p P o C c GPU Transfer overhead Synchronization 22 Feb 2014 Mark Silberstein - EE, Technion

  23. Working around overheads Asynchronous invocation Data reuse management Double buffering Buffer size optimization CPU copy to copy to o invoke t GPU GPU U GPU-CPU y p P o low-level tricks C c GPU 23 Feb 2014 Mark Silberstein - EE, Technion

  24. Management overhead Asynchronous invocation Asynchronous invocation Data reuse management Data reuse management Double buffering Double buffering Buffer size optimization Buffer size optimization GPU-CPU GPU-CPU low-level tricks low-level tricks Why do we need to deal with low-level system details? 24 Feb 2014 Mark Silberstein - EE, Technion

  25. The reason is.... GPUs are peer-processors They need I/O OS services 25 Feb 2014 Mark Silberstein - EE, Technion

  26. GPUfs: application view C PUs G PU1 G PU2 G PU3 open(“shared_file”) o p e n ( mmap() “ s write() h a r e d _ f i l e ” ) GPUfs Host File System 26 Feb 2014 Mark Silberstein - EE, Technion

  27. GPUfs: application view C PUs G PU1 G PU2 G PU3 open(“shared_file”) o p System-wide e n ( mmap() shared namespace “ s write() h a r e d _ f i l e POSIX (CPU)-like API ” ) GPUfs Host File System Persistent storage 27 Feb 2014 Mark Silberstein - EE, Technion

  28. Accelerating collage app with GPUfs No CPU management code GPUfs GPUfs GPU open/read from GPU 28 Feb 2014 Mark Silberstein - EE, Technion

  29. Accelerating collage app with GPUfs CPU CPU CPU Read-ahead GPUfs buffer cache GPUfs GPUfs GPU Overlapping Overlapping computations and transfers 29 Feb 2014 Mark Silberstein - EE, Technion

  30. Accelerating collage app with GPUfs CPU CPU CPU GPUfs GPU Data reuse Random data access 30 Feb 2014 Mark Silberstein - EE, Technion

  31. Understanding the hardware 31 Feb 2014 Mark Silberstein - EE, Technion

  32. GPU hardware characteristics Parallelism Low serial performance Heterogeneous memory 32 Feb 2014 Mark Silberstein - EE, Technion

  33. GPU hardware parallelism 1. Multi-core GPU GPU GPU memory GPU memory Core MP Core MP Core MP Core MP 33 Feb 2014 Mark Silberstein - EE, Technion

  34. GPU hardware parallelism 2. SIMD GPU GPU GPU memory GPU memory SIMD vector SIMD vector SIMD vector MP 34 Feb 2014 Mark Silberstein - EE, Technion

  35. GPU hardware parallelism 3. Parallelism for latency hiding GPU GPU GPU memory GPU memory T1 T2 MP T3 Execution state 35 Feb 2014 Mark Silberstein - EE, Technion

  36. GPU Hardware 3. Parallelism for latency hiding GPU GPU GPU memory GPU memory R 0x01 T1 T2 MP T3 Execution state 36 Feb 2014 Mark Silberstein - EE, Technion

  37. GPU Hardware 3. Parallelism for latency hiding GPU GPU GPU memory GPU memory R 0x01 T1 R 0x04 T2 MP T3 Execution state 37 Feb 2014 Mark Silberstein - EE, Technion

  38. GPU Hardware 3. Parallelism for latency hiding GPU GPU GPU memory GPU memory R 0x01 T1 R 0x04 T2 MP R 0x08 T3 Execution state 38 Feb 2014 Mark Silberstein - EE, Technion

  39. GPU Hardware 3. Parallelism for latency hiding GPU GPU GPU memory GPU memory R 0x01 T1 R 0x04 T2 MP R 0x08 T3 Execution state 39 Feb 2014 Mark Silberstein - EE, Technion

  40. Putting it all together: 3 levels of hardware parallelism GPU GPU GPU memory GPU memory Core MP Core MP Core MP Core MP SIMD vector Thread Ctx 1 State 1 Core MP Thread Ctx k State k 40 Feb 2014 Mark Silberstein - EE, Technion

  41. Software-Hardware mapping GPU GPU GPU memory GPU memory Core MP MP Core MP MP Core MP MP Core MP MP SIMD vector Thread Ctx 1 State 1 T T h h r r Core MP MP e e a a d d n 1 Thread Ctx k State k 41 Feb 2014 Mark Silberstein - EE, Technion

  42. (1) 10,000-s of concurrent threads! GPU GPU GPU memory GPU memory NVIDIA K20x GPU: 64x14x32= 28672 concurrent threads Core MP MP Core MP MP Core MP MP Core MP MP 14 32 SIMD vector Thread Ctx 1 State 1 T T h h 64 r r Core MP MP e e a a d d n 1 Thread Ctx k State k 42 Feb 2014 Mark Silberstein - EE, Technion

  43. (2) Each thread is slow GPU GPU GPU memory GPU memory Core MP MP Core MP MP Core MP MP Core MP MP ~100x slower than a CPU thread SIMD vector Thread Ctx 1 State 1 T T h h r r Core MP MP e e a a d d n 1 Thread Ctx k State k 43 Feb 2014 Mark Silberstein - EE, Technion

  44. (3) Heterogeneous memory CPU GPU 10-32GB/s 250GB/s x20 Memory Memory 12 GB/s 44 Feb 2014 Mark Silberstein - EE, Technion

  45. GPUfs: file system layer for GPUs Joint work with Bryan Ford, Idit Keidar, Emmett Witchel [ASPLOS13, TOCS14] 45 Feb 2014 Mark Silberstein - EE, Technion

  46. GPUfs: principled redesign of the whole file system stack ● Modified FS API semantics for massive parallelism ● Relaxed distributed FS consistency for non-uniform memory ● GPU-specific implementation of synchronization primitives, read-optimized data structures, memory allocation, …. 46 Feb 2014 Mark Silberstein - EE, Technion

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend