Understanding GPU performance How to get peak FLOPS (GPU version) - PowerPoint PPT Presentation

Mar 06, 2023 •176 likes •282 views

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents 1 Data Access Performance 2 / 7 Contents 1 Data Access Performance 3 / 7 Data access performance data access performance is important in GPU too

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7
Contents 1 Data Access Performance 2 / 7
Contents 1 Data Access Performance 3 / 7
Data access performance data access performance is important in GPU too 4 / 7
Memory organization Pascal (P100) level line size capacity associativity L1 32B 24KB/SM ? L2 32B 4MB/device ? Global Memory 12/16GB N/A Shared Memory 64KB ( ∗ ) N/A Volta (V100) level line size capacity associativity L1 32B 32-128 KB/SM ( ∗ ) ? L2 32B 6MB/device ? Global Memory 16GB N/A Shared Memory ≤ 96KB ( ∗ ) N/A ∗ : 128KB is split between L1 and Shared Memory (configurable) source: https://arxiv.org/abs/1804.06826 5 / 7
Global vs. Shared Memory global memory and L1/L2 cache are the ordinary memory that make a hierarchy cudaMalloc returns a global memory accesses to global memory are transparently cached into L1/L2 caches shared memory is an explicitly-managed scratch memory latency shorter than L1 (esp. on Pascal) you explicitly move between global and shared memory data shared only within a thread block programming interface is covered shortly 6 / 7
Latency measurement the same pointer chasing experiment as we did on CPU ✞ for ( N times) { 1 p = p->next; 2 } 3 next pointers (link all elements in a random order) cache line size N elements 7 / 7
Data size vs. latency even L1 cache hit takes 30 (Volta) - 100 (Pascal) cycles latency per load in a random list traversal 700 p 8 v 8 600 latency/load (GPU cycles) 500 400 300 200 100 0 1024 4096 16384 65536 262144 1 . 04858 × 10 6 4 . 1943 × 10 6 1 . 67772 × 10 7 6 . 71089 × 10 7 size of the region (bytes) 8 / 7
Shared memory 9 / 7

Recommend

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Status of GPU offloading on Wayland Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland How to do GPU offloading 1 GPU offloading with X DRI2 2 GPU offloading with Wayland 3 and XWayland? 4

427 views • 29 slides

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs. CPU Why to Learn About GPU? NVIDIA GPU relative performances Why to Learn About GPU? Hardware Why to Learn About GPU? Interactive rendering

852 views • 46 slides

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS ARCHITECTURES GPU 0 GPU 1 GPU 2 CPU GPU 0 GPU 1 GPU 2 MEM MEM MEM SYS MEM 2 UNIFIED MEMORY FUNDAMENTALS Single Pointer CPU code GPU code void

870 views • 70 slides

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team Lead Alexander Soklev, RT GPU R&D Agenda Recent improvements in RT GPU Rounded edges MDL material support Next-gen GPU

534 views • 24 slides

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory Bandwidth GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20 th -24 th , 2015 GPU 6x faster on

178 views • 4 slides

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU 16 Streaming multiprocessors 8 Streaming processors pr SM 8192 registers pr SM 768 threads pr SM

211 views • 6 slides

HIGH-PERFORMANCE GPU VIDEO ENCODING ABHIJIT PATAIT SR. MANAGER, NVIDIA AGENDA GPU Video

HIGH-PERFORMANCE GPU VIDEO ENCODING ABHIJIT PATAIT SR. MANAGER, NVIDIA AGENDA GPU Video Encoding Overview NVIDIA Video Encoding Capabilities Kepler, Maxwell Gen 1, Maxwell Gen 2 Software API Performance & Quality Roadmap WHY GPU VIDEO

585 views • 30 slides

Use Tesla to provide first GPU VM Service in China Feng Zhu

Use Tesla to provide first GPU VM Service in China Feng Zhu Outline UCloud Introduction K80 GPU VM P40 GPU VM UCloud GPU PaaS Service: UAI-Service UCloud GPU ecosystem 2 About UCloud Top 3

726 views • 33 slides

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D Graphic Processor (GPU). Develop a high level language to program the GPU. Provide all of the necessary tools, test-bench and regressions.

411 views • 18 slides

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8, 2017 Why super GPU is needed Extending CUDA view into clusters Why super GPU is needed Extending CUDA view into clusters Example: Sparse Matrix

484 views • 13 slides

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of multiple GPUs NCCL Multiple GPUs per system 1 GPU Multiple systems connected NCCL : N VIDIA C ollective C ommunication L ibrary 2 MULTI-GPU DL

1.39k views • 19 slides

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Today s Topic s Topic Today GPU architecture GPU architecture What and why What and why GPU Architecture and chitecture and GPU Ar The good The good The bad The bad Programming with OpenCL

570 views • 23 slides

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU programming in Haskell GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation: Sensor calibration 1 Motivation: Sensor calibration 2 Haskell GPU programming 3 Fact-check 4 Accelerate programming 5

459 views • 33 slides

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand Clusters H. Wang, S. Potluri, M. Luo, A. K. Singh, S. Sur D. K. Panda

606 views • 27 slides

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing Unit (GPGPU) management Today GPU architecture GPU programming model Challenges Real-Time GPU management 2 History GPU

834 views • 66 slides

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms This week Example program Memory model

750 views • 36 slides

Convex Formulation of Continuous Multi-label Problems by Pock, Schoenemann, Graber,

Convex Formulation of Continuous Multi-label Problems by Pock, Schoenemann, Graber, Bischof, Cremers (2008) Pascal Getreuer Department of Mathematics University of California, Los Angeles getreuer@math.ucla.edu Pascal Getreuer (UCLA)

767 views • 28 slides

1 Handling exceptions (III) Handling exceptions (II) n try -statements can be nested and

2.6 Error, exception and event Error and exception handling in handling Pascal n There are conditions that have to be fulfilled by a n While errors definitely can occur in Pascal programs, program that sometimes are not fulfilled, which Pascal

143 views • 3 slides

Nouveau Recap, on-going and future work Karol Herbst, Pierre Moreau & Martin Peres Nouveau

Introduction Pascal support Power management Userspace Community Conclusion Nouveau Recap, on-going and future work Karol Herbst, Pierre Moreau & Martin Peres Nouveau developers February 3, 2018 Introduction Pascal support Power

506 views • 25 slides

Alias Analysis Last time Reuse optimization Today Alias analysis (pointer analysis)

Alias Analysis Last time Reuse optimization Today Alias analysis (pointer analysis) Next time More alias analysis (pointer analysis) CS553 Lecture Alias Analysis I 2 Aliasing What is aliasing? When two expressions denote

265 views • 8 slides

How to Make Lisp More Special P a s c a l C o s t a n z a V U B r u s s e l Dynamic Scoping

How to Make Lisp More Special P a s c a l C o s t a n z a V U B r u s s e l Dynamic Scoping Common Lisp: special variables Example: *print-base* Scheme: with-output-to-file fluid variables, parameter objects, etc. Dynamic

715 views • 26 slides

Using GPUs to Accelerated Computational Performance Dr Eric McCreath Research School of Computer

Using GPUs to Accelerated Computational Performance Dr Eric McCreath Research School of Computer Science The Australian National University Overview GPU Architecture SIMT Kernels Memory Intermediate representations and runtimes

315 views • 18 slides

GMNet: Graph Matching Network for Large Scale Part Semantic Segmentation in the Wild Umberto

GMNet: Graph Matching Network for Large Scale Part Semantic Segmentation in the Wild Umberto Michieli, Edoardo Borsato, Luca Rossi, Pietro Zanuttigh umberto.michieli@dei.unipd.it Sema Se mantic Se Segme mentation - Defini niti tion

284 views • 17 slides

Veronese Varieties over Fields with non-zero Characteristic: A Survey Hans Havlicek Vienna

Veronese Varieties over Fields with non-zero Characteristic: A Survey Hans Havlicek Vienna University of Technology Combinatorics 2000 Introduction If F is a (commutative) field of characteristic 2 then all tangent lines of a conic in PG(2 , F

485 views • 33 slides