A Fully A Fu lly GPU GPU-Based Based Ou Out-Of Of- ology Con - PowerPoint PPT Presentation

ence ce eren Confer A Fully A Fu lly GPU GPU-Based Based Ou Out-Of Of- ology Con Co Core e App pproac oach h to to Ha Handle ndle h, Germany many hnolog Oct 2018 2018 La Large e Volume olume Da Data ta echn Munich, Ger GPU Tec 11 Oct Munic GPU 09-11 Nicol colas s Courilleau rilleau 1,2 ,2 , , Jona nathan than Sarton ton 1 , , Flor orent ent Dugu guet 1,3 ,3 , ion 1 and Laurent Yann nnic ick k Remion ent Lucas as 1 09 1 – Univ 1 iversité ité de Reim ims Champa pagne gne-Ar Arde denn nne, Franc nce 2 – Neoxi 2 xia, Franc nce 3 – Altimesh 3 imesh, Franc nce

 Background and motivation  Previous works Outline  Out-of-core model presentation  Model in action: application to visualization  Conclusion and outlook

Context Local 3D DATA Offshore x TB Teleworking • Targets HPC of 3DNeuroSecure • Interactive processing and visualization (virtual microscopy, DVR) of very large biomedical datasets • Accelerating drug discovery for Alzheimer disease N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Problematic • Designing out-of-core algorithms • Voxel representation → High volume of data >> CPU and GPU memory Domain/Application Data size Mesh  100 GB voxelization 4352 3 (RGBA – 32bits) ≈ 330 GB  100 GB Histology to Electron microscopy several TB Regular 3D grid And beyond N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Previous works

Previous works [Klaus Engel] [Fogal et al.] IEEE Symposium on Large Data IEEE Symposium on Large Data Analysis & Visualization Analysis & Visualization 2009 2011 2012 2013 ACM SIGGRAPH i3D IEEE Transaction on Visualisation & Computer Graphics [Crassin et al.] [Hadwiger et al.] N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Previous works… at a glance • Address translation taxonomy [Beyer et al. 2015] N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Previous works… at a glance • Bricking: Page table look-up • Octree multi-resolution: tree traversal • Multi-resolution page table [Beyer et al. 2015] N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

And Nvidia – Pascal / Volta unified memory • GPU memory oversubscription (unified memory) • Limited to host memory / OS specs limitation • Volume decomposition still needed • Volta using • Nvidia Tesla V100 • IBM Power 9 • NVLink 2 (+ OS ATS) • Unix « mmap » • Unix kernel 4.16 (at least) • Limitations • ATS over NVLink 2 = Power 9 • NVLink 2 = Tesla V100 • No page fault control • No texture memory Summit - DOE/SC/Oak Ridge National Laboratory [Everything you need to know about unified memory, Nikolay Sakharnykh, GTC 2018] N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Our contributions • GPU based out-of-core data management • Multiresolution multilevel page table hierarchy • Managed entirely on GPU • Any kind of applications (regular 3D grids of voxels) • Interactive visualization • On-demand data processing • Both at the same time • CPU – GPU communications reduced • Complete pipeline – From storage to GPU In addition, • Multi OS support, since Kepler architecture N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Out-of-core model presentation

Data representation and storage • (1) Multiresolution – Level of details Level 2 • (2) Bricking – Level subdivision • Allows the out-of-core approach Level 1 • (1) + (2) = Bricked multiresolution 3D pyramid Level 0 • Bonus: Data compression (LZ4 – Loss less and real-time decompression) N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Multiresolution multilevel page table hierarchy Brick cache N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Multiresolution multilevel page table hierarchy Multiresolution page table Brick cache N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Multiresolution multilevel page table hierarchy Multiresolution page directory Page table cache Brick cache N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Multiresolution multilevel page table hierarchy • Entry = Multiresolution page directory • 3D coordinates of the block in the next cache • + Flag: Page table cache • Mapped • Unmapped • Empty Brick cache N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Virtual addressing • Virtual volume navigation – address = [𝑚, 𝑞] • 𝑚 = Level of detail • 𝑞 = 3D normalized positon, 𝑦, 𝑧, 𝑨 ∈ [0, 1) 3 MRPD PT1 Brick cache N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Cache miss • Virtual volume navigation – address = [𝑚, 𝑞] • 𝑚 = Level of detail • 𝑞 = 3D normalized positon, 𝑦, 𝑧, 𝑨 ∈ [0, 1) 3 MRPD PT1 Cache miss Brick cache N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Pipeline 1 – Voxel cache request Localhost Mass storage End-user Application L2 Bricks positions Cache Manager request list Application Interface L1 … Request handler Requests 1 asynchronous thread handling ⚫ ⚫ ⚫ ⚫ … Requested bricks Multi-level Cache manager L0 ⚫ ⚫ ⚫ ⚫ … LRUs update Multi-resolution CUDA zero copy Page Table communication Hierarchy RAM brick cache … … … + Hierarchy update Data Cache … GPU CPU N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Pipeline 2 – Hierarchy look-up Localhost Mass storage End-user Application L2 Bricks positions Cache Manager request list Application Interface L1 … Request handler Requests 1 asynchronous thread handling ⚫ ⚫ ⚫ ⚫ … Requested bricks 2 Multi-level Cache manager L0 ⚫ ⚫ ⚫ ⚫ … LRUs update Multi-resolution CUDA zero copy Page Table communication Hierarchy RAM brick cache … … … + Hierarchy update Data Cache … GPU CPU N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Pipeline 2.1 – Request list creation Localhost Mass storage End-user Application L2 Bricks positions Cache Manager request list Application Interface L1 … Request handler Requests 1 asynchronous thread handling ⚫ ⚫ ⚫ ⚫ … 2.1 Requested bricks 2 Multi-level Cache manager L0 ⚫ ⚫ ⚫ ⚫ … LRUs update Multi-resolution CUDA zero copy Page Table communication Hierarchy RAM brick cache … … … + Hierarchy update Data Cache … GPU CPU N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Pipeline 2.2 – Request list asynchronous handling Localhost Mass storage End-user Application L2 Bricks positions Cache Manager 2.2 request list Application Interface L1 … Request handler Requests 1 asynchronous thread handling ⚫ ⚫ ⚫ ⚫ … 2.1 Requested bricks 2 Multi-level Cache manager L0 ⚫ ⚫ ⚫ ⚫ … LRUs update Multi-resolution CUDA zero copy Page Table communication Hierarchy RAM brick cache … … … + Hierarchy update Data Cache … GPU CPU N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Pipeline 2.3 – CPU cache look-up (simple cache) Localhost Mass storage End-user Application L2 Bricks positions Cache Manager 2.2 request list Application Interface L1 … Request handler Requests 1 asynchronous thread handling ⚫ ⚫ ⚫ ⚫ … 2.1 Requested 2.3 bricks 2 Multi-level Cache manager L0 ⚫ ⚫ ⚫ ⚫ … LRUs update Multi-resolution CUDA zero copy Page Table communication Hierarchy RAM brick cache … … … + Hierarchy update Data Cache … GPU CPU N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Pipeline 2.4 – If not in CPU cache = Loading from mass storage Localhost Mass storage End-user Application L2 Bricks positions Cache Manager 2.2 request list Application Interface L1 … Request handler Requests 1 asynchronous thread handling ⚫ ⚫ ⚫ ⚫ … 2.1 Requested 2.3 2.4 bricks 2 Multi-level Cache manager L0 ⚫ ⚫ ⚫ ⚫ … LRUs update Multi-resolution CUDA zero copy Page Table communication Hierarchy RAM brick cache … … … + Hierarchy update Data Cache … GPU CPU N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

Pipeline 2.5 – Load bricks in a Cuda zero copy buffer Localhost Mass storage End-user Application L2 Bricks positions Cache Manager 2.2 request list Application Interface L1 … Request handler Requests 1 asynchronous thread handling ⚫ ⚫ ⚫ ⚫ … 2.1 Requested 2.5 2.3 2.4 bricks 2 Multi-level Cache manager L0 ⚫ ⚫ ⚫ ⚫ … LRUs update Multi-resolution CUDA zero copy Page Table communication Hierarchy RAM brick cache … … … + Hierarchy update Data Cache … GPU CPU N. Courilleau et al. , 2018-10-11 GTC 2018 Munich, E8246, Room 3

A Fully A Fu lly GPU GPU-Based Based Ou Out-Of Of- ology Con - PowerPoint PPT Presentation

ence ce eren Confer A Fully A Fu lly GPU GPU-Based Based Ou Out-Of Of- ology Con Co Core e App pproac oach h to to Ha Handle ndle h, Germany many hnolog Oct 2018 2018 La Large e Volume olume Da Data ta echn Munich,

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Commission: Out of touch, out of date, out of pocket April 2017 Commission: Out of touch, out of

Fully Persistent Arrays Anders Kaseorg andersk@mit.edu 6.851 Project Presentation Fully

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

Outline Concepts T aint analysis on the x86 architecture T aint objects and

Disclosures Sepsis: Diagnosis and Treatment Allergan research grant Genentech

x86 Introduction Philipp Koehn 25 October 2019 Philipp Koehn Computer Systems Fundamentals: x86

Multi-Resolution Method for Ray Tracing Sung-Eui Yoon ( ) Course URL:

MODEL REPRESENTATION AND SIMPLIFICATION Graphics & Visualization: Principles & Algorithms

12/22/2016 Control Flow Computers execute instructions in sequence. Except when we change the

Selected Pentium Instructions Chapter 12 S. Dandamudi Outline Status flags Conditional

The Instruction Set Architecture Level Wolfgang Schreiner Research Institute for Symbolic

A Fully A Fu lly GPU GPU-Based Based Ou Out-Of Of- ology Con - PowerPoint PPT Presentation

ence ce eren Confer A Fully A Fu lly GPU GPU-Based Based Ou Out-Of Of- ology Con Co Core e App pproac oach h to to Ha Handle ndle h, Germany many hnolog Oct 2018 2018 La Large e Volume olume Da Data ta echn Munich,

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Commission: Out of touch, out of date, out of pocket April 2017 Commission: Out of touch, out of

Fully Persistent Arrays Anders Kaseorg andersk@mit.edu 6.851 Project Presentation Fully

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

Outline Concepts T aint analysis on the x86 architecture T aint objects and

Disclosures Sepsis: Diagnosis and Treatment Allergan research grant Genentech

x86 Introduction Philipp Koehn 25 October 2019 Philipp Koehn Computer Systems Fundamentals: x86

Multi-Resolution Method for Ray Tracing Sung-Eui Yoon ( ) Course URL:

MODEL REPRESENTATION AND SIMPLIFICATION Graphics &amp; Visualization: Principles &amp; Algorithms

12/22/2016 Control Flow Computers execute instructions in sequence. Except when we change the

Selected Pentium Instructions Chapter 12 S. Dandamudi Outline Status flags Conditional

The Instruction Set Architecture Level Wolfgang Schreiner Research Institute for Symbolic

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MODEL REPRESENTATION AND SIMPLIFICATION Graphics & Visualization: Principles & Algorithms