A Fully A Fu lly GPU GPU-Based Based Ou Out-Of Of- ology Con - - PowerPoint PPT Presentation

a fully a fu lly gpu gpu based based ou out of of
SMART_READER_LITE
LIVE PREVIEW

A Fully A Fu lly GPU GPU-Based Based Ou Out-Of Of- ology Con - - PowerPoint PPT Presentation

ence ce eren Confer A Fully A Fu lly GPU GPU-Based Based Ou Out-Of Of- ology Con Co Core e App pproac oach h to to Ha Handle ndle h, Germany many hnolog Oct 2018 2018 La Large e Volume olume Da Data ta echn Munich,


slide-1
SLIDE 1

A Fu A Fully lly GPU GPU-Based Based Ou Out-Of Of- Co Core e App pproac

  • ach

h to to Ha Handle ndle La Large e Volume

  • lume Da

Data ta

Nicol colas s Courilleau rilleau1,2

,2,

, Jona nathan than Sarton ton1, , Flor

  • rent

ent Dugu guet1,3

,3,

Yann nnic ick k Remion ion1 and Laurent ent Lucas as1

1 1 – Univ iversité ité de Reim ims Champa pagne gne-Ar Arde denn nne, Franc nce 2 2 – Neoxi xia, Franc nce 3 3 – Altimesh imesh, Franc nce

GPU GPU Tec echn hnolog

  • logy Con

Confer eren ence ce

Munic Munich, Ger h, Germany many

09 09-11 11 Oct Oct 2018 2018

slide-2
SLIDE 2

Outline

 Background and motivation  Previous works  Out-of-core model presentation  Model in action: application to visualization  Conclusion and outlook

slide-3
SLIDE 3

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Context

  • Targets HPC of 3DNeuroSecure
  • Interactive processing and visualization (virtual microscopy, DVR) of very large biomedical datasets
  • Accelerating drug discovery for Alzheimer disease

Local Offshore Teleworking

x TB 3D DATA

slide-4
SLIDE 4

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Problematic

  • Designing out-of-core algorithms
  • Voxel representation → High volume of data >> CPU and GPU memory

Domain/Application Data size Mesh voxelization  100 GB Histology Electron microscopy  100 GB to several TB Regular 3D grid And beyond

43523 (RGBA – 32bits) ≈ 330 GB

slide-5
SLIDE 5

Previous works

slide-6
SLIDE 6

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Previous works

2009

ACM SIGGRAPH i3D [Crassin et al.]

2011 2012 2013

[Klaus Engel] [Hadwiger et al.] IEEE Symposium on Large Data Analysis & Visualization IEEE Transaction on Visualisation & Computer Graphics [Fogal et al.] IEEE Symposium on Large Data Analysis & Visualization

slide-7
SLIDE 7

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Previous works… at a glance

  • Address translation taxonomy [Beyer et al. 2015]
slide-8
SLIDE 8

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Previous works… at a glance

  • Bricking: Page table look-up
  • Octree multi-resolution: tree traversal
  • Multi-resolution page table

[Beyer et al. 2015]

slide-9
SLIDE 9

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Previous works… at a glance

  • Bricking: Page table look-up
  • Octree multi-resolution: tree traversal
  • Multi-resolution page table

[Beyer et al. 2015]

slide-10
SLIDE 10

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Previous works… at a glance

  • Bricking: Page table look-up
  • Octree multi-resolution: tree traversal
  • Multi-resolution page table

[Beyer et al. 2015]

slide-11
SLIDE 11

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

And Nvidia – Pascal / Volta unified memory

  • GPU memory oversubscription (unified memory)
  • Limited to host memory / OS specs limitation
  • Volume decomposition still needed
  • Volta using
  • Nvidia Tesla V100
  • IBM Power 9
  • NVLink 2 (+ OS ATS)
  • Unix « mmap »
  • Unix kernel 4.16 (at least)
  • Limitations
  • ATS over NVLink 2 = Power 9
  • NVLink 2 = Tesla V100
  • No page fault control
  • No texture memory

[Everything you need to know about unified memory, Nikolay Sakharnykh, GTC 2018] Summit - DOE/SC/Oak Ridge National Laboratory

slide-12
SLIDE 12

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Our contributions

  • GPU based out-of-core data management
  • Multiresolution multilevel page table hierarchy
  • Managed entirely on GPU
  • Any kind of applications (regular 3D grids of voxels)
  • Interactive visualization
  • On-demand data processing
  • Both at the same time
  • CPU – GPU communications reduced
  • Complete pipeline – From storage to GPU

In addition,

  • Multi OS support, since Kepler architecture
slide-13
SLIDE 13

Out-of-core model presentation

slide-14
SLIDE 14

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Data representation and storage

  • (1) Multiresolution – Level of details
  • (2) Bricking – Level subdivision
  • Allows the out-of-core approach
  • (1) + (2) = Bricked multiresolution 3D pyramid
  • Bonus: Data compression (LZ4 – Loss less and

real-time decompression)

Level 2 Level 1 Level 0

slide-15
SLIDE 15

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Multiresolution multilevel page table hierarchy

Brick cache

slide-16
SLIDE 16

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Multiresolution multilevel page table hierarchy

Multiresolution page table Brick cache

slide-17
SLIDE 17

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Multiresolution multilevel page table hierarchy

Page table cache Multiresolution page directory Brick cache

slide-18
SLIDE 18

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Page table cache Multiresolution page directory Brick cache

Multiresolution multilevel page table hierarchy

  • Entry =
  • 3D coordinates of

the block in the next cache

  • + Flag:
  • Mapped
  • Unmapped
  • Empty
slide-19
SLIDE 19

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Virtual addressing

PT1 MRPD Brick cache

  • Virtual volume navigation – address = [𝑚, 𝑞]
  • 𝑚 = Level of detail
  • 𝑞 = 3D normalized positon, 𝑦, 𝑧, 𝑨 ∈ [0, 1)3
slide-20
SLIDE 20

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Cache miss

  • Virtual volume navigation – address = [𝑚, 𝑞]
  • 𝑚 = Level of detail
  • 𝑞 = 3D normalized positon, 𝑦, 𝑧, 𝑨 ∈ [0, 1)3

PT1 MRPD Brick cache

Cache miss

slide-21
SLIDE 21

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Pipeline

Application Interface

Localhost

End-user Application

Mass storage CPU GPU

LRUs update Hierarchy update Multi-level Multi-resolution Page Table Hierarchy + Data Cache Requests handling

L2 L1 L0

… … … … … … …

⚫ ⚫ ⚫ ⚫

Cache Manager CUDA zero copy communication Request handler asynchronous thread Cache manager RAM brick cache

1

⚫ ⚫ ⚫ ⚫

Requested bricks Bricks positions request list

1 – Voxel cache request

slide-22
SLIDE 22

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Pipeline

Application Interface

Localhost

End-user Application

Mass storage CPU GPU

LRUs update Hierarchy update Multi-level Multi-resolution Page Table Hierarchy + Data Cache Requests handling

L2 L1 L0

… … … … … … …

⚫ ⚫ ⚫ ⚫

Cache Manager CUDA zero copy communication Request handler asynchronous thread Cache manager RAM brick cache

1

⚫ ⚫ ⚫ ⚫

Requested bricks Bricks positions request list

2

2 – Hierarchy look-up

slide-23
SLIDE 23

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Pipeline

Application Interface

Localhost

End-user Application

Mass storage CPU GPU

LRUs update Hierarchy update Multi-level Multi-resolution Page Table Hierarchy + Data Cache Requests handling

L2 L1 L0

… … … … … … …

⚫ ⚫ ⚫ ⚫

Cache Manager CUDA zero copy communication Request handler asynchronous thread Cache manager RAM brick cache

2.1 1

⚫ ⚫ ⚫ ⚫

Requested bricks Bricks positions request list

2

2.1 – Request list creation

slide-24
SLIDE 24

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Pipeline

Application Interface

Localhost

End-user Application

Mass storage CPU GPU

LRUs update Hierarchy update Multi-level Multi-resolution Page Table Hierarchy + Data Cache Requests handling

L2 L1 L0

… … … … … … …

⚫ ⚫ ⚫ ⚫

Cache Manager CUDA zero copy communication Request handler asynchronous thread Cache manager RAM brick cache

2.1 1 2.2

⚫ ⚫ ⚫ ⚫

Requested bricks Bricks positions request list

2

2.2 – Request list asynchronous handling

slide-25
SLIDE 25

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Pipeline

Application Interface

Localhost

End-user Application

Mass storage CPU GPU

LRUs update Hierarchy update Multi-level Multi-resolution Page Table Hierarchy + Data Cache Requests handling

L2 L1 L0

… … … … … … …

⚫ ⚫ ⚫ ⚫

Cache Manager CUDA zero copy communication Request handler asynchronous thread Cache manager RAM brick cache

2.1 1 2.2 2.3

⚫ ⚫ ⚫ ⚫

Requested bricks Bricks positions request list

2

2.3 – CPU cache look-up (simple cache)

slide-26
SLIDE 26

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Pipeline

Application Interface

Localhost

End-user Application

Mass storage CPU GPU

LRUs update Hierarchy update Multi-level Multi-resolution Page Table Hierarchy + Data Cache Requests handling

L2 L1 L0

… … … … … … …

⚫ ⚫ ⚫ ⚫

Cache Manager CUDA zero copy communication Request handler asynchronous thread Cache manager RAM brick cache

2.1 1 2.2 2.3 2.4

⚫ ⚫ ⚫ ⚫

Requested bricks Bricks positions request list

2

2.4 – If not in CPU cache = Loading from mass storage

slide-27
SLIDE 27

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Pipeline

Application Interface

Localhost

End-user Application

Mass storage CPU GPU

LRUs update Hierarchy update Multi-level Multi-resolution Page Table Hierarchy + Data Cache Requests handling

L2 L1 L0

… … … … … … …

⚫ ⚫ ⚫ ⚫

Cache Manager CUDA zero copy communication Request handler asynchronous thread Cache manager RAM brick cache

2.1 1 2.2 2.3 2.4 2.5

⚫ ⚫ ⚫ ⚫

Requested bricks Bricks positions request list

2

2.5 – Load bricks in a Cuda zero copy buffer

slide-28
SLIDE 28

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Pipeline

Application Interface

Localhost

End-user Application

Mass storage CPU GPU

LRUs update Hierarchy update Multi-level Multi-resolution Page Table Hierarchy + Data Cache Requests handling

L2 L1 L0

… … … … … … …

⚫ ⚫ ⚫ ⚫

Cache Manager CUDA zero copy communication Request handler asynchronous thread Cache manager RAM brick cache

3 2.1 1 2.2 2.3 2.4 2.5 2.6

⚫ ⚫ ⚫ ⚫

Requested bricks Bricks positions request list

2

3 & 2.6 – LRUs update then address bricks in GPU cache hierarchy

slide-29
SLIDE 29

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

LRU updates on GPU

Usage buffer

Stream compactions

𝑀𝑣+ = ∪𝑛∈𝑁 𝑀𝑝 × 𝑛 𝑀𝑝 𝑁 𝑀𝑣− = 𝑀𝑣+ + Mask Old LRU Updated LRU

  • Size = number of bricks in the cache
  • Marked with a timestamp
slide-30
SLIDE 30

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Brick request management on GPU

  • Size = number of bricks in the volumes (All LODs)
  • Marked with a timestamp

Stream compaction

Requested list Request buffer

LOD0 LOD1 LOD2

slide-31
SLIDE 31

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

CPU / GPU transfer

CPU to GPU – The data (bricks) [Cuda Zero copy] GPU to CPU – Requested bricks IDs

slide-32
SLIDE 32

Model in action: application to visualization

slide-33
SLIDE 33

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

General purpose applications

  • Processing
  • Convolution, classification, etc.
  • Visualization
  • Virtual microscopy
  • Direct volume rendering
slide-34
SLIDE 34

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Multi-resolution volume Ray-Casting

  • Cast primary rays to integrate the volume rendering equation. Color and opacity samples accumulation

according to transfer function

  • Multiresolution
  • LOD selection for each samples during ray-marching
  • Adaptive sampling according to the multiresolution representation

( , ) ( , ) ( , )

( ) ( ) ( )

in in

r r r r in r r

L r L r e q r e dr

  − − 

= + 

 r r ~ ~ q(r) rin

slide-35
SLIDE 35

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Ray-guided approach

  • Modern approach for out-of-core GPU based volume ray-casting on

large data

  • Intuitive visibility selection : no additional culling calculation
  • Intuitive out-of-core integration : only load visible bricks on GPU cache.
slide-36
SLIDE 36

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Space skipping

MRPD entry with flag empty or unmapped Cache PT1 All entries of a PT bloc with flag empty or unmapped Brick cache None of the corresponding Bricks are present MRPD entry with flag mapped Cache PT1 One entry of a PT bloc with flag empty or unmapped Brick cache The red brick is not present

slide-37
SLIDE 37

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Multi-GPU rendering strategy

  • Sort-last distributed approach
  • Compute scalability + memory scalability
  • Compositing : reconstruct final rendering from local GPU contribution

Sort-last Subdivide image domain Sort-first Subdivide data domain

[Beyer et al. 2015]

slide-38
SLIDE 38

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Volume distribution

  • Virtual volume distribution: restrict virtual adressability

range of each GPU

  • Implicit multiresolution volume distribution
slide-39
SLIDE 39

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Distributed out-of-core volume rendering

GPU Composition

  • Peer to peer GPUs communication with UVA

if possible

  • Intermediate CPU transfers otherwise
  • Front-to-back over operator

𝐷𝑗

′ = 𝐷𝑗−1 ′

+ 1 − ∝𝑗−1

𝐷𝑗 ∝𝑗

′= ∝𝑗−1 ′

+ 1 − ∝𝑗−1

∝𝑗 With C: RGB color and ∝: opacity

slide-40
SLIDE 40

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Out-of-core distribution & communications

  • One instance of the virtual addressing structure and its cache manager by GPU
  • A single virtualization configuration (# of level, brick size and PT block sizes)
  • A single CPU cache

Strategy multi-thread and communications

  • One openMP thread by CUDA context (by GPU)
  • Attributed on one core of the associated CPU
slide-41
SLIDE 41

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Remote rendering

  • Interactive display streaming from remote server rendering
slide-42
SLIDE 42

Evaluation and results

slide-43
SLIDE 43

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Datasets

Histological mouse brain 64000 x 50000 x 114 RGBA ≈ 1,5 TB Light sheet microscope primate hippocampus 2160 x 2560 x 1072 16 bits ≈ 12 GB

slide-44
SLIDE 44

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Memory occupancy

  • "light sheet“ microscopy 2 160 × 2 560 × 1 072 ≈ 12 GB
  • Brick size 643 → ≈ 27 000 bricks (7 LOD)
  • One virtualization level (data cache = 7000 bricks)

→ 1.2 MB on GPU needed

  • Mouse brain 64 000 x 50 000 x 114 ≈ 1.5 TB
  • Brick size 643 → ≈ 3.13 million bricks (10 LOD)
  • One virtualization level → ≈ 63 MB on GPU needed
  • Two virtualization levels → ≈ 13 MB on GPU needed
  • Note
  • Three virtualization levels and PT blocks of 643

→ One MRPD entry addressing ≈ 68 billions of bricks

slide-45
SLIDE 45

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Rendering server

  • Nvidia Quadro VCA
  • 8 GPUs Quadro P6000
  • 2 CPUs Intel Xeon
  • One hybride node multi-GPUs multi-CPUs
slide-46
SLIDE 46

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Frames frequency performances

slide-47
SLIDE 47

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Frames frequency performances

slide-48
SLIDE 48

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Frames frequency performances

Details of the different steps of the rendering

slide-49
SLIDE 49

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

View loading performances

Worst case scenario: GPU and CPU empty caches

slide-50
SLIDE 50

Conclusion and outlook

slide-51
SLIDE 51

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Conclusion

  • Complete visualization-driven pipeline
  • Out-of-core data management: multiresolution multilevel page table

hierarchy

  • Entirely managed on GPU
  • GPU – CPU communication reduced
  • Any kind of applications (Regular 3D grid)
  • Adapted to HPC environment
  • Not OS depend
  • Weak GPU memory footprint
  • Application: Multi-GPUs ray-guided multiresolution ray-casting
  • 1 – Good rendering frequency
  • 2 – Efficient on-demand data loading time
  • 1 + 2 – Even for very large volume of data (> TB)
slide-52
SLIDE 52

GTC 2018 Munich, E8246, Room 3

  • N. Courilleau et al., 2018-10-11

Questions ?

Nicolas Courilleau – nicolas.courilleau@neoxia.com Jonathan Sarton – jonathan.sarton@univ-reims.fr Florent Duguet – florent.duguet@altimesh.com Yannick Remion – yannick.remion@univ-reims.fr Laurent Lucas – laurent.lucas@univ-reims.fr

A Fully GPU-Based Out-Of-Core Approach to Handle Large Volume Data