Load Balanced Parallel GPU Out-of-Core for Continuous LOD Model - - PowerPoint PPT Presentation

▶

Aug 25, 2023 133 likes •386 views

Load Balanced Parallel GPU Out-of-Core for Continuous LOD Model Visualization Chao Peng, Peng Mi and Yong Cao Department of Computer Science, Virginia Tech, Blacksburg, Virginia, USA Ultrascale Visualization Workshop 2012 Motivation How

SLIDE 1

Ultrascale Visualization Workshop 2012

Department of Computer Science, Virginia Tech, Blacksburg, Virginia, USA

Load Balanced Parallel GPU Out-of-Core for Continuous LOD Model Visualization

Chao Peng, Peng Mi and Yong Cao

SLIDE 2

Ultrascale Visualization Workshop 2012

Motivation

How to efficiently render a large 3D model

that contains a lot of objects and triangles?

The Boeing 777 model: Triangles: 332 million Vertices: 223 million Objects: 719 thousand Rendering difficulties: The objects have dramatically different shapes and are topologically disconnected. The data size is far beyond the GPU rendering capabilities.

SLIDE 3

Ultrascale Visualization Workshop 2012

The Previous Approach

Our GPU-based approach in EuroGraphics’12.

– Parallel Continuous LOD: triangle-level mesh simplification. – GPU Out-of Core: CPU-GPU data streaming.

SLIDE 4

Ultrascale Visualization Workshop 2012

A Multi-GPU and Multi-Display System

The input triangle data set CPU Core GPU Device CPU Core GPU Device

SLIDE 5

Ultrascale Visualization Workshop 2012

The approach on a single GPU

LOD Selection GPU Out-of-Core Triangle Reformation Rendering (VBO)

SLIDE 6

Ultrascale Visualization Workshop 2012

LOD Selection GPU Out-of-Core Triangle Reformation Rendering (VBO)

O1 O2 O3 O4 O5 O6 O7

Existing Data

Coherence Evaluation

CPU GPU Defragmentation

SLIDE 7

Ultrascale Visualization Workshop 2012

Performance Bottleneck

O1 O2 O3 O4 O5 O6 O7

Coherence Evaluation

CPU GPU Defragmentation GPU Out-Of-Core 45% Triangle Reformation 20% OpenGL VBO Rendering 28%

SLIDE 8

Ultrascale Visualization Workshop 2012

Contributions

LOD Selection GPU Out-of-Core Triangle Reformation Rendering (VBO) LOD Selection GPU Out-of-Core Triangle Reformation Rendering (VBO) Inter-GPU Communication Final Display Final Display Load Balancing

SLIDE 9

Ultrascale Visualization Workshop 2012

Load Balancing

Viewpoint

1 4 5 3 2

n1 n2 n3 n4 n5 1 2 3 4 5 LOD Selection Result:

SLIDE 10

Ultrascale Visualization Workshop 2012

Load Balancing

Viewpoint

1 4 5 3 2

n1 n2 n3 n4 n5 1 2 3 4 5 LOD Selection Result: GPU1 GPU2 n1 n2 n3 n4 n5

n1+n2 n3+n4+n5 [1-t, 1+t]

GPU1: GPU2:

∉

SLIDE 11

Ultrascale Visualization Workshop 2012

Load Balancing

Viewpoint

1 4 5 3 2

n1 n2 n3 n4 n5 1 2 3 4 5 LOD Selection Result: GPU1

[1-t, 1+t]

n1 n2 n3 n4 n5 GPU1: GPU2: GPU2

∉

n1+n2 n3+n4+n5

SLIDE 12

Ultrascale Visualization Workshop 2012

Load Balancing

Viewpoint

1 4 5 3 2

n1 n2 n3 n4 n5 1 2 3 4 5 LOD Selection Result: GPU1 GPU2

n1+n2+n3+n4 n5 [1-t, 1+t]

n1 n2 n3 n4 n5 GPU1: GPU2:

∉

SLIDE 13

Ultrascale Visualization Workshop 2012

Load Balancing

Viewpoint

1 4 5 3 2

n1 n2 n3 n4 n5 1 2 3 4 5 LOD Selection Result: GPU1 GPU2

[1-t, 1+t]

n1 n2 n3 n4 n5 GPU1: GPU2:

∉

n1+n2+n3+n4 n5

SLIDE 14

Ultrascale Visualization Workshop 2012

Load Balancing

Viewpoint

1 4 5 3 2

n1 n2 n3 n4 n5 1 2 3 4 5 LOD Selection Result: GPU1 GPU2

[1-t, 1+t] n1+n2+n3 n4+n5

n1 n2 n3 n4 n5 GPU1: GPU2:

∈

SLIDE 15

Ultrascale Visualization Workshop 2012

Inter-GPU Communication

Displayed image on GPU1 Displayed image on GPU2 Rendered image on GPU1 Rendered image on GPU2 CUDA Inter-Process Communication (CUDA 4.1 IPC) for transferring image buffer.

SLIDE 16

Ultrascale Visualization Workshop 2012

Implementation

Two GPUs:

– NVIDIA GTX 580. – 512 cores, 3 GB DDR5. – 192.4 GB/s memory bandwidth.

CPU Main Memory:

– 16 GB RAMs.

Rendering performance:

– An average of 20 fps on the Linux system with MPI and CUDA 4.2.

SLIDE 17

Ultrascale Visualization Workshop 2012

SLIDE 18

Ultrascale Visualization Workshop 2012

Performance Evaluation

Comparison

– Dual-GPU with load balancing (our approach). – Dural-GPU without load balancing. – Single GPU.

SLIDE 19

Ultrascale Visualization Workshop 2012

Performance Evaluation

SLIDE 20

Ultrascale Visualization Workshop 2012

Performance Evaluation

Approach FPS Diff. Triangle Num. Visible Triangle Num. Load Balancing GPU Out-Of- Core Triangle Reformation GL Rendering

Single-GPU 14.94

12.29 M
29.62 ms

3.62 ms 30.24 ms Dual-GPU (NB) 17.84 7.94 M 12.29 M

24.54 ms

2.85 ms 25.31 ms

Dual-GPU (B)

20.40 0.37 M 12.29 M 5.38 ms 18.56 ms 1.97 ms 19.13 ms

SLIDE 21

Ultrascale Visualization Workshop 2012

Performance Evaluation

SLIDE 22

Ultrascale Visualization Workshop 2012

Conclusion

A rendering system with two GPUs:

– The workload balancer based on view- frustum partitioning method.

Inter-GPU communication for image

re-arrangement.

Future work:

– Scalability beyond two GPUs.

SLIDE 23

Ultrascale Visualization Workshop 2012

Acknowledgment

SLIDE 24

Ultrascale Visualization Workshop 2012

Department of Computer Science, Virginia Tech, Blacksburg, Virginia, USA

Load Balanced Parallel GPU Out-of-Core for Continuous LOD Model Visualization

Chao Peng, Peng Mi and Yong Cao

Motivation

that contains a lot of objects and triangles?

The Boeing 777 model: Triangles: 332 million Vertices: 223 million Objects: 719 thousand Rendering difficulties: The objects have dramatically different shapes and are topologically disconnected. The data size is far beyond the GPU rendering capabilities.

The Previous Approach

– Parallel Continuous LOD: triangle-level mesh simplification. – GPU Out-of Core: CPU-GPU data streaming.

A Multi-GPU and Multi-Display System

The input triangle data set CPU Core GPU Device CPU Core GPU Device

The approach on a single GPU

LOD Selection GPU Out-of-Core Triangle Reformation Rendering (VBO)

LOD Selection GPU Out-of-Core Triangle Reformation Rendering (VBO)

O1 O2 O3 O4 O5 O6 O7

CPU GPU Defragmentation

Performance Bottleneck

O1 O2 O3 O4 O5 O6 O7

CPU GPU Defragmentation GPU Out-Of-Core 45% Triangle Reformation 20% OpenGL VBO Rendering 28%

Contributions

LOD Selection GPU Out-of-Core Triangle Reformation Rendering (VBO) LOD Selection GPU Out-of-Core Triangle Reformation Rendering (VBO) Inter-GPU Communication Final Display Final Display Load Balancing

Load Balancing

Viewpoint

n1 n2 n3 n4 n5 1 2 3 4 5 LOD Selection Result:

Load Balancing

Viewpoint

n1 n2 n3 n4 n5 1 2 3 4 5 LOD Selection Result: GPU1 GPU2 n1 n2 n3 n4 n5

n1+n2 n3+n4+n5 [1-t, 1+t]

GPU1: GPU2:

∉

Load Balancing

Viewpoint

n1 n2 n3 n4 n5 1 2 3 4 5 LOD Selection Result: GPU1

[1-t, 1+t]

n1 n2 n3 n4 n5 GPU1: GPU2: GPU2

∉

n1+n2 n3+n4+n5

Load Balancing

Viewpoint

n1 n2 n3 n4 n5 1 2 3 4 5 LOD Selection Result: GPU1 GPU2

n1+n2+n3+n4 n5 [1-t, 1+t]

n1 n2 n3 n4 n5 GPU1: GPU2:

∉

Load Balancing

Viewpoint

n1 n2 n3 n4 n5 1 2 3 4 5 LOD Selection Result: GPU1 GPU2

[1-t, 1+t]

n1 n2 n3 n4 n5 GPU1: GPU2:

∉

n1+n2+n3+n4 n5

Load Balancing

Viewpoint

n1 n2 n3 n4 n5 1 2 3 4 5 LOD Selection Result: GPU1 GPU2

[1-t, 1+t] n1+n2+n3 n4+n5

n1 n2 n3 n4 n5 GPU1: GPU2:

∈

Inter-GPU Communication

Displayed image on GPU1 Displayed image on GPU2 Rendered image on GPU1 Rendered image on GPU2 CUDA Inter-Process Communication (CUDA 4.1 IPC) for transferring image buffer.

Implementation

– NVIDIA GTX 580. – 512 cores, 3 GB DDR5. – 192.4 GB/s memory bandwidth.

– 16 GB RAMs.

– An average of 20 fps on the Linux system with MPI and CUDA 4.2.

Performance Evaluation

– Dual-GPU with load balancing (our approach). – Dural-GPU without load balancing. – Single GPU.

Performance Evaluation

Performance Evaluation

Performance Evaluation

Conclusion

– The workload balancer based on view- frustum partitioning method.

re-arrangement.

– Scalability beyond two GPUs.

Acknowledgment

Thank you.