Porting Maxwell to the GPU Top Challenges Juan Caada Head of - - PowerPoint PPT Presentation

porting maxwell to the gpu
SMART_READER_LITE
LIVE PREVIEW

Porting Maxwell to the GPU Top Challenges Juan Caada Head of - - PowerPoint PPT Presentation

Porting Maxwell to the GPU Top Challenges Juan Caada Head of Visualization Next Limit Technologies Agenda - Maxwell overview - Why porting to the GPU was challenging - Performance considerations - Using the CPU to improve the GPU engine


slide-1
SLIDE 1

Porting Maxwell to the GPU

Top Challenges

Juan Cañada Head of Visualization Next Limit Technologies

slide-2
SLIDE 2
  • Maxwell overview
  • Why porting to the GPU was challenging
  • Performance considerations
  • Using the CPU to improve the GPU engine
  • Summary

Agenda

slide-3
SLIDE 3
  • Maxwell overview
  • Why porting to the GPU was challenging
  • Performance considerations
  • Using the CPU to improve the GPU engine
  • Summary

Agenda

slide-4
SLIDE 4

Maxwell Overview

Visualization Fluids Physics

slide-5
SLIDE 5
  • First physically based render in the market (2004)
  • Ground-truth reference render
  • Predictive rendering tool
  • Light analysis tool

Maxwell Overview MAXWELL

slide-6
SLIDE 6
  • Animation & VFX
  • Architecture
  • Industrial Design
  • Science
  • Others

Maxwell in use

slide-7
SLIDE 7
  • Animation & VFX
  • Architecture
  • Industrial Design
  • Science
  • Others

Maxwell in use

slide-8
SLIDE 8
  • Animation & VFX
  • Architecture
  • Industrial Design
  • Science
  • Others

Maxwell in use

slide-9
SLIDE 9
  • Animation & VFX
  • Architecture
  • Industrial Design
  • Science
  • Others

Maxwell in use

slide-10
SLIDE 10
  • Animation & VFX
  • Architecture
  • Industrial Design
  • Science
  • Others

Maxwell in use

slide-11
SLIDE 11
slide-12
SLIDE 12
  • Maxwell Render overview
  • Why porting to the GPU was challenging
  • Performance considerations
  • Using the CPU to improve the GPU engine
  • Summary

Agenda

slide-13
SLIDE 13
  • Keep pixel accuracy
  • Use GPU for predictive rendering
  • Improve performance
  • Spectral, unbiased, accurate PBR
  • Support CPU & GPU resuming & merging

Challenges

slide-14
SLIDE 14

Predictive Rendering

slide-15
SLIDE 15

Correct Fast ☺ Fast Correct 

slide-16
SLIDE 16
slide-17
SLIDE 17
  • Maxwell overview
  • Why porting to the GPU was challenging
  • Performance considerations
  • Using the CPU to improve the GPU engine
  • Summary

Agenda

slide-18
SLIDE 18

Maxwell GPU Architecture

Ray Tracing Thread Mapping Materials Evaluation TM?

GPU

Ray Generation

Ray Sorting

Direct Light

Visibility Test

Geometry Voxelization

slide-19
SLIDE 19

GPU Maxwell

Ray Tracing Thread Mapping Materials Evaluation TM?

GPU

Ray Generation

Ray Sorting

Direct Light

Visibility Test

Geometry Voxelization

slide-20
SLIDE 20

GPU Maxwell

  • Voxelization
  • Same Voxelization system as the CPU render
  • Currently performed in CPU just once
  • BVH
  • Binary tree (each node has 2 childs)
  • Coherent traversal

All threads fetch same amount of data / node Increase coherence in performance Trees become bigger

+ +

slide-21
SLIDE 21

GPU Maxwell

Ray Tracing Thread Mapping Materials Evaluation TM?

GPU

Ray Generation

Ray Sorting

Direct Light

Visibility Test

Geometry Voxelization

slide-22
SLIDE 22

Ray Generation

Ray Sorting

Geometry Voxelization

GPU Maxwell

Ray Tracing Materials Evaluation TM? Direct Light

Visibility Test GPU

Thread Mapping

slide-23
SLIDE 23

GPU Maxwell

  • Thread Mapping
  • Module that manages THREAD / PIXEL mapping
  • Sampling Level (SL)
  • Low Morton

Curve

  • Medium

Balances SPP

  • High

Uses Variance

Morton Curve

slide-24
SLIDE 24

Ray Generation

Ray Sorting

Geometry Voxelization

GPU Maxwell

Ray Tracing Materials Evaluation TM? Direct Light

Visibility Test GPU

Thread Mapping

slide-25
SLIDE 25

Geometry Voxelization

GPU Maxwell

Ray Tracing Thread Mapping Materials Evaluation TM? Direct Light

Visibility Test GPU

Ray Generation

Ray Sorting

slide-26
SLIDE 26

GPU Maxwell

  • Ray Generation Module
  • Primary Rays (PR)
  • Rays shot from camera
  • High degree of coherence
  • Two neighboring rays will hit nearby similar objects
  • Secondary Rays (SR)
  • Rays shot from surfaces
  • No coherence
  • Two neighbouring rays might hit different objects
slide-27
SLIDE 27

GPU Maxwell

  • Ray Generation Module
  • Thread blocks with just PR
  • High degree of coherence
  • Best performance situation
  • Thread blocks with just SR
  • All will take much more time than PR
  • The worst SR will drive the performance
  • Thread blocks with PR and SR
  • SR will hurt PR performance
slide-28
SLIDE 28

GPU Maxwell

  • Ray Generation Module
  • How do we handle it?
  • GPU Ray sorting by Ray Type

PR0 PR1 SR0 PR2 SR1 PR3 SR2 PR4

slide-29
SLIDE 29

GPU Maxwell

  • Ray Generation Module
  • How do we handle it?
  • GPU Ray sorting by Ray Type

PR0 PR1 SR0 PR2 SR1 PR3 SR2 PR4 PR0 PR1 SR0 PR2 SR1 PR3 SR2 PR4

slide-30
SLIDE 30

GPU Maxwell

  • Ray Generation Module
  • How do we handle it?
  • GPU Ray sorting by Ray Type
  • Sorting is really fast
  • Simple, yet powerful
  • Do it just after 2nd bounce
  • Not needed for PR
  • Performance boost is scene dependant
slide-31
SLIDE 31

GPU Maxwell

  • Ray Generation Module
  • How do we handle it?
  • GPU Ray sorting by Ray Type
  • Considerations
  • Not useful for medium to small-res images
  • Use an indirection buffer
  • Cleaner code
  • Avoids moving global data
  • Much better performance
slide-32
SLIDE 32

Geometry Voxelization

GPU Maxwell

Ray Tracing Thread Mapping Materials Evaluation TM? Direct Light

Visibility Test GPU

Ray Generation

Ray Sorting

slide-33
SLIDE 33

Geometry Voxelization

GPU Maxwell

Thread Mapping Materials Evaluation TM? Ray Generation

Ray Sorting

Direct Light

Visibility Test GPU

Ray Tracing

slide-34
SLIDE 34

GPU Maxwell

  • Ray Tracing Module
  • GPU architecture dependent kernels
  • Fermi, Kepler, Maxwell
  • Use every architecture strengths
slide-35
SLIDE 35

Geometry Voxelization

GPU Maxwell

Thread Mapping Materials Evaluation TM? Ray Generation

Ray Sorting

Direct Light

Visibility Test GPU

Ray Tracing

slide-36
SLIDE 36

Geometry Voxelization

GPU Maxwell Render

Ray Tracing Thread Mapping Materials Evaluation TM? Ray Generation

Ray Sorting

Direct Light

Visibility Test GPU

slide-37
SLIDE 37

GPU Maxwell

Direct Light Module

1. Sample scene emitters at each path node

  • Two strategies
  • Sample 1 random emitter / sample
  • Sample all emitters / sample

2. Visibility test

  • Trace shadow rays
  • Incoherent rays

Ray sorting does not help

3. Many other optimizations

slide-38
SLIDE 38

Geometry Voxelization

GPU Maxwell

Ray Tracing Thread Mapping Materials Evaluation TM? Ray Generation

Ray Sorting

Direct Light

Visibility Test GPU

slide-39
SLIDE 39

Geometry Voxelization

GPU Maxwell

Ray Tracing Thread Mapping TM? Ray Generation

Ray Sorting

Direct Light

Visibility Test GPU

Materials Evaluation

slide-40
SLIDE 40

GPU Maxwell

  • Materials Evaluation Module
  • Maxwell materials are complex
  • Many layers and many BSDFs / layer  very generic
slide-41
SLIDE 41

GPU Maxwell

Materials Evaluation Module

  • Bbig kernels are harmful
  • Samples evaluating different materials
  • Access different data
  • Execute different code
slide-42
SLIDE 42

GPU Maxwell

  • Materials Evaluation Module
  • Materials Group Queue System (MGQS)

1. Every material is assigned a Material Group ID 2. Queue system for Material Groups (MG) 3. Every queue has specific kernels

  • Avoid big kernels

4. Samples are queued to the corresponding MG Queue 5. All samples evaluating the same MG are executed together

  • Increased coherence in execution time
  • Increased coherence in data access

+ + +

slide-43
SLIDE 43

GPU Maxwell

  • Materials Evaluation Module
  • Materials Group Queue System (MGQS)

1. Every material is assigned a Material Group ID 2. Queue system for Material Groups (MG) 3. Every queue has specific kernels

  • Avoid big kernels

4. Samples are queued to the corresponding MG Queue 5. All samples evaluating the same MG are executed together

  • Increased coherence in execution time
  • Increased coherence in data access

+ + +

slide-44
SLIDE 44

GPU Maxwell Render

  • Materials Evaluation Module
  • Materials Group Queue System (MGQS)

1. Every material is assigned a Material Group ID 2. Queue system for Material Groups (MG) 3. Every queue has specific kernels

  • Avoid big kernels

4. Samples are queued to the corresponding MG Queue 5. All samples evaluating the same MG are executed together

  • Increased coherence in execution time
  • Increased coherence in data access

+ + +

slide-45
SLIDE 45

GPU Maxwell

  • Materials Evaluation Module
  • Materials Group Queue System (MGQS)

1. Every material is assigned a Material Group ID 2. Queue system for Material Groups (MG) 3. Every queue has specific kernels (Avoid big kernels) 4. Samples are queued to the corresponding MG Queue 5. All samples evaluating the same MG are executed together

  • Increased coherence in execution time
  • Increased coherence in data access

+ +

slide-46
SLIDE 46

GPU Maxwell

  • Materials Evaluation Module
  • Materials Group Queue System (MGQS)

1. Every material is assigned a Material Group ID 2. Queue system for Material Groups (MG) 3. Every queue has specific kernels (Avoid big kernels) 4. Samples are queued to the corresponding MG Queue 5. All samples evaluating the same MG are executed together

  • Increased coherence in execution time
  • Increased coherence in data access
slide-47
SLIDE 47

GPU Maxwell

  • Materials Evaluation Module
  • Materials Group Queue System (MGQS)

1. Every material is assigned a Material Group ID 2. Queue system for Material Groups (MG) 3. Every queue has specific kernels (Avoid big kernels) 4. Samples are queued to the corresponding MG Queue 5. All samples evaluating the same MG are executed together

  • Increased coherence in execution time
  • Increased coherence in data access
slide-48
SLIDE 48

Geometry Voxelization

GPU Maxwell

Ray Tracing Thread Mapping TM? Ray Generation

Ray Sorting

Direct Light

Visibility Test GPU

Materials Evaluation

slide-49
SLIDE 49

Thread Mapping Ray Generation

Ray Sorting

Materials Evaluation Geometry Voxelization

GPU Maxwell

Ray Tracing Direct Light

Visibility Test GPU

TM?

slide-50
SLIDE 50

GPU Maxwell

Ray Tracing Thread Mapping Materials Evaluation TM?

GPU

Ray Generation

Ray Sorting

Direct Light

Visibility Test

Geometry Voxelization

slide-51
SLIDE 51
slide-52
SLIDE 52
  • Maxwell overview
  • Why porting to the GPU was challenging
  • Performance considerations
  • Using the CPU to improve the GPU engine
  • Summary

Agenda

slide-53
SLIDE 53

Using the CPU to improve the GPU engine Why using our CPU engine as ground truth?

  • 12 years old  Stable & Robust
  • Used many times for validation purposes
slide-54
SLIDE 54

CPU vs GPU Case Studies

Guggenheim scene Teapot scene

slide-55
SLIDE 55

Guggenheim Scene

slide-56
SLIDE 56

Guggenheim Scene

slide-57
SLIDE 57
  • Slight differences in intensity
  • Noise in some areas
  • Subtle changes in glossy surfaces

Guggenheim Scene ISSUES

slide-58
SLIDE 58
  • Simplifying & Isolating (surprise :P)
  • Automated numerical comparisons
  • Raytracing text output
  • Ray viewer

Guggenheim Scene STRATEGY

slide-59
SLIDE 59

CPU

Guggenheim Scene

slide-60
SLIDE 60

Different Intensity + Noise Problems

GPU

Guggenheim Scene

slide-61
SLIDE 61

GPU GPU CPU CPU

Guggenheim Scene – Intensity & Noise

slide-62
SLIDE 62

Guggenheim Scene FINDINGS

  • Emitters intensity
  • Hidden property of emitters was not working properly
  • Non-visible emitters were causing occlusions
  • Loss of energy
  • Noise
  • QMC had some problems for higher dimensions
slide-63
SLIDE 63

CPU

Guggenheim Scene FIXED

slide-64
SLIDE 64

GPU

FIXED Guggenheim Scene

slide-65
SLIDE 65

Guggenheim Scene

slide-66
SLIDE 66

Guggenheim Scene – Differences in glossies

slide-67
SLIDE 67

Guggenheim Scene – Differences in glossies

slide-68
SLIDE 68

Guggenheim Scene – Differences in glossies

slide-69
SLIDE 69

CPU GPU

Guggenheim Scene – Differences in glossies

slide-70
SLIDE 70

CPU GPU

  • Simplify the material  Lambert

Guggenheim Scene – Differences in glossies

slide-71
SLIDE 71

Guggenheim Scene – Differences in glossies

CPU GPU

slide-72
SLIDE 72

Guggenheim Scene – Differences in glossies

CPU GPU

slide-73
SLIDE 73
  • It turned out it was not related to materials
  • Both glossy and lambert have the same problem
  • Difficult to isolate
  • Possible problems
  • QMC numbers bug?
  • Russian Roulette bug?
  • Ray / triangle intersection issues with indirect bounces?
  • Energy accumulation problem?
  • Precision issues?

Guggenheim Scene – Differences in glossies

slide-74
SLIDE 74

Russian Roulette was OK

(Mean path length for both engines was the same)

Guggenheim Scene – Differences in glossies

slide-75
SLIDE 75

CPU – QMC Distributions GPU – QMC Distributions

Guggenheim Scene – Differences in glossies

slide-76
SLIDE 76

CPU – QMC Distributions GPU – QMC Distributions

Guggenheim Scene – Differences in glossies

Automated tests detected differences!

slide-77
SLIDE 77

CPU GPU

SOLVED

Guggenheim Scene – Differences in glossies

slide-78
SLIDE 78

Guggenheim Scene

CPU == GPU ☺

slide-79
SLIDE 79
slide-80
SLIDE 80

Teapot Scene

CPU GPU

slide-81
SLIDE 81

Teapot Scene

slide-82
SLIDE 82

Teapot Scene

slide-83
SLIDE 83
  • Subtle differences in bump/normal mapping
  • Differences in materials with many layers/bsdfs
  • Small changes in intensity

Teapot Scene ISSUES

slide-84
SLIDE 84

Use cases where CPU Maxwell helped… A LOT!!!

CPU GPU

Test 1 : Lambert materials + Constant Sky  OK

slide-85
SLIDE 85

Test 2 : Added textures + Normal maps  WRONG

CPU GPU

Teapot Scene

slide-86
SLIDE 86

CPU GPU

Test 3 : Added multilayered materials  WRONG

Teapot Scene

slide-87
SLIDE 87
  • Automated CPU vs GPU numerical comparisons were key
  • Rays reaching IBL were not accumulating energy properly
  • Multilayered weights were not properly computed
  • Bug introduced when porting CPU optimized code
  • Precision issues creating TBN bases (Affected bump/normal mapping)

Teapot Scene FINDINGS

slide-88
SLIDE 88
slide-89
SLIDE 89

Next Steps Unbiased, GPU friendly SSS

slide-90
SLIDE 90
  • Maxwell Render overview
  • Why porting to the GPU was challenging
  • Performance considerations
  • Using the CPU to improve the GPU engine
  • Summary

Agenda

slide-91
SLIDE 91

Main sources of bugs:

  • CPU optimized code not easy to port
  • Refactoring to make code GPU friendly
  • Precision issues with some math operators

Summary

slide-92
SLIDE 92
  • 90% of the complexity of Maxwell already ported
  • Very happy with the results: Speed boost: 5x-15x
  • CUDA made it possible
  • Validating using a ground truth renderer
  • Was painful
  • 100% worth in the long run (quality first, speed second)

Summary

slide-93
SLIDE 93

Thanks!

Juan Cañada Head of Visualization Next Limit Technologies