SLIDE 1 Porting Maxwell to the GPU
Top Challenges
Juan Cañada Head of Visualization Next Limit Technologies
SLIDE 2
- Maxwell overview
- Why porting to the GPU was challenging
- Performance considerations
- Using the CPU to improve the GPU engine
- Summary
Agenda
SLIDE 3
- Maxwell overview
- Why porting to the GPU was challenging
- Performance considerations
- Using the CPU to improve the GPU engine
- Summary
Agenda
SLIDE 4 Maxwell Overview
Visualization Fluids Physics
SLIDE 5
- First physically based render in the market (2004)
- Ground-truth reference render
- Predictive rendering tool
- Light analysis tool
Maxwell Overview MAXWELL
SLIDE 6
- Animation & VFX
- Architecture
- Industrial Design
- Science
- Others
Maxwell in use
SLIDE 7
- Animation & VFX
- Architecture
- Industrial Design
- Science
- Others
Maxwell in use
SLIDE 8
- Animation & VFX
- Architecture
- Industrial Design
- Science
- Others
Maxwell in use
SLIDE 9
- Animation & VFX
- Architecture
- Industrial Design
- Science
- Others
Maxwell in use
SLIDE 10
- Animation & VFX
- Architecture
- Industrial Design
- Science
- Others
Maxwell in use
SLIDE 11
SLIDE 12
- Maxwell Render overview
- Why porting to the GPU was challenging
- Performance considerations
- Using the CPU to improve the GPU engine
- Summary
Agenda
SLIDE 13
- Keep pixel accuracy
- Use GPU for predictive rendering
- Improve performance
- Spectral, unbiased, accurate PBR
- Support CPU & GPU resuming & merging
- …
Challenges
SLIDE 14
Predictive Rendering
SLIDE 15
Correct Fast ☺ Fast Correct
SLIDE 16
SLIDE 17
- Maxwell overview
- Why porting to the GPU was challenging
- Performance considerations
- Using the CPU to improve the GPU engine
- Summary
Agenda
SLIDE 18 Maxwell GPU Architecture
Ray Tracing Thread Mapping Materials Evaluation TM?
GPU
Ray Generation
Ray Sorting
Direct Light
Visibility Test
Geometry Voxelization
SLIDE 19 GPU Maxwell
Ray Tracing Thread Mapping Materials Evaluation TM?
GPU
Ray Generation
Ray Sorting
Direct Light
Visibility Test
Geometry Voxelization
SLIDE 20 GPU Maxwell
- Voxelization
- Same Voxelization system as the CPU render
- Currently performed in CPU just once
- BVH
- Binary tree (each node has 2 childs)
- Coherent traversal
All threads fetch same amount of data / node Increase coherence in performance Trees become bigger
+ +
SLIDE 21 GPU Maxwell
Ray Tracing Thread Mapping Materials Evaluation TM?
GPU
Ray Generation
Ray Sorting
Direct Light
Visibility Test
Geometry Voxelization
SLIDE 22 Ray Generation
Ray Sorting
Geometry Voxelization
GPU Maxwell
Ray Tracing Materials Evaluation TM? Direct Light
Visibility Test GPU
Thread Mapping
SLIDE 23 GPU Maxwell
- Thread Mapping
- Module that manages THREAD / PIXEL mapping
- Sampling Level (SL)
- Low Morton
Curve
Balances SPP
Uses Variance
Morton Curve
SLIDE 24 Ray Generation
Ray Sorting
Geometry Voxelization
GPU Maxwell
Ray Tracing Materials Evaluation TM? Direct Light
Visibility Test GPU
Thread Mapping
SLIDE 25 Geometry Voxelization
GPU Maxwell
Ray Tracing Thread Mapping Materials Evaluation TM? Direct Light
Visibility Test GPU
Ray Generation
Ray Sorting
SLIDE 26 GPU Maxwell
- Ray Generation Module
- Primary Rays (PR)
- Rays shot from camera
- High degree of coherence
- Two neighboring rays will hit nearby similar objects
- Secondary Rays (SR)
- Rays shot from surfaces
- No coherence
- Two neighbouring rays might hit different objects
SLIDE 27 GPU Maxwell
- Ray Generation Module
- Thread blocks with just PR
- High degree of coherence
- Best performance situation
- Thread blocks with just SR
- All will take much more time than PR
- The worst SR will drive the performance
- Thread blocks with PR and SR
- SR will hurt PR performance
SLIDE 28 GPU Maxwell
- Ray Generation Module
- How do we handle it?
- GPU Ray sorting by Ray Type
PR0 PR1 SR0 PR2 SR1 PR3 SR2 PR4
SLIDE 29 GPU Maxwell
- Ray Generation Module
- How do we handle it?
- GPU Ray sorting by Ray Type
PR0 PR1 SR0 PR2 SR1 PR3 SR2 PR4 PR0 PR1 SR0 PR2 SR1 PR3 SR2 PR4
SLIDE 30 GPU Maxwell
- Ray Generation Module
- How do we handle it?
- GPU Ray sorting by Ray Type
- Sorting is really fast
- Simple, yet powerful
- Do it just after 2nd bounce
- Not needed for PR
- Performance boost is scene dependant
SLIDE 31 GPU Maxwell
- Ray Generation Module
- How do we handle it?
- GPU Ray sorting by Ray Type
- Considerations
- Not useful for medium to small-res images
- Use an indirection buffer
- Cleaner code
- Avoids moving global data
- Much better performance
SLIDE 32 Geometry Voxelization
GPU Maxwell
Ray Tracing Thread Mapping Materials Evaluation TM? Direct Light
Visibility Test GPU
Ray Generation
Ray Sorting
SLIDE 33 Geometry Voxelization
GPU Maxwell
Thread Mapping Materials Evaluation TM? Ray Generation
Ray Sorting
Direct Light
Visibility Test GPU
Ray Tracing
SLIDE 34 GPU Maxwell
- Ray Tracing Module
- GPU architecture dependent kernels
- Fermi, Kepler, Maxwell
- Use every architecture strengths
SLIDE 35 Geometry Voxelization
GPU Maxwell
Thread Mapping Materials Evaluation TM? Ray Generation
Ray Sorting
Direct Light
Visibility Test GPU
Ray Tracing
SLIDE 36 Geometry Voxelization
GPU Maxwell Render
Ray Tracing Thread Mapping Materials Evaluation TM? Ray Generation
Ray Sorting
Direct Light
Visibility Test GPU
SLIDE 37 GPU Maxwell
Direct Light Module
1. Sample scene emitters at each path node
- Two strategies
- Sample 1 random emitter / sample
- Sample all emitters / sample
2. Visibility test
- Trace shadow rays
- Incoherent rays
Ray sorting does not help
3. Many other optimizations
SLIDE 38 Geometry Voxelization
GPU Maxwell
Ray Tracing Thread Mapping Materials Evaluation TM? Ray Generation
Ray Sorting
Direct Light
Visibility Test GPU
SLIDE 39 Geometry Voxelization
GPU Maxwell
Ray Tracing Thread Mapping TM? Ray Generation
Ray Sorting
Direct Light
Visibility Test GPU
Materials Evaluation
SLIDE 40 GPU Maxwell
- Materials Evaluation Module
- Maxwell materials are complex
- Many layers and many BSDFs / layer very generic
SLIDE 41 GPU Maxwell
Materials Evaluation Module
- Bbig kernels are harmful
- Samples evaluating different materials
- Access different data
- Execute different code
SLIDE 42 GPU Maxwell
- Materials Evaluation Module
- Materials Group Queue System (MGQS)
1. Every material is assigned a Material Group ID 2. Queue system for Material Groups (MG) 3. Every queue has specific kernels
4. Samples are queued to the corresponding MG Queue 5. All samples evaluating the same MG are executed together
- Increased coherence in execution time
- Increased coherence in data access
+ + +
SLIDE 43 GPU Maxwell
- Materials Evaluation Module
- Materials Group Queue System (MGQS)
1. Every material is assigned a Material Group ID 2. Queue system for Material Groups (MG) 3. Every queue has specific kernels
4. Samples are queued to the corresponding MG Queue 5. All samples evaluating the same MG are executed together
- Increased coherence in execution time
- Increased coherence in data access
+ + +
SLIDE 44 GPU Maxwell Render
- Materials Evaluation Module
- Materials Group Queue System (MGQS)
1. Every material is assigned a Material Group ID 2. Queue system for Material Groups (MG) 3. Every queue has specific kernels
4. Samples are queued to the corresponding MG Queue 5. All samples evaluating the same MG are executed together
- Increased coherence in execution time
- Increased coherence in data access
+ + +
SLIDE 45 GPU Maxwell
- Materials Evaluation Module
- Materials Group Queue System (MGQS)
1. Every material is assigned a Material Group ID 2. Queue system for Material Groups (MG) 3. Every queue has specific kernels (Avoid big kernels) 4. Samples are queued to the corresponding MG Queue 5. All samples evaluating the same MG are executed together
- Increased coherence in execution time
- Increased coherence in data access
+ +
SLIDE 46 GPU Maxwell
- Materials Evaluation Module
- Materials Group Queue System (MGQS)
1. Every material is assigned a Material Group ID 2. Queue system for Material Groups (MG) 3. Every queue has specific kernels (Avoid big kernels) 4. Samples are queued to the corresponding MG Queue 5. All samples evaluating the same MG are executed together
- Increased coherence in execution time
- Increased coherence in data access
SLIDE 47 GPU Maxwell
- Materials Evaluation Module
- Materials Group Queue System (MGQS)
1. Every material is assigned a Material Group ID 2. Queue system for Material Groups (MG) 3. Every queue has specific kernels (Avoid big kernels) 4. Samples are queued to the corresponding MG Queue 5. All samples evaluating the same MG are executed together
- Increased coherence in execution time
- Increased coherence in data access
SLIDE 48 Geometry Voxelization
GPU Maxwell
Ray Tracing Thread Mapping TM? Ray Generation
Ray Sorting
Direct Light
Visibility Test GPU
Materials Evaluation
SLIDE 49 Thread Mapping Ray Generation
Ray Sorting
Materials Evaluation Geometry Voxelization
GPU Maxwell
Ray Tracing Direct Light
Visibility Test GPU
TM?
SLIDE 50 GPU Maxwell
Ray Tracing Thread Mapping Materials Evaluation TM?
GPU
Ray Generation
Ray Sorting
Direct Light
Visibility Test
Geometry Voxelization
SLIDE 51
SLIDE 52
- Maxwell overview
- Why porting to the GPU was challenging
- Performance considerations
- Using the CPU to improve the GPU engine
- Summary
Agenda
SLIDE 53 Using the CPU to improve the GPU engine Why using our CPU engine as ground truth?
- 12 years old Stable & Robust
- Used many times for validation purposes
SLIDE 54 CPU vs GPU Case Studies
Guggenheim scene Teapot scene
SLIDE 55
Guggenheim Scene
SLIDE 56
Guggenheim Scene
SLIDE 57
- Slight differences in intensity
- Noise in some areas
- Subtle changes in glossy surfaces
Guggenheim Scene ISSUES
SLIDE 58
- Simplifying & Isolating (surprise :P)
- Automated numerical comparisons
- Raytracing text output
- Ray viewer
Guggenheim Scene STRATEGY
SLIDE 59 CPU
Guggenheim Scene
SLIDE 60 Different Intensity + Noise Problems
GPU
Guggenheim Scene
SLIDE 61 GPU GPU CPU CPU
Guggenheim Scene – Intensity & Noise
SLIDE 62 Guggenheim Scene FINDINGS
- Emitters intensity
- Hidden property of emitters was not working properly
- Non-visible emitters were causing occlusions
- Loss of energy
- Noise
- QMC had some problems for higher dimensions
SLIDE 63 CPU
Guggenheim Scene FIXED
SLIDE 64 GPU
FIXED Guggenheim Scene
SLIDE 65
Guggenheim Scene
SLIDE 66
Guggenheim Scene – Differences in glossies
SLIDE 67
Guggenheim Scene – Differences in glossies
SLIDE 68
Guggenheim Scene – Differences in glossies
SLIDE 69 CPU GPU
Guggenheim Scene – Differences in glossies
SLIDE 70 CPU GPU
- Simplify the material Lambert
Guggenheim Scene – Differences in glossies
SLIDE 71 Guggenheim Scene – Differences in glossies
CPU GPU
SLIDE 72 Guggenheim Scene – Differences in glossies
CPU GPU
SLIDE 73
- It turned out it was not related to materials
- Both glossy and lambert have the same problem
- Difficult to isolate
- Possible problems
- QMC numbers bug?
- Russian Roulette bug?
- Ray / triangle intersection issues with indirect bounces?
- Energy accumulation problem?
- Precision issues?
- …
Guggenheim Scene – Differences in glossies
SLIDE 74 Russian Roulette was OK
(Mean path length for both engines was the same)
Guggenheim Scene – Differences in glossies
SLIDE 75 CPU – QMC Distributions GPU – QMC Distributions
Guggenheim Scene – Differences in glossies
SLIDE 76 CPU – QMC Distributions GPU – QMC Distributions
Guggenheim Scene – Differences in glossies
Automated tests detected differences!
SLIDE 77 CPU GPU
SOLVED
Guggenheim Scene – Differences in glossies
SLIDE 78 Guggenheim Scene
CPU == GPU ☺
SLIDE 79
SLIDE 80 Teapot Scene
CPU GPU
SLIDE 81
Teapot Scene
SLIDE 82
Teapot Scene
SLIDE 83
- Subtle differences in bump/normal mapping
- Differences in materials with many layers/bsdfs
- Small changes in intensity
Teapot Scene ISSUES
SLIDE 84 Use cases where CPU Maxwell helped… A LOT!!!
CPU GPU
Test 1 : Lambert materials + Constant Sky OK
SLIDE 85 Test 2 : Added textures + Normal maps WRONG
CPU GPU
Teapot Scene
SLIDE 86 CPU GPU
Test 3 : Added multilayered materials WRONG
Teapot Scene
SLIDE 87
- Automated CPU vs GPU numerical comparisons were key
- Rays reaching IBL were not accumulating energy properly
- Multilayered weights were not properly computed
- Bug introduced when porting CPU optimized code
- Precision issues creating TBN bases (Affected bump/normal mapping)
Teapot Scene FINDINGS
SLIDE 88
SLIDE 89
Next Steps Unbiased, GPU friendly SSS
SLIDE 90
- Maxwell Render overview
- Why porting to the GPU was challenging
- Performance considerations
- Using the CPU to improve the GPU engine
- Summary
Agenda
SLIDE 91 Main sources of bugs:
- CPU optimized code not easy to port
- Refactoring to make code GPU friendly
- Precision issues with some math operators
Summary
SLIDE 92
- 90% of the complexity of Maxwell already ported
- Very happy with the results: Speed boost: 5x-15x
- CUDA made it possible
- Validating using a ground truth renderer
- Was painful
- 100% worth in the long run (quality first, speed second)
Summary
SLIDE 93 Thanks!
Juan Cañada Head of Visualization Next Limit Technologies