Efficient Stream Reduction on the GPU Efficient Stream Reduction on - - PowerPoint PPT Presentation

efficient stream reduction on the gpu efficient stream
SMART_READER_LITE
LIVE PREVIEW

Efficient Stream Reduction on the GPU Efficient Stream Reduction on - - PowerPoint PPT Presentation

Efficient Stream Reduction on the GPU Efficient Stream Reduction on the GPU David Roger, Ulf Assarsson, Nicolas Holzschuch Grenoble Chalmers University Cornell University of Technology University Stream Reduction Removing unwanted elements


slide-1
SLIDE 1

Efficient Stream Reduction on the GPU Efficient Stream Reduction on the GPU

David Roger, Ulf Assarsson, Nicolas Holzschuch

Grenoble University Chalmers University

  • f Technology

Cornell University

slide-2
SLIDE 2

2

Stream Reduction

Removing unwanted elements from a stream

Input stream Reduced stream

slide-3
SLIDE 3

3

Applications

  • Tree traversal:

– Ray tracing – Collision detection

  • Often the bottleneck
slide-4
SLIDE 4

4

Sequential Algorithm

  • Algorithm:

i=0 for j=0 to n-1 do

if x[j] is valid then

x[i]=x[j] i=i+1

  • Easy: one single loop
  • Linear complexity
slide-5
SLIDE 5

5

On GPU

  • Parallelism
  • We assume no scatter

– We will speak about scatter later

slide-6
SLIDE 6

6

Talk Structure

  • Previous Works
  • Algorithm Overview
  • Details and Implementation
  • Results
  • Future Works & Conclusion
slide-7
SLIDE 7

7

Previous works: Horn's Method

Input stream Reduced stream

1 1 1 1 2 3 3 4 4 5 5 5 6 6

Dichotomic search:

performs the displacements

Prefix sum Prefix sum scan:

computes the displacements

slide-8
SLIDE 8

8

Previous works

  • Prefix sum scan

– Hillis and Steele, Horn: O(n log n) – Blelloch, Sengupta et al., Harris et al.: O(n) – Sengupta et al. Hybrid: O(n)

  • Dichotomic search: O(n log n)
  • Overall complexity: O(n log n)
slide-9
SLIDE 9

9

Other approaches

  • Geometry shader + stream output

– NV_transform_feedback – Input stream: vertices in a VBO – Geometry shader discards NULL elements – Output stream: vertices in a VBO

  • No fragments, no fragment shader
  • Bitonic sort

– Slow

  • Sum scan + Scatter with vertex engine
slide-10
SLIDE 10

10

Talk Structure

  • Previous Works
  • Algorithm Overview
  • Details and Implementation
  • Results
  • Future Works & Conclusion
slide-11
SLIDE 11

11

Talk Structure

  • Previous Works
  • Algorithm Overview
  • Details and Implementation
  • Results
  • Future Works & Conclusion
slide-12
SLIDE 12

12

Our approach

Input stream, split in blocks Reduced stream Reduction of the blocks Concatenation

slide-13
SLIDE 13

13

Reduction of the blocks

  • In parallel
  • Using previous works

– Prefix sum scan – Dichotomic search

  • Complexity

– s: size of a block – One block: O(s log s) – n/s blocks: O(n log s)

slide-14
SLIDE 14

14

Concatenation of the blocks

  • Prefix sum scan

– Computes displacements of

the blocks in parallel

  • Line drawing

– Segments extremities moved

by scattering (vertex engine)

– Other elements linearly

interpolated (rasterization)

  • Complexity: O(n)
slide-15
SLIDE 15

15

Concatenation of the blocks

Reduced stream Reduced blocks

slide-16
SLIDE 16

16

Concatenation of the blocks

Reduced stream Reduced blocks

Move the extremities with the vertex shader

slide-17
SLIDE 17

17

Concatenation of the blocks

Reduced stream Reduced blocks

Move the extremities with the vertex shader Rasterization

slide-18
SLIDE 18

18

Algortihmic complexity

  • All previous works: O(n log n)
  • Our algorithm: O(n log s)

– s is the size of the blocks – s is a constant !

slide-19
SLIDE 19

19

Overview

Input stream, split in blocks Reduced stream Prefix sum scan + Dichotomic search Prefix sum scan + Line drawing

slide-20
SLIDE 20

20

Why is it efficient ?

The key is block concatenation:

– Dichotomic search is avoided – Vertex engine: scatter ... but lesser efficiency

  • Use it for a few elements (segment extremities)
  • Interpolate the other elements
slide-21
SLIDE 21

21

Talk Structure

  • Previous Works
  • Algorithm Overview
  • Details and Implementation
  • Results
  • Future Works & Conclusion
slide-22
SLIDE 22

22

Talk Structure

  • Previous Works
  • Algorithm Overview
  • Details and Implementation
  • Results
  • Future Works & Conclusion
slide-23
SLIDE 23

23

Overview

Input stream, split in blocks Reduced stream Prefix sum scan + Dichotomic search Prefix sum scan + Line drawing

slide-24
SLIDE 24

24

Overview

Input stream, split in blocks Reduced stream Prefix sum scan + Dichotomic search Prefix sum scan + Line drawing

slide-25
SLIDE 25

25

Dichotomic search details

Input block Reduced block

1 1 1 1 2 3 4 4 5 5 5 6 6

Prefix sum

sum[j] = 3

Gather: At output position i Search j in input such as: i = j – sum[j] Search bounds: i+sum[i] ≤ j ≤ i+sum[15] Example: i = 5 6 ≤ j ≤ 11 Search result j = 8

3 ? 0 1 2 3 4 5 6 7 j=8 j=8 9 10 11 12 13 14 15 0 1 2 3 4 i=5 =5 6 7 8 9 10 11 12 13 14 15

slide-26
SLIDE 26

26

Dichotomic search pseudo-code

while(found ≠ 0) { if (found < 0) lowBound = j else upBound = j j = (lowBound + upBound) / 2 found = j-sum[j]-i }

Search j0 such as i = j0 - sum[j0]:

lowBound = i + sum[i] upBound = i + sum[n-1] if(upBound > n-1) discard j = (lowBound + upBound)/2 found = j-sum[j]-i

slide-27
SLIDE 27

27

Dichotomic search improvement

while(found ≠ 0) { if (found < 0) lowBound = j - found else upBound = j - found j = (lowBound + upBound) / 2 found = j-sum[j]-i }

Search j0 such as i = j0 - sum[j0]:

lowBound = i + sum[i] upBound = i + sum[n-1] if(upBound > n-1) discard j = (lowBound + upBound)/2 found = j-sum[j]-i

Because j – sum[j] is contracting!

slide-28
SLIDE 28

28

Overview

Input stream, split in blocks Reduced stream Prefix sum scan + Dichotomic search Prefix sum scan + Line drawing

slide-29
SLIDE 29

29

Overview

Input stream, split in blocks Reduced stream Prefix sum scan + Dichotomic search Prefix sum scan + Line drawing

slide-30
SLIDE 30

30

Lines wrapping

  • We use 2D textures: wrap line segments

– Split all segments in two

  • Or

– Use geometry engine to split only when necessary

Concatenation

slide-31
SLIDE 31

31

Lines wrapping

  • We use 2D textures: wrap line segments

– Split all segments in two

  • Or

– Use geometry engine to split only when necessary

Concatenation

slide-32
SLIDE 32

32

Lines wrapping

  • We use 2D textures: wrap line segments

– Split all segments in two

  • Or

– Use geometry engine to split only when necessary

Concatenation

slide-33
SLIDE 33

33

Lines wrapping

  • We use 2D textures: wrap line segments

– Split all segments in two

  • Or

– Use geometry engine to split only when necessary

Concatenation

slide-34
SLIDE 34

34

Lines wrapping

  • We use 2D textures: wrap line segments

– Split all segments in two

  • Or

– Use geometry engine to split only when necessary

Concatenation

slide-35
SLIDE 35

35

Lines wrapping

  • We use 2D textures: wrap line segments

– Split all segments in two

  • Or

– Use geometry engine to split only when necessary

Concatenation

slide-36
SLIDE 36

36

Lines wrapping

  • We use 2D textures: wrap line segments

– Split all segments in two

  • Or

– Use geometry engine to split only when necessary

Concatenation

slide-37
SLIDE 37

37

Lines wrapping

  • We use 2D textures: wrap line segments

– Split all segments in two

  • Or

– Use geometry engine to split only when necessary

Concatenation

slide-38
SLIDE 38

38

Lines wrapping

  • We use 2D textures: wrap line segments

– Split all segments in two

  • Or

– Use geometry engine to split only when necessary

Concatenation

slide-39
SLIDE 39

39

Lines wrapping

  • We use 2D textures: wrap line segments

– Split all segments in two

  • Or

– Use geometry engine to split only when necessary

Concatenation

slide-40
SLIDE 40

40

Talk Structure

  • Previous Works
  • Algorithm Overview
  • Details and Implementation
  • Results
  • Future Works & Conclusion
slide-41
SLIDE 41

41

Talk Structure

  • Previous Works
  • Algorithm Overview
  • Details and Implementation
  • Results
  • Future Works & Conclusion
slide-42
SLIDE 42

42

Behavior: linear complexity

slide-43
SLIDE 43

43

Behavior: block size

slide-44
SLIDE 44

44

Behavior: fill ratio

slide-45
SLIDE 45

45

Comparison with previous works

slide-46
SLIDE 46

46

Talk Structure

  • Previous Works
  • Algorithm Overview
  • Details and Implementation
  • Results
  • Future Works & Conclusion
slide-47
SLIDE 47

47

Talk Structure

  • Previous Works
  • Algorithm Overview
  • Details and Implementation
  • Results
  • Future Works & Conclusion
slide-48
SLIDE 48

48

Scatter ? (future work)

  • Scatter available in CUDA
  • Possible improvements
slide-49
SLIDE 49

49

Scatter ? (future work)

Input stream, split in blocks Reduced stream Reduction of the blocks:

  • without scatter:

sum scan + search O(n log s)

  • with scatter:

sequential algo (loop over the block) O(n)

Concatenation:

  • Simpler
  • No wrapping
slide-50
SLIDE 50

50

Scatter ? (future work)

  • Overall complexity: O(n)
  • ... but other techniques in O(n)

– Sum scan (Harris et al. or Sengupta et al.) + scatter

  • Future work: tests with CUDA

– Expected speed up ≥ 2.5

slide-51
SLIDE 51

51

Conclusion

  • Orthogonal to previous works:

– We don't compete with them, we use them !

  • Better asymptotic complexity

– O(n) Vs O(n log n)

  • Significant speed up
  • Does not require scatter
slide-52
SLIDE 52

52

Thank you