AUTOMATIC VECTORIZATION OF TREE TRAVERSALS Youngjoon Jo, Michael - - PowerPoint PPT Presentation

automatic vectorization of tree traversals
SMART_READER_LITE
LIVE PREVIEW

AUTOMATIC VECTORIZATION OF TREE TRAVERSALS Youngjoon Jo, Michael - - PowerPoint PPT Presentation

AUTOMATIC VECTORIZATION OF TREE TRAVERSALS Youngjoon Jo, Michael Goldfarb and Milind Kulkarni PACT, Edinburgh, U.K. September 11 th , 2013 2 Youngjoon Jo Commodity processors and SIMD Commodity processors support SIMD (Single Instruction


slide-1
SLIDE 1

AUTOMATIC VECTORIZATION OF TREE TRAVERSALS

Youngjoon Jo, Michael Goldfarb and Milind Kulkarni PACT, Edinburgh, U.K. September 11th, 2013

slide-2
SLIDE 2

Commodity processors and SIMD

  • Commodity processors support SIMD (Single

Instruction Multiple Data) instructions

  • MMX (1996), SSE (1999), SSE2 (2001), SSE3 (2004),

SSE4 (2006), AVX (2011), AVX2 (2013)

  • SIMD width getting wider
  • AVX is 256bit
  • Upcoming AVX-512 to be 512bit (2015)
  • Using SIMD is an excellent way to improve

performance

Youngjoon Jo

2

slide-3
SLIDE 3

SIMD works great for regular loops

Youngjoon Jo

3

  • for (int i = 0; i < 4; i++) {

c[i] = a[i] + b[i]; }

slide-4
SLIDE 4

SIMD works great for regular loops

Youngjoon Jo

4

  • for (int i = 0; i < 4; i++) {

c[i] = a[i] + b[i]; }

  • __m128 vec_a = _mm_load_ps(a);

__m128 vec_b = _mm_load_ps(b); __m128 vec_c = _mm_add_ps(vec_a, vec_b); _mm_store_ps(c, vec_c);

slide-5
SLIDE 5

But not so well on irregular codes

Youngjoon Jo

5

void main() { Ray *rays[N] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } }

  • void recurse(Ray *r, Node *n) {

if (truncate(r, n)) return; if (n->isLeaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }

slide-6
SLIDE 6

But not so well on irregular codes

Youngjoon Jo

6

void main() { Ray *rays[N] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } }

  • void recurse(Ray *r, Node *n) {

if (truncate(r, n)) return; if (n->isLeaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }

slide-7
SLIDE 7

Automatic vectorization desired

Youngjoon Jo

7

void main() { Ray *rays[N] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } }

  • void recurse(Ray *r, Node *n) {

if (truncate(r, n)) return; if (n->isLeaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }

A u t

  • m

a t i c v e c t

  • r

i z a t i

  • n

t e c h n i q u e s f

  • r

i r r e g u l a r c

  • d

e s h i g h l y d e s i r e d

slide-8
SLIDE 8

Automatic vectorization desired

Youngjoon Jo

8

void main() { Ray *rays[N] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } }

  • void recurse(Ray *r, Node *n) {

if (truncate(r, n)) return; if (n->isLeaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }

A u t

  • m

a t i c v e c t

  • r

i z a t i

  • n

t e c h n i q u e s f

  • r

t r e e t r a v e r s a l s h i g h l y d e s i r e d

slide-9
SLIDE 9

Tree codes are important

Youngjoon Jo

9

slide-10
SLIDE 10

Tree codes are important

Youngjoon Jo

10

slide-11
SLIDE 11

Tree codes are important

Youngjoon Jo

11

slide-12
SLIDE 12

Tree vectorization challenges

  • Non trivial to find vectorizable computation
  • Difficult to keep vectorizable computation together

Youngjoon Jo

12

slide-13
SLIDE 13

Previous tree vectorization work

  • Non trivial to find vectorizable computation
  • Manually transform code to packetize traversals
  • Process multiple traversals in packet simultaneously
  • Wald et. al. [Computer Graphics Forum 2001]
  • Difficult to keep vectorizable computation together

Youngjoon Jo

13

slide-14
SLIDE 14

Previous tree vectorization work

  • Non trivial to find vectorizable computation
  • Manually transform code to packetize traversals
  • Process multiple points in packet simultaneously
  • Wald et. al. [Computer Graphics Forum 2001]

Kim et. al. [SIGMOD 2010]

  • Difficult to keep vectorizable computation together

Youngjoon Jo

14

I n s i t u a t i

  • n

s l i k e p h y s i c a l s i m u l a t i

  • n

, c

  • l

l i s i

  • n

d e t e c t i

  • n
  • r

r a y t r a c i n g i n s c e n e s , w h e r e r a y s b

  • u

n c e i n t

  • m

u l t i p l e d i r e c t i

  • n

s ( s p h e r i c a l

  • r

b u m p m a p p e d s u r f a c e s ) , c

  • h

e r e n t r a y p a c k e t s b r e a k d

  • w

n v e r y q u i c k l y t

  • s

i n g l e r a y s

  • r

d

  • n
  • t

e x i s t a t a l l . I n t h e a b

  • v

e m e n t i

  • n

e d t a s k s , p a c k e t

  • r

i e n t e d S I M D c

  • m

p u t a t i

  • n

s i s m u c h l e s s u s e f u l . H a v e l a n d H e r

  • u

t [ I E E E T r a n s a c t i

  • n

s

  • n

V i s u a l i z a t i

  • n

a n d C

  • m

p u t e r G r a p h i c s 2 1 ]

slide-15
SLIDE 15

Previous tree vectorization work

  • Non trivial to find vectorizable computation
  • Manually transform code to packetize traversals
  • Process multiple points in packet simultaneously
  • Wald et. al. [Computer Graphics Forum 2001]
  • Difficult to keep vectorizable computation together
  • Look to alternative sources of vectorization
  • Pixar [RT 2006]

Dammertz et. al. [EGSR 2008] Kim et. al. [SIGMOD 2010] Chhugani et. al. [SC 2012]

Youngjoon Jo

15

slide-16
SLIDE 16

Our previous tree locality work

  • Point blocking

Jo and Kulkarni [OOPSLA 2011]

  • Traversal splicing

Jo and Kulkarni [OOPSLA 2012]

Youngjoon Jo

16

slide-17
SLIDE 17

Our solution

  • Non trivial to find vectorizable computation
  • Manually transform code to packetize traversals
  • Automatically packetize traversals with point blocking and a novel

layout transformation

  • Difficult to keep vectorizable computation together
  • Look to alternative sources of vectorization
  • Exploit dynamic sorting of traversal splicing to dramatically

enhance utilization

Youngjoon Jo

17

slide-18
SLIDE 18

Contributions

  • Show how tree traversal codes can be

systematically transformed to

  • Expose SIMD opportunities
  • Enhance utilization
  • Propose a novel layout transformation for efficient

vectorization of tree codes

  • Present a framework for automatically

restructuring traversals and data layouts to enable vectorization

Youngjoon Jo

18

slide-19
SLIDE 19

Contributions

  • Show how tree traversal codes can be

systematically transformed to

  • Expose SIMD opportunities
  • Enhance utilization
  • Propose a novel layout transformation for efficient

vectorization of tree codes

  • Present a framework for automatically

restructuring traversals and data layouts to enable vectorization

Youngjoon Jo

19

S p

  • i

l e r a l e r t ! S I M T r e e c a n d e l i v e r s p e e d u p s

  • f

u p t

  • 6

. 5 9 , a n d 2 . 7 8

  • n

a v e r a g e

slide-20
SLIDE 20

Outline

  • Example & Abstract Model
  • Point Blocking to Enable SIMD
  • Traversal Splicing to Enhance Utilization
  • Automatic Transformation
  • Evaluation and Conclusion

Youngjoon Jo

20

slide-21
SLIDE 21

Tree traversals

Youngjoon Jo

21

void main() { Ray *rays[N] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } }

  • void recurse(Ray *r, Node *n) {

if (truncate(r, n)) return; if (n->isLeaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }

slide-22
SLIDE 22

Tree traversals

Youngjoon Jo

22

void main() { Point *points[N] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } }

  • void recurse(Point *p, Node *n) {

if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

slide-23
SLIDE 23

Tree traversals

Youngjoon Jo

23

void main() { Point *points[N] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } }

  • void recurse(Point *p, Node *n) {

if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

slide-24
SLIDE 24

Tree traversals

Youngjoon Jo

24

void main() { Point *points[N] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } }

  • void recurse(Point *p, Node *n) {

if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

slide-25
SLIDE 25

Tree traversals

Youngjoon Jo

25

void main() { Point *points[N] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } }

  • void recurse(Point *p, Node *n) {

if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

slide-26
SLIDE 26

Tree traversals

Youngjoon Jo

26

void main() { Point *points[N] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } }

  • void recurse(Point *p, Node *n) {

if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

slide-27
SLIDE 27

An abstract model

Youngjoon Jo

27

void main() { foreach(Point p : points) { foreach(Node n : p.oracleNodes()) { update(p, n); } } }

slide-28
SLIDE 28

Iteration space of traversal

Youngjoon Jo

28

void main() { foreach(Point p : points) { foreach(Node n : p.oracleNodes()) { update(p, n); } } }

  • Points

Nodes

slide-29
SLIDE 29

Iteration space of traversal

Youngjoon Jo

29

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

slide-30
SLIDE 30

Iteration space of traversal

Youngjoon Jo

30

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

slide-31
SLIDE 31

How to vectorize?

Youngjoon Jo

31

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

slide-32
SLIDE 32

Outline

  • Example & Abstract Model
  • Point Blocking to Enable SIMD
  • Traversal Splicing to Enhance Utilization
  • Automatic Transformation
  • Evaluation and Conclusion

Youngjoon Jo

32

slide-33
SLIDE 33

Point blocking [OOPSLA 2011]

Youngjoon Jo

33

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

slide-34
SLIDE 34

Point blocking [OOPSLA 2011]

Youngjoon Jo

34

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

slide-35
SLIDE 35

Point blocked code

Youngjoon Jo

35

void recurse(Point *p, Node *n) { if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

slide-36
SLIDE 36

Point blocked code

Youngjoon Jo

36

void recurse(Block *block, Node *n) { if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

slide-37
SLIDE 37

Point blocked code

Youngjoon Jo

37

void recurse(Block *block, Node *n) { if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

Function body

slide-38
SLIDE 38

Point blocked code

Youngjoon Jo

38

void recurse(Block *block, Node *n) { for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } } }

Function body Loop over points in block

slide-39
SLIDE 39

Point blocked code

Youngjoon Jo

39

void recurse(Block *block, Node *n) { for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } } }

Loop over points in block Function body

slide-40
SLIDE 40

Point blocked code

Youngjoon Jo

40

void recurse(Block *block, Node *n) { Block *nextBlock = // next level block for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isLeaf()) { update(p, n); } else { nextBlock->add(p); } } }

Function body Loop over points in block

slide-41
SLIDE 41

Point blocked code

Youngjoon Jo

41

void recurse(Block *block, Node *n) { Block *nextBlock = // next level block for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isLeaf()) { update(p, n); } else { nextBlock->add(p); } } if (nextBlock->size > 0) { recurse(nextBlock, n->left); recurse(nextBlock, n->right); } }

Next block recurses children Function body Loop over points in block

slide-42
SLIDE 42

Point blocking [OOPSLA 2011]

Youngjoon Jo

42

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

slide-43
SLIDE 43

Point blocking [OOPSLA 2011]

Youngjoon Jo

43

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

slide-44
SLIDE 44

Point blocking [OOPSLA 2011]

Youngjoon Jo

44

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

slide-45
SLIDE 45

Analogous to packet SIMD

Youngjoon Jo

45

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

slide-46
SLIDE 46

Analogous to packet SIMD

Youngjoon Jo

46

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

B r e a k s d

  • w

n w h e n p

  • i

n t s d i v e r g e

slide-47
SLIDE 47

Packet SIMD has poor utilization

Youngjoon Jo

47

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

Full SIMD Partial SIMD

slide-48
SLIDE 48

Packet SIMD has poor utilization

Youngjoon Jo

48

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

Full SIMD Partial SIMD

slide-49
SLIDE 49

SIMD utilization

Youngjoon Jo

49

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

Full SIMD Partial SIMD

S I M D u t i l i z a t i

  • n

= W

  • r

k i n f u l l S I M D / T

  • t

a l w

  • r

k

slide-50
SLIDE 50

SIMD utilization

Youngjoon Jo

50

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

Full SIMD Partial SIMD

W

  • r

k i n f u l l S I M D / T

  • t

a l w

  • r

k = C i r c l e s i n b l u e / T

  • t

a l c i r c l e s = 2 4 / 7 4 = . 3 2

slide-51
SLIDE 51

Use larger block size

Youngjoon Jo

51

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

Full SIMD Partial SIMD

U s e b l

  • c

k s i z e l a r g e r t h a n S I M D w i d t h a n d c

  • m

p a c t p

  • i

n t s !

slide-52
SLIDE 52

Use larger block size

Youngjoon Jo

52

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

slide-53
SLIDE 53

Better utilization with larger block size

Youngjoon Jo

53

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

Full SIMD Partial SIMD

slide-54
SLIDE 54

Better utilization with larger block size

Youngjoon Jo

54

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

Full SIMD Partial SIMD

C i r c l e s i n b l u e / T

  • t

a l c i r c l e s = 6 4 / 7 4 = . 8 6

slide-55
SLIDE 55

SIMD utilization – Block size

Youngjoon Jo

55 0.2 0.4 0.6 0.8 1 4 40 400 4000 40000 400000 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping

slide-56
SLIDE 56

Ideal utilization

Youngjoon Jo

56 0.2 0.4 0.6 0.8 1 4 40 400 4000 40000 400000 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping

B l

  • c

k s i z e e q u a l t

  • t
  • t

a l p

  • i

n t s y i e l d s i d e a l S I M D u t i l i z a t i

  • n

Ideal Utilization

slide-57
SLIDE 57

Use max block! Problem solved?

Youngjoon Jo

57 0.2 0.4 0.6 0.8 1 4 40 400 4000 40000 400000 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping

slide-58
SLIDE 58

Large block has poor locality

Youngjoon Jo

58 0.2 0.4 0.6 0.8 1 4 40 400 4000 40000 400000 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping

slide-59
SLIDE 59

Large block has poor locality

Youngjoon Jo

59 0.2 0.4 0.6 0.8 1 4 40 400 4000 40000 400000 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping

N e e d s c h e d u l e w i t h g

  • d

u t i l i z a t i

  • n

a n d g

  • d

l

  • c

a l i t y

slide-60
SLIDE 60

Outline

  • Example & Abstract Model
  • Point Blocking to Enable SIMD
  • Traversal Splicing to Enhance Utilization
  • Automatic Transformation
  • Evaluation and Conclusion

Youngjoon Jo

60

slide-61
SLIDE 61

Traversal splicing [OOPSLA 2012]

Youngjoon Jo

61

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

slide-62
SLIDE 62

Traversal splicing [OOPSLA 2012]

Youngjoon Jo

62

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

  • 1. Designate splice nodes
slide-63
SLIDE 63

Traversal splicing [OOPSLA 2012]

Youngjoon Jo

63

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

  • 1. Designate splice nodes
  • 2. Traverse up to splice node
slide-64
SLIDE 64

Traversal splicing [OOPSLA 2012]

Youngjoon Jo

64

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

  • 1. Designate splice nodes
  • 2. Traverse up to splice node
slide-65
SLIDE 65

Traversal splicing [OOPSLA 2012]

Youngjoon Jo

65

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

  • 1. Designate splice nodes
  • 2. Traverse up to splice node
  • 3. Resume at next node
slide-66
SLIDE 66

Traversal splicing [OOPSLA 2012]

Youngjoon Jo

66

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

  • 1. Designate splice nodes
  • 2. Traverse up to splice node
  • 3. Resume at next node
slide-67
SLIDE 67

Traversal splicing [OOPSLA 2012]

Youngjoon Jo

67

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

  • 1. Designate splice nodes
  • 2. Traverse up to splice node
  • 3. Resume at next node
  • 4. Repeat 2-3 until finished
slide-68
SLIDE 68

Traversal splicing [OOPSLA 2012]

Youngjoon Jo

68

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

  • 1. Designate splice nodes
  • 2. Traverse up to splice node
  • 3. Resume at next node
  • 4. Repeat 2-3 until finished
slide-69
SLIDE 69

Can change order of points

Youngjoon Jo

69

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

  • 1. Designate splice nodes
  • 2. Traverse up to splice node
  • 3. Resume at next node
  • 4. Repeat 2-3 until finished

W e c a n c h a n g e t h e

  • r

d e r

  • f

p a u s e d p

  • i

n t s , b u t h

  • w

?

slide-70
SLIDE 70

Dynamic sorting

Youngjoon Jo

70

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

  • 1. Designate splice nodes
  • 2. Traverse up to splice node
  • 3. Resume at next node
  • 4. Repeat 2-3 until finished

I n s i g h t : p

  • i

n t s w h i c h r e a c h s a m e n

  • d

e s a r e l i k e l y t

  • h

a v e s i m i l a r t r a v e r s a l s i n f u t u r e D y n a m i c s

  • r

t i n g

  • n

t r a v e r s a l h i s t

  • r

y

slide-71
SLIDE 71

Dynamic sorting

Youngjoon Jo

71

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

  • 1. Designate splice nodes
  • 2. Traverse up to splice node
slide-72
SLIDE 72

Dynamic sorting

Youngjoon Jo

72

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

  • 1. Designate splice nodes
  • 2. Traverse up to splice node
  • 3. Reorder points at splice node

A B C D E F G H

slide-73
SLIDE 73

Dynamic sorting

Youngjoon Jo

73

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

A C E F H B D G

  • 1. Designate splice nodes
  • 2. Traverse up to splice node
  • 3. Reorder points at splice node
slide-74
SLIDE 74

Dynamic sorting

Youngjoon Jo

74

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

A C E F H B D G A C E F H B D G

  • 1. Designate splice nodes
  • 2. Traverse up to splice node
  • 3. Reorder points at splice node
  • 4. Resume at next node
slide-75
SLIDE 75

Dynamic sorting

Youngjoon Jo

75

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

A C E F H B D G E B D G A C F H

  • 1. Designate splice nodes
  • 2. Traverse up to splice node
  • 3. Reorder points at splice node
  • 4. Resume at next node
  • 5. Repeat 2-4 until finished
slide-76
SLIDE 76

Dynamic sorting

Youngjoon Jo

76

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

A C E F H B D G E B D G A C F H

  • 1. Designate splice nodes
  • 2. Traverse up to splice node
  • 3. Reorder points at splice node
  • 4. Resume at next node
  • 5. Repeat 2-4 until finished
slide-77
SLIDE 77

Dynamic sorting enhances utilization

Youngjoon Jo

77

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

A C E F H B D G E B D G A C F H Full SIMD Partial SIMD

slide-78
SLIDE 78

Dynamic sorting enhances utilization

Youngjoon Jo

78

1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H

1 4 2 5 3 7 8 9 6 10 11 12 13 14 15

A C E F H B D G E B D G A C F H Full SIMD Partial SIMD

C i r c l e s i n b l u e / T

  • t

a l c i r c l e s = 4 8 / 7 4 = . 6 5

slide-79
SLIDE 79

SIMD utilization – splice depth

Youngjoon Jo

79 0.2 0.4 0.6 0.8 1 4 40 400 4000 40000 400000 SIMD Utilization Block Size N/A 2 4 6 8 10

Nearest Neighbor

slide-80
SLIDE 80

SIMD utilization – splice depth

Youngjoon Jo

80 0.2 0.4 0.6 0.8 1 4 40 400 4000 40000 400000 SIMD Utilization Block Size N/A 2 4 6 8 10

Nearest Neighbor

Block size: 512 Splice depth: 10 Block size: 524288

slide-81
SLIDE 81

SIMD utilization

Youngjoon Jo

81 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization Baseline A Priori Sort Dynamic Sort Ideal

slide-82
SLIDE 82

SIMD utilization

Youngjoon Jo

82 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization Baseline A Priori Sort Dynamic Sort Ideal

D y n a m i c s

  • r

t i n g c a n a u t

  • m

a t i c a l l y e x t r a c t a l m

  • s

t t h e m a x i m u m a m

  • u

n t

  • f

S I M D u t i l i z a t i

  • n
slide-83
SLIDE 83

Outline

  • Example & Abstract Model
  • Point Blocking to Enable SIMD
  • Traversal Splicing to Enhance Utilization
  • Automatic Transformation
  • Evaluation and Conclusion

Youngjoon Jo

83

slide-84
SLIDE 84

Automatic transformation

  • Point blocking

Jo and Kulkarni [OOPSLA 2011]

  • Traversal splicing

Jo and Kulkarni [OOPSLA 2012]

Youngjoon Jo

84

slide-85
SLIDE 85

Automatic transformation

  • Our key addition for SIMD:

Layout transformation from AoS (array of structures) to SoA (structure of arrays)

  • + Allows vector load/stores
  • + Packed data has better spatial locality
  • - More overhead in moving data

Youngjoon Jo

85

x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4 AoS (array of structures) SoA (structure of arrays)

slide-86
SLIDE 86

AoS to SoA layout

  • Whole program AoS to SoA layout transformation

difficult to automate with aliasing

  • Limit scope to traversal code only
  • Copy in to SoA before traversal
  • Copy out to AoS after traversal
  • Inter-procedural, flow-insensitive analysis
  • Determine which point fields should be SoA
  • Conservatively ensure correctness

Youngjoon Jo

86

slide-87
SLIDE 87

AoS to SoA layout

Youngjoon Jo

87

  • void recurse(Point *p, Node *n) {

if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

slide-88
SLIDE 88

AoS to SoA layout

Youngjoon Jo

88

struct Point { float f1, f2, f3; }

  • void recurse(Point *p, Node *n) {

if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

slide-89
SLIDE 89

AoS to SoA layout

Youngjoon Jo

89

struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }

  • void recurse(Point *p, Node *n) {

if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

slide-90
SLIDE 90

AoS to SoA layout

Youngjoon Jo

90

struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }

  • void recurse(Point *p, Node *n) {

if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

  • bool truncate(Point *p, Node *n) {

return p->f1 == n->point->f1; }

  • void update(Point *p, Node *n) {

p->f2 += n->point->f3; }

slide-91
SLIDE 91

Ensuring correctness

Youngjoon Jo

91

struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }

  • void recurse(Point *p, Node *n) {

if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

  • bool truncate(Point *p, Node *n) {

return p->f1 == n->point->f1; }

  • void update(Point *p, Node *n) {

p->f2 += n->point->f3; }

slide-92
SLIDE 92

Ensuring correctness

Youngjoon Jo

92

struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }

  • bool truncate(Point *p, Node *n) {

return p->f1 == n->point->f1; }

  • void update(Point *p, Node *n) {

p->f2 += n->point->f3; }

  • Point-access

Non-point-access Read Write Read Write f1 f2 f3

slide-93
SLIDE 93

Ensuring correctness

Youngjoon Jo

93

struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }

  • bool truncate(Point *p, Node *n) {

return p->f1 == n->point->f1; }

  • void update(Point *p, Node *n) {

p->f2 += n->point->f3; }

  • Point-access

Non-point-access Read Write Read Write f1 ✓ f2 f3

slide-94
SLIDE 94

Ensuring correctness

Youngjoon Jo

94

struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }

  • bool truncate(Point *p, Node *n) {

return p->f1 == n->point->f1; }

  • void update(Point *p, Node *n) {

p->f2 += n->point->f3; }

  • Point-access

Non-point-access Read Write Read Write f1 ✓ ✓ f2 f3

slide-95
SLIDE 95

Ensuring correctness

Youngjoon Jo

95

struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }

  • bool truncate(Point *p, Node *n) {

return p->f1 == n->point->f1; }

  • void update(Point *p, Node *n) {

p->f2 += n->point->f3; }

  • Point-access

Non-point-access Read Write Read Write f1 ✓ ✓ f2 ✓ ✓ f3

slide-96
SLIDE 96

Ensuring correctness

Youngjoon Jo

96

struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }

  • bool truncate(Point *p, Node *n) {

return p->f1 == n->point->f1; }

  • void update(Point *p, Node *n) {

p->f2 += n->point->f3; }

  • Point-access

Non-point-access Read Write Read Write f1 ✓ ✓ f2 ✓ ✓ f3 ✓

slide-97
SLIDE 97

Transforming SoA fields

Youngjoon Jo

97

struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }

  • bool truncate(Point *p, Node *n) {

return p->f1 == n->point->f1; }

  • void update(Point *p, Node *n) {

p->f2 += n->point->f3; }

  • Point-access

Non-point-access Read Write Read Write f1 ✓ ✓ f2 ✓ ✓ f3 ✓

slide-98
SLIDE 98

Transforming SoA fields

Youngjoon Jo

98

struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }

  • bool truncate(Block *block, int bi, Node *n) {

return block->f1[bi] == n->point->f1; }

  • void update(Block *block, int bi, Node *n) {

block->f2[bi] += n->point->f3; }

  • Point-access

Non-point-access Read Write Read Write f1 ✓ ✓ f2 ✓ ✓ f3 ✓

slide-99
SLIDE 99

Correctness violation example

Youngjoon Jo

99

struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }

  • bool truncate(Block *block, int bi, Node *n) {

return block->f1[bi] == n->point->f1; }

  • void update(Block *block, int bi, Node *n) {

block->f2[bi] += n->point->f2; }

  • Point-access

Non-point-access Read Write Read Write f1 ✓ ✓ f2 ✓ ✓ ✓ f3

slide-100
SLIDE 100

Ensuring correctness

Youngjoon Jo

100

struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }

  • bool truncate(Point *p, Node *n) {

return p->f1 == n->point->f1; }

  • void update(Point *p, Node *n) {

p->f2 += n->point->f3; p->f3 = 1; }

  • Point-access

Non-point-access Read Write Read Write f1 ✓ ✓ f2 ✓ ✓ ✓ f3

S

  • u

n d a n a l y s i s c

  • n

s e r v a t i v e l y p r

  • v

e s S

  • A

t r a n s f

  • r

m a t i

  • n

c

  • r

r e c t . S u f f i c e s t

  • t

r a n s f

  • r

m a l l

  • f
  • u

r b e n c h m a r k s .

slide-101
SLIDE 101

SIMTree

  • Implementation of analysis and transformation in

a source to source C++ compiler

  • Based on ROSE compiler infrastructure
  • Transforms code to apply point blocking, traversal

splicing, and SoA layout

  • Does not perform the vectorization itself
  • https://engineering.purdue.edu/plcl/simtree/

Youngjoon Jo

101

slide-102
SLIDE 102

Outline

  • Example & Abstract Model
  • Point Blocking to Enable SIMD
  • Traversal Splicing to Enhance Utilization
  • Automatic Transformation
  • Evaluation and Conclusion

Youngjoon Jo

102

slide-103
SLIDE 103

Evaluation

  • Five benchmarks
  • Barnes-Hut, Point Correlation, Nearest Neighbor,

Vantage Point, Photon Mapping

  • Real and random inputs form 17 benchmark/inputs
  • Two machines
  • Intel Xeon E5-4650
  • AMD Opteron 6282
  • Automatic transformation with SIMTree
  • Manual vectorization of transformed code with 4-way

SIMD intrinsics for best performance

  • Auto vectorization of transformed code with icc gets 84% of

best performance

Youngjoon Jo

103

slide-104
SLIDE 104

1 2 3 4 5 6 7 Speedup Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] Block+SIMD+Splice

Speedup on Xeon

Youngjoon Jo

104

Geometric means

slide-105
SLIDE 105

Speedup on Xeon

Youngjoon Jo

105

PacketSIMD Block Block +SIMD Block +Splice Block+SIMD +Splice Xeon 0.81 1.13 1.19 1.92 2.69

1 2 3 4 5 6 7 Speedup Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] Block+SIMD+Splice

Geometric means

slide-106
SLIDE 106

Dynamic sorting makes SIMD profitable

Youngjoon Jo

106

PacketSIMD Block Block +SIMD Block +Splice Block+SIMD +Splice Xeon 0.81 1.13 1.19 1.92 2.69 Opteron 0.83 1.15 1.27 1.78 2.86

1 2 3 4 5 6 7 Speedup Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] Block+SIMD+Splice

Geometric means

slide-107
SLIDE 107

1 2 3 4 5 6 7 Speedup Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] Block+SIMD+Splice

Dynamic sorting makes SIMD profitable

Youngjoon Jo

107

an

PacketSIMD Block Block +SIMD Block +Splice Block+SIMD +Splice Xeon 0.81 1.13 1.19 1.92 2.69 Opteron 0.83 1.15 1.27 1.78 2.86 Geometric means

slide-108
SLIDE 108

Instruction counts: Opteron

Youngjoon Jo

108 0.0 0.5 1.0 1.5 2.0 2.5 Insturction Ratio Block BlockSoA BlockSoA+SIMD Block+Splice BlockSoA+Splice BlockSoA+SIMD+Splice

Block BlockSoA BlockSoA +SIMD Block +Splice BlockSoA +Splice BlockSoA +SIMD+Splice 1.27 1.62 1.20 1.08 1.38 0.64

slide-109
SLIDE 109

Cycles per instruction: Opteron

Youngjoon Jo

109 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Cycles Per Instruction Base Block BlockSoA Block+Splice BlockSoA+Splice

Base Block BlockSoA Block+Splice BlockSoA+Splice 2.24 1.56 1.35 1.17 1.04

slide-110
SLIDE 110

Conclusion

  • Show how tree traversal codes can be

systematically transformed to

  • Expose SIMD opportunities
  • Enhance utilization
  • Propose a novel layout transformation for efficient

vectorization of tree codes

  • Present a framework for automatically

restructuring traversals and data layouts to enable vectorization

Youngjoon Jo

110

slide-111
SLIDE 111

Conclusion

  • Show how tree traversal codes can be

systematically transformed to

  • Expose SIMD opportunities
  • Enhance utilization
  • Propose a novel layout transformation for efficient

vectorization of tree codes

  • Present a framework for automatically

restructuring traversals and data layouts to enable vectorization

Youngjoon Jo

111

S I M T r e e i s

  • p

e n s

  • u

r c e ! h t t p s : / / e n g i n e e r i n g . p u r d u e . e d u / p l c l / s i m t r e e /

slide-112
SLIDE 112

AUTOMATIC VECTORIZATION OF TREE TRAVERSALS

Youngjoon Jo, Michael Goldfarb and Milind Kulkarni PACT, Edinburgh, U.K. September 11th, 2013