AUTOMATIC VECTORIZATION OF TREE TRAVERSALS Youngjoon Jo, Michael - - PowerPoint PPT Presentation
AUTOMATIC VECTORIZATION OF TREE TRAVERSALS Youngjoon Jo, Michael - - PowerPoint PPT Presentation
AUTOMATIC VECTORIZATION OF TREE TRAVERSALS Youngjoon Jo, Michael Goldfarb and Milind Kulkarni PACT, Edinburgh, U.K. September 11 th , 2013 2 Youngjoon Jo Commodity processors and SIMD Commodity processors support SIMD (Single Instruction
Commodity processors and SIMD
- Commodity processors support SIMD (Single
Instruction Multiple Data) instructions
- MMX (1996), SSE (1999), SSE2 (2001), SSE3 (2004),
SSE4 (2006), AVX (2011), AVX2 (2013)
- SIMD width getting wider
- AVX is 256bit
- Upcoming AVX-512 to be 512bit (2015)
- Using SIMD is an excellent way to improve
performance
Youngjoon Jo
2
SIMD works great for regular loops
Youngjoon Jo
3
- for (int i = 0; i < 4; i++) {
c[i] = a[i] + b[i]; }
SIMD works great for regular loops
Youngjoon Jo
4
- for (int i = 0; i < 4; i++) {
c[i] = a[i] + b[i]; }
- __m128 vec_a = _mm_load_ps(a);
__m128 vec_b = _mm_load_ps(b); __m128 vec_c = _mm_add_ps(vec_a, vec_b); _mm_store_ps(c, vec_c);
But not so well on irregular codes
Youngjoon Jo
5
void main() { Ray *rays[N] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } }
- void recurse(Ray *r, Node *n) {
if (truncate(r, n)) return; if (n->isLeaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }
But not so well on irregular codes
Youngjoon Jo
6
void main() { Ray *rays[N] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } }
- void recurse(Ray *r, Node *n) {
if (truncate(r, n)) return; if (n->isLeaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }
Automatic vectorization desired
Youngjoon Jo
7
void main() { Ray *rays[N] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } }
- void recurse(Ray *r, Node *n) {
if (truncate(r, n)) return; if (n->isLeaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }
A u t
- m
a t i c v e c t
- r
i z a t i
- n
t e c h n i q u e s f
- r
i r r e g u l a r c
- d
e s h i g h l y d e s i r e d
Automatic vectorization desired
Youngjoon Jo
8
void main() { Ray *rays[N] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } }
- void recurse(Ray *r, Node *n) {
if (truncate(r, n)) return; if (n->isLeaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }
A u t
- m
a t i c v e c t
- r
i z a t i
- n
t e c h n i q u e s f
- r
t r e e t r a v e r s a l s h i g h l y d e s i r e d
Tree codes are important
Youngjoon Jo
9
Tree codes are important
Youngjoon Jo
10
Tree codes are important
Youngjoon Jo
11
Tree vectorization challenges
- Non trivial to find vectorizable computation
- Difficult to keep vectorizable computation together
Youngjoon Jo
12
Previous tree vectorization work
- Non trivial to find vectorizable computation
- Manually transform code to packetize traversals
- Process multiple traversals in packet simultaneously
- Wald et. al. [Computer Graphics Forum 2001]
- Difficult to keep vectorizable computation together
Youngjoon Jo
13
Previous tree vectorization work
- Non trivial to find vectorizable computation
- Manually transform code to packetize traversals
- Process multiple points in packet simultaneously
- Wald et. al. [Computer Graphics Forum 2001]
Kim et. al. [SIGMOD 2010]
- Difficult to keep vectorizable computation together
Youngjoon Jo
14
I n s i t u a t i
- n
s l i k e p h y s i c a l s i m u l a t i
- n
, c
- l
l i s i
- n
d e t e c t i
- n
- r
r a y t r a c i n g i n s c e n e s , w h e r e r a y s b
- u
n c e i n t
- m
u l t i p l e d i r e c t i
- n
s ( s p h e r i c a l
- r
b u m p m a p p e d s u r f a c e s ) , c
- h
e r e n t r a y p a c k e t s b r e a k d
- w
n v e r y q u i c k l y t
- s
i n g l e r a y s
- r
d
- n
- t
e x i s t a t a l l . I n t h e a b
- v
e m e n t i
- n
e d t a s k s , p a c k e t
- r
i e n t e d S I M D c
- m
p u t a t i
- n
s i s m u c h l e s s u s e f u l . H a v e l a n d H e r
- u
t [ I E E E T r a n s a c t i
- n
s
- n
V i s u a l i z a t i
- n
a n d C
- m
p u t e r G r a p h i c s 2 1 ]
Previous tree vectorization work
- Non trivial to find vectorizable computation
- Manually transform code to packetize traversals
- Process multiple points in packet simultaneously
- Wald et. al. [Computer Graphics Forum 2001]
- Difficult to keep vectorizable computation together
- Look to alternative sources of vectorization
- Pixar [RT 2006]
Dammertz et. al. [EGSR 2008] Kim et. al. [SIGMOD 2010] Chhugani et. al. [SC 2012]
Youngjoon Jo
15
Our previous tree locality work
- Point blocking
Jo and Kulkarni [OOPSLA 2011]
- Traversal splicing
Jo and Kulkarni [OOPSLA 2012]
Youngjoon Jo
16
Our solution
- Non trivial to find vectorizable computation
- Manually transform code to packetize traversals
- Automatically packetize traversals with point blocking and a novel
layout transformation
- Difficult to keep vectorizable computation together
- Look to alternative sources of vectorization
- Exploit dynamic sorting of traversal splicing to dramatically
enhance utilization
Youngjoon Jo
17
Contributions
- Show how tree traversal codes can be
systematically transformed to
- Expose SIMD opportunities
- Enhance utilization
- Propose a novel layout transformation for efficient
vectorization of tree codes
- Present a framework for automatically
restructuring traversals and data layouts to enable vectorization
Youngjoon Jo
18
Contributions
- Show how tree traversal codes can be
systematically transformed to
- Expose SIMD opportunities
- Enhance utilization
- Propose a novel layout transformation for efficient
vectorization of tree codes
- Present a framework for automatically
restructuring traversals and data layouts to enable vectorization
Youngjoon Jo
19
S p
- i
l e r a l e r t ! S I M T r e e c a n d e l i v e r s p e e d u p s
- f
u p t
- 6
. 5 9 , a n d 2 . 7 8
- n
a v e r a g e
Outline
- Example & Abstract Model
- Point Blocking to Enable SIMD
- Traversal Splicing to Enhance Utilization
- Automatic Transformation
- Evaluation and Conclusion
Youngjoon Jo
20
Tree traversals
Youngjoon Jo
21
void main() { Ray *rays[N] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } }
- void recurse(Ray *r, Node *n) {
if (truncate(r, n)) return; if (n->isLeaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }
Tree traversals
Youngjoon Jo
22
void main() { Point *points[N] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } }
- void recurse(Point *p, Node *n) {
if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
Tree traversals
Youngjoon Jo
23
void main() { Point *points[N] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } }
- void recurse(Point *p, Node *n) {
if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
Tree traversals
Youngjoon Jo
24
void main() { Point *points[N] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } }
- void recurse(Point *p, Node *n) {
if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
Tree traversals
Youngjoon Jo
25
void main() { Point *points[N] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } }
- void recurse(Point *p, Node *n) {
if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
Tree traversals
Youngjoon Jo
26
void main() { Point *points[N] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } }
- void recurse(Point *p, Node *n) {
if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
An abstract model
Youngjoon Jo
27
void main() { foreach(Point p : points) { foreach(Node n : p.oracleNodes()) { update(p, n); } } }
Iteration space of traversal
Youngjoon Jo
28
void main() { foreach(Point p : points) { foreach(Node n : p.oracleNodes()) { update(p, n); } } }
- Points
Nodes
Iteration space of traversal
Youngjoon Jo
29
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
Iteration space of traversal
Youngjoon Jo
30
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
How to vectorize?
Youngjoon Jo
31
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
Outline
- Example & Abstract Model
- Point Blocking to Enable SIMD
- Traversal Splicing to Enhance Utilization
- Automatic Transformation
- Evaluation and Conclusion
Youngjoon Jo
32
Point blocking [OOPSLA 2011]
Youngjoon Jo
33
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
Point blocking [OOPSLA 2011]
Youngjoon Jo
34
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
Point blocked code
Youngjoon Jo
35
void recurse(Point *p, Node *n) { if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
Point blocked code
Youngjoon Jo
36
void recurse(Block *block, Node *n) { if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
Point blocked code
Youngjoon Jo
37
void recurse(Block *block, Node *n) { if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
Function body
Point blocked code
Youngjoon Jo
38
void recurse(Block *block, Node *n) { for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } } }
Function body Loop over points in block
Point blocked code
Youngjoon Jo
39
void recurse(Block *block, Node *n) { for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } } }
Loop over points in block Function body
Point blocked code
Youngjoon Jo
40
void recurse(Block *block, Node *n) { Block *nextBlock = // next level block for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isLeaf()) { update(p, n); } else { nextBlock->add(p); } } }
Function body Loop over points in block
Point blocked code
Youngjoon Jo
41
void recurse(Block *block, Node *n) { Block *nextBlock = // next level block for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isLeaf()) { update(p, n); } else { nextBlock->add(p); } } if (nextBlock->size > 0) { recurse(nextBlock, n->left); recurse(nextBlock, n->right); } }
Next block recurses children Function body Loop over points in block
Point blocking [OOPSLA 2011]
Youngjoon Jo
42
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
Point blocking [OOPSLA 2011]
Youngjoon Jo
43
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
Point blocking [OOPSLA 2011]
Youngjoon Jo
44
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
Analogous to packet SIMD
Youngjoon Jo
45
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
Analogous to packet SIMD
Youngjoon Jo
46
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
B r e a k s d
- w
n w h e n p
- i
n t s d i v e r g e
Packet SIMD has poor utilization
Youngjoon Jo
47
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
Full SIMD Partial SIMD
Packet SIMD has poor utilization
Youngjoon Jo
48
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
Full SIMD Partial SIMD
SIMD utilization
Youngjoon Jo
49
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
Full SIMD Partial SIMD
S I M D u t i l i z a t i
- n
= W
- r
k i n f u l l S I M D / T
- t
a l w
- r
k
SIMD utilization
Youngjoon Jo
50
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
Full SIMD Partial SIMD
W
- r
k i n f u l l S I M D / T
- t
a l w
- r
k = C i r c l e s i n b l u e / T
- t
a l c i r c l e s = 2 4 / 7 4 = . 3 2
Use larger block size
Youngjoon Jo
51
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
Full SIMD Partial SIMD
U s e b l
- c
k s i z e l a r g e r t h a n S I M D w i d t h a n d c
- m
p a c t p
- i
n t s !
Use larger block size
Youngjoon Jo
52
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
Better utilization with larger block size
Youngjoon Jo
53
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
Full SIMD Partial SIMD
Better utilization with larger block size
Youngjoon Jo
54
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
Full SIMD Partial SIMD
C i r c l e s i n b l u e / T
- t
a l c i r c l e s = 6 4 / 7 4 = . 8 6
SIMD utilization – Block size
Youngjoon Jo
55 0.2 0.4 0.6 0.8 1 4 40 400 4000 40000 400000 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping
Ideal utilization
Youngjoon Jo
56 0.2 0.4 0.6 0.8 1 4 40 400 4000 40000 400000 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping
B l
- c
k s i z e e q u a l t
- t
- t
a l p
- i
n t s y i e l d s i d e a l S I M D u t i l i z a t i
- n
Ideal Utilization
Use max block! Problem solved?
Youngjoon Jo
57 0.2 0.4 0.6 0.8 1 4 40 400 4000 40000 400000 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping
Large block has poor locality
Youngjoon Jo
58 0.2 0.4 0.6 0.8 1 4 40 400 4000 40000 400000 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping
Large block has poor locality
Youngjoon Jo
59 0.2 0.4 0.6 0.8 1 4 40 400 4000 40000 400000 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping
N e e d s c h e d u l e w i t h g
- d
u t i l i z a t i
- n
a n d g
- d
l
- c
a l i t y
Outline
- Example & Abstract Model
- Point Blocking to Enable SIMD
- Traversal Splicing to Enhance Utilization
- Automatic Transformation
- Evaluation and Conclusion
Youngjoon Jo
60
Traversal splicing [OOPSLA 2012]
Youngjoon Jo
61
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
Traversal splicing [OOPSLA 2012]
Youngjoon Jo
62
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
- 1. Designate splice nodes
Traversal splicing [OOPSLA 2012]
Youngjoon Jo
63
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
- 1. Designate splice nodes
- 2. Traverse up to splice node
Traversal splicing [OOPSLA 2012]
Youngjoon Jo
64
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
- 1. Designate splice nodes
- 2. Traverse up to splice node
Traversal splicing [OOPSLA 2012]
Youngjoon Jo
65
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
- 1. Designate splice nodes
- 2. Traverse up to splice node
- 3. Resume at next node
Traversal splicing [OOPSLA 2012]
Youngjoon Jo
66
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
- 1. Designate splice nodes
- 2. Traverse up to splice node
- 3. Resume at next node
Traversal splicing [OOPSLA 2012]
Youngjoon Jo
67
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
- 1. Designate splice nodes
- 2. Traverse up to splice node
- 3. Resume at next node
- 4. Repeat 2-3 until finished
Traversal splicing [OOPSLA 2012]
Youngjoon Jo
68
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
- 1. Designate splice nodes
- 2. Traverse up to splice node
- 3. Resume at next node
- 4. Repeat 2-3 until finished
Can change order of points
Youngjoon Jo
69
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
- 1. Designate splice nodes
- 2. Traverse up to splice node
- 3. Resume at next node
- 4. Repeat 2-3 until finished
W e c a n c h a n g e t h e
- r
d e r
- f
p a u s e d p
- i
n t s , b u t h
- w
?
Dynamic sorting
Youngjoon Jo
70
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
- 1. Designate splice nodes
- 2. Traverse up to splice node
- 3. Resume at next node
- 4. Repeat 2-3 until finished
I n s i g h t : p
- i
n t s w h i c h r e a c h s a m e n
- d
e s a r e l i k e l y t
- h
a v e s i m i l a r t r a v e r s a l s i n f u t u r e D y n a m i c s
- r
t i n g
- n
t r a v e r s a l h i s t
- r
y
Dynamic sorting
Youngjoon Jo
71
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
- 1. Designate splice nodes
- 2. Traverse up to splice node
Dynamic sorting
Youngjoon Jo
72
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
- 1. Designate splice nodes
- 2. Traverse up to splice node
- 3. Reorder points at splice node
A B C D E F G H
Dynamic sorting
Youngjoon Jo
73
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
A C E F H B D G
- 1. Designate splice nodes
- 2. Traverse up to splice node
- 3. Reorder points at splice node
Dynamic sorting
Youngjoon Jo
74
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
A C E F H B D G A C E F H B D G
- 1. Designate splice nodes
- 2. Traverse up to splice node
- 3. Reorder points at splice node
- 4. Resume at next node
Dynamic sorting
Youngjoon Jo
75
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
A C E F H B D G E B D G A C F H
- 1. Designate splice nodes
- 2. Traverse up to splice node
- 3. Reorder points at splice node
- 4. Resume at next node
- 5. Repeat 2-4 until finished
Dynamic sorting
Youngjoon Jo
76
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
A C E F H B D G E B D G A C F H
- 1. Designate splice nodes
- 2. Traverse up to splice node
- 3. Reorder points at splice node
- 4. Resume at next node
- 5. Repeat 2-4 until finished
Dynamic sorting enhances utilization
Youngjoon Jo
77
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
A C E F H B D G E B D G A C F H Full SIMD Partial SIMD
Dynamic sorting enhances utilization
Youngjoon Jo
78
1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Nodes Points A B C D E F G H
1 4 2 5 3 7 8 9 6 10 11 12 13 14 15
A C E F H B D G E B D G A C F H Full SIMD Partial SIMD
C i r c l e s i n b l u e / T
- t
a l c i r c l e s = 4 8 / 7 4 = . 6 5
SIMD utilization – splice depth
Youngjoon Jo
79 0.2 0.4 0.6 0.8 1 4 40 400 4000 40000 400000 SIMD Utilization Block Size N/A 2 4 6 8 10
Nearest Neighbor
SIMD utilization – splice depth
Youngjoon Jo
80 0.2 0.4 0.6 0.8 1 4 40 400 4000 40000 400000 SIMD Utilization Block Size N/A 2 4 6 8 10
Nearest Neighbor
Block size: 512 Splice depth: 10 Block size: 524288
SIMD utilization
Youngjoon Jo
81 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization Baseline A Priori Sort Dynamic Sort Ideal
SIMD utilization
Youngjoon Jo
82 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Utilization Baseline A Priori Sort Dynamic Sort Ideal
D y n a m i c s
- r
t i n g c a n a u t
- m
a t i c a l l y e x t r a c t a l m
- s
t t h e m a x i m u m a m
- u
n t
- f
S I M D u t i l i z a t i
- n
Outline
- Example & Abstract Model
- Point Blocking to Enable SIMD
- Traversal Splicing to Enhance Utilization
- Automatic Transformation
- Evaluation and Conclusion
Youngjoon Jo
83
Automatic transformation
- Point blocking
Jo and Kulkarni [OOPSLA 2011]
- Traversal splicing
Jo and Kulkarni [OOPSLA 2012]
Youngjoon Jo
84
Automatic transformation
- Our key addition for SIMD:
Layout transformation from AoS (array of structures) to SoA (structure of arrays)
- + Allows vector load/stores
- + Packed data has better spatial locality
- - More overhead in moving data
Youngjoon Jo
85
x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4 AoS (array of structures) SoA (structure of arrays)
AoS to SoA layout
- Whole program AoS to SoA layout transformation
difficult to automate with aliasing
- Limit scope to traversal code only
- Copy in to SoA before traversal
- Copy out to AoS after traversal
- Inter-procedural, flow-insensitive analysis
- Determine which point fields should be SoA
- Conservatively ensure correctness
Youngjoon Jo
86
AoS to SoA layout
Youngjoon Jo
87
- void recurse(Point *p, Node *n) {
if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
AoS to SoA layout
Youngjoon Jo
88
struct Point { float f1, f2, f3; }
- void recurse(Point *p, Node *n) {
if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
AoS to SoA layout
Youngjoon Jo
89
struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }
- void recurse(Point *p, Node *n) {
if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
AoS to SoA layout
Youngjoon Jo
90
struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }
- void recurse(Point *p, Node *n) {
if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
- bool truncate(Point *p, Node *n) {
return p->f1 == n->point->f1; }
- void update(Point *p, Node *n) {
p->f2 += n->point->f3; }
Ensuring correctness
Youngjoon Jo
91
struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }
- void recurse(Point *p, Node *n) {
if (truncate(p, n)) return; if (n->isLeaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
- bool truncate(Point *p, Node *n) {
return p->f1 == n->point->f1; }
- void update(Point *p, Node *n) {
p->f2 += n->point->f3; }
Ensuring correctness
Youngjoon Jo
92
struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }
- bool truncate(Point *p, Node *n) {
return p->f1 == n->point->f1; }
- void update(Point *p, Node *n) {
p->f2 += n->point->f3; }
- Point-access
Non-point-access Read Write Read Write f1 f2 f3
Ensuring correctness
Youngjoon Jo
93
struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }
- bool truncate(Point *p, Node *n) {
return p->f1 == n->point->f1; }
- void update(Point *p, Node *n) {
p->f2 += n->point->f3; }
- Point-access
Non-point-access Read Write Read Write f1 ✓ f2 f3
Ensuring correctness
Youngjoon Jo
94
struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }
- bool truncate(Point *p, Node *n) {
return p->f1 == n->point->f1; }
- void update(Point *p, Node *n) {
p->f2 += n->point->f3; }
- Point-access
Non-point-access Read Write Read Write f1 ✓ ✓ f2 f3
Ensuring correctness
Youngjoon Jo
95
struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }
- bool truncate(Point *p, Node *n) {
return p->f1 == n->point->f1; }
- void update(Point *p, Node *n) {
p->f2 += n->point->f3; }
- Point-access
Non-point-access Read Write Read Write f1 ✓ ✓ f2 ✓ ✓ f3
Ensuring correctness
Youngjoon Jo
96
struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }
- bool truncate(Point *p, Node *n) {
return p->f1 == n->point->f1; }
- void update(Point *p, Node *n) {
p->f2 += n->point->f3; }
- Point-access
Non-point-access Read Write Read Write f1 ✓ ✓ f2 ✓ ✓ f3 ✓
Transforming SoA fields
Youngjoon Jo
97
struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }
- bool truncate(Point *p, Node *n) {
return p->f1 == n->point->f1; }
- void update(Point *p, Node *n) {
p->f2 += n->point->f3; }
- Point-access
Non-point-access Read Write Read Write f1 ✓ ✓ f2 ✓ ✓ f3 ✓
Transforming SoA fields
Youngjoon Jo
98
struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }
- bool truncate(Block *block, int bi, Node *n) {
return block->f1[bi] == n->point->f1; }
- void update(Block *block, int bi, Node *n) {
block->f2[bi] += n->point->f3; }
- Point-access
Non-point-access Read Write Read Write f1 ✓ ✓ f2 ✓ ✓ f3 ✓
Correctness violation example
Youngjoon Jo
99
struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }
- bool truncate(Block *block, int bi, Node *n) {
return block->f1[bi] == n->point->f1; }
- void update(Block *block, int bi, Node *n) {
block->f2[bi] += n->point->f2; }
- Point-access
Non-point-access Read Write Read Write f1 ✓ ✓ f2 ✓ ✓ ✓ f3
Ensuring correctness
Youngjoon Jo
100
struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; }
- bool truncate(Point *p, Node *n) {
return p->f1 == n->point->f1; }
- void update(Point *p, Node *n) {
p->f2 += n->point->f3; p->f3 = 1; }
- Point-access
Non-point-access Read Write Read Write f1 ✓ ✓ f2 ✓ ✓ ✓ f3
S
- u
n d a n a l y s i s c
- n
s e r v a t i v e l y p r
- v
e s S
- A
t r a n s f
- r
m a t i
- n
c
- r
r e c t . S u f f i c e s t
- t
r a n s f
- r
m a l l
- f
- u
r b e n c h m a r k s .
SIMTree
- Implementation of analysis and transformation in
a source to source C++ compiler
- Based on ROSE compiler infrastructure
- Transforms code to apply point blocking, traversal
splicing, and SoA layout
- Does not perform the vectorization itself
- https://engineering.purdue.edu/plcl/simtree/
Youngjoon Jo
101
Outline
- Example & Abstract Model
- Point Blocking to Enable SIMD
- Traversal Splicing to Enhance Utilization
- Automatic Transformation
- Evaluation and Conclusion
Youngjoon Jo
102
Evaluation
- Five benchmarks
- Barnes-Hut, Point Correlation, Nearest Neighbor,
Vantage Point, Photon Mapping
- Real and random inputs form 17 benchmark/inputs
- Two machines
- Intel Xeon E5-4650
- AMD Opteron 6282
- Automatic transformation with SIMTree
- Manual vectorization of transformed code with 4-way
SIMD intrinsics for best performance
- Auto vectorization of transformed code with icc gets 84% of
best performance
Youngjoon Jo
103
1 2 3 4 5 6 7 Speedup Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] Block+SIMD+Splice
Speedup on Xeon
Youngjoon Jo
104
Geometric means
Speedup on Xeon
Youngjoon Jo
105
PacketSIMD Block Block +SIMD Block +Splice Block+SIMD +Splice Xeon 0.81 1.13 1.19 1.92 2.69
1 2 3 4 5 6 7 Speedup Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] Block+SIMD+Splice
Geometric means
Dynamic sorting makes SIMD profitable
Youngjoon Jo
106
PacketSIMD Block Block +SIMD Block +Splice Block+SIMD +Splice Xeon 0.81 1.13 1.19 1.92 2.69 Opteron 0.83 1.15 1.27 1.78 2.86
1 2 3 4 5 6 7 Speedup Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] Block+SIMD+Splice
Geometric means
1 2 3 4 5 6 7 Speedup Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] Block+SIMD+Splice
Dynamic sorting makes SIMD profitable
Youngjoon Jo
107
an
PacketSIMD Block Block +SIMD Block +Splice Block+SIMD +Splice Xeon 0.81 1.13 1.19 1.92 2.69 Opteron 0.83 1.15 1.27 1.78 2.86 Geometric means
Instruction counts: Opteron
Youngjoon Jo
108 0.0 0.5 1.0 1.5 2.0 2.5 Insturction Ratio Block BlockSoA BlockSoA+SIMD Block+Splice BlockSoA+Splice BlockSoA+SIMD+Splice
Block BlockSoA BlockSoA +SIMD Block +Splice BlockSoA +Splice BlockSoA +SIMD+Splice 1.27 1.62 1.20 1.08 1.38 0.64
Cycles per instruction: Opteron
Youngjoon Jo
109 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Cycles Per Instruction Base Block BlockSoA Block+Splice BlockSoA+Splice
Base Block BlockSoA Block+Splice BlockSoA+Splice 2.24 1.56 1.35 1.17 1.04
Conclusion
- Show how tree traversal codes can be
systematically transformed to
- Expose SIMD opportunities
- Enhance utilization
- Propose a novel layout transformation for efficient
vectorization of tree codes
- Present a framework for automatically
restructuring traversals and data layouts to enable vectorization
Youngjoon Jo
110
Conclusion
- Show how tree traversal codes can be
systematically transformed to
- Expose SIMD opportunities
- Enhance utilization
- Propose a novel layout transformation for efficient
vectorization of tree codes
- Present a framework for automatically
restructuring traversals and data layouts to enable vectorization
Youngjoon Jo
111
S I M T r e e i s
- p
e n s
- u