Toward a Core Design to Distribute an Execution on a Manycore - - PowerPoint PPT Presentation

toward a core design to distribute an execution on a
SMART_READER_LITE
LIVE PREVIEW

Toward a Core Design to Distribute an Execution on a Manycore - - PowerPoint PPT Presentation

PaCT2015, Petrozavodsk, August 31 - September 4, 2015 Toward a Core Design to Distribute an Execution on a Manycore Processor. Bernard Goossens, David Parello, Katarzyna Porada, Djallal Rahmoune Universit e de Perpignan Via Domitia


slide-1
SLIDE 1

PaCT’2015, Petrozavodsk, August 31 - September 4, 2015

Toward a Core Design to Distribute an Execution on a Manycore Processor.

Bernard Goossens, David Parello, Katarzyna Porada, Djallal Rahmoune Universit´ e de Perpignan Via Domitia DALI-LIRMM

1 / 33

slide-2
SLIDE 2

Summary.

1

Parallelization of a C Code.

2

Automatic Hardware Parallelization.

3

Determinism.

4

Conclusion.

2 / 33

slide-3
SLIDE 3

Parallelization of a C Code.

3 / 33

slide-4
SLIDE 4

Example : a sum reduction.

long sum( long t [ ] , unsigned i n t n){ i f ( n==1) return t [ 0 ] ; i f ( n==2) return t [0]+ t [ 1 ] ; return sum( t , n /2) + sum(&( t [ n / 2 ] ) , n−n / 2 ) ; }

This code looks sequential. Let us parallelize it.

4 / 33

slide-5
SLIDE 5

What we do today : e.g. using pthreads.

typedef s t r u c t { unsigned long ∗p ; unsigned long i ;} ST ; void ∗sum( void ∗ s t ){ ST str1 , s t r 2 ; unsigned long s , s1 , s2 ; p t h r e a d t tid1 , t i d 2 ; i f ( ( ( ST ∗) s t)−>i >2){ s t r 1 . p=((ST ∗) s t)−>p ; s t r 1 . i =((ST ∗) s t)−>i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗)& s t r 1 ) ; s t r 2 . p=((ST ∗) s t)−>p + ( (ST ∗) s t)−>i /2; s t r 2 . i =((ST ∗) s t)−>i − ( (ST ∗) s t)−>i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗)& s t r 2 ) ; } e l s e i f ( ( ( ST ∗) s t)−>i ==1){s1 =((ST ∗) s t)−>p [ 0 ] ; s2 =0;} e l s e {s1 =((ST ∗) s t)−>p [ 0 ] ; s2 =((ST ∗) s t)−>p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗) s ) ; }

5 / 33

slide-6
SLIDE 6

What we do today : e.g. using pthreads.

typedef s t r u c t { unsigned long ∗p ; unsigned long i ;} ST ; void ∗sum( void ∗ s t ){ ST str1 , s t r 2 ; unsigned long s , s1 , s2 ; p t h r e a d t tid1 , t i d 2 ; i f ( ( ( ST ∗) s t)−>i >2){ s t r 1 . p=((ST ∗) s t)−>p ; s t r 1 . i =((ST ∗) s t)−>i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗)& s t r 1 ) ; s t r 2 . p=((ST ∗) s t)−>p + ( (ST ∗) s t)−>i /2; s t r 2 . i =((ST ∗) s t)−>i − ( (ST ∗) s t)−>i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗)& s t r 2 ) ; } e l s e i f ( ( ( ST ∗) s t)−>i ==1){s1 =((ST ∗) s t)−>p [ 0 ] ; s2 =0;} e l s e {s1 =((ST ∗) s t)−>p [ 0 ] ; s2 =((ST ∗) s t)−>p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗) s ) ; }

The code is multithreaded.

5 / 33

slide-7
SLIDE 7

What we do today : e.g. using pthreads.

typedef s t r u c t { unsigned long ∗p ; unsigned long i ;} ST ; void ∗sum( void ∗ s t ){ ST str1 , s t r 2 ; unsigned long s , s1 , s2 ; p t h r e a d t tid1 , t i d 2 ; i f ( ( ( ST ∗) s t)−>i >2){ s t r 1 . p=((ST ∗) s t)−>p ; s t r 1 . i =((ST ∗) s t)−>i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗)& s t r 1 ) ; s t r 2 . p=((ST ∗) s t)−>p + ( (ST ∗) s t)−>i /2; s t r 2 . i =((ST ∗) s t)−>i − ( (ST ∗) s t)−>i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗)& s t r 2 ) ; } e l s e i f ( ( ( ST ∗) s t)−>i ==1){s1 =((ST ∗) s t)−>p [ 0 ] ; s2 =0;} e l s e {s1 =((ST ∗) s t)−>p [ 0 ] ; s2 =((ST ∗) s t)−>p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗) s ) ; }

The code is multithreaded. Threads executions are non deterministically ordered.

5 / 33

slide-8
SLIDE 8

What we do today : e.g. using pthreads.

typedef s t r u c t { unsigned long ∗p ; unsigned long i ;} ST ; void ∗sum( void ∗ s t ){ ST str1 , s t r 2 ; unsigned long s , s1 , s2 ; p t h r e a d t tid1 , t i d 2 ; i f ( ( ( ST ∗) s t)−>i >2){ s t r 1 . p=((ST ∗) s t)−>p ; s t r 1 . i =((ST ∗) s t)−>i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗)& s t r 1 ) ; s t r 2 . p=((ST ∗) s t)−>p + ( (ST ∗) s t)−>i /2; s t r 2 . i =((ST ∗) s t)−>i − ( (ST ∗) s t)−>i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗)& s t r 2 ) ; } e l s e i f ( ( ( ST ∗) s t)−>i ==1){s1 =((ST ∗) s t)−>p [ 0 ] ; s2 =0;} e l s e {s1 =((ST ∗) s t)−>p [ 0 ] ; s2 =((ST ∗) s t)−>p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗) s ) ; }

The code is multithreaded. Threads executions are non deterministically ordered. Too few synchronization => the result is not deterministic.

5 / 33

slide-9
SLIDE 9

Synchronized threads.

typedef s t r u c t { unsigned long ∗p ; unsigned long i ;} ST ; void ∗sum( void ∗ s t ){ ST str1 , s t r 2 ; unsigned long s , s1 , s2 ; p t h r e a d t tid1 , t i d 2 ; i f ( ( ( ST ∗) s t)−>i >2){ s t r 1 . p=((ST ∗) s t)−>p ; s t r 1 . i =((ST ∗) s t)−>i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗)& s t r 1 ) ; p t h r e a d j o i n ( tid1 , ( void ∗)&s1 ) ; s t r 2 . p=((ST ∗) s t)−>p + ( (ST ∗) s t)−>i /2; s t r 2 . i =((ST ∗) s t)−>i − ( (ST ∗) s t)−>i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗)& s t r 2 ) ; p t h r e a d j o i n ( tid2 , ( void ∗)&s2 ) ; } e l s e i f ( ( ( ST ∗) s t)−>i ==1){s1 =((ST ∗) s t)−>p [ 0 ] ; s2 =0;} e l s e {s1 =((ST ∗) s t)−>p [ 0 ] ; s2 =((ST ∗) s t)−>p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗) s ) ; }

6 / 33

slide-10
SLIDE 10

Synchronized threads.

typedef s t r u c t { unsigned long ∗p ; unsigned long i ;} ST ; void ∗sum( void ∗ s t ){ ST str1 , s t r 2 ; unsigned long s , s1 , s2 ; p t h r e a d t tid1 , t i d 2 ; i f ( ( ( ST ∗) s t)−>i >2){ s t r 1 . p=((ST ∗) s t)−>p ; s t r 1 . i =((ST ∗) s t)−>i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗)& s t r 1 ) ; p t h r e a d j o i n ( tid1 , ( void ∗)&s1 ) ; s t r 2 . p=((ST ∗) s t)−>p + ( (ST ∗) s t)−>i /2; s t r 2 . i =((ST ∗) s t)−>i − ( (ST ∗) s t)−>i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗)& s t r 2 ) ; p t h r e a d j o i n ( tid2 , ( void ∗)&s2 ) ; } e l s e i f ( ( ( ST ∗) s t)−>i ==1){s1 =((ST ∗) s t)−>p [ 0 ] ; s2 =0;} e l s e {s1 =((ST ∗) s t)−>p [ 0 ] ; s2 =((ST ∗) s t)−>p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗) s ) ; }

Among all the run orderings, the synchronization keeps only good ones (i.e. computing the same result as a sequential execution).

6 / 33

slide-11
SLIDE 11

Synchronized threads.

typedef s t r u c t { unsigned long ∗p ; unsigned long i ;} ST ; void ∗sum( void ∗ s t ){ ST str1 , s t r 2 ; unsigned long s , s1 , s2 ; p t h r e a d t tid1 , t i d 2 ; i f ( ( ( ST ∗) s t)−>i >2){ s t r 1 . p=((ST ∗) s t)−>p ; s t r 1 . i =((ST ∗) s t)−>i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗)& s t r 1 ) ; p t h r e a d j o i n ( tid1 , ( void ∗)&s1 ) ; s t r 2 . p=((ST ∗) s t)−>p + ( (ST ∗) s t)−>i /2; s t r 2 . i =((ST ∗) s t)−>i − ( (ST ∗) s t)−>i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗)& s t r 2 ) ; p t h r e a d j o i n ( tid2 , ( void ∗)&s2 ) ; } e l s e i f ( ( ( ST ∗) s t)−>i ==1){s1 =((ST ∗) s t)−>p [ 0 ] ; s2 =0;} e l s e {s1 =((ST ∗) s t)−>p [ 0 ] ; s2 =((ST ∗) s t)−>p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗) s ) ; }

Among all the run orderings, the synchronization keeps only good ones (i.e. computing the same result as a sequential execution). Too much synchronization => not parallel enough.

6 / 33

slide-12
SLIDE 12

Correctly synchronized threads.

typedef s t r u c t { unsigned long ∗p ; unsigned long i ;} ST ; void ∗sum( void ∗ s t ){ ST str1 , s t r 2 ; unsigned long s , s1 , s2 ; p t h r e a d t tid1 , t i d 2 ; i f ( ( ( ST ∗) s t)−>i >2){ s t r 1 . p=((ST ∗) s t)−>p ; s t r 1 . i =((ST ∗) s t)−>i /2; p t h r e a d c r e a t e (& tid1 , NULL, sum , ( void ∗)& s t r 1 ) ; s t r 2 . p=((ST ∗) s t)−>p + ( (ST ∗) s t)−>i /2; s t r 2 . i =((ST ∗) s t)−>i − ( (ST ∗) s t)−>i /2; p t h r e a d c r e a t e (& tid2 , NULL, sum , ( void ∗)& s t r 2 ) ; p t h r e a d j o i n ( tid1 , ( void ∗)&s1 ) ; p t h r e a d j o i n ( tid2 , ( void ∗)&s2 ) ; } e l s e i f ( ( ( ST ∗) s t)−>i ==1){s1 =((ST ∗) s t)−>p [ 0 ] ; s2 =0;} e l s e {s1 =((ST ∗) s t)−>p [ 0 ] ; s2 =((ST ∗) s t)−>p [ 1 ] ; } s=s1+s2 ; p t h r e a d e x i t ( ( void ∗) s ) ; }

7 / 33

slide-13
SLIDE 13

What we propose to do.

long sum( long t [ ] , unsigned i n t n){ i f ( n==1) return t [ 0 ] ; i f ( n==2) return t [0]+ t [ 1 ] ; return sum( t , n /2) + sum(&( t [ n / 2 ] ) , n−n / 2 ) ; }

8 / 33

slide-14
SLIDE 14

What we propose to do.

long sum( long t [ ] , unsigned i n t n){ i f ( n==1) return t [ 0 ] ; i f ( n==2) return t [0]+ t [ 1 ] ; return sum( t , n /2) + sum(&( t [ n / 2 ] ) , n−n / 2 ) ; }

This code is usually understood as sequential.

8 / 33

slide-15
SLIDE 15

What we propose to do.

long sum( long t [ ] , unsigned i n t n){ i f ( n==1) return t [ 0 ] ; i f ( n==2) return t [0]+ t [ 1 ] ; return sum( t , n /2) + sum(&( t [ n / 2 ] ) , n−n / 2 ) ; }

This code is usually understood as sequential. The order of instructions and expressions matches the order of execution.

8 / 33

slide-16
SLIDE 16

What we propose to do.

long sum( long t [ ] , unsigned i n t n){ i f ( n==1) return t [ 0 ] ; i f ( n==2) return t [0]+ t [ 1 ] ; return sum( t , n /2) + sum(&( t [ n / 2 ] ) , n−n / 2 ) ; }

This code is usually understood as sequential. The order of instructions and expressions matches the order of execution. Left half sum is computed before right half sum.

8 / 33

slide-17
SLIDE 17

What we propose to do : nothing.

long sum( long t [ ] , unsigned i n t n){ i f ( n==1) return t [ 0 ] ; i f ( n==2) return t [0]+ t [ 1 ] ; return sum( t , n /2) + sum(&( t [ n / 2 ] ) , n−n / 2 ) ; }

9 / 33

slide-18
SLIDE 18

What we propose to do : nothing.

long sum( long t [ ] , unsigned i n t n){ i f ( n==1) return t [ 0 ] ; i f ( n==2) return t [0]+ t [ 1 ] ; return sum( t , n /2) + sum(&( t [ n / 2 ] ) , n−n / 2 ) ; }

This code can be understood as parallel.

9 / 33

slide-19
SLIDE 19

What we propose to do : nothing.

long sum( long t [ ] , unsigned i n t n){ i f ( n==1) return t [ 0 ] ; i f ( n==2) return t [0]+ t [ 1 ] ; return sum( t , n /2) + sum(&( t [ n / 2 ] ) , n−n / 2 ) ; }

This code can be understood as parallel. Just change the compiler and the processor hardware.

9 / 33

slide-20
SLIDE 20

What we propose to do : nothing.

long sum( long t [ ] , unsigned i n t n){ i f ( n==1) return t [ 0 ] ; i f ( n==2) return t [0]+ t [ 1 ] ; return sum( t , n /2) + sum(&( t [ n / 2 ] ) , n−n / 2 ) ; }

This code can be understood as parallel. Just change the compiler and the processor hardware. Rename and run in parallel, hardware synchronize reader with writer.

9 / 33

slide-21
SLIDE 21

The compiled sum reduction (actual compiler).

sum : cmpq $2 , %r s i ; i f (n>2) goto .L1 j a .L1 movq (% r d i ) , %rax ; rax = t [ 0 ] jb .L2 ; i f (n<2) goto .L2 addq 8(% r d i ) , %rax ; rax = t [ 0 ] + t [ 1 ] .L2 : r e t ; r e t u r n ( rax ) .L1 : pushq %r s i ; save n pushq %r d i ; save t pushq %rbp ; save rbp subq $8 , %rsp ; a l l o c ( temp ) movq %r s i , %rbp ; rbp = n shrq %r s i ; n = n/2 c a l l sum ; rax = sum( t , n /2) movq %rax , 0(% rsp ) ; temp = t [ 0 ] + . . . + t [ n/2−1] l e a q (%r di , %r s i , 8) , %r d i ; t = t + n/2∗8 = &(t [ n / 2 ] ) subq %r s i , %rbp ; rbp = n − n/2 movq %rbp , %r s i ; n = n − n/2 c a l l sum ; rax = sum(&( t [ n / 2 ] ) , n−n /2) addq 0(% rsp ) , %rax ; rax = t [ 0 ] + . . . + t [ n/2−1] ; + t [ n /2] + . . . + t [ n−1] addq $8 , %rsp ; f r e e ( temp ) popq %rbp ; r e s t o r e rbp popq %r d i ; r e s t o r e t popq %r s i ; r e s t o r e n r e t ; r e t u r n ( rax )

10 / 33

slide-22
SLIDE 22

The compiled sum reduction (parallelizing compiler).

sum : cmpq $2 , %r s i ; i f (n>2) goto .L1 j a .L1 movq (% r d i ) , %rax ; rax = t [ 0 ] jb .L2 ; i f (n<2) goto .L2 addq 8(% r d i ) , %rax ; rax = t [ 0 ] + t [ 1 ] .L2 : endfork ; r e t u r n ( rax ) .L1 : ; at fork , rsp , rbp , r di , r s i and rbx are copied subq $8 , %rsp ; a l l o c ( temp ) movq %r s i , %rbp ; rbp = n shrq %r s i ; n = n/2 f o r k sum ; rax = sum( t , n /2) movq %rax , 0(% rsp ) ; temp = t [ 0 ] + . . . + t [ n/2−1] l e a q (%r di , %r s i , 8) , %r d i ; t = t + n/2∗8 = &(t [ n / 2 ] ) subq %r s i , %rbp ; rbp = n − n/2 movq %rbp , %r s i ; n = n − n/2 f o r k sum ; rax = sum(&( t [ n / 2 ] ) , n−n /2) addq 0(% rsp ) , %rax ; rax = t [ 0 ] + . . . + t [ n/2−1] ; + t [ n /2] + . . . + t [ n−1] addq $8 , %rsp ; f r e e ( temp ) endfork ; r e t u r n ( rax )

11 / 33

slide-23
SLIDE 23

The parallel run steps.

Fetch the trace as fastly as possible, relying on fork machine instruction.

12 / 33

slide-24
SLIDE 24

The parallel run steps.

Fetch the trace as fastly as possible, relying on fork machine instruction. Rename destinations and match readers with writers from trace total order.

12 / 33

slide-25
SLIDE 25

The parallel run steps.

Fetch the trace as fastly as possible, relying on fork machine instruction. Rename destinations and match readers with writers from trace total order. Run in partial order of reader to writer dependencies.

12 / 33

slide-26
SLIDE 26

The parallel run steps.

Fetch the trace as fastly as possible, relying on fork machine instruction. Rename destinations and match readers with writers from trace total order. Run in partial order of reader to writer dependencies. Discard intermediate storing resources when overwritten or freed (dynamic regions : stack, heap).

12 / 33

slide-27
SLIDE 27

The parallel run steps.

Fetch the trace as fastly as possible, relying on fork machine instruction. Rename destinations and match readers with writers from trace total order. Run in partial order of reader to writer dependencies. Discard intermediate storing resources when overwritten or freed (dynamic regions : stack, heap). Dump final results to physical memory (single writer => trivial coherency).

12 / 33

slide-28
SLIDE 28

Automatic Hardware Parallelization.

13 / 33

slide-29
SLIDE 29

Fetching = Reading IP-addressed code and computing the control flow.

Fork :

14 / 33

slide-30
SLIDE 30

Fetching = Reading IP-addressed code and computing the control flow.

Fork : 2 sections : current IP continues to callee + new IP resumes after callee return.

14 / 33

slide-31
SLIDE 31

Fetching = Reading IP-addressed code and computing the control flow.

Fork : 2 sections : current IP continues to callee + new IP resumes after callee return. Registers rsp, rbp, rdi, rsi, rbx are copied from current section to new section.

14 / 33

slide-32
SLIDE 32

Fetching = Reading IP-addressed code and computing the control flow.

Fork : 2 sections : current IP continues to callee + new IP resumes after callee return. Registers rsp, rbp, rdi, rsi, rbx are copied from current section to new section. 2 copies of the stack : callee stack + resume stack.

14 / 33

slide-33
SLIDE 33

Fetching = Reading IP-addressed code and computing the control flow.

Fork : 2 sections : current IP continues to callee + new IP resumes after callee return. Registers rsp, rbp, rdi, rsi, rbx are copied from current section to new section. 2 copies of the stack : callee stack + resume stack. End of section = endfork.

14 / 33

slide-34
SLIDE 34

Synchronizing reader with writer.

15 / 33

slide-35
SLIDE 35

Synchronizing reader with writer.

Reader and writer in the same section : Tomasulo’s algorithm (ooo execution).

15 / 33

slide-36
SLIDE 36

Synchronizing reader with writer.

Reader and writer in the same section : Tomasulo’s algorithm (ooo execution). Reading and writing a register in different sections : reader sends import value request to predecessor section when known, writer sends value to reader when computed.

15 / 33

slide-37
SLIDE 37

Synchronizing reader with writer.

Reader and writer in the same section : Tomasulo’s algorithm (ooo execution). Reading and writing a register in different sections : reader sends import value request to predecessor section when known, writer sends value to reader when computed. Reading and writing a memory location : reader imports value from same address writer, following predecessor links.

15 / 33

slide-38
SLIDE 38

Synchronizing reader with writer.

Reader and writer in the same section : Tomasulo’s algorithm (ooo execution). Reading and writing a register in different sections : reader sends import value request to predecessor section when known, writer sends value to reader when computed. Reading and writing a memory location : reader imports value from same address writer, following predecessor links. Reading and writing a rsp-based stack location : reader imports value from same address writer, following level predecessor links.

15 / 33

slide-39
SLIDE 39

Retiring = Exporting results.

A computed value is exported to (if not overwritten, freed or consumed by a successor) :

16 / 33

slide-40
SLIDE 40

Retiring = Exporting results.

A computed value is exported to (if not overwritten, freed or consumed by a successor) : The next section (register, stack, heap).

16 / 33

slide-41
SLIDE 41

Retiring = Exporting results.

A computed value is exported to (if not overwritten, freed or consumed by a successor) : The next section (register, stack, heap). The next section at the same level (rsp-based stack).

16 / 33

slide-42
SLIDE 42

Retiring = Exporting results.

A computed value is exported to (if not overwritten, freed or consumed by a successor) : The next section (register, stack, heap). The next section at the same level (rsp-based stack). The previous section (static memory).

16 / 33

slide-43
SLIDE 43

rsi = 8, rdi = t, rsp = SP, sum(8,t)

2 3 4 5 6 addq 0(%rsp), %rax addq $8, %rsp endfork cmpq $2, %rsi ja .L1 1 movq (%rdi), %rax jb .L2 addq 8(%rdi), %rax endfork .L2: .L1: movq %rsi, %rbp shrq %rsi subq $8, %rsp fork sum c0 1 movq %rax, 0(%rsp) leaq (%rdi, %rsi, 8), %rdi subq %rsi, %rbp movq %rbp, %rsi fork sum sum:

17 / 33

slide-44
SLIDE 44

One section, block 4 fetched+decoded, block 1 executed.

3 5 6 addq 0(%rsp), %rax addq $8, %rsp endfork cmpq $2, %rsi ja .L1 jb .L2 2 1 addq 8(%rdi), %rax endfork .L2: .L1: movq %rsi, %rbp shrq %rsi 4 movq (%rdi), %rax subq $8, %rsp fork sum c0 4 1 movq %rax, 0(%rsp) leaq (%rdi, %rsi, 8), %rdi subq %rsi, %rbp movq %rbp, %rsi fork sum sum:

18 / 33

slide-45
SLIDE 45

Second section, rbp=n,rsi=n/2,rdi=t copied, blk1 retired.

3 4 6 addq 0(%rsp), %rax addq $8, %rsp endfork cmpq $2, %rsi ja .L1 jb .L2 2 1 addq 8(%rdi), %rax endfork .L2: .L1: movq %rsi, %rbp shrq %rsi movq (%rdi), %rax subq $8, %rsp c0 c3 4 1 1 5 movq %rax, 0(%rsp) leaq (%rdi, %rsi, 8), %rdi subq %rsi, %rbp movq %rbp, %rsi 5 fork sum fork sum sum:

19 / 33

slide-46
SLIDE 46

Third section, blk5 renames rax (when predecessor known).

3 4 5 6 rax cmpq $2, %rsi ja .L1 jb .L2 .L1: 4 2 1 addq 8(%rdi), %rax endfork .L2: .L1: movq %rsi, %rbp shrq %rsi movq (%rdi), %rax subq $8, %rsp c0 c3 c6 4 1 4 5 1 6 5 movq %rax, 0(%rsp) leaq (%rdi, %rsi, 8), %rdi subq %rsi, %rbp movq %rbp, %rsi addq 0(%rsp), %rax addq $8, %rsp endfork fork sum fork sum sum:

20 / 33

slide-47
SLIDE 47

Blk6 renames rax ( ?) and 0(rsp) (level predecessor).

3 6 addq 0(%rsp), %rax addq $8, %rsp endfork 5 rDS: cmpq $2, %rsi ja .L1 1 jb .L2 2 addq 8(%rdi), %rax endfork .L2: .L1: movq %rsi, %rbp shrq %rsi 4 movq (%rdi), %rax subq $8, %rsp c0 c1 c3 c6 1 4 1 5 5 1 4 6 movq %rax, 0(%rsp) leaq (%rdi, %rsi, 8), %rdi subq %rsi, %rbp movq %rbp, %rsi rax 0(rsp) fork sum fork sum

21 / 33

slide-48
SLIDE 48

B5,c3 renames rax, predecessor sends rax when computed.

3 rDS: cmpq $2, %rsi ja .L1 1 jb .L2 2 addq 8(%rdi), %rax endfork .L2: .L1: movq %rsi, %rbp shrq %rsi 4 movq (%rdi), %rax subq $8, %rsp c0 c1 c2 c3 c4 c6 4 1 2 5 1 6 5 1 4 1 5 6 movq %rax, 0(%rsp) leaq (%rdi, %rsi, 8), %rdi subq %rsi, %rbp movq %rbp, %rsi 5 addq 0(%rsp), %rax addq $8, %rsp endfork 6 fork sum fork sum

22 / 33

slide-49
SLIDE 49

B4,c3 retires : nothing to export, ooo retirement.

4 movq (%rdi), %rax jb .L2 2 5 rDS: cmpq $2, %rsi ja .L1 1 addq 8(%rdi), %rax endfork .L2: .L1: movq %rsi, %rbp shrq %rsi 3 subq $8, %rsp c0 c1 c2 c3 c4 c5 c6 5 1 2 6 5 4 1 2 5 1 6 6 movq %rax, 0(%rsp) leaq (%rdi, %rsi, 8), %rdi subq %rsi, %rbp movq %rbp, %rsi addq 0(%rsp), %rax addq $8, %rsp endfork 6 1 2 3 fork sum fork sum

23 / 33

slide-50
SLIDE 50

B3,c0 computes rax, value sent to renamer b5,c1.

4 5 6 addq 0(%rsp), %rax addq $8, %rsp endfork rDS: cmpq $2, %rsi ja .L1 1 movq (%rdi), %rax jb .L2 addq 8(%rdi), %rax endfork .L2: .L1: movq %rsi, %rbp shrq %rsi 2 3 subq $8, %rsp c0 c1 c2 c3 c4 c5 c6 5 1 2 3 6 5 1 2 3 5 1 2 6 6 movq %rax, 0(%rsp) leaq (%rdi, %rsi, 8), %rdi subq %rsi, %rbp movq %rbp, %rsi 2 3 fork sum fork sum

24 / 33

slide-51
SLIDE 51

B5+b3,c1 send rax and 0(rsp) to renamer b6,c2.

2 4 5 6 addq 0(%rsp), %rax addq $8, %rsp endfork rDS: cmpq $2, %rsi ja .L1 1 movq (%rdi), %rax jb .L2 addq 8(%rdi), %rax endfork .L2: .L1: movq %rsi, %rbp shrq %rsi 3 subq $8, %rsp c0 c1 c2 c3 c4 c5 c6 6 5 2 3 5 1 2 3 6 6 movq %rax, 0(%rsp) leaq (%rdi, %rsi, 8), %rdi subq %rsi, %rbp movq %rbp, %rsi 5 2 3 3 fork sum fork sum

25 / 33

slide-52
SLIDE 52

B6,c2 sends half sum rax to b5,c3.

2 4 5 6 addq 0(%rsp), %rax addq $8, %rsp endfork rDS: cmpq $2, %rsi ja .L1 1 movq (%rdi), %rax jb .L2 addq 8(%rdi), %rax endfork .L2: .L1: movq %rsi, %rbp shrq %rsi 3 subq $8, %rsp c0 c1 c2 c3 c4 c5 c6 6 5 3 5 3 6 6 movq %rax, 0(%rsp) leaq (%rdi, %rsi, 8), %rdi subq %rsi, %rbp movq %rbp, %rsi 5 3 fork sum fork sum 2

26 / 33

slide-53
SLIDE 53

B5,c3 sends stack saved half sum 0(rsp) to b6,c6.

2 4 5 6 addq 0(%rsp), %rax addq $8, %rsp endfork rDS: cmpq $2, %rsi ja .L1 1 movq (%rdi), %rax jb .L2 addq 8(%rdi), %rax endfork .L2: .L1: movq %rsi, %rbp shrq %rsi 3 subq $8, %rsp c0 c1 c2 c3 c4 c5 c6 6 5 5 3 6 6 movq %rax, 0(%rsp) leaq (%rdi, %rsi, 8), %rdi subq %rsi, %rbp movq %rbp, %rsi fork sum fork sum

27 / 33

slide-54
SLIDE 54

B6,c6 computes final sum into rax.

2 4 5 6 addq 0(%rsp), %rax addq $8, %rsp endfork rDS: cmpq $2, %rsi ja .L1 1 movq (%rdi), %rax jb .L2 addq 8(%rdi), %rax endfork .L2: .L1: movq %rsi, %rbp shrq %rsi 3 subq $8, %rsp c0 c1 c2 c3 c4 c5 c6 5 6 6 movq %rax, 0(%rsp) leaq (%rdi, %rsi, 8), %rdi subq %rsi, %rbp movq %rbp, %rsi fork sum fork sum

28 / 33

slide-55
SLIDE 55

B6,c6 sends rax to renamer (main).

2 4 5 6 addq 0(%rsp), %rax addq $8, %rsp endfork rDS: cmpq $2, %rsi ja .L1 1 movq (%rdi), %rax jb .L2 addq 8(%rdi), %rax endfork .L2: .L1: movq %rsi, %rbp shrq %rsi 3 subq $8, %rsp c0 c1 c2 c3 c4 c5 c6 6 movq %rax, 0(%rsp) leaq (%rdi, %rsi, 8), %rdi subq %rsi, %rbp movq %rbp, %rsi fork sum fork sum

29 / 33

slide-56
SLIDE 56

Determinism.

30 / 33

slide-57
SLIDE 57

Hardware parallelization ensures determinism.

31 / 33

slide-58
SLIDE 58

Hardware parallelization ensures determinism.

The parallel fetch builds a total order of the trace which associates each reader to the closest writer.

31 / 33

slide-59
SLIDE 59

Hardware parallelization ensures determinism.

The parallel fetch builds a total order of the trace which associates each reader to the closest writer. The renaming builds a deterministic partial order (reader related to writer and synchronized).

31 / 33

slide-60
SLIDE 60

Hardware parallelization ensures determinism.

The parallel fetch builds a total order of the trace which associates each reader to the closest writer. The renaming builds a deterministic partial order (reader related to writer and synchronized). The reader/writer synchronization and the partial order run ensure determinism.

31 / 33

slide-61
SLIDE 61

Hardware parallelization ensures determinism.

The parallel fetch builds a total order of the trace which associates each reader to the closest writer. The renaming builds a deterministic partial order (reader related to writer and synchronized). The reader/writer synchronization and the partial order run ensure determinism. This is to be compared to pthread non determinism and complex software synchronization.

31 / 33

slide-62
SLIDE 62

Conclusion.

32 / 33

slide-63
SLIDE 63

Conclusion : what message is being sent out ?

Total order of execution : sequential + deterministic.

33 / 33

slide-64
SLIDE 64

Conclusion : what message is being sent out ?

Total order of execution : sequential + deterministic. Partial order + OS scheduling : parallel + non deterministic.

33 / 33

slide-65
SLIDE 65

Conclusion : what message is being sent out ?

Total order of execution : sequential + deterministic. Partial order + OS scheduling : parallel + non deterministic. Total order of trace + dataflow order of execution : parallel + deterministic.

33 / 33

slide-66
SLIDE 66

Conclusion : what message is being sent out ?

Total order of execution : sequential + deterministic. Partial order + OS scheduling : parallel + non deterministic. Total order of trace + dataflow order of execution : parallel + deterministic. http ://perso.numericable.fr/bernard.goossens/i want a fork.html

33 / 33