TSO-Atom icity: TSO Enforcem ent for A Aggressive Program Optim - - PowerPoint PPT Presentation

tso atom icity tso enforcem ent for a aggressive program
SMART_READER_LITE
LIVE PREVIEW

TSO-Atom icity: TSO Enforcem ent for A Aggressive Program Optim - - PowerPoint PPT Presentation

TSO-Atom icity: TSO Enforcem ent for A Aggressive Program Optim ization i P O ti i ti Cheng Wang, Youfeng Wu, Jaewoong Chung Programming Systems Lab / MPR Intel Labs Programming Systems Lab Motivation Two different usage models in


slide-1
SLIDE 1

TSO-Atom icity: TSO Enforcem ent for A i P O ti i ti Aggressive Program Optim ization

Cheng Wang, Youfeng Wu, Jaewoong Chung Programming Systems Lab / MPR Intel Labs

Programming Systems Lab

slide-2
SLIDE 2

Motivation

Two different usage models in architecture support for atomic region execution

– HW primitive to enforce exclusive memory access p y

– Support transactional memory, lock elision, etc

– HW primitive to enforce Sequential Consistency (SC)

– Support SC-preserve aggressive program optimization pp p gg p g p

Popular architectures like X86 / SPARC implement Total Store Ordering (TSO) instead of SC

– Weaker consistency in TSO leads to more efficient HW – Weaker consistency in TSO leads to more efficient HW implementation

TSO-Atomicity: new HW primitive to enforce TSO

Weaker consistency than atomicity leads to more efficient – Weaker consistency than atomicity leads to more efficient HW implementation than atomicity – Support TSO-preserve aggressive program optimization

2

Programming Systems Lab

slide-3
SLIDE 3

Relationship betw een Different Consistency Model

Enforce Consist Optim iz Atom icity TSO-Atom icity ? e Mem or tency for zation SC TSO ? ? ? ry r SC TSO W eaker Consistency w ith More Efficient I m plem entation

3

Programming Systems Lab

slide-4
SLIDE 4

SC Vs. TSO

ld1 st2 start ld1 st2_start

SC Execution Trace TSO Execution Trace

st2_start st2_end st3_start st2_end st3_start ld4

t3 h store

st3 end st5_start ld6 st3_end

st3 cache m iss st3 cache m iss forw ard

st3_end ld4 st5_start st5 end st5_end ld7

st_ start w rite store data into store buffer

st5_end ld6 ld7

st_ end w rite store data from store buffer to cache/ m em ory Later load m ay be reordered w ith earlier store Data forw arded from store buffer to later load

4

Programming Systems Lab

Data forw arded from store buffer to later load

slide-5
SLIDE 5

Relationship betw een Different Consistency Model

Enforce Consis Optim i Atom icity TSO-Atom icity ? ? e Mem o tency fo ization SC TSO ? ? ry

  • r

SC TSO W eaker Consistency w ith More Efficient I m plem entation

5

Programming Systems Lab

slide-6
SLIDE 6

Execution w ith Atom icity

first inst execution first load first load execution first store execution snoop load snoop store com m it all load/ store atom ically atom ically

6

Programming Systems Lab

slide-7
SLIDE 7

I nefficiency of Atom icity

ld1 st2_start st2_end st3_start

st3 cache miss region1

st3_end ld4 st5_start st5_end ld6

region2

ld7

Across region boundary, atom icity suffer sim ilar inefficiency as SC

7

Programming Systems Lab

slide-8
SLIDE 8

Atom icity Vs. TSO-Atom icity

first inst execution first load first inst execution first load first load execution first store execution first load execution first store execution snoop snoop load snoop store snoop load snoop store com m it all com m it all load/ store atom ically com m it all store atom ically load atom ically atom ically y

Execution w ith Atom icity Execution w ith TSO-Atom icity

8

Programming Systems Lab

slide-9
SLIDE 9

Efficient I m plem entation of TSO-Atom icity

i 1 ld1 st2_start TSO-Atom icity Execution Trace region1 st2_end st3_start ld4 t5 t t st3 cache store forward region1 region2 st5_start ld6 st3_end st5 end st3 cache miss forward region2 st5_end ld7 Load snoop in region2 use the speculative bits release by load commit Load snoop in region2 use the speculative bits release by load commit in region1 No snooping for store data in store-buffer until written to cache Across region boundary, TSO-atomicity achieve same efficiency as TSO

9

Programming Systems Lab

slide-10
SLIDE 10

Prevent Unnecessary Abort

thread1 thread2 thread1 thread2 Init: A= 0, B= 0

r1 ld B st A 1 r2 ld A st B 1 st A 1 st B 1

Results: r1= 0, r2= 0

By committing load earlier, reduce unnecessary abort due to memory conflicts To enforce TSO instead of exclusive memory access, TSO- atomicity is not constrained to produce only serializable scheduling

1 0

Programming Systems Lab

slide-11
SLIDE 11

Sem antics of Mem ory Consistency Models

SC thread1 thread2 I nit: A= 0 , B= 0 TSO

st A 1 r1 ld B st B 1 r2 ld A

Atom icity

r1 ld B r2 ld A

TSO-Atom icity Results ( r1 , r2 ) TSO (0, 0), (0, 1), (1, 0), (1,1) SC (0, 1), (1, 0), (1, 1) SC (0, 1), (1, 0), (1, 1) Atomicity (0, 1), (1, 0) TSO-Atomicity (0, 0), (0, 1), (1, 0)

1 1

Programming Systems Lab

slide-12
SLIDE 12

Relationship betw een Different Consistency Model

Enforce Consis Optim i Atom icity TSO-Atom icity ? e Mem o tency fo ization SC TSO ? ? ry

  • r

SC TSO W eaker Consistency w ith More Efficient I m plem entation

1 2

Programming Systems Lab

slide-13
SLIDE 13

Atom icity Enforce SC for Optim ization

Physical Execution Logical Execution com putation

ll ll ll

  • Logically, all computation execute atomically at

region commit time

  • The sem antic of region w ith atom icity is

The sem antic of region w ith atom icity is independent of optim izations w ithin the region

  • Enforce SC for aggressive program optimization

1 3

Programming Systems Lab

slide-14
SLIDE 14

Relationship betw een Different Consistency Model

Enforce Consist Optim iz Atom icity TSO-Atom icity e Mem or tency for zation SC TSO ? ry r SC TSO W eaker Consistency w ith More Efficient I m plem entation

1 4

Programming Systems Lab

slide-15
SLIDE 15

TSO-Atom icity Enforce TSO for Optim ization

Physical Execution Logical Execution Upw ard-Exposed Loads Upw ard-Exposed Loads Other com putation

  • Logically, all upward-exposed loads execute atomically at load commit

Dow nw ard-Exposed Stores p

Logically, all upward exposed loads execute atomically at load commit

  • Non-upward-exposed loads get forwarded data
  • Logically, all downward-exposed stores execute atomically at store

commit

N d d d t itt

  • Non-downward-exposed stores are overwritten
  • The sem antic of region w ith TSO-atom icity is also independent
  • f optim izations w ithin the region
  • Enforce TSO for aggressive program optimization

1 5

Programming Systems Lab

gg p g p

slide-16
SLIDE 16

Conclusions and Future W orks

TSO-atomicity is a new HW primitive to enforce TSO for aggressive program optimization

– Weaker than atomicity for efficient HW implementation – Weaker than atomicity for efficient HW implementation – Prevent unnecessary abort due to conflicts

Experimental results shows that TSO-atomicity is efficient to support dynamic binary optimizations

– 20% performance improvement through dynamic binary

  • ptimization with TSO-atomicity support

p y pp

It is interesting future works to study new HW primitives to enforce other relaxed memory consistency models for program optimizations consistency models for program optimizations

– weaker atomicity than TSO-atomicity to enforce weaker memory consistency than TSO

1 6

Programming Systems Lab