understanding power multiprocessors
play

Understanding POWER multiprocessors Susmit Sarkar 1 Peter Sewell 1 - PowerPoint PPT Presentation

Understanding POWER multiprocessors Susmit Sarkar 1 Peter Sewell 1 Jade Alglave 2 , 3 Luc Maranget 3 Derek Williams 4 1 University of Cambridge 2 Oxford University 3 INRIA 4 IBM June 2011 Programming shared-memory multiprocessors No Sequential


  1. Understanding POWER multiprocessors Susmit Sarkar 1 Peter Sewell 1 Jade Alglave 2 , 3 Luc Maranget 3 Derek Williams 4 1 University of Cambridge 2 Oxford University 3 INRIA 4 IBM June 2011

  2. Programming shared-memory multiprocessors No Sequential Consistency (SC) and not since 1972 But what do we get? “Relaxed Memory”, differing on different architectures: x86, SPARC — Relatively strong, better understood; POWER/ARM — Weaker, widely used, not widely understood; High-level languages — Different again Models informed by POWER/ARM features Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 2 / 13

  3. Relaxed memory behaviour: Message Passing Thread 0 Thread 1 x = 1 while (y == 0) y = 1 {} ; r = x ( read 0? ) Thread 0 Thread 1 Forbidden on SC, or x86-TSO a: W[x]=1 c: R[y]=1 rf Allowed on POWER ( ∼ 1e6 in po po 2e9 on a POWER7) b: W[y]=1 rf d: R[x]=0 Test MP : Allowed Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 3 / 13

  4. What is going on? Visible Microarchitectural Effects: Out-of-order, and Speculative Execution Buffering of Stores and Loads Topology of Interconnection Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 4 / 13

  5. Enforcing order where needed Thread 0 Thread 1 x = 1 while (y == 0) {} ; sync() ( read 0? ) y = &x r = *y sync: writes in order ◮ On the same thread; and Thread 0 Thread 1 ◮ When propagating to other a: W[x]=1 c: R[y]=&x rf threads sync addr Dependency: reads in order b: W[y]=&x rf d: R[x]=0 ◮ Later read not issued until Test MP+sync+addr : Forbidden resolved Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 5 / 13

  6. POWER model in general: . . . How do we find out? Architecture Manuals: Ambiguous prose “all that horrible horribly incomprehensible and confusing [...] text that no-one can parse or reason with — not even the people who wrote it” — Anonymous Processor Architect, 2011 1 Concrete Implementation: Proprietary Extremely complex, and too low-level Changes across generations 1 Not Derek Williams! Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 6 / 13

  7. Our work Rigorous Architecture Do lots of tests (borrow, handwrite, autogenerate) on Power G5, 6, and 7 Discuss with designers/architects Develop an abstract operational model Matches observed behaviour (intentionally looser in some aspects) Simple enough to understand Only considering application and common OS code, with no unaligned/mixed-size accesses (no self-modifying code, device memory, or page table changes) Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 7 / 13

  8. The model structure Overall structure: Thread Thread Write request Write announce Barrier request Barrier ack Storage Subsystem Some aspects are thread-only, some storage-only, some both Threads and Storage Subsystem: Abstract state machines Speculative execution in Threads; Topology-independent Storage Subsystem Formally: transitions, guarded by preconditions, change state, and synchronize with each other Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 8 / 13

  9. Cumulativity: Programming on many threads Thread 0 Thread 1 Thread 2 x = 1 while (x == 0) while (y == 0) {} ; {} ; ( read 0? ) sync() r = *y y = &x Thread 0 Thread 1 Thread 2 a: W[x]=1 b: R[x]=1 d: R[y]=&x rf rf sync addr c: W[y]=&x e: R[x]=0 rf Test WRC+sync+addr : Forbidden The sync is cumulative : it keeps (a) and (c) in order for all threads Flipping the dependency and barrier does not recover SC Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 9 / 13

  10. Model Excerpt Propagate write to another thread The storage subsystem can propagate a write w (by thread tid ) that it has seen to another thread tid ′ , if: the write has not yet been propagated to tid ′ ; w is coherence-after any write to the same address that has already been propagated to tid ′ ; and all barriers that were propagated to tid before w (in s . events propagated to ( tid ) ) have already been propagated to tid ′ . Action: append w to s . events propagated to ( tid ′ ) . Explanation: This rule advances the thread tid ′ view of the coherence order to w , which is needed before tid ′ can read from w , and is also needed before any barrier that has w in its “Group A” can be propagated to tid ′ . Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 10 / 13

  11. Overall Model Size Explanation in ∼ 3 pages of prose Microarchitectural intuitions No extraneous concrete details ∼ 2500 lines of machine-processed math In LEM [ITP’11], a simple new semantic metalanguage Can extract executable code, and theorem-prover code Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 11 / 13

  12. Validating the model Extract executable code from definition, exhaustively enumerate possible behaviours of tests Run many iterations of tests on real hardware (Power G5, 6, 7) Excerpt of results: Test Model POWER 6 POWER 7 WRC+sync+addr Forbid ok 0 / 16G ok 0 / 110G WRC+data+sync Allow ok 150k / 12G ok 56k / 94G PPOCA Allow unseen 0 / 39G ok 62k / 141G PPOAA Forbid ok 0 / 39G ok 0 / 157G LB Allow unseen 0 / 31G unseen 0 / 176G Agreed with key IBM Power designers/architects Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 12 / 13

  13. Validating the model Extract executable code from definition, exhaustively enumerate possible behaviours of tests Run many iterations of tests on real hardware (Power G5, 6, 7) Excerpt of results: Test Model POWER 6 POWER 7 WRC+sync+addr Forbid ok 0 / 16G ok 0 / 110G WRC+data+sync Allow ok 150k / 12G ok 56k / 94G PPOCA Allow unseen 0 / 39G ok 62k / 141G PPOAA Forbid ok 0 / 39G ok 0 / 157G LB Allow unseen 0 / 31G unseen 0 / 176G Agreed with key IBM Power designers/architects Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 12 / 13

  14. Validating the model Extract executable code from definition, exhaustively enumerate possible behaviours of tests Run many iterations of tests on real hardware (Power G5, 6, 7) Excerpt of results: Test Model POWER 6 POWER 7 WRC+sync+addr Forbid ok 0 / 16G ok 0 / 110G WRC+data+sync Allow ok 150k / 12G ok 56k / 94G PPOCA Allow unseen 0 / 39G ok 62k / 141G PPOAA Forbid ok 0 / 39G ok 0 / 157G LB Allow unseen 0 / 31G unseen 0 / 176G Agreed with key IBM Power designers/architects Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 12 / 13

  15. Summing up A mathematically precise, empirically validated, operational model of POWER Microarchitectural intuitions, but abstract: no implementation details Rigorous Architecture Can reason about low-level code above it (static analysis tools) Can build on for software verification (e.g. compiler verification) Can use as specification to test implementations . . . Lots to be done! Susmit Sarkar (Cambridge) Understanding POWER multiprocessors June 2011 13 / 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend