CS 6354: SMT sum2 += array[i]; thread_one_func( int offset) { for ( - PowerPoint PPT Presentation

CS 6354: SMT sum2 += array[i]; thread_one_func( int offset) { for ( int i = 0; i < N / 2; ++i) sum1 += array[i]; } thread_two_func() { for ( int i = N / 2; i < N; ++i) } 2 compute_sum() { thread_one = thread_create(thread_one_func); thread_two = thread_create(thread_two_func); wait_for_thread(thread_one); wait_for_thread(thread_two); sum = sum1 + sum2; } Multithreading appears to execute at same time as other threads 28 September 2016 Hennessy and Patterson, Computer Architecture: A Quanitative Approach , 1 To read more… This day’s papers: Tullsen et al, “Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor” Alverson et al, “The Tera Computer System” Supplementary Reading: Section 3.12 may or may not share memory Kongetira et al, “Niagara: A 32-Way Multithreaded Sparc Processor” Shin and Lipasti, Modern Processor Design , Section 11.4.4 1 Defjnition: Thread stream of program execution own registers 3 own program counter (current instruction pointer)

OS context switches 1-cycle data cache in-order choose thread dynamically round-robin between threads sum1 += array[offset + 0]; many register fjles schedule when ready compiler-specifjed delays reorder bufger in-order completion imprecise exceptions 70-cycle data memory Tera 6 Tera: Is it usable? minimum of nine threads to get full throughput 5% serial 10% serial 25% serial 50% serial 0% serial Degree of Parallelism (1=serial) Speedup (1=serial) Amdahl’s Law out-of-order many register name maps Exploiting Choice if (0 < N / 2) goto done2; if (0 < N / 2) goto done1; ... sum1 += array[offset + 1940]; if (1940 < N / 2) goto done1; copy registers to memory OS runs load registers from memory timer interrupt/exception two approaches return from interrupt/exception sum2 += array[offset + N/2 + 0]; 7 ... externally visible: if (1849 + N / 2 < N) goto done2; 4 5 threads may or may not be in seperate programs maybe shared between threads: memory (address of page table) (program-visible) registers program counter (current instruction) sum2 += array[offset + N/2 + 1849]; threads state AKA context × 256 CPUs = 2304 threads 20 15 10 5 10 20 30 40 50 60

Tera: the commercial version 1994: MIPS R8000 register map physical register values branch prediction info program counters reorder bufger queued instructions internal: (program-visible) registers program counter (current instruction) externally visible: thread state AKA context 10 thread state — running superscalar 9 1993: Pentium Tera/Cray MTA (1997) — described in paper (took a complaint 7 years!) Cray MTA-2 (2002) Cray XMT (2009) — combines with conventional processors for I/O not advertised anymore 8 Why doesn’t Tera paper compare to 1992: DEC Alpha 21064 superscalar/out-of-order? 1960s: IBM, Control Data Corp. machines 1988: Motorola MC88100 1989: Intel i960CA Tera paper 1990: AMD 29000 11

modern SMT systems caches reorder bufger??? register maps return address stack (branch prediction) program counters duplicated resources 14 physical registers load/store queue functional units (adders, multipliers, etc.) instruction queues shared resources most Intel desktop/laptop chips — 2 threads/core 13 duplicate thread state no context switch running two threads 12 POWER5 (2004) — 2 threads/core IBM POWER8 (2013) — 8 threads/core SPARC T1 (2005) — 4 threads/core Oracle SPARC T5 (2013) — 8 threads/core 2nd gen. Pentium 4 (“NetBurst”) (2002) 15

thread ids added to resources two cache-bound threads? Figure from Fedorova et al, “Chip multithreading systems need a new operating system scheduler”, 2004 core multi SMT one intuition 18 two branch-heavy threads? how many cache accesses per cycle? how many integer ALUs? branch target bufger — phantom branches two intensive integer threads? how many fmoating point adders? two fmoating point intensive threads? what workloads benefjt? 17 actual throughput: approx. 4.5 maximum throughput: 8 instructions/cycle 8-issue processor?? 16 19

variable gains instead of better branch prediction … cycle 2: 4 from thread 1, 4 from thread 2 cycle 1: 4 from thread 1, 4 from thread 2 multiple threads at a time (2.4) cycle 4: 8 form thread 2 cycle 3: 8 from thread 1 cycle 2: 8 from thread 2 cycle 1: 8 from thread 1 baseline (1.8) round-robin variants 22 instead of faster ALUs hide long-latency instructions Figure from Funston, et al, “An SMT-Selection Metric to Improve Multithreaded Applications’ Performance”, IPDPS 2012 Tera: no data cache removed complexity? 21 Tera: fetch logic = issue logic fetch/branch logic Tera: in-order completion Tera: imprecise arithmetic exceptions more complex interrupt logic Exploiting Choice: useful for single thread huge number of registers — slower regfjle added complexity? 20 23 just have more parallelism!

round-robin performance 24 priority-based fetch fetch more for faster/more starved threads less unresolved branches less cache misses less pending instructions 25 priority-based fetch 26 Tera: thread creation CREATE instruction no OS intervention OS can later move each thread between processors 27

Exploiting Choice: thread creation not specifjed really helps dense matrix math unit gives 2 fmoating point operations/cycle/functional cheat? FMA: optimization or benchmark 30 fetch and add read/write when ready write read complex commands to memory: no caches — single copy of all data Tera: Synchronization 29 16x16x16 version of: Tera: hypertorus 28 “logical processor/core” same as multiple processors Intel mechanism: each thread looks like processor 31 Image: Wikimedia Commons user おむこさん志望 Fused Multiply-Add A = B × C + D single instruction/functional unit use

Next week: multiple processors C.mmp — one of the earliest multiprocessor what changes from the uniprocessor were required? how well are threads isolated from each other? how does one program these machines? design of the networks challenges in making multiprocessor machine things to think about when reading 34 you can skim/skip these parts multiprocessor lots of software issues that don’t really concern C.mmp distractions 33 steps 1-3: cycle time steps 1-2: access time 2. triggers signal if old direction was ‘1’ 1. set magnetization direction to ‘0’ read: tiny metal rings, magnetized to store a bit not something you are expected to know: Some weird terminology in C.mmp 32 T3E — supercomputer from the 90s 35 C.mmp deals with core memory (1950s-1970s) 3. rewrite value to old direction how does one coordinate between threads?

CS 6354: SMT sum2 += array[i]; thread_one_func( int offset) { for ( - PowerPoint PPT Presentation

CS 6354: SMT sum2 += array[i]; thread_one_func( int offset) { for ( int i = 0; i < N / 2; ++i) sum1 += array[i]; } thread_two_func() { for ( int i = N / 2; i < N; ++i) } 2 compute_sum() { thread_one = thread_create(thread_one_func);

SMT WORLDWIDE SMT America, Europe and Asia staff has over 20 years experience in the SMT field

POLYMETALLIC PRODUCER AGM PRESENTATION June 30, 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL: SMT

SMT Solvers: A Disruptive Technology John Rushby Computer Science Laboratory SRI International

Using SMT solvers for binary analysis and exploitation A primer on SMT, SMT solvers, Z3 & angr

POLYMETALLIC PRODUCER CORPORATE PRESENTATION July 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL:

SMT in Asia Content Teknek and the SMT industry The market Why cleaning is needed

POLYMETALLIC PRODUCER CORPORATE PRESENTATION February 2020 TSX: SMT | NYSE AMERICAN: SMTS |

DIVERSIFIED PRODUCER CORPORATE PRESENTATION August 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL:

DIVERSIFED PRODUCER CORPORATE PRESENTATION August 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL:

SMT-LIB for HOL Daniel Kroening Philipp Rmmer Georg Weissenbacher Oxford University Computing

Motivation SMT Theories of Interest History of SMT Eager approach Lazy approach Optimizations

Introduction to SAT and SMT Solvers Interfacing Yosys and SMT Solvers for BMC and more using

Introduction to SMT Albert Oliveras Technical University of Catalonia 8th International

SMT Unsat Core Minimization OFER GUTHMANN, OFER STRICHMAN, ANNA TRO STANETSKI FMCAD2016 1 SMT

Applications of SMT to Test Generation Patrice Godefroid Microsoft Research SAT/SMT Summer

SMT-LIB Status Report SMT19 Clark Barrett, Pascal Fontaine, Cesare Tinelli 1 / 11 Plan

CPE 335 Computer Organization Computer Organization Basic MIPS Architecture Part II Dr.

V. Register Machine Yuxi Fu BASICS, Shanghai Jiao Tong University Register Machines are more

Lecture 16: Basic CPU Design Todays topics: Single-cycle CPU Multi-cycle CPU

Course Overview Who am I? Jonathan Walpole Professor at PSU since 2004, OGI 1989 2004

ProtoDUNE-SP 35T HV Testing Sarah Lockwitz, FNAL DUNE LBNC Meeting, October 26, 2017 Introduction

High Voltage Testing at the 35t Long Baseline Neutrino Committee (LBNC): ProtoDUNE-SP Sarah

Sam Asci, NEFMC Staff Scallop APMay 22 nd , 2019 Scallop CommitteeMay 23 rd , 2019 1

Building a Fast, Virtualized Building a Fast, Virtualized Data Plane with Data Plane with

CS 6354: SMT sum2 += array[i]; thread_one_func( int offset) { for ( - PowerPoint PPT Presentation

CS 6354: SMT sum2 += array[i]; thread_one_func( int offset) { for ( int i = 0; i < N / 2; ++i) sum1 += array[i]; } thread_two_func() { for ( int i = N / 2; i < N; ++i) } 2 compute_sum() { thread_one = thread_create(thread_one_func);

SMT WORLDWIDE SMT America, Europe and Asia staff has over 20 years experience in the SMT field

POLYMETALLIC PRODUCER AGM PRESENTATION June 30, 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL: SMT

SMT Solvers: A Disruptive Technology John Rushby Computer Science Laboratory SRI International

Using SMT solvers for binary analysis and exploitation A primer on SMT, SMT solvers, Z3 &amp; angr

POLYMETALLIC PRODUCER CORPORATE PRESENTATION July 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL:

SMT in Asia Content Teknek and the SMT industry The market Why cleaning is needed

POLYMETALLIC PRODUCER CORPORATE PRESENTATION February 2020 TSX: SMT | NYSE AMERICAN: SMTS |

DIVERSIFIED PRODUCER CORPORATE PRESENTATION August 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL:

DIVERSIFED PRODUCER CORPORATE PRESENTATION August 2020 TSX: SMT | NYSE AMERICAN: SMTS | BVL:

SMT-LIB for HOL Daniel Kroening Philipp Rmmer Georg Weissenbacher Oxford University Computing

Motivation SMT Theories of Interest History of SMT Eager approach Lazy approach Optimizations

Introduction to SAT and SMT Solvers Interfacing Yosys and SMT Solvers for BMC and more using

Introduction to SMT Albert Oliveras Technical University of Catalonia 8th International

SMT Unsat Core Minimization OFER GUTHMANN, OFER STRICHMAN, ANNA TRO STANETSKI FMCAD2016 1 SMT

Applications of SMT to Test Generation Patrice Godefroid Microsoft Research SAT/SMT Summer

SMT-LIB Status Report SMT19 Clark Barrett, Pascal Fontaine, Cesare Tinelli 1 / 11 Plan

CPE 335 Computer Organization Computer Organization Basic MIPS Architecture Part II Dr.

V. Register Machine Yuxi Fu BASICS, Shanghai Jiao Tong University Register Machines are more

Lecture 16: Basic CPU Design Todays topics: Single-cycle CPU Multi-cycle CPU

Course Overview Who am I? Jonathan Walpole Professor at PSU since 2004, OGI 1989 2004

ProtoDUNE-SP 35T HV Testing Sarah Lockwitz, FNAL DUNE LBNC Meeting, October 26, 2017 Introduction

High Voltage Testing at the 35t Long Baseline Neutrino Committee (LBNC): ProtoDUNE-SP Sarah

Sam Asci, NEFMC Staff Scallop APMay 22 nd , 2019 Scallop CommitteeMay 23 rd , 2019 1

Building a Fast, Virtualized Building a Fast, Virtualized Data Plane with Data Plane with

Using SMT solvers for binary analysis and exploitation A primer on SMT, SMT solvers, Z3 & angr