Hardware Acceleration of Transactional Memory on Commodity Systems - PowerPoint PPT Presentation

Hardware Acceleration of Transactional Memory on Commodity Systems Jared Casper , Tayo Oguntebi, Sungpack Hong, Nathan Bronson, Christos Kozyrakis, Kunle Olukotun Pervasive Parallelism Laboratory Stanford University 1

TM Design Alternatives ! Software (STM) ! “Barriers” on each shared load and store update data structures ! Hardware (HTM) ! Tap hardware data paths to learn of loads and stores for conflict detection ! Buffer speculative state or maintain undo log in hardware, usually at the L1 level ! Hybrid ! Best effort HTM falls back to STM ! Generally target small transactions ! Hardware accelerated ! Software runtime is always used, but accelerated ! Existing proposals still tap the hardware data path 2

TMACC: TM Acceleration on Commodity Cores ! Challenges facing adoption of TM ! Software TM requires 4-8 cores just to break even ! Hardware TM is expensive and risky ! Sun’s Rock provides limited HTM for small transactions ! Support for large transactions requires changes to core ! Optimal semantics for HTM is still under debate ! Hybrid schemes look attractive, but still modify the core ! No systems available to attract software developers ! Accelerate STM without changing the processor ! Leverage much of the work on STMs ! Much less risky and expensive ! Use existing memory system for communication 3

TMACC: TM Acceleration on Commodity Cores # ! Conflict detection ! Can happen after the fact ! Can nearly eliminate expensive read barriers " ! Checkpointing ! Needs access to core internals " ! Version management ! Latency critical operations ! Common case when load is not in store buffer must take less than ~10 cycles " ! Commit ! Could be done off-chip, but would require removing everything from the processor’s cache 4

Protocol Overview ! Reads TMACC Thread1 Thread2 HW Send address to HW ! Check for value in write buffer ! Read A ! Writes Add to the write buffer ! Read B Same as STM ! To write B ! Commit Send HW each address in write set ! Ask permission to commit OK to ! commit? Apply write buffer ! ! Violation notification Yes Must be fast to check for violation in You’re ! Violated software 5

Problem of Being Off-Core ! Variable latency to TMACC Thread1 Thread2 HW reach the HW ! Network latency To write A ! Amount of time in the store buffer OK to ! How can we determine commit? correct ordering? Read A Yes OK to commit? 6

Global and Local Epochs Epoch N-1 Epoch N Epoch N+1 # A A " # B !!!!! B " C C Global Epochs Local Epochs ! Global Epochs ! Each command embeds epoch number (a global variable) . ! Finer grain but requires global state ! Know A < B,C but nothing about B and C ! Local Epochs ! Each thread declares start of new epoch ! Cheaper, but coarser grain (non-overlapping epochs) � ! Know C < B, but nothing about A and B or A and C 7

Two TMACC Schemes ! We proposed two TM schemes. ! TMACC-GE uses global epochs ! TMACC-LE uses local epochs ! Trade-Offs TMACC-GE TMACC-LE More accurate conflict detection No global data in software $ less SW overhead # $ less false positives # Global epoch management Less information for ordering $ more SW overhead " $ more false positives " ! Details in the paper 8

TMACC Hardware ! A set of generic BloomFilters + control logic ! BloomFilter: a condensed way to store ‘set’ information ! Read-set: Addresses that a thread has read ! Write-set: Addresses that other threads have written ! Conflict detection ! Compare read-address against write-set ! Compare write-address against read-set 9

Procyon System ! First implementation of FARM single node configuration ! From A&D Technology, Inc. ! CPU Unit (x2) ! AMD Opteron Socket F (Barcelona) ! DDR2 DIMMs x 2 ! FPGA Unit (x1) ! Altera Stratix II, SRAM, DDR ! Each unit is a board ! All units connected via cHT backplane ! Coherent HyperTransport (ver 2) ! We implemented cHT compatibility for FPGA unit (next slide) 10

Base FARM Components Altera Stratix II FPGA (132k Logic Gates) ! " !"#$ % !"#$ % !"#$ % !"#$ % MMR TMACC &'()%* % &'()%5 % &'()%* % &'()%5 % … … IF +,-%.! % +,-%.! % +,-%.! % +,-%.! % /!0-1 % /!0-1 % /!0-1 % /!0-1 % Cache IF .0% .0% .0% .0% Configurable &234) % &234) % &234) % &234) % Data Stream IF Coherent Cache 2MB Data 2MB L3 Shared Cache L3 Shared Cache Transfer Engine cHTCore™ 32 Gbps 6.4 Gbps Hyper Hyper Hyper Transport (PHY, LINK) ! Transport Transport 32 Gbps 6.4 Gbps ~60ns ~380ns * cHTCore is from University of Heidelberg AMD Barcelona Block diagram of Procyon system ! FPGA Unit = communication logics + user application ! Three interfaces for user application ! ! Coherent cache interface FARM: A Prototyping Environment for Tightly- ! Data stream interface Coupled, Heterogeneous Architectures. Tayo Oguntebi et. al. FCCM 2010. ! Memory mapped register interface 11

Communication ! Sending addresses TMACC Thread1 Thread2 HW FARM’s streaming interface ! Address range marked as “write- ! combing” causes non-temporal store Read A As close to “fire-and-forget” as is ! available Read B 630MB/s ! To write B ! Commit request Read from memory mapped register ! OK to Approx. 700ns, 1000s of cycles! ! commit? ! Violation notification Yes FPGA writes to cacheable address ! You’re Common case of no violation is fast, ! Violated just as cache hit for the processor 12

Implementation Result ! Full prototype of both TMACC schemes on FARM ! HW Resource Usage Common TMACC-GE TMACC-LE 4Kb BRAM 144 (24%) 256 (42%) 296 (49%) Registers 16K (15%) 24K (22%) 24K (22%) LUTs 20K 30K 35K FPGA Altera Stratix II EPS130 (-3) Max Freq. 100 MHz 13

Microbenchmark Analysis ! Two random array accesses Parameters: A1, A2, R, W, C ! Partitioned (non-conflicting) TM_BEGIN ! Fully-shared (possible for I = 1 to (R + W) { conflicts) p = (R / R + W) ! Free from pathologies and 2 nd - /* Non-conflicting Access */ order effects a1 = rand(0, A1 / N) + tid * A1/N; if (rand_f(0,1) < p)) ! Decouple effects of parameters TM_READ( Array1[a1] ) else ! Size of Working Set (A1) TM_WRITE( Array1[a1] ) ! Number of Read/Writes (R,W) /* Conflicting Access */ ! Degree of Conflicts (C, A2) if (C) { a2 = rand(0, A2); if (rand_f(0,1) < p)) TM_READ( Array2 [a2] ) else TM_WRITE( Array2[a2] ) EigenBench: A Simple Exploration Tool } for Orthogonal TM Characteristics. } Sungpack Hong et. al. IISWC 2010 TM_END 14

Microbenchmark Results " ~10% Working set size Transaction size The knee is overflowing the cache All violations are false positives ! ! Constant spread out of speedup Sharp decrease in performance ! ! for small transactions TMACC-LE begins to suffer from ! false positives 15

Microbenchmark Results " ~22% +76% Write set size Number of threads Medium sized transactions TMACC-GE suffers from lock ! ! scale well migration as the number of writes goes up Small transactions are not ! accelerated TL2 suffers across chip ! boundary 16

STAMP Benchmark Results +85% +50% Genome Vacation Transactions with few conflicts, a lot of reads, and few writes ! Bread and butter of transactional memory apps ! Barrier overhead primary cause of slowdown in TL2 ! 17

STAMP Benchmark Results -8% K-means low K-means high Few reads per transaction ! Violations dominating factor ! Not much room for acceleration ! Still not many reads to ! Large number of writes ! accelerate Hurts TMACC-GE ! 18

Prototype vs. Simulation ! Simulated processor greatly exaggerated penalty from extra instructions ! Modern processors much more tolerant of extra instructions in the read barriers ! Simulated interconnect did not model variable latency and command reordering ! No need for epochs, etc. ! Real hardware doesn’t have “fire-and- forget” stores ! We didn’t model the write-combining buffer ! Smaller data sets looked very different ! Bandwidth consumption, TLB pressure, etc. 19

Summary: TMACC ! A hardware accelerated TM scheme ! Offloads conflict detection to external HW ! Accelerates TM without core modifications ! Requires careful thinking about handling latency and ordering of commands ! Prototyped on FARM ! Prototyping gave far more insight than simulation. ! Very effective for medium-to-large sized transactions ! Small transaction performance gets better with ASIC or on-chip implementation. ! Possible future combination with best-effort HTM 20

Hardware Acceleration of Transactional Memory on Commodity Systems - PowerPoint PPT Presentation

Hardware Acceleration of Transactional Memory on Commodity Systems Jared Casper , Tayo Oguntebi, Sungpack Hong, Nathan Bronson, Christos Kozyrakis, Kunle Olukotun Pervasive Parallelism Laboratory Stanford University 1 TM Design Alternatives

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 25 November 2016 Lecture 8

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 27 November 2015 Lecture 8

Hardware Transactional Memory Shao-Hung Chiu, Upasana Sridhar Transactional Memory - Where did

Extending Hardware Transactional Memory to Support Non-busy Waiting and Non-transactional Actions

Extending Hardware Transactional Memory Capacity via Rollback-Only Transactions and Suspend/Resume

Transactional Memory: Architectural support for Lock-Free Data Structure Transactional Memory:

Transactional memory with data Transactional memory with data invariants: or putting the

Hardware Observability Framework Hardware Observability Framework Hardware Observability

DHTM: Durable Hardware Transactional Memory Arpit Joshi , Vijay Nagarajan, Marcelo Cintra, Stratis

Hardware Acceleration of Transactional Memory on Commodity Systems Jared Casper , Tayo Oguntebi,

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Transactional Memory 1 To read more This days papers: Herlihy and Moss, Transactional

Hardware Acceleration of Hardware Acceleration of Graphics and Imaging Graphics and Imaging

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Verification of Transactional Memories that support Non-Transactional Memory Accesses Ariel Cohen

Evaluating the Impact of Transactional Characteristics on the Performance of Transactional Memory

Acceleration Strategies for ICAPS/IETs Transitions Academy Webinar Series Acceleration

Engineering Education and Centers Grantees Conference October 30, 2017 RET Site in Mechatronics

BOYS ISSUES THAT CONTRIBUTE TO POOR PERFORMANCE Transference of academic skills into real

CHAPTER 10 Video Production VIDEO The most powerful MASS media Reaches largest audience

Slide 4 / 40 2 A blimp travels at 3 m/s for 1000 s. What distance does the blimp cover in that

An introduction to FPGA-based acceleration of neural networks Marco Pagani 1 What is an FPGA?

Finding Vertex Cover: Acceleration Via CUDA Yang Liu, High Performance Research Computing, Texas

Nonlinear System Identification of an F-16 Aircraft Using the Acceleration Surface Method Tiln