Memory Accesses in Out-of-Order Execution Instructor: Nima - PowerPoint PPT Presentation

Spring 2015 :: CSE 502 – Computer Architecture Memory Accesses in Out-of-Order Execution Instructor: Nima Honarmand

Spring 2015 :: CSE 502 – Computer Architecture Big Picture I-cache Instruction Branch FETCH Flow Predictor Instruction Buffer DECODE Memory Integer Floating-point Media Memory Data Flow EXECUTE Reorder Buffer Register (ROB) Data COMMIT Flow D-cache Store Queue

Spring 2015 :: CSE 502 – Computer Architecture OoO and Memory Instructions • Memory instructions benefit from out-of-order execution just like other insts • Especially important to execute loads as soon as address is known – Loads are at the top of dependence chains • To enable precise state recovery, stores are sent to D$ after retirement – Sufficient to prevent wrong-branch-path stores • Loads can be issued out-of-order w.r.t. other loads and stores if no dependence

Spring 2015 :: CSE 502 – Computer Architecture OoO and Memory Instructions • Same 3 types of dependences as register-based insts – RAW (true), WAR and WAW (false) • However, memory-based dependences are dynamic – Depend on program state, can change as the program executes – Unlike register-based dependences Load R3 = 0[R6] (1) Issue (1) Cache Miss! (4) Miss serviced (5) Issue Add R7 = R3 + R9 (6) Issue Store R4  0[R7] (1) Issue Sub R1 = R1 – R2 (3) Issue (3) Cache Hit! Load R8 = 0[R1] But there was a later load… • [R1] != [R7] -> Load and Store are independent -> Correct execution • [R1] == [R7] -> Load and Store are dependent -> Incorrect execution

Spring 2015 :: CSE 502 – Computer Architecture Basic Concepts • Memory Aliasing : two memory references involving the same memory location (collision of two memory addresses) • Memory Disambiguation : Determining whether two memory references will alias or not – Whether there is a dependence or not – Requires computing effective addresses of both memory references • We say a memory op is performed when it is done in D$ – Loads perform in Execute (X) stage – Stores perform in Rertire (R) stage

Spring 2015 :: CSE 502 – Computer Architecture Scheme 1: In-Order Load/Stores • Performs all loads/stores in-order with respect to each other – However, they can execute out of order with respect to other types of instructions • Pessimistically, assuming dependence between all mem ops

Spring 2015 :: CSE 502 – Computer Architecture Load/Store Queue (LSQ) • Operates as a circular FIFO • Loads and store instructions are stored in program order – allocate on dispatch – de-allocate on retirement • For each instruction, contains: – “Type”: Instruction type (S or L) – “ Addr ”: Memory addr • Addr is generated in dataflow order and copied to LSQ – “Val”: Data for stores • Val is generated in dataflow order and copied to LSQ • You can think of LSQ as the RS for memory ops – i.e., each entry also contains tags and other RS stuff

Spring 2015 :: CSE 502 – Computer Architecture Scheme 1: In-Order Load/Stores • Only the instruction at the LSQ head can perform, if ready – If load, it can perform whenever ready – If store, it can perform if it is also at ROB head and ready • Stores are held for all previous instructions – Since they perform in R stage • Loads are only held for stores • Easy to implement but killing most of OoO benefits  significant performance hit

Spring 2015 :: CSE 502 – Computer Architecture Scheme 1 “Pipeline” • Stores – Dispatch (D) • Allocate entry at LSQ tail – Execute (X) • Calculate and write address and data into corresponding LSQ slot – Retire (R) • Write address/data from LSQ head to D$, free LSQ head • Loads – Dispatch (D) • Allocate entry at LSQ tail – Addr Gen (G) • Calculate and write address into corresponding LSQ slot – Execute (X) • Send load to D$ if at the head of LSQ – Retire (R) • Free LSQ head

Spring 2015 :: CSE 502 – Computer Architecture Scheme 2: Load Bypassing • Loads can be allowed to bypass stores (if no aliasing) – Requires checking addresses of older stores – Addresses of older stores must be known in order to check • To implement, use separate load queue (LQ) and store queue (SQ) – Think of separate RS for loads and stores • Need to know the relative order of instructions in the queues – “Age”: new field added to both queues • Age represents position of load/store in the program • A simple counter incremented during the in-order dispatch (for now)

Spring 2015 :: CSE 502 – Computer Architecture Scheme 2: Load Bypassing • Loads: for the oldest ready load addr load age load in LQ, check the “ Addr ” of older stores in SQ wait? Store Queue (SQ) – If any with an uncomputed or matching “ Addr ”, load cannot data address value issue out == – Check SQ in parallel with == == head accessing D$ == age == == • Requires associative memory == tail == (CAM) • Stores: can always execute when at ROB head D$/TLB

Spring 2015 :: CSE 502 – Computer Architecture Scheme 3: Load Forwarding + Bypassing • Loads: can be satisfied load addr load age data out from the stores in the wait? Store Queue (SQ) store queue on an address match? match address value – If the store data is available == == == head • Avoids waiting until the == age == == store in sent to the cache == tail == • Stores: can always execute when at ROB head D$/TLB

Spring 2015 :: CSE 502 – Computer Architecture Scheme 2 & 3 “Pipeline” • Stores – Dispatch (D) • Allocate entry at SQ tail and record age – Execute (X) • Calculate and write address and data into corresponding SQ slot – Retire (R) • Write address/data from SQ head to D$, free SQ head • Loads – Dispatch (D) • Allocate entry at LQ tail and record age – Addr Gen (G) • Calculate and write address into corresponding LQ slot – Execute (X) • Send load to D$ when D$ available and check the SQ for aliasing stores – Retire (R) • Free LQ head

Spring 2015 :: CSE 502 – Computer Architecture Scheme 4: Loads Execute When Ready • Drawback of previous schemes: – Loads must wait for all older stores to compute their “ Addr ” • i.e., to “execute” • Alternative: let the loads go ahead even if older stores exist with uncomputed “ Addr ” – Most aggressive scheme • Greatest potential IPC – loads never stall • A form of speculation: speculate that uncomputed stores are to other addresses – Relies on the fact that aliases are rare – Potential for incorrect execution • Need to be able to “undo” bad loads

Spring 2015 :: CSE 502 – Computer Architecture Detecting Ordering Violations • Case 1: Older store execs store addr store age data before younger load Load Queue (LQ) – No problem, HW from Scheme 3 takes care of this address == • Case 2: Older store execs == flush? == head after younger load == age == == – Store scans all younger loads == tail == – Address match  ordering violation – Requires associative search in D$/TLB LQ

Spring 2015 :: CSE 502 – Computer Architecture Scheme 4 “Pipeline” • Stores – Dispatch (D) • Allocate entry at SQ tail and record age – Execute (X) • Calculate and write address and data into corresponding SQ slot – Retire (R) • Write address/data from SQ head to D$, free SQ head • Check LQ for potential aliases, initiate “recovery” if necessary • Loads – Dispatch (D) • Allocate entry at LQ tail and record age – Addr Gen (G) • Calculate and write address into corresponding LQ slot – Execute (X) • Send load to D$ when D$ available and check the SQ for aliasing stores – Retire (R) • Free LQ head

Spring 2015 :: CSE 502 – Computer Architecture Dealing with Misspeculations • Loads are not the only thing which are wrong – Loads propagate wrong values to all dependents • These must somehow be re-executed • Easiest: flush all instructions after (and including?) the misspeculated load, and just refetch • Load uses forwarded value • Correct value propagated when instructions re-execute Flushing the pipeline has very high-overhead

Spring 2015 :: CSE 502 – Computer Architecture Lowering Flush Overhead (1) • Selective Re-execution : re-execute only the dependent insns. • Ideal case w.r.t. maintaining high IPC – No need to re-fetch/re-dispatch/re-rename/re-execute • Very complicated – Need to hunt down only data-dependent insns. – Some bad insns. already executed (now in ROB) – Some bad insns . didn’t execute yet (still in RS) • Pentium 4 does something like this (called “replay”)

Spring 2015 :: CSE 502 – Computer Architecture Lowering Flush Overhead (2) • Observation: loads/stores that cause violations are “stable” – Dependences are mostly program based, program doesn’t change • Alias Prediction : predict which load/store pairs are likely to alias – Use a hybrid scheme – Predict which loads, or load/store pairs will cause violations – Use Scheme 3 for those, Scheme 4 for the rest

Memory Accesses in Out-of-Order Execution Instructor: Nima - PowerPoint PPT Presentation

Spring 2015 :: CSE 502 Computer Architecture Memory Accesses in Out-of-Order Execution Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture Big Picture I-cache Instruction Branch FETCH Flow Predictor

Memory and I/O buses I/O bus 1880Mbps 1056Mbps Memory CPU Crossbar CPU accesses physical

PARKS DEPARTMENT Municipal Parks - 35 Beach Accesses- 53 BEACH Accesses Playgrounds- 10

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Systems Survey on the Off-Chip Scheduling of Memory Accesses in the Memory Interface of

Optimizing Indirect Memory References with milk Vladimir Kiriansky, Yunming Zhang, Saman

Memory Protection ability to avoid unwanted memory accesses management Sharing

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Problem Failure-Oblivious Computing Memory Errors and Memory Corruption Buffer Overflow

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Hands-On Getting ready... Run a task that accesses ESD Locally Locally with ROOT

Why Use Scheduling? Sequential accesses to DRAM Memory Access are wasteful Scheduler

The Synchronization Power of The Synchronization Power of Coalesced Memory Accesses oalesced

:: Privacy Pass :: Bypassing internet challenges anonymously Alex Davidson 1,3 Ian Goldberg 2

BIOS in 2015 Oleksandr Bazhaniuk, Yuriy Bulygin (presenting) , Andrew Furtak , Mikhail Gorobets,

WEN ETA JB? A 2 million dollars problem Date: 05/06/2019 For: SSTIC Presenters: Eloi

Ticagrelor vs Aspirin in Patients undergoing Coronary- Artery Bypass Grafting Herib eribert

Strata: A Cross Media File System Youngjin Kwon , Henrique Fingler, Tyler Hunt, Simon Peter,

4B Geriatric Hip Fracture 6A Gastric Bypass Abdominal Laparoscopic Results Geriatric Hip

Auditing hooks and security transparency for CPython Steve Dower, Christian Heimes EuroPython

IX:$A$Protected$Dataplane$Opera3ng$ System$for$High$Throughput$and$ Low$Latency$ Adam%Belay