Day 3 Advanced Vector Architectures Session A: Vector Instruction - PDF document

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break Session B: Vector Flag Processing & Vector Register Files Lunch Session C: Virtual Processor Caches Break Session D: Vector IRAM Vector Instruction Execution Pipelines Main issues: Hiding/Tolerating Memory Latency Handling Exceptions Avoiding complexity

Tolerating Memory Latency with Short Chimes => Can use same techniques as scalar processors: Static Scheduling: Move Load Earlier Instruction Stream Memory Prefetch Load Add Latency Dynamic Scheduling: Execute Add Later Hardware or Software Prefetch: Request Data Earlier (Decoupled Pipeline or Out-of-Order Execution [Espasa, PhD ‘97]) Vectors allow simple control logic to buffer 1000s of outstanding operations (also multithreading with parallel threads) Tolerating Memory Latency with Vectors Traditional In-Order Vector Issue Pipeline Memory Latency VLD v1 I A A A A A W W W W W Chaining VMUL v2,v1,r1 I R R R R R Issue Stage Blocked Decoupled Vector Pipeline (Espasa, PhD’97) Memory Latency Load Data Queue A A A A A VLD v1 I W W W W W Chaining I VMUL v2,v1,r1 Instruction R R R R R Issue Queue Stage Free Also, full out-of-order issue is possible (Espasa, PhD’97)

Memory Latency and Short Vectors VLD v1 VMUL v2,v1,r1 Vector Instruction Sequence VLD v3 VMUL v4,v3,r2 Instruction Execution in Time VLD v1 Address Mem Idle VLD v3 Address Addr.Gen. VLD v1 Data VLD v3 Data Data Bus Memory Latency VMUL v2,v1,r1 VMUL v4,v3,r2 Mult Idle Multiplier Cray-style VLD v1 Address VLD v3 Address Addr.Gen. VLD v1 Data VLD v3 Data Data Bus VMUL v2,v1,r1 VMUL v4,v3,r2 Multiplier Enqueue VMUL Decoupled Pipeline Enqueue VLD data Decoupled Pipeline Issues Latencies: Decoupling hides memory latency in most cases but exposes latency in others. Scalar Unit Reads of Vector Unit State Scalar Pipe F D X M W Instruction Queues Scatter/Gather Indices Load/Store Masks Vector Load Pipe A W Memory Latency Vector Arithmetic Pipe R X X X W Exceptions: • IEEE Floating-Point • Page Faults for Virtual Memory

Vector IEEE Floating-Point Model Vector FP arithmetic instructions never cause machine traps • (Except in special debugging modes) • IEEE default results handled without user-visible traps (unlike Alpha) • Largest expense is hardware subnormal handling Vector FP exceptions signaled by writes to vector flag registers • Reserve 5 vector flag registers to receive exception information: Invalid, DivideByZero, Overflow, Underflow, InExact User trap handlers: inline conditional code or trap barrier • Use normal vector conditional execution to handle vector FP exceptions • Explicit trap barrier instruction checks flags and takes precise trap Full IEEE support at full speed in deep vector pipeline Short-Running Vector Instructions Simplify Virtual Memory Scalar Pipe F D X M W Page Fault? A C Load Data Queue Memory Latency W Pre-Address Check Instruction Queue R Address Check Committed Instruction Queue Instruction Queue Few Clock Cycles Many Clock Cycles • Address translate/check (C) of whole vector takes only 4-8 clocks • Overlap checks with memory latency - no added latency for VM • Buffer following instructions until address check complete • For in-order machine, short vectors limit size of state to save/restore • For out-of-order machine, short vectors limit reorder buffer size

Instruction Queue Design Dispatch Vlen PC Inst. Scalar 1 Scalar 2 CIQ head ACIQ head Pointers To Vector Memory PAIQ head Instructions PAIQ tail Issue Delayed Pipeline Replace queues with fixed length instruction pipeline: Scalar Pipe F D X M W Vector Load Pipe A Memory Latency W Vector Arithmetic Pipe Instruction Delay R X X X W Vector Store Pipe A R Short bypass latencies Simpler than decoupled, no data buffers. Works best for fixed latency memory with few collision.

Out-of-Order Vector Execution Simpler than scalar out-of-order because of reduced instruction bandwidth. Vector register renaming solves exception problem. But problems in vector register renaming: • Elements beyond vector length (change ISA to mark undefined) • Masked elements (change ISA to leave undefined - requires merges) • Scalar insert into vector register (Make it slow so programmers avoid this But maybe OOO not a big win with more vector registers, better vector compiler, decoupled pipeline. (vector loops should be mostly statically schedulable) OOO without vector register renaming may give small boost (put OOO after address commit) Day 3, Session B Vector Flag Processing Model & Vector Register Files

Flags are more than Masks Flags are used for: • Conditional Execution (Mask Registers) • Reporting Status (Popcount and Count Leading/Trailing Zeros) • Exception Reporting (IEEE754 FP) • Speculative Execution Flag Priority Instructions Goal: Avoid latency of scalar read-flags -> write-new-length Approach: Generate mask vector with correct length Reads flag register, writes flag register, three forms: 0 7 Source flags Flag-before-first ( fbf ) Flag-including-first ( fif ) Flag-only-first ( fof ) Also, operation that compresses flag register Source flags Compress-flags ( cpf )

Vector Register File Design Construct high bandwidth VRF from multiple banks of less highly multiported memory. Design decisions: • form of bank partitioning • number of banks versus ports/bank Bank Partitioning Alternatives V3[0] V3[1] V3[2] V3[3] V3[4] V3[5] V3[6] V3[7] V2[0] V2[1] V2[2] V2[3] V2[4] V2[5] V2[6] V2[7] V1[0] V1[1] V1[2] V1[3] V1[4] V1[5] V1[6] V1[7] V0[0] V0[1] V0[2] V0[3] V0[4] V0[5] V0[6] V0[7] Register Partitioned V3[0] V3[4] V3[1] V3[5] V3[2] V3[6] V3[3] V3[7] V2[0] V2[4] V2[1] V2[5] V2[2] V2[6] V2[3] V2[7] V1[0] V1[4] V1[1] V1[5] V1[2] V1[6] V1[3] V1[7] V0[0] V0[4] V0[1] V0[5] V0[2] V0[6] V0[3] V0[7] Element Partitioned V3[0] V3[2] V3[4] V3[6] V3[1] V3[3] V3[5] V3[7] V2[0] V2[2] V2[4] V2[6] V2[1] V2[3] V2[5] V2[7] V1[0] V1[2] V1[4] V1[6] V1[1] V1[3] V1[5] V1[7] V0[0] V0[2] V0[4] V0[6] V0[1] V0[3] V0[5] V0[7] Register and Element Partitioned

Example VAFU0 1R+1W 5R+3W Vector Register Write Select File: Write Word Selects Multiported Storage Cells Element Bank 0 Read X Word Selects 1 Lane Read Y Word Selects VAFU0 Read Enable (all designs double-pumped) VAFU1 Read Enable VMFU Read Enable Write Select Write Word Selects Element Bank 3 Read X Word Selects Read Y Word Selects VAFU0 Read Enable VAFU1 Read Enable VMFU Read Enable 2R+1W 3R+2W VAFU1 VMFU

Vector Regfile: Design Comparison Cell 5R+3W 3R+2W 2R+1W 2R+1W 1R+1W Width 1 1 1 2 2 All designs provide 256 64-bit elements, and 5R+3W ports. Day 3, Session C: Virtual Processor (VP) Caches Highly parallel primary caches for vector units Reduce bandwidth demands on main memory Convert strided and scatter/gather operations to unit-stride Two forms: Rake Cache (Spatial VP Cache) Histogram Cache (Temporal VP Cache)

Virtual Processor Paradigm Vector Data Registers Integer Float Registers Registers v7 r7 f7 v0 r0 f0 [0] [1] [2] [MAXVL-1] Vector Length Register VLR Scalar Unit Vector Unit v1 Vector Arithmetic Instructions v2 VADD v3,v1,v2 + + + + + + Virtual Processor v3 [0] [1] [2] [VLR-1] Vector Load and Store Instructions [0] [1] [2] [VLR-1] VLD v1,r1,r2 v1 Base, r1 Memory Stride, r2 Many Useful Vector Algorithms use Virtual Processor Paradigm Developed by Blelloch et al., CMU SCANDAL group: •Sorting •Sparse Matrix-Vector Multiply •Connected Components •Linear Recurrences •List Ranking But frequent scatter/gather and non-unit stride accesses Address bandwidth expensive: •Address Crossbars •TLB ports •Cache Transactions •DRAM Page Breaks

Matrix-Vector Multiply B C = A x B VP0 VP7 C A (Row-major matrix storage) Strided vector accesses but each virtual processor accesses unit-stride stream Rake Cache KEY IDEA: Associate one (or more) cache line with each virtual processor Vector Data Registers v7 v0 [0] [1] [2] [MAXVL-1] Separate Cache Line Per VP Advantages over shared cache: • Access local to lane, lower energy and compact layout • High-bandwidth without multiport or interleaved memories • No inter-VP conflicts, power-of-2 stride OK!

Rake Cache for Matrix-Vector Multiply B C = A x B VP0 VP7 C A Four word rake cache line With 4 word cache line, rake cache can reduce address bandwidth by up to 4x Other Forms of Rake VP0 VP7 1D Strided Rake VP0 VP7 Indexed Rake (parallel structure access)

Rake Cache Design Single Rake Cache Line Valid bit per line VPN PPN V D Byte D Byte D Byte Virtually Tagged Per Byte Dirty Bit Explicitly Selected and Indexed • Strided and indexed instructions specify use of rake cache (and which line if more than one) Non-coherent • weak vector consistency model, flush at vector memory barrier instructions Virtually Tagged • reduces TLB accesses • weak vector consistency model, no problem with synonyms Per Byte Dirty Bits • Avoids false sharing problem, only write-back modified bytes Rake Cache Implementation Vector Register File Store Load Index Base Stride Physical Page Number S.M. Rake Cache VTag PPN Index Address Generator =? 64 64 Page Hit? Virtual Page Number Index FIFO =? Line Hit? Data Write Back Physical Address Bus 256 Data Bus

Day 3 Advanced Vector Architectures Session A: Vector Instruction - PDF document

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break Session B: Vector Flag Processing & Vector Register Files Lunch Session C: Virtual Processor Caches Break Session D: Vector IRAM Vector

At Creation Common Holy Day 1 Day 2 Day 8 Day 9 Day 3 Day 4 Day 5 Day 6 Day 7 7 Days The

Science with a Little Altitude | QS18 Fah Sathirapongsasuti, PhD EBC Everest Day 1 Day 2 Day

ENGLAND | APRIL 12 20, 2020 8 DAY TOU R SUGGE STE D ITI N E R ARY* DAY 0 DAY 1 DAY 2

Day 1 Day 1 Staging area Buses & Ambulances In Use Day 1 Day 2 Days 2 & 3 Day 4

Introduction to R Day 4: Functions October 10, 2019 Agenda Day 1: Figures Day 2: Selecting,

Module 4 AFA CyberCamp Format Day T wo Day Three Day Four Day Five Day One Windows

Workflow 6 Touchpoints After First Visit Day 0 - Sunday Day 2 - Tuesday Day 6 -

Summer School Overview Day 0: R bootcamp Day 1: Workflow, Google App Engine Day 2:

2014 Investor Day DECEMBER 10, 2014 5 | 2014 INVESTOR DAY | 2014 INVESTOR DAY Welcome MARK

BJC BJC BJC BJC Opportunity Day Opportunity Day 4Q09 Opportunity Day Opportunity Day

CACTM Patricia Arizmendi Garcia LauraCalleja Diez Fernando Carmona Mateos Andrea Magn

Europe 2014 A Million Heartbeats by rt Ahlin 2 Who am I? 3 What do I do? 4 Why QS? 5

2020 Effective Mentoring Program Combined Program (School and Early Childhood) Day 2 1 2020 SB

King, Jr. Day Remember! Celebrate! Act! A Day On, Not A Day Off! January 18, 2016 Dr. Martin

KINDERGARTEN EXPERIENCE How will the full-day kindergarten differ from the half-day

In the United State we have celebrated Independence Day and Presidents Day since the 1870s.

Memory Data Flow in Out-of-Order Pipelines Nima Honarmand Spring 2018 :: CSE 502 Big Picture

Switch Implementation and Performance Simple switch - general purpose workstation with

Course Overview SWEN-261 Introduction to Software Engineering Department of Software

Wireless Sensor Networks 5. Routing Christian Schindelhauer Technische Fakultt Rechnernetze

Fa Fast st Has ash h Tab able e Loo ookup p Usi sing ng Exten ended ded Bloom oom Fi

PXD DAQ S. Lange (Giessen) for PXD DAQ team Hardware and firmware by IHEP Beijing, Bonn,

Multicore DSP Architecture and Programming O. Dahl 1 1 Electrical Engineering, Linkping

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Day 3 Advanced Vector Architectures Session A: Vector Instruction - PDF document

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break Session B: Vector Flag Processing & Vector Register Files Lunch Session C: Virtual Processor Caches Break Session D: Vector IRAM Vector

At Creation Common Holy Day 1 Day 2 Day 8 Day 9 Day 3 Day 4 Day 5 Day 6 Day 7 7 Days The

Science with a Little Altitude | QS18 Fah Sathirapongsasuti, PhD EBC Everest Day 1 Day 2 Day

ENGLAND | APRIL 12 20, 2020 8 DAY TOU R SUGGE STE D ITI N E R ARY* DAY 0 DAY 1 DAY 2

Day 1 Day 1 Staging area Buses &amp; Ambulances In Use Day 1 Day 2 Days 2 &amp; 3 Day 4

Introduction to R Day 4: Functions October 10, 2019 Agenda Day 1: Figures Day 2: Selecting,

Module 4 AFA CyberCamp Format Day T wo Day Three Day Four Day Five Day One Windows

Workflow 6 Touchpoints After First Visit Day 0 - Sunday Day 2 - Tuesday Day 6 -

Summer School Overview Day 0: R bootcamp Day 1: Workflow, Google App Engine Day 2:

2014 Investor Day DECEMBER 10, 2014 5 | 2014 INVESTOR DAY | 2014 INVESTOR DAY Welcome MARK

BJC BJC BJC BJC Opportunity Day Opportunity Day 4Q09 Opportunity Day Opportunity Day

CACTM Patricia Arizmendi Garcia LauraCalleja Diez Fernando Carmona Mateos Andrea Magn

Europe 2014 A Million Heartbeats by rt Ahlin 2 Who am I? 3 What do I do? 4 Why QS? 5

2020 Effective Mentoring Program Combined Program (School and Early Childhood) Day 2 1 2020 SB

King, Jr. Day Remember! Celebrate! Act! A Day On, Not A Day Off! January 18, 2016 Dr. Martin

KINDERGARTEN EXPERIENCE How will the full-day kindergarten differ from the half-day

In the United State we have celebrated Independence Day and Presidents Day since the 1870s.

Memory Data Flow in Out-of-Order Pipelines Nima Honarmand Spring 2018 :: CSE 502 Big Picture

Switch Implementation and Performance Simple switch - general purpose workstation with

Course Overview SWEN-261 Introduction to Software Engineering Department of Software

Wireless Sensor Networks 5. Routing Christian Schindelhauer Technische Fakultt Rechnernetze

Fa Fast st Has ash h Tab able e Loo ookup p Usi sing ng Exten ended ded Bloom oom Fi

PXD DAQ S. Lange (Giessen) for PXD DAQ team Hardware and firmware by IHEP Beijing, Bonn,

Multicore DSP Architecture and Programming O. Dahl 1 1 Electrical Engineering, Linkping

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Day 1 Day 1 Staging area Buses & Ambulances In Use Day 1 Day 2 Days 2 & 3 Day 4