StVEC: A Vector Instruction Extension for High Performance Stencil - PowerPoint PPT Presentation

StVEC: A Vector Instruction Extension for High Performance Stencil Computation Renji Thomas Louis-No¨ el Pouchet Naser Sedaghati Radu Teodorescu P. Sadayappan Department of Computer Science and Engineering The Ohio State University HPC Research Lab: barista.cse.ohio-state.edu Computer Architecture Lab: arch.cse.ohio-state.edu October 13 th 2011 Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 1 / 24

Outline Introduction 1 Vectorization of Stencils 2 Enhancing Vector ISA with StVEC 3 Generating Code for StVEC 4 Evaluation 5 Summary 6 Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 2 / 24

Introduction Stencil Computation Repeat over TIME Sweep over a spatial grid Compute a point from neighbor points values Same grid or multiple grids Numerous application domains Finite difference methods for solving PDEs Image processing (e.g. MRI image pipeline) Computational electromagnetics, CFD, numerical relativity, etc. Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 3 / 24

Introduction Stencil Computation: An Example 2-D 5-point Jacobi for (t = 0; t < TMAX; t++) for (i = 1; i < N - 1; i++) for (j = 1; j < M - 1; j++) B[i][j] = A[i-1][j] + A[i][j-1] + A[i ][j] + A[i][j+1] + A[i+1][j]; Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 4 / 24

Introduction Short-Vector SIMD Identical computation on small chunks of data Independent operations Vector size (width) of 2 to 64 Packing operations to form a vector (shuffle, extract, etc.) SIMD performance Multiple SIMD units per CPU Maximum speedup equals the vector width Ubiquitous features on modern processors x86 – SSE, AVX Power – VMX/VSX ARM – NEON Cell SPU Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 5 / 24

Introduction Vectorization: An Example Vector width = 4, N divisible by 4 for (t = 0; t < T; t++) for (i = 4; i < N; i++) A[i] = B[i] * B[i] ; Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 6 / 24

Introduction Vectorization: An Example Vector width = 4, N divisible by 4 for (t = 0; t < T; t++) for (i = 4; i < N; i++) A[i] = B[i] * B[i] ; 1: ASM (MIPS-like) for (t = 0; t < T; t++) for (i = 4; i < N; i++){ LD R1, &B[i] MUL R2, R1, R1 ST R2, &A[i] } Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 6 / 24

Introduction Vectorization: An Example Vector width = 4, N divisible by 4 for (t = 0; t < T; t++) for (i = 4; i < N; i++) A[i] = B[i] * B[i] ; 1: ASM (MIPS-like) 2: 4-way unroll + re-schedule for (t = 0; t < T; t++) for (t = 0; t < T; t++) for (i = 4; i < N; i++){ for (i = 4; i < N; i+=4){ LD R1, &B[i] LD R1, &B[i] MUL R2, R1, R1 LD R2, &B[i+1] LD R3, &B[i+2] ST R2, &A[i] LD R4, &B[i+3] } MUL R5, R1, R1 MUL R6, R2, R2 MUL R7, R3, R3 MUL R8, R4, R4 ST R5, &A[i] ST R6, &A[i+1] ST R7, &A[i+2] ST R8, &A[i+3] } Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 6 / 24

Introduction Vectorization: An Example Vector width = 4, N divisible by 4 for (t = 0; t < T; t++) for (i = 4; i < N; i++) A[i] = B[i] * B[i] ; 1: ASM (MIPS-like) 2: 4-way unroll + re-schedule 3: Vectorize for (t = 0; t < T; t++) for (t = 0; t < T; t++) for (t = 0; t < T; t++) for (i = 4; i < N; i+=4){ for (i = 4; i < N; i++){ for (i = 4; i < N; i+=4){ VLD VR1, &B[i] LD R1, &B[i] LD R1, &B[i] VMUL VR2, VR1, VR1 MUL R2, R1, R1 LD R2, &B[i+1] VST VR2, &A[i] LD R3, &B[i+2] ST R2, &A[i] } LD R4, &B[i+3] } MUL R5, R1, R1 MUL R6, R2, R2 MUL R7, R3, R3 MUL R8, R4, R4 ST R5, &A[i] ST R6, &A[i+1] ST R7, &A[i+2] ST R8, &A[i+3] } Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 6 / 24

Introduction Vectorization: An Example Vector width = 4, N divisible by 4 for (t = 0; t < T; t++) for (i = 4; i < N; i++) A[i] = B[i] * B[i] ; 1: ASM (MIPS-like) 2: 4-way unroll + re-schedule 3: Vectorize for (t = 0; t < T; t++) for (t = 0; t < T; t++) for (t = 0; t < T; t++) for (i = 4; i < N; i+=4){ for (i = 4; i < N; i++){ for (i = 4; i < N; i+=4){ VLD VR1, &B[i] LD R1, &B[i] LD R1, &B[i] VMUL VR2, VR1, VR1 MUL R2, R1, R1 LD R2, &B[i+1] VST VR2, &A[i] LD R3, &B[i+2] ST R2, &A[i] } LD R4, &B[i+3] } MUL R5, R1, R1 MUL R6, R2, R2 MUL R7, R3, R3 MUL R8, R4, R4 ST R5, &A[i] ST R6, &A[i+1] ST R7, &A[i+2] ST R8, &A[i+3] } Observation Aligned memory referencing (i.e. B[i]) helps vectorization! Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 6 / 24

Vectorization of Stencils Vectorization of Stencils Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 7 / 24

Vectorization of Stencils Vectorizing Stencil Computation for (t = 0; t < T; t++) for (i = 4; i < N; i++) A[i] += B[i-1] * B[i]; Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 8 / 24

Vectorization of Stencils Vectorizing Stencil Computation for (t = 0; t < T; t++) for (i = 4; i < N; i++) A[i] += B[i-1] * B[i]; Solution1: load + shuffle B[ ] in XMM Registers SSE Assembly (N=1024) Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 8 / 24

Vectorization of Stencils Vectorizing Stencil Computation for (t = 0; t < T; t++) for (i = 4; i < N; i++) A[i] += B[i-1] * B[i]; Solution1: load + shuffle Solution2: unaligned load B[ ] in XMM Registers SSE Assembly (N=1024) Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 8 / 24

Vectorization of Stencils Vectorizing Stencil Computation for (t = 0; t < T; t++) for (i = 4; i < N; i++) A[i] += B[i-1] * B[i]; Solution1: load + shuffle Solution2: unaligned load Our Solution: StVEC (no shuffle, no unaligned load) B[ ] in XMM Registers SSE Assembly (N=1024) Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 8 / 24

Enhancing Vector ISA with StVEC Enhancing Vector ISA with StVEC Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 9 / 24

Enhancing Vector ISA with StVEC Execution Model Building Unaligned Vector Operands Idea: build an unaligned operand during register read Only one unaligned operand suffice for stencils Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 10 / 24

Enhancing Vector ISA with StVEC Execution Model Building Unaligned Vector Operands Idea: build an unaligned operand during register read Only one unaligned operand suffice for stencils Build the unaligned operand (i.e. VOPR x ) with two source regs base and extension Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 10 / 24

Enhancing Vector ISA with StVEC Execution Model Building Unaligned Vector Operands Idea: build an unaligned operand during register read Only one unaligned operand suffice for stencils Build the unaligned operand (i.e. VOPR x ) with two source regs base and extension 16x128-bit vector register file base = VR 1 , extension = VR 14 source offset VOPR x Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 10 / 24

Enhancing Vector ISA with StVEC Execution Model Building Unaligned Vector Operands Idea: build an unaligned operand during register read Only one unaligned operand suffice for stencils Build the unaligned operand (i.e. VOPR x ) with two source regs base and extension 16x128-bit vector register file base = VR 1 , extension = VR 14 source offset VOPR x 0 X 1 , 0:4 ( aligned ) VR 1 Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 10 / 24

StVEC: A Vector Instruction Extension for High Performance Stencil - PowerPoint PPT Presentation

StVEC: A Vector Instruction Extension for High Performance Stencil Computation Renji Thomas Louis-No el Pouchet Naser Sedaghati Radu Teodorescu P. Sadayappan Department of Computer Science and Engineering The Ohio State University HPC

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Instruction Set 2 Architecting a vocabulary for the HW INSTRUCTION SET OVERVIEW 3 Instruction

Improving User Experience for translators Translate Extension Translate Extension Translate

Energy-efficient & High-performance Energy-efficient & High-performance Instruction Fetch

NOW Handout Page 1 1 Styles of Vector Architectures Components of Vector Processor Vector

EXPLICIT INSTRUCTION EXPLICIT INSTRUCTION Michael L. Kamil Michael L. Kamil Stanford University

Lecture 3: Instruction Lecture 3: Instruction of a computer that a machine language of a

EE 457 Unit 3 Instruction Sets 2 With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3

EE 109 Unit 10 MIPS Instruction Set MIPS INSTRUCTION OVERVIEW 10.3 10.4 Instruction Set

Instruction encoding The ISA defines The format of an instruction (syntax) The

Slide Handouts: Instruction Ask the Expert Welcome to Module 6 Lesson 1. Instruction: Ask the

Instruction Scheduling cs5363 1 Instruction scheduling Reordered Original Instruction code

EE 457 Unit 3 Instruction Sets With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3.3

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Electromagnetic Form Factors through Parity-Expanded Variational Analysis Finn M. Stokes Waseem

Problem Dicusssion Part 2: Some more problems Lucca Siaudzionis and Jack Spalding-Jamieson

Directional recurrence, ergodicity, and weak mixing Ay se S ahin DePaul University June

Threatened by a Great Opportunity: Disruptive Innovation in Formal Verification John Rushby

Basic Search Algorithms Tsan-sheng Hsu tshsu@iis.sinica.edu.tw

XPANDER: TOWARDS OPTIMAL-PERFORMANCE DATACENTERS Asaf Valadarsky (Hebrew University) Gal Shahaf

High Dimensional Expanders Luis Kumanduri MIT 1 / 3 What is an expander? Definition Let X be a

Lossless Expander Graphs in Compressive Sensing Abbas Kazemipour MAST Group Meeting University

StVEC: A Vector Instruction Extension for High Performance Stencil - PowerPoint PPT Presentation

StVEC: A Vector Instruction Extension for High Performance Stencil Computation Renji Thomas Louis-No el Pouchet Naser Sedaghati Radu Teodorescu P. Sadayappan Department of Computer Science and Engineering The Ohio State University HPC

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Instruction Set 2 Architecting a vocabulary for the HW INSTRUCTION SET OVERVIEW 3 Instruction

Improving User Experience for translators Translate Extension Translate Extension Translate

Energy-efficient &amp; High-performance Energy-efficient &amp; High-performance Instruction Fetch

NOW Handout Page 1 1 Styles of Vector Architectures Components of Vector Processor Vector

EXPLICIT INSTRUCTION EXPLICIT INSTRUCTION Michael L. Kamil Michael L. Kamil Stanford University

Lecture 3: Instruction Lecture 3: Instruction of a computer that a machine language of a

EE 457 Unit 3 Instruction Sets 2 With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3

EE 109 Unit 10 MIPS Instruction Set MIPS INSTRUCTION OVERVIEW 10.3 10.4 Instruction Set

Instruction encoding The ISA defines The format of an instruction (syntax) The

Slide Handouts: Instruction Ask the Expert Welcome to Module 6 Lesson 1. Instruction: Ask the

Instruction Scheduling cs5363 1 Instruction scheduling Reordered Original Instruction code

EE 457 Unit 3 Instruction Sets With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3.3

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Electromagnetic Form Factors through Parity-Expanded Variational Analysis Finn M. Stokes Waseem

Problem Dicusssion Part 2: Some more problems Lucca Siaudzionis and Jack Spalding-Jamieson

Directional recurrence, ergodicity, and weak mixing Ay se S ahin DePaul University June

Threatened by a Great Opportunity: Disruptive Innovation in Formal Verification John Rushby

Basic Search Algorithms Tsan-sheng Hsu tshsu@iis.sinica.edu.tw

XPANDER: TOWARDS OPTIMAL-PERFORMANCE DATACENTERS Asaf Valadarsky (Hebrew University) Gal Shahaf

High Dimensional Expanders Luis Kumanduri MIT 1 / 3 What is an expander? Definition Let X be a

Lossless Expander Graphs in Compressive Sensing Abbas Kazemipour MAST Group Meeting University

Energy-efficient & High-performance Energy-efficient & High-performance Instruction Fetch