StVEC: A Vector Instruction Extension for High Performance Stencil - - PowerPoint PPT Presentation

stvec a vector instruction extension for high performance
SMART_READER_LITE
LIVE PREVIEW

StVEC: A Vector Instruction Extension for High Performance Stencil - - PowerPoint PPT Presentation

StVEC: A Vector Instruction Extension for High Performance Stencil Computation Renji Thomas Louis-No el Pouchet Naser Sedaghati Radu Teodorescu P. Sadayappan Department of Computer Science and Engineering The Ohio State University HPC


slide-1
SLIDE 1

StVEC: A Vector Instruction Extension for High Performance Stencil Computation

Naser Sedaghati Renji Thomas Louis-No¨ el Pouchet Radu Teodorescu

  • P. Sadayappan

Department of Computer Science and Engineering The Ohio State University HPC Research Lab: barista.cse.ohio-state.edu Computer Architecture Lab: arch.cse.ohio-state.edu

October 13th 2011 Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 1 / 24

slide-2
SLIDE 2

Outline

1

Introduction

2

Vectorization of Stencils

3

Enhancing Vector ISA with StVEC

4

Generating Code for StVEC

5

Evaluation

6

Summary

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 2 / 24

slide-3
SLIDE 3

Introduction

Stencil Computation

Repeat over TIME Sweep over a spatial grid Compute a point from neighbor points values Same grid or multiple grids

Numerous application domains

Finite difference methods for solving PDEs Image processing (e.g. MRI image pipeline) Computational electromagnetics, CFD, numerical relativity, etc.

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 3 / 24

slide-4
SLIDE 4

Introduction

Stencil Computation: An Example

2-D 5-point Jacobi

for (t = 0; t < TMAX; t++) for (i = 1; i < N - 1; i++) for (j = 1; j < M - 1; j++) B[i][j] = A[i-1][j] + A[i][j-1] + A[i ][j] + A[i][j+1] + A[i+1][j];

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 4 / 24

slide-5
SLIDE 5

Introduction

Stencil Computation: An Example

2-D 5-point Jacobi

for (t = 0; t < TMAX; t++) for (i = 1; i < N - 1; i++) for (j = 1; j < M - 1; j++) B[i][j] = A[i-1][j] + A[i][j-1] + A[i ][j] + A[i][j+1] + A[i+1][j];

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 4 / 24

slide-6
SLIDE 6

Introduction

Stencil Computation: An Example

2-D 5-point Jacobi

for (t = 0; t < TMAX; t++) for (i = 1; i < N - 1; i++) for (j = 1; j < M - 1; j++) B[i][j] = A[i-1][j] + A[i][j-1] + A[i ][j] + A[i][j+1] + A[i+1][j];

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 4 / 24

slide-7
SLIDE 7

Introduction

Stencil Computation: An Example

2-D 5-point Jacobi

for (t = 0; t < TMAX; t++) for (i = 1; i < N - 1; i++) for (j = 1; j < M - 1; j++) B[i][j] = A[i-1][j] + A[i][j-1] + A[i ][j] + A[i][j+1] + A[i+1][j];

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 4 / 24

slide-8
SLIDE 8

Introduction

Short-Vector SIMD

Identical computation on small chunks of data

Independent operations Vector size (width) of 2 to 64 Packing operations to form a vector (shuffle, extract, etc.)

SIMD performance

Multiple SIMD units per CPU Maximum speedup equals the vector width

Ubiquitous features on modern processors

x86 – SSE, AVX Power – VMX/VSX ARM – NEON Cell SPU

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 5 / 24

slide-9
SLIDE 9

Introduction

Vectorization: An Example

Vector width = 4, N divisible by 4

for (t = 0; t < T; t++) for (i = 4; i < N; i++) A[i] = B[i] * B[i] ;

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 6 / 24

slide-10
SLIDE 10

Introduction

Vectorization: An Example

Vector width = 4, N divisible by 4

for (t = 0; t < T; t++) for (i = 4; i < N; i++) A[i] = B[i] * B[i] ;

1: ASM (MIPS-like) for (t = 0; t < T; t++) for (i = 4; i < N; i++){ LD R1, &B[i] MUL R2, R1, R1 ST R2, &A[i] }

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 6 / 24

slide-11
SLIDE 11

Introduction

Vectorization: An Example

Vector width = 4, N divisible by 4

for (t = 0; t < T; t++) for (i = 4; i < N; i++) A[i] = B[i] * B[i] ;

1: ASM (MIPS-like) for (t = 0; t < T; t++) for (i = 4; i < N; i++){ LD R1, &B[i] MUL R2, R1, R1 ST R2, &A[i] } 2: 4-way unroll + re-schedule for (t = 0; t < T; t++) for (i = 4; i < N; i+=4){ LD R1, &B[i] LD R2, &B[i+1] LD R3, &B[i+2] LD R4, &B[i+3] MUL R5, R1, R1 MUL R6, R2, R2 MUL R7, R3, R3 MUL R8, R4, R4 ST R5, &A[i] ST R6, &A[i+1] ST R7, &A[i+2] ST R8, &A[i+3] }

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 6 / 24

slide-12
SLIDE 12

Introduction

Vectorization: An Example

Vector width = 4, N divisible by 4

for (t = 0; t < T; t++) for (i = 4; i < N; i++) A[i] = B[i] * B[i] ;

1: ASM (MIPS-like) for (t = 0; t < T; t++) for (i = 4; i < N; i++){ LD R1, &B[i] MUL R2, R1, R1 ST R2, &A[i] } 2: 4-way unroll + re-schedule for (t = 0; t < T; t++) for (i = 4; i < N; i+=4){ LD R1, &B[i] LD R2, &B[i+1] LD R3, &B[i+2] LD R4, &B[i+3] MUL R5, R1, R1 MUL R6, R2, R2 MUL R7, R3, R3 MUL R8, R4, R4 ST R5, &A[i] ST R6, &A[i+1] ST R7, &A[i+2] ST R8, &A[i+3] } 3: Vectorize for (t = 0; t < T; t++) for (i = 4; i < N; i+=4){ VLD VR1, &B[i] VMUL VR2, VR1, VR1 VST VR2, &A[i] }

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 6 / 24

slide-13
SLIDE 13

Introduction

Vectorization: An Example

Vector width = 4, N divisible by 4

for (t = 0; t < T; t++) for (i = 4; i < N; i++) A[i] = B[i] * B[i] ;

1: ASM (MIPS-like) for (t = 0; t < T; t++) for (i = 4; i < N; i++){ LD R1, &B[i] MUL R2, R1, R1 ST R2, &A[i] } 2: 4-way unroll + re-schedule for (t = 0; t < T; t++) for (i = 4; i < N; i+=4){ LD R1, &B[i] LD R2, &B[i+1] LD R3, &B[i+2] LD R4, &B[i+3] MUL R5, R1, R1 MUL R6, R2, R2 MUL R7, R3, R3 MUL R8, R4, R4 ST R5, &A[i] ST R6, &A[i+1] ST R7, &A[i+2] ST R8, &A[i+3] } 3: Vectorize for (t = 0; t < T; t++) for (i = 4; i < N; i+=4){ VLD VR1, &B[i] VMUL VR2, VR1, VR1 VST VR2, &A[i] }

Observation Aligned memory referencing (i.e. B[i]) helps vectorization!

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 6 / 24

slide-14
SLIDE 14

Vectorization of Stencils

Vectorization of Stencils

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 7 / 24

slide-15
SLIDE 15

Vectorization of Stencils

Vectorizing Stencil Computation

for (t = 0; t < T; t++) for (i = 4; i < N; i++) A[i] += B[i-1] * B[i];

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 8 / 24

slide-16
SLIDE 16

Vectorization of Stencils

Vectorizing Stencil Computation

for (t = 0; t < T; t++) for (i = 4; i < N; i++) A[i] += B[i-1] * B[i];

Solution1: load + shuffle

B[ ] in XMM Registers SSE Assembly (N=1024) Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 8 / 24

slide-17
SLIDE 17

Vectorization of Stencils

Vectorizing Stencil Computation

for (t = 0; t < T; t++) for (i = 4; i < N; i++) A[i] += B[i-1] * B[i];

Solution1: load + shuffle Solution2: unaligned load

B[ ] in XMM Registers SSE Assembly (N=1024) Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 8 / 24

slide-18
SLIDE 18

Vectorization of Stencils

Vectorizing Stencil Computation

for (t = 0; t < T; t++) for (i = 4; i < N; i++) A[i] += B[i-1] * B[i];

Solution1: load + shuffle Solution2: unaligned load Our Solution: StVEC (no shuffle, no unaligned load)

B[ ] in XMM Registers SSE Assembly (N=1024) Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 8 / 24

slide-19
SLIDE 19

Enhancing Vector ISA with StVEC

Enhancing Vector ISA with StVEC

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 9 / 24

slide-20
SLIDE 20

Enhancing Vector ISA with StVEC Execution Model

Building Unaligned Vector Operands

Idea: build an unaligned operand during register read

Only one unaligned operand suffice for stencils

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 10 / 24

slide-21
SLIDE 21

Enhancing Vector ISA with StVEC Execution Model

Building Unaligned Vector Operands

Idea: build an unaligned operand during register read

Only one unaligned operand suffice for stencils Build the unaligned operand (i.e. VOPRx) with two source regs

base and extension

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 10 / 24

slide-22
SLIDE 22

Enhancing Vector ISA with StVEC Execution Model

Building Unaligned Vector Operands

Idea: build an unaligned operand during register read

Only one unaligned operand suffice for stencils Build the unaligned operand (i.e. VOPRx) with two source regs

base and extension

16x128-bit vector register file base = VR1 , extension = VR14

source

  • ffset

VOPRx

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 10 / 24

slide-23
SLIDE 23

Enhancing Vector ISA with StVEC Execution Model

Building Unaligned Vector Operands

Idea: build an unaligned operand during register read

Only one unaligned operand suffice for stencils Build the unaligned operand (i.e. VOPRx) with two source regs

base and extension

16x128-bit vector register file base = VR1 , extension = VR14

source

  • ffset

VOPRx VR1 X1,0:4 (aligned)

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 10 / 24

slide-24
SLIDE 24

Enhancing Vector ISA with StVEC Execution Model

Building Unaligned Vector Operands

Idea: build an unaligned operand during register read

Only one unaligned operand suffice for stencils Build the unaligned operand (i.e. VOPRx) with two source regs

base and extension

16x128-bit vector register file base = VR1 , extension = VR14

source

  • ffset

VOPRx VR1 X1,0:4 (aligned) VR1, VR14

1

X1,1:3X14,0:1

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 10 / 24

slide-25
SLIDE 25

Enhancing Vector ISA with StVEC Execution Model

Building Unaligned Vector Operands

Idea: build an unaligned operand during register read

Only one unaligned operand suffice for stencils Build the unaligned operand (i.e. VOPRx) with two source regs

base and extension

16x128-bit vector register file base = VR1 , extension = VR14

source

  • ffset

VOPRx VR1 X1,0:4 (aligned) VR1, VR14

1

X1,1:3X14,0:1 VR1, VR14

2

X1,2:2X14,0:2

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 10 / 24

slide-26
SLIDE 26

Enhancing Vector ISA with StVEC Execution Model

Building Unaligned Vector Operands

Idea: build an unaligned operand during register read

Only one unaligned operand suffice for stencils Build the unaligned operand (i.e. VOPRx) with two source regs

base and extension

16x128-bit vector register file base = VR1 , extension = VR14

source

  • ffset

VOPRx VR1 X1,0:4 (aligned) VR1, VR14

1

X1,1:3X14,0:1 VR1, VR14

2

X1,2:2X14,0:2 VR1, VR14

3

X1,3:1X14,0:3

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 10 / 24

slide-27
SLIDE 27

Enhancing Vector ISA with StVEC Instruction Format

StVEC Instructions

StVEC operands Target: register-register vector instructions src1 and dst: unchanged src2: expanded to: offset, base and extension

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 11 / 24

slide-28
SLIDE 28

Enhancing Vector ISA with StVEC Instruction Format

StVEC Instructions

StVEC operands Target: register-register vector instructions src1 and dst: unchanged src2: expanded to: offset, base and extension SSE translation to StVEC (vector width: W = 4) SSE

mulps VRx, VRy

StVEC

stmulps offset, VRx, VRz, VRy

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 11 / 24

slide-29
SLIDE 29

Enhancing Vector ISA with StVEC Instruction Format

StVEC Instructions

StVEC operands Target: register-register vector instructions src1 and dst: unchanged src2: expanded to: offset, base and extension SSE translation to StVEC (vector width: W = 4) SSE

mulps VRx, VRy VRy = VRx ∗ VRy

StVEC

stmulps offset, VRx, VRz, VRy VRy = VRx{offset : W − offset}VRz{0 : offset} ∗ VRy

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 11 / 24

slide-30
SLIDE 30

Enhancing Vector ISA with StVEC Modified Vector Register File

Modified Vector Register File (StVRF)

Modifications to the baseline VRF (BVRF)

separate register address for each bank vector register adjustment (VRA) logic w/ offset

TStVRF ≈ TBVRF + TVRA

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 12 / 24

slide-31
SLIDE 31

Enhancing Vector ISA with StVEC Modified Vector Register File

Modified Vector Register File (StVRF)

Modifications to the baseline VRF (BVRF)

separate register address for each bank vector register adjustment (VRA) logic w/ offset

Example: stmulps $1, %xmm1, %xmm2, %xmm3 vector width = 4

  • ffset = 1

base = xmm1 extension = xmm2 src1 = dst = xmm3 src2 = xmm1{1:3}xmm2{0:1} OP = vector multiply

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 12 / 24

slide-32
SLIDE 32

Enhancing Vector ISA with StVEC Modified Vector Register File

Modified Vector Register File (StVRF)

Modifications to the baseline VRF (BVRF)

separate register address for each bank vector register adjustment (VRA) logic w/ offset

Example: stmulps $1, %xmm1, %xmm2, %xmm3 vector width = 4

  • ffset = 1

base = xmm1 extension = xmm2 src1 = dst = xmm3 src2 = xmm1{1:3}xmm2{0:1} OP = vector multiply

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 12 / 24

slide-33
SLIDE 33

Enhancing Vector ISA with StVEC Modified Vector Register File

Modified Vector Register File (StVRF)

Modifications to the baseline VRF (BVRF)

separate register address for each bank vector register adjustment (VRA) logic w/ offset

Example: stmulps $1, %xmm1, %xmm2, %xmm3 vector width = 4

  • ffset = 1

base = xmm1 extension = xmm2 src1 = dst = xmm3 src2 = xmm1{1:3}xmm2{0:1} OP = vector multiply

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 12 / 24

slide-34
SLIDE 34

Enhancing Vector ISA with StVEC Modified Vector Register File

Modified Vector Register File (StVRF)

Modifications to the baseline VRF (BVRF)

separate register address for each bank vector register adjustment (VRA) logic w/ offset

Example: stmulps $1, %xmm1, %xmm2, %xmm3 vector width = 4

  • ffset = 1

base = xmm1 extension = xmm2 src1 = dst = xmm3 src2 = xmm1{1:3}xmm2{0:1} OP = vector multiply

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 12 / 24

slide-35
SLIDE 35

Generating Code for StVEC

Generating Code for StVEC

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 13 / 24

slide-36
SLIDE 36

Generating Code for StVEC

The Code Generation Procedure

Input: Abstract syntax tree (AST) of a vectorizable innermost loop

1 generate basic intrinsics 2 perform StVEC code generation

Output: vectorized loop with StVEC intrinsics

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 14 / 24

slide-37
SLIDE 37

Generating Code for StVEC

StVEC Code Generation

Input: basic intrinsics loop The proposed algorithm

1 replace every unaligned reference by two aligned loads 2 find offset values and promote to StVEC insts when possible Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 15 / 24

slide-38
SLIDE 38

Generating Code for StVEC

StVEC Code Generation

Input: basic intrinsics loop The proposed algorithm

1 replace every unaligned reference by two aligned loads 2 find offset values and promote to StVEC insts when possible

Some additional optimizations: dead-code elimination 3-stage software-pipelining

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 15 / 24

slide-39
SLIDE 39

Generating Code for StVEC

StVEC Code Generation

Input: basic intrinsics loop The proposed algorithm

1 replace every unaligned reference by two aligned loads 2 find offset values and promote to StVEC insts when possible

Some additional optimizations: dead-code elimination 3-stage software-pipelining Properties can be emulated w/ existing vector ISA unaligned loads can be eliminated

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 15 / 24

slide-40
SLIDE 40

Evaluation

Evaluation Methodology

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 16 / 24

slide-41
SLIDE 41

Evaluation Code Implementation

Emulating StVEC Instructions on Real Vector ISA

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 17 / 24

slide-42
SLIDE 42

Evaluation Code Implementation

Emulating StVEC Instructions on Real Vector ISA

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 17 / 24

slide-43
SLIDE 43

Evaluation Code Implementation

Emulating StVEC Instructions on Real Vector ISA

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 17 / 24

slide-44
SLIDE 44

Evaluation Code Implementation

Emulating StVEC Instructions on Real Vector ISA

TStVRF ≈ TBVRF + TVRA st-opt: StVEC inst models delay of 1 SIMD inst

TStVRF ≤ Tcycle

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 17 / 24

slide-45
SLIDE 45

Evaluation Code Implementation

Emulating StVEC Instructions on Real Vector ISA

TStVRF ≈ TBVRF + TVRA st-opt: StVEC inst models delay of 1 SIMD inst

TStVRF ≤ Tcycle

st-pes: StVEC inst models delay of 2 dependent SIMD insts

TStVRF ≤ 2 ∗ Tcycle

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 17 / 24

slide-46
SLIDE 46

Evaluation Code Implementation

Emulating StVEC Instructions on Real Vector ISA

TStVRF ≈ TBVRF + TVRA st-opt: StVEC inst models delay of 1 SIMD inst

TStVRF ≤ Tcycle

st-pes: StVEC inst models delay of 2 dependent SIMD insts

TStVRF ≤ 2 ∗ Tcycle

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 17 / 24

slide-47
SLIDE 47

Evaluation Experimental Setup and Stencil Benchmarks

Setup

Running on x86 architectures

Intel Core i7 Nehalem, Intel Sandy Bridge, Intel Core2 Quad AMD Phenom (K10)

Stencil Benchmarks

1D: Jacobi (2-, 3-, 5- and 7-point) 2D: Jacobi (5- and 9-point), POP, FDTD 2D, Rician Denoise 2D 3D: Jacobi (27-point), Heattut 3D

L1-resident problem size

assume tiling was performed beforehand if necessary

Compilers

ICC 12 (w/ -fast) and GCC 4.4.4 (w/ -O3)

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 18 / 24

slide-48
SLIDE 48

Evaluation Experimental Setup and Stencil Benchmarks

Experimental Results

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 19 / 24

slide-49
SLIDE 49

Evaluation Experimental Setup and Stencil Benchmarks

Average (Geometric Mean) Speedup with StVEC

Single-precision (average across all 12 benchmarks)

0.5 1 1.5 2 2.5 3 icc gcc icc gcc icc gcc icc gcc Average Speedup autovec st-pes st-opt i7-sb i7-n core2 phenom

Normalized to baseline (intrinsics + unrolling + SWP)

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 20 / 24

slide-50
SLIDE 50

Evaluation Experimental Setup and Stencil Benchmarks

Average (Geometric Mean) Speedup with StVEC

Single-precision (average across all 12 benchmarks)

0.5 1 1.5 2 2.5 3 icc gcc icc gcc icc gcc icc gcc Average Speedup autovec st-pes st-opt i7-sb i7-n core2 phenom

Normalized to baseline (intrinsics + unrolling + SWP)

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 20 / 24

slide-51
SLIDE 51

Evaluation Experimental Setup and Stencil Benchmarks

Average (Geometric Mean) Speedup with StVEC

Single-precision (average across all 12 benchmarks)

0.5 1 1.5 2 2.5 3 icc gcc icc gcc icc gcc icc gcc Average Speedup autovec st-pes st-opt i7-sb i7-n core2 phenom

Normalized to baseline (intrinsics + unrolling + SWP) Single-precision: 7% to 2.26x for st-pes and 20% to 2.47x for st-opt Double-precision: 30% to 65% for st-opt

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 20 / 24

slide-52
SLIDE 52

Evaluation Experimental Setup and Stencil Benchmarks

Speedup with ICC Across Machines

Single-precision

0.5 1 1.5 2 2.5 3 3.5 4 4.5 i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom Relative Speedup st-pes st-opt rician heattut fdtd pop2 pop1 j3d27p j2d9p j2d5p j1d7p j1d5p j1d3p j1d2p

Normalized to baseline (intrinsics + unrolling + SWP)

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 21 / 24

slide-53
SLIDE 53

Evaluation Experimental Setup and Stencil Benchmarks

Speedup with ICC Across Machines

Single-precision

0.5 1 1.5 2 2.5 3 3.5 4 4.5 i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom i7-sb i7-n core2 phenom Relative Speedup st-pes st-opt rician heattut fdtd pop2 pop1 j3d27p j2d9p j2d5p j1d7p j1d5p j1d3p j1d2p

Normalized to baseline (intrinsics + unrolling + SWP)

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 21 / 24

slide-54
SLIDE 54

Evaluation Experimental Setup and Stencil Benchmarks

StVEC Hardware Overhead

StVRF access time in 45nm CMOS TBVRF : SRAM model in CACTI TVRA: circuit synthesis/layout by Synopsys Design Compiler

#Regs. #Banks

  • Reg. size

BVRF StVRF 128 (st-opt) 4 128-bit 0.24 ns 0.30 ns 256 (st-pes) 8 256-bit 0.37 ns 0.50 ns

TStVRF ≈ TBVRF + TVRA st-opt: StVEC inst models delay of 1 SIMD inst

TStVRF ≤ Tcycle

st-pes: StVEC inst models delay of 2 dependent SIMD inst

TStVRF ≤ 2 ∗ Tcycle

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 22 / 24

slide-55
SLIDE 55

Evaluation Experimental Setup and Stencil Benchmarks

StVEC Hardware Overhead

StVRF access time in 45nm CMOS TBVRF : SRAM model in CACTI TVRA: circuit synthesis/layout by Synopsys Design Compiler

#Regs. #Banks

  • Reg. size

BVRF StVRF 128 (st-opt) 4 128-bit 0.24 ns 0.30 ns 256 (st-pes) 8 256-bit 0.37 ns 0.50 ns

TStVRF ≈ TBVRF + TVRA st-opt: StVEC inst models delay of 1 SIMD inst

TStVRF ≤ Tcycle Ex1: F = 3GHz, Tcycle = 0.33ns > TStVRF ⇒ No overhead!

st-pes: StVEC inst models delay of 2 dependent SIMD inst

TStVRF ≤ 2 ∗ Tcycle

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 22 / 24

slide-56
SLIDE 56

Evaluation Experimental Setup and Stencil Benchmarks

StVEC Hardware Overhead

StVRF access time in 45nm CMOS TBVRF : SRAM model in CACTI TVRA: circuit synthesis/layout by Synopsys Design Compiler

#Regs. #Banks

  • Reg. size

BVRF StVRF 128 (st-opt) 4 128-bit 0.24 ns 0.30 ns 256 (st-pes) 8 256-bit 0.37 ns 0.50 ns

TStVRF ≈ TBVRF + TVRA st-opt: StVEC inst models delay of 1 SIMD inst

TStVRF ≤ Tcycle Ex1: F = 3GHz, Tcycle = 0.33ns > TStVRF ⇒ No overhead!

st-pes: StVEC inst models delay of 2 dependent SIMD inst

TStVRF ≤ 2 ∗ Tcycle Ex2: F = 3GHz, Tcycle = 0.33ns < TStVRF ⇒ Extra cycle overhead!

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 22 / 24

slide-57
SLIDE 57

Evaluation Experimental Setup and Stencil Benchmarks

StVEC Hardware Overhead

StVRF access time in 45nm CMOS TBVRF : SRAM model in CACTI TVRA: circuit synthesis/layout by Synopsys Design Compiler

#Regs. #Banks

  • Reg. size

BVRF StVRF 128 (st-opt) 4 128-bit 0.24 ns 0.30 ns 256 (st-pes) 8 256-bit 0.37 ns 0.50 ns

TStVRF ≈ TBVRF + TVRA st-opt: StVEC inst models delay of 1 SIMD inst

TStVRF ≤ Tcycle Ex1: F = 3GHz, Tcycle = 0.33ns > TStVRF ⇒ No overhead!

st-pes: StVEC inst models delay of 2 dependent SIMD inst

TStVRF ≤ 2 ∗ Tcycle Ex2: F = 3GHz, Tcycle = 0.33ns < TStVRF ⇒ Extra cycle overhead! Ex3: F = 2GHz, Tcycle = 0.50ns ≥ TStVRF ⇒ No overherad!

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 22 / 24

slide-58
SLIDE 58

Summary

Conclusion

Take-home Message Vectorization of stencils is expensive

Previous solutions: unaligned loads or shuffle instructions!

Our solution: StVEC – new vector instruction extension

Fast execution of stencils Small hardware changes Eliminating unaligned loads

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 23 / 24

slide-59
SLIDE 59

Summary

Conclusion

Take-home Message Vectorization of stencils is expensive

Previous solutions: unaligned loads or shuffle instructions!

Our solution: StVEC – new vector instruction extension

Fast execution of stencils Small hardware changes Eliminating unaligned loads

Performance evalution with existing x86 vector ISAs

Optimistic (1 StVEC inst ≈ 1 SIMD inst): 20% to 2.47x Pessimistic (1 StVEC ≈ 2 dependent SIMD insts): 7% to 2.26x

Best fit for 128-bit wide vector computations

May require additional pipeline stage(s) for wider vectors

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 23 / 24

slide-60
SLIDE 60

Questions

Questions?

Naser Sedaghati (CSE @ Ohio State) StVEC: A Vector Instruction Extension PACT’11 24 / 24