Unit 11: Data-Level Parallelism: blti r1,4096,L1 Vectors & GPUs - PowerPoint PPT Presentation

How to Compute This Fast? • Performing the same operations on many data items • Example: SAXPY L1: ldf [X+r1]->f1 // I is in r1 for (I = 0; I < 1024; I++) { mulf f0,f1->f2 // A is in f0 CIS 501: Computer Architecture Z[I] = A*X[I] + Y[I]; ldf [Y+r1]->f3 } addf f2,f3->f4 stf f4->[Z+r1} addi r1,4->r1 Unit 11: Data-Level Parallelism: blti r1,4096,L1 Vectors & GPUs • Instruction-level parallelism (ILP) - fine grained • Loop unrolling with static scheduling –or– dynamic scheduling Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania' ' • Wide-issue superscalar (non-)scaling limits benefits with'sources'that'included'University'of'Wisconsin'slides ' • Thread-level parallelism (TLP) - coarse grained by'Mark'Hill,'Guri'Sohi,'Jim'Smith,'and'David'Wood ' • Multicore • Can we do some “medium grained” parallelism? CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 1 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 2 Data-Level Parallelism Example Vector ISA Extensions (SIMD) • Extend ISA with floating point (FP) vector storage … • Data-level parallelism (DLP) • Vector register : fixed-size array of 32- or 64- bit FP elements • Single operation repeated on multiple data elements • Vector length : For example: 4, 8, 16, 64, … • SIMD ( S ingle- I nstruction, M ultiple- D ata) • … and example operations for vector length of 4 • Less general than ILP: parallel insns are all same operation • Load vector: ldf.v [X+r1]->v1 • Exploit with vectors • Old idea: Cray-1 supercomputer from late 1970s ldf [X+r1+0]->v1 0 ldf [X+r1+1]->v1 1 • Eight 64-entry x 64-bit floating point “vector registers” ldf [X+r1+2]->v1 2 • 4096 bits (0.5KB) in each register! 4KB for vector register file ldf [X+r1+3]->v1 3 • Special vector instructions to perform vector operations • Add two vectors: addf.vv v1,v2->v3 • Load vector, store vector (wide memory operation) addf v1 i ,v2 i ->v3 i (where i is 0,1,2,3) • Vector+Vector or Vector+Scalar • Add vector to scalar: addf.vs v1,f2,v3 • addition, subtraction, multiply, etc. addf v1 i ,f2->v3 i (where i is 0,1,2,3) • In Cray-1, each instruction specifies 64 operations! • Today’s vectors: short (128 or 256 bits), but fully parallel • ALUs were expensive, so one operation per cycle (not parallel) CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 3 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 4

Example Use of Vectors – 4-wide Vector Datapath & Implementatoin ldf [X+r1]->f1 ldf.v [X+r1]->v1 • Vector insn. are just like normal insn… only “wider” mulf f0,f1->f2 mulf.vs v1,f0->v2 ldf [Y+r1]->f3 ldf.v [Y+r1]->v3 • Single instruction fetch (no extra N 2 checks) addf f2,f3->f4 addf.vv v2,v3->v4 • Wide register read & write (not multiple ports) stf f4->[Z+r1] stf.v v4,[Z+r1] addi r1,4->r1 addi r1,16->r1 • Wide execute: replicate floating point unit (same as superscalar) blti r1,4096,L1 blti r1,4096,L1 • Wide bypass (avoid N 2 bypass problem) 7x1024 instructions 7x256 instructions • Operations • Wide cache read & write (single cache tag check) (4x fewer instructions) • Load vector: ldf.v [X+r1]->v1 • Execution width (implementation) vs vector width (ISA) • Multiply vector to scalar: mulf.vs v1,f2->v3 • Example: Pentium 4 and “Core 1” executes vector ops at half width • Add two vectors: addf.vv v1,v2->v3 • “Core 2” executes them at full width • Store vector: stf.v v1->[X+r1] • Performance? • Because they are just instructions… • Best case: 4x speedup • …superscalar execution of vector instructions • But, vector instructions don’t always have single-cycle throughput • Multiple n-wide vector instructions per cycle • Execution width (implementation) vs vector width (ISA) CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 5 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 6 Intel’s SSE2/SSE3/SSE4/AVX… Other Vector Instructions • Intel SSE2 (Streaming SIMD Extensions 2) - 2001 • These target specific domains: e.g., image processing, crypto • 16 128bit floating point registers ( xmm0–xmm15 ) • Vector reduction (sum all elements of a vector) • Each can be treated as 2x64b FP or 4x32b FP (“packed FP”) • Geometry processing: 4x4 translation/rotation matrices • Or 2x64b or 4x32b or 8x16b or 16x8b ints (“packed integer”) • Saturating (non-overflowing) subword add/sub: image processing • Or 1x64b or 1x32b FP (just normal scalar floating point) • Byte asymmetric operations: blending and composition in graphics • Original SSE: only 8 registers, no packed integer support • Byte shuffle/permute: crypto • Population (bit) count: crypto • Other vector extensions • Max/min/argmax/argmin: video codec • AMD 3DNow!: 64b (2x32b) • Absolute differences: video codec • PowerPC AltiVEC/VMX: 128b (2x64b or 4x32b) • Multiply-accumulate: digital-signal processing • Special instructions for AES encryption • Looking forward for x86 • More advanced (but in Intel’s Xeon Phi) • Intel’s “Sandy Bridge” brings 256-bit vectors to x86 • Scatter/gather loads: indirect store (or load) from a vector of pointers • Intel’s “Xeon Phi” multicore will bring 512-bit vectors to x86 • Vector mask: predication (conditional execution) of specific elements CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 7 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 8

Using Vectors in Your Code Recap: Vectors for Exploiting DLP • Write in assembly • Vectors are an efficient way of capturing parallelism • Ugh • Data-level parallelism • Avoid the N 2 problems of superscalar • Use “intrinsic” functions and data types • Avoid the difficult fetch problem of superscalar • For example: _mm_mul_ps() and “__m128” datatype • Area efficient, power efficient • Use vector data types • typedef double v2df __attribute__ ((vector_size (16))); • The catch? • Need code that is “vector-izable” • Use a library someone else wrote • Need to modify program (unlike dynamic-scheduled superscalar) • Let them do the hard work • Requires some help from the programmer • Matrix and linear algebra packages • Let the compiler do it (automatic vectorization, with feedback) • Looking forward: Intel “Xeon Phi” (aka Larrabee) vectors • GCC’s “-ftree-vectorize” option, -ftree-vectorizer-verbose= n • More flexible (vector “masks”, scatter, gather) and wider • Limited impact for C/C++ code (old, hard problem) • Should be easier to exploit, more bang for the buck CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 9 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 10 Graphics Processing Units (GPU) GPUs and SIMD/Vector Data Parallelism • Killer app for parallelism: graphics (3D games) • How do GPUs have such high peak FLOPS & FLOPS/Joule? • Exploit massive data parallelism – focus on total throughput • Remove hardware structures that accelerate single threads • Specialized for graphs: e.g., data-types & dedicated texture units • “SIMT” execution model • Single instruction multiple threads • Similar to both “vectors” and “SIMD” • A key difference: better support for conditional control flow Tesla S870 ! • Program it with CUDA or OpenCL • Extensions to C • Perform a “shader task” (a snippet of scalar computation) over many elements • Internally, GPU uses scatter/gather and vector mask operations CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 11 CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 12

Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf 13 14 Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf 15 16

Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf 17 18 Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf 19 20

Data Parallelism Summary • Data Level Parallelism • “medium-grained” parallelism between ILP and TLP • Still one flow of execution (unlike TLP) • Compiler/programmer must explicitly expresses it (unlike ILP) • Hardware support: new “wide” instructions (SIMD) • Wide registers, perform multiple operations in parallel • Trends • Wider: 64-bit (MMX, 1996), 128-bit (SSE2, 2000), 256-bit (AVX, 2011), 512-bit (Xeon Phi, 2012?) • More advanced and specialized instructions • GPUs • Embrace data parallelism via “SIMT” execution model • Becoming more programmable all the time • Today’s chips exploit parallelism at all levels: ILP, DLP, TLP CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs 21

Unit 11: Data-Level Parallelism: blti r1,4096,L1 Vectors & GPUs - PowerPoint PPT Presentation

How to Compute This Fast? Performing the same operations on many data items Example: SAXPY L1: ldf [X+r1]->f1 // I is in r1 for (I = 0; I < 1024; I++) { mulf f0,f1->f2 // A is in f0 CIS 501: Computer Architecture Z[I] =

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

MLP yes! Definitions ILP no ! MLP ILP = Instruction Level = Memory Level Parallelism Work

HOUSING PROJECT 1 UNIT 4 UNIT 1 UNIT 6 UNIT 5 UNIT 3 UNIT 2 Application of the Concept

Unit Title: Presentation Software Unit Level: 2 Unit Credit Value: 4 GLH: 30 LASER Unit

Data-Parallel Architectures Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you

Modeling (01) RNDr. Martin Madaras, PhD. martin.madaras@stuba.sk Computer Graphics Image

Real-Time Image Recognition Nikita Shamgunov, CEO, MemSQL In-Memory Computing Summit 2017 1

3D Point Cloud Classification, Segmentation, and Normal estimation using Modified Fisher Vector

CS324e - Elements of Graphics and Visualization Java 3D Intro Java 2D Java2D and Swing part

Last time 6.891 Computer Vision and Applications Interesting points, correspondence, affine

Insights from the FMA John Botica and Derek Grantham Insights from the FMA- whats

CS675: Convex and Combinatorial Optimization Spring 2018 Introduction to Matroid Theory

Differential Vector Calculus Steve Rotenberg CSE169: Computer Animation UCSD Winter 2020