DATA LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor - PowerPoint PPT Presentation

May 13, 2023 •20 likes •140 views

DATA LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview ILP: instruction level parallelism Out of order execution (all in hardware) IPC hardly

DATA LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture
Overview ¨ ILP: instruction level parallelism ¤ Out of order execution (all in hardware) ¤ IPC hardly achieves more than 2 ¨ Other forms of parallelism ¤ DLP: data level parallelism n Vector processors, SIMD, and GPUs ¤ TLP: thread level parallelism n Multiprocessors, and hardware multithreading ¤ RLP: request level parallelism n Datacenters
Data Level Parallelism ¨ Due to executing the same code on a large number of objects ¤ Common in scientific computing ¨ DLP architectures ¤ Vector processors—e.g., Cray machines ¤ SIMD extensions—e.g., Intel MMX ¤ Graphics processing unit—e.g., NVIDIA ¨ Improve throughput rather than latency ¤ Not good for non-parallel workloads
Vector Processing ¨ Scalar vs. vector processor } for(i=0; i<1000; ++i) { B[i] = A[i] + x; vadd v3, v2, v1 add r3, r2, r1 } A : … x x x x x x x x x x x + + + + + + + + + + + B : …
Vector Processor ¨ A scalar processor—e.g., MIPS ¤ Scalar register file ¤ Scalar functional units ¨ Vector register file ¤ 2D register array ¤ Each register is an array of registers ¤ The number of elements per register determines the max vector length ¨ Vector functional units ¤ Single opcode activates multiple units ¤ Integer, floating point, load and stores
Basic Vector Processor Architecture
Parallel vs. Pipeline Units
Vector Instruction Set Architecture ¨ Single instruction defines multiple operations ¤ Lower instruction fetch/decode/issue cost ¨ Operations are executed in parallel ¤ Naturally no dependency among data elements ¤ Simple hardware ¨ Predictable memory access pattern ¤ Improve performance via prefetching ¤ Simple memory scheduling policy ¤ Multi banking may be used for improving bandwidth
Vector Operation Length ¨ Fixed in hardware ¤ Common in narrow SIMD ¤ Not efficient for wide SIMD ¨ Variable length ¤ Determined by a vector length register (VLR) ¤ MVL is the maximum VL ¤ How to process vectors wider than MVL?
Conditional Execution ¨ Question: how to handle branches? ¨ Solution: by predication for(i=0; i<1000; ++i) { if ( A [ i ] != B [ i ]) A [ i ] -= B [ i ] ; ¤ Use masks, flag vectors with single-bit elements } ¤ Determine the flag values based on vector compare vld V1, Ra vld V2, Rb ¤ Use flag registers as control vcmp.neq.vv M0, V1, V2 mask for the next vector vsub.vv V3, V2, V1, M0 operations vst V3, Ra
Branches in Scalar Processors inp ALU for (i =0; i < 8; ++i) { if (inp[i] > 0) { y = inp[i] * inp[i]; y = y + 2 * inp[i]; out[0] out[i] = y + 3; } else { y = 4 * inp[i]; out[i] = y + 1; out[1] } … }
Branches in Vector Processors inp ALU T T T T if (inp[i] > 0) { y = inp[i] * inp[i]; y = y + 2 * inp[i]; out[i] = y + 3; } else { T T T T y = 4 * inp[i]; out[i] = y + 1; } out

Recommend

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March 30, 2009 Billions of transistors Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March 30, 2009 Multicore

469 views • 23 slides

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Fall 2015 :: CSE 610 Parallel Computer Architectures Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Overview Data Parallelism vs. Control Parallelism Data Parallelism: parallelism

1.33k views • 59 slides

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

' $ Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems & % Database Systems

341 views • 21 slides

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism among instructions. Instruction-level parallelism INSTRUCTION-LEVEL PARALLELISM Increase depth of pipeline (greater overlap of

646 views • 26 slides

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Kunle Olukotun Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism accessible to all programmers Parallelism is not for the average programmer Too difficult to find parallelism, to debug, maintain

593 views • 40 slides

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is supported in OpenMP. If a PARALLEL directive is encountered within another PARALLEL directive, a new team of threads will be created. This is

242 views • 11 slides

MLP yes! Definitions ILP no ! MLP ILP = Instruction Level = Memory Level Parallelism Work

MLP yes! Definitions ILP no ! MLP ILP = Instruction Level = Memory Level Parallelism Work on memory level parallelism. Parallelism IPC metric misleading Stop worrying about IPC. = Number cache misses (Inst. per Clock) simultaneously

407 views • 8 slides

Data-Parallel Architectures Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture

Spring 2016 :: CSE 502 Computer Architecture Data-Parallel Architectures Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture Overview Data Parallelism vs. Control (Thread-Level) Parallelism Data Parallelism: parallelism

434 views • 31 slides

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism threads shared-memory architectures Message-Passing Parallelism processes distributed-memory architectures Practicalities

640 views • 34 slides

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1 CPU1 CPU1 CPU1 Operating Systems Process 2 Parallelism CPU1 Parallel Systems Process 1 True (Soon to be basic OS knowledge) CPU2 Process 2

381 views • 3 slides

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Loops and CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to use parallelism. Unfortunately, it is not easy to develop software that can take advantage of parallel machines. Dividing the

659 views • 54 slides

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Overview Implicit Parallelism Programming Languages References Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo Multi-core Programming: Implicit Parallelism Overview Implicit Parallelism

480 views • 34 slides

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk for Loops 2 Marc Moreno Maza Scheduling Theory and Implementation 3 University of Western Ontario, London, Ontario (Canada) Measuring Parallelism

531 views • 16 slides

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you understand by "parallelism" 2. How/where is parallelism in computers? Parallel / parallelism Concurrent / concurrency Many things

827 views • 27 slides

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and locality Real world exhibits parallelism and locality Particles, people, etc function independently Nearby objects interact more strongly

443 views • 25 slides

Instruction-Level Parallelism (ILP) Fine-grained parallelism Obtained by: instruction

Instruction-Level Parallelism (ILP) Fine-grained parallelism Obtained by: instruction overlap in a pipeline executing instructions in parallel (later, with multiple instruction issue) ILP hindered by: data dependence : arises

340 views • 21 slides

Pointer Networks: Handling variable size output dictionary Outputs are discrete and

Pointer Networks: Handling variable size output dictionary Outputs are discrete and correspond to positions in the input. Thus, the output "dictionary" varies per example. Q: Can we think of cases where we need such dynamic

1.03k views • 81 slides

Information Retrieval Index compression Hamid Beigy Sharif university of technology October 19,

Information Retrieval Information Retrieval Index compression Hamid Beigy Sharif university of technology October 19, 2018 Hamid Beigy | Sharif university of technology | October 19, 2018 1 / 28 Information Retrieval Introduction 1

966 views • 30 slides

Introduction I Greedy methods: A technique for solving optimization problems Computer Science

Introduction I Greedy methods: A technique for solving optimization problems Computer Science & Engineering 423/823 I Choose a solution to a problem that is best per an objective function Design and Analysis of Algorithms I Similar to

1.09k views • 5 slides

Zero-Error Coding with a Generator Set of Variable-Length Words Nicolas Charpenay, Mal le

Zero-Error Coding with a Generator Set of Variable-Length Words Nicolas Charpenay, Mal le Treust 2020 IEEE International Symposium on Information Theory Nicolas Charpenay, Mal le Treust Zero-Error Coding with a Generator Set of

408 views • 25 slides

CSE 143 Class Vector: Interface class Vector { public: Dynamic Memory In Classes Vector ( );

CSE 143 Class Vector: Interface class Vector { public: Dynamic Memory In Classes Vector ( ); bool isEmpty( ); [Chapter 4, p 156-157] int length ( ); void vectorInsert (int newPosition, Item newItem); Item vectorDelete (int position);

293 views • 7 slides

CS 101: Computer Programming and Utilization About These Slides Based on Chapter 22 of the

CS 101: Computer Programming and Utilization About These Slides Based on Chapter 22 of the book An Introduction to Programming Through C++ by Abhiram Ranade (Tata McGraw Hill, 2014) Original slides by Abhiram Ranade First update

1.13k views • 39 slides

Whats New with Mediasite Recorders? Mediasite Mediasite Mediasite Mediasite Recorder Pro

Whats New with Mediasite Recorders? Mediasite Mediasite Mediasite Mediasite Recorder Pro Recorder Mini Recorder Mobile Recorder Automated Hardware & Software Capture Mediasite Recorder 7.5 New Encoding Engine Qualit y- Defi n

462 views • 17 slides

Communication Networks and Services Quality of Service (QoS) - Identify traffic flows - Mark

Communication Networks and Services Quality of Service (QoS) - Identify traffic flows - Mark traffic flows - Police and shape traffic - Apply priority (managed scheduling) 1 Open-Loop Control / QoS Model Network performance is guaranteed to

408 views • 24 slides