Processor Performance and Parallelism Y. K. Malaiya Processor - PowerPoint PPT Presentation

Processor Performance and Parallelism Y. K. Malaiya

Processor Execution time The time taken by a program to execute is the product of n Number of machine instructions executed n Number of clock cycles per instruction (CPI) n Single clock period duration = ´ Clock Cycles Instructio n Count Cycles per Instructio n = ´ ´ CPU Time Instructio n Count CPI Clock period Example: 10,000 instructions, CPI=2, clock period = 250 ps = ´ ´ CPU Time 1 0 , 000 instructio ns 2 250 ps - - 4 12 6 = ´ ´ = 10 2 250 . 10 5 . 10 sec . 2

Processor Execution time Instruction Count for a program = ´ ´ CPU Time Instructio n Count CPI Clock Cycle Time n Determined by program, ISA and compiler Average Cycles per instruction (CPI) n Determined by CPU hardware n If different instructions have different CPI Average CPI affected by instruction mix Clock cycle time (inverse of frequency) n Logic levels n technology 3

Reducing clock cycle time Has worked well for decades. Small transistor dimensions implied smaller delays and hence lower clock cycle time. Not any more. 4

CPI (cycles per instruction) What is LC-3 cycles per instruction? Instructions take 5-9 cycles (p. 568), assuming memory access time is one clock period. n LC-3 CPI may be about 6*. (ideal) No cache, memory access time = 100 cycles? Load/store instructions n LC-3 CPI would be very high. are about 20-30% Cache reduces access time to 2 cycles. n LC-3 CPI higher than 6, but still reasonable. 5

Parallelism to save time Do things in parallel to save time. Approaches: • Instruction level parallelism Ø Pipelining: Divide flow into stages. Let instructions flow into the pipeline. Ø Multiple issue: Fetch multiple instructions at the same time • Concurrent processes or thread (Task-level parallelism) Ø For true concurrency, need extra hardware – Multiple processors (cores) or – support for multiple thread Demo: Threads in Mac 6

Pipelining Analogy Pipelined laundry: overlapping execution • Parallelism improves performance n Four loads: n time = 4x2 = 8 hours n Pipelined: n Time in example = 7x0.5 = 3.5 hours n Non-stop = 4x0.5 = 2 hours. 7

Pipeline Processor Performance Single-cycle (T c = 800ps) Pipelined (T c = 200ps) 8

Pipelining: Issues Cannot predict which branch will be taken. n Actually you may be able to make a good guess. n Some performance penalty for bad guesses. Instructions may depend on results of previous instructions. n There may be a way to get around that problem in some cases. 9

Instruction level parallelism (ILP): Pipelining is one example. Multiple issue: have multiple copies of resources • Multiple instructions start at the same time • Need careful scheduling Ø Compiler assisted scheduling Ø Hardware assisted (“superscaler”): “dynamic scheduling” – Ex: AMD Opteron x4 – CPI can be less than 1!. 10

Task Parallelism Program is divided into tasks that can be run in parallel Concurrent Processes n Can run truly in parallel if there are multiple processors, e.g. multi-core processors Concurrent Threads n Multiple threads can run on multiple processors, or n Single processor with multi-threading support (Simultaneous Multithreading) Process vs thread n All information resources for a process are private to the process. n Multiple threads within a process have private registers & stack, but not address space. 11

Task Parallelism Program is divided into tasks that can be run in parallel Example: A program needs subtasks A,B,C,D. B and C can be run in parallel. They each take 200, 500, 500 and 300 nanoseconds. Without parallelism: total time needed = 200+500+500+300 = 1500 ns. With Task level parallelism: 200 +500 (B and C in parallel) +300 = 1000 ns. A B C D A B D C 12

Task Parallelism 13

Flynn’s taxonomy Michael J. Flynn, 1966 Data Streams Single Multiple Instruction Single SISD : SIMD : MMX/SSE Streams Intel Pentium 4 instructions in x86 Multiple MISD : MIMD : eg. Multicore No examples today Intel Xeon e5345 n Instruction level parallelism is still SISD n SSE (Streaming SIMD Extensions): vector operations n Intel Xeon e5345: 4 cores n Does not model Instruction level/task level parallelism 14

Multi what? Multitasking: tasks share a processor Multithreading: threads share a processor Multiprocessors: using multiple processors n For example multi-core processors (multiples processors on the same chip) n Scheduling of tasks/subtasks needed 15

Multi-core processors Power consumption has become a limiting factor Key advantage: lower power consumption for the same performance n Ex: 20% lower clock frequency: 87% performance, 51% power. A processor can switch to lower frequency to reduce power. N cores: can run n or more threads. 16

Multi-core processors Cores may be identical or specialized Higher level caches are shared. Lower level cache coherency required. Cores may use superscalar or simultaneous multithreading architectures. 17

LC-3 states Instruction Cycles ADD, AND, 5 NOT, JMP TRAP 8 LD, LDR, 7 ST, STR LDI, STI 9 BR 5, 6 JSR 6 18

Processor Performance and Parallelism Y. K. Malaiya Processor - PowerPoint PPT Presentation

Processor Performance and Parallelism Y. K. Malaiya Processor Execution time The time taken by a program to execute is the product of n Number of machine instructions executed n Number of clock cycles per instruction (CPI) n Single clock period

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

1 CPI (cycles per instruction) CPI (cycles per instruction) Parallelism to save time

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Lecture 23: Virtual Memory, Multiprocessors Todays topics: Virtual memory

Tensor Methods for large-scale Machine Learning Anima Anandkumar U.C. Irvine Learning with Big

Mix-Automatic Sequences Jrg Endrullis Clemens Grabmayer Dimitri Hendriks Fields Workshop on

Construct 2 A game engine without the programming. The Code Liberation Foundation Lecture 3:

MS degree in Computer Engineering University of Rome Tor Vergata Lecturer: Francesco Quaglia

Parallel Computer Architecture Lars Karlsson Ume a University 2009-12-07 Lars Karlsson

CSL 860: Modern Parallel Computation Computation Course Information

RNS Arithmetic Approach in Lattice-based Cryptography Accelerating the Rounding-off Core

Sambuz

Useful Links

Newsletter

Mail Us