Datapath component (4) Prof. Usagi Recap: Memory hierarchy in - PowerPoint PPT Presentation

Datapath component (4) Prof. Usagi

Recap: Memory “hierarchy” in modern processor architectures Processor fastest Processor < 1ns Core fastest Registers 32 or 64 words L1 $ L2 $ SRAM $ a few ns KBs ~ MBs L3 $ GBs DRAM tens of ns larger TBs tens of ns Storage larger 2

Program-erase cycles: SLC v.s. MLC v.s. TLC v.s. QLC 3

Recap: Flash memory characteristics • Regarding the following flash memory characteristics, please identify how many of the following statements are correct ① Flash memory cells can only be programmed with limited times ② The reading latency of flash memory cells can be largely different from programming ③ The latency of programming different flash memory pages can be different ④ The programmed cell cannot be reprogrammed again unless its charge level is refilled to the top-level A. 0 B. 1 C. 2 D. 3 E. 4 4

If programmer doesn’t know flash “features” • Software designer should be aware of the characteristics of underlying hardware components 5

Recap: Clock signal 0ns 10ns 20ns 30ns 40ns 50ns 60ns 70ns 80ns 90ns • Clock -- Pulsing signal for enabling latches; ticks like a clock • The clock's period must be longer than the longest delay from the state register's output to the state register's input, known as the critical path. • Synchronous circuit: sequential circuit with a clock • Clock period: time between pulse starts • Above signal: period = 20 ns • Clock cycle: one such time interval • Above signal shows 3.5 clock cycles • Clock duty cycle: time clock is high • 50% in this case • Clock frequency: 1/period • Above : freq = 1 / 20ns = 50MHz; 6

Recap: Serial Adders a i Full s i b i Adder c i c i+1 Clk 7

Excitation Table of Serial Adder a i b i c i c i+1 s i a i 0 0 0 0 0 s i 0 0 1 0 1 0 1 0 0 1 b i 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 D Flip- flop 1 1 0 1 0 D Q 1 1 1 1 1 8

Poll close in Critical path of the circuit? • Assume each gate A delay is 1ns and the a i delay in a register is 2ns. s i Which of the following path determines the B b i “cycle time” of the circuit? C A. A D Flip- B. B flop D Q D C. C D. D 9

Critical path of the circuit? • Assume each gate A delay is 1ns and the a i delay in a register is 2ns. s i Which of the following path determines the B b i “cycle time” of the circuit? C A. A D Flip- B. B flop D Q D C. C D. D 10

Poll close in Cycle time of the circuit? • Assume each gate delay is 1ns and the a i delay in a register is s i 2ns, what’s the cycle time of the circuit? b i A. 2 ns B. 3 ns C. 4 ns D Flip- flop D Q D. 5 ns E. 6 ns 11

Cycle time of the circuit? • Assume each gate delay is 1ns and the a i delay in a register is s i 2ns, what’s the cycle time of the circuit? b i A. 2 ns B. 3 ns C. 4 ns D Flip- flop D Q D. 5 ns E. 6 ns 12

Recap: Frequency • Consider the following adders. Assume each gate delay is 1ns and the delay in a register is 2ns. Please rank their maximum operating frequencies 1 17 ns = 58.8 MHz ① 32-bit CLA made with 8 4-bit CLA adders 1 64 ns = 15.6 MHz ② 32-bit CRA made with 32 full adders 1 5 ns = 200 MHz ③ 32-bit serial adders made with 4-bit CLA adders 1 4 ns = 250 MHz ④ 32-bit serial adders made with 1-bit full adders A. (1) > (2) > (3) > (4) B. (2) > (1) > (4) > (3) C. (2) > (1) > (3) > (4) D. (4) > (3) > (2) > (1) E. (4) > (3) > (1) > (2) 13

Recap: Area/Delay of adders • Consider the following adders? ① 32-bit CLA made with 8 4-bit CLA adders Each CLA — 2-gate delay — 8*2+1 ~ 17 ② 32-bit CRA made with 32 full adders Each carry — 2-gate delay — 64 ③ 32-bit serial adders made with 4-bit CLA adders Each CLA — (3-gate delay + 2-gate delay)*8 cycles — 5*8+1 = 41 ④ 32-bit serial adders made with 1-bit full adders Each CLA — (2-gate delay + 2-gate delay)*32 cycles — 4*32 = 128 A. Area: (1) > (2) > (3) > (4) Delay: (1) < (2) < (3) < (4) B. Area: (1) > (3) > (2) > (4) Delay: (1) < (3) < (2) < (4) C. Area: (1) > (3) > (4) > (2) Delay: (1) < (3) < (4) < (2) D. Area: (1) > (2) > (3) > (4) Delay: (1) < (3) < (2) < (4) E. Area: (1) > (3) > (2) > (4) Delay: (1) < (3) < (4) < (2) 14

Frequency != End-to-end latency 15

Outline • Pipelining • Multipliers 16

Pipelining • Different parts of the hardware works on different requests/ commands simultaneously • A clock signal controls and synchronize the beginning and the end of each part/ stage of the work • A pipeline register between different parts of the hardware to keep intermediate results necessary for the upcoming work • Register is basically an array of flip-flops! 17

Pipelining 18

Pipelining a 4-bit serial adder Serial Serial Serial Serial Adder Adder Adder Adder # 1 # 2 # 3 # 4 19

Pipelining a 4-bit serial adder Cycles 1st 2nd 3rd 4th = 1 add a, b 1st 2nd 3rd 4th Add add c, d 1st 2nd 3rd 4th add e, f 1st 2nd 3rd 4th add g, h 1st 2nd 3rd 4th add i, j 1st 2nd 3rd 4th add k, l 1st 2nd 3rd 4th add m, n 1st 2nd 3rd 4th add o, p After this point, 1st 2nd 3rd 4th add q, r we are completing an 1st 2nd 3rd 4th add s, t add operation each 1st 2nd 3rd 4th cycle! add u, v t 20

Poll close in What if we have millions of adds to do? • Consider the following adders. Assume each gate delay is 1ns and the delay in a register is 2ns. And we are processing 10 million of add operations. Please rank their total time in finishing these 10 million adds. ① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 8-stage, pipelined 32-bit serial adders made with 4-bit CLA adders ④ 32-stage, pipelined 32-bit serial adders made with 1-bit full adders A. (1) < (2) < (3) < (4) B. (2) < (1) < (4) < (3) C. (3) < (4) < (2) < (1) D. (4) < (3) < (2) < (1) E. (4) < (3) < (1) < (2) 21

What if we have millions of adds to do? • Consider the following adders. Assume each gate delay is 1ns and the delay in a register is 2ns. And we are processing 10 million of add operations. Please rank their total time in finishing these 10 million adds. ① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 8-stage, pipelined 32-bit serial adders made with 4-bit CLA adders ④ 32-stage, pipelined 32-bit serial adders made with 1-bit full adders A. (1) < (2) < (3) < (4) B. (2) < (1) < (4) < (3) C. (3) < (4) < (2) < (1) D. (4) < (3) < (2) < (1) E. (4) < (3) < (1) < (2) 22

Latency/Delay v.s. Bandwidth/Throughput • Latency — the amount of time to finish an operation • access time • response time • Throughput — the amount of work can be done within a given period of time • bandwidth (MB/Sec, GB/Sec, Mbps, Gbps) • IOPs • MFLOPs 23

Latency/Delay v.s. Throughput Toyota Prius 100 Gb Network • 100 miles (161 km) from UCSD • 100 miles (161 km) from UCSD • Lightspeed! — 3*10 8 m/sec • 75 MPH on highway! • Max load: 374 kg = 2,770 hard drives • Max load:4 lanes operating at 25GHz (2TB per drive) 100 Gb/s or bandwidth 290GB/sec 12.5GB/sec 2 Peta-byte over 167772 seconds latency 3.5 hours = 1.94 Days You can start watching the movie response time You see nothing in the first 3.5 hours as soon as you get a frame! 24

Poll close in Area/Cost • Consider the following adders. Please rank the number of transistors in implementing each of them ① 32-bit CLA made with 8 4-bit CLA adders ② 32-bit CRA made with 32 full adders ③ 8-stage, pipelined 32-bit serial adders made with 4-bit CLA adders ④ 32-stage, pipelined 32-bit serial adders made with 1-bit full adders A. (1) > (2) > (3) > (4) B. (2) > (1) > (4) > (3) C. (3) > (4) > (2) > (1) D. (4) > (3) > (2) > (1) E. (4) > (3) > (1) > (2) 25

Recap: CLA’s size • How many transistors do we need to implement a 4-bit CLA S i = A i XOR B i XOR C i logic? G i = A i B i A. 38 P i = A i XOR B i B. 64 C 1 = G 0 + P 0 C 0 4 + 4 = 8 C 2 = G 1 + P 1 C 1 = G 1 + P 1 (G 0 + P 0 C 0 ) C. 88 = G 1 + P 1 G 0 + P 1 P 0 C 0 D. 116 4 + 6 + 6 = 16 C 3 = G 2 + P 2 C 2 E. 128 = G 2 + P 2 G 1 + P 2 P 1 G 0 + P 2 P 1 P 0 C 0 4 + 6 + 8 + 8 =26 C 4 = G 3 + P 3 C 3 = G 3 + P 3 G 2 + P 3 P 2 G 1 + P 3 P 2 P 1 G 0 + P 3 P 2 P 1 P 0 C 0 4 + 6 + 8 + 10 + 10 = 38 26

Recap: Excitation Table of Serial Adder a i b i c i c i+1 s i a i 0 0 0 0 0 s i 0 0 1 0 1 0 1 0 0 1 b i 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 D Flip- flop 1 1 0 1 0 D Q 1 1 1 1 1 27

Datapath component (4) Prof. Usagi Recap: Memory hierarchy in - PowerPoint PPT Presentation

Datapath component (4) Prof. Usagi Recap: Memory hierarchy in modern processor architectures Processor fastest Processor < 1ns Core fastest Registers 32 or 64 words L1 $ L2 $ SRAM $ a few ns KBs ~ MBs L3 $ GBs DRAM tens

This Unit: Single-Cycle Datapath App App App Datapath storage elements System software

Datapath Elements & Single Cycle Datapath Unit Chapter 11 Datapath Elements Introduction

FSMD%Block%Diagram FSM$Datapath*Systems Datapath%Elements

Lesson 10 Processors Continued Building a datapath Datapath element is a unit used to operate

Datapath Design, Coding Standards, and Lab 2 1 Separating Control From Data The datapath is

LECTURE 5 Single-Cycle Datapath and Control PROCESSORS Datapath and control are the two

CSCI-2500: Computer Organization Processor Design Datapath n The datapath is the interconnection

Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4

CENG 342 Digital Systems Finite State Machine with Datapath (FSMD) Larry Pyeatt SDSM&T

CSSE232 Computer Architecture I Mul5cycle Datapath Class Status

Control Unit Datapath Elements & Single Cycle Datapath Unit Register Files Register Layout

Control Path Design and Lab 3 1 Separating Control From Data The datapath is where data

CSCI341 Lecture 30, Building a Datapath RECALL... The datapath is a representation of

Control Path Design and Lab 3 1 Separating Control From Data The datapath is where data

Functional components Notification component Application received Refuse ? Notification

WIO IOSAP Project Budget Nairobi Convention WIO IOSAP Budget per Project Component COMPONENT

Linux on the Ipaq by Jon Nelson Linux on the Ipaq Distros Familiar Intimate

Dark Storm: Further Adventures I n XT Architecture Flexibility John P. Noe Robert A. Ballance

15-721 DATABASE SYSTEMS Lecture #04 Optimistic Concurrency Control Andy Pavlo / / Carnegie

ACMP: An Architecture to Handle Amdahls Law M. Aater Suleman Advisor: Yale Patt HPS Research

1 " Flexible Algorithms in Education - An Experience Report Experiences about a

CS184a: Computer Architecture (Structures and Organization) Day3: October 2, 2000 Arithmetic

Brief Course Intro Math Review Growable Array Analysis And int And ntro to daily qui uizzes,

The K Project Filesystem Conclusion ATAPI Driver LSE Team EPITA May 17, 2019 LSE Team

Datapath component (4) Prof. Usagi Recap: Memory hierarchy in - PowerPoint PPT Presentation

Datapath component (4) Prof. Usagi Recap: Memory hierarchy in modern processor architectures Processor fastest Processor < 1ns Core fastest Registers 32 or 64 words L1 $ L2 $ SRAM $ a few ns KBs ~ MBs L3 $ GBs DRAM tens

This Unit: Single-Cycle Datapath App App App Datapath storage elements System software

Datapath Elements &amp; Single Cycle Datapath Unit Chapter 11 Datapath Elements Introduction

FSMD%Block%Diagram FSM$Datapath*Systems Datapath%Elements

Lesson 10 Processors Continued Building a datapath Datapath element is a unit used to operate

Datapath Design, Coding Standards, and Lab 2 1 Separating Control From Data The datapath is

LECTURE 5 Single-Cycle Datapath and Control PROCESSORS Datapath and control are the two

CSCI-2500: Computer Organization Processor Design Datapath n The datapath is the interconnection

Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4

CENG 342 Digital Systems Finite State Machine with Datapath (FSMD) Larry Pyeatt SDSM&amp;T

CSSE232 Computer Architecture I Mul5cycle Datapath Class Status

Control Unit Datapath Elements &amp; Single Cycle Datapath Unit Register Files Register Layout

Control Path Design and Lab 3 1 Separating Control From Data The datapath is where data

CSCI341 Lecture 30, Building a Datapath RECALL... The datapath is a representation of

Control Path Design and Lab 3 1 Separating Control From Data The datapath is where data

Functional components Notification component Application received Refuse ? Notification

WIO IOSAP Project Budget Nairobi Convention WIO IOSAP Budget per Project Component COMPONENT

Linux on the Ipaq by Jon Nelson Linux on the Ipaq Distros Familiar Intimate

Dark Storm: Further Adventures I n XT Architecture Flexibility John P. Noe Robert A. Ballance

15-721 DATABASE SYSTEMS Lecture #04 Optimistic Concurrency Control Andy Pavlo / / Carnegie

ACMP: An Architecture to Handle Amdahls Law M. Aater Suleman Advisor: Yale Patt HPS Research

1 &quot; Flexible Algorithms in Education - An Experience Report Experiences about a

CS184a: Computer Architecture (Structures and Organization) Day3: October 2, 2000 Arithmetic

Brief Course Intro Math Review Growable Array Analysis And int And ntro to daily qui uizzes,

The K Project Filesystem Conclusion ATAPI Driver LSE Team EPITA May 17, 2019 LSE Team

Datapath Elements & Single Cycle Datapath Unit Chapter 11 Datapath Elements Introduction

CENG 342 Digital Systems Finite State Machine with Datapath (FSMD) Larry Pyeatt SDSM&T

Control Unit Datapath Elements & Single Cycle Datapath Unit Register Files Register Layout

1 " Flexible Algorithms in Education - An Experience Report Experiences about a