CS356 Unit 12 Processor Hardware Organization Pipelining 12.2 - PowerPoint PPT Presentation

12.1 CS356 Unit 12 Processor Hardware Organization Pipelining

12.2 From combinational to sequential logic BASIC HW

12.3 Logic Circuits Combinational • Combinational Logic Logic Outputs Inputs – Performs a specific function (Usually operations like (mapping of 2 n input combinations to +, -, *, /, &, |, <<) desired output combinations) Outputs depend only on – No internal state or feedback current outputs • Given a set of inputs, we will always Outputs get the same output after some time (propagation) delay Current Combinational inputs Logic • Sequential Logic – Registers : fundamental building blocks Sequential • Remembers a set of bits for later use values feedback 1 0 1 and provide • Acts like a variable from software Register holding "state" "memory" Sequential Logic • Controlled by a "clock" signal Outputs depend on current inputs and previous inputs (previous inputs summarized by current state)

12.4 Combinational Logic Gates • Circuits called gates perform logic operations to produce desired outputs from some digital inputs 1 0 0 1 1 0 0 0 1 OR gate 1 0 0 AND gate 0 OR gate OR gate NOT gate

12.5 Propagation Delay • All digital logic circuits have propagation delay – Time for output to change when inputs are changed 1 1 0 0 1 1 1 0 1 0 0 1 0 1 1 0 4 gate delays for input to 0 propagate to outputs 0

12.6 Combinational Logic Functions • Map input combinations of n -bits to desired m -bit output • Can describe function with a truth table and then find its circuit implementation IN0 IN1 IN2 OUT0 OUT1 In0 In1 0 0 0 0 0 Outputs Inputs Out1 Logic 0 0 1 1 0 In2 Circuit … 1 1 1 0 1

12.7 ALU’s • Perform a selected Func. Op. Func. Op. Code Code operation on two input 00_0000 A SHL B 10_0000 A+B numbers 00_0010 A SHR B 10_0010 A-B – FS[5:0] selects the desired … … 00_0011 A SAR B operation … … 10_0100 A AND B 01_1000 A * B 10_0101 A OR B C0 0x7fffffff X[31:0] A 01_1001 A * B 10_0110 A XOR B RES[31:0] 32 (uns.) 0x80000000 RES ALU 32 01_1010 A / B 10_0111 A NOR B Y[31:0] B OF 1 32 0x00000001 A / B (uns.) … … 01_1011 ZF 0 FS[5:0] Func CF 0 6 … … 10_1010 A < B 100010

12.8 Sequential Devices (Registers) • Registers capture the D input value when a control input %rax ( clock signal ) transitions from 0 to 1 ( clock edge ) and store that value at the Q output until the next clock edge • A register is similar to a variable in software: at the clock edge, it stores a value for later use. • Block Diagram of We can choose to only clock the register at "certain" a Register times when we want the register to capture a new value (e.g., when it is the destination of an instruction) • Key Idea Registers store data while we operate on those values add %rbx,%rax add %rcx,%rax …causes q(t) to The clock sample and hold pulse the current d(t) (positive value until edge) here… another clock pulse ALU sum %rax

12.9 Clock Signal • Alternating high/low voltage Clock Signal pulse train Op. 1 Op. 2 Op. 3 1 (5V) • Controls the ordering and timing 0 (0V) of operations performed in the 1 cycle processor 2.8 GHz = 2.8*10 9 cycles per second • 1 cycle is usually measured from = 0.357 ns/cycle rising edge to rising edge Processor • Clock frequency = # of cycles per second (e.g. 2.8 GHz = 2.8 * 10 9 cycles per second)

12.10 Basic HW organization for a simplified instruction set FROM X86 TO RISC

12.11 From CISC to RISC • Complex Instruction Set Computers (CISC) // CISC instruction movq 0x40(%rdi, %rsi, 4), %rax often have instructions that vary widely in // RISC equivalent with 1 memory or ALU how much work they perform and how // operation per instruction mov %rsi, %rbx # use %rbx as a temp. much time they take to execute shl 2, %rbx # %rsi * 4 – Fewer instructions are needed for a task add %rdi, %rbx # %rdi + (%rsi*4) add $0x40, %rbx # 0x40 + %rdi + (%rsi*4) • Reduced Instruction Set Computers (RISC) mov (%rbx), %rax # %rax = *%rbx favor instructions that take roughly the CISC vs. RISC Equivalents same time to execute and follow a common sequence of steps – More instructions needed, each faster John Hennessy and David Patterson, ACM Turing Award Lecture, 2017

12.12 A RISC Subset of x86 // 3 x86 memory read instructions mov (%rdi), %rax // 1 mov 0x40(%rdi), %rax // 2 • Split mov instructions that access memory mov 0x40(%rdi,%rsi), %rax // 3 into separate instructions: // Equivalent load sequences ld 0x0(%rdi), %rax // 1 – ld = Load/Read from memory ld 0x40(%rdi), %rax // 2 – st = Store/Write to memory mov %rsi, %rbx // 3a add %rdi, %rbx // 3b • ld 0x40(%rbx), %rax // 3c Limit ld & st instructions to use at most indirect w/ displacement // 3 x86 memory write instructions mov %rax, (%rdi) // 1 – No ld 0x04(%rdi, %rsi, 4), %rax mov %rax, 0x40(%rdi) // 2 mov %rax, 0x40(%rdi,%rsi) // 3 • Too much work – At most ld 0x40(%rdi), %rax or // Equivalent store sequences st %rax, 0x0(%rdi) // 1 st %rax, 0x40(%rdi) st %rax, 0x40(%rdi) // 2 mov %rsi, %rbx // 3a • Limit arithmetic & logic instructions to only add %rdi, %rbx // 3b st %rax, 0x40(%rbx) // 3c operate on registers // CISC instruction – No add (%rsp), %rax since this implicitly add %rax, (%rsp) accesses (dereferences) memory // Equivalent RISC sequence with ld / st – Only add %reg1, %reg2 ld 0(%rsp), %rbx add %rax, %rbx st %rbx, 0(%rsp)

12.13 Developing a Processor Organization Hardware components used by each instruction type: Cond. Codes Registers (%rax, %rbx, P Addr. Data Addr. Data Zero ALU etc.) C Res. (aka RegFile) I-Cache / D-Cache / I-MEM D-MEM ALU-Type LD ST JE add %rax,%rbx ld 8(%rax),%rbx st %rbx, 8(%rax) je label/displacement 1. PC PC PC PC • Addr. of Instruc • Addr. of Instruc • Addr. of Instruc • Addr. of Instruc 2. I-Cache I-Cache I-Cache I-Cache • Fetch Instruc • Fetch Instruc • Fetch Instruc • Fetch Instruc 3. Registers Registers Registers • Get %rax,%rbx • Get %rax • Get %rax / %rbx 4. ALU ALU ALU ALU • Sum %rax+%rbx • Sum %rax+8 • Sum %rax+8 • If cond=TRUE, PC = PC+disp. 5. Registers D-Cache D-Cache • Save result to %rbx • Read data • Write %rbx data 6. Registers • Save data to %rbx

12.14 Processor Block Diagram Fetch Decode Exec. Mem WB Control Signals (e.g. ALU operation, Read/Write D-Cache, Dec etc.) PC ode Z O C S F F F F D-Cache Registers I-Cache ALU (aka Addr Data RegFile) ALU Output Data to write Instruction Operands (Addr. or to dest. (Machine Code) Result) register 10 ns 10 ns 10 ns 10 ns 10 ns Clock Cycle Time = Sum of delay through worst case pathway = 50 ns

12.15 Processor Execution ( add ) Fetch Decode Exec. Mem WB Control Signals (e.g. ALU operation, Read/Write D-Cache, Dec etc.) %rax+%rdx PC ode Z O C S F F F F %rax D-Cache Registers I-Cache ALU (aka Addr Data RegFile) %rdx ALU Output Data to write Instruction Operands (Addr. or to dest. (Machine Code) Result) register add %rax,%rdx [Machine Code: 48 01 c2] %rdx = %rax+%rdx

12.16 Processor Execution ( ld ) Fetch Decode Exec. Mem WB Control Signals (e.g. ALU operation, Read/Write D-Cache, Dec etc.) PC ode Z O C S F F F F %rbx D-Cache Registers I-Cache ALU (aka Addr Data 0x40 RegFile) addr data ALU Output Data to write Instruction Operands (Addr. or to dest. (Machine Code) Result) register ld 0x40(%rbx),%rax [Machine Code: 48 8b 43 40] %rdx = data

12.17 Processor Execution ( st ) Fetch Decode Exec. Mem WB Control Signals (e.g. ALU operation, Read/Write D-Cache, Dec etc.) PC ode Z O C S F F F F %rbx D-Cache Registers addr 0x40 I-Cache ALU (aka Addr Data RegFile) Instruction %rax (Machine Code) st %rax,0x40(%rbx) [Machine Code: 48 89 43 40]

12.18 Processor Execution ( je ) Fetch Decode Exec. Mem WB Control Signals PC + 0x08 (e.g. ALU operation, Read/Write D-Cache, Dec etc.) PC Z O C S ode F F F F 1 0 1 0 PC D-Cache Registers 0x08 I-Cache ALU (aka Addr Data RegFile) Instruction (Machine Code) je L1 (disp. = 0x08) [Machine Code: 74 08]

12.19 PIPELINING

12.20 Example for(i=0; i < 100; i++) C[i] = (A[i] + B[i]) / 4; Memory A: i A[i] Cntr B: B[i] C: 10 ns per input set = 1000 ns total

12.21 Pipelining Example for(i=0; i < 100; i++) C[i] = (A[i] + B[i]) / 4; Stage 1 Stage 2 Pipelining refers to insertion of registers to Stage 1 Stage 2 split combinational logic into smaller stages that Clock 0 A[0] + B[0] can be overlapped in time (i.e., create an Clock 1 A[1] + B[1] (A[0] + B[0]) / 4 assembly line) Clock 2 A[2] + B[2] (A[1] + B[1]) / 4

12.22 Need for Registers • Provides separation between combinational functions – Without registers, fast signals could “catch-up” to data values in the next operation stage

CS356 Unit 12 Processor Hardware Organization Pipelining 12.2 - PowerPoint PPT Presentation

12.1 CS356 Unit 12 Processor Hardware Organization Pipelining 12.2 From combinational to sequential logic BASIC HW 12.3 Logic Circuits Combinational Combinational Logic Logic Outputs Inputs Performs a specific function (Usually

Introduction to CS356 CS356 Object-Oriented Design and Programming http://cs356.yusun.io

SOLID: Principles of OOD CS356 Object-Oriented Design and Programming http://cs356.yusun.io

HOUSING PROJECT 1 UNIT 4 UNIT 1 UNIT 6 UNIT 5 UNIT 3 UNIT 2 Application of the Concept

CS356 Unit 4 Intro to x86 Instruction Set 4.2 Why Learn Assembly To understand something of

CS356 Unit 5 x86 Control Flow 5.2 JUMP/BRANCHING OVERVIEW 5.3 Concept of Jumps/Branches

CS356 Unit 10 Memory Allocation & Heap Management 10.2 BASIC OS CONCEPTS & TERMINOLOGY

CS356 Unit 6 x86 Procedures Basic Stack Frames 6.2 Review of Program Counter (IP register)

CS356 Unit 9 Virtual Memory & Address Translation 9.2 Indirection Indirection means

CS356 Unit 6 x86 Procedures Basic Stack Frames 6.2 Review of Program Counter (Instruc. Pointer)

CS356 Unit 4 x86 Instruction Set 4.2 Why Learn Assembly Understand hardware limitations

CS356 Unit 7 Data Layout & Intermediate Stack Frames 7.2 Structs CS:APP 3.9.1 Structs

CS356 Unit 11 Linking 11.2 In complex C projects... We would like to: Split source into

CS356 Unit 15 Review 15.2 Final Jeopardy Binary Instruction Random Riddles Memory Processor

Goals Understand the terms and ideas used in a modern, high-performance processor CS356 Unit

CS356 Unit 12a Processor Hardware Organization BASIC HW Pipelining 12a.3 12a.4 Logic Circuits

CS356 Unit 10 Memory Allocation & Heap Management BASIC OS CONCEPTS & TERMINOLOGY 10.3

Motivation Database as a service (DaaS) User Service Provider Service Level Database

Ad Hoc Housing Council Committee 1 Ad Hoc Agenda 1. Call Meeting to Order 5. Next Steps a.

The MySQL Query Optimizer Explained Through Optimizer Trace ystein Grvlen Senior Staff

Outline Outline Review of PSP Levels Overview Selecting Verification Methods Design

THE MOST IMPORTANT CAREER CONVERSATION WE DON'T WANT TO HAVE Preparing for Salary Negotiations

Games with more than 1 round Repeated prisoners dilemma Suppose this game is to be played 10

Key to Understanding/Complying with ADA Individualized Decision-making Get and use facts

Comparing State Spaces in Automatic Security Protocol Verification Pascal Lafourcade & Cas

CS356 Unit 12 Processor Hardware Organization Pipelining 12.2 - PowerPoint PPT Presentation

12.1 CS356 Unit 12 Processor Hardware Organization Pipelining 12.2 From combinational to sequential logic BASIC HW 12.3 Logic Circuits Combinational Combinational Logic Logic Outputs Inputs Performs a specific function (Usually

Introduction to CS356 CS356 Object-Oriented Design and Programming http://cs356.yusun.io

SOLID: Principles of OOD CS356 Object-Oriented Design and Programming http://cs356.yusun.io

HOUSING PROJECT 1 UNIT 4 UNIT 1 UNIT 6 UNIT 5 UNIT 3 UNIT 2 Application of the Concept

CS356 Unit 4 Intro to x86 Instruction Set 4.2 Why Learn Assembly To understand something of

CS356 Unit 5 x86 Control Flow 5.2 JUMP/BRANCHING OVERVIEW 5.3 Concept of Jumps/Branches

CS356 Unit 10 Memory Allocation &amp; Heap Management 10.2 BASIC OS CONCEPTS &amp; TERMINOLOGY

CS356 Unit 6 x86 Procedures Basic Stack Frames 6.2 Review of Program Counter (IP register)

CS356 Unit 9 Virtual Memory &amp; Address Translation 9.2 Indirection Indirection means

CS356 Unit 6 x86 Procedures Basic Stack Frames 6.2 Review of Program Counter (Instruc. Pointer)

CS356 Unit 4 x86 Instruction Set 4.2 Why Learn Assembly Understand hardware limitations

CS356 Unit 7 Data Layout &amp; Intermediate Stack Frames 7.2 Structs CS:APP 3.9.1 Structs

CS356 Unit 11 Linking 11.2 In complex C projects... We would like to: Split source into

CS356 Unit 15 Review 15.2 Final Jeopardy Binary Instruction Random Riddles Memory Processor

Goals Understand the terms and ideas used in a modern, high-performance processor CS356 Unit

CS356 Unit 12a Processor Hardware Organization BASIC HW Pipelining 12a.3 12a.4 Logic Circuits

CS356 Unit 10 Memory Allocation &amp; Heap Management BASIC OS CONCEPTS &amp; TERMINOLOGY 10.3

Motivation Database as a service (DaaS) User Service Provider Service Level Database

Ad Hoc Housing Council Committee 1 Ad Hoc Agenda 1. Call Meeting to Order 5. Next Steps a.

The MySQL Query Optimizer Explained Through Optimizer Trace ystein Grvlen Senior Staff

Outline Outline Review of PSP Levels Overview Selecting Verification Methods Design

THE MOST IMPORTANT CAREER CONVERSATION WE DON'T WANT TO HAVE Preparing for Salary Negotiations

Games with more than 1 round Repeated prisoners dilemma Suppose this game is to be played 10

Key to Understanding/Complying with ADA Individualized Decision-making Get and use facts

Comparing State Spaces in Automatic Security Protocol Verification Pascal Lafourcade &amp; Cas

CS356 Unit 10 Memory Allocation & Heap Management 10.2 BASIC OS CONCEPTS & TERMINOLOGY

CS356 Unit 9 Virtual Memory & Address Translation 9.2 Indirection Indirection means

CS356 Unit 7 Data Layout & Intermediate Stack Frames 7.2 Structs CS:APP 3.9.1 Structs

CS356 Unit 10 Memory Allocation & Heap Management BASIC OS CONCEPTS & TERMINOLOGY 10.3

Comparing State Spaces in Automatic Security Protocol Verification Pascal Lafourcade & Cas