CS356 Unit 12a Processor Hardware Organization BASIC HW Pipelining - PowerPoint PPT Presentation

12a.1 12a.2 CS356 Unit 12a Processor Hardware Organization BASIC HW Pipelining 12a.3 12a.4 Logic Circuits Combinational Logic Gates Combinational • ________________ logic • Main Idea: Circuits called "gates" perform Logic Outputs Inputs – Performs a specific function (mapping (Usually logic operations to produce desired outputs operations like of ___ input combinations to desired +, -, *, /, &, |, <<) from some digital inputs output combinations) Outputs depend only on – No internal state or feedback current outputs 1 • Given a set of inputs, we will always N 0 get the same output after some time C 0 (propagation) __________ Current Out Combinational 1 1 inputs P Logic • __________ logic (Storage devices) 0 0 0 R 1 – _____________ are the fundamental Sequential building blocks OR gate V values feedback 1 0 1 0 1 B and provide 0 • Remembers a set of bits for later use Register holding "state" T AND gate "memory" Sequential Logic 0 • Acts like a variable from software OR gate Outputs depend on current inputs • Controlled by a "_________" signal OR gate and previous inputs (previous inputs summarized via state) NOT gate

12a.5 12a.6 Propagation Delay Combinational Logic Functions • Main Idea: All digital logic circuits have • Map input combinations of n-bits to desired propagation delay m-bit output – Time it takes for output to change when inputs • Can describe function with a truth table and are changed then find its circuit implementation IN0 IN1 IN2 OUT0 OUT1 1 N In0 0 0 0 0 0 0 In1 0 C Outputs Inputs Logic Out1 1 1 P 0 0 1 1 0 0 Circuit In2 0 0 R 1 … V 0 1 0 B 1 1 1 0 1 4 gate delays for input 0 T to propagate to outputs 0 12a.7 12a.8 ALU’s Sequential Devices (Registers) • Registers _______ the D input value when a control %rax • Perform a selected Func. Op. Func. Op. input (aka the clock signal) transitions from _____ Code Code (clock edge) and stores that value at the Q output until operation on two input 00_0000 A SHL B 10_0000 A+B the next clock edge numbers. • A register is like a ___________ in software. It stores a 00_0010 A SHR B 10_0010 A-B value for later use. – FS[5:0] select the desired Block Diagram of 00_0011 A SAR B … … • We can choose to only clock the register at "________" a Register operation times when we want the register to capture a new … … 10_0100 A AND B value (i.e. when it is the __________ of an instruction) 01_1000 A * B 10_0101 A OR B • Key Idea : Registers __________ data while we operate C0 on those values __________ X[31:0] A 01_1001 A * B 10_0110 A XOR B RES[31:0] (uns.) 32 0x80000000 add %rbx,%rax add %rcx,%rax RES ALU 01_1010 A / B 10_0111 A NOR B …causes q(t) to 32 The clock Y[31:0] B sample and hold OF __ pulse 0x00000001 32 the current d(t) 01_1011 A / B … … __ (positive ZF value until FS[5:0] Func (uns.) edge) here… another clock __ CF 6 ALU sum pulse 10_1010 A < B 100010 … … %rax

12a.9 12a.10 Clock Signal • Alternating high/low voltage Clock Signal pulse train Op. 1 Op. 2 Op. 3 1 (5V) • Controls the ordering and timing 0 (0V) of operations performed in the 1 cycle processor 2.8 GHz = 2.8*10 9 cycles per second • 1 cycle is usually measured from = 0.357 ns/cycle Basic HW organization for a simplified instruction set rising edge to rising edge FROM X86 TO RISC Processor • Clock frequency = # of cycles per second (e.g. 2.8 GHz = 2.8 * 10 9 cycles per second) 12a.11 12a.12 From CISC to RISC A RISC Subset of x86 // 3 x86 memory read instructions mov (%rdi), %rax // 1 mov 0x40(%rdi), %rax // 2 • Complex Instruction Set Computers often • Split mov instructions that access memory mov 0x40(%rdi,%rsi), %rax // 3 have instructions that ______ widely in how into separate instructions: // Equiv. load sequences ld 0x0(%rdi), %rax // 1 much _________ they perform and how – ld = _____________ from memory ld 0x40(%rdi), %rax // 2 much _________ they take to execute mov %rsi, %rbx // 3a – st = _____________ to memory // CISC instruction add %rdi, %rbx // 3b movq 0x40(%rdi, %rsi, 4), %rax – Benefit is _________ instructions are ld 0x40(%rbx), %rax // 3c • Limit ld & st instructions to use at most needed to accomplish a task // RISC Equiv. w/ 1 mem. or ALU op. ____________________________ // 3 x86 memory write instructions // per instruction mov %rax, (%rdi) // 1 • Reduced Instruction Set Computers favor mov %rsi, %rbx # use %rbx as a temp. mov %rax, 0x40(%rdi) // 2 – No ld 0x04(%rdi, %rsi, 4), %rax shl 2, %rbx # %rsi * 4 mov %rax, 0x40(%rdi,%rsi) // 3 instructions that take roughly the _______ add %rdi, %rbx # %rdi + (%rsi*4) • Too much work add $0x40, %rbx # 0x40 + %rdi + (%rsi*4) time to execute and follow a common // Equiv. store sequences mov (%rbx), %rax # %rax = *%rbx – At most ld ____________, %rax or st %rax, 0x0(%rdi) // 1 _____________ of steps st %rax, 0x40(%rdi) // 2 st %rax, _____________ mov %rsi, %rbx // 3a CISC vs. RISC Equivalents – It often requires _______ instructions to • Limit arithmetic & logic instructions to only add %rdi, %rbx // 3b st %rax, 0x40(%rbx) // 3c describe the overall task (larger code size) operate on registers // CISC instruction See example to the right • – No add (%rsp), %rax since this implicitly add %rax, (%rsp) accesses (dereferences) memory • RISC makes the ___________ design easier // Equiv. RISC sequence (w/ ld and st) ld 0(%rsp), %rbx – Only add ______________ so let's tweak our x86 instructions to be add %rax, %rbx st %rbx, 0(%rsp) more RISC-like

12a.13 12a.14 Processor Block Diagram Developing a Processor Organization • Identify which hardware components each instruction type Fetch Decode Exec. Mem WB would use and in what order: ALU-Type, LD, ST, Jump Control Signals Cond. Codes (e.g. ALU operation, Read/Write D-Cache, etc.) Registers Decode PC (%rax, %rbx, Addr. Data Addr. Data ALU Zero PC etc.) ZF OFCFSF Res. (aka RegFile) D-Cache I-Cache / I-MEM D-Cache / D-MEM Registers I-Cache ALU Addr Data (aka ALU-Type LD ST JE RegFile) Data add %rax,%rbx ld 8(%rax),%rbx st %rbx, 8(%rax) je label/displacement Instruction ALU Output Data to write to 1. Operands (Machine Code) (Addr. or Result) dest. register 2. 3. 4. 5. 10 ns 10 ns 10 ns 10 ns 10 ns 6. Clock Cycle Time = Sum of delay through worst case pathway = 50 ns 12a.15 12a.16 Processor Execution (add) Processor Execution (load) Fetch Decode Exec. Mem WB Fetch Decode Exec. Mem WB Control Signals Control Signals (e.g. ALU operation, (e.g. ALU operation, Read/Write D-Cache, etc.) Read/Write D-Cache, etc.) %rax+%rdx Decode Decode PC PC ZF OFCFSF ZF OFCFSF %rax %rbx D-Cache D-Cache Registers Registers I-Cache ALU Addr Data I-Cache ALU Addr Data (aka (aka 0x40 RegFile) %rdx RegFile) addr data Instruction Instruction ALU Output Data to write to ALU Output Data to write to Operands Operands (Machine Code) (Machine Code) (Addr. or Result) dest. register (Addr. or Result) dest. register add %rax,%rdx ld 0x40(%rbx),%rax [Machine Code: 48 01 c2] [Machine Code: 48 8b 43 40] %rdx = %rax+%rdx %rdx = data

12a.17 12a.18 Processor Execution (store) Processor Execution (branch/jump) Fetch Decode Exec. Mem WB Fetch Decode Exec. Mem WB Control Signals Control Signals PC + 0x08 (e.g. ALU operation, (e.g. ALU operation, Read/Write D-Cache, etc.) Read/Write D-Cache, etc.) Decode Decode PC PC ZF OF CF SF ZF OFCFSF 1 0 1 0 %rbx PC D-Cache D-Cache addr Registers Registers 0x40 0x08 I-Cache ALU Addr Data I-Cache ALU Addr Data (aka (aka RegFile) RegFile) Data Data Instruction Instruction %rax (Machine Code) (Machine Code) st %rax,0x40(%rbx) je L1 (disp. = 0x08) [Machine Code: 48 89 43 40] [Machine Code: 74 08] 12a.19 12a.20 Example for(i=0; i < 100; i++) C[i] = (A[i] + B[i]) / 4; Memory A: i A[i] Cntr B: B[i] C: PIPELINING 10 ns per input set = _____ ns total

CS356 Unit 12a Processor Hardware Organization BASIC HW Pipelining - PowerPoint PPT Presentation

12a.1 12a.2 CS356 Unit 12a Processor Hardware Organization BASIC HW Pipelining 12a.3 12a.4 Logic Circuits Combinational Logic Gates Combinational ________________ logic Main Idea: Circuits called "gates" perform Logic

Introduction to CS356 CS356 Object-Oriented Design and Programming http://cs356.yusun.io

12a Kinesiology: Muscle Names and Locations 12a Kinesiology: Muscle Names and Locations

12a Kinesiology: Muscle Names and Locations 12a Kinesiology: Muscle Names and Locations

SOLID: Principles of OOD CS356 Object-Oriented Design and Programming http://cs356.yusun.io

HOUSING PROJECT 1 UNIT 4 UNIT 1 UNIT 6 UNIT 5 UNIT 3 UNIT 2 Application of the Concept

CS356 Unit 4 Intro to x86 Instruction Set 4.2 Why Learn Assembly To understand something of

CS356 Unit 5 x86 Control Flow 5.2 JUMP/BRANCHING OVERVIEW 5.3 Concept of Jumps/Branches

CS356 Unit 10 Memory Allocation & Heap Management 10.2 BASIC OS CONCEPTS & TERMINOLOGY

CS356 Unit 6 x86 Procedures Basic Stack Frames 6.2 Review of Program Counter (IP register)

CS356 Unit 9 Virtual Memory & Address Translation 9.2 Indirection Indirection means

CS356 Unit 6 x86 Procedures Basic Stack Frames 6.2 Review of Program Counter (Instruc. Pointer)

CS356 Unit 4 x86 Instruction Set 4.2 Why Learn Assembly Understand hardware limitations

CS356 Unit 12 Processor Hardware Organization Pipelining 12.2 From combinational to sequential

CS356 Unit 7 Data Layout & Intermediate Stack Frames 7.2 Structs CS:APP 3.9.1 Structs

CS356 Unit 11 Linking 11.2 In complex C projects... We would like to: Split source into

CS356 Unit 15 Review 15.2 Final Jeopardy Binary Instruction Random Riddles Memory Processor

LECTURE 9 Pipeline Hazards PIPELINED DATAPATH AND CONTROL In the previous lecture, we

CSE443 Compilers Dr. Carl Alphonce alphonce@buffalo.edu 343 Davis Hall Phases of a compiler

Processes and Exceptions int main( void ) { doing nothing on a busy system 3 }

Practice Translate the C code into assembly: label and $t0, $t0, $zero #let i = 0 for(i =

Improving Deliberations Process Practical Guidance: A.B.A. Principles for Juries & Jury

EECS 583 Class 4 Predicated Execution If-conversion University of Michigan September 15,

ARM v4T CS2253 Owen Kaser, UNBSJ ARM v4T History of ARM processors R is for RISC

WIT COMP1000 while Loops Wentworth Institute of Technology Engineering & Technology Loops

CS356 Unit 12a Processor Hardware Organization BASIC HW Pipelining - PowerPoint PPT Presentation

12a.1 12a.2 CS356 Unit 12a Processor Hardware Organization BASIC HW Pipelining 12a.3 12a.4 Logic Circuits Combinational Logic Gates Combinational ________________ logic Main Idea: Circuits called "gates" perform Logic

Introduction to CS356 CS356 Object-Oriented Design and Programming http://cs356.yusun.io

12a Kinesiology: Muscle Names and Locations 12a Kinesiology: Muscle Names and Locations

12a Kinesiology: Muscle Names and Locations 12a Kinesiology: Muscle Names and Locations

SOLID: Principles of OOD CS356 Object-Oriented Design and Programming http://cs356.yusun.io

HOUSING PROJECT 1 UNIT 4 UNIT 1 UNIT 6 UNIT 5 UNIT 3 UNIT 2 Application of the Concept

CS356 Unit 4 Intro to x86 Instruction Set 4.2 Why Learn Assembly To understand something of

CS356 Unit 5 x86 Control Flow 5.2 JUMP/BRANCHING OVERVIEW 5.3 Concept of Jumps/Branches

CS356 Unit 10 Memory Allocation &amp; Heap Management 10.2 BASIC OS CONCEPTS &amp; TERMINOLOGY

CS356 Unit 6 x86 Procedures Basic Stack Frames 6.2 Review of Program Counter (IP register)

CS356 Unit 9 Virtual Memory &amp; Address Translation 9.2 Indirection Indirection means

CS356 Unit 6 x86 Procedures Basic Stack Frames 6.2 Review of Program Counter (Instruc. Pointer)

CS356 Unit 4 x86 Instruction Set 4.2 Why Learn Assembly Understand hardware limitations

CS356 Unit 12 Processor Hardware Organization Pipelining 12.2 From combinational to sequential

CS356 Unit 7 Data Layout &amp; Intermediate Stack Frames 7.2 Structs CS:APP 3.9.1 Structs

CS356 Unit 11 Linking 11.2 In complex C projects... We would like to: Split source into

CS356 Unit 15 Review 15.2 Final Jeopardy Binary Instruction Random Riddles Memory Processor

LECTURE 9 Pipeline Hazards PIPELINED DATAPATH AND CONTROL In the previous lecture, we

CSE443 Compilers Dr. Carl Alphonce alphonce@buffalo.edu 343 Davis Hall Phases of a compiler

Processes and Exceptions int main( void ) { doing nothing on a busy system 3 }

Practice Translate the C code into assembly: label and $t0, $t0, $zero #let i = 0 for(i =

Improving Deliberations Process Practical Guidance: A.B.A. Principles for Juries &amp; Jury

EECS 583 Class 4 Predicated Execution If-conversion University of Michigan September 15,

ARM v4T CS2253 Owen Kaser, UNBSJ ARM v4T History of ARM processors R is for RISC

WIT COMP1000 while Loops Wentworth Institute of Technology Engineering &amp; Technology Loops

CS356 Unit 10 Memory Allocation & Heap Management 10.2 BASIC OS CONCEPTS & TERMINOLOGY

CS356 Unit 9 Virtual Memory & Address Translation 9.2 Indirection Indirection means

CS356 Unit 7 Data Layout & Intermediate Stack Frames 7.2 Structs CS:APP 3.9.1 Structs

Improving Deliberations Process Practical Guidance: A.B.A. Principles for Juries & Jury

WIT COMP1000 while Loops Wentworth Institute of Technology Engineering & Technology Loops