Extraction of Efficient Instruction Schedulers from Cycle-true - PowerPoint PPT Presentation

Extraction of Efficient Instruction Schedulers from Cycle-true Processor Models Oliver Wahlen, Manuel Hohenauer, Rainer Leupers, Gerd Ascheid, Gunnar Braun Xiaoning Nie Heinrich Meyr CoWare, Inc. Infineon Technologies RWTH Aachen Institute for Integrated Signal Processing Systems

Motivation: Why ASIPs? Application Specific Instruction-Set Processors Combine advantages of processors and ASICs: • Provide system programmability and reconfigurability • Good tradeoff: performance/power consumption/area • Can easily be integrated into embedded systems efficiency (MIPS/Watt) ASICs ASIPs ASIPs domain specific GPPs flexibility Institute for Integrated Signal Processing Systems

Solution: LISA Processor Design Platform Language for Instruction-Set Architectures Application Application C Compiler EDGE TM Processor Profiler Profiler C Compiler C Compiler C Compiler Profiler Profiler (research) Designer Assembler Assembler Simulator Simulator Simulator Simulator Assembler Assembler Linker Linker LISA 2.0 Architecture Implementation Architecture Exploration Architecture RIM TM Software Designer HUB TM System Integrator Specification Assembler / C-Compiler System on Chip Linker Simulator / Debug. Software Application Design Integration and Verification http://www.coware.com Institute for Integrated Signal Processing Systems

Architecture Exploration Loop application Automatic tool generation: .c • Speeds up design cycles LISA C compiler • Eliminates consistency processor problem model automatic application C – compiler in the loop: generation manual .asm changes • Reduction in implementation assembler no and verification time & linker design • IP reuse check simulator criteria & profiler met? VHDL model yes Institute for Integrated Signal Processing Systems

Compiler Structure and Generation LISA .c processor model C front-end engine Semiautomatic Generation Optimizations Optimizations IR optimizations Instruction Selector architecture Register specific Allocator backend Scheduler .cgd Generation engines Emitter compiler backend CoSy Compiler description .asm Development System Institute for Integrated Signal Processing Systems

Scheduler Generation LISA [EXPRESSION, .c PEAS-III] processor model C front-end engine Semiautomatic Generation Optimizations Optimizations IR optimizations Instruction Selector Register Allocator Scheduler Scheduler .cgd Generation postpass tool lpacker Emitter Emitter compiler backend CoSy Compiler description .asm Development System Institute for Integrated Signal Processing Systems

Scheduler Description Reservation Tables Example: [O.Wahlen, M.Hohenauer, R.Leupers, H. Meyr, 2003] 0: MUL R1,R2,R3 1: NOP ALU_op MUL_op Elimination of 2: MUL R4,R5,R6 Structural cycle 0 Hazards cycle 1 cycle 2 EX_alu EX_mul cycle 3 EX_mul cycle 4 Institute for Integrated Signal Processing Systems

Scheduler Description Latency Tables Example: RAW ALU_in MUL_in Elimination of ALU_out 1 1 0: MUL R3,R1,R2 Dataflow MUL_out 2 2 1: NOP Hazards 2: ADD R5,R3,R4 WAW WAR Reservation Tables Example: [O.Wahlen, M.Hohenauer, R.Leupers, H. Meyr, 2003] 0: MUL R1,R2,R3 1: NOP ALU_op MUL_op Elimination of 2: MUL R4,R5,R6 Structural cycle 0 Hazards cycle 1 cycle 2 EX_alu EX_mul cycle 3 EX_mul cycle 4 Institute for Integrated Signal Processing Systems

LISA Description ... OPERATION reg_alu_instr IN pipe.ID OPERATION reg_alu_instr IN pipe.ID { decode { DECLARE { DECLARE { GROUP Opcode = { ADD || SUB }; ... GROUP Opcode = { ADD || SUB }; ... alu control GROUP Rs1, Rs2, Rd = { gp_reg }; GROUP Rs1, Rs2, Rd = { gp_reg }; } } ... ... CODING { Opcode Rs2 Rs1 Rd 0b0[10] } CODING { Opcode Rs2 Rs1 Rd 0b0[10] } imm_alu_instr reg_alu_instr SYNTAX { Opcode ~" " Rd ~" " Rs1 ~" " Rs2 } SYNTAX { Opcode ~" " Rd ~" " Rs1 ~" " Rs2 } BEHAVIOR { BEHAVIOR { opcode Rd Rs1 Rs2 PIPELINE_REGISTER(pipe,ID/EX).src1 = GP_Regs[Rs1]; PIPELINE_REGISTER(pipe,ID/EX).src1 = GP_Regs[Rs1]; PIPELINE_REGISTER(pipe,ID/EX).src2 = GP_Regs[Rs2]; PIPELINE_REGISTER(pipe,ID/EX).src2 = GP_Regs[Rs2]; ... ... ... PIPELINE_REGISTER(pipe,ID/EX).dst = Rd; PIPELINE_REGISTER(pipe,ID/EX).dst = Rd; ADD SUB } } ACTIVATION { Opcode } ACTIVATION { Opcode } } } Institute for Integrated Signal Processing Systems

Scheduler Generation: Operation Schedule Activation DAG: 1 1 1 1 ADD ADD register_ register_ alu_instr alu_instr 1 SUB 0 0 1 0 0 1 1 alu_wb main main fetch fetch decode decode alu_wb 1 0 1 ADDI imm_ 1 alu_instr SUBI 1 Operation Schedule: ALU Cycle Resource usage 0 --- 1 2 x read of GP-register file 2 x read 2 x read 1 x write 1 x write 2 --- 3 1 x write of GP-register file Read Ports Write Port 1 2 3 1 Register File R Institute for Integrated Signal Processing Systems

Latency Calculation Latencies between two instructions i and j ( R is a processor resource) L raw ( i , j ) = Max R ( last_write_cycle( j , R ) – first_read_cycle( i , R ) + 1 ) Instructions 0 1 2 3 ADD R1 , R2, R3 … … … GPR SUB R4, R1 , R5 … GPR … … 0 1 2 3 4 5 6 ADD R1 , R2, R3 … … … GPR SUB R4, R1 , R5 … GPR … … L raw = 3 – 1 + 1 = 3 Institute for Integrated Signal Processing Systems

Latency Calculation Latencies between two instructions i and j ( R is a processor resource) L raw ( i , j ) = Max R ( last_write_cycle( j , R ) – first_read_cycle( i , R ) + 1 ) L waw ( i , j ) = Max R ( last_write_cycle( j , R ) – first_write_cycle( i , R ) + 1 ) Instructions 0 1 2 3 ADD R1 , R2, R3 … … … GPR SUB R1 , R4, R5 … ... … GPR 0 1 2 3 4 ADD R1 , R2, R3 … … … GPR SUB R1 , R4, R5 ... ... … GPR L waw = 3 – 3 + 1 = 1 Institute for Integrated Signal Processing Systems

Latency Calculation Latencies between two instructions i and j ( R is a processor resource) L raw ( i , j ) = Max R ( last_write_cycle( j , R ) – first_read_cycle( i , R ) + 1 ) L waw ( i , j ) = Max R ( last_write_cycle( j , R ) – first_write_cycle( i , R ) + 1 ) L war ( i , j ) = Max R ( last_read_cycle( j , R ) – first_write_cycle( i , R ) ) Instructions 0 1 2 3 ADD R2, R1, R3 PC ... … ... JMP addr … PC … ... 0 1 2 3 4 ADD R2, R1, R3 PC … ... ... JMP addr ... PC ... ... negative latency = delay slot negative latency = delay slot L war = 0 – 1 = -1 Institute for Integrated Signal Processing Systems

List Scheduling Example data dependence dag ready set: ADD R1 R2 SUB R3 R4 ADD R1 R2 PC: -1 PC: -1 SUB R3 R4 JMP addr Institute for Integrated Signal Processing Systems

List Scheduling Example data dependence dag ready set: ADD R1 R2 SUB R3 R4 PC: -1 PC: -1 SUB R3 R4 JMP addr Cycle Step 1 0 ADD R1 R2 1 2 3 Institute for Integrated Signal Processing Systems

List Scheduling Example data dependence dag ready set: ADD R1 R2 SUB R3 R4 JMP addr PC: -1 PC: -1 JMP addr Cycle Step 1 Step 2 0 ADD R1 R2 ADD R1 R2 1 SUB R3 R4 2 3 Institute for Integrated Signal Processing Systems

List Scheduling Example data dependence dag ready set: ADD R1 R2 SUB R3 R4 PC: -1 PC: -1 JMP addr Cycle Step 1 Step 2 Step 3 0 ADD R1 R2 ADD R1 R2 ADD R1 R2 1 SUB R3 R4 SUB R3 R4 2 JMP addr 3 Institute for Integrated Signal Processing Systems

List Scheduling Example data dependence dag ready set: ADD R1 R2 SUB R3 R4 PC: -1 PC: -1 JMP addr delay slot must be filled delay slot must be filled Cycle Step 1 Step 2 Step 3 Step 4 0 ADD R1 R2 ADD R1 R2 ADD R1 R2 ADD R1 R2 1 SUB R3 R4 SUB R3 R4 SUB R3 R4 2 JMP addr JMP addr 3 NOP Institute for Integrated Signal Processing Systems

Backtracking Scheduler • Negative latencies can automatically be extracted from the LISA model • They indicate delay slots • Negative weights in dependence DAG cannot be utilized by list schedulers because scheduling decisions need to be revoked Development of a retargetable Backtracking Scheduler Development of a retargetable Backtracking Scheduler [S. G. Abraham, W. Meleis, I. D. Baev, 2000] [S. G. Abraham, W. Meleis, I. D. Baev, 2000] Institute for Integrated Signal Processing Systems

mixedBT Backtracking Scheduler Concept: three scheduling modes 1. normal scheduling: if there is no conflict instructions are scheduled according to their data dependencies 2. displace scheduling: unschedule instructions that have lower priority and are causing a structural hazard 3. force scheduling: if 1 and 2 are not possible unschedule conflicts and force the scheduling of the candidate Institute for Integrated Signal Processing Systems

Extraction of Efficient Instruction Schedulers from Cycle-true - PowerPoint PPT Presentation

Extraction of Efficient Instruction Schedulers from Cycle-true Processor Models Oliver Wahlen, Manuel Hohenauer, Rainer Leupers, Gerd Ascheid, Gunnar Braun Xiaoning Nie Heinrich Meyr CoWare, Inc. Infineon Technologies RWTH Aachen

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

Brett Ayoob, PSP Best Practices for CPM Schedulers // Introduction The Corporate Teams Plan

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

[537] Schedulers Tyler Harter 9/10/14 Overview Review processes Workloads, schedulers, and

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Integrated Schedulers for a Predictable Interrupt Management on Real-Time Kernels A. Crespo S.

CPU Scheduling Schedulers Structure of a CPU scheduler Criteria for scheduling

Learning Automatic Schedulers through Projective Reparameterization Ajay Jain Saman Amarasinghe

Multi Cycle CPU Jason Mars Monday, February 4, 13 Why a Multiple Cycle CPU? Monday, February 4,

Instruction Set 2 Architecting a vocabulary for the HW INSTRUCTION SET OVERVIEW 3 Instruction

Hamiltonian Cycles Hamiltonian Cycles CSE, IIT KGP Hamiltonian Cycle Hamiltonian Cycle A A

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

COMP 590-154: Computer Architecture Core Pipelining Generic Instruction Cycle Steps in

(Basic) Processor Pipeline Nima Honarmand Spring 2018 :: CSE 502 Generic Instruction Life Cycle

Multiple Instruction Issue Multiple instructions issued each cycle a processor that can

Lecture 9: Processor design multi cycle Arent single cycle processors good enough? No!

Building Custom RISC-V SoCs in Chipyard Abraham Gonzalez UC Berkeley abe.gonzalez@berkeley.edu

MachineArchitecture CS217 Fall2001 1 ComputerOrganization MBus CPU R Control e

Using a Personal Device to Strengthen Password Authentication from an Untrusted Computer Mohammad

Geometry of Soergel Bimodules Ben Webster (joint with Geordie Williamson) IAS/MIT June 17th,

B EYOND 'S TANDARD 'M ODEL ' AT 'LHC

Pipelining Dr. Soner Onder CS 4431 Michigan Technological University 9/28/2020 1 A

Type Systems 3. Labeled Variants 4. Lists Lecture 4 Nov. 10th, 2004 5. Normalization

CS654 Advanced Computer Architecture Lec 5 Performance + Pipeline Review Peter Kemper