ece 327 digital systems engineering lecture slides
play

ECE 327: Digital Systems Engineering Lecture Slides 2020t1 (Winter) - PowerPoint PPT Presentation

ECE 327: Digital Systems Engineering Lecture Slides 2020t1 (Winter) Mark Aagaard University of Waterloo Department of Electrical and Computer Engineering ii CONTENTS iv 1.3.2 Library Units . . . . . . . . . . . . . . . . . . . . . . . . . .


  1. x CONTENTS CONTENTS xii 4 State Machines 191 5 Dataflow Diagrams 275 4.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 5.1 Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 4.2 Finite State Machines in VHDL . . . . . . . . . . . . . . . . . . . . . . 193 5.1.1 Dataflow Diagrams Overview . . . . . . . . . . . . . . . . . . . 276 4.2.1 HDL Coding Styles for State Machines . . . . . . . . . . . . . . 193 5.1.2 Dataflow Diagram Execution . . . . . . . . . . . . . . . . . . . 284 4.2.2 State Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . 194 5.1.3 Dataflow Diagrams, Hardware, and Behaviour . . . . . . . . . 286 4.2.3 Traditional State-Machine Notation . . . . . . . . . . . . . . . . 195 5.1.4 Performance Estimation . . . . . . . . . . . . . . . . . . . . . . 290 4.2.4 Our State-Machine Notation . . . . . . . . . . . . . . . . . . . 196 5.1.5 Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 291 4.2.5 Bounce Example . . . . . . . . . . . . . . . . . . . . . . . . . . 197 5.1.6 Design Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 293 4.2.6 Registered Assignments . . . . . . . . . . . . . . . . . . . . . 202 5.2 Design Example: Hnatyshyn DFD . . . . . . . . . . . . . . . . . . . . 298 4.2.7 More Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 5.2.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 4.2.7.1 Extension: Transient States . . . . . . . . . . . . . . . 205 5.2.2 Data-Dependency Graph . . . . . . . . . . . . . . . . . . . . . 299 4.2.7.2 Assignments within States . . . . . . . . . . . . . . . 207 5.2.3 Initial Dataflow Diagram . . . . . . . . . . . . . . . . . . . . . . 300 4.2.7.3 Conditional Expressions . . . . . . . . . . . . . . . . 210 5.2.4 Area Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 301 4.2.7.4 Default Values . . . . . . . . . . . . . . . . . . . . . . 211 5.2.5 Assign Names to Registered Signals . . . . . . . . . . . . . . . 302 4.2.8 Semantic and Syntax Rules . . . . . . . . . . . . . . . . . . . . 218 5.2.6 Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 4.2.9 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 5.2.7 State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 4.3 LeBlanc FSM Design Example . . . . . . . . . . . . . . . . . . . . . . 228 5.2.8 VHDL Implementation . . . . . . . . . . . . . . . . . . . . . . . 316 4.3.1 State Machine and VHDL . . . . . . . . . . . . . . . . . . . . . 229 5.3 Design Example: Hnatyshyn with Bubbles . . . . . . . . . . . . . . . . 321 4.3.2 State Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . 232 5.3.1 Adding Support for Bubbles . . . . . . . . . . . . . . . . . . . . 322 4.4 Parcels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 5.3.2 Control Table with Valid Bits . . . . . . . . . . . . . . . . . . . . 326 4.4.1 Bubbles and Throughput . . . . . . . . . . . . . . . . . . . . . 240 5.3.3 VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 5.4 Inter-Parcel Variables: Hnatyshyn with Internal State . . . . . . . . . . 331 4.4.2 Parcel Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . 245 5.4.1 Requirements and Goals . . . . . . . . . . . . . . . . . . . . . 332 4.4.3 Valid Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 5.4.2 Dataflow Diagrams and Waveforms . . . . . . . . . . . . . . . 333 4.5 LeBlanc with Bubbles . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 5.4.3 Control Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 4.6 Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 5.4.4 VHDL Implementation . . . . . . . . . . . . . . . . . . . . . . . 339 4.7 Interparcel Variables and Loops . . . . . . . . . . . . . . . . . . . . . 255 5.4.5 Summary of Bubbles and Inter-Parcel Variables . . . . . . . . 341 4.7.1 Introduction to Looping Le Blanc . . . . . . . . . . . . . . . . . 255 5.5 Design Example: Vanier . . . . . . . . . . . . . . . . . . . . . . . . . . 342 4.7.2 Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 5.5.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 4.7.3 State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 5.5.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 4.7.4 VHDL Code for Loop and Bubbles . . . . . . . . . . . . . . . . 260 5.5.3 Initial Dataflow Diagram . . . . . . . . . . . . . . . . . . . . . . 345 4.8 Memory Arrays and RTL Design . . . . . . . . . . . . . . . . . . . . . 262 5.5.4 Reschedule to Meet Requirements . . . . . . . . . . . . . . . . 346 4.8.1 Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . 262 5.5.5 Optimization: Reduce Inputs . . . . . . . . . . . . . . . . . . . 348 4.8.2 Memory Arrays in VHDL . . . . . . . . . . . . . . . . . . . . . . 265 5.5.6 Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 4.8.3 Using Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 5.5.7 Explicit State Machine . . . . . . . . . . . . . . . . . . . . . . . 352 4.8.3.1 Writing from Multiple Vars . . . . . . . . . . . . . . . 267 5.5.8 VHDL #1: Explicit . . . . . . . . . . . . . . . . . . . . . . . . . 353 4.8.3.2 Reading from Memory to Multiple Variables . . . . . . 268 5.5.9 VHDL #2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 4.8.3.3 Example: Maximum Value Seen so Far . . . . . . . . 270 5.5.10 Notes and Observations . . . . . . . . . . . . . . . . . . . . . 359 4.8.4 Build Larger Memory from Slices . . . . . . . . . . . . . . . . . 273 5.6 Memory Operations in Dataflow Diagrams . . . . . . . . . . . . . . . . 361 5.7 Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 4.8.5 Memory Arrays in High-Level Models . . . . . . . . . . . . . . 274 5.8 Example of DFD and Memory . . . . . . . . . . . . . . . . . . . . . . 371 xi CONTENTS CONTENTS xiii

  2. xiv CONTENTS CONTENTS xvi 6 Optimizations 377 8 Timing Analysis 445 8.1 Delays and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 6.1 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 8.1.1 Background Definitions . . . . . . . . . . . . . . . . . . . . . . 446 6.1.1 Introduction to Pipelining . . . . . . . . . . . . . . . . . . . . . 379 8.1.2 Clock-Related Timing Definitions . . . . . . . . . . . . . . . . . 447 6.1.2 Partially Pipelined . . . . . . . . . . . . . . . . . . . . . . . . . 383 6.1.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 8.1.2.1 Clock Latency . . . . . . . . . . . . . . . . . . . . . . 448 6.1.4 Overlapping Pipeline Stages . . . . . . . . . . . . . . . . . . . 386 8.1.2.2 Clock Skew . . . . . . . . . . . . . . . . . . . . . . . . 449 6.2 Staggering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 8.1.2.3 Clock Jitter . . . . . . . . . . . . . . . . . . . . . . . . 451 8.1.3 Storage-Related Timing Definitions . . . . . . . . . . . . . . . 453 6.3 Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 8.1.3.1 Flops and Latches . . . . . . . . . . . . . . . . . . . . 453 6.4 General Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 8.1.3.2 Timing Parameters . . . . . . . . . . . . . . . . . . . 454 6.4.1 Strength Reduction . . . . . . . . . . . . . . . . . . . . . . . . 402 8.1.3.3 Timing Parameters for a Flop . . . . . . . . . . . . . . 455 6.4.1.1 Arithmetic Strength Reduction . . . . . . . . . . . . . 402 8.1.4 Propagation Delays . . . . . . . . . . . . . . . . . . . . . . . . 456 6.4.1.2 Boolean Strength Reduction . . . . . . . . . . . . . . 403 8.1.5 Timing Constraints . . . . . . . . . . . . . . . . . . . . . . . . . 457 6.4.2 Replication and Sharing . . . . . . . . . . . . . . . . . . . . . . 404 8.2 Timing Analysis of Simple Latches . . . . . . . . . . . . . . . . . . . . 461 6.4.2.1 Mux-Pushing . . . . . . . . . . . . . . . . . . . . . . . 404 8.2.1 Review: Active-High Latch Behaviour . . . . . . . . . . . . . . 461 6.4.2.2 Common Subexpression Elimination . . . . . . . . . . 405 8.2.2 Structure and Behaviour of Multiplexer Latch . . . . . . . . . . 462 6.4.2.3 Computation Replication . . . . . . . . . . . . . . . . 407 8.2.3 Strategy for Timing Analysis of Storage Devices . . . . . . . . 465 6.4.3 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 8.2.4 Clock-to-Q Time of a Latch . . . . . . . . . . . . . . . . . . . . 466 6.5 Customized State Encodings . . . . . . . . . . . . . . . . . . . . . . . 409 8.2.5 From Load Mode to Store Mode . . . . . . . . . . . . . . . . . 467 8.2.6 Setup Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . 468 7 Performance Analysis 411 8.2.7 Hold Time of a Multiplexer Latch . . . . . . . . . . . . . . . . . 474 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 8.2.8 Example of a Bad Latch . . . . . . . . . . . . . . . . . . . . . . 477 7.2 Defining Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 8.2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480 7.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416 8.3 Advanced Timing Analysis of Storage Elements . . . . . . . . . . . . 481 8.4 Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482 7.4 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . . 419 8.4.1 Introduction to Critical and False Paths . . . . . . . . . . . . . 484 7.4.1 General Equations . . . . . . . . . . . . . . . . . . . . . . . . . 419 8.4.1.1 Example of Critical Path in Full Adder . . . . . . . . . 485 7.4.2 Example: Performance of Printers . . . . . . . . . . . . . . . . 426 8.4.1.2 Longest Path and Critical Path . . . . . . . . . . . . . 487 7.5 Clock Speed, CPI, Program Length, and Performance . . . . . . . . . 427 8.4.1.3 Criteria for Critical Path Algorithms . . . . . . . . . . . 490 7.5.1 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 8.4.2 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 7.5.2 Example: CISC vs RISC and CPI . . . . . . . . . . . . . . . . 428 8.4.2.1 Algorithm to Find Longest Path . . . . . . . . . . . . . 491 7.5.3 Effect of Instruction Set on Performance . . . . . . . . . . . . . 432 8.4.2.2 Longest Path Example . . . . . . . . . . . . . . . . . 492 7.6 Effect of Time to Market on Relative Performance . . . . . . . . . . . 438 8.4.3 Monotone Speedup . . . . . . . . . . . . . . . . . . . . . . . . 493 8.5 False Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498 8.6 Analog Timing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 499 8.6.1 Defining Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . 500 8.6.2 Modeling Circuits for Timing . . . . . . . . . . . . . . . . . . . . 504 8.6.2.1 Example: Two Buffers with Complex Wiring . . . . . . 507 8.6.2.2 Example: Two Buffers with Simple Wiring . . . . . . . 508 8.6.3 Calculate Delay . . . . . . . . . . . . . . . . . . . . . . . . . . 509 xv CONTENTS CONTENTS xvii

  3. xviii CONTENTS CONTENTS xx 8.6.4 Ex: Two Bufs with Both Caps . . . . . . . . . . . . . . . . . . . 515 9.6.9 Alternatives to Clock Gating . . . . . . . . . . . . . . . . . . . . 591 9.6.9.1 Use Chip Enables . . . . . . . . . . . . . . . . . . . . 591 8.7 Elmore Delay Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519 9.6.9.2 Operand Gating . . . . . . . . . . . . . . . . . . . . . 592 8.7.1 Elmore Delay as an Approximation . . . . . . . . . . . . . . . . 519 8.7.2 A More Complicated Example . . . . . . . . . . . . . . . . . . 522 10 Review 593 8.8 Practical Usage of Timing Analysis . . . . . . . . . . . . . . . . . . . . 527 10.1 Overview of the Term . . . . . . . . . . . . . . . . . . . . . . . . . . . 594 10.2 VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 9 Power 529 10.2.1 VHDL Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530 10.2.2 VHDL Example Problems . . . . . . . . . . . . . . . . . . . . 596 9.1.1 Importance of Power and Energy . . . . . . . . . . . . . . . . . 530 10.3 RTL Design Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 597 9.1.2 Power vs.Energy . . . . . . . . . . . . . . . . . . . . . . . . . . 531 10.3.1 Design Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . 597 9.1.3 Batteries, Power and Energy . . . . . . . . . . . . . . . . . . . 533 10.3.2 Design Example Problems . . . . . . . . . . . . . . . . . . . . 598 9.1.3.1 Do Batteries Store Energy or Power? . . . . . . . . . 533 10.4 Performance Analysis and Optimization . . . . . . . . . . . . . . . . 599 10.4.1 Performance Topics . . . . . . . . . . . . . . . . . . . . . . . 599 9.1.3.2 Battery Life and Efficiency . . . . . . . . . . . . . . . 534 10.4.2 Performance Example Problems . . . . . . . . . . . . . . . . 600 9.1.3.3 Battery Life and Power . . . . . . . . . . . . . . . . . 535 10.5 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601 9.2 Power Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538 10.5.1 Timing Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . 601 9.2.1 Switching Power . . . . . . . . . . . . . . . . . . . . . . . . . . 540 10.5.2 Timing Example Problems . . . . . . . . . . . . . . . . . . . . 602 9.2.2 Short-Circuited Power . . . . . . . . . . . . . . . . . . . . . . . 541 10.6 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603 9.2.3 Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . . 542 10.6.1 Power Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . 603 9.2.4 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543 10.6.2 Power Example Problems . . . . . . . . . . . . . . . . . . . . 604 9.2.5 Note on Power Equations . . . . . . . . . . . . . . . . . . . . . 543 10.7 Formulas to be Given on Final Exam . . . . . . . . . . . . . . . . . . 605 9.3 Overview of Power Reduction Techniques . . . . . . . . . . . . . . . . 543 9.4 Voltage, Power, and Delay . . . . . . . . . . . . . . . . . . . . . . . . . 548 9.5 Data Encoding for Power Reduction . . . . . . . . . . . . . . . . . . . 556 9.5.1 How Data Encoding Can Reduce Power . . . . . . . . . . . . . 556 9.5.2 Example Problem: Sixteen Pulser . . . . . . . . . . . . . . . . 560 9.5.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . 560 Chapter 1 9.5.2.2 Additional Information . . . . . . . . . . . . . . . . . . 561 9.5.2.3 Answer . . . . . . . . . . . . . . . . . . . . . . . . . . 563 9.6 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569 9.6.1 Introduction to Clock Gating . . . . . . . . . . . . . . . . . . . . 570 9.6.2 Implementing Clock Gating . . . . . . . . . . . . . . . . . . . . 571 Fundamentals of VHDL 9.6.3 Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . 572 9.6.4 Effectiveness of Clock Gating . . . . . . . . . . . . . . . . . . . 573 9.6.5 Example: Reduced Activity Factor with Clock Gating . . . . . . 577 9.6.6 Calculating PctBusy . . . . . . . . . . . . . . . . . . . . . . . . 579 9.6.6.1 Valid Bits and Busy . . . . . . . . . . . . . . . . . . . 579 9.6.6.2 Calculating LenBusy . . . . . . . . . . . . . . . . . . 581 9.6.6.3 From LenBusy to PctBusy . . . . . . . . . . . . . . . 583 9.6.7 Example: Pipelined Circuit with Clock-Gating . . . . . . . . . . 584 9.6.8 Clock Gating in ASICs . . . . . . . . . . . . . . . . . . . . . . . 590 21 xix CONTENTS

  4. 22 CHAPTER 1. FUNDAMENTALS OF VHDL 1.1.3 Semantics 24 1.1 Introduction to VHDL 1.1.3 Semantics The original goal of VHDL was to simulate circuits. The semantics of the language 1.1.1 Levels of Abstraction define circuit behaviour . Transistor Signal values and time are continous (analog). Each transistor is modeled by a resistor-capacitor network. a c <= a AND b; simulation b Switch Time is continuous, but voltage may be either continuous or discrete. c Linear equations are used. Gate Transistors are grouped together into gates. Voltages are discrete values But now, VHDL is used in simulation and synthesis. Synthesis is concerned with such as 0 and 1. the structure of the circuit. Register transfer level Hardware is modeled as assignments to registers and combinational signals. Basic unit of time is one clock cycle. Synthesis: converts one type of description (behavioural) into another, lower level, Transaction level A transaction is an operation such as transfering data across description (usually a netlist). a bus. Building blocks are processors, controllers, etc. VHDL, SystemC, or SystemVerilog. a synthesis c <= a AND b; c b Electronic-system level Looks at an entire electronic system, with both hardware and software. 1.1.2 VHDL Origins and History Synthesis vs Simulation For synthesis, we want the code we write to define the structure of the hardware VHDL = VHSIC Hardware Description Language that is generated. VHSIC = Very High Speed Integrated Circuit The VHDL semantics define the behaviour of the hardware that is generated, not the structure of the hardware. The VHSIC Hardware Description Language (VHDL) is a formal notation intended for use in all phases of the creation of electronic systems. a Because it is both machine readable and human readable, it supports the b development, verification, synthesis and testing of hardware designs, the c simulation communication of hardware design data, and the maintenance, same behaviour modification, and procurement of hardware. a Language Reference Manual (IEEE Design Automation Standards a synthesis simulation c <= a AND b; c b b Committee, 1993a) c synthesis different same structure behaviour VHDL is a lot more than synthesis of digital a a simulation c b hardware b c 23 CHAPTER 1. FUNDAMENTALS OF VHDL 1.1.3 Semantics 25

  5. 26 CHAPTER 1. FUNDAMENTALS OF VHDL 1.1.6 Standard Logic 1164 28 1.1.4 Synthesis of a Simulation-Based 1.1.6 Standard Logic 1164 Language std logic 1164 : IEEE standard for signal values in VHDL. This section reserved for your reading pleasure ’U’ uninitialized ’X’ strong unknown ’0’ strong 0 ’1’ strong 1 ’Z’ high impedance ’W’ weak unknown ’L’ weak 0 ’H’ weak 1 ’-’ don’t care The most common values are: ’U’ , ’X’ , ’0’ , ’1’ . If you see ’X’ in a simulation, it usually means that there is a mistake in your code. 1.1.5 Solution to Synthesis Sanity 1.2 Comparison of VHDL to Other Hardware Description Languages • Pick a high-quality synthesis tool and study its documentation thoroughly • Learn the idioms of the tool This section reserved for your reading pleasure • Different VHDL code with same behaviour can result in very different circuits • Be careful if you have to port VHDL code from one tool to another • KISS: Keep It Simple Stupid 1.3 Overview of Syntax – VHDL examples will illustrate reliable coding techniques for the synthesis tools from Synopsys, Mentor Graphics, Altera, Xilinx, and most other companies as well. 1.3.1 Syntactic Categories – Follow the coding guidelines and examples from lecture – As you write VHDL, think about the hardware you expect to get. This section reserved for your reading pleasure Note: If you can’t predict the hardware, then the hardware probably won’t be very good (small, fast, correct, etc) 1.3.2 Library Units This section reserved for your reading pleasure 27 CHAPTER 1. FUNDAMENTALS OF VHDL 1.2. COMPARISON OF VHDL TO OTHER HARDWARE DESCRIPTION LANGUAGES 29

  6. 30 CHAPTER 1. FUNDAMENTALS OF VHDL 1.3.3 Entities and Architecture 32 1.3.3 Entities and Architecture Architecture Each hardware module is described with an Entity/Architecture pair architecture main of and_or is signal x : std_logic; <= entity entity begin architecture := x <= a and b; architecture z <= x or (a and c); = end architecture; Example of architecture Entity and Architecture Entity 1.3.4 Concurrent Statements • An architecture contains concurrent statements • Concurrent statements execute in parallel library ieee; – Concurrent statements make VHDL fundamentally different from most use ieee.std_logic_1164.all; software languages. entity and_or is – Hardware (gates) naturally execute in parallel — VHDL mimics the behaviour port ( of real hardware. a, b, c : in std_logic ; – At each infinitesimally small moment of time, in parallel, every gate: z : out std_logic 1. samples its inputs ); 2. computes the value of its output end entity; 3. drives the output Example of an entity 31 CHAPTER 1. FUNDAMENTALS OF VHDL 1.3.4 Concurrent Statements 33

  7. 34 CHAPTER 1. FUNDAMENTALS OF VHDL 1.3.5 Component Declaration and Instantiations 36 Concurrent Statements 1.3.5 Component Declaration and Instantiations architecture main1 of simple is architecture main2 of simple is begin begin This section reserved for your reading pleasure x1 <= a AND b; z <= NOT x2; x2 <= NOT x1; x2 <= NOT x1; z <= NOT x2; x1 <= a AND b; end main; end main; 1.3.6 Processes a x1 x2 z b • Processes are used to describe complex and potentially unsynthesizable behaviour The order of concurrent statements doesn’t matter • A process is a concurrent statement (section 1.3.4). • The body of a process contains sequential statements (section 1.3.8) • Processes are the most complex and difficult to understand part of VHDL (sections 1.5 and 1.6) Types of Concurrent Statements Example Process with Sensitivity List process (a, b, c) conditional assignment similar to conventional if-then-else begin c <= a+b when sel=’1’ else a+c when sel=’0’ else "0000"; y <= a and b; selected assignment similar to conventional case/switch if (a = ’1’) then with color select d <= "00" when red , "01" when ... ; z1 <= b and c; component instantiation use a hardware module/component z2 <= not c; add1 : adder port map( a => f , b => g , s => h , co => i ); else for-generate create multiple pieces of hardware z1 <= b or c; bgen : for i in 1 to 7 generate b(i)<=a(7-i); end generate; z2 <= c; if-generate conditionally create some hardware end if; okgen : if optgoal /= fast then generate end process; result <= ((a and b) or (d and not e)) or g; end generate; fastgen : if optgoal = fast then generate result <= ’1’; end generate; process description of complex behaviour (section 1.3.6) 35 CHAPTER 1. FUNDAMENTALS OF VHDL 1.3.6 Processes 37

  8. 38 CHAPTER 1. FUNDAMENTALS OF VHDL 1.3.6 Processes 40 Example Process with Wait Statements Sensitivity List The sensitivity list contains the signals that are read in the process. process begin wait until rising_edge(clk); A process is executed when a signal in its sensitivity list changes value. if (a = ’1’) then z <= ’1’; An important coding guideline to ensure consistent synthesis and simulation y <= ’0’; results is to include all signals that are read in the sensitivity list. else y <= a or b; There is one exception to this rule: for a process that implements a flip-flop with end if; an if rising edge statement, it is acceptable to include only the clock signal in end process; the sensitivity list — other signals may be included, but are not needed. Sensitivity Lists and Wait Statements 1.3.7 Generate Statements • Two categories of generate statements: • Processes must have either a sensitivity list or at least one wait statement on – if-generate : conditionally generate some hardware each execution path through the process. – for-generate : generate multiple copies of some hardware • Processes cannot have both a sensitivity list and a wait statement. • Generate statements are executed during elaboration (at compile time) • The conditions and loop ranges must be static – Must be able to be evaluated at elaboration – Must not depend upon the value of any signal • A generate statement must be preceded by a label 39 CHAPTER 1. FUNDAMENTALS OF VHDL 1.3.7 Generate Statements 41

  9. 42 CHAPTER 1. FUNDAMENTALS OF VHDL 1.4.1 Concurrent Assignment vs Process 44 1.3.8 Sequential Statements 1.4.1 Concurrent Assignment vs Process Used inside processes , functions , and procedures . The two code fragments below have identical behaviour: wait wait until . . . ; architecture main of tiny is architecture main of tiny is begin begin signal assignment . . . <= . . . ; b <= a; process (a) begin if-then-else if . . . then . . . elsif . . . end if; end main; b <= a; case case . . . is end process; end main; when . . . | . . . => . . . ; when . . . => . . . ; end case; loop loop . . . end loop; while loop while . . . loop . . . end loop; for loop for . . . in . . . loop . . . end loop; next next . . . ; The most commonly used sequential statements 1.3.9 A Few More Miscellaneous VHDL 1.4.2 Conditional Assignment vs If Features Statements This section reserved for your reading pleasure The two code fragments below have identical behaviour: Concurrent Statements Sequential Statements 1.4 Concurrent vs Sequential Statements if < cond > then t <= <val1> when < cond > t <= < val1 >; All concurrent assignments can be translated into sequential statements. But, not else < val2 >; else all sequential statements can be translated into concurrent statements. t <= < val2 >; end if 43 CHAPTER 1. FUNDAMENTALS OF VHDL 1.4.2 Conditional Assignment vs If Statements 45

  10. 46 CHAPTER 1. FUNDAMENTALS OF VHDL 1.5. OVERVIEW OF PROCESSES 48 1.4.3 Selected Assignment vs Case 1.5 Overview of Processes Statement Processes are the most difficult VHDL construct to understand. This section gives an overview of processes. section 1.6 gives the details of the semantics of The two code fragments below have identical behaviour processes. • Within a process, statements are executed almost sequentially Concurrent Statements Sequential Statements with < expr > select case < expr > is • Among processes, execution is done in parallel t <= < val1 > when < choices1 >, when < choices1 > => • Remember: a process is a concurrent statement! < val2 > when < choices2 >, t <= < val1 >; < val3 > when < choices3 >; when < choices2 > => t <= < val2 >; when < choices3 > => t <= < val3 >; end case; 1.4.4 Coding Style Process Semantics • VHDL mimics hardware Code that’s easy to write with sequential statements, but difficult with concurrent : • Hardware (gates) execute in parallel • Processes execute in parallel with each other case < expr > is • All possible orders of executing processes must produce the same simulation when < choice1 > => results (waveforms) if < cond > then o <= < expr1 >; • If a signal is not assigned a value, then it holds its previous value else o <= < expr2 >; end if; All orders of executing concurrent when < choice2 > => statements must produce the same . . . waveforms end case; 47 CHAPTER 1. FUNDAMENTALS OF VHDL 1.5. OVERVIEW OF PROCESSES 49

  11. 50 CHAPTER 1. FUNDAMENTALS OF VHDL 1.5.1 Combinational Process vs Clocked Process 52 Process Semantics 1.5.1 Combinational Process vs Clocked Process execution sequence execution sequence execution sequence Each well-written synthesizable process is either combinational or clocked. architecture procA: process Combinational process: stmtA1; A1 A1 A1 stmtA2; A2 A2 A2 • Executing the process takes part of one clock cycle stmtA3; A3 A3 A3 end process; • Target signals are outputs of combinational circuitry procB: process • A combinational process must have a sensitivity list stmtB1; stmtB2; B1 B1 B1 • A combinational process must not have any wait statements end process; B2 B2 B2 • A combinational process must not have any rising_edge s, or falling_edge s single threaded: single threaded: multithreaded: • The hardware for a combinational process is just combinational circuitry procA before procB before procA and procB in parallel procB procA Process Semantics Clocked process: • Executing the process takes one (or more) clock cycles • Target signals are outputs of flops • Process contains one or more wait or if rising edge statements • Hardware contains combinational circuitry and flip flops Note: Clocked processes are sometimes called “sequential processes”, but this can be easily confused with “sequential statements”, so in ECE-327 we’ll refer to synthesizable processes as either “combinational” or “clocked”. All execution orders must have same behaviour 51 CHAPTER 1. FUNDAMENTALS OF VHDL 1.5.1 Combinational Process vs Clocked Process 53

  12. 54 CHAPTER 1. FUNDAMENTALS OF VHDL 1.5.1 Combinational Process vs Clocked Process 56 Combinational or Clocked Process? (1) Combinational or Clocked Process? (3) process (a,b,c) process (clk) p1 <= a; begin if (b = c) then if rising_edge(clk) then p2 <= b; b <= a; else end if; p2 <= a; end process; end if; end process; Combinational or Clocked Process? (2) Combinational or Clocked Process? (4) process process (clk) begin begin wait until rising_edge(clk); a <= clk; b <= a; end process; end process; 55 CHAPTER 1. FUNDAMENTALS OF VHDL 1.5.1 Combinational Process vs Clocked Process 57

  13. 58 CHAPTER 1. FUNDAMENTALS OF VHDL 1.5.2 Latch Inference 60 Combinational or Clocked Process? (5) Latch Inference When a signal’s value must be stored, VHDL infers a latch or a flip-flop in the process hardware to store the value. begin wait until rising_edge(a); c <= b; If you want a latch or a flip-flop for the signal, then latch inference is good. end process; If you want combinational circuitry, then latch inference is bad. 1.5.2 Latch Inference Loop, Latch, Flop The semantics of VHDL require that if a signal is assigned a value on some a passes through a process and not on other passes, then on a pass through the b z b z D Q b process when the signal is not assigned a value, it must maintain its value from z a a EN the previous pass. Flip-flop Latch Combinational loop process (a, b, c) begin Question: Write VHDL code for each of the above circuits if (a = ’1’) then a z1 <= b; b z2 <= b; c else z1 z1 <= c; z2 end if; end process; Example of latch inference 59 CHAPTER 1. FUNDAMENTALS OF VHDL 1.5.2 Latch Inference 61

  14. 62 CHAPTER 1. FUNDAMENTALS OF VHDL Review: Introduction to VHDL 1. The goal of ece327 is help you think . 2. Hardware runs Software runs 3. In VHDL, the interface of a circuit is called a(n) . In VHDL, the body of a circuit is called a(n) . The body of a circuit contains statements, which execute A process contains statements, which execute 4. To simulate hardware: At each , every gate in the circuit: 1 2 3 63 CHAPTER 1. FUNDAMENTALS OF VHDL

  15. 1.6 VHDL Execution: Delta-Cycle Simulation 1.6.1 Simple Simulation Hardware runs in parallel : At each infinitesimally small moment of time, in parallel, each gate: 1. samples its inputs 2. computes the value of its output 3. drives the output 0ns 10ns 12ns 15ns a a b c d e c b d e 1.6. VHDL EXECUTION: DELTA-CYCLE SIMULATION 64

  16. 65 CHAPTER 1. FUNDAMENTALS OF VHDL 1.6.4 Intuition Behind Delta-Cycle Simulation 67 1.6.2 Temporal Granularities of Simulation 1.6.4 Intuition Behind Delta-Cycle Simulation register-transfer-level • smallest unit of time is a clock cycle 1.6.4.1 Introduction to Delta-Cycle • combinational logic has zero delay Simulation • flip-flops have a delay of one clock cycle timing simulation • To make it appear that events propagate instantaneously through • smallest unit of time is a nano, pico, or fempto second combinational circuitry : VHDL introduces the delta cycle • combinational logic and wires have delay as computed by timing analysis – Infinitesimally small artificial unit of time tools – In each delta cycle, in paralle, every gate in the circuit 1. samples its input signals • flip-flops have setup, hold, and clock-to-Q timing parameters 2. computes its result value delta cycles 3. drives the result value on its output signal • units of time are artifacts of VHDL semantics and simulation software • To make it appear that gates operate in parallel : VHDL introduces the • simulation cycles, delta cycles, and simulation steps are infinitesimally small projected assignment amounts of time – the effect of simulating a gate remains invisible until the beginning of the next • VHDL semantics are defined in terms of these concepts delta cycle 1.6.3 Zero-Delay Simulation 1.6.4.2 Intuitive Rules for Delta-Cycle Simulation Register-transfer-level and delta-cycle simulation are both examples of zero-delay 1. Simulate a gate if any of its inputs changed. simulation . If no input changed, then the current value of the output is correct and the output can stay at the same value. There are two fundamental rules for zero-delay simulation: 2. Each gate is simulated at most once per delta cycle. 1. Events appear to propagate through combinational circuitry instantaneously. 3. When a gate is executed, the projected ( i.e. , new) value of the output remains 2. All of the gates appear to operate in parallel invisible until the beginning of the next delta cycle. 4. Increment time when there is no need for another delta cycle. No gate had an input change value in the current delta cycle. 66 CHAPTER 1. FUNDAMENTALS OF VHDL 1.6.4 Intuition Behind Delta-Cycle Simulation 68

  17. 69 CHAPTER 1. FUNDAMENTALS OF VHDL 1.6.4.3 Example of Delta: Buffers This section reserved for your reading pleasure 1.6.4.4 Example of Delta and Proj: Buffers 1ns Abbreviated Code proc (a) a Delta-cycle simulation with projected values b <= a; end; b proc (b) c <= b; end; c Hardware b a c S: 1ns 2ns C: simulation a Simple D: b c 1.6.4.5 Example of Proj Asn: Flip-Flops This section reserved for your reading pleasure 70 CHAPTER 1. FUNDAMENTALS OF VHDL

  18. 1.6.4.6 Example of Delta and Proj: Comb Loop Truly parallel simulation: • multiple gates/processes execute at the same time • therefore no need for projected assignment. 0ns 1ns δ a b c d a b a b c d c a b c d d 1.6.4 Intuition Behind Delta-Cycle Simulation 71

  19. 72 CHAPTER 1. FUNDAMENTALS OF VHDL Correct Simulation • Processes execute one at a time. • Projected assignments become visible at the beginning of the next delta cycle. 1 0 1 a a b c d b c d 0 1 1 1 0 1 1 0 1 1 a a b c d b c d 1 1 0 0 1 1 1 1 0 1 1 a a b c d b c d 0 1 1 0 0 1 1

  20. Different Execution Orders The order in which we execute the processes does not affect the behaviour. Execution order: b, c, d Execution order: b, d, c 1 1 a a b c d b c d 0 1 1 0 0 1 1 0 1 1 a a b c d b c d 0 1 1 1 0 0 1 1 1 0 1 1 a a b c d b c d 0 1 1 1 1 0 0 1 1 1 1 0 73 CHAPTER 1. FUNDAMENTALS OF VHDL

  21. 1.6.4 Intuition Behind Delta-Cycle Simulation 74 Buggy Simulation • Processes execute one at a time. • Projected assignments become visible immediately. Execution order: b, c, d Execution order: b, d, c 1 1 a a b c d b c d 0 0 1 0 0 0 1 0 1 1 a a b c d b c d 0 0 0 0 0 0 0 1 1 1 1 1 a a b c d b c d 0 0 0 0 0 0 0 0 1 1 1 1

  22. Analysis Which values does c see? Correct Buggy b d 1.6.4 Intuition Behind Delta-Cycle Simulation 75

  23. 76 CHAPTER 1. FUNDAMENTALS OF VHDL Re-do with Waveforms a b c d Execution order: b, c, d Execution order: b, d, c Final value Final value 0ns 1ns δ 0ns 1ns δ -cycle δ -cycle δ δ δ -cycle δ -cycle δ S S S S S S S S a a 1 1 C C D D b b 0 0 C C C C D D D D c c 1 1 C C D D d d 1 1

  24. Buggy Simulation 1 1 a a b c d b c d 0 0 0 0 1 1 Final value Final value 1ns δ 0ns 1ns 0ns δ δ -cycle δ -cycle δ -cycle δ -cycle S S S S S S S S a a 1 1 C C b b 0 0 D C C D C C c c 0 1 C C D D D D d d 0 1 D D Execution order: b, c, d Execution order: b, d, c 77 CHAPTER 1. FUNDAMENTALS OF VHDL

  25. 1.6.5 VHDL Delta-Cycle Simulation 78 80 CHAPTER 1. FUNDAMENTALS OF VHDL 1.6.5 VHDL Delta-Cycle Simulation 1.6.5.2 Example: VHDL Sim for Buffers The algorithm presented here is a simplification of the actual algorithm in the This section reserved for your reading pleasure VHDL Standard. This algorithm does not support: • delayed assignments; for example: a <= b after 2 ns; • resolution, which is where multiple processes write to the same signal (usually a mistake, but useful for tri-state busses) 1.6.5.1 Informal Description of Algorithm 1.6.5.3 Definitions and Algorithm • Processes have three modes : Resumed : The process has work to do and is waiting its turn to execute. Notes on Simulation Algorithm Executing : The process is running. • At a wait statement, the process will suspend even if the condition is true in the Suspended : The process is idle and has no work to do. current simulation cycle. The process will resume the next time that a signal in • A simulation run is initialization followed by a sequence of simulation rounds the condition changes and the condition is true. • Initialization : • If we execute multiple assignments to the same signal in the same process in – Each process starts off resumed . the same simulation cycle, only the last assignment actually takes effect — all – Each signal starts off with its default value . ( ’U’ for std logic ) but the last assignment are ignored. • In each simulation round : • In a simulation round, the first simulation cycle is not a delta cycle. – Increment time • The mode of a process is determined implicitly by keeping track of the set of – Resume all processes that are waiting for the current time processes that are resumed (the resume set ) and the process(es) that is(are) – A simulation round is a sequence of simulation cycles. executing. All other processes are suspended. • In each simulation cycle : – Copy projected value of signals to current value. – Resume processes based on sensitivity lists and wait conditions. – Execute each resumed process. – If no projected assignment changed the value of a signal, then increment time and start next simulation round. 1.6.5 VHDL Delta-Cycle Simulation 79 81 CHAPTER 1. FUNDAMENTALS OF VHDL

  26. 1.6.5 VHDL Delta-Cycle Simulation 82 84 CHAPTER 1. FUNDAMENTALS OF VHDL VHDL Simulation Definitions 1.6.5.5 Ex: VHDL Sim of Comb Loop proc_a : process begin a <= ’0’; Definition simulation step: Executing one sequential assignment or process wait for 1 ns; a <= ’1’; mode change. wait; end process; proc_b : process (a) Definition simulation cycle: The operations that occur in one iteration of the begin a b <= not( a ); simulation algorithm. b c d end process; proc_c : process (a,b,d) begin Definition delta cycle: A simulation cycle where time did not advance at the c <= not( a ) or b or d; end process; beginning of the cycle. proc_d : process (a,c) begin d <= a and c; Definition simulation round: A sequence of simulation cycles that all have the end process; same simulation time. More Formal Description of Algorithm This section reserved for your reading pleasure 1.6.5.4 Example: Delta-Cycle for Flip-Flops This section reserved for your reading pleasure 1.6.5 VHDL Delta-Cycle Simulation 83

  27. a b c d Time 0ns Sim rounds Sim cycles proc_a proc_b proc_c proc_d a b c d 1.6.5 VHDL Delta-Cycle Simulation 85

  28. 86 CHAPTER 1. FUNDAMENTALS OF VHDL 1.6.6 External Inputs and Flip-Flops 88 1.6.5.6 Rules and Observations for 1.6.6 External Inputs and Flip-Flops Drawing Delta-Cycle Simulations Question: Do the signals b1 and b2 have the same behaviour from 10–20 ns? architecture mathilde of sauv´ e is The VHDL Language Reference Manual gives only a textual description of the signal clk, a, b : std_logic; VHDL semantics. The conventions for drawing the waveforms are just our own. begin process begin • Each column is a simulation step. clk <= ’0’; wait for 10 ns; • In a simulation step, either exactly one process changes mode or exactly one clk <= ’1’; signal changes value, except in the first two simulation steps of each simulation wait for 10 ns; end process; cycle, when multiple current values may be updated and multiple processes process begin may resume. wait for 10 ns; a1 <= ’1’; • If a projected assignment assigns the same value as the signal’s current end process; projected value, the projected assignment must still be shown, because this process begin wait until rising_edge(clk); assignment will force another simulation cycle in the current simulation round. a2 <= ’1’; end process; • If a signal’s visible value is updated with the same value as it currently has, this process begin assignment is not shown, because it will not trigger any sensitivity lists. wait until rising_edge( clk ); b1 <= a1; • Assignments to signals may be denoted by either the number/letter of the new b2 <= a2; value or one of the edge symbols: end process; end architecture; new value U 0 1 Review: Delta-Cycle Simulation U old value 0 1 A delta-cycle is a at the beginning of which Some observations about delta-cycle simulation waveforms that can be helpful in checking that a simulation is correct: . • In the first simulation step of the first simulation cycle of a simulation round ( i.e. , The two illusions of zero-delay simulation: the first simulation step of a simulation round), at least one process will resume. This is contrast to the first simulation step of all other simulation cycle, where 1. propagate current values of signals are updated with projected values. • At the end of a simulation cycle all processes are suspended. 2. operate • In the last simulation cycle of a simulation round either no signals change value, or any signal that changes value is not in the sensitivity list of any process. VHDL achieves the illusions by: 1. 2. 87 CHAPTER 1. FUNDAMENTALS OF VHDL 1.6.6 External Inputs and Flip-Flops 89

  29. 90 CHAPTER 1. FUNDAMENTALS OF VHDL 1.7.2 Technique for Register-Transfer Level Simulation 92 1.7 Register-Transfer-Level Simulation 1.7.2 Technique for Register-Transfer Level Simulation 1.7.1 Overview 1. Pre-processing • Much simpler than delta cycle (a) Separate processes into timed, clocked, and combinational • Columns are real time: clock cycles, nanoseconds, etc. (b) Decompose each combinational process into separate processes with one • Can simulate both synthesizable and unsynthesizable code target signal per process • Cannot simulate combinational loops (c) Sort combinational processes into topological order based on dependencies • Same values as delta-cycle at end of simulation round 2. For each moment of real time: (a) Run timed processes in any order , reading old values of signals . (b) Run clocked processes in any order , reading new values of timed signals and old values of registered signals . (c) Run combinational processes in topological order , reading new values of signals . Question: In this code, what value 1.7.3 Examples of RTL Simulation process begin should b have at 10 ns — does it a <= ’0’; read the new value of a or the old wait for 10 ns; value? a <= ’1’; 1.7.3.1 RTL Simulation Example 1 ... end process; We revisit an earlier example from delta-cycle simulation, but change the code slightly and do register-transfer-level simulation. process begin b <= ’0’; proc1: process (a, b, c) begin proc3: process begin wait for 10 ns; d <= NOT c; a <= ’1’; b <= a; c <= a AND b; b <= ’0’; ... end process; wait for 3 ns; end process; b <= ’1’; proc2: process (b, d) begin wait for 99 ns; e <= b AND d; end process; end process; 91 CHAPTER 1. FUNDAMENTALS OF VHDL 1.7.3 Examples of RTL Simulation 93

  30. 94 CHAPTER 1. FUNDAMENTALS OF VHDL 1.7.3 Examples of RTL Simulation 96 Decompose and sort comb procs Combinational Loops Why is RTL-simulation unable to support combinational loops? proc1d: process (c) begin proc1c: process (a, b) begin d <= NOT c; c <= a and b; end process; end process; process (a, c) begin proc1c: process (a, b) begin proc1d: process (c) begin b <= a xor c; c <= a AND b; d <= not c; end process; b c end process; end process; a process (b) begin proc2: process (b, d) begin proc2: process (b, d) begin c <= not b; e <= b AND d; e <= b and d; end process; end process; end process; Decomposed Sorted Waveforms Decomposing if-then-else Clauses 0ns 1ns 2ns 3ns 102ns This example illustrates how to decompose a combinational process that contains a U assignments to multiple variables and if-then-else clauses. b U c U Original Decomposed d U process (a, b, c) begin e U if a = ’1’ then y <= b; proc3: process begin proc1c: process (a, b) begin z <= c; a <= ’1’; c <= a and b; else b <= ’0’; end process; y <= not b; wait for 3 ns; proc1d: process (c) begin z <= not c; b <= ’1’; d <= not c; end if; wait for 99 ns; end process; end process; end process; proc2: process (b, d) begin e <= b and d; end process; 95 CHAPTER 1. FUNDAMENTALS OF VHDL 1.7.3 Examples of RTL Simulation 97

  31. 98 CHAPTER 1. FUNDAMENTALS OF VHDL 1.8. SIMPLE RTL SIMULATION IN SOFTWARE 100 1.8 Simple RTL Simulation in Software Review: RTL Simulation This is an advanced section. It is not covered in the course 1. Algorithm for RTL simulation: and will not be tested. Preprocessing 1.9 Variables in VHDL (a) Separate processes into two groups: and This is an advanced section. It is not covered in the course (b) the processes so that each process and will not be tested. 1.10 Delta-Cycle Simulation with Delays (c) Sort the processes into order Running For each moment in time or clock cycle: This is an advanced section. (a) Run the processes in order. It is not covered in the course and will not be tested. Processs read the value of signals. 1.11 VHDL and Hardware Building Blocks (b) Run the processes in order. Processes read the value of signals. 1.11.1 Basic Building Blocks 2. What are the defining characteristics of zero-delay simulation? (a) operate This section reserved for your reading pleasure (b) propagate 1.11.2 Deprecated Building Blocks for RTL 3. Comparing delta-cycle and RTL simulation: Illusion #1 Illusion #2 Delta cyle This section reserved for your reading pleasure RTL 99 CHAPTER 1. FUNDAMENTALS OF VHDL 1.11. VHDL AND HARDWARE BUILDING BLOCKS 101

  32. 102 CHAPTER 1. FUNDAMENTALS OF VHDL 1.11.3 Hardware and Code for Flops 104 1.11.3 Hardware and Code for Flops Flop with Synchronous Reset: Wait-Style process 1.11.3.1 Flops with Waits and Ifs begin wait until rising_edge(clk); This section reserved for your reading pleasure if reset = ’1’ then q <= ’0’; else q <= d; end if; end process; 1.11.3.2 Flops with Synchronous Reset Variation on a Floppy Theme process (clk) begin Question: What is this? if rising_edge(clk) then if (reset = ’1’) then process (clk, reset) q <= ’0’; begin else if reset = ’1’ then q <= d; q <= ’0’; end if; else end if; if rising_edge(clk) then end process; q <= d; end if; end if ; end process; 103 CHAPTER 1. FUNDAMENTALS OF VHDL 1.11.3 Hardware and Code for Flops 105

  33. 106 CHAPTER 1. FUNDAMENTALS OF VHDL 1.11.3 Hardware and Code for Flops 108 Flop with Chip-Enable Q: Flops with a Mux on the Output? process (clk) sel Q q0 d0 D begin clk q if rising_edge(clk) then q1 if ce = ’1’ then d1 D Q q <= d; clk end if; end if; end process; Wait-style flop with chip-enable included in course notes Q: Flop with a Mux on the Input? Behavioural Comparison sel d0 q0 D Q d0 sel sel q D Q d0 d1 q q D Q clk d1 d1 q1 D Q clk clk Question: For the two circuits above, does q have the same behaviour in both circuits? Mux on input Mux on output clk clk sel sel d0 d0 d1 d1 q q 107 CHAPTER 1. FUNDAMENTALS OF VHDL 1.11.3 Hardware and Code for Flops 109

  34. 110 CHAPTER 1. FUNDAMENTALS OF VHDL 1.12. SYNTHESIZABLE VS NON-SYNTHESIZABLE CODE 112 1.11.3.3 Flop with Chip-Enable and Mux on 1.12 Synthesizable vs Non-Synthesizable Input Code Hint: Chip Enable For us to consider a VHDL progam synthesizable, all of the conditions below must be satisfied: process (clk) begin • the program must be theoretically implementable in hardware if rising_edge(clk) then • the hardware that is produced must be consistent with the structure of the if ce = ’1’ then source code q <= d; • the source code must be portable across a wide range of synthesis tools, in that end if; the synthesis tools all produce correct hardware end if; end process; Synthesis is done by matching VHDL code against templates or patterns. It’s important to use idioms that your synthesis tools recognize. Think like hardware: when you write VHDL, you should know what hardware you expect to be produced by the synthesizer. 1.11.3.4 Flops with Chip-Enable, Muxes, 1.12.1 Wait For and Reset Wait for length of time (UNSYNTHESIZABLE) This section reserved for your reading pleasure wait for 10 ns; 1.11.4 Example Coding Styles Reason : Delays through circuits are dependent upon both the circuit and its operating environment, particularly supply voltage and temperature. For example, imagine trying to build an AND gate that will have exactly a 2ns delay in all This section reserved for your reading pleasure environments. 1.12.2 Initial Values Initial values on signals (UNSYNTHESIZABLE) signal bad_signal : std_logic := ’0’; Reason : At powerup, the values on signals are random (except for some FPGAs). 111 CHAPTER 1. FUNDAMENTALS OF VHDL 1.12.1 Wait For 113

  35. 114 CHAPTER 1. FUNDAMENTALS OF VHDL 1.12.5 “if rising edge” with “else” Clause 116 1.12.3 Assignments before Wait Statement 1.12.5 “if rising edge” with “else” Clause The if statement has a rising edge condition and an else clause If a synthesizable clocked process has a wait statement, then the process must (UNSYNTHESIZABLE). begin with a wait statement. process (clk) process process begin c <= a; wait until rising edge(clk); if rising_edge(clk) then d <= b; c <= a; q0 <= d0; wait until rising edge(clk); d <= b; else end process; end process; q0 <= d1; Unsynthesizable Synthesizable end if; end process; Reason: Cannot synthesize reasonble hardware that has the correct behavior. Reason : The idioms for the synthesis tools expect a signal to be either registered In simulation , any assignments before the first wait statement will be executed in or combinational, not both. the first delta-cycle . In the synthesized circuit , the signals will be outputs of flip-flops and will first be assigned values after the first rising-edge . 1.12.4 “if rising edge” and “wait” in Same 1.12.6 While Loop with Dynamic Condition Process and Combinational Body An if rising edge statement and a wait statement in the same process A while loop where the condition is dynamic (depends upon a signal value) and (UNSYNTHESIZABLE) the body is combinational is unsynthesizable. The loop below is unsynthesizable: process process (a,b,c) begin begin while a = ’1’ loop if rising_edge(clk) then z <= b and c; q0 <= d0; end loop; end if; end process; wait until rising_edge(clk); This loop is designed to be very small, but illustrate the problem. The loop itself is q0 <= d1; non-sensical. end process; Reason : The idioms for synthesis tools generally expect just a single type of flop-generating statement in each process. 115 CHAPTER 1. FUNDAMENTALS OF VHDL 1.12.6 While Loop with Dynamic Condition and Combinational Body 117

  36. 118 CHAPTER 1. FUNDAMENTALS OF VHDL 1.13. GUIDELINES FOR DESIRABLE HARDWARE 120 Know Your Hardware For Loop with Combinational Body A for-loop with a combinational body is synthesizable , because the loop condition The most important guideline is: know what you want the synthesis tool to build for can be evaluated statically (at compile/elaboration time). The loop below is you . synthesizable: • For every signal in your design, know whether it should be a flip-flop or combinational. Check the output of the synthesis tool see if the flip flops in your process ( b, c ) begin circuit match your expectations, and to check that you do not have any latches for i in 0 to 3 loop in your design. z(i) <= b(i) and c(i); • If you cannot predict what hardware the synthesis tool will generate, then you end loop; probably will be unhappy with the result of synthesis. end process; An equivalent while loop would require variables, which are an advanced topic (section 1.9). While loops with dynamic conditions and clocked bodies are synthesizable , but are an example of an implicit state machine and are an advanced topic. 1.13 Guidelines for Desirable Hardware 1.13.1 Latches Code that is synthesizable, but undesirable ( i.e. , bad coding practices): Difference between a flip-flop and a latch: • latches flip-flop Edge sensitive: output only changes on rising (or falling) edge of clock • combinational loops latch Level sensitive: output changes whenever clock is high (or low) • multiple drivers for a signal A common implementation of a flip-flop is a pair of latches (Master/Slave flop). • asynchronous resets • using a data signal as a clock Latches are sometimes called “transparent latches”, because they are transparent • using a clock signal as data (input directly connected to output) when the clock is high. To prevent undesireable hardware, some synthesis tools will flag some of these The clock to a latch is sometimes called the “enable” line. problems as “unsynthesizable”. There is more information in the course notes on timing analysis for storage devices (section 8.3). 119 CHAPTER 1. FUNDAMENTALS OF VHDL 1.13.1 Latches 121

  37. 122 CHAPTER 1. FUNDAMENTALS OF VHDL 1.13.2 Combinational Loops 124 Latch: Combinational if - then without else 1.13.2 Combinational Loops A combinational loop is a cyclic path of dependencies through one or more process (a, b) combinational processes. begin process (a, b, c) begin if (a = ’1’) then if a = ’0’ then c <= b; d <= b; end if; else a end process; d <= c; • For a combinational process, every signal that is assigned to, must be assigned end if; to in every branch of if-then and case statements. b e d end process; reason If a signal is not assigned a value in a path through a combinational c process, then that signal will be a latch. process (d, e) begin note For a clocked process, if a signal is not assigned a value in a clock cycle, b <= d and e; then the flip-flop for that signal will have a chip-enable pin. Chip-enable pins end process; are fine; they are available on flip-flops in essentially every cell library. • If you need a signal to be dependent on itself, you must include a register somewhere in the cyclic path. • Some FPGA synthesis tools consider a combinational loop to be unsynthesizable . We consider it to be synthesizable and bad-hardware, because the hardware is obvious and is obviously bad. Signals Missing from Sensitivity List 1.13.3 Multiple Drivers z <= a and b; a process (a) b z z <= c; c begin • Each signal should be assigned to in only one process. This is often called the c <= a and b; “single assignment rule”. end process; reason Multiple processes driving the same signal is the same as having • For a combinational process, the sensitivity list should contain all of the signals multiple gates driving the same wire. This can cause contention, tri-state that are read in the process. values, and other bad things. reason Gives consistent results across different tools. Many synthesis tools will implicitly include all signals that a process reads in its sensitivity list. This differs from the VHDL Standard. A synthesis tool that adheres to the standard will either generate an error or will create hardware with latches or flops clocked by data sigansl if not all signals that are read from are included in the sensitivity list. exception In a clocked process using an if rising edge , it is acceptable to have only the clock in the sensitivity list 123 CHAPTER 1. FUNDAMENTALS OF VHDL 1.13.3 Multiple Drivers 125

  38. 126 CHAPTER 1. FUNDAMENTALS OF VHDL 1.13.5 Using a Data Signal as a Clock 128 Multiple Drivers Example 1.13.5 Using a Data Signal as a Clock process begin The example below shows how a “software style” structure that puts the reset wait until rising_edge(clk); code in one process will cause multiple drivers for the signals y and z . count <= count + 1; a b D Q end process; process begin 1 count (5) D Q wait until rising edge(clk); process begin if reset = ’1’ then wait until rising edge(clk); process begin clk y <= ’0’; if reset = ’0’ then z <= ’0’; if b = ’1’ then waiting until rising_edge( count(5) ); end if; y <= c; b <= a; end process; end if; end if; end process; process begin end process; • Data signals should be used only as data. wait until rising edge(clk); if reset = ’0’ then reason All data assignments should be synchronized to a clock. This ensures if a = ’1’ then that the timing analysis tool can determine the maximum clock speed z <= b and c; accurately. Using a data signal as a clock clock signals can lead to else z <= d; unpredictable delays between different assignments, which makes it end if; infeasible to do an accurate timing analysis. end if; end process; 1.13.4 Asynchronous Reset 1.13.6 Using a Clock Signal as Data In an asynchronous reset, the test for reset occurs outside of the test for the clock process begin edge. wait until rising_edge(clk); count <= count + 1; end process; process (reset, clk) begin b <= a and clk; if (reset = ’1’) then • Clock signals should be used only as clocks. q <= ’0’; reason Clock signals have two defined values in a clock cycle and transition in elsif rising_edge(clk) then the middle of the clock cycle. At the register-transfer level, each signal has q <= d; exactly one value in a clock cycle and signals transition between values only end if; at the boundary between clock cycles. end process; • All reset signals should be synchronous . reason If a reset occurs very close to a clock edge, some parts of the circuit might be reset in one clock cycle and some in the subsequent clock cycle. This can lead the circuit to be out of sync as it goes through the reset sequence, potentially causing erroneous internal state and output values. 127 CHAPTER 1. FUNDAMENTALS OF VHDL 1.13.6 Using a Clock Signal as Data 129

  39. 130 CHAPTER 1. FUNDAMENTALS OF VHDL 1.14.1 Tri-State Buffers and Signals 132 1.14 Bad VHDL Coding Inout and Buffer Port Modes This section lists some coding practices to avoid in VHDL unless you have a very entity bad is good reason. port ( io_bad : inout std_logic; buf_bad : buffer std_logic 1.14.1 Tri-State Buffers and Signals ); end entity; ‘Z’ as a Signal Value • Use in or out , do not use inout or buffer reason inout and buffer signals are tri-state. process (sel, a0) note If you have an output signal that you also want to read from, you might be b <= a0 when sel = ’0’ tempted to declare the mode of the signal to be inout . A better solution is to else ’Z’; create a new, internal, signal that you both read from and write to. Then, your end process; output signal can just read from the internal signal. process (sel, a1) b <= a1 when sel = ’1’ else ’Z’; end process; • Use multiplexers, not tri-state buffers. 1.14.2 Variables in Processes reason Multiplexers are more robust than tri-state buffers, because tri-state buffers rely on analog effects such as drive-strength and voltages that are process between ’0’ and ’1’ . Multiplexers require more area than tri-state buffers, variable bad : std_logic; but for the size of most busses, the advantage in a more robust design is begin worth the cost in extra area. wait until rising_edge(clk); bad := not a; d <= bad and b; e <= bad or c; end process; • In a process, use signals; do not use variables reason The intention of the creators of VHDL was for signals to be wires and variables to be just for simulation. Some synthesis tools allow some uses of variables, but when using variables, it is easy to create a design that works in simulation but not in real hardware. (section 1.9) 131 CHAPTER 1. FUNDAMENTALS OF VHDL 1.14.2 Variables in Processes 133

  40. 134 CHAPTER 1. FUNDAMENTALS OF VHDL 1.14.3 Bits and Booleans as Signals 136 3. 4. 1.14.3 Bits and Booleans as Signals process (a,b) begin process (a, b) begin if a = ’1’ then if a = ’1’ then q <= b; q <= b; signal bad1 : bit; end if; else signal bad2 : boolean; end proces; q <= not q; • Use std_logic signals, do not use bit or Boolean signals. Yes No end if; reason std_logic is the most commonly used signal type across synthesis Synth? end proces; tools and simulation tools. Yes No Good? Synth? Good? Review: Synthesizable, Good, and Bad VHDL For each code fragment below, answer whether it is synthesizable. If the code is synthesizable, answer whether it follows good coding practices for synthesizable Chapter 2 hardware. 1. 2. process (clk) begin process (clk) begin if rising_edge(clk) then if rising_edge(clk) then Additional Features of VHDL q <= a; q1 <= d1; else end if; q <= b; if rising_edge(clk) then end if; q2 <= d2; end proces; end if; Yes No end proces; Yes No Synth? Synth? Good? Good? 137 135 CHAPTER 1. FUNDAMENTALS OF VHDL

  41. 138 CHAPTER 2. ADDITIONAL FEATURES OF VHDL 2.2. ARRAYS AND VECTORS 140 2.2 Arrays and Vectors 2.1 Literals 2.1.1 Numeric Literals 2.2.1 Declarations Description Type Example 1 Example 2 VHDL arrays have: Decimal Integer 17 1023 • direction ( to or downto ) Decimal Real 17.0 1023.1 • upper bound Hexadecimal Integer 16#FF# 16#2F190# • lower bound Hexadecimal Real 16#FF.F# 16#2F1.90# signal a : std_logic_vector( 3 downto 0 ); Binary Integer 2#1101# 2#011101# signal b : std_logic_vector( 0 to 3 ); Binary Real 2#1101.111# 2#0111.01# signal c : std_logic_vector( 1 to 4 ); Exponent Integer 17E+3 2#111#E3 Exponent Real 17.1E+3 2#11.1#E3 Underscore Integer 123 45 67 16#FF 3A# 2.1.2 Bit-String Literals Constant Arrays Binary B"1101010" B"1101 1010" To define a constant array: Octal O"3470100" O"45 23" Hexadecimal X"FF2300" X"Ff3dbF 23" constant a : array( 0 to 3 ) of integer := ( 10, 17, -31, 23 ); Note: Array literals are called “aggregates” and are described in section 2.2.2. constant b : array( 0 to 3 ) of integer := ( 0 => 10, 1 => 17, 2 => -31, 3 => 23 ); constant c : array( 0 to 3 ) of integer := ( 0 => 10, 1 => 17, others => 23 ); 139 CHAPTER 2. ADDITIONAL FEATURES OF VHDL 2.2.1 Declarations 141

  42. 142 CHAPTER 2. ADDITIONAL FEATURES OF VHDL 2.2.2 Indexing, Slicing, Concatenation, Aggregates 144 2.2.2 Indexing, Slicing, Concatenation, Assignments (cont’d) Aggregates Declarations a , b : std_logic_vector(15 downto 0); Operations ax, bx : std_logic_vector(0 to 15); Legal code b (3 downto 0) <= a(15 downto 12); Indexing an array to reference a a(0) bx(0 to 3) <= a(15 downto 12); single element ( b(3) , b(4) ) <= a(13 downto 12); A slice or “discrete subrange” of a( 3 downto 2) ( bx(4), b(4) ) <= a(13 downto 12); an array Illegal code Concatenating an element onto ’1’ & a bx(0 to 3) <= a(12 to 15); b & a an array, or concatenation two ar- -- slice dirs must be same as decl, fails for a rays c (3 downto 0) <= (a & b)( 3 downto 0); Array literals or “aggregates” ( ’0’, ’0’, ’1’ ) -- may not index an expression ( a(0), b(2), a(3) ) Aggregate with positional indices ( 0=>’0’, 2=>’X’, 1=>’U’ ) b(3) & b(2) <= a(12 to 13); Aggregate with “ others ” key- ( 0=>’0’, 3=>’1’, others=>’X’ ) -- & may not be used on lhs word Assignments 2.3 Arithmetic 1. The ranges on both sides of the assignment must be the same. VHDL includes all of the common arithmetic operators and relations. 2. The direction ( downto or to ) of each slice must match the direction of the signal declaration. Use the VHDL arithmetic operators and let the synthesis tool choose the best 3. The direction of the target and expression may be different. implementation for you. 2.3.1 Arithmetic Packages To do arithmetic with signals, use the numeric_std package. numeric std supersedes earlier arithmetic packages, such as std logic arith . Use only one arithmetic package, otherwise the different definitions will clash and you can get strange error messages. We will describe arithmetic with the numeric std package. 143 CHAPTER 2. ADDITIONAL FEATURES OF VHDL 2.3. ARITHMETIC 145

  43. 146 CHAPTER 2. ADDITIONAL FEATURES OF VHDL 2.3.4 Widths for Addition and Subtraction 148 2.3.2 Arithmetic Types 2.3.4 Widths for Addition and Subtraction Arithmetic may be done on three types of expressions: • Sources may have different widths integers Numeric values, such as 17 • The target must be the same width as the widest source unsigned Unsigned vectors, such as signals defined as type unsigned( 7 downto 0) . Declarations signed Signed vectors, such as signals defined as type w1, w2, w3 : unsigned(7 downto 0) – wide signed( 7 downto 0) . unsigned(3 downto 0) – narrow n1, n2, n3 : Target Src1/2 Src2/1 Example The types signed and unsigned are std_logic vectors on which you can do wide wide wide w3 <= w1 + w2; OK signed or unsigned arithmetic and all of the operations that are supported by wide wide narrow w3 <= w1 + n2; OK std logic vector s. wide wide int w3 <= w1 + 17; OK narrow narrow narrow n3 <= n1 + n2; OK narrow narrow int n3 <= n1 + 17; OK narrow wide — n3 <= w1 + n2; Fail These failures are caught at elaboration , which happens after typechecking. 2.3.3 Overloading of Arithmetic Widths for Multiplication • The sources may be different widths The arithmetic operators + , - , and * are overloaded on signed vectors, • the width of the result must be the sum of the widths of the sources unsigned vectors, and integers. Declarations Declarations v4a, v4b, v4c : unsigned( 3 downto 0 ); u1, u2, u3 : unsigned( 7 downto 0); v8 : unsigned( 7 downto 0 ); s1, s2, s3 : signed( 7 downto 0); v12 : unsigned( 11 downto 0 ); Target Src1/2 Src2/1 Example unsigned unsigned unsigned u3 <= u1 + u2; OK Target Src1/2 Src2/1 Example unsigned unsigned integer OK u3 <= u1 + 17; 8-bits 4-bits 4-bits v8 <= v4a * v4b; OK signed signed signed OK s3 <= s1 + s2; 12-bits 4-bits 8-bits v12 <= v4a * v8; OK signed signed integer s3 <= s1 + -17; OK 4-bits 4-bits 4-bits Fail v4c <= v4a * v; — unsigned signed Fail u3 <= u1 + s2; 147 CHAPTER 2. ADDITIONAL FEATURES OF VHDL 2.3.4 Widths for Addition and Subtraction 149

  44. 150 CHAPTER 2. ADDITIONAL FEATURES OF VHDL 2.3.7 Type Conversion 152 2.3.5 Overloading of Comparisons 2.3.7 Type Conversion • Comparisons are overloaded on arrays and integers. If you convert between two types of the same width, then no additional hardware • If both operands are arrays, both must be of the same type. will be generated. Declarations Use: typecast a signal u1, u2 : unsigned( 7 downto 0); unsigned ( val : std_logic_vector ) return unsigned ; s1, s2 : signed( 7 downto 0); signed ( val : std_logic_vector ) return signed ; Src1/2 Src2/1 Example Use: assign an integer literal to a signal unsigned unsigned u1 >= u2 OK to_unsigned( val : integer ; width : natural) return unsigned ; unsigned integer u1 >= 17 OK to_signed ( val : integer ; width : natural) return signed ; signed signed s1 >= s2 OK signed integer s1 >= 17 OK Use: use a signal as an index into an array unsigned signed u1 >= s1 Fail to_integer ( val : signed ) return integer ; to_integer ( val : unsigned ) return integer ; 2.3.6 Widths for Comparisons Examples of Conversions • Sources may have different widths Declarations Declarations u1, u2, u3 : unsigned( 7 downto 0); w1, w2 : unsigned(7 downto 0) – wide sn1, sn2, sn3 : signed( 7 downto 0); unsigned(3 downto 0) – narrow n1, n2 : sw1, sw2, sw3 : signed( 8 downto 0); Src1/2 Src2/1 Example Examples wide — w1 >= n1 OK OK u3 <= to unsigned( 17, 8 ); narrow — n1 >= w2 OK OK sn3 <= to signed( 17, 8 ); OK sw3 <= signed( "0" & u1 ); Bad sn3 <= signed( u1 ); sw3 <= signed( "0" & u1) - signed( "0" & u2); OK OK sw3 <= signed( "0" & (u1 + u2)); Bad sw3 <= signed( "0" & (u1 - u2)); The Bad examples above will typecheck and elaborate without any errors, but they potentially will produce incorrect results. 151 CHAPTER 2. ADDITIONAL FEATURES OF VHDL 2.3.7 Type Conversion 153

  45. 154 CHAPTER 2. ADDITIONAL FEATURES OF VHDL 2.3.8 Shift and Rotate Operations 156 Resizing and Sign Extension 2.3.8 Shift and Rotate Operations The function resize resizes vectors, performing sign extension if necessary, Shift and rotate operations are described with three character acronyms: based upon the type of the argument. It is overloaded for different types of arguments. � s hift � � l eft/ r ight � � a rithmetic/ l ogical � resize( v : std_logic_vector; width : natural ) return std_logic_vector; resize( u : unsigned ; width : natural ) return unsigned; � ro tate � � l eft/ r ight � resize( s : signed ; width : natural ) return signed; Declarations The shift right arithmetic ( sra ) operation preserves the sign of the operand, by un1, un2 : unsigned( 4 downto 0); copying the most significant bit into lower bit positions. uw1, uw2 : unsigned( 7 downto 0); sn1, sn2 : signed( 4 downto 0); The shift left arithmetic ( sla ) does the analogous operation, except that the least sw1, sw2 : signed( 7 downto 0); significant bit is copied. Examples uw1 <= resize( un1, 8 ); OK a sra 2 -- arithmetic shift of a by 2 bits un1 <= resize( uw1, 4 ); OK sw1 <= resize( sn1, 8 ); OK sn1 <= resize( sw1, 4 ); OK sw1 <= resize( un1, 8 ); Fail uw1 <= resize( sn1, 8 ); Fail Type Conversion and Array Indices 2.3.9 Arithmetic Optimizations Multiply by a constant power of two wired shift logical left To use a signal as an index into an array, you must convert the signal into an Multiply by a power of two shift logical left integer using the function to_integer . Divide by a constant power of two wired shift logical right Divide by a power of two shift logical right Declarations signal u : unsigned( 3 downto 0); signal v : std logic vector( 3 downto 0); Question: How would you implement: z <= a * 3 ? signal a : std logic vector(15 downto 0); Examples Ok a( to integer(u) ) a( to integer( unsigned(v) ) ) Ok Fail v(u) Fail a( unsigned(v) ) 155 CHAPTER 2. ADDITIONAL FEATURES OF VHDL 2.3.9 Arithmetic Optimizations 157

  46. 158 CHAPTER 2. ADDITIONAL FEATURES OF VHDL 2.4.2 Defining New Array Types 160 2.4 Types 2.4.1 Enumerated Types VHDL supports enumerated types: type color is (red, green, blue); 2.4.2 Defining New Array Types When defining a new array type, the range may be left unconstrained: type color is (red, green, blue); Chapter 3 type color_vector is array ( natural range <> ) of color; We may then use the unconstrained array type as the basis for defining a constrained array subtype : Overview of FPGAs subtype few_colors is color_vector( 0 to 3 ); subtype many_colors is color_vector( 0 to 1023 ); Note the use of subtype above. It is illegal to use type to define a constrained array in terms of an unconstrained array. 3.1 Generic FPGA Hardware We can use type to define a constrained array directly: • This section: generic FPGA with 4 inputs per lookup table. • Many real FPGAs have more ( e.g. , 6) inputs per lookup table. type few_colors is array ( 0 to 3 ) of color; • Principles described here are applicable in general, even as details differ. 161 159 CHAPTER 2. ADDITIONAL FEATURES OF VHDL

  47. 162 CHAPTER 3. OVERVIEW OF FPGAS 3.1.1 Generic FPGA Cell 164 3.1.1 Generic FPGA Cell Connect Comb and Flop FPGA “Cell” = “Logic Element” (LE) in Altera carry_in = “Configurable Logic Block” (CLB) in Xilinx “LUT” = “lookup table” comb_data_out = PLA (programmable logic array) comb_data_in comb R flop_data_out D Q CE configurable 4:1 lookup table S flop_data_in configurable multiplexer carry_in ctrl_in comb_data_out comb_data_in LUT R flop_data_out carry_out D Q CE S flop_data_in ctrl_in carry_out Separate Comb and Flop Flopped and Unflopped Outputs carry_in carry_in comb_data_out comb_data_out comb_data_in comb_data_in comb comb R R flop_data_out flop_data_out D Q D Q CE CE S S flop_data_in flop_data_in ctrl_in ctrl_in carry_out carry_out 163 CHAPTER 3. OVERVIEW OF FPGAS 3.1.1 Generic FPGA Cell 165

  48. 166 CHAPTER 3. OVERVIEW OF FPGAS 3.1.3 Interconnect for Generic FPGA 168 3.1.2 Lookup Table Local Connections (Zoom Out) A 4:1 lookup table is usually implemented as a memory array with 16 1-bit elements. z = (a AND b) z = NOT a OR (b AND NOT c) OR (c AND NOT d) 4-bit address 1-bit data d c b a z d c b a z 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 1 0 1 0 . . . 0 1 1 0 1 1 0 0 1 0 . . . 1 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 0 1 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 3.1.3 Interconnect for Generic FPGA General-Purpose Wires and Carry Chains Local Connections General purpose interconnect Carry chains and cascade chains configurable, slow vertically adjacent cells, fast Note: In these pictures, the space between tightly grouped wires sometimes disappears, making a group of wires appear to be a single large wire. 167 CHAPTER 3. OVERVIEW OF FPGAS 3.1.3 Interconnect for Generic FPGA 169

  49. 170 CHAPTER 3. OVERVIEW OF FPGAS 3.1.5 Special Circuitry in FPGAs 172 3.1.4 Blocks of Cells for Generic FPGA 3.1.5 Special Circuitry in FPGAs Memory Since the mid 1990s, almost all FPGAs have had special circuits for Column of cells in blocks Path to connect cells RAM and ROM. These special circuits are possible because many FPGAs are in different rows Two rows of blocks fabricated on the same processes as SRAM chips. So, the FPGAs simply contain small chunks of SRAM. Microprocessors In 2001, some high-end FPGAs had one or more hardwired microprocessors on the same chip as programmable hardware. In 2005, the Xilinx-II Pro had 4 Power PCs and enough programmable hardware to implement the first-generation Intel Pentium microprocessor. Arithmetic Circuitry In 2001, FPGAs began to have hardwired circuits for multipliers and adders. Using these resources can improve significantly both the area and performance of a design. Input / Output Some FPGAs include special circuits to increase the bandwidth of communication with the outside world. Connecting Through Cells 3.2 Area Estimation for FPGAs This section describes three methods to estimate the number of FPGA cells Cells that are not used for computation can be used as “wires” to shorten length of path between cells. required to implement a circuit: section 3.2.1 Rough estimate based simply upon the number of flip-flops and primary inputs that are in the fanin of each flip-flop or output. section 3.2.2 A more accurate, and more complex, technique that uses a greedy algorithm to allocates as many gates as possible into the lookup table of each FPGA cell. section 3.2.3 A technique to estimate the area for arithmetic circuits with registers. Each cell: • LUT for any combinational function with up to four inputs and one output • Carry-in and carry-out signals used only for arithmetic carries • Flip-flop can be driven by LUT or separate input 171 CHAPTER 3. OVERVIEW OF FPGAS 3.2. AREA ESTIMATION FOR FPGAS 173

  50. 174 CHAPTER 3. OVERVIEW OF FPGAS 3.2.1 Area for Circuit with one Target 176 3.2.1 Area for Circuit with one Target 4:1 Mux in Two FPGA Cells This section gives a technique to esti- A 4:1 mux has 6 inputs, so it should fit into two FPGA cells. mate the number of FPGA cells required for a purely combinational circuit with sel(0) sel(1) one output. d0 But, with some clever tricks, a 4:1 mux d1 z can be implemented in two FPGA cells: d2 Question: What is the maximum Question: Number of inputs for two d3 number of inputs for a function LUTs? sel0 i sel1 But, there is no partitioning of the gates that can be implemented with d0 into two groups such that each group m one LUT? j l has at most 4 inputs and 1 output. d1 sel0 sel1 k d0 n z Question: Three LUTs? Question: Four LUTs? d2 d1 z o d2 d3 d3 Single Target vs Multiple Targets 3.2.2 Algorithm to Allocate Gates to Cells For a single target signal, this technique gives a lower bound on the number of This section presents an algorithm to allocate gates to FPGA cells for circuits with: LUTs needed. • multiple outputs • combinational gates For multiple target signals, this technique might be an overestimate, because a • flip-flops single LUT can be used in the logic for multiple target cells. The algorithm mimics what a synthesis tool does in transforming a netlist of generic gates into an FPGA: Technology map Map groups of generic combinational gates into LUTs Placement Assign each LUT and flip-flop to an FPGA cell In addition to above, synthesis tools do the step of routing: connecting the signals between FPGA cells. Because we are working with general-purpose combinational gates, we cannot use the carry-in and carry-out signals with the LUTs. 175 CHAPTER 3. OVERVIEW OF FPGAS 3.2.2 Algorithm to Allocate Gates to Cells 177

  51. 178 CHAPTER 3. OVERVIEW OF FPGAS 3.2.2 Algorithm to Allocate Gates to Cells 180 Overview of Algorithm Number of FPGA Cells (2) For each flip-flop and output: traverse backward through the fanin gathering as Question: Map the circuit below onto generic FPGA cells. much combinational circuitry as possible into the FPGA cell. Stopping conditions: a g e x b • flip-flop h c f z y • more than four inputs — However, have more than four signals as input, then d i further back in the fanin, the circuit will collapse back to four or fewer signals. Extra copy: a g e x b h c f z y d i Number of FPGA Cells (1) Number of FPGA Cells (3) In this question, the signal i becomes a new output. Question: Map the circuit below onto generic FPGA cells. Question: Map the circuit below onto generic FPGA cells. Do not perform any algebraic optimizations. Use NC (no connect) for any unused a g pins on the cells. e b x h c f y z d a i b z Extra copy: c a g e b x d h c f y z d i 179 CHAPTER 3. OVERVIEW OF FPGAS 3.2.2 Algorithm to Allocate Gates to Cells 181

  52. 182 CHAPTER 3. OVERVIEW OF FPGAS 3.2.3 Area for Arithmetic Circuits 184 3.2.3 Area for Arithmetic Circuits Adder with a Multiplexer For arithmetic circuits, we take into account inputs, outputs, carry-in, and carry-out signals. Question: How many lookup tables for an adder with a 2:1 mux on one input? 1 lookup table can implement one 1-bit n lookup tables can implement one n - ci ci sel full-adder bit full-adder sel a ci ci a b sum sum b a0 c c d0 b0 sum0 a co co d1 sum b a1 d2 b1 sum1 NC d3 NC a2 b2 sum2 a3 co b3 sum3 co Two-Bit Adder Arithmetic VHDL Code Question: How many lookup tables are needed for a two-bit adder? Question: How many cells are needed for each of the code fragments below? ci All signals are 8 bits. a0 b0 sum0 z <= a + b; a1 sum1 b1 co z <= a + b + c; process begin wait until rising_edge(clk); z <= a + b + c; end process; 183 CHAPTER 3. OVERVIEW OF FPGAS 3.2.3 Area for Arithmetic Circuits 185

  53. 186 CHAPTER 3. OVERVIEW OF FPGAS 3.2.3 Area for Arithmetic Circuits 188 Arithmetic VHDL Code (Cont’d) Area Optimizations process begin wait until rising_edge(clk); a <= i_a; b <= i_b; c <= i_c; z <= a + b + c; end process; a <= i_a; b <= i_b; c <= i_c; process begin wait until rising_edge(clk); z <= a + b + c; end process; m <= a when sel=’0’ else b; process begin wait until rising_edge(clk); z <= m + c; end process; Other Arithmetic Operations Example: Area Optimization Code Number of LUTs per bit z <= a + 1; z <= a = b; z <= a = 0; 187 CHAPTER 3. OVERVIEW OF FPGAS 3.2.3 Area for Arithmetic Circuits 189

  54. 190 CHAPTER 3. OVERVIEW OF FPGAS 4.1. NOTATIONS 192 4.1 Notations We will use a variety of notations to model our hardware: Pseudocode For algorithms. Used early in the design process for sequential behaviour and high-level optimizations. Dataflow diagrams Models the structure and behaviour of datapath-intensive circuits. State machines A variation on the conventional bubble-and-arrow style state machines. VHDL code For the real implementation. 4.2 Finite State Machines in VHDL 4.2.1 HDL Coding Styles for State Machines Chapter 4 Explicit VHDL code contains a state signal. At most one wait statement per process. Explicit-Current The state signal represents the current state of the machine and the signal is assigned its next value in a clocked process. Explicit-Current+Next there is a signal for the current state and another State Machines signal for the next state. The next-state signal is assigned its value in a combinational process or concurrent statement and is dependent upon the current state and the inputs. The current-state signal is assigned its value in a clocked process and is just a flopped copy of the next-state signal. (“three-process” style) Implicit There is no explicit state signal. At least one process has multiple wait statements. Each wait statement corresponds to a single state (Advanced topic not covered in this course). 191 4.2. FINITE STATE MACHINES IN VHDL 193

  55. 194 CHAPTER 4. STATE MACHINES 4.2.4 Our State-Machine Notation 196 4.2.2 State Encodings 4.2.4 Our State-Machine Notation Explicit state machines require a state signal. Before we can define a state signal, A simple extension to Mealy machines, allow both: • combinational assignments z = 0 we must define values for the names of the states. For example, we might define • registered assignments z’ = 0 S0 to be "000" and S1 to be "001" . The value for the name of state is called the “encoding” of the state. In hardware, each value is a bit-vector. There are a variety Combinational Combinational assignments Registered common encodings for states: binary, one-hot, Gray, and thermometer. 0 1 2 3 4 5 6 assignments assignments state S0 S1 S3 S0 S2 S3 S0 1 0 We can either define the encoding ourselves, or let the synthesis tool choose the a s0 s0 1 0 0 0 0 0 0 z encoding for us. If we define the encoding, then the type for the states is a z=1 !a z=0 a z’=1 !a !a z’=0 std logic vector . To let the synthesis choose the encoding, we create an Registered assignments enumerated type to the states, where each state is an element of the type. The s1 s2 s1 s2 0 1 2 3 4 5 6 synthesis tool then chooses a specific binary value for each state. Usually, the S0 S1 S3 S0 S2 S3 S0 state z=0 z=0 z’=0 z’=0 synthesis tool has heuristics to choose either a binary or one-hot encoding. a 1 0 s3 s3 z=0 z’=0 z 1 0 0 0 0 0 This section reserved for your reading pleasure 4.2.3 Traditional State-Machine Notation 4.2.5 Bounce Example Combinational Assignments Registered Assignments This section reserved for your reading pleasure a z=1 !a z=0 a z’=1 !a z’=0 s1 s0 s2 s1 s0 s2 z=0 z=1 z’=0 z’=1 0 1 2 3 4 0 1 2 3 4 S0 S1 S0 S2 S0 S0 S1 S0 S2 S0 state state 1 0 1 0 a a 1 0 0 1 z 1 0 0 1 z Explicit-Current Coding Style 195 CHAPTER 4. STATE MACHINES 4.2.5 Bounce Example 197

  56. 198 CHAPTER 4. STATE MACHINES 4.2.5 Bounce Example 200 Combinational Assignments Registered Assignments Explicit-Current+Next a z=1 !a z=0 a z’=1 !a z’=0 Combinational Assignments Registered Assignments s1 s0 s2 s1 s0 s2 z=0 z=1 z’=0 z’=1 process (clk) begin process (clk) begin a z=1 !a z=0 a z’=1 !a z’=0 if rising_edge(clk) then if rising_edge(clk) then s1 s0 s2 s1 s0 s2 case state is case state is when S0 => when S0 => z=0 z=1 z’=0 z’=1 if a = ’1’ then if a = ’1’ then process (clk) begin process (clk) begin state <= S1; state <= S1; if rising_edge(clk) then if rising_edge(clk) then else else st <= next_st; st <= next_st; state <= S2; state <= S2; end if; end if; end if; end if; end process; end process; when others => when others => next_st next_st state <= S0; state <= S0; <= S1 when st = S0 and a = ’1’ <= S1 when st = S0 and a = ’1’ end case; end case; else S2 when st = S0 else S2 when st = S0 end if; end if; else S0; else S0; end process; end process; process (clk) begin z <= ’1’ when (st = S0 and a = ’1’) process (state, a) begin process begin if rising_edge(clk) then or (st = S2) if (state = S0 and a = ’1’) wait until rising_edge(clk); if (st = S0 and a = ’1’) else ’0’; or (state = S2) if (state = S0 and a = ’1’) or (st = S2) then or (state = S2) then z <= ’1’; then z <= ’1’; else z <= ’1’; else z <= ’0’; else z <= ’0’; end if z <= ’0’; end if; end process; end if end if; end process; end process; Additional Coding Options Implicit Combinational Assignments Registered Assignments Combinational Assignments Registered Assignments a z=1 !a z=0 a z’=1 !a z’=0 a z=1 !a z=0 a z’=1 !a z’=0 s1 s0 s2 s1 s0 s2 s1 s0 s2 s1 s0 s2 z=0 z=1 z’=0 z’=1 z=0 z=1 z’=0 z’=1 process begin process (clk) begin Note: Implicit state process (clk) begin wait until rising_edge(clk); -- S0 if rising_edge(clk) then if rising_edge(clk) then machines do not support if a = ’1’ then case state is case state is z <= ’1’; when S0 => combinational assignments, when S0 => wait until rising_edge(clk); -- S1 if a = ’1’ then if a = ’1’ then because an implicit state z <= ’0’; z <= ’1’; state <= S1; else state <= S1; machine is a clocked process else z <= ’0’; else state <= S2; and in a clocked process, all wait until rising_edge(clk); -- S2 z <= ’0’; end if; z <= ’1’; state <= S2; assignments are registered. when others => end if; end if; state <= S0; Note: Implicit state end process; when S1 => end case; machines are an advanced z <= ’0’; end if; state <= S0; end process; topic and are not covered in when others => z <= ’1’ when (state = S0 and a = ’1’) ECE-327. z <= ’1’; or state = S2 state <= S0; else ’0’; end case; end if; end process; 199 CHAPTER 4. STATE MACHINES 4.2.5 Bounce Example 201

  57. 202 CHAPTER 4. STATE MACHINES 4.2.6 Registered Assignments 204 4.2.6 Registered Assignments Registered Assignments in VHDL Combinational assignments Appear to happen instantaneously. Registered assigments Clock-cycle boundary between when inputs are p_z1 : process begin p_z2 : process begin sampled and when target signal is driven. wait until re(clk); if re(clk) then if state = S0 then if state = S0 then z <= 1; z <= 1; VHDL and FSMs use different techniques to achieve the same behaviour. else else z <= 0; z <= 0; end if; end if; Use a registered assignment based on the state to illustrate. end process; end if; end process; process begin wait until re(clk); 50ns S0 if state = S0 then +1 δ +2 δ z <= 1; z’ = 1; clk 10ns 30ns 50ns 70ns 90ns else clk S1 state S1 S1 S0 z <= 0; state S0 S1 S0 S1 S0 proc_state z’ = 0; z 1 0 1 0 end if; p_z1, p_z2 RTL simulation end process; z 1 0 FSM Assignment is executed before the clock edge. Delta-cycle simulation Delay driving the output until after clock edge. VHDL Assignment is executed after the clock edge. Sample the old (visible) value of registered inputs from before the clock edge. Registered Assignments in State Machines 4.2.7 More Notation 10ns 30ns 50ns 70ns 90ns 4.2.7.1 Extension: Transient States clk state S0 S1 S0 S1 S0 z 1 0 1 0 S0 S0 50ns S0 state’ = S0; y = 1; y = 1; y = 1; a !a z = 2; z = 3; z’ = 1; z’ = 1; a z = 2; !a z = 3; clk S1 state’ = S1; S1 S2 S1 S2 state S1 S0 z’ = 0; z’ = 0; With transient-state, write y = 1 just state asn z asn once. z 1 0 203 CHAPTER 4. STATE MACHINES 4.2.7 More Notation 205

  58. 206 CHAPTER 4. STATE MACHINES 4.2.7 More Notation 208 Transient States with Registered Assignments within States (Cont’d) Assignments 2. If all incoming edges have the same registered assignment, then the assignment may be transformed into a combinational assignment and moved • Syntactically, registered assigments may appear before combinational into the state. assignments. The three state machines below all have the same behaviour. • Semantically, the effect of the registered assignments occurs after the s1 s2 s1 s2 s1 s2 combinational assignments. w = 0; w = 0; y = 5; y = 5; w = 0; y = 2; y = 2; x = 1; x = 1; x = 1; y = 2; y = 5; x = 1; S0 S0 z’= 3; z’= 3; z’= 3; s3 y’ = 1; y’ = 1; s3 s3 y’ = 1; a !a z = 3; z = 2; z = 3; a z = 2; !a z = 3; S1 S2 S1 S2 0 1 0 1 state S0 S1 S0 S2 1 0 a 1 1 y z 2 3 4.2.7.2 Assignments within States Assignments within States (Cont’d) Assignments may appear within states. As another example to illustrate moving assignments between edges and states, the three machines below have the same behaviour: 1. If all outgoing edges have the same assignment, then the assignment may be moved into the state. s1 s2 s1 s2 s1 s2 The three state machines below all have the same behaviour. x’ = 1; x’ = 1; s1 s3 s1 s1 s3 s3 x = 1; x = 1; z’ = 3; x = 1; w = 0; x = 1; x = 1; z’= 3; x = 1; x = 1; y = 2; y = 5; w = 0; w = 0; y = 5; y = 5; z’= 3; z’= 3; s4 s5 s4 s5 s4 s5 y = 2; y = 2; s2 s3 s2 s3 s2 s3 207 CHAPTER 4. STATE MACHINES 4.2.7 More Notation 209

  59. 210 CHAPTER 4. STATE MACHINES 4.2.7 More Notation 212 4.2.7.3 Conditional Expressions Default Values (Cont’d) The FSMs below have the same behaviour: Intuition: a signal is defined its default value in any state-to-state transition where it is not explicitly assigned a value. S0 S0 default: z=0 a z = b !a z = c if a then z = b else S0 S0 z = c a !a a !a S1 S1 S1 S2 S1 S2 The FSMs below have the same behaviour: z=1 z=1 With default values Equivalent FSM without default values S0 S0 a z = b !a if a then z = b S1 S1 4.2.7.4 Default Values More Examples Combinational default: z=0 default: z=0 S0 S0 z=2 z=2 S0 S0 a a !a !a a !a a !a S1 S2 S1 S2 z=1 z=1 S1 S2 S1 S2 With default values Equivalent FSM without default values z=1 z=1 With default values Equivalent FSM without default values default: z=0 S0 S0 a a !a z=2 !a z=2 S1 S2 S1 S2 z=1 z=1 With default values Equivalent FSM without default values 211 CHAPTER 4. STATE MACHINES 4.2.7 More Notation 213

  60. 214 CHAPTER 4. STATE MACHINES 4.2.7 More Notation 216 Default Values: Registers Default Value: Unconstrained Register The semantics define that if a registered variable is not assigned a value in a clock default: z’ = ’ − ’ cycle, then it holds its previous value. S0 S0 z’=a z’=a Default S2 S1 S2 S1 expression Behaviour when not assigned a value z’=a z’=a none z holds its previous value. With default values Optimized FSM z is assigned a . z’ = a z is unconstrained. z’ = ’-’ S0 S2 S1 Simplified FSM Default Value: Registered Assignment Questions to Answer and Ponder default: z’ = 99 Question: Why do combinational variables not need a “don’t care” default S0 S0 statement? z’=b z’=b S2 S1 S2 S1 z’=a z’=a With default values Equivalent FSM without default values Question: Why do register variables need a “don’t care” default statement? 0 1 2 3 4 5 6 7 state S0 S1 S2 S0 S1 S2 S0 0 1 2 3 4 5 6 7 a b 10 11 12 13 14 15 16 17 z 215 CHAPTER 4. STATE MACHINES 4.2.7 More Notation 217

  61. 218 CHAPTER 4. STATE MACHINES 4.2.8 Semantic and Syntax Rules 220 4.2.8 Semantic and Syntax Rules Summary of Semantic Rules 1. Signals take on the value of the last assignment that is executed in a clock Inputs, Combinational, Registered cycle. 2. Combinational assignments become visible immediately. There are three categories of variables in FSMs. Each category has its own rules 3. Registered assignments become visible in the next clock cycle. for how and when the variables are updated. 4. If a combinational signal is not assigned to in a given clock cycle, then the value Inputs Values are updated every clock cycle. of that signal is unconstrained (in other words, arbitrary, non-deterministic, or Combinational If a variable is not assigned a value in a clock cycle, then its don’t-care). value is unconstrained . Registered If a variable is not assigned a value in a clock cycle, then it holds its previous value. If there is any ambiguity about whether a signal is an input, then it should be declared as an input. Multiple Assignments to Same Signal Syntax Rules For a sequence of transitions within the same clock cycle, only the last assigment Our state machines are designed to match closely with VHDL code and hardware. to each signal is visible. The state machine notation is equivalent to synthesizable hardware that satisfies our rules for good coding practices, with the addition that we also support S0 0 1 non-determinism. Non-determinism is not synthesizable, but is often useful in S0 S1 state y = 1; z’ = 3 specifications for state machines. 2 y z’ = 5 5 z 1. For a given signal, it must be that either all assignments are combinational or all y = 2; assignments are registered. It is illegal to have both combinational and registered assignments to the same S1 signal. The reason is that this will lead to unsynthesizable code, because a signal cannot be both combinational and registered. 2. Within a clock cycle, a combinational signal must not be written to after it has been read. Violating this rule will lead to combinational loops. 219 CHAPTER 4. STATE MACHINES 4.2.8 Semantic and Syntax Rules 221

  62. 222 CHAPTER 4. STATE MACHINES 4.2.9 Reset 224 4.2.9 Reset 3. Completness of transitions: The conditions on the outgoing edges from a state must cover all possibilities. That is, from a given state, it must always be possible to make a transition. This includes a self-looping transition back to the All circuits should have a reset signal that puts the circuit back into a good initial state itself. state. However, not all flip flops within the circuit need to be reset. In a circuit that has a datapath and a state machine, the state machine will probably need to be Additional guidelines: reset, but datapath may not need to be reset. 1. Within a clock cycle, a combinational signal should be assigned to before it is read. This section reserved for your reading pleasure Violating this guideline will lead to non-deterministic behaviour, because the value of a combinational signal is unconstrained in a clock cycle until it has been written to. Deterministic vs Non-Deterministic Reset with Explicit-Current deterministic Exactly one outgoing transition is enabled (condition is true) non-deterministic Multiple outgoing transitions are enabled; machine randomly process (clk) begin process (clk) begin if rising_edge(clk) then if rising_edge(clk) then chooses which transtion to take case state is if reset = ’1’ then when S0 => state <= S0; if a = ’1’ then else z <= ’1’; case state is • Our state machines may be non-deterministic. state <= S1; when S0 => else if a = ’1’ then • Non-determinism happens when multiple outgoing transitions are enabled at z <= ’0’; z <= ’1’; the same time. state <= S2; state <= S1; end if; else • Non-determinism is sometimes useful in specifications and high-level models. when S1 => z <= ’0’; z <= ’0’; state <= S2; • Real hardware is deterministic state <= S0; end if; when others => when S1 => (unless you are building a quantum computer) z <= ’1’; z <= ’0’; state <= S0; state <= S0; • For real hardware, your transitions must be mutually exclusive . end case; when others => end if; z <= ’1’; end process; state <= S0; end case; end if; end if; end process; 223 CHAPTER 4. STATE MACHINES 4.2.9 Reset 225

  63. 226 CHAPTER 4. STATE MACHINES 4.3. LEBLANC FSM DESIGN EXAMPLE 228 Reset with Explicit-Current+Next 4.3 LeBlanc FSM Design Example Without Reset With Reset process (clk) begin process (clk) begin if rising_edge(clk) then if rising_edge(clk) then st <= next_st; if reset = ’1’ then end if; st <= S0; end process; else next_st st <= next_st; <= S1 when st = S0 and a = ’1’ end if; else S2 when st = S0 end if; else S0; end process; z <= ’1’ when (st = S0 and a = ’1’) next_st or (st = S2) <= S1 when st = S0 and a = ’1’ else ’0’; else S2 when st = S0 else S0; z <= ’1’ when (st = S0 and a = ’1’) or (st = S2) else ’0’; 4.3.1 State Machine and VHDL Review: Introduction to State Machines S0 process begin wait until rising_edge(clk); !a a Do the state-machine fragments below have the same behaviour? if reset = ’1’ then state <= S0; S1 S2 else case state is z’ = b - c z’ = b + c when S0 => S0 if a = ’0’ then !a S3 a S0 st <= S1; !a a else st <= S2; type state_ty is (S0, S1, S2, S3); b = 1 b = 0 end if; signal state : state_ty; b = 1 b = 0 when st <= S3; c’ = b c’ = b process begin when S3 => wait until rising_edge(clk); S1 st <= S0; if end case; S1 c = b z <= b - c; end if; end process; z <= b + c; end if; end process; 227 CHAPTER 4. STATE MACHINES 4.3.1 State Machine and VHDL 229

  64. 230 CHAPTER 4. STATE MACHINES 4.3.2 State Encodings 232 Datapath + Control 4.3.2 State Encodings Ctrl With 7 states next dp state ctrl Binary One-Hot state 0 000 1000000 1 001 0100000 2 010 0010000 ≥ ≥ 3 011 0001000 4 100 0000100 Datapath 5 101 0000010 • Control circuitry 6 110 0000001 – Compute next state (sequencing between states) – Drive control inputs to datapath • From datapath to control: • From control to datapath: – Usually 1-bit signals – Multiplexer select lines – Outputs of comparators – Chip-enables for registers – External inputs – Operations for multifunction datap- ath components. – etc. Hardware Le Blanc in Binary reset default: z’ = ’-’ S0 Ctrl S0 !a a S1 S2 next dp !a a state ctrl z’ = b - c z’ = b + c state S3 process begin a S1 S2 wait until rising_edge(clk); ______type state_ty is if reset = ’1’ then b z’ = b - c z’ = b + c signal state : state_ty; c state <= S0 : z else D Q S3 S1 : CE S2 : S3 : process begin wait until rising_edge(clk); if z <= b - c; z <= b + c; end if; end process; 231 CHAPTER 4. STATE MACHINES 4.3.2 State Encodings 233

  65. 234 CHAPTER 4. STATE MACHINES 4.3.2 State Encodings 236 LeBlanc in Optimized Binary Optimized Binary Le Blanc in Hardware State encodings affect the amount of circuitry needed to: reset • test conditions that drive the control signals for the datapath. state • choose the next state a b Define a custom encoding to simplify the circuitry needed to recognize the c condition that the system is in either S1 or S2. z One-Hot LeBlanc default: z’ = ’-’ S0 !a a S1 S2 process begin z’ = b - c z’ = b + c default: z’ = ’-’ S3 S0 wait until rising_edge(clk); !a a if reset = ’1’ then signal state : S1 S2 state <= S0 : z’ = b - c z’ = b + c else S1 : S3 process begin S2 : wait until rising_edge(clk); S3 : signal state : if reset = ’1’ then process begin S0 : state <= wait until rising_edge(clk); S1 : else if S2 : z <= b - c; S3 : z <= b + c; process begin end if; wait until rising_edge(clk); end process; if z <= b - c; z <= b + c; end if; end process; 235 CHAPTER 4. STATE MACHINES 4.3.2 State Encodings 237

  66. 238 CHAPTER 4. STATE MACHINES 4.4.1 Bubbles and Throughput 240 4.4.1 Bubbles and Throughput One-Hot Le Blanc in Hardware • Between each pair of parcels is a sequence of zero or more bubbles parcel bubbles parcel bubbles parcel reset α β γ default: z’ = ’-’ Bubble : invalid or garbage data that must be ignored S0 • Each system has a requirement for minimum number of bubbles between !a a parcels S1 S2 • Throughput : number of parcels per clock cycle a 2 bubbles 2 bubbles 2 bubbles 2 bubbles z’ = b - c z’ = b + c b α β γ δ ε S3 c throughput = 1 parcel / 3 clock cycles z = 1/3 parcels per clock cycle 2 bubbles 4 bubbles 3 bubbles α β γ δ 12 clock cycles throughput = 3 parcels / 12 clock cycles = 1/4 parcels per clock cycle 4.4 Parcels Maximum and Actual Throughput • “Parcel” = basic unit of data in a system • Examples Maximum Throughput The maximum rate of parcels per cycle (minimum System Parcel number of bubbles) at which the system will work correctly. Microprocessor Instruction usually: max throughput = 1/(minimum number of bubbles + 1) Car factory Car Actual Throughput The actual rate at which the environment sends parcels to • A parcel flows through a system the system. • A parcel may be composed of multiple components Actual throughput must be less -than-or-equal-to maximum throughput. Parcel Components Actual number of bubbles must be greater -than-or-equal-to minimum number of bubb Instruction Opcode, operands, result Car Doors, windows, engine, etc. 239 CHAPTER 4. STATE MACHINES 4.4.1 Bubbles and Throughput 241

  67. 242 CHAPTER 4. STATE MACHINES 4.4.1 Bubbles and Throughput 244 Max Tput: Pipelining and Superscalar Actual Throughput: Constant and Variable Two categories of actual throughput: Question: Label each of the arrows and dots below with one of: Unpipelined, Pipelined, Fully-pipelined, or Superscalar Constant Throughput Always the same number of bubbles between parcels. Often actual number of bubble is the minimum number of bubbles. Choose actual throughput = maximum throughput. 2 bubbles 2 bubbles 2 bubbles 2 bubbles α β γ δ ε Variable Throughput The number of bubbles changes over time. Usually the number of bubbles is unpredictable. 0 1/latency 1 Maximum throughput Actual number of bubbles must be at least as great as minimum required. 2 bubbles 4 bubbles 3 bubbles As an advanced topic, some systems with both combinational inputs and outputs use an area optimization that reduces the maximum throughput of an unpipelined α β γ δ system to be 1/(latency+1). FSMs, Latency, and Tput 4.4.2 Parcel Schedule Actual Throughput and Parcel Schedule Question: What are the latency and maximum throughput of the FSM below? To reduce confusion about the meaning of “throughput”, we will use: • “throughput” means “maximum possible throughput” Answer: • “parcel schedule” means “actual throughput” S0 !a a • “as soon as possible (ASAP) parcel schedule” means actual throughput is 0 1 2 3 4 5 6 7 8 9 a constant and is the maximum possible S1 S2 b • “unpredictable number of bubbles” means actual throughput is variable c p’ = b - c p’ = b + c p S3 z z’ = p + c Latency S3 Throughput 243 CHAPTER 4. STATE MACHINES 4.4.2 Parcel Schedule 245

  68. 246 CHAPTER 4. STATE MACHINES 4.4.3 Valid Bits 248 Parcel Schedule and FSM Patterns State Encodings and Parcel Schedule ASAP Parcels One-hot, binary, or custom. Outer loop derived from parcel schedule S0 Bubbles Valid bits bubble parcel "Trunk" derived from Outer loop derived from computation for parcel schedule one parcel "Trunk" derived from computation for one parcel c = i_a + i_b c = i_a + i_b c = i_a + i_b ASAP parcels Unpredictable number of bubbles o_z = p + q o_z = p + q o_z = p + q Core ASAP parcels Unpredictable parcels 4.4.3 Valid Bits One-Hot State Encoding When the parcel schedule is unpredictable number of bubbles, we need a Waveform for one-hot: mechanism to distinguish between a parcel and a bubble. i_data α Most common solution: valid bit protocol. state(0) state(1) state(2) i_valid o_data α β γ δ i_data o_valid Hardware implementation of one-hot: α β γ o_data reset 247 CHAPTER 4. STATE MACHINES 4.4.3 Valid Bits 249

  69. 250 CHAPTER 4. STATE MACHINES 4.5. LEBLANC WITH BUBBLES 252 Valid-Bit State Encoding VHDL Code Waveform for valid bits: S0 i_v !i_v !a a S1 S2 i_valid z’ = b - c z’ = b + c i_data α β γ process begin S3 wait until rising_edge(clk); v(0) if reset = ’1’ then v(1) process begin state <= v(2) wait until rising_edge(clk); else if v(3) z <= b - c; o_valid o_data z <= b + c; end if; Hardware implementation of valid bits: end process; reset i_valid o_valid 4.5 LeBlanc with Bubbles 4.6 Pseudocode We use pseudocode to describe multi-step computation ( e.g. , algorithms). Le Blanc with a parcel schedule of unpredictable number of bubbles. Declarations We must declare “special” variables. !a a Inputs Value might change in each clock cycle. Example Interpcl section 4.7 Used to communi- z’ = b - c z’ = b + c input: a, b; cate between parcels output: z; Outputs p = a + b; for i in 0 to 3 { If-then-else p = p + b; While loop } For loop z = p; Repeat-until loop Assignments Expressions Arithmetic, logical, arrays, etc. 251 CHAPTER 4. STATE MACHINES 4.6. PSEUDOCODE 253

  70. 254 CHAPTER 4. STATE MACHINES 4.7.2 Pseudo-Code 256 Pseudocode Semantics 4.7.2 Pseudo-Code • Idential to conventional software semantics We add variable declarations to distinguish inputs, outputs, and interparcel • Executed sequentially: target is updated when the assignment is executed. variables. • All assignments are instantenous: no reg vs comb. • Variables hold value until assigned a new value. Below, “ T ” stands for “total”. • No notion of time or clock cycles. Core vs System • Pseudocode describes the core of a computation. It does not show the parcel schedule. • But, with a finite sequence of parcels, the pseudocode may show more than 1 parcel. • FSM for core does not show i valid and o valid . • FSM for system (including parcel schedule) does show i valid and o valid if needed ( i.e. , if the parcel schedule is unpredicable number of bubbles). 4.7 Interparcel Variables and Loops Simple Inter-parcel var T Loop and inter-parcel var inputs a, b, c; inputs b, c; inputs b, c; 4.7.1 Introduction to Looping Le Blanc outputs z; outputs z; outputs z; interpcl T; interpcl T; if a then { Two new concepts: if a then { z = b + c; T = 0; • Inter-parcel variables } else { for i in 0 to 127 { T = T + b + c; • Outer loop around “core” } else { if a then { z = b - c; } T = T + b - c; T = T + b + c; } } else { Inter-parcel variables are used to communicate data between parcels. T = T + b - c; } Until now, all of our variables have been intra- parcel: used within a single parcel: } All intra- parcel vars “Total is an inter- parcel variable z = T; z = a + b + c Total = Total + a + b 255 CHAPTER 4. STATE MACHINES 4.7.2 Pseudo-Code 257

  71. 258 CHAPTER 4. STATE MACHINES 4.7.4 VHDL Code for Loop and Bubbles 260 4.7.3 State Machine 4.7.4 VHDL Code for Loop and Bubbles Design Patterns ASAP Parcels Unpredictable number of bubbles State Machine v(0) <= i_v; process begin process begin wait until re(clk); wait until re(clk); if reset = ’1’ then if reset = ’1’ then total = (others => ’0’); Unpredictable number of bubbles v(1 to 4) = (others => ’0’); elsif v(4) and i >= 128 then ASAP Parcels else total = (others => ’0’); elsif v(1)=’1’ then total = total + b - c; elsif v(2) then total = total + b + c; Total’ = 0 i’ = 0 end if; end if; end process; end process; !a a !a a process begin wait until re(clk); S1 S2 if reset = ’1’ then i = (others => ’0’); Total’ = Total’ = Total + b + c Total’ = Total’ = Total + b + c Total + b - c elsif v(4) and i >= 128 then Total + b - c i = (others => ’0’); S3 elsif v(3)=’1’ then i’=i+1 i = i + 1; i’=i+1 end if; S4 end process; i < 128 i ≥ 128 o_valid <= v(4) and i >= 128; z = Total z <= total; 259 CHAPTER 4. STATE MACHINES 4.7.4 VHDL Code for Loop and Bubbles 261

  72. 262 CHAPTER 4. STATE MACHINES 4.8.1 Memory Operations 264 Dual-Port Memory 4.8 Memory Arrays and RTL Design 4.8.1 Memory Operations Hardware Behaviour FSM we WE clk Read of Memory a0 do0 A0 DO0 M Hardware FSM we di0 DI0 we a1 do1 WE A1 DO1 α a a0 a do clk A DO M α d di0 DI clk β a a1 Behaviour M( α a ) M( β a ) β d do0 clk do1 we a α a M( α a ) α d do Write to Memory 4.8.2 Memory Arrays in VHDL entity mem is generic ( Hardware FSM data_width : natural := 8; we WE addr_width : natural := 7 a do A DO ); M di DI port ( clk clk : in std_logic; Behaviour wr_en : in std_logic -- write enable addr : in unsigned( add_width - 1 downto 0); -- address i_data : in data; -- input data o_data : out data -- output data ); end mem; clk we architecture main of mem is type mem_type is array (2**addr_width-1 downto 0) of a α a std_logic_vector(data_width - 1 downto 0) ; signal mem : mem_type ; di α d begin process (clk) M( α a ) begin do if rising_edge(clk) then if wr_en = ’1’ then mem( to_integer( addr) ) <= i_data ; end if ; o_data <= mem( to_integer( addr )); end if ; end process; end main; 263 CHAPTER 4. STATE MACHINES 4.8.2 Memory Arrays in VHDL 265

  73. 266 CHAPTER 4. STATE MACHINES 4.8.3 Using Memory 268 4.8.3 Using Memory 4.8.3.2 Reading from Memory to Multiple Variables Pseudocode FSM VHDL u_mem : entity work.mem Pseudocode FSM Hardware port map ( M[i] = a; clk => clk, p = M[i] S1 wr_en => p = M[i+1]; q = a addr => S1 ctrl i_data => ... o_data => p’ = M[i] S2 p = b p ); q’ = a WE i q = M[i+1] A DO S2 M mem_wr_en <= ’1’ when Both vars are DI q else ’0’; 1 p’ = b q’ = M[i+1] mem_addr <= i when a else i + 1; b ctrl Question: How should we connect memory to p and q ? Hardware: WE i p A DO M DI 1 4.8.3.1 Writing from Multiple Vars Multivar Reading (cont’d) FSM u_mem : entity work.mem port map ( S1 clk => clk, wr_en => mem_wr_en, u_mem : entity work.mem M’[i] = a S1 addr => mem_addr, port map ( i_data => mem_i_data; clk => clk, S2 mem_o_data’ = M[i] o_data => mem_o_data; wr_en => mem_wr_en, ); M’[i+1] = b addr => mem_addr, S2 i_data => mem_i_data; mem_wr_en <= ’0’; mem_o_data’ = M[i+1] o_data => p; p = mem_o_data Hardware ); q = a mem_addr <= i when state = S1 else i + 1; mem_wr_en <= ’1’ when state = S1 S3 ctrl or state = S2; p <= mem_o_data when state = S2 p = b else b; mem_addr <= i when state = S1 q = mem_o_data else i + 1; WE i q <= a when state = S2 p A DO M else mem_o_data; DI mem_i_data <= a when state = S1 1 else b; a b 267 CHAPTER 4. STATE MACHINES 4.8.3 Using Memory 269

  74. 270 CHAPTER 4. STATE MACHINES 4.8.3 Using Memory 272 4.8.3.3 Example: Maximum Value Seen so FSM #1 FSM #2 i’=0 i’=0 Far S0 S0 max’ = M[i] Design an FSM that iterates through a memory array, replacing each value with i ≥ 128 i < 128 the maximum value seen so far. i’=i+1 Initial value of M 4 3 2 6 7 3 5 S1 i ≥ 128 Example execution: i < 128 b’ = M[i] Final value of M S2 max < b max ≥ b S2 max’=b M’[i]=max max < b max ≥ b max’=b M’[i]=max Pseudocode #1 Pseudocode #2 4.8.4 Build Larger Memory from Slices i = 0 i = 0 max = M[i] This section reserved for your reading pleasure while i < 128 { i = i + 1 while i < 128 { b = M[i] if max < b { max = b } else { if max < b { M[i] = max max = b } } else { } M[i] = max } } 271 CHAPTER 4. STATE MACHINES 4.8.4 Build Larger Memory from Slices 273

  75. 274 CHAPTER 4. STATE MACHINES 5.1. DATAFLOW DIAGRAMS 276 4.8.5 Memory Arrays in High-Level Models 5.1 Dataflow Diagrams 5.1.1 Dataflow Diagrams Overview This section reserved for your reading pleasure • Dataflow diagrams are data-dependency graphs where the computation is divided into clock cycles. • Purpose: – Provide a disciplined approach for designing datapath-centric circuits – Guide the design from algorithm, through high-level models, and finally to register transfer level code for the datapath and control circuitry. – Estimate area and performance – Make tradeoffs between different design options • Background – Based on techniques from high-level synthesis tools – Some similarity between high-level synthesis and software compilation – Each dataflow diagram corresponds to a basic block in software compiler terminology. Data-Dependency Graphs and Dataflow Diagrams Models for z = a + b + c + d + e + f Chapter 5 a b c d e f a b c d e f + + + + + + Dataflow Diagrams + + + + z z Data-dependency graph Dataflow diagram 275 5.1.1 Dataflow Diagrams Overview 277

  76. 278 CHAPTER 5. DATAFLOW DIAGRAMS 5.1.2 Dataflow Diagram Execution 285 Latency Unconnected signal tails a b c d e f are inputs + Definition Latency: Number of clock cycles from inputs to outputs. Horizontal lines mark clock cycle boundaries + • A combinational circuit has latency of zero. Signals crossing clock + boundaries are flip-flops • A single register has a latency of one. + • A chain of n registers has a latency of n . Blocks in clock cycles are datapath components + Unconnected signal heads are outputs z + + + + + + + + + + Latency = Latency = 5.1.2 Dataflow Diagram Execution 5.1.3 Dataflow Diagrams, Hardware, and Behaviour a b c d e f 0 1 2 3 4 5 6 0 Primary Input clk + a Dataflow Diagram Hardware x1 x1 1 + i i x x2 x2 x x3 2 + x4 x3 Behaviour x5 3 + clk z i x4 + 4 x x5 z 5 284 CHAPTER 5. DATAFLOW DIAGRAMS 5.1.3 Dataflow Diagrams, Hardware, and Behaviour 286

  77. 287 CHAPTER 5. DATAFLOW DIAGRAMS 5.1.3 Dataflow Diagrams, Hardware, and Behaviour 289 Register Signal Reuse a Component Hardware Hardware i1 r1 o1 x i1 + i2 Dataflow Diagram + i1 i2 Dataflow Diagram r2 i1 i2 r1 r2 i2 + + Behaviour i2 clk r1 r2 x i1 + Behaviour i2 clk r1 x i1 o1 i2 o1 Combinational-Component Output 5.1.4 Performance Estimation Performance Equations Hardware i1 x + i2 1 Dataflow Diagram Performance ∝ i1 i2 TimeExec + Behaviour TimeExec = Latency × ClockPeriod x clk i1 Performance of Dataflow Diagrams i2 x • Latency: count horizontal lines in diagram • Min clock period (Max clock speed) limited by longest path in a clock cycle 288 CHAPTER 5. DATAFLOW DIAGRAMS 5.1.4 Performance Estimation 290

  78. 291 CHAPTER 5. DATAFLOW DIAGRAMS 5.1.6 Design Analysis 293 5.1.5 Area Estimation 5.1.6 Design Analysis • Maximum number of blocks in a clock cycle is total number of that component that are needed a b c d e f • Maximum number of signals that cross a cycle boundary is total number of + registers that are needed num inputs • Maximum number of unconnected signal tails in a clock cycle is total number num outputs + of inputs that are needed • Maximum number of unconnected signal heads in a clock cycle is total num registers + number of outputs that are needed num adders • These estimates are just approximations. Does not take into account: + – Area and delay of control circuitry min clock period – Multiplexers on registers and datapath components + latency – Relative area and delay of different components – Technology-specific features, constraints, and costs z • These estimates give lower bounds. • Other constraints or design goals might force you to use more components. Examples: – Decreasing latency = ⇒ larger area – Constraint on max number of registers = ⇒ more datapath components Area Estimation Design Analysis 2 Implementation-technology factors, such as the relative size of registers, a b c d e f multiplexers, and datapath components, might force you to make tradeoffs that + increase the number of datapath components to decrease the overall area of the num inputs circuit. num outputs + • With some FPGA chips, a 2:1 multiplexer has the same area as an adder. num registers • With some FPGA chips, a 2:1 multiplexer can be combined with an adder into + one FPGA cell per bit. num adders • In FPGAs, registers are usually “free”, in that the area consumed by a circuit is + min clock period limited by the amount of combinational logic, not the number of flip-flops. + latency z 292 CHAPTER 5. DATAFLOW DIAGRAMS 5.1.6 Design Analysis 294

  79. 295 CHAPTER 5. DATAFLOW DIAGRAMS 5.1.6 Design Analysis 297 Design Analysis 2 (Cont’d) Review: Dataflow Diagrams a b c d e f 0 0 1 2 3 4 5 6 clk For each of the diagrams below, calculate the latency, minimum clock period, and + a x1 minimum number of adders required. x1 + x2 x2 x3 + x4 x3 1 x5 + z x4 2 + x5 z 3 Latency Clock period Adders Design Analysis 3 5.2 Design Example: Hnatyshyn DFD 5.2.1 Requirements a b c d e f num inputs • Functional requirement: + + num outputs – Compute the following formula: z = a + b + c num registers + + • Performance requirements: num adders – Max clock period: flop plus (1 add) + min clock period – Max latency: 2 z latency • Cost requirements – Maximum of two adders – Unlimited registers – Maximum of three inputs and one output – Maximum of 5000 student-minutes of design effort • Combinational inputs, registered outputs • Parcels arrive as-soon-as-possible (ASAP) 296 CHAPTER 5. DATAFLOW DIAGRAMS 5.2. DESIGN EXAMPLE: HNATYSHYN DFD 298

  80. 299 CHAPTER 5. DATAFLOW DIAGRAMS 5.2.4 Area Optimization 301 5.2.2 Data-Dependency Graph 5.2.4 Area Optimization Data-dependency graph latency a b c clock cycle a b clock period Requirements and algorithm: 0 z = a + b + c inputs Create a data-dependency graph for the 1 algorithm. outputs z 2 registers z adders 5.2.3 Initial Dataflow Diagram 5.2.5 Assign Names to Registered Signals Schedule operations into clock Area and performance analysis cycles We start our initial (sub-optimal) design. latency a b c clock period Before we can write VHDL code for our dataflow diagram, we must assign a name to each internal registered value. inputs Optionally, we may assign names to combinational values. outputs z clock registers cycle a b adders 0 • Best-case analysis for a theoretical design • No guarantee that we will achieve best-case (optimal) design c 1 • Design process: systematic method to try to come close close to optimal design • Start with sub-optimal, but obviously correct, design 2 • Series of optimizations to improve area and speed while avoiding bugs z 300 CHAPTER 5. DATAFLOW DIAGRAMS 5.2.5 Assign Names to Registered Signals 302

  81. 303 CHAPTER 5. DATAFLOW DIAGRAMS 5.2.6 Allocation 305 Behaviour and Analysis 5.2.6 Allocation Allocation is the area optimization of Design Analysis Current Optimum mapping a large number of objects in latency clock Inputs 3 2 current design to smaller number of cycle 0 1 2 3 4 5 Registers 2 1 clock period objects. a b a Adders 2 1 b 0 Outputs 1 1 c inputs c x1 x1 • Example: allocate both x i registers to the same register 1 x2 outputs • Similar to register allocation in software z • This design is so simple that allocation is trivial. For real designs, finding the x2 registers 2 best allocation is very difficult. Many different heuristics for how to do allocation. z adders • We will allocate inputs, outputs, registers, and datapath components. • We will work clock-cycle by clock-cycle. • Annotate dataflow diagram and fill in cells in I/O schedule and control table. I/O Schedule Control Table a1 r1 o1 clock clock i1 i2 o1 cycle cycle src1 src2 ce d 0 0 1 1 2 const Use ASAP Parcel Schedule Allocate Clock Cycle 0: Inputs and Datapath default I/O Schedule Control Table z = x2; a1 r1 o1 clock clock x1’ = a+b; i1 i2 o1 cycle cycle src1 src2 ce d S1 a b x2’ = x1 + c; 0 0 S2 c 1 1 const Question: When to start parcel β ? 2 z 0 1 2 3 4 5 6 7 8 a, b, a1 α c, x1, a2 α x2, z α state Question: What is the maximum throughput that this system supports? 304 CHAPTER 5. DATAFLOW DIAGRAMS 5.2.6 Allocation 306

  82. 307 CHAPTER 5. DATAFLOW DIAGRAMS 5.2.6 Allocation 309 Allocate Clock Cycle 0: Regs Allocate Clock-Cycle 1: Regs I/O Schedule Control Table I/O Schedule Control Table a1 r1 o1 a1 r1 o1 clock clock clock clock i1 i2 o1 i1 i2 o1 cycle cycle cycle cycle src1 src2 ce d src1 src2 ce d a b a b i1 i2 i1 i2 0 a b 0 i1 i2 0 a b 0 i1 i2 1 a1 a1 a1 c x1 r1 c i2 1 1 1 c 1 r1 i2 a1 const const 2 2 z z Allocate Clock-Cycle 1: Inputs and Datapath Allocate Output I/O Schedule Control Table With registered outputs, each output port must be connected directly to a register. a1 r1 o1 clock clock i1 i2 o1 cycle cycle src1 src2 ce d I/O Schedule Control Table a b a1 r1 o1 clock clock i1 i2 0 a b 0 i1 i2 1 a1 i1 i2 o1 cycle cycle src1 src2 ce d a1 a b c r1 i1 i2 1 a1 0 a b 0 i1 i2 1 1 a1 x1 r1 c i2 const 1 a1 1 c 1 r1 i2 2 a1 z x2 r1 const 2 z 308 CHAPTER 5. DATAFLOW DIAGRAMS 5.2.6 Allocation 310

  83. 311 CHAPTER 5. DATAFLOW DIAGRAMS 5.2.7 State Machine 313 Behaviour post Allocation Control Table for Explicit State Machine Transform control table: clock cycle 0 1 2 • Label rows by state i1 α a b i2 α α 0 i1 i2 • Add next-state column a1 a1 α α • Identify “don’t-care” values r1 α α x1 c r1 o1 α i2 1 a1 Labeled by clock cycle Labeled by state a1 r1 a1 r1 x2 r1 clock next state o1 o1 cycle state src1 src2 ce d src1 src2 ce d o1 2 1 a1 1 a1 z 0 i1 i2 i1 i2 1 a1 1 a1 1 r1 i2 r1 i2 const r1 const r1 5.2.7 State Machine Find Constants • Done with datapath design and optimization If all of the cells in a column have the same value, then that column can be reduced to a constant. • Now build the control circuitry a1 r1 next state o1 state src1 src2 ce d Control-circuit optimizations: S0 i1 i2 1 a1 S1 • Choose state encoding S1 r1 i2 1 a1 S0 • Design state machine const r1 • Design control circuitry that drives datapath – Multiplexer select lines – Chip enables – Operation selection 312 CHAPTER 5. DATAFLOW DIAGRAMS 5.2.7 State Machine 314

  84. 315 CHAPTER 5. DATAFLOW DIAGRAMS 5.2.8 VHDL Implementation 317 Control Table, State Machine, Hardware a1_src1 <= i1 when state = S0 else r1; a1 r1 next state ------------------------------------------ o1 state src1 src2 ce d -- registers S0 i1 i2 1 a1 S1 S0 process (clk) begin if rising_edge(clk) then S1 r1 i2 1 a1 S0 r1 <= a1; end if; S1 const i2 1 a1 r1 end process; Control table for entire system ------------------------------------------ State machine for entire system -- datapath a1 <= a1_src1 + i2; o1 <= r1; ------------------------------------------ end architecture; Ctrl next dp state ctrl state i1 r1 o1 i2 Hardware for entire system 5.2.8 VHDL Implementation VHDL Implementation #2 • One-hot encoding for state architecture main of hnatyshyn is signal r1, a1, a1_src1 : unsigned(7 downto 0); • Define constants for S0, S1 type state_ty is (S0, S1); signal state : state_ty; • Replace state = S0 with state(0) = ’1’ . begin ------------------------------------------ -- control a1 r1 next process (clk) begin state o1 state src1 src2 ce d if rising_edge(clk) then i1 i2 1 a1 if reset = ’1’ then S1 state <= S0; 1 a1 r1 i2 S0 else case state is 1 a1 const i2 r1 when S0 => state <= S1; when S1 => state <= S0; end case; end if; end if; end process; 316 CHAPTER 5. DATAFLOW DIAGRAMS 5.2.8 VHDL Implementation 318

  85. 319 CHAPTER 5. DATAFLOW DIAGRAMS 5.3. DESIGN EXAMPLE: HNATYSHYN WITH BUBBLES 321 architecture main of hnatyshyn is 5.3 Design Example: Hnatyshyn with signal r1, a1 : unsigned(7 downto 0); Bubbles subtype state_ty is std_logic_vector(1 downto 0); constant S0 : state_ty := "01"; constant S1 : state_ty := "10"; • section 5.2: Hnatyshyn with ASAP parcels signal state : state_ty; • This section: Hnatyshyn with unpredictable number of bubbles begin ------------------------------------------ • Key feature: valid bits for control circuitry -- control process (clk) begin if rising_edge(clk) then if reset = ’1’ then state = S0; else state <= state rol 1; end if; end if; end process; 5.3.1 Adding Support for Bubbles a1_src1 <= i1 when state(0) = ’1’ else r1; • No change to dataflow diagram (dataflow diagrams are independent of parcel ------------------------------------------ schedule) -- registers process (clk) begin • Add i valid and o valid to denote whether input or output is parcel or bubble if rising_edge(clk) then r1 <= a1; • Add idle state to state machine for when there is not a parcel in the system end if; end process; ------------------------------------------ -- datapath a1 <= a1_src1 + i2; o1 <= r1; ------------------------------------------ end architecture; 320 CHAPTER 5. DATAFLOW DIAGRAMS 5.3.1 Adding Support for Bubbles 322

  86. 323 CHAPTER 5. DATAFLOW DIAGRAMS 5.3.1 Adding Support for Bubbles 325 Add Valid Bits Behaviour 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 i_valid i_valid γ γ α β α β i1 i1 γ γ S0 γ γ α α β β α α β β i2 i2 v1 v2 γ γ γ γ α α β β α α β β a1 a1 γ γ γ γ r1 α α β β r1 α α β β S1 γ γ α β α β o1 o1 ’0’ o_valid ’1’ o_valid S2 state i_valid o_valid valid v0 v1 bits v2 Use Valid Bits as Control 5.3.2 Control Table with Valid Bits Initial Table Ctrl • Label the rows of the control table by valid bits, instead of by states. reset • Do not include a row for the last valid bit. i_valid o_valid – We have registered outputs dp – Therefore, no control decisions are made in the last clock cycle ctrl – Therefore, the last valid bit does not affect the datapath i1 r1 o1 clock i2 cycle a1 r1 a b o1 src1 src2 ce d 0 i1 i2 a1 x1 r1 c i2 1 a1 x2 r1 2 o1 z 324 CHAPTER 5. DATAFLOW DIAGRAMS 5.3.2 Control Table with Valid Bits 326

  87. 327 CHAPTER 5. DATAFLOW DIAGRAMS 5.3.3 VHDL 329 Constants ------------------------------------------ -- control v(0) <= i_valid; process (clk) begin clock if rising_edge(clk) then cycle if reset = ’1’ then a1 r1 v(1 to 2) <= (others => ’0’); a b o1 valid bits src1 src2 ce d else 0 i1 i2 v(1 to 2) <= v(0 to 1); v(0) i1 i2 1 a1 a1 end if; x1 c r1 end if; i2 1 v(1) r1 i2 1 a1 end process; a1 a1_src1 <= i1 when v(0) = ’1’ else r1; x2 r1 const r1 2 o1 z 5.3.3 VHDL ------------------------------------------ -- registers process (clk) begin if rising_edge(clk) then The only difference between the VHDL code for Hnatyshyn with bubbles and r1 <= a1; Hnatyshyn with ASAP parcels is the control circuitry. The datapath is exactly the end if; end process; same for both designs. ------------------------------------------ -- datapath entity hnatyshyn_bubble is a1 <= a1_src1 + i2; port ( o_valid <= v(2); clk : in std_logic; o1 <= r1; i_valid : in std_logic; ------------------------------------------ i1, i2 : in unsigned(7 downto 0); end architecture; o_valid : out std_logic; o1 : out unsigned(7 downto 0) ); end entity; architecture main of hnatyshyn_bubble is signal r1, a1, a1_src1 : unsigned(7 downto 0); signal v : std_logic_vector(0 to 2); begin 328 CHAPTER 5. DATAFLOW DIAGRAMS 5.3.3 VHDL 330

  88. 331 CHAPTER 5. DATAFLOW DIAGRAMS 5.4.2 Dataflow Diagrams and Waveforms 333 5.4 Inter-Parcel Variables: Hnatyshyn with 5.4.2 Dataflow Diagrams and Waveforms Internal State clock a b Sum cycle i1 i2 a b Sum Inter-parcel variables are used to communicate data between parcels. a b Sum 0 a1 Previous systems “Sum” is an inter-parcel variable r1 r2 1 a1 z = a + b + c Sum = Sum + a + b Sum Sum 2 r1 intra-parcel variables The type of variables and signals that we have used until Sum now 0 1 2 3 4 5 6 • Also called “temporary values” α β γ i1 γ • Stores intermediate data from clock-cycle to clock-clock cycle α β i2 α α β β γ γ a1 • Each value is read only by the same parcel that wrote the value γ γ r1 α α β β γ α β r2 inter-parcel variables The new type of variables and signals • Also called “programmer-visible”, “internal-state”, or “visible-state” variables • Stores data that is used to communicate between parcels • Each value is written by one parcel and then read by other parcels 5.4.1 Requirements and Goals Bad DFD • Functional requirements: compute the following formula: Sum = Sum + a + b • Performance requirement: Question: What is wrong with the – Max clock period: flop plus (1 add) dataflow diagram below? – Max latency: 3 clock a b cycle i1 i2 • Cost requirements 0 a1 – Maximum of two adders Sum r1 – Unlimited registers 1 a1 – Maximum of three inputs and one output 2 r1 – Maximum of 5000 student-minutes of design effort Sum • Combinational inputs • Registered outputs • Parcel schedule is “Unpredictable number of bubbles” 332 CHAPTER 5. DATAFLOW DIAGRAMS 5.4.2 Dataflow Diagrams and Waveforms 334

  89. 335 CHAPTER 5. DATAFLOW DIAGRAMS 5.4.3 Control Tables 337 States and Bubbles 5.4.3 Control Tables Question: Label the states on the DFD and execution. Complete the FSM DFD Execution Initial Control Table a b Sum 0 1 2 3 4 5 6 7 8 9 10 i1 i2 α α β β γ δ i1 i2 α α β β γ δ a1 α α β β γ γ δ δ a1 0 1 2 3 4 5 6 7 8 9 10 r1 r2 S0 α α β β γ γ δ reset r1 a1 α β γ δ α α β β γ γ δ ε r2 i1 γ γ r1 α α β β δ ε i2 state S1 S2 S1 γ α α β β δ δ ε ε a1 Sum r1 α α β β 0 0 0 δ δ ε FSM S1 α β δ ε r2 S0 state S0 S1 S2 S1 S2 S0 S0 S0 S1 S2 S1 valid v0 S2 v1 bits v2 S1 S2 Reset VHDL Code for Control Circuitry 0 1 2 3 4 5 6 7 8 9 10 The VHDL code for just the control circuitry is below. In section 5.4.4, we show the reset complete code. α α β β γ γ δ ε i1 γ γ i2 α α β β δ ε γ α α β β δ δ ε ε a1 a1_src1 <= i1 when v(0) = ’1’ α α β β δ δ ε r1 else r1; α β δ ε r2 a1_src2 <= i2 when v(0) = ’1’ state S1 S2 S1 else r2; r1_ce <= v(0) or v(1); 336 CHAPTER 5. DATAFLOW DIAGRAMS 5.4.3 Control Tables 338

  90. 339 CHAPTER 5. DATAFLOW DIAGRAMS 5.4.5 Summary of Bubbles and Inter-Parcel Variables 341 5.4.4 VHDL Implementation 5.4.5 Summary of Bubbles and Inter-Parcel Variables -- valid bits v(0) <= i_valid; Options for state encoding: process begin wait until rising_edge(clk); if reset = ’1’ then Systen has v( 1 to 2 ) <= (others => ’0’); Inter-pcl vars else No Yes v( 1 to 2 ) <= v(0 to 1); end if; State encoding end process; ASAP FSM has idle state Ctrl table hhas idle row -- a1 a1_src1 <= i1 when v(0) = ’1’ State encoding else r1; Bubbles FSM has idle state a1_src2 <= i2 when v(0) = ’1’ Ctrl table hhas idle row else r2; a1 <= a1_src1 + a1_src2; 5.5 Design Example: Vanier -- r1 process begin wait until rising_edge(clk); Design Process if reset = ’1’ then r1 <= (others => ’0’); 1. Requirements 8. Block-diagram of datapath elsif v(0)=’1’ or v(1)=’1’ then r1 <= a1; 2. Algorithm 9. Control-table for state machine end if; end process; 3. Data-dependency graph 10. Don’t-care assignments -- r2 4. Schedule 11. VHDL code #1 (core) process begin 5. Allocate I/O ports, datapath wait until rising_edge(clk); 12. Parcel schedule r2 <= r1; components, registers end process; 13. State encoding 6. Separate datapath and control -- outputs 7. Connect datapath, add muxes 14. VHDL code #2 (system) o_valid <= v(2); o1 <= r1; 340 CHAPTER 5. DATAFLOW DIAGRAMS 5.5. DESIGN EXAMPLE: VANIER 342

  91. 343 CHAPTER 5. DATAFLOW DIAGRAMS 5.5.3 Initial Dataflow Diagram 345 5.5.1 Requirements 5.5.3 Initial Dataflow Diagram • Functional requirements: compute the following formula: Schedule operations into clock cycles. z = (a × d) + c + (d × b) + b for sixteen-bit unsigned data. • Performance requirement: Requirement for max clock period: max( 2 add, mul) + flop. – Max clock period: flop plus (2 adds or 1 multiply) – Max latency: 4 Area and performance analysis • Cost requirements latency – Maximum of two adders a d b c – Maximum of two multipliers clock period – Unlimited registers + – Maximum of three inputs and one output inputs + – Maximum of 5000 student-minutes of design effort outputs + • Combinational inputs registers z • Registered outputs adders • ASAP parcel schedule multipliers 5.5.2 Algorithm 5.5.4 Reschedule to Meet Requirements z = (a × d) + c + (d × b) + b Requirement: no more than 3 inputs. Create a data-dependency graph for the algorithm. a d b c a d b c a d b c + + + + + z z + z 344 CHAPTER 5. DATAFLOW DIAGRAMS 5.5.4 Reschedule to Meet Requirements 346

  92. 347 CHAPTER 5. DATAFLOW DIAGRAMS 5.5.5 Optimization: Reduce Inputs 349 Fix Clock Period Violation Analysis d b c d b c d b 0 latency + + a a clock period a c 1 + + + inputs + + outputs + 2 z z registers + adders 3 multipliers z Question: Should we move the second addition from clock-cycle 2 up to 1? 5.5.5 Optimization: Reduce Inputs a d b c Assume that inputs are much more ex- pensive than other resources. d b c + a + + z z 348 CHAPTER 5. DATAFLOW DIAGRAMS

  93. 5.5.6 Allocation 350 5.5.6 Allocation Ι/Ο m1 a1 a2 r1 r2 r3 d b i1 i2 o1 o1 sc1 sc2 sc1 sc2 sc1 sc2 ce d ce d ce d 0 0 a c 1 1 + 2 + 2 + 3 const needs mux needs ce z

  94. Alternative Allocation Ι/Ο m1 a1 a2 r1 r2 r3 d b i1 i2 o1 o1 ce d ce d ce d sc1 sc2 sc1 sc2 sc1 sc2 i1 i2 d b i1 i2 1 i1 1 m1 1 i2 0 0 m1 a c r1 r2 r3 i1 i2 a c i1 r1 r3 i2 1 1 + m1 a1 + 2 2 + const i1 3 z needs mux 5/9 needs ce 0/3 z 5.5.6 Allocation 351

  95. 352 CHAPTER 5. DATAFLOW DIAGRAMS 5.5.8 VHDL #1: Explicit 354 5.5.7 Explicit State Machine ---------------------- -- registers process (clk) begin From Clock Cycles to States if rising_edge(clk) then if state = S0 then • ASAP parcel schedule r1 <= i1; else • Latency is 3, therefore 3 states (S0, S1, S2) r1 <= r2; • State machine iterates through states, with S2 looping back to S0. end if; end if; end process; process (clk) begin if rising_edge(clk) then r2 <= m1; end if; end process; process (clk) begin if rising_edge(clk) then if state = S0 then r3 <= i2; else r3 <= a1; end if; end if; end process; ---------------------- o1 <= r3; end architecture; 5.5.8 VHDL #1: Explicit State Encoding architecture main of vanier is ---------------------- Use a one-hot state encoding. signal r1, r2, r3, -- datapath a1, a1_src1, a1_src2, a2, m1_src2 <= i2 when state = S0 m1, m1_src2 else r1; : unsigned(15 downto 0); m1 <= i1(7 downto 0) * m1_src2(7 downto 0); type state_ty is (S0, S1, S2); a1_src1 <= r3 when state = S1 signal state : state_ty; else r2; begin a1_src2 <= i1 when state = S1 ---------------------- else a2; -- control a1 <= a1_src1 + a1_src2; process (clk) begin a2 <= r1 + r3; if rising_edge(clk) then if reset = ’1’ then state <= S0; else case state is when S0 => state <= S1; when S1 => state <= S2; when S2 => state <= S0; end case; end if; end if; end process; 353 CHAPTER 5. DATAFLOW DIAGRAMS 5.5.8 VHDL #1: Explicit 355

  96. 356 CHAPTER 5. DATAFLOW DIAGRAMS 5.5.9 VHDL #2 358 Don’t Care: Encoding-Based Instantations ---------------------- -- registers For this simple example, the encoding-based instantiations are trivial. process (clk) begin if rising_edge(clk) then Ι/Ο if state(0) = ’1’ then m1 a1 a2 r1 r2 r3 r1 <= i1; d b i1 i2 o1 o1 ce d ce d ce d sc1 sc2 sc1 sc2 sc1 sc2 else i1 i2 r1 <= r2; d b i1 i2 r3 i2 1 i1 1 m1 1 i2 end if; 0 S0 m1 end if; end process; a r1 r2 r3 c process (clk) begin i1 i2 if rising_edge(clk) then a c S1 i1 r1 r3 i2 1 r2 1 m1 1 a1 r2 <= m1; 1 + m1 a1 end if; end process; r2 r1 r3 process (clk) begin if rising_edge(clk) then + a2 2 r2 if state(0) = ’1’ then S2 r2 a2 r1 r3 1 a1 r3 <= i2; + a1 else r3 <= a1; r3 const i1 r1 r3 1 1 m1 1 r3 end if; 3 z needs mux 5/9 o1 end if; needs ce 0/3 end process; ---------------------- z o1 <= r3; end architecture; 5.5.9 VHDL #2 5.5.10 Notes and Observations architecture main of vanier is ---------------------- signal r1, r2, r3, -- datapath a1, a1_src1, a1_src2, a2, m1_src2 <= i2 when state = S0 Our functional requirement was written as: m1, m1_src2 else r1; : unsigned(15 downto 0); m1 <= i1(7 downto 0) * m1_src2(7 downto 0); subtype state_ty is a1_src1 <= r3 when state = S1 z = (a × d) + (d × b) + b + c std_logic_vector(2 downto 0); else r2; constant s0 : state_ty := "001"; a1_src2 <= i1 when state = S1 constant s1 : state_ty := "010"; else a2; If we had been given the functional requirement: constant s2 : state_ty := "100"; a1 <= a1_src1 + a1_src2; signal state : state_ty; a2 <= r1 + r3; begin z = (a × d) + b + (d × b) + c ---------------------- -- control process (clk) begin we could have used the same design, because the two equations are equivalent. if rising_edge(clk) then if reset = ’1’ then state <= S0; else -- rotate 1-bit to left state <= state( 1 downto 0) & state( 2 ); end if; end if; end process; 357 CHAPTER 5. DATAFLOW DIAGRAMS 5.5.10 Notes and Observations 359

  97. 360 CHAPTER 5. DATAFLOW DIAGRAMS 5.6. MEMORY OPERATIONS IN DATAFLOW DIAGRAMS 362 Memory Read Data Dependency Graphs: Clean vs Ugly The naive data dependency graph for the second formulation is much messier Hardware Dataflow diagram we WE than the data dependency graph for the original formulation: a do A DO M DI clk Alternative Original (a × d) + b + (d × b) + c (a × d) + (d × b) + b + c a d b c a d b c FSM Behaviour + clk we - + + + a α a + + M( α a ) α d z do z 5.6 Memory Operations in Dataflow Memory Write Diagrams Hardware Dataflow diagram Read Write we WE a do A DO M di DI clk Behaviour Inputs FSM Output clk we - a - α a Operation di - α d M( α a ) do Location 361 CHAPTER 5. DATAFLOW DIAGRAMS 5.6. MEMORY OPERATIONS IN DATAFLOW DIAGRAMS 363

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend