Design Space Exploration for Memory Subsystems of VLIW - - PowerPoint PPT Presentation

design space exploration for
SMART_READER_LITE
LIVE PREVIEW

Design Space Exploration for Memory Subsystems of VLIW - - PowerPoint PPT Presentation

HEINZ NIXDORF INSTITUTE University of Paderborn Schaltungstechnik Dr.-Ing. Mario Porrmann Design Space Exploration for Memory Subsystems of VLIW Architectures Thorsten Jungeblut 1 , Gregor Sievers, Mario Porrmann 1 , Ulrich Rckert 2 1 System


slide-1
SLIDE 1

HEINZ NIXDORF INSTITUTE

University of Paderborn Schaltungstechnik Dr.-Ing. Mario Porrmann

Design Space Exploration for Memory Subsystems of VLIW Architectures

Thorsten Jungeblut1, Gregor Sievers, Mario Porrmann1, Ulrich Rückert2

1 System and Circuit Technology,

University of Paderborn

2 Cognitive Interaction Technology –

Center of Excellence, Bielefeld University

slide-2
SLIDE 2

2

HEINZ NIXDORF INSTITUT

Universität Paderborn Schaltungstechnik

  • Prof. Dr.-Ing. Ulrich Rückert
  • Increasing complexity of mobile applications
  • More functionality
  • New algorithms (LTE; LTE Advanced)
  • Multimedia applications (Video, 3-D, …)
  • Nonflexible hardware  Flexible software implementation

(Software-Defined Radio - SDR)  Powerful CPU necessary

  • High requirements to ressource efficiency!

Motivation(1)

slide-3
SLIDE 3

3

HEINZ NIXDORF INSTITUT

Universität Paderborn Schaltungstechnik

  • Prof. Dr.-Ing. Ulrich Rückert

Motivation(2)

  • In embedded processors size of on-chip memories is limited
  • External (SDRAM) memory

– Low costs per bit – Slow/high latency

  • Intermediate storage of accesses in the cache

– Loading of entire cache lines from the external memory – Use of temporal and spatial locality

  • Size of the caches is limited by the
  • perating frequency of the processor core

– Cache hierachie – Level-1 cache is matched to the core frequency → additional levels with higher latency

Register Level-1 cache Level-2 cache Main memory Hard disk

slide-4
SLIDE 4

4

HEINZ NIXDORF INSTITUT

Universität Paderborn Schaltungstechnik

  • Prof. Dr.-Ing. Ulrich Rückert

Outline

  • Concurrent design flow for DSE
  • VLIW architecture/Cache architecture
  • Prototyping Environment
  • Performance results and resource requirements
  • Conclusion/Outlook

Assembler Software Simulator Visualization RTL-Description Emulator (Prototyp) Benchmarks

Source Code Assembler Code Object-Files Executables Profiling-Data RTL-Code RTL-Code Netlist

UPSLA

Functional Verification Ressource Efficiency RTL-Simulator Linker Compiler Synthesis-Tool Specification ASIC-Realization

Vice-UPSLA

Register Write L1 Data-Cache Register Read Instruction Fetch / L1 Instruction Cache Instruction Decode

FE DC RD EX ME WR

/

ALU

*

LD/ST

/

ALU

*

/

ALU

*

/

ALU

*

LD/ST LD/ST LD/ST

Condition Register Register

Bypass Instruction Memory Data Memory
slide-5
SLIDE 5

5

HEINZ NIXDORF INSTITUT

Universität Paderborn Schaltungstechnik

  • Prof. Dr.-Ing. Ulrich Rückert

Assembler Software Simulator Visualization RTL-Description Emulator (Prototyp) Benchmarks

Source Code Assembler Code Object-Files Executables Profiling-Data RTL-Code RTL-Code Netlist

UPSLA

Functional Verification Ressource Efficiency RTL-Simulator Linker Compiler Synthesis-Tool Specification ASIC-Realization

Vice-UPSLA

Design Space Exploration Tool Flow

Goal: Highly automated design flow

slide-6
SLIDE 6

6

HEINZ NIXDORF INSTITUT

Universität Paderborn Schaltungstechnik

  • Prof. Dr.-Ing. Ulrich Rückert

The CoreVA architecture Modular Design

Register Write L1 Data-Cache Register Read Instruction Fetch / L1 Instruction Cache Instruction Decode

FE DC RD EX ME WR

/

ALU

*

LD/ST

/

ALU

*

/

ALU

*

/

ALU

*

LD/ST LD/ST LD/ST

Condition Register Register

Bypass

Instruction Memory Data Memory

slide-7
SLIDE 7

7

HEINZ NIXDORF INSTITUT

Universität Paderborn Schaltungstechnik

  • Prof. Dr.-Ing. Ulrich Rückert

Dynamically Reconfigurable Platform

RAPTOR-X64

SelectMAP, CFG-JTAG SelectMAP, CFG-JTAG SelectMAP, CFG-JTAG CTRL+Config Logic Arbiter, MMU Diagnostics, CLK, Configuration, etc.

PCI-X-Bus (64Bit Data / 32Bit Address)

PCI-Bus- Bridge Master, Slave, DMA

Local-Bus (32Bit Data / 32Bit Address)

Dual-Port SRAM

85

CTRL, SMB

85

CTRL, SMB

85

CTRL, SMB

128

Module 6 Module 4

Module 1

128

Module 2

128

Module 3

128 75 75 75 Broadcast-Bus

USB Logic Local-Bus Master Local-Bus Slave OTG-Control USB Controller USB 2.0-High-Speed USB-OTG Xilinx SystemACE CF CF Access, JTAG Control TST-JTAG CFG-JTAG System Monitor Voltage, Tempature, Analog Inputs Clock Sythesis, Distribution

Prototypic Implementation of Microelectronic Circuits on FPGAs

  • Up to 200 Million transistors emulated
  • Flexible, modular concept: PCI-Bus-

motherboard with up to six modules

  • Partial dynamic reconfiguration

at high reconfiguration bandwidth

slide-8
SLIDE 8

8

HEINZ NIXDORF INSTITUT

Universität Paderborn Schaltungstechnik

  • Prof. Dr.-Ing. Ulrich Rückert

System Environment

  • Multi master system bus
  • Generic I/D cache interfaces to external memory
  • 4 GB SDRAM
  • Penalty cycles on cache misses:
  • Instr. cache: >73 clock cycles
  • Data cache: >61 clock cycles
  • Internal memories can be accessed

from host system

  • Generic interface for dedicated

hardware extensions

  • 9.1 Gbit/s external bandwidth

CoreVA CPU

Data Cache

MMIO

UART CRC

Instr. Cache

Arbiter Xilinx FPGA Systembus RAPTOR2000 System ASIC

FIFO

SDRAM Controller SDRAM Localbus Interface

Host PC

Systembus Controller

slide-9
SLIDE 9

9

HEINZ NIXDORF INSTITUT

Universität Paderborn Schaltungstechnik

  • Prof. Dr.-Ing. Ulrich Rückert

Cache Architecture Overview

  • I-Cache:

– 32 bit per issue slot  4 slot configuration: 128 bit interface – Direct mapped (low latency/power/area) – 16kB cache size, 64 bytes line width (configurable)

  • D-Cache:

– 1-/2-port configuration possible – Direct mapped – 16kB cache size, 32 bytes line width (configurable) – Write-back policy, non-blocking – Two programmable allocation modes: fetch-on-write-miss/allocate-on-write-miss

  • I-/D-Caches can dynamically be configured as scratch pad memories

– Higher performance for timing critical parts of an application (cache misses are avoided) – Energy improvements due to nonexistent external memory accesses

slide-10
SLIDE 10

10

HEINZ NIXDORF INSTITUT

Universität Paderborn Schaltungstechnik

  • Prof. Dr.-Ing. Ulrich Rückert

Cache Architecture Synthesis Results

slide-11
SLIDE 11

11

HEINZ NIXDORF INSTITUT

Universität Paderborn Schaltungstechnik

  • Prof. Dr.-Ing. Ulrich Rückert
  • Applications: synthetic benchmarks, baseband, cryptography,

multimedia, LTE protocol stack

  • 50% LD/ST-units per #FUs best trade-off
  • Concurrent LD/ST ≠ Speedup! Speedup dependent on scheduling!

Application Evaluation Different Cache Configurations

slide-12
SLIDE 12

12

HEINZ NIXDORF INSTITUT

Universität Paderborn Schaltungstechnik

  • Prof. Dr.-Ing. Ulrich Rückert
  • High hit rates for all applications
  • Allocate-on-write-miss ↔ Fetch-on-write-miss

Results(1) Hit Rates

slide-13
SLIDE 13

13

HEINZ NIXDORF INSTITUT

Universität Paderborn Schaltungstechnik

  • Prof. Dr.-Ing. Ulrich Rückert

Results(2) Portion of Stall Cycles to Execution Time

  • Latencies of SDRAM accesses may vary

dependent on the order, distribution and frequency

  • f the accesses.
slide-14
SLIDE 14

14

HEINZ NIXDORF INSTITUT

Universität Paderborn Schaltungstechnik

  • Prof. Dr.-Ing. Ulrich Rückert

Results(3) Performance

slide-15
SLIDE 15

15

HEINZ NIXDORF INSTITUT

Universität Paderborn Schaltungstechnik

  • Prof. Dr.-Ing. Ulrich Rückert

Results(4) Energy

slide-16
SLIDE 16

16

HEINZ NIXDORF INSTITUT

Universität Paderborn Schaltungstechnik

  • Prof. Dr.-Ing. Ulrich Rückert

Results(5) Energy-Delay

slide-17
SLIDE 17

17

HEINZ NIXDORF INSTITUT

Universität Paderborn Schaltungstechnik

  • Prof. Dr.-Ing. Ulrich Rückert

The CoreVA VLIW architecture ASIC realization

  • 4-issue VLIW processor, 2x MLA,DIV
  • 1-Port I-Cache (16kByte,128 Bit),
  • 2-Port D-Cache (16kByte, 32 Bit)
  • 65nm ST Microelectronics, Low Power (Thick Oxide),

1.2V MixedVT, 1.8V I/Os (configurable pullups)

  • Hardware extensions (incl. ECC)

Frequency 400 MHz Area (32kB SRAM) 2.7 mm² Power Consumption 0.1 W

1.6 GOP/s in scalar mode 3.2 GOP/s in SIMD mode

Instruction Cache Data Cache Comp. Cell ECC Execute Register File 1.66 mm 1.66 mm

slide-18
SLIDE 18

18

HEINZ NIXDORF INSTITUT

Universität Paderborn Schaltungstechnik

  • Prof. Dr.-Ing. Ulrich Rückert

Conclusion/Outlook

  • Framework for the design-space exploration of processor architectures

and memory subsystems

  • Rapid prototyping environment RAPTOR
  • Dynamic configurable cache architecture
  • 2-slot configuration/allocate-on-write-miss shows best energy trade-off
  • Performance/Energy gains up to 25%
  • Future work:

– Include associativity – Combination of caches/scratch-pad memories to enhance memory bandwidth

slide-19
SLIDE 19

19

HEINZ NIXDORF INSTITUT

Universität Paderborn Schaltungstechnik

  • Prof. Dr.-Ing. Ulrich Rückert

Questions?

Assembler Software Simulator Visualization RTL-Description Emulator (Prototyp) Benchmarks Source Code Assembler Code Object-Files Executables Profiling-Data RTL-Code RTL-Code Netlist

UPSLA

Functional Verification Ressource Efficiency RTL-Simulator Linker Compiler Synthesis-Tool Specification ASIC-Realization

Vice-UPSLA

Design space exploration VLIW architecture Rapid prototyping System Architecture

CoreVA CPU

Data Cache

MMIO

UART CRC

Instr. Cache

Arbiter Xilinx FPGA Systembus RAPTOR2000 System ASIC

FIFO

SDRAM Controller SDRAM Localbus Interface

Host PC

Systembus Controller

slide-20
SLIDE 20

HEINZ NIXDORF INSTITUTE

University of Paderborn Schaltungstechnik Dr.-Ing. Mario Porrmann

Heinz Nixdorf Institute University of Paderborn System and Circuit Technology Dipl.-Ing. Thorsten Jungeblut Fürstenallee 11 33102 Paderborn Tel.: 0 52 51/60 63 39 Fax.: 0 52 51/60 63 51 Email: tj@hni.upb.de http://wwwhni.upb.de/sct

Thank you for your attention!