Hardw are/Softw are Hardw are/Softw are Hardw are/Softw are - - PowerPoint PPT Presentation

hardw are softw are hardw are softw are hardw are softw
SMART_READER_LITE
LIVE PREVIEW

Hardw are/Softw are Hardw are/Softw are Hardw are/Softw are - - PowerPoint PPT Presentation

Hardw are/Softw are Hardw are/Softw are Hardw are/Softw are Instruction Set Configurability Instruction Set Configurability Instruction Set Configurability for Sytem-on-Chip Processors for Sytem-on-Chip Processors for Sytem-on-Chip


slide-1
SLIDE 1

38th DAC, Las Vegas, June 18-22, 2001 38th DAC, Las Vegas, June 18-22, 2001

Hardw are/Softw are Instruction Set Configurability for Sytem-on-Chip Processors Hardw are/Softw are Hardw are/Softw are Instruction Set Configurability Instruction Set Configurability for Sytem-on-Chip Processors for Sytem-on-Chip Processors

Albert Wang, Chris Row en, Dror Maydan, Earl Killia

slide-2
SLIDE 2

2

Landscape of reconfigurable computing Landscape of Landscape of reconfigurable reconfigurable computing computing

Optimality/ integration (e.g. mW, $) Flexibility/modularity (e.g. time-to-market)

ASIC FPGA

∆ ~10x ∆ ~10x

Instruction-set Configurable Processor

General Processor FPGA + Processor

slide-3
SLIDE 3

3

Computing using temporal connection Computing using temporal connection Computing using temporal connection

Registers Datapath Control Processor Solution Memory (Program)

  • X

Correct Efficient

  • X

Processor

slide-4
SLIDE 4

4

Computing using spatial connection Computing using spatial connection Computing using spatial connection

Registers Datapath Control Processor Solution ASIC Solution FSM Storage Memory (Program)

  • X

Correct Efficient

  • X

ASIC

slide-5
SLIDE 5

5

Processor with Application-specific Instructions

Configurable Processors: best of both Configurable Processors: best of both Configurable Processors: best of both

Registers Datapath Control Processor Solutions ASIC Solutions FSM Storage Memory (Program)

  • Correct

Efficient Processor ASIC

slide-6
SLIDE 6

6

Outline Outline Outline

Configurable processor solution

Xtensa ™ processor Architecture Instruction extension automation Software development tools

An Example Results Summary

slide-7
SLIDE 7

7

Conventional Architecture Conventional Architecture Conventional Architecture

Source RF0 RF1 RF2 S1 S0

FU0 FU0 FU0 FU0

Result Decoder Control

  • More registers
  • More FU’s
  • Deeper pipeline
  • Bypass/forward
slide-8
SLIDE 8

8

Conventional Architecture - cont. Conventional Architecture - cont. Conventional Architecture - cont.

Source routing RF0 RF1 RF2 S1 S0

FU0 FU1 FU2 FU3

Result routing Decoder Control

  • More FU’s
slide-9
SLIDE 9

9

Conventional Architecture – cont. Conventional Architecture – cont. Conventional Architecture – cont.

Source routing RF0 RF1 RF2 S1 S0

FU0 FU1 FU2 FU3

Result routing Decoder Control

  • More FU’s
  • More registers
slide-10
SLIDE 10

10

Conventional Architecture – cont. Conventional Architecture – cont. Conventional Architecture – cont.

Source routing RF0 RF1 RF2 S1 S0

FU0 FU1 FU2 FU3

Result routing Decoder Control

  • More registers
  • More FU’s
  • Deeper pipeline
slide-11
SLIDE 11

11

Conventional Architecture – cont. Conventional Architecture – cont. Conventional Architecture – cont.

Source routing RF0 RF1 RF2 S1 S0

FU0 FU1 FU2 FU3

Result routing Decoder Control

  • More registers
  • More FU’s
  • Deeper pipeline
  • Bypass/forward
slide-12
SLIDE 12

12

Conventional Architecture – cont. Conventional Architecture – cont. Conventional Architecture – cont.

Problem with fixed processor:

Waste silicon

  • There is no universal extensions, or even one for each

application class

Not fast enough, compared with hardware implementation Waste power

The Tensilica solution:

Small core processor Allow easy and efficient application-specific instruction extensions

slide-13
SLIDE 13

13

Xtensa Architecture – Base Xtensa Architecture – Base Xtensa Architecture – Base

Source routing RF0 RF1 RF2 S1 S0

FU0 FU0 FU0 FU0

Result routing Decoder Control

  • Good performance
  • Comparable to any embedded 32-bit

RISC

  • Good code density
  • Much better than 32-bit RISC
  • Use 16b/24b instructions
  • Small
  • .7mm2 in .18
  • Low power
  • .37mw / MHz
  • Easy extension
  • With Tensilica Instruction Extension

(TIE) language – ISA level

  • Efficient extension
  • TIE compiler generates efficient

pipelined implementation

  • TIE compiler extends all software

development tools

slide-14
SLIDE 14

14

TIE language - opcode TIE language - TIE language - opcode

  • pcode

Source routing RF0 RF1 RF2 S1 S0

FU0 FU0 FU0 FU0

Result routing Decoder Control

  • Opcode
  • pcode MAC op2=5 CUST0
slide-15
SLIDE 15

15

TIE Language – regfile / state TIE Language – TIE Language – regfile regfile / / state state

Source routing RF0 S0

FU0 FU0 FU0 FU0

Result routing Decoder Control

  • Opcode
  • Register file / State

… as needed

state ACC 40

slide-16
SLIDE 16

16

TIE Language – semantics TIE Language – TIE Language – semantics semantics

Source routing RF0

FU0 MAC

Result routing Decoder Control

  • Opcode
  • Register file / state
  • semantics

S0

… as needed … as needed

semantic sem1 {MAC} {assign ACCL=ACCL+ars[16:0]*art[15:0];}

slide-17
SLIDE 17

17

TIE Language – iclass TIE Language – TIE Language – iclass iclass

Source routing RF0

FU0 MAC

Result routing Decoder Control

  • Opcode
  • Register file / state
  • semantics

S0

… as needed … as needed

  • Instruction class

iclass c1 {MAC} {in ars, in art} {inout ACC}

slide-18
SLIDE 18

18

TIE Language - schedule TIE Language - schedule TIE Language - schedule

  • schedule

Source routing RF0

FU0 MAC

Result routing Decoder Control

  • Opcode
  • Register file / state
  • semantics

S0

… as needed … as needed

  • Instruction class

schedule s1 {MAC}{use ars 1; use art 1; use ACC 2; def ACC 2;}

slide-19
SLIDE 19

19

A Complete Example – parallel MAC A Complete Example – parallel MAC A Complete Example – parallel MAC

  • pcode PMAC op2=0 CUST0

state ACC1 40 state ACC2 40 iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2} semantic pmac_sem {PMAC} {

assign ACC1 = ACC1 + ars[15:0] * art[15:0]; assign ACC2 = ACC2 + ars[31:16] * art[31:16];

} schedule pmac_schd {PMAC} {

use ars 1; use art 1; use ACC1 2; use ACC2 2; def ACC1 2; def ACC2 2;

}

slide-20
SLIDE 20

20

Productivity Gain – language + compiler Productivity Gain – language + compiler Productivity Gain – language + compiler

Select processor

  • ptions

Using the Xtensa processor generator, create...

ALU

Pipe I/O Timer MMU Register File Cache

Tailored, synthesizable HDL uP core Customized Compiler, Assembler, Linker, Debugger, Simulator ∗∗∗∗∗∗∗ ∗∗∗∗ ∗∗∗∗∗∗∗∗ ∗∗∗ Describe new instructions

In Minutes!

slide-21
SLIDE 21

21

Productivity Gain – Softw are Tools Productivity Gain – Softw are Tools Productivity Gain – Softw are Tools

Select processor

  • ptions

Using the Xtensa processor generator, create...

ALU

Pipe I/O Timer MMU Register File Cache

Tailored, synthesizable HDL uP core Customized Compiler, Assembler, Linker, Debugger, Simulator ∗∗∗∗∗∗∗ ∗∗∗∗ ∗∗∗∗∗∗∗∗ ∗∗∗ Describe new instructions

slide-22
SLIDE 22

22

Softw are Support – Assembler Softw are Support – Assembler Softw are Support – Assembler

RF0

FU0

Decoder Control ACC1 ACC2

∗ + ∗ +

  • Assembler
  • Custom data type
  • Register allocation
  • Code Scheduling
  • RTOS
  • Simulator/debugger

Loop a2, .L1 l16si a10, a3, 0 l16si a11, a3, 2 addi.n a3, a3, 2 PMAC a10, a11 .L1:

slide-23
SLIDE 23

23

Softw are Support – custom data type Softw are Support – custom data type Softw are Support – custom data type

RF0

FU0

Decoder Control ACC1 ACC2

∗ + ∗ +

  • Assembler
  • Custom data type
  • Register allocation
  • Code Scheduling
  • RTOS
  • Simulator/debugger

sat_int x,y,z; z = sat_add(x,y); C Code:

slide-24
SLIDE 24

24

Softw are Support – register allocation Softw are Support – register allocation Softw are Support – register allocation

RF0

FU0

Decoder Control ACC1 ACC2

∗ + ∗ +

  • Assembler
  • Custom data type
  • Register allocation
  • Code Scheduling
  • RTOS
  • Simulator/debugger

sat_add s3, s1, s2 sat_store s3, a1, 0 call8 foo sat_load s3, a1, 0 Spilling around a call:

slide-25
SLIDE 25

25

Softw are Support – code scheduling Softw are Support – code scheduling Softw are Support – code scheduling

RF0

FU0

Decoder Control ACC1 ACC2

∗ + ∗ +

  • Assembler
  • Custom data type
  • Register allocation
  • Code Scheduling
  • RTOS
  • Simulator/debugger

t = sat_mult(x,y); z = sat_add(z, t); t2 = sat_mult(x2, y2); sat_mult s3, s1, s2 sat_mult s6, s5, s4 sat_add s7, s7, s3

slide-26
SLIDE 26

26

Softw are Support - RTOS Softw are Support - RTOS Softw are Support - RTOS

RF0

FU0

Decoder Control ACC1 ACC2

∗ + ∗ +

  • Assembler
  • Custom data type
  • Register allocation
  • Code Scheduling
  • RTOS
  • Simulator/debugger

Task0 S0, S1, … s15 Task1 S0, S1, … s15 Memory

sat_store sat_load

Context Switch

slide-27
SLIDE 27

27

Softw are Support – simulator/debugger Softw are Support – simulator/debugger Softw are Support – simulator/debugger

RF0

FU0

Decoder Control ACC1 ACC2

∗ + ∗ +

gdb> break … gdb> cont gdb> step gdb> display …

  • Assembler
  • Custom data type
  • Register allocation
  • Code Scheduling
  • RTOS
  • Simulator/debugger

? ? ?

slide-28
SLIDE 28

28

Outline Outline Outline

Configurable processors

Architecture Instruction extension Software support

An Example Results Summary

slide-29
SLIDE 29

29

Data Encryption Standard (DES) Data Encryption Standard (DES) Data Encryption Standard (DES)

Initial step

(R, L) = Initial_permutation(Din64)

Iterate 16 times Key generation

(C, D) = PC1(k) n = rotate_amount (function of iteration count) C = rotate_right(C, n) D = rotate_right (D, n) K = PC2(D, C)

Encryption

R i+1 = Li ⊕ Permutation ( S_Box ( K ⊕ Expansion ( R ) ) ) L i+1 = Ri

Final step

Dout64 = Final_permutation(L, R)

slide-30
SLIDE 30

30

DES: Softw are Implementation DES: Softw are Implementation DES: Softw are Implementation

static unsigned permute( unsigned char *table, in t n, unsigned hi, unsigned lo) { int ib, ob; unsigned out = 0; for (ob = 0; ob < n; ob++) { ib = table[ob] - 1; if (ib >= 32) { if (hi & (1 << (ib-32))) out |= 1 << ob; } else { if (lo & (1 << ib)) out |= 1 << ob; } } return out; }

slide-31
SLIDE 31

31

DES: Softw are Implementation DES: Softw are Implementation DES: Softw are Implementation

static unsigned permute( unsigned char *table, in t n, unsigned hi, unsigned lo) { int ib, ob; unsigned out = 0; for (ob = 0; ob < n; ob++) { ib = table[ob] - 1; if (ib >= 32) { if (hi & (1 << (ib-32))) out |= 1 << ob; } else { if (lo & (1 << ib)) out |= 1 << ob; } } return out; }

Too much computation!

slide-32
SLIDE 32

32

DES: Hardw are Implementation DES: Hardw are Implementation DES: Hardw are Implementation

Initial Permutation Expansion Permutation S Boxes P Permutation

⊕ ⊕

Final Permutation Key Generation State Machine

slide-33
SLIDE 33

33

DES: Hardw are Implementation DES: Hardw are Implementation DES: Hardw are Implementation

Initial Permutation Expansion Permutation S Boxes P Permutation

⊕ ⊕

Final Permutation Key Generation State Machine

Complicated control logic!

slide-34
SLIDE 34

34

DES: SETDATA instruction DES: DES: SETDATA SETDATA instruction instruction

SETDATA ars, art Initial Permutation Expansion Permutation S Boxes P Permutation

⊕ ⊕

Final Permutation Key Generation State Machine

slide-35
SLIDE 35

35

DES: SETKEY instruction DES: DES: SETKEY SETKEY instruction instruction

Initial Permutation Expansion Permutation S Boxes P Permutation

⊕ ⊕

Final Permutation Key Generation State Machine SETKEY ars, art

slide-36
SLIDE 36

36

DES: DES instruction DES: DES: DES DES instruction instruction

DES immediate Initial Permutation Expansion Permutation S Boxes P Permutation

⊕ ⊕

Final Permutation Key Generation State Machine

slide-37
SLIDE 37

37

DES: GETDATA instruction DES: DES: GETDATA GETDATA instruction instruction

GETDATA ars, hilo Initial Permutation Expansion Permutation S Boxes P Permutation

⊕ ⊕

Final Permutation Key Generation State Machine

slide-38
SLIDE 38

38

DES: Putting it together DES: Putting it together DES: Putting it together

GETDATA ars, hilo DES immediate SETDATA ars, art Initial Permutation Expansion Permutation S Boxes P Permutation

⊕ ⊕

Final Permutation Key Generation State Machine SETKEY ars, art

slide-39
SLIDE 39

39

DES: Improved Program DES: Improved Program DES: Improved Program

SETKEY(K_hi, K_lo); for (;;) { … /* read encrypted data */ SETDATA(D_hi, D_lo); DES(DECRYPT1); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT1); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT1); DES(DECRYPT1); E_hi = GETDATA(hi); E_lo = GETDATA(lo); … /* write data */ } SETKEY(K_hi, K_lo); for (;;) { … /* read data */ SETDATA(D_hi, D_lo); DES(ENCRYPT1); DES(ENCRYPT1); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT1); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT1); E_hi = GETDATA(hi); E_lo = GETDATA(lo); … /* write encrypted data */ }

Decryption Encryption

slide-40
SLIDE 40

40

DES: Summary DES: Summary DES: Summary

Add 4 TIE instructions:

80 lines of TIE description No cycle time impact ~1700 additional gates Code-size reduced

DES Performance 43 50 53 72 20 40 60 80 1024 64 8 Mean Block Size (Bytes) Speedup (X)

slide-41
SLIDE 41

41

Outline Outline Outline

Configurable processors

Architecture Instruction extension Software support

An Example Results Summary

slide-42
SLIDE 42

42

Improvement over general purpose 32b RISC Improvement over general purpose 32b RISC Improvement over general purpose 32b RISC

JPEG

(image compression)

JPEG

(image compression)

Motion Estimation

(video conferencing)

Motion Estimation

(video conferencing)

FIR filter

(signal processing)

FIR filter

(signal processing)

Viterbi Decoding

(wireless communication)

Viterbi Decoding

(wireless communication)

MIPS or MIPS/Watt DES

(content encryption)

DES

(content encryption)

2x 4x 6x 8x 10x 55x 1x Base + 7500 gates Base + 6500 gates Base + 900 gates Base + 1000 gates Base +1700 gates

slide-43
SLIDE 43

43

What is “EEMBC”? What is “EEMBC”? What is “EEMBC”?

EDN Embedded Microprocessor Benchmark Consortium Pronounced “Embassy” Non-profit consortium, funded by over 40 members Including: ARM, AMD, IBM, Intel, LSI Logic, MIPS, Motorola, National Semi, NEC, TI, Toshiba…Tensilica, and more… Objective: Provide independently certified benchmark scores relevant to deeply embedded processor applications Independent laboratory recreates and certifies all benchmark results - no tricks Five different benchmark suites: Each suite comprised of a range (five to sixteen) of benchmarks representative of that product category Example: Consumer: image compression, image filtering, color conversion

slide-44
SLIDE 44

44

EEMBC Netw orking Benchmark EEMBC Netw orking Benchmark EEMBC Netw orking Benchmark

Netmark Performance

2 4 6 8 10 12 14

IDT 32334/100 IDT79RC32364/100 NEC V832-143 AMD ElanSC520/133 Toshiba TMPR3927F-GH189/133 IDT79RC32V334-150 Toshiba TMPR3927F-GHM2000/133 NEC VR5432-167 Xtensa/200 IDT79RC64575IDtc/250 NEC VR5000 IDT79RC64575Algor/250 AMD K6-2/450 AMD K6-2E/400 Xtensa Optimized/200 AMD K6-2E+/500 AMD K6-IIIE+/550

Netmark Efficiency (Netmark/MHz)

0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045

Comparable in Netmark to high-end desktop CPUs 2x in Netmark/MHz

59K total gates at 200MHz

Colors: Blue-Xtensa, Green-Desktop x86s, Maroon-64b RISCs, Orange-32b RISCs

slide-45
SLIDE 45

45

EEMBC Telecom Benchmark EEMBC Telecom Benchmark EEMBC Telecom Benchmark

Telemark Performance

10 20 30 40 50 60 70 80 90 AMD ElanSC520/133 IDT 32334-100 Analog Devices 21065L/60 NEC V832-143 IDT79RC32V334-150 Xtensa/200 NEC VR5432-167 IDT79RC64575Algor/250 NEC VR5000 AMD K6-2E/400 TI TMS320C6203/300 AMDK6-2E+/500 AMD K6-III+/550 IBM PowerPC750CX/500 TI TMS320C6203 C opt/300 TI TMS320C6203 Optimized/300 Xtensa Optimized/200

Telemark Efficiency (Telemark/MHz)

0.000 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450

Colors: Blue-Xtensa, Green-Desktop x86s, Maroon-64b RISCs, Orange-32b RISCs, Gray - DSPs

Beats all processors, including hand-optimized TI C6x

180K total gates at 200MHz

slide-46
SLIDE 46

46

EEMBC Consumer Benchmark EEMBC Consumer Benchmark EEMBC Consumer Benchmark

Consumermark Performance

20 40 60 80 100 120 140 160 180 200 ST20C2/50 AMD ElanSC520/133 NEC V832/143 National Geode GX1/200 NEC VR5432/167 Xtensa/200 NEC VR5000/250 AMD K6-2E/400 AMDK6-2E+/500 AMD K6-III+/550 Xtensa Optimized/200

Consumermark Efficiency (Consumermark/MHz)

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

Colors: Blue-Xtensa, Green-Desktop x86s, Maroon-64b RISCs, Orange-32b RISCs

6x in Consumermark and 12x in Consumermark/MHz 127K total gates at 200MHz

slide-47
SLIDE 47

47

Summary Summary Summary

Optimality/ integration (e.g. mW, $) Flexibility/modularity (e.g. time-to-market)

ASIC FPGA

∆ ~10x ∆ ~10x

Instruction-set Configurable Processor Traditional Processor FPGA + Processor

slide-48
SLIDE 48

48

Summary Summary Summary

Optimality/ integration (e.g. mW, $) Flexibility/modularity (e.g. time-to-market)

ASIC FPGA

∆ ~10x ∆ ~10x

Instruction-set Configurable Processor

General Processor FPGA + Processor

slide-49
SLIDE 49

49

Summary Summary Summary

Optimality/ integration (e.g. mW, $) Flexibility/modularity (e.g. time-to-market)

ASIC FPGA

∆ ~10x ∆ ~10x

Traditional Processor FPGA + Processor

Instruction-set Configurable Processor

Benefit of SoC integration

Higher Bandwidth Lower Cost Lower Power

Benefit of IS configuration

A cost-effective computing platform

Benefit of TIE compiler and SW tools

Faster time-to-market Lower development cost Lower risk

slide-50
SLIDE 50

50

Thank You!