Hardw are/Softw are Hardw are/Softw are Hardw are/Softw are - - PowerPoint PPT Presentation
Hardw are/Softw are Hardw are/Softw are Hardw are/Softw are - - PowerPoint PPT Presentation
Hardw are/Softw are Hardw are/Softw are Hardw are/Softw are Instruction Set Configurability Instruction Set Configurability Instruction Set Configurability for Sytem-on-Chip Processors for Sytem-on-Chip Processors for Sytem-on-Chip
2
Landscape of reconfigurable computing Landscape of Landscape of reconfigurable reconfigurable computing computing
Optimality/ integration (e.g. mW, $) Flexibility/modularity (e.g. time-to-market)
ASIC FPGA
∆ ~10x ∆ ~10x
Instruction-set Configurable Processor
General Processor FPGA + Processor
3
Computing using temporal connection Computing using temporal connection Computing using temporal connection
Registers Datapath Control Processor Solution Memory (Program)
- X
Correct Efficient
- X
Processor
4
Computing using spatial connection Computing using spatial connection Computing using spatial connection
Registers Datapath Control Processor Solution ASIC Solution FSM Storage Memory (Program)
- X
Correct Efficient
- X
ASIC
5
Processor with Application-specific Instructions
Configurable Processors: best of both Configurable Processors: best of both Configurable Processors: best of both
Registers Datapath Control Processor Solutions ASIC Solutions FSM Storage Memory (Program)
- Correct
Efficient Processor ASIC
6
Outline Outline Outline
Configurable processor solution
Xtensa ™ processor Architecture Instruction extension automation Software development tools
An Example Results Summary
7
Conventional Architecture Conventional Architecture Conventional Architecture
Source RF0 RF1 RF2 S1 S0
FU0 FU0 FU0 FU0
Result Decoder Control
- More registers
- More FU’s
- Deeper pipeline
- Bypass/forward
8
Conventional Architecture - cont. Conventional Architecture - cont. Conventional Architecture - cont.
Source routing RF0 RF1 RF2 S1 S0
FU0 FU1 FU2 FU3
Result routing Decoder Control
- More FU’s
9
Conventional Architecture – cont. Conventional Architecture – cont. Conventional Architecture – cont.
Source routing RF0 RF1 RF2 S1 S0
FU0 FU1 FU2 FU3
Result routing Decoder Control
- More FU’s
- More registers
10
Conventional Architecture – cont. Conventional Architecture – cont. Conventional Architecture – cont.
Source routing RF0 RF1 RF2 S1 S0
FU0 FU1 FU2 FU3
Result routing Decoder Control
- More registers
- More FU’s
- Deeper pipeline
11
Conventional Architecture – cont. Conventional Architecture – cont. Conventional Architecture – cont.
Source routing RF0 RF1 RF2 S1 S0
FU0 FU1 FU2 FU3
Result routing Decoder Control
- More registers
- More FU’s
- Deeper pipeline
- Bypass/forward
12
Conventional Architecture – cont. Conventional Architecture – cont. Conventional Architecture – cont.
Problem with fixed processor:
Waste silicon
- There is no universal extensions, or even one for each
application class
Not fast enough, compared with hardware implementation Waste power
The Tensilica solution:
Small core processor Allow easy and efficient application-specific instruction extensions
13
Xtensa Architecture – Base Xtensa Architecture – Base Xtensa Architecture – Base
Source routing RF0 RF1 RF2 S1 S0
FU0 FU0 FU0 FU0
Result routing Decoder Control
- Good performance
- Comparable to any embedded 32-bit
RISC
- Good code density
- Much better than 32-bit RISC
- Use 16b/24b instructions
- Small
- .7mm2 in .18
- Low power
- .37mw / MHz
- Easy extension
- With Tensilica Instruction Extension
(TIE) language – ISA level
- Efficient extension
- TIE compiler generates efficient
pipelined implementation
- TIE compiler extends all software
development tools
14
TIE language - opcode TIE language - TIE language - opcode
- pcode
Source routing RF0 RF1 RF2 S1 S0
FU0 FU0 FU0 FU0
Result routing Decoder Control
- Opcode
- pcode MAC op2=5 CUST0
15
TIE Language – regfile / state TIE Language – TIE Language – regfile regfile / / state state
Source routing RF0 S0
FU0 FU0 FU0 FU0
Result routing Decoder Control
- Opcode
- Register file / State
… as needed
state ACC 40
16
TIE Language – semantics TIE Language – TIE Language – semantics semantics
Source routing RF0
FU0 MAC
Result routing Decoder Control
- Opcode
- Register file / state
- semantics
S0
… as needed … as needed
semantic sem1 {MAC} {assign ACCL=ACCL+ars[16:0]*art[15:0];}
17
TIE Language – iclass TIE Language – TIE Language – iclass iclass
Source routing RF0
FU0 MAC
Result routing Decoder Control
- Opcode
- Register file / state
- semantics
S0
… as needed … as needed
- Instruction class
iclass c1 {MAC} {in ars, in art} {inout ACC}
18
TIE Language - schedule TIE Language - schedule TIE Language - schedule
- schedule
Source routing RF0
FU0 MAC
Result routing Decoder Control
- Opcode
- Register file / state
- semantics
S0
… as needed … as needed
- Instruction class
schedule s1 {MAC}{use ars 1; use art 1; use ACC 2; def ACC 2;}
19
A Complete Example – parallel MAC A Complete Example – parallel MAC A Complete Example – parallel MAC
- pcode PMAC op2=0 CUST0
state ACC1 40 state ACC2 40 iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2} semantic pmac_sem {PMAC} {
assign ACC1 = ACC1 + ars[15:0] * art[15:0]; assign ACC2 = ACC2 + ars[31:16] * art[31:16];
} schedule pmac_schd {PMAC} {
use ars 1; use art 1; use ACC1 2; use ACC2 2; def ACC1 2; def ACC2 2;
}
20
Productivity Gain – language + compiler Productivity Gain – language + compiler Productivity Gain – language + compiler
Select processor
- ptions
Using the Xtensa processor generator, create...
ALU
Pipe I/O Timer MMU Register File Cache
Tailored, synthesizable HDL uP core Customized Compiler, Assembler, Linker, Debugger, Simulator ∗∗∗∗∗∗∗ ∗∗∗∗ ∗∗∗∗∗∗∗∗ ∗∗∗ Describe new instructions
In Minutes!
21
Productivity Gain – Softw are Tools Productivity Gain – Softw are Tools Productivity Gain – Softw are Tools
Select processor
- ptions
Using the Xtensa processor generator, create...
ALU
Pipe I/O Timer MMU Register File Cache
Tailored, synthesizable HDL uP core Customized Compiler, Assembler, Linker, Debugger, Simulator ∗∗∗∗∗∗∗ ∗∗∗∗ ∗∗∗∗∗∗∗∗ ∗∗∗ Describe new instructions
22
Softw are Support – Assembler Softw are Support – Assembler Softw are Support – Assembler
RF0
FU0
Decoder Control ACC1 ACC2
∗ + ∗ +
- Assembler
- Custom data type
- Register allocation
- Code Scheduling
- RTOS
- Simulator/debugger
Loop a2, .L1 l16si a10, a3, 0 l16si a11, a3, 2 addi.n a3, a3, 2 PMAC a10, a11 .L1:
23
Softw are Support – custom data type Softw are Support – custom data type Softw are Support – custom data type
RF0
FU0
Decoder Control ACC1 ACC2
∗ + ∗ +
- Assembler
- Custom data type
- Register allocation
- Code Scheduling
- RTOS
- Simulator/debugger
sat_int x,y,z; z = sat_add(x,y); C Code:
24
Softw are Support – register allocation Softw are Support – register allocation Softw are Support – register allocation
RF0
FU0
Decoder Control ACC1 ACC2
∗ + ∗ +
- Assembler
- Custom data type
- Register allocation
- Code Scheduling
- RTOS
- Simulator/debugger
sat_add s3, s1, s2 sat_store s3, a1, 0 call8 foo sat_load s3, a1, 0 Spilling around a call:
25
Softw are Support – code scheduling Softw are Support – code scheduling Softw are Support – code scheduling
RF0
FU0
Decoder Control ACC1 ACC2
∗ + ∗ +
- Assembler
- Custom data type
- Register allocation
- Code Scheduling
- RTOS
- Simulator/debugger
t = sat_mult(x,y); z = sat_add(z, t); t2 = sat_mult(x2, y2); sat_mult s3, s1, s2 sat_mult s6, s5, s4 sat_add s7, s7, s3
26
Softw are Support - RTOS Softw are Support - RTOS Softw are Support - RTOS
RF0
FU0
Decoder Control ACC1 ACC2
∗ + ∗ +
- Assembler
- Custom data type
- Register allocation
- Code Scheduling
- RTOS
- Simulator/debugger
Task0 S0, S1, … s15 Task1 S0, S1, … s15 Memory
sat_store sat_load
Context Switch
27
Softw are Support – simulator/debugger Softw are Support – simulator/debugger Softw are Support – simulator/debugger
RF0
FU0
Decoder Control ACC1 ACC2
∗ + ∗ +
gdb> break … gdb> cont gdb> step gdb> display …
- Assembler
- Custom data type
- Register allocation
- Code Scheduling
- RTOS
- Simulator/debugger
? ? ?
28
Outline Outline Outline
Configurable processors
Architecture Instruction extension Software support
An Example Results Summary
29
Data Encryption Standard (DES) Data Encryption Standard (DES) Data Encryption Standard (DES)
Initial step
(R, L) = Initial_permutation(Din64)
Iterate 16 times Key generation
(C, D) = PC1(k) n = rotate_amount (function of iteration count) C = rotate_right(C, n) D = rotate_right (D, n) K = PC2(D, C)
Encryption
R i+1 = Li ⊕ Permutation ( S_Box ( K ⊕ Expansion ( R ) ) ) L i+1 = Ri
Final step
Dout64 = Final_permutation(L, R)
30
DES: Softw are Implementation DES: Softw are Implementation DES: Softw are Implementation
static unsigned permute( unsigned char *table, in t n, unsigned hi, unsigned lo) { int ib, ob; unsigned out = 0; for (ob = 0; ob < n; ob++) { ib = table[ob] - 1; if (ib >= 32) { if (hi & (1 << (ib-32))) out |= 1 << ob; } else { if (lo & (1 << ib)) out |= 1 << ob; } } return out; }
31
DES: Softw are Implementation DES: Softw are Implementation DES: Softw are Implementation
static unsigned permute( unsigned char *table, in t n, unsigned hi, unsigned lo) { int ib, ob; unsigned out = 0; for (ob = 0; ob < n; ob++) { ib = table[ob] - 1; if (ib >= 32) { if (hi & (1 << (ib-32))) out |= 1 << ob; } else { if (lo & (1 << ib)) out |= 1 << ob; } } return out; }
Too much computation!
32
DES: Hardw are Implementation DES: Hardw are Implementation DES: Hardw are Implementation
Initial Permutation Expansion Permutation S Boxes P Permutation
⊕ ⊕
Final Permutation Key Generation State Machine
33
DES: Hardw are Implementation DES: Hardw are Implementation DES: Hardw are Implementation
Initial Permutation Expansion Permutation S Boxes P Permutation
⊕ ⊕
Final Permutation Key Generation State Machine
Complicated control logic!
34
DES: SETDATA instruction DES: DES: SETDATA SETDATA instruction instruction
SETDATA ars, art Initial Permutation Expansion Permutation S Boxes P Permutation
⊕ ⊕
Final Permutation Key Generation State Machine
35
DES: SETKEY instruction DES: DES: SETKEY SETKEY instruction instruction
Initial Permutation Expansion Permutation S Boxes P Permutation
⊕ ⊕
Final Permutation Key Generation State Machine SETKEY ars, art
36
DES: DES instruction DES: DES: DES DES instruction instruction
DES immediate Initial Permutation Expansion Permutation S Boxes P Permutation
⊕ ⊕
Final Permutation Key Generation State Machine
37
DES: GETDATA instruction DES: DES: GETDATA GETDATA instruction instruction
GETDATA ars, hilo Initial Permutation Expansion Permutation S Boxes P Permutation
⊕ ⊕
Final Permutation Key Generation State Machine
38
DES: Putting it together DES: Putting it together DES: Putting it together
GETDATA ars, hilo DES immediate SETDATA ars, art Initial Permutation Expansion Permutation S Boxes P Permutation
⊕ ⊕
Final Permutation Key Generation State Machine SETKEY ars, art
39
DES: Improved Program DES: Improved Program DES: Improved Program
SETKEY(K_hi, K_lo); for (;;) { … /* read encrypted data */ SETDATA(D_hi, D_lo); DES(DECRYPT1); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT1); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT1); DES(DECRYPT1); E_hi = GETDATA(hi); E_lo = GETDATA(lo); … /* write data */ } SETKEY(K_hi, K_lo); for (;;) { … /* read data */ SETDATA(D_hi, D_lo); DES(ENCRYPT1); DES(ENCRYPT1); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT1); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT1); E_hi = GETDATA(hi); E_lo = GETDATA(lo); … /* write encrypted data */ }
Decryption Encryption
40
DES: Summary DES: Summary DES: Summary
Add 4 TIE instructions:
80 lines of TIE description No cycle time impact ~1700 additional gates Code-size reduced
DES Performance 43 50 53 72 20 40 60 80 1024 64 8 Mean Block Size (Bytes) Speedup (X)
41
Outline Outline Outline
Configurable processors
Architecture Instruction extension Software support
An Example Results Summary
42
Improvement over general purpose 32b RISC Improvement over general purpose 32b RISC Improvement over general purpose 32b RISC
JPEG
(image compression)
JPEG
(image compression)
Motion Estimation
(video conferencing)
Motion Estimation
(video conferencing)
FIR filter
(signal processing)
FIR filter
(signal processing)
Viterbi Decoding
(wireless communication)
Viterbi Decoding
(wireless communication)
MIPS or MIPS/Watt DES
(content encryption)
DES
(content encryption)
2x 4x 6x 8x 10x 55x 1x Base + 7500 gates Base + 6500 gates Base + 900 gates Base + 1000 gates Base +1700 gates
43
What is “EEMBC”? What is “EEMBC”? What is “EEMBC”?
EDN Embedded Microprocessor Benchmark Consortium Pronounced “Embassy” Non-profit consortium, funded by over 40 members Including: ARM, AMD, IBM, Intel, LSI Logic, MIPS, Motorola, National Semi, NEC, TI, Toshiba…Tensilica, and more… Objective: Provide independently certified benchmark scores relevant to deeply embedded processor applications Independent laboratory recreates and certifies all benchmark results - no tricks Five different benchmark suites: Each suite comprised of a range (five to sixteen) of benchmarks representative of that product category Example: Consumer: image compression, image filtering, color conversion
44
EEMBC Netw orking Benchmark EEMBC Netw orking Benchmark EEMBC Netw orking Benchmark
Netmark Performance
2 4 6 8 10 12 14
IDT 32334/100 IDT79RC32364/100 NEC V832-143 AMD ElanSC520/133 Toshiba TMPR3927F-GH189/133 IDT79RC32V334-150 Toshiba TMPR3927F-GHM2000/133 NEC VR5432-167 Xtensa/200 IDT79RC64575IDtc/250 NEC VR5000 IDT79RC64575Algor/250 AMD K6-2/450 AMD K6-2E/400 Xtensa Optimized/200 AMD K6-2E+/500 AMD K6-IIIE+/550
Netmark Efficiency (Netmark/MHz)
0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045
Comparable in Netmark to high-end desktop CPUs 2x in Netmark/MHz
59K total gates at 200MHz
Colors: Blue-Xtensa, Green-Desktop x86s, Maroon-64b RISCs, Orange-32b RISCs
45
EEMBC Telecom Benchmark EEMBC Telecom Benchmark EEMBC Telecom Benchmark
Telemark Performance
10 20 30 40 50 60 70 80 90 AMD ElanSC520/133 IDT 32334-100 Analog Devices 21065L/60 NEC V832-143 IDT79RC32V334-150 Xtensa/200 NEC VR5432-167 IDT79RC64575Algor/250 NEC VR5000 AMD K6-2E/400 TI TMS320C6203/300 AMDK6-2E+/500 AMD K6-III+/550 IBM PowerPC750CX/500 TI TMS320C6203 C opt/300 TI TMS320C6203 Optimized/300 Xtensa Optimized/200
Telemark Efficiency (Telemark/MHz)
0.000 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450
Colors: Blue-Xtensa, Green-Desktop x86s, Maroon-64b RISCs, Orange-32b RISCs, Gray - DSPs
Beats all processors, including hand-optimized TI C6x
180K total gates at 200MHz
46
EEMBC Consumer Benchmark EEMBC Consumer Benchmark EEMBC Consumer Benchmark
Consumermark Performance
20 40 60 80 100 120 140 160 180 200 ST20C2/50 AMD ElanSC520/133 NEC V832/143 National Geode GX1/200 NEC VR5432/167 Xtensa/200 NEC VR5000/250 AMD K6-2E/400 AMDK6-2E+/500 AMD K6-III+/550 Xtensa Optimized/200
Consumermark Efficiency (Consumermark/MHz)
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
Colors: Blue-Xtensa, Green-Desktop x86s, Maroon-64b RISCs, Orange-32b RISCs
6x in Consumermark and 12x in Consumermark/MHz 127K total gates at 200MHz
47
Summary Summary Summary
Optimality/ integration (e.g. mW, $) Flexibility/modularity (e.g. time-to-market)
ASIC FPGA
∆ ~10x ∆ ~10x
Instruction-set Configurable Processor Traditional Processor FPGA + Processor
48
Summary Summary Summary
Optimality/ integration (e.g. mW, $) Flexibility/modularity (e.g. time-to-market)
ASIC FPGA
∆ ~10x ∆ ~10x
Instruction-set Configurable Processor
General Processor FPGA + Processor
49
Summary Summary Summary
Optimality/ integration (e.g. mW, $) Flexibility/modularity (e.g. time-to-market)
ASIC FPGA
∆ ~10x ∆ ~10x
Traditional Processor FPGA + Processor
Instruction-set Configurable Processor
Benefit of SoC integration
Higher Bandwidth Lower Cost Lower Power
Benefit of IS configuration
A cost-effective computing platform
Benefit of TIE compiler and SW tools
Faster time-to-market Lower development cost Lower risk
50