F-CPU: Year 4 Bail Cedric Boulay Nicolas Yann Guidon F-CPU 19C3 - - PowerPoint PPT Presentation

f cpu year 4
SMART_READER_LITE
LIVE PREVIEW

F-CPU: Year 4 Bail Cedric Boulay Nicolas Yann Guidon F-CPU 19C3 - - PowerPoint PPT Presentation

F-CPU: Year 4 Bail Cedric Boulay Nicolas Yann Guidon F-CPU 19C3 presentation p.1/64 Plan F-CPU 4 dummies A simple SIMD character comparison Another example : arbitrary byte shuffling in one byte The hardware design flow TCPA Design


slide-1
SLIDE 1

F-CPU: Year 4

Bail Cedric Boulay Nicolas Yann Guidon

F-CPU 19C3 presentation – p.1/64

slide-2
SLIDE 2

Plan

F-CPU 4 dummies A simple SIMD character comparison Another example : arbitrary byte shuffling in one byte The hardware design flow TCPA Design Call convention

F-CPU 19C3 presentation – p.2/64

slide-3
SLIDE 3

F-CPU 4 dummies Yann Guidon

F-CPU 19C3 presentation – p.3/64

slide-4
SLIDE 4

Introduction

Goal : to design a microprocessor that can be used and modified by anyone without industrial pressure <RMS_beard=on> It’s all about freedom : This is ‘Freedom CPU’, not ‘Free CPU’ ‘Year 4’ means 4th presentation to CCC and 4th year of existence

F-CPU 19C3 presentation – p.4/64

slide-5
SLIDE 5

Architecture

F-CPU is designed ‘from scratch’ and is not compatible with existing computers The architecture is aimed at high efficiency for computation intensive software RISC features and methods Fixed-size 32 bits instructions 64 x 64 bits registers Load-store architecture No stack Register #0 is hardwired to 0 Conditional move and jump/call/return

F-CPU 19C3 presentation – p.5/64

slide-6
SLIDE 6

Data types

Beware ! a register is not equivalent to a number ! Registers are ‘at least’ 64-bit wide Registers can have more than 64 bits ! It is simpler and more efficient to enlarge the registers than to decode more instructions per cycle (decoding and control logic would explode Register sizes can be any power of 2 : 128, 256, 512,

  • r even 32768 bits (in theory)

F-CPU 19C3 presentation – p.6/64

slide-7
SLIDE 7

Data types (2)

scalar data : aligned to the LSB, all MSB are cleared 8, 16, 32 and 64 bit integers are supported pointers : like scalar data but the number of valid LSB is not known (depends on the implementation, could be 30 or 50) SIMD data : 2**N scalar data 8x8, 4x16 and 2x32 bit integers are supported for 64 bit implementations

F-CPU 19C3 presentation – p.7/64

slide-8
SLIDE 8

Instruction Format

F-CPU 19C3 presentation – p.8/64

slide-9
SLIDE 9

FC0

1st implementation: FC0 Statically scheduled (scoreboard-based) Single-issue core Out Of Order Completion Many “Execution units” around a “Crossbar” “Carpaccio” pipeline stages for higher frequency

F-CPU 19C3 presentation – p.9/64

slide-10
SLIDE 10

Ongoing work

(this is not complete or exhaustive) VHDL model C model Manual Boot monitor Gcc port Assembler Linker L4 Linux

F-CPU 19C3 presentation – p.10/64

slide-11
SLIDE 11

Simple SIMD character comparison

F-CPU 19C3 presentation – p.11/64

slide-12
SLIDE 12

The ROP2 (logic) unit

F-CPU Design Team ROP2 unit : schematic view for one byte (C) Yann Guidon 8/31/2001 version : dec. 2, 2001 rop2_unit.vhdl rop2_xbar.vhdl

FF FF FF

A B

FF

C

FF FF FF

ROP2_function

2 1

2 1

LUT

FF FF FF

A B

FF

C

FF FF FF

A B

FF

C

FF FF FF

A B

FF

C

FF

S

FF

S

FF

S

FF

S

ROP2_function_bit3

FF FF FF

A B

FF

C

FF FF FF

A B

FF

C

FF FF FF

A B

FF

C

FF FF FF

A B

FF

C

FF

S

FF

S

FF

S

FF

S

FF FF

ROP2_mode

3-level signal amplification tree (1->4->16->64) performed by fanout_tree This is only an indication

  • f the equation complexity.

The circuit will be synthesised from the parametised LUT. The fanout is higher than that : 16 for the 64-bit version. fanout_tree is used to compensate for this.

partial_MUX partial_ROP partial_OR partial_AND

F-CPU 19C3 presentation – p.12/64

slide-13
SLIDE 13

C example

char a; ... if (a == TAB || a == CR || a == ’ ’ || a == 0) { ... }

F-CPU 19C3 presentation – p.13/64

slide-14
SLIDE 14

Assembler example

a in Ra, temporary result in Rtemp, mask in Rmask : loadaddri end if, Rjmp ; prefetch sdup.8 Ra, Rtemp ; duplicate a loadcons[0] 0x2000, Rmask ; load constants loadconsx[1] 0x090A, Rmask xorn.and.32 Rmask, Rtemp, Rtemp bnz Rtemp, Rjmp ... end if:

F-CPU 19C3 presentation – p.14/64

slide-15
SLIDE 15

Arbitrary byte shuffling in one byte

F-CPU 19C3 presentation – p.15/64

slide-16
SLIDE 16

Random shuffling example

0 -> 3 1 -> 2 2 -> 4 3 -> 7 4 -> 5 5 -> 1 6 -> 0 7 -> 6 From this, we generate the following masks : r3 = mask1 = 0x8040201008040201; // linear bit selection r5 = maks2 = 0x4001028020100408; // permuted mask

F-CPU 19C3 presentation – p.16/64

slide-17
SLIDE 17

The assembly langage source

sdup.b r1, r2 ; duplicate r1 into r2 and.or r2, r3, r4 ; first mask and combine and r4, r5, r6 ; second mask shri 32, r6, r7 ; gather the bits in log2

  • r r6, r7

shri 16, r6, r7

  • r r6, r7

shri 8, r6, r7

  • r r6, r7

9 instructions for shuffling 8 bits : this yields almost 1 instruction per bit !

F-CPU 19C3 presentation – p.17/64

slide-18
SLIDE 18

Powerup and BIST method

F-CPU 19C3 presentation – p.18/64

slide-19
SLIDE 19

The FC0 pipeline

ROP2 SHL INC ASU

Register Set

F-CPU 19C3 presentation – p.19/64

slide-20
SLIDE 20

Popcount unit and LFSR

ROP2 SHL INC ASU

Register Set

Signal Generator

compact signature generate signature

F-CPU 19C3 presentation – p.20/64

slide-21
SLIDE 21

Popcount unit and LFSR

XOR POPCOUNT MUX LFSR 64 6 6 64 64 64

F-CPU 19C3 presentation – p.21/64

slide-22
SLIDE 22

The hardware design flow Nicolas Boulay

F-CPU 19C3 presentation – p.22/64

slide-23
SLIDE 23

A transistor

F-CPU 19C3 presentation – p.23/64

slide-24
SLIDE 24

A real transistor

F-CPU 19C3 presentation – p.24/64

slide-25
SLIDE 25

A wafer

F-CPU 19C3 presentation – p.25/64

slide-26
SLIDE 26

Some ASIC

F-CPU 19C3 presentation – p.26/64

slide-27
SLIDE 27

An other ASIC

F-CPU 19C3 presentation – p.27/64

slide-28
SLIDE 28

FPGA principe

F-CPU 19C3 presentation – p.28/64

slide-29
SLIDE 29

Making hardware

FPGA (field programable gate array) Semi-custom, full custom (ASIC, Application Specific Integrated Circuit).

F-CPU 19C3 presentation – p.29/64

slide-30
SLIDE 30

Design IP (or a core)

Nowdays what had been put in mainboard are put in the same die (piece of silicon). Componants are replace by core to create System-on-Chip (SoC). F-cpu is a core. So a SoC could be maid of fritz chip + fcpu.

F-CPU 19C3 presentation – p.30/64

slide-31
SLIDE 31

TCPA

F-CPU 19C3 presentation – p.31/64

slide-32
SLIDE 32

GPL

Depending of the licence, we could obliged to open all sources. But the cores risk to be not used (imagine that linux unallowed to run proprietary stuff). And seeing the code could not surely help to break the protection.

F-CPU 19C3 presentation – p.32/64

slide-33
SLIDE 33

LGPL

Only the core is protected like the Leon is (Sparc V7 clone).

F-CPU 19C3 presentation – p.33/64

slide-34
SLIDE 34

GPL+proprietary interface

Like linux kernel, we could choose to open certain interface (like the io bus but not the SDRAM bus).

F-CPU 19C3 presentation – p.34/64

slide-35
SLIDE 35

Licence

But the licence is a constant flameware on the mailing list. GPL is currently used, but is too much restrictive from my point of view. It’s also hard to accept that GPL could cover hardware, too (something with sources and a "result").

F-CPU 19C3 presentation – p.35/64

slide-36
SLIDE 36

Design

F-CPU 19C3 presentation – p.36/64

slide-37
SLIDE 37

Design cycle

Write HDL then Simulate RTL code (waveform)

F-CPU 19C3 presentation – p.37/64

slide-38
SLIDE 38

Design cycle

Write HDL then Simulate RTL code (waveform) Synthesis it to have a netlist (timing result + number of gate used)

F-CPU 19C3 presentation – p.37/64

slide-39
SLIDE 39

Design cycle

Write HDL then Simulate RTL code (waveform) Synthesis it to have a netlist (timing result + number of gate used) Place and route to get plan (GDS2 files + more precise timing result + area used (wire))

F-CPU 19C3 presentation – p.37/64

slide-40
SLIDE 40

Simulator

F-CPU sources are compatible with most compilers and have been tested with : ncsim (cadence, fastest of the market) modelsim Simili (freeware, slower that ncsim) ghdl (alpha version) (the story of a guy that wanted to learn Ada and VHDL so he wrote a VHDL gcc front end in Ada) ALDEC’s Riviera (nice but proprietary) Vanilla VHDL (abandonware)

F-CPU 19C3 presentation – p.38/64

slide-41
SLIDE 41

Synthetiser

Design Compiler (Synopsys, 100 Keur/year... for ASIC) Synplify (Synplicity for FPGA) _NO_ free software

F-CPU 19C3 presentation – p.39/64

slide-42
SLIDE 42

Place & Route

Cadence tools Tendance of merged with synthesys tools (for <130 nm technology). Also _NO_ free software

F-CPU 19C3 presentation – p.40/64

slide-43
SLIDE 43

That’s NOT all folks !

Static timing analysis tool to verify synthesis (primetime from synopsys : 100 Keur/year). Equivalence checking between netlist and rtl code (avoid slooow simulation in gate level). ATPG (automatic patern generator) to create input vectors to test the chip at the fab to cover the maximum stuck fault with the minimum of vectors. BIST generator to test memory. Formal proofing tools to help finding bug in the rtl design.

F-CPU 19C3 presentation – p.41/64

slide-44
SLIDE 44

Tools conclusion

So it miss a lot of free tools !

F-CPU 19C3 presentation – p.42/64

slide-45
SLIDE 45

Call convention Cedric Bail

F-CPU 19C3 presentation – p.43/64

slide-46
SLIDE 46

F-CPU call capacity

No specialised register

F-CPU 19C3 presentation – p.44/64

slide-47
SLIDE 47

F-CPU call capacity

No specialised register No stack pointer

F-CPU 19C3 presentation – p.44/64

slide-48
SLIDE 48

F-CPU call capacity

No specialised register No stack pointer No specific address pointer

F-CPU 19C3 presentation – p.44/64

slide-49
SLIDE 49

F-CPU call capacity

No specialised register No stack pointer No specific address pointer 63 Generals registers

F-CPU 19C3 presentation – p.44/64

slide-50
SLIDE 50

F-CPU call capacity

No specialised register No stack pointer No specific address pointer 63 Generals registers No call

F-CPU 19C3 presentation – p.44/64

slide-51
SLIDE 51

F-CPU call capacity

No specialised register No stack pointer No specific address pointer 63 Generals registers No call No stack

F-CPU 19C3 presentation – p.44/64

slide-52
SLIDE 52

What we need to do a call

Stack pointer Return address Return value Parameters

F-CPU 19C3 presentation – p.45/64

slide-53
SLIDE 53

C source example

void hanoi(int N, char* D, char* B, char* I) { if (N == 1) printf ("move %s to %s", D, B); else { hanoi (N-1, D, I, B); printf ("move %s to %s", D, B); hanoi (N-1, I, B, D); } }

F-CPU 19C3 presentation – p.46/64

slide-54
SLIDE 54

The first call convention

R0

F-CPU 19C3 presentation – p.47/64

slide-55
SLIDE 55

The first call convention

R0 = always zero

F-CPU 19C3 presentation – p.47/64

slide-56
SLIDE 56

The first call convention

R0 = always zero R1-R61 R62 R63

F-CPU 19C3 presentation – p.47/64

slide-57
SLIDE 57

The first call convention

R0 = always zero R1-R61 = preserved accross call R62 = return address R63 = stack pointer

F-CPU 19C3 presentation – p.47/64

slide-58
SLIDE 58

The cost

Before using a register need to store it in memory Before doing a return you need to load them back from memory

F-CPU 19C3 presentation – p.48/64

slide-59
SLIDE 59

Prologue example

storei -8, [sp], r1 storei -8, [sp], r2 storei -8, [sp], r3 storei -8, [sp], r4 storei -8, [sp], r5 storei -8, [sp], r6 storei -8, [sp], r7 storei -8, [sp], r62

F-CPU 19C3 presentation – p.49/64

slide-60
SLIDE 60

Prologue example

storei -8, [sp], r1 storei -8, [sp], r2 storei -8, [sp], r3 storei -8, [sp], r4 storei -8, [sp], r5 storei -8, [sp], r6 storei -8, [sp], r7 storei -8, [sp], r62 addi 6 * 8, sp, r1 loadi +8, [r1], r2 ; char* I loadi +8, [r1], r3 ; char* B loadi +8, [r1], r4 ; char* D loadi +8, [r1], r5 ; int N

F-CPU 19C3 presentation – p.49/64

slide-61
SLIDE 61

Epilogue example

loadi -8, [sp], r62 loadi -8, [sp], r7 loadi -8, [sp], r6 loadi -8, [sp], r5 loadi -8, [sp], r4 loadi -8, [sp], r3 loadi -8, [sp], r2 loadi -8, [sp], r1

F-CPU 19C3 presentation – p.50/64

slide-62
SLIDE 62

hanoi with first call convention

22 * 64 bits data are stored 20 * 64 bits data are loaded No tail recursive call

F-CPU 19C3 presentation – p.51/64

slide-63
SLIDE 63

Second call convention

R1-R15 = Parameters R16-R31 = Temporary (not preserved accross call) R32-R61 = Saved temporary (preserved accross call) R62 = Stack pointer R63 = Return address

F-CPU 19C3 presentation – p.52/64

slide-64
SLIDE 64

Prologue example

storei -8, [sp], r32 storei -8, [sp], r33 storei -8, [sp], r34 storei -8, [sp], r35 storei -8, [sp], r62

F-CPU 19C3 presentation – p.53/64

slide-65
SLIDE 65

Epilogue example

loadi +8, [sp], r62 loadi +8, [sp], r35 loadi +8, [sp], r34 loadi +8, [sp], r33 loadi +8, [sp], r32

F-CPU 19C3 presentation – p.54/64

slide-66
SLIDE 66

hanoi with second call convention

10 * 64 bits data are stored 10 * 64 bits data are loaded Tail recursive call Recursive prologue

F-CPU 19C3 presentation – p.55/64

slide-67
SLIDE 67

Recursive prologue example

storei -8, [sp], r36 storei -8, [sp], r37 loadcons printf, r36 loopentry r37 ; Hanoi really start here

F-CPU 19C3 presentation – p.56/64

slide-68
SLIDE 68

The maskload/store idea

R1-R15 = Parameters R16-R31 = Temporary (not preserved accross call) R32-R57 = Saved temporary (preserved accross call) R58 = Mask register R59 = Pointer to Procedure Linkage Table R60 = Pointer to Global Offset Table R61 = Frame pointer R62 = Stack pointer R63 = Return address

F-CPU 19C3 presentation – p.57/64

slide-69
SLIDE 69

Prologue example

Will save r48-r52, mr (r58), sp (r62), ra (r63) 1100 0100 0001 1111 move r0, t2 loadcons.3 0xC82F, t2 and mr, t2, t3 maskstore t3, [sp] move t2, r48

F-CPU 19C3 presentation – p.58/64

slide-70
SLIDE 70

Epilogue example

maskload r48, [sp]

F-CPU 19C3 presentation – p.59/64

slide-71
SLIDE 71

Problem

Asynchronous Complex Faults Never the same binary with the same code

F-CPU 19C3 presentation – p.60/64

slide-72
SLIDE 72

Solution

But we can do it with conditionnal load and store. cstorel t3, [sp], r48 shiftli 1, t3, t3 msubi 8, sp, sp

F-CPU 19C3 presentation – p.61/64

slide-73
SLIDE 73

The current accepted call convention

R1-R15 = Parameters R16-R31 = Temporary (not preserved accross call) R32-R58 = Saved temporary (preserved accross call) R59 = Pointer to Procedure Linkage Table R60 = Pointer to Global Offset Table R61 = Frame pointer R62 = Stack pointer R63 = Return address

F-CPU 19C3 presentation – p.62/64

slide-74
SLIDE 74

Linking solution

Use elf to put information on register used by function and call graph Clean address mode No hidden register Always the same result with the same code Always the best result for the binarie

F-CPU 19C3 presentation – p.63/64

slide-75
SLIDE 75

Questions ?

Cedric BAIL : cedric.bail@free.fr Nicolas BOULAY : nico@seul.org Yann GUIDON : whygee@f-cpu.org

F-CPU 19C3 presentation – p.64/64