[PPT] - DYNAMIC PRECISION NUMERICS USING A VARIABLE-PRECISION UNUM TYPE I HW PowerPoint Presentation

SLIDE 1

ARITH’26 | BOCCO Andrea | 11 June 2019

DYNAMIC PRECISION NUMERICS USING A VARIABLE-PRECISION UNUM TYPE I HW COPROCESSOR

SLIDE 2

| 2

INTRODUCTION: STATE OF THE ART

➢ Variable Precision (VP) computing has been investigated to improve

convergence of algorithms. It has been investigated in:

▪

Software (SW): GMP[2] and MPFR[3]

▪

Slow, they might not met requirements in high speed applications

▪

Hardware (HW):

▪

Kulisch[4] : large fixed point accumulator

▪

Schulte and Swartzlander[5] : mantissas divided in multiple words

➢ None of the previous works show how to store efficiently VP Floating

Point (FP) number in main memory

▪

They support IEEE 754 FP format in main memory

[1] IEEE754-2008 2008. IEEE Standard for Floating-Point Arithmetic. IEEE 754-2008 https://doi.org/10.1109/IEEESTD.2008.4610935 [2] Torbjörn Granlund and the GMP development team. 2012. GNU MP: The GNU Multiple Precision Arithmetic Library. https://gmplib.org/ [3] Laurent Fousse, et al. MPFR: A Multiple precision Binary Floating-point Library with Correct Rounding. https://doi.org/10.1145/1236463.1236468 [4] Ulirich Kulisch. 2013. Computer arithmetic and validity: Theory, implementation, and applications [5] M. J. Schulte and E. E. Swartzlander. 2000. A family of variable precision interval arithmetic processors. https://doi.org/10.1109/12.859535

SLIDE 3

| 3

INTRODUCTION: MY WORK Our previous work[6]: a VP FP hardware accelerator:

Supports the UNUM type I format in

main memory

Does computation internally with another

(hardware friendly) FP format

Supports Interval Arithmetic (IA)

This work: ▪

Refines the UNUM type I FP format.

▪

Proposes a new VP FP architecture.

▪

Proposes a new programming model.

▪

Benchmarks our system.

[6] A. Bocco, Y. Durand, F. Dinechin, 2019, SMURF: Scalar Multiple-precision UNUM RISC-V Floating-point Accelerator for Scientific Computing.

Rocket tile

UNUM co-proc

RoCC LSU FPU LSU $ L1

R A M

Scratchpad

$ L1

R A M 1 2 3 4 5

RISC-V

Rocket Chip

SLIDE 4

| 4

OUTLINE

Choice of the memory format: the UNUM type I
Refinements on the UNUM type I FP format
The adopted VP FP Architecture
The programming model
System benchmark: gauss elimination solver
Conclusions

SLIDE 5

| 5

OUTLINE

Choice of the memory format: the UNUM type I
Refinements on the UNUM type I FP format
The adopted VP FP Architecture
The programming model
System benchmark: gauss elimination solver
Conclusions

SLIDE 6

| 6

CHOICE OF THE MEMORY FORMAT: THE UNUM TYPE I

We decided to use the UNUM type I FP format in main memory

It is 6 sub-fields self-descriptive FP format

3 more that conventional IEEE 754 FP numbers

WHY?
UNUM is a VP FP format
It self-encodes the exponent and fraction field lengths

However UNUM type I has some peculiarities to be fixed:

How to organize UNUM arrays in main memory
How to organize the UNUM fields in memory

s e f u es-1 fs-1

sign exponent fraction ubit exponent size fraction size

es bits fs bits

SLIDE 7

| 7

OUTLINE

Choice of the memory format: the UNUM type I
Refinements on the UNUM type I FP format
The adopted VP FP Architecture
The programming model
System benchmark: gauss elimination solver
Conclusions

SLIDE 8

| 8

REFINEMENTS ON THE UNUM TYPE I FP FORMAT:

UNUM FIELD ORGANIZATION

For a UNUM/ubound which spans multiple addresses in main memory it is important to have the descriptor fields present in the lower addresses.

➢ We have re-organized the order of the fields for UNUM and ubound

left right left right left right s u es-1 fs-1 s u es-1 fs-1 e e f f s u es-1 fs-1 e f

2 1

LSB MSB

@1’: p FF--FF 00--00

U1 ? ? ? ? ? ?

p @1’: FF--FF 00--00

U1 ?

@2’:

U2 ?

SLIDE 9

| 9

REFINEMENTS ON THE UNUM TYPE I FP FORMAT:

UNUM ARRAY ORGANIZATION

Handling a two-element UNUM array on main memory with p bits parallelism

U2_0 U2_1 U2_2 U1_0 U1_1

p p 2p 3p p p U2 : U1 : bit length p @2’: @1’: FF--FF 00--00 1

U1_1 U1_0 U2_1 U2_0 U2_2

@2’’: @1’: p FF--FF 00--00 2

U1_1 U1_0 U2_2 U2_1 U2_0 U3_2 U3_1 U3_0 U3_2 U3_1 U3_0

! U3=U1*U2 Array support: Guarantee affine addressing scheme

SLIDE 10

| 10

OUTLINE

Choice of the memory format: the UNUM type I
Refinements on the UNUM type I FP format
The adopted VP FP Architecture
The programming model
System benchmark: gauss elimination solver
Conclusions

SLIDE 11

| 11

1 integer register file (iRF): 32 integer general purpose register

(GPR) + pc, in the main processor.

1 g-bound register file (gRF): 32 entries, in the co-processor.
UNUMs/u-bounds are strictly considered as memory formats:
Load operations:
Load UNUMs/u-bounds from the main memory, and converts them into internal g-bounds.
Store operations:
Convert internal g-bounds (entries of the internal gRF) into u-bounds. Store the latter the

main memory.

The coprocessor internal parallelism is fixed to 64 bits
Coprocessor’s status registers:
DUE
SUE
MBB
WGP

THE ADOPTED VP FP ARCHITECTURE

Rocket tile

UNUM co-proc

RoCC LSU FPU LSU $ L1

R A M

Scratchpad

$ L1

R A M 1 2 3 4 5

RISC-V

Rocket Chip

NEW!

SLIDE 12

| 12

UNUM format is variable length (up to a maximum length)

▪ It is impossible to have compacted arrays having random access to its

elements

➢ We define the Maximum Byte Budget (MBB) as the maximum length

that a UNUM number can have in main memory

➢ The user can address VP FP numbers specifying their length with Byte

granularity. THE MBB: MAXIMUM BYTE BUDGET

LSU g0 g1 g2 g3 g4 G2U BMF u0 u1 u2 u3 u4 u’0 u’1 u’2 u’3 u’4 MBB MBB MBB

SLIDE 13

| 13

s u es-1 fs-1 1a) 0 1 1-----1 1-----1 2a) 1 1 1-----1 1-----1 3a) 0 0 1-----1 1-----1 4a) 1 0 1-----1 1-----1 5a) 0 1 1-----1 1-----1 6a) 1 1 1-----1 1-----1 7a) 0 1 es-1 fs-1 8a) 1 1 es-1 fs-1 9a) s u es-1 fs-1 1b) 0 1 1--------1 1--------1 2b) 1 1 1--------1 1--------1 3b) 0 0 1--------1 1--------1 4b) 1 0 1--------1 1--------1 5b) 0 1 es-1 fs-1 6b) 1 1 es-1 fs-1 7b) s u es-1 fs-1 s u es-1 fs-1

∞↓

+∞) right (-∞ left x +∞↓ 1--------------1 1------1 1------------1 e 1--------------1 fs_max es_max 1---------------------------------1 1---------------------1 1------------------------1 f 1---------------------------------1 sNaN qNaN 1--------------1 1--------------1 1---------------------------------1 1---------------------------------1 1--------------1 1--------------1 1-------------------------------10 1-------------------------------10 UNUSED BITS fss’’ ess’’

bit length

MBB*8

fs es 1------1 1------------1 e 1---------------------1 1------------------------1 f

∞↓

+∞) right (-∞ left x +∞↓ sNaN qNaN +∞) right (-∞ left fss’ ess’ UNUSED BITS

THE BMF: BOUNDED MEMORY FORMAT

MBB >= max unum lengh MBB < max unum lengh

SLIDE 14

| 14

OUTLINE

Choice of the memory format: the UNUM type I
Refinements on the UNUM type I FP format
The adopted VP FP Architecture
The programming model
System benchmark: gauss elimination solver
Conclusions

SLIDE 15

| 15

01: k = 0 02: while convergence not reached do 03: for i := 1:n do 04:  =0 05: for j := 1:n do 06: if j ≠ i then 07: 𝝉 += 𝒃𝒋𝒌𝒚𝒌

(𝒍)

08: end 09: end 10: 𝒚𝒋

(𝒍+𝟐) = 𝟐 𝒃𝒋𝒋 (𝒄𝒋 − 𝝉)

11: end 12: k=k+1 13: end

Rocket tile UNUM co-proc

RoCC LSU FPU LSU

Scratchpad

$ L1 R A M

1 2 3

RISC-V

Our hardware is best suited for VP kernels which exploit three different storage types:

The external (main memory) storage
The intermediate (L1 cache) storage
The internal (register-level) storage

THE COPROCESSOR PROGRAMMING MODEL

b Ā x

· =

x

Legend:

Outermost loop Intermediate loop Innermost loop

UNUM co-proc

𝝉

SLIDE 16

| 16

OUTLINE

Choice of the memory format: the UNUM type I
Refinements on the UNUM type I FP format
The adopted VP FP Architecture
The programming model
System benchmark: gauss elimination solver
Conclusions

SLIDE 17

| 17

SYSTEM BENCHMARK: GAUSS ELIMINATION SOLVER Our system benchmarked with a Gauss elimination solver, both in UNUM (scalar) and ubound (interval), showed:

A gain of up to 65 decimal digits on IEEE double
The result precision is constrained by the adopted precision in memory.
Intervals do not converge always but it is useful in the computational

error estimation (Ax-b).

A speed up of 4-10x with respect to the MPFR software library

SLIDE 18

| 18

OUTLINE

Choice of the memory format: the UNUM type I
Refinements on the UNUM type I FP format
The adopted VP FP Architecture
The programming model
System benchmark: gauss elimination solver
Conclusions

SLIDE 19

| 19

CONCLUSIONS

This work proposes a Variable Precision (VP) Floating Point (FP) computing system, based on RISC-V, for high performance computing servers as an alternative to VP FP software routines.

It supports UNUM/ubound format in main memory
It supports several Unum Environments: from (1,1) to (4,8), up to 256 mantissa bits
It supports a dedicated internal format in its Register File
32 intervals; Each interval endpoint can have up to 512 mantissa bits
With the adopted memory format (BMF) it supports VP FP in main memory
User can decide the memory footprint of data with a Byte definition
With the adopted programming model, it is possible to extend VP FP high

precision variables in main memory.

The result precision can be significantly improved.
Its flops performances are better than software libraries (MPFR) and they

stays within the same range of a regular fixed-precision IEEE FPU.

SLIDE 20

Leti, technology research institute Commissariat à l’énergie atomique et aux énergies alternatives Minatec Campus | 17 rue des Martyrs | 38054 Grenoble Cedex | France www.leti.fr

THANK YOU FOR YOUR ATTENTION!

Contacts: Andrea BOCCO andrea.bocco@cea.fr