DYNAMIC PRECISION NUMERICS USING A VARIABLE-PRECISION UNUM TYPE I HW - - PowerPoint PPT Presentation

dynamic precision numerics using a variable precision
SMART_READER_LITE
LIVE PREVIEW

DYNAMIC PRECISION NUMERICS USING A VARIABLE-PRECISION UNUM TYPE I HW - - PowerPoint PPT Presentation

DYNAMIC PRECISION NUMERICS USING A VARIABLE-PRECISION UNUM TYPE I HW COPROCESSOR ARITH26 | BOCCO Andrea | 11 June 2019 INTRODUCTION: STATE OF THE ART Variable Precision (VP) computing has been investigated to improve convergence of


slide-1
SLIDE 1

ARITH’26 | BOCCO Andrea | 11 June 2019

DYNAMIC PRECISION NUMERICS USING A VARIABLE-PRECISION UNUM TYPE I HW COPROCESSOR

slide-2
SLIDE 2

| 2

INTRODUCTION: STATE OF THE ART

➢ Variable Precision (VP) computing has been investigated to improve

convergence of algorithms. It has been investigated in:

Software (SW): GMP[2] and MPFR[3]

Slow, they might not met requirements in high speed applications

Hardware (HW):

Kulisch[4] : large fixed point accumulator

Schulte and Swartzlander[5] : mantissas divided in multiple words

➢ None of the previous works show how to store efficiently VP Floating

Point (FP) number in main memory

They support IEEE 754 FP format in main memory

[1] IEEE754-2008 2008. IEEE Standard for Floating-Point Arithmetic. IEEE 754-2008 https://doi.org/10.1109/IEEESTD.2008.4610935 [2] Torbjörn Granlund and the GMP development team. 2012. GNU MP: The GNU Multiple Precision Arithmetic Library. https://gmplib.org/ [3] Laurent Fousse, et al. MPFR: A Multiple precision Binary Floating-point Library with Correct Rounding. https://doi.org/10.1145/1236463.1236468 [4] Ulirich Kulisch. 2013. Computer arithmetic and validity: Theory, implementation, and applications [5] M. J. Schulte and E. E. Swartzlander. 2000. A family of variable precision interval arithmetic processors. https://doi.org/10.1109/12.859535

slide-3
SLIDE 3

| 3

INTRODUCTION: MY WORK Our previous work[6]: a VP FP hardware accelerator:

  • Supports the UNUM type I format in

main memory

  • Does computation internally with another

(hardware friendly) FP format

  • Supports Interval Arithmetic (IA)

This work: ▪

Refines the UNUM type I FP format.

Proposes a new VP FP architecture.

Proposes a new programming model.

Benchmarks our system.

[6] A. Bocco, Y. Durand, F. Dinechin, 2019, SMURF: Scalar Multiple-precision UNUM RISC-V Floating-point Accelerator for Scientific Computing.

Rocket tile

UNUM co-proc

RoCC LSU FPU LSU $ L1

R A M

Scratchpad

$ L1

R A M 1 2 3 4 5

RISC-V

Rocket Chip

slide-4
SLIDE 4

| 4

OUTLINE

  • Choice of the memory format: the UNUM type I
  • Refinements on the UNUM type I FP format
  • The adopted VP FP Architecture
  • The programming model
  • System benchmark: gauss elimination solver
  • Conclusions
slide-5
SLIDE 5

| 5

OUTLINE

  • Choice of the memory format: the UNUM type I
  • Refinements on the UNUM type I FP format
  • The adopted VP FP Architecture
  • The programming model
  • System benchmark: gauss elimination solver
  • Conclusions
slide-6
SLIDE 6

| 6

CHOICE OF THE MEMORY FORMAT: THE UNUM TYPE I

We decided to use the UNUM type I FP format in main memory

  • It is 6 sub-fields self-descriptive FP format

3 more that conventional IEEE 754 FP numbers

  • WHY?
  • UNUM is a VP FP format
  • It self-encodes the exponent and fraction field lengths

However UNUM type I has some peculiarities to be fixed:

  • How to organize UNUM arrays in main memory
  • How to organize the UNUM fields in memory

s e f u es-1 fs-1

sign exponent fraction ubit exponent size fraction size

es bits fs bits

slide-7
SLIDE 7

| 7

OUTLINE

  • Choice of the memory format: the UNUM type I
  • Refinements on the UNUM type I FP format
  • The adopted VP FP Architecture
  • The programming model
  • System benchmark: gauss elimination solver
  • Conclusions
slide-8
SLIDE 8

| 8

REFINEMENTS ON THE UNUM TYPE I FP FORMAT:

  • UNUM FIELD ORGANIZATION

For a UNUM/ubound which spans multiple addresses in main memory it is important to have the descriptor fields present in the lower addresses.

➢ We have re-organized the order of the fields for UNUM and ubound

left right left right left right s u es-1 fs-1 s u es-1 fs-1 e e f f s u es-1 fs-1 e f

2 1

LSB MSB

@1’: p FF--FF 00--00

U1 ? ? ? ? ? ?

p @1’: FF--FF 00--00

U1 ?

@2’:

U2 ?

slide-9
SLIDE 9

| 9

REFINEMENTS ON THE UNUM TYPE I FP FORMAT:

  • UNUM ARRAY ORGANIZATION

Handling a two-element UNUM array on main memory with p bits parallelism

U2_0 U2_1 U2_2 U1_0 U1_1

p p 2p 3p p p U2 : U1 : bit length p @2’: @1’: FF--FF 00--00 1

U1_1 U1_0 U2_1 U2_0 U2_2

@2’’: @1’: p FF--FF 00--00 2

U1_1 U1_0 U2_2 U2_1 U2_0 U3_2 U3_1 U3_0 U3_2 U3_1 U3_0

! U3=U1*U2 Array support: Guarantee affine addressing scheme

slide-10
SLIDE 10

| 10

OUTLINE

  • Choice of the memory format: the UNUM type I
  • Refinements on the UNUM type I FP format
  • The adopted VP FP Architecture
  • The programming model
  • System benchmark: gauss elimination solver
  • Conclusions
slide-11
SLIDE 11

| 11

  • 1 integer register file (iRF): 32 integer general purpose register

(GPR) + pc, in the main processor.

  • 1 g-bound register file (gRF): 32 entries, in the co-processor.
  • UNUMs/u-bounds are strictly considered as memory formats:
  • Load operations:
  • Load UNUMs/u-bounds from the main memory, and converts them into internal g-bounds.
  • Store operations:
  • Convert internal g-bounds (entries of the internal gRF) into u-bounds. Store the latter the

main memory.

  • The coprocessor internal parallelism is fixed to 64 bits
  • Coprocessor’s status registers:
  • DUE
  • SUE
  • MBB
  • WGP

THE ADOPTED VP FP ARCHITECTURE

Rocket tile

UNUM co-proc

RoCC LSU FPU LSU $ L1

R A M

Scratchpad

$ L1

R A M 1 2 3 4 5

RISC-V

Rocket Chip

NEW!

slide-12
SLIDE 12

| 12

UNUM format is variable length (up to a maximum length)

▪ It is impossible to have compacted arrays having random access to its

elements

➢ We define the Maximum Byte Budget (MBB) as the maximum length

that a UNUM number can have in main memory

➢ The user can address VP FP numbers specifying their length with Byte

granularity. THE MBB: MAXIMUM BYTE BUDGET

LSU g0 g1 g2 g3 g4 G2U BMF u0 u1 u2 u3 u4 u’0 u’1 u’2 u’3 u’4 MBB MBB MBB

slide-13
SLIDE 13

| 13

s u es-1 fs-1 1a) 0 1 1-----1 1-----1 2a) 1 1 1-----1 1-----1 3a) 0 0 1-----1 1-----1 4a) 1 0 1-----1 1-----1 5a) 0 1 1-----1 1-----1 6a) 1 1 1-----1 1-----1 7a) 0 1 es-1 fs-1 8a) 1 1 es-1 fs-1 9a) s u es-1 fs-1 1b) 0 1 1--------1 1--------1 2b) 1 1 1--------1 1--------1 3b) 0 0 1--------1 1--------1 4b) 1 0 1--------1 1--------1 5b) 0 1 es-1 fs-1 6b) 1 1 es-1 fs-1 7b) s u es-1 fs-1 s u es-1 fs-1

  • ∞↓

+∞) right (-∞ left x +∞↓ 1--------------1 1------1 1------------1 e 1--------------1 fs_max es_max 1---------------------------------1 1---------------------1 1------------------------1 f 1---------------------------------1 sNaN qNaN 1--------------1 1--------------1 1---------------------------------1 1---------------------------------1 1--------------1 1--------------1 1-------------------------------10 1-------------------------------10 UNUSED BITS fss’’ ess’’

bit length

MBB*8

fs es 1------1 1------------1 e 1---------------------1 1------------------------1 f

  • ∞↓

+∞) right (-∞ left x +∞↓ sNaN qNaN +∞) right (-∞ left fss’ ess’ UNUSED BITS

THE BMF: BOUNDED MEMORY FORMAT

MBB >= max unum lengh MBB < max unum lengh

slide-14
SLIDE 14

| 14

OUTLINE

  • Choice of the memory format: the UNUM type I
  • Refinements on the UNUM type I FP format
  • The adopted VP FP Architecture
  • The programming model
  • System benchmark: gauss elimination solver
  • Conclusions
slide-15
SLIDE 15

| 15

01: k = 0 02: while convergence not reached do 03: for i := 1:n do 04:  =0 05: for j := 1:n do 06: if j ≠ i then 07: 𝝉 += 𝒃𝒋𝒌𝒚𝒌

(𝒍)

08: end 09: end 10: 𝒚𝒋

(𝒍+𝟐) = 𝟐 𝒃𝒋𝒋 (𝒄𝒋 − 𝝉)

11: end 12: k=k+1 13: end

Rocket tile UNUM co-proc

RoCC LSU FPU LSU

Scratchpad

$ L1 R A M

1 2 3

RISC-V

Our hardware is best suited for VP kernels which exploit three different storage types:

  • The external (main memory) storage
  • The intermediate (L1 cache) storage
  • The internal (register-level) storage

THE COPROCESSOR PROGRAMMING MODEL

b Ā x

·

=

x

Legend:

Outermost loop Intermediate loop Innermost loop

UNUM co-proc

𝝉

slide-16
SLIDE 16

| 16

OUTLINE

  • Choice of the memory format: the UNUM type I
  • Refinements on the UNUM type I FP format
  • The adopted VP FP Architecture
  • The programming model
  • System benchmark: gauss elimination solver
  • Conclusions
slide-17
SLIDE 17

| 17

SYSTEM BENCHMARK: GAUSS ELIMINATION SOLVER Our system benchmarked with a Gauss elimination solver, both in UNUM (scalar) and ubound (interval), showed:

  • A gain of up to 65 decimal digits on IEEE double
  • The result precision is constrained by the adopted precision in memory.
  • Intervals do not converge always but it is useful in the computational

error estimation (Ax-b).

  • A speed up of 4-10x with respect to the MPFR software library
slide-18
SLIDE 18

| 18

OUTLINE

  • Choice of the memory format: the UNUM type I
  • Refinements on the UNUM type I FP format
  • The adopted VP FP Architecture
  • The programming model
  • System benchmark: gauss elimination solver
  • Conclusions
slide-19
SLIDE 19

| 19

CONCLUSIONS

This work proposes a Variable Precision (VP) Floating Point (FP) computing system, based on RISC-V, for high performance computing servers as an alternative to VP FP software routines.

  • It supports UNUM/ubound format in main memory
  • It supports several Unum Environments: from (1,1) to (4,8), up to 256 mantissa bits
  • It supports a dedicated internal format in its Register File
  • 32 intervals; Each interval endpoint can have up to 512 mantissa bits
  • With the adopted memory format (BMF) it supports VP FP in main memory
  • User can decide the memory footprint of data with a Byte definition
  • With the adopted programming model, it is possible to extend VP FP high

precision variables in main memory.

  • The result precision can be significantly improved.
  • Its flops performances are better than software libraries (MPFR) and they

stays within the same range of a regular fixed-precision IEEE FPU.

slide-20
SLIDE 20

Leti, technology research institute Commissariat à l’énergie atomique et aux énergies alternatives Minatec Campus | 17 rue des Martyrs | 38054 Grenoble Cedex | France www.leti.fr

THANK YOU FOR YOUR ATTENTION!

Contacts: Andrea BOCCO andrea.bocco@cea.fr