ENHANCING SCIENTIFIC COMPUTATION USING A VARIABLE PRECISION FPU WITH - - PowerPoint PPT Presentation

enhancing scientific computation using a variable
SMART_READER_LITE
LIVE PREVIEW

ENHANCING SCIENTIFIC COMPUTATION USING A VARIABLE PRECISION FPU WITH - - PowerPoint PPT Presentation

ENHANCING SCIENTIFIC COMPUTATION USING A VARIABLE PRECISION FPU WITH A RISC-V PROCESSOR Y.Durand, C.Fabre, A. Bocco, T. Trevisan | IMPRENUM Project | Oct 2019 | 1 USE CASES FOR (LARGE) VARIABLE PRECISION Applications Techniques & Kernels


slide-1
SLIDE 1

| 1

Y.Durand, C.Fabre, A. Bocco, T. Trevisan | IMPRENUM Project | Oct 2019

ENHANCING SCIENTIFIC COMPUTATION USING A VARIABLE PRECISION FPU WITH A RISC-V PROCESSOR

slide-2
SLIDE 2

| 2 Y.Durand | Oct 2019

Applications

  • Computational Physics
  • Computational chemistry
  • Computational statistics
  • Computational geometry
  • Large PDEs
  • Finite elements, finite

differences

  • ODE s
  • ptimization

USE CASES FOR (LARGE) VARIABLE PRECISION

Techniques & Kernels

  • Dense/sparse linear algebra
  • Solvers, eigenvalues
  • Numerical integration
  • RK, but not only…
  • Monte Carlo
  • Spectral techniques
  • FFT and others
  • Interval arithmetics

Our main focus today: linear algebra solvers However, there are many other area in scientific computing where variable precision is sought

slide-3
SLIDE 3

| 3

we need

  • 1. extended precision operators,
  • 2. dedicated accumulators in registers inside

the FPU,

  • 3. Extended precision storage in close memory

VARIABLE PRECISION FOR SCIENTIFIC COMPUTATION JACOBI while convergence not reached do for i := 1:n do  =0 for j := 1:n do if j ≠ i then 𝜏 += 𝑏𝑗𝑘𝑦𝑘

(𝑙)

end end 𝑦𝑗

(𝑙+1) = 1 𝑏𝑗𝑗 (𝑐𝑗 − 𝜏)

end k=k+1 end

Vector update :

  • dense
  • Requires high precision
  • should be kept in close

memory Accumulation : Requires max precision should be done inside the FPU Matrix coeffs: read-only, sparse doubles Stay in remote memory

While error > tolerance augment precision

end

slide-4
SLIDE 4

| 4 Y.Durand | Oct 2019

k = 0 while convergence not reached do for i = 1:n do  =0 for j = 1:n do if j ≠ i then 𝜏 += 𝑏𝑗𝑘𝑦𝑘

(𝑙)

end end 𝑦𝑗

(𝑙+1) = 1 𝑏𝑗𝑗 (𝑐𝑗 − 𝜏)

end k=k+1 end MORE IN DEPTH WITH JACOBI : EXECUTING ON THE V1 ACCELERATOR

Rocket tile

VP co-proc

RoCC

L&S

Risc V

FPU L&S $ L1

R A M

Scratchpad

$ L1/ L2/ L3

R A M

Input data, RO, in RAM, double format (sparse)

Internal format, for accumulation (high precision)

Intermediate vector, adjustable format (dense)

slide-5
SLIDE 5

| 5

L1$

VARIABLE PRECISION SYSTEM

FPU VP scratchpad L1$ Distant Shared memory Standard core + specialized registers V.P Floating Point Unit (FPU) Large size registers for accumulation (eg 64 512b registers) Specific access to memory hierarchy LLC$ Large size (10s

  • f MB) coherent

close memory

Y.Durand | Oct 2019

slide-6
SLIDE 6

| 6 Y.Durand | Oct 2018

PROGRAMMING MODEL: HARDWARE & SOFTWARE LAYERS

application

Domain Specific library

SOLVERS & ALGORITHMS

Computation routines i/f

kernel kernel

Solver & algorithms i/f Auxiliary support library

Hardware

VP SOLVERS & ALGORITHMS Variable precision is contained within calls to kernel (BLAS level) and Solver (LaPack level) calls

Variable precision kernel

slide-7
SLIDE 7

| 7 Y.Durand | Oct 2019

  • Augmenting accuracy inside the kernel reduces rounding errors 

improves stability of the computation

  • Augmenting the mantissa during accumulation is not sufficient
  • Usual solution is to tweak the solver (pre-conditioning, etc.) but

this is costly, hazardous and very limited

  • Another solution is to double precision ( quad !!) in the

intermediate calculation  huge impact in memory and in calculation time

  • Using specialized data types (GMP, MPFR) has the same pitfalls
  • At even higher cost in memory
  • Our solution:
  • Variable precision, byte-aligned data format for intermediate data in

memory

  • affordable memory footprint for intermediate data
  • Hardware support for variable precision in hardware co-processor
  • Up to 4x64 bits fractional part in internal accumulator

RECAP: BENEFITS OF VARIABLE PRECISION

slide-8
SLIDE 8

| 8 Y.Durand | April 2019

PERSPECTIVES

  • Early investigation carried on by CEA
  • With support of other research projects
  • OPRECOMP, Imprenum, QUANTEX
  • First Use cases
  • Proof of concept = First FPGA prototype
  • Investigation on Compiler and library support
  • Mid-term Target : Proof of realization
  • Re-engineering with actual memory subsystem & infrastructure
  • Improve co-processor integration with processor
  • SW integration (libraries, execution model ?)
  • Main publications
  • Andrea Bocco, Yves Durand, and Florent de Dinechin. SMURF: Scalar multiple-precision unum Risc-V floating-point accelerator

for scientific computing. In Conference on Next-Generation Arithmetic, March 2019

  • Tiago Trevisan Jost, Andrea Bocco, Yves Durand, Christian Fabre, Florent De Dinechin, Anca Molnos, Albert Cohen:Variable

Precision Capabilities in RISC-V Processors, RISC-V Workshop Zurich (June 11 – 13, 2019)

  • Andrea Bocco, Yves Durand, and Florent de Dinechin. Dynamic precision numerics using a variable-precision UNUM type I HW
  • coprocessor. In 26th IEEE Symposium of Computer Arithmetic (ARITH-26), June 2019.