Efficient Floating-Point Logarithm Unit for FPGAs Nikolaos Alachiotis - - PowerPoint PPT Presentation

efficient floating point logarithm unit for fpgas
SMART_READER_LITE
LIVE PREVIEW

Efficient Floating-Point Logarithm Unit for FPGAs Nikolaos Alachiotis - - PowerPoint PPT Presentation

Efficient Floating-Point Logarithm Unit for FPGAs Nikolaos Alachiotis , Alexandros Stamatakis The Exelixis Lab, Dept. of Computer Science, TUM, Munich, Germany PRESENTATION OVERVIEW Introduction Approximation Strategy Reconfigurable


slide-1
SLIDE 1

Efficient Floating-Point Logarithm Unit for FPGAs

The Exelixis Lab,

  • Dept. of Computer Science,

TUM, Munich, Germany

Nikolaos Alachiotis, Alexandros Stamatakis

slide-2
SLIDE 2

PRESENTATION OVERVIEW

  • Introduction
  • Approximation Strategy
  • Reconfigurable Architecture
  • Performance Evaluation
  • Conclusion and Future Work
slide-3
SLIDE 3

INTRODUCTION

  • The Project:

Design of HW accelerators for Phylogenetic Inference Programs

slide-4
SLIDE 4

INTRODUCTION

  • The Project:

Design of HW accelerators for Phylogenetic Inference Programs

Calculation of evolutionary relationships between organisms core function: the Phylogenetic Likelihood Function

slide-5
SLIDE 5

INTRODUCTION

  • The Project:

Design of HW accelerators for Phylogenetic Inference Programs

Tip probability vector Ancestral probability vector Virtual Root

slide-6
SLIDE 6

INTRODUCTION

  • The Project:

Design of HW accelerators for Phylogenetic Inference Programs

  • The Phylogenetic Likelihood Function:

85% of total execution time

  • Log-Likelihood Scores:

2% of total execution time

slide-7
SLIDE 7

INTRODUCTION

  • The Project:

Design of HW accelerators for Phylogenetic Inference Programs

  • The Phylogenetic Likelihood Function:

85% of total execution time

  • Log-Likelihood Scores:

2% of total execution time

Need for a resource-efficient logarithm function

slide-8
SLIDE 8

APPROXIMATION STRATEGY

“A Hardware-Independent Fast Logarithm Approximation with Adjustable Accuracy,” by O. Vinyals, G. Friedland. Tenth IEEE Inter. Symposium on Multimedia, pp. 61–65, 2008. Open source C implementation: ICSILog 0.6 BETA Floating-Point number in IEEE-754 standard sign exponent mantissa Number = sign * 2exponent * mantissa

slide-9
SLIDE 9

APPROXIMATION STRATEGY

Number = sign * 2exponent * mantissa LOG(Number) = LOG ( 2exponent * mantissa) = LOG ( 2exponent) + LOG(mantissa) = exponent * LOG (2) + LOG(mantissa)

Logarithm defined only for positive values Multiplicative property of logarithm

slide-10
SLIDE 10

APPROXIMATION STRATEGY

Number = sign * 2exponent * mantissa LOG(Number) = LOG ( 2exponent * mantissa) = LOG ( 2exponent) + LOG(mantissa) = exponent * LOG (2) + LOG(mantissa)

Logarithm defined only for positive values Multiplicative property of logarithm

Lookup Table

slide-11
SLIDE 11

APPROXIMATION STRATEGY

X LUT +

log(2)

51-q MSBs 63 62 downto 52 51 downto 0 Sign Exponent Mantissa Sign Exponent Mantissa

VALUE LOG(VALUE)

LOG(Value) = exponent * LOG(2) + LOG(mantissa)

slide-12
SLIDE 12

LOGARITHM APPROXIMATION UNIT (LAU) ARCHITECTURE

63 62 downto 52 51 downto 0

Sign Exponent Mantissa

input

Sign Exponent Mantissa

log(input)

1 0 FP VAL log(2) FP VAL MAN LUT EXP LUT SUB 2046 P R MULT 1 0 CASE DETECT ADD

slide-13
SLIDE 13

LOGARITHM APPROXIMATION UNIT (LAU) ARCHITECTURE

63 62 downto 52 51 downto 0

Sign Exponent Mantissa

input

Sign Exponent Mantissa

log(input)

1 0 FP VAL log(2) FP VAL MAN LUT EXP LUT SUB 2046 P R MULT 1 0 CASE DETECT ADD INPUT CASE DETECTION log(Negative number)=nan log(Nan)=nan log(Inf)=Inf log(-Inf)=nan

slide-14
SLIDE 14

LOGARITHM APPROXIMATION UNIT (LAU) ARCHITECTURE

63 62 downto 52 51 downto 0

Sign Exponent Mantissa

input

Sign Exponent Mantissa

log(input)

1 0 FP VAL log(2) FP VAL MAN LUT EXP LUT SUB 2046 P R MULT 1 0 CASE DETECT ADD CREATE THE EXPLUT INDEX Decimal value Exponent

  • 1023

1

  • 1022

… … 1022

  • 1

1023 1024 1 … ... 2046 1023

slide-15
SLIDE 15

LOGARITHM APPROXIMATION UNIT (LAU) ARCHITECTURE

63 62 downto 52 51 downto 0

Sign Exponent Mantissa

input

Sign Exponent Mantissa

log(input)

1 0 FP VAL log(2) FP VAL MAN LUT EXP LUT SUB 2046 P R MULT 1 0 CASE DETECT ADD CREATE THE EXPLUT INDEX Decimal value Exponent

  • 1023

1

  • 1022

… … 1022

  • 1

1023 1024 1 … ... 2046 1023

slide-16
SLIDE 16

LOGARITHM APPROXIMATION UNIT (LAU) ARCHITECTURE

63 62 downto 52 51 downto 0

Sign Exponent Mantissa

input

Sign Exponent Mantissa

log(input)

1 0 FP VAL log(2) FP VAL MAN LUT EXP LUT SUB 2046 P R MULT 1 0 CASE DETECT ADD CREATE THE EXPLUT INDEX Decimal value Exponent

  • 1023

1

  • 1022

… … 1022

  • 1

1023 1024 1 … ... 2046 1023 EXP LUT EXP LUT

slide-17
SLIDE 17

LOGARITHM APPROXIMATION UNIT (LAU) ARCHITECTURE

63 62 downto 52 51 downto 0

Sign Exponent Mantissa

input

Sign Exponent Mantissa

log(input)

1 0 FP VAL log(2) FP VAL MAN LUT EXP LUT SUB 2046 P R MULT 1 0 CASE DETECT ADD CREATE THE EXPLUT INDEX Decimal value Exponent

  • 1023

1

  • 1022

… … 1022

  • 1

1023 1024 1 … ... 2046 1023 EXP LUT EXP LUT

X - 1023 X

slide-18
SLIDE 18

LOGARITHM APPROXIMATION UNIT (LAU) ARCHITECTURE

63 62 downto 52 51 downto 0

Sign Exponent Mantissa

input

Sign Exponent Mantissa

log(input)

1 0 FP VAL log(2) FP VAL MAN LUT EXP LUT SUB 2046 P R MULT 1 0 CASE DETECT ADD CREATE THE EXPLUT INDEX Decimal value Exponent

  • 1023

1

  • 1022

… … 1022

  • 1

1023 1024 1 … ... 2046 1023 EXP LUT EXP LUT

1023- (X – 1023) =2046-X X

slide-19
SLIDE 19

LOGARITHM APPROXIMATION UNIT (LAU) ARCHITECTURE

63 62 downto 52 51 downto 0

Sign Exponent Mantissa

input

Sign Exponent Mantissa

log(input)

1 0 FP VAL log(2) FP VAL MAN LUT EXP LUT SUB 2046 P R MULT 1 0 CASE DETECT ADD FLOATING-POINT VALUE Single-precision values Single-precision MULT and ADD For single-precision inputs EXPLUT containts 128 entries to construct a single-precision value For double-precision inputs EXPLUT contains 1024 entries to construct a single-precision value

slide-20
SLIDE 20

LOGARITHM APPROXIMATION UNIT (LAU) ARCHITECTURE

63 62 downto 52 51 downto 0

Sign Exponent Mantissa

input

Sign Exponent Mantissa

log(input)

1 0 FP VAL log(2) FP VAL MAN LUT EXP LUT SUB 2046 P R MULT 1 0 CASE DETECT ADD MANTISSA LUT ICSILog 0.6 software

slide-21
SLIDE 21

Accuracy Versus Hardware resources

10 20 30 40 50 60 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Resources (Number of 18Kb block rams) Average Error (x103)

PERFORMANCE EVALUATION

slide-22
SLIDE 22

Accuracy Versus Hardware resources

10 20 30 40 50 60 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Resources (Number of 18Kb block rams) Average Error (x103)

PERFORMANCE EVALUATION

6 block rams = 4096 LUT entries Dataset (Organisms) DP-GNU DP-ICSILog 150

  • 39606.3
  • 39606.6

218

  • 134173.8
  • 134167.5

140 (Prot)

  • 124777.2
  • 124780.1

Log-Likelihood score deviation

slide-23
SLIDE 23

PERFORMANCE EVALUATION

VIRTEX 5 SX95T for mapping and verification XILINX ISE 10.1 and CHIPSCOPE Pro Analyzer

  • F. de Dinechin, C. Klein, B. Pasca,

“Generating high-performance custom floating-point pipelines,” Proc. of FPL 2009.

slide-24
SLIDE 24

PERFORMANCE EVALUATION

Slice Registers Slice LUTs Occupied Slices 200 400 600 800 1000 1200 SP-FPLog SP-LAU

Resource Utilization and Performance: Single Precision

slide-25
SLIDE 25

PERFORMANCE EVALUATION

Resource Utilization and Performance: Single Precision

BRAMs 18k BRAMs 36k DSP48Es 1 2 3 4 5 6 SP-FPLog SP-LAU

slide-26
SLIDE 26

PERFORMANCE EVALUATION

Resource Utilization and Performance: Single Precision

BRAMs 18k BRAMs 36k DSP48Es 1 2 3 4 5 6 SP-FPLog SP-LAU

FPLog LAU Clock Latency 20 22 Max Frequency 244.7 353.5

slide-27
SLIDE 27

PERFORMANCE EVALUATION

Resource Utilization and Performance: Double Precision

Slice Registers Slice LUTs Occupied Slices 500 1000 1500 2000 2500 3000 DP-FPLog DP-LAU

slide-28
SLIDE 28

PERFORMANCE EVALUATION

BRAMs 18k BRAMs 36k DSP48Es 2 4 6 8 10 12 14 16 18 20 DP-FPLog DP-LAU

Resource Utilization and Performance: Double Precision

slide-29
SLIDE 29

PERFORMANCE EVALUATION

BRAMs 18k BRAMs 36k DSP48Es 2 4 6 8 10 12 14 16 18 20 DP-FPLog DP-LAU

FPLog LAU Clock Latency 34 22 Max Frequency 192.3 320.6

Resource Utilization and Performance: Double Precision

slide-30
SLIDE 30

PERFORMANCE EVALUATION

Resource Utilization and Performance: Double Precision

Slice Registers Slice LUTs Occupied Slices 200 400 600 800 1000 1200 DP-FPLog DP-LAU

DP-FPLog with same accuracy as DP-LAU

slide-31
SLIDE 31

PERFORMANCE EVALUATION

BRAMs 18k BRAMs 36k DSP48Es 0.5 1 1.5 2 2.5 3 3.5 DP-FPLog DP-LAU

Resource Utilization and Performance: Double Precision DP-FPLog with same accuracy as DP-LAU

slide-32
SLIDE 32

PERFORMANCE EVALUATION

FPLog LAU Clock Latency 20 22 Max Frequency 239.6 320.6

Resource Utilization and Performance: Double Precision DP-FPLog with same accuracy as DP-LAU

BRAMs 18k BRAMs 36k DSP48Es 0.5 1 1.5 2 2.5 3 3.5 DP-FPLog DP-LAU

slide-33
SLIDE 33

PERFORMANCE EVALUATION

Single Precision Double Precision 1000 2000 3000 4000 5000 6000 7000 GNU Log (gnu) MKL Log (icc) SP-ICSILog DP-ICSILog SP-LAU DP-LAU

100000000 logarithm calculations time in milliseconds

Performance:

LAU vs SP/DP-ICSILog vs GNU Log vs MKL Log

SP-LAU VS GNU-LOG : 11X MKL-LOG : 1.6X DP-LAU VS GNU-LOG: 18X MKL-LOG: 2.5X Intel Core2 DUO T9600 @ 2.8GHz 6MB L2 Cache

slide-34
SLIDE 34

CONCLUSION and FUTURE WORK

AVAILABILITY DP-ICSILog C Implementation and SP/DP LAU FPGA core for Virtex4 and Virtex5 FPGAs http://wwwkrammer.in.tum.de/exelixis/nikos/ipcores.html Or OpenCores.org: Project name: fp_log http://www.opencores.org/project,fp_log

slide-35
SLIDE 35

CONCLUSION and FUTURE WORK

RELATED PROJECTS Implementation of a UDP/IP core for Virtex 5 FPGAs (optimized for PC-FPGA communication) http://wwwkrammer.in.tum.de/exelixis/nikos/ipcores.html Or OpenCores.org: Project name: udp_ip__core http://www.opencores.org/project,udp_ip__core FUTURE WORK Implementation of a resource-efficient exponential function Integration of the LOG and EXP cores into the general Phylogenetic Architecture