[PPT] - Efficient Floating-Point Logarithm Unit for FPGAs Nikolaos Alachiotis PowerPoint Presentation

SLIDE 1

Efficient Floating-Point Logarithm Unit for FPGAs

The Exelixis Lab,

Dept. of Computer Science,

TUM, Munich, Germany

Nikolaos Alachiotis, Alexandros Stamatakis

SLIDE 2

PRESENTATION OVERVIEW

Introduction
Approximation Strategy
Reconfigurable Architecture
Performance Evaluation
Conclusion and Future Work

SLIDE 3

INTRODUCTION

The Project:

Design of HW accelerators for Phylogenetic Inference Programs

SLIDE 4

INTRODUCTION

The Project:

Design of HW accelerators for Phylogenetic Inference Programs

Calculation of evolutionary relationships between organisms core function: the Phylogenetic Likelihood Function

SLIDE 5

INTRODUCTION

The Project:

Design of HW accelerators for Phylogenetic Inference Programs

Tip probability vector Ancestral probability vector Virtual Root

SLIDE 6

INTRODUCTION

The Project:

Design of HW accelerators for Phylogenetic Inference Programs

The Phylogenetic Likelihood Function:

85% of total execution time

Log-Likelihood Scores:

2% of total execution time

SLIDE 7

INTRODUCTION

The Project:

Design of HW accelerators for Phylogenetic Inference Programs

The Phylogenetic Likelihood Function:

85% of total execution time

Log-Likelihood Scores:

2% of total execution time

Need for a resource-efficient logarithm function

SLIDE 8

APPROXIMATION STRATEGY

“A Hardware-Independent Fast Logarithm Approximation with Adjustable Accuracy,” by O. Vinyals, G. Friedland. Tenth IEEE Inter. Symposium on Multimedia, pp. 61–65, 2008. Open source C implementation: ICSILog 0.6 BETA Floating-Point number in IEEE-754 standard sign exponent mantissa Number = sign * 2exponent * mantissa

SLIDE 9

APPROXIMATION STRATEGY

Number = sign * 2exponent * mantissa LOG(Number) = LOG ( 2exponent * mantissa) = LOG ( 2exponent) + LOG(mantissa) = exponent * LOG (2) + LOG(mantissa)

Logarithm defined only for positive values Multiplicative property of logarithm

SLIDE 10

APPROXIMATION STRATEGY

Number = sign * 2exponent * mantissa LOG(Number) = LOG ( 2exponent * mantissa) = LOG ( 2exponent) + LOG(mantissa) = exponent * LOG (2) + LOG(mantissa)

Logarithm defined only for positive values Multiplicative property of logarithm

Lookup Table

SLIDE 11

APPROXIMATION STRATEGY

X LUT +

log(2)

51-q MSBs 63 62 downto 52 51 downto 0 Sign Exponent Mantissa Sign Exponent Mantissa

VALUE LOG(VALUE)

LOG(Value) = exponent * LOG(2) + LOG(mantissa)

SLIDE 12

LOGARITHM APPROXIMATION UNIT (LAU) ARCHITECTURE

63 62 downto 52 51 downto 0

Sign Exponent Mantissa

input

Sign Exponent Mantissa

log(input)

1 0 FP VAL log(2) FP VAL MAN LUT EXP LUT SUB 2046 P R MULT 1 0 CASE DETECT ADD

SLIDE 13

LOGARITHM APPROXIMATION UNIT (LAU) ARCHITECTURE

63 62 downto 52 51 downto 0

Sign Exponent Mantissa

input

Sign Exponent Mantissa

log(input)

1 0 FP VAL log(2) FP VAL MAN LUT EXP LUT SUB 2046 P R MULT 1 0 CASE DETECT ADD INPUT CASE DETECTION log(Negative number)=nan log(Nan)=nan log(Inf)=Inf log(-Inf)=nan

SLIDE 14

LOGARITHM APPROXIMATION UNIT (LAU) ARCHITECTURE

63 62 downto 52 51 downto 0

Sign Exponent Mantissa

input

Sign Exponent Mantissa

log(input)

1 0 FP VAL log(2) FP VAL MAN LUT EXP LUT SUB 2046 P R MULT 1 0 CASE DETECT ADD CREATE THE EXPLUT INDEX Decimal value Exponent

1023

1

1022

… … 1022

1

1023 1024 1 … ... 2046 1023

SLIDE 15

LOGARITHM APPROXIMATION UNIT (LAU) ARCHITECTURE

63 62 downto 52 51 downto 0

Sign Exponent Mantissa

input

Sign Exponent Mantissa

log(input)

1 0 FP VAL log(2) FP VAL MAN LUT EXP LUT SUB 2046 P R MULT 1 0 CASE DETECT ADD CREATE THE EXPLUT INDEX Decimal value Exponent

1023

1

1022

… … 1022

1

1023 1024 1 … ... 2046 1023

SLIDE 16

LOGARITHM APPROXIMATION UNIT (LAU) ARCHITECTURE

63 62 downto 52 51 downto 0

Sign Exponent Mantissa

input

Sign Exponent Mantissa

log(input)

1 0 FP VAL log(2) FP VAL MAN LUT EXP LUT SUB 2046 P R MULT 1 0 CASE DETECT ADD CREATE THE EXPLUT INDEX Decimal value Exponent

1023

1

1022

… … 1022

1

1023 1024 1 … ... 2046 1023 EXP LUT EXP LUT

SLIDE 17

LOGARITHM APPROXIMATION UNIT (LAU) ARCHITECTURE

63 62 downto 52 51 downto 0

Sign Exponent Mantissa

input

Sign Exponent Mantissa

log(input)

1 0 FP VAL log(2) FP VAL MAN LUT EXP LUT SUB 2046 P R MULT 1 0 CASE DETECT ADD CREATE THE EXPLUT INDEX Decimal value Exponent

1023

1

1022

… … 1022

1

1023 1024 1 … ... 2046 1023 EXP LUT EXP LUT

X - 1023 X

SLIDE 18

LOGARITHM APPROXIMATION UNIT (LAU) ARCHITECTURE

63 62 downto 52 51 downto 0

Sign Exponent Mantissa

input

Sign Exponent Mantissa

log(input)

1 0 FP VAL log(2) FP VAL MAN LUT EXP LUT SUB 2046 P R MULT 1 0 CASE DETECT ADD CREATE THE EXPLUT INDEX Decimal value Exponent

1023

1

1022

… … 1022

1

1023 1024 1 … ... 2046 1023 EXP LUT EXP LUT

1023- (X – 1023) =2046-X X

SLIDE 19

LOGARITHM APPROXIMATION UNIT (LAU) ARCHITECTURE

63 62 downto 52 51 downto 0

Sign Exponent Mantissa

input

Sign Exponent Mantissa

log(input)

1 0 FP VAL log(2) FP VAL MAN LUT EXP LUT SUB 2046 P R MULT 1 0 CASE DETECT ADD FLOATING-POINT VALUE Single-precision values Single-precision MULT and ADD For single-precision inputs EXPLUT containts 128 entries to construct a single-precision value For double-precision inputs EXPLUT contains 1024 entries to construct a single-precision value

SLIDE 20

LOGARITHM APPROXIMATION UNIT (LAU) ARCHITECTURE

63 62 downto 52 51 downto 0

Sign Exponent Mantissa

input

Sign Exponent Mantissa

log(input)

1 0 FP VAL log(2) FP VAL MAN LUT EXP LUT SUB 2046 P R MULT 1 0 CASE DETECT ADD MANTISSA LUT ICSILog 0.6 software

SLIDE 21

Accuracy Versus Hardware resources

10 20 30 40 50 60 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Resources (Number of 18Kb block rams) Average Error (x103)

PERFORMANCE EVALUATION

SLIDE 22

Accuracy Versus Hardware resources

10 20 30 40 50 60 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Resources (Number of 18Kb block rams) Average Error (x103)

PERFORMANCE EVALUATION

6 block rams = 4096 LUT entries Dataset (Organisms) DP-GNU DP-ICSILog 150

39606.3
39606.6

218

134173.8
134167.5

140 (Prot)

124777.2
124780.1

Log-Likelihood score deviation

SLIDE 23

PERFORMANCE EVALUATION

VIRTEX 5 SX95T for mapping and verification XILINX ISE 10.1 and CHIPSCOPE Pro Analyzer

F. de Dinechin, C. Klein, B. Pasca,

“Generating high-performance custom floating-point pipelines,” Proc. of FPL 2009.

SLIDE 24

PERFORMANCE EVALUATION

Slice Registers Slice LUTs Occupied Slices 200 400 600 800 1000 1200 SP-FPLog SP-LAU

Resource Utilization and Performance: Single Precision

SLIDE 25

PERFORMANCE EVALUATION

Resource Utilization and Performance: Single Precision

BRAMs 18k BRAMs 36k DSP48Es 1 2 3 4 5 6 SP-FPLog SP-LAU

SLIDE 26

PERFORMANCE EVALUATION

Resource Utilization and Performance: Single Precision

BRAMs 18k BRAMs 36k DSP48Es 1 2 3 4 5 6 SP-FPLog SP-LAU

FPLog LAU Clock Latency 20 22 Max Frequency 244.7 353.5

SLIDE 27

PERFORMANCE EVALUATION

Resource Utilization and Performance: Double Precision

Slice Registers Slice LUTs Occupied Slices 500 1000 1500 2000 2500 3000 DP-FPLog DP-LAU

SLIDE 28

PERFORMANCE EVALUATION

BRAMs 18k BRAMs 36k DSP48Es 2 4 6 8 10 12 14 16 18 20 DP-FPLog DP-LAU

Resource Utilization and Performance: Double Precision

SLIDE 29

PERFORMANCE EVALUATION

BRAMs 18k BRAMs 36k DSP48Es 2 4 6 8 10 12 14 16 18 20 DP-FPLog DP-LAU

FPLog LAU Clock Latency 34 22 Max Frequency 192.3 320.6

Resource Utilization and Performance: Double Precision

SLIDE 30

PERFORMANCE EVALUATION

Resource Utilization and Performance: Double Precision

Slice Registers Slice LUTs Occupied Slices 200 400 600 800 1000 1200 DP-FPLog DP-LAU

DP-FPLog with same accuracy as DP-LAU

SLIDE 31

PERFORMANCE EVALUATION

BRAMs 18k BRAMs 36k DSP48Es 0.5 1 1.5 2 2.5 3 3.5 DP-FPLog DP-LAU

Resource Utilization and Performance: Double Precision DP-FPLog with same accuracy as DP-LAU

SLIDE 32

PERFORMANCE EVALUATION

FPLog LAU Clock Latency 20 22 Max Frequency 239.6 320.6

Resource Utilization and Performance: Double Precision DP-FPLog with same accuracy as DP-LAU

BRAMs 18k BRAMs 36k DSP48Es 0.5 1 1.5 2 2.5 3 3.5 DP-FPLog DP-LAU

SLIDE 33

PERFORMANCE EVALUATION

Single Precision Double Precision 1000 2000 3000 4000 5000 6000 7000 GNU Log (gnu) MKL Log (icc) SP-ICSILog DP-ICSILog SP-LAU DP-LAU

100000000 logarithm calculations time in milliseconds

Performance:

LAU vs SP/DP-ICSILog vs GNU Log vs MKL Log

SP-LAU VS GNU-LOG : 11X MKL-LOG : 1.6X DP-LAU VS GNU-LOG: 18X MKL-LOG: 2.5X Intel Core2 DUO T9600 @ 2.8GHz 6MB L2 Cache

SLIDE 34

CONCLUSION and FUTURE WORK

AVAILABILITY DP-ICSILog C Implementation and SP/DP LAU FPGA core for Virtex4 and Virtex5 FPGAs http://wwwkrammer.in.tum.de/exelixis/nikos/ipcores.html Or OpenCores.org: Project name: fp_log http://www.opencores.org/project,fp_log

SLIDE 35

CONCLUSION and FUTURE WORK

RELATED PROJECTS Implementation of a UDP/IP core for Virtex 5 FPGAs (optimized for PC-FPGA communication) http://wwwkrammer.in.tum.de/exelixis/nikos/ipcores.html Or OpenCores.org: Project name: udp_ip__core http://www.opencores.org/project,udp_ip__core FUTURE WORK Implementation of a resource-efficient exponential function Integration of the LOG and EXP cores into the general Phylogenetic Architecture