A Hybrid Implementation of Hamming Weight Enric Morancho Computer - - PowerPoint PPT Presentation

a hybrid implementation of hamming weight
SMART_READER_LITE
LIVE PREVIEW

A Hybrid Implementation of Hamming Weight Enric Morancho Computer - - PowerPoint PPT Presentation

A Hybrid Implementation of Hamming Weight Enric Morancho Computer Architecture Department Universitat Politcnica de Catalunya, BarcelonaTech Barcelona, Spain enricm@ac.upc.edu 22 nd Euromicro International Conference on Parallel,


slide-1
SLIDE 1

A Hybrid Implementation

  • f Hamming Weight

Enric Morancho

Computer Architecture Department Universitat Politècnica de Catalunya, BarcelonaTech Barcelona, Spain enricm@ac.upc.edu

22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Torino, Italy, Feb. 12nd − 14th, 2014

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 1 / 35

slide-2
SLIDE 2

Outline

1

Introduction

2

Algorithms for computing hamming weight

3

Evaluation of existing implementations

4

Proposed hybrid implementation

5

Conclusion and future work

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 2 / 35

slide-3
SLIDE 3

Outline

1

Introduction

2

Algorithms for computing hamming weight

3

Evaluation of existing implementations

4

Proposed hybrid implementation

5

Conclusion and future work

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 3 / 35

slide-4
SLIDE 4

Introduction

What is hamming weight?

The hamming weight of a bitstring is the number of bits set to one in the bitstring

Hamming weight is also known as population count, sideways addition or bit counting

Applications: cryptography, chemical informatics, information theory

Bitstring lengths up to several thousands of bits

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 4 / 35

slide-5
SLIDE 5

Introduction

Algorithms for computing hamming weight

Several algorithms have been proposed:

Naïve, memoization, parallel reduction, merged parallel reduction, bitslicing, . . . Some algorithms admit both scalar and vector implementations However, the existing implementations expose either scalar parallelism or vector parallelism.

This work proposes an hybrid scalar-vector implementation

Exposes both parallelisms simultaenously Useful on platforms that can exploit both parallelisms simultaneously

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 5 / 35

slide-6
SLIDE 6

Outline

1

Introduction

2

Algorithms for computing hamming weight

3

Evaluation of existing implementations

4

Proposed hybrid implementation

5

Conclusion and future work

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 6 / 35

slide-7
SLIDE 7

Existing algorithms

Naïve

Iterates through the bits of the bitstring and accumulates each bit value Can be specialized to deal with sparse/dense bitstrings Poor performance due to not exploiting parallelism

uint8_t hw_naive(uint32_t w) { uint8_t i, cnt=0; for (i=0; i<32; i++, w = w>>1) cnt += w&0x1; return(cnt); }

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 7 / 35

slide-8
SLIDE 8

Existing algorithms

Memoization

Steps:

Defines a subword size (e.g. 8 bits) Precomputes the hamming weight of all possible subwords Looks up the precomputacion table for each subword of the bitstring and accumulates the results

Admits both scalar and vector implementations Exposes more parallelism than naïve implementation

uint8_t T8[256] = {0, 1, 1, 2, ..., 7, 8}; uint8_t hw_memoization8(uint32_t w) { return(T8[w&0xFF] + T8[(w>>8)&0xFF] + T8[(w>>16)&0xFF] + T8[w>>24]); }

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 8 / 35

slide-9
SLIDE 9

Existing algorithms

Parallel reduction at bit level

Tree reduction of the input word in ⌈log2 bits per word⌉ levels. Input 1 1 1 1 Parallel reduction: level 1 01 10 00 01 Parallel reduction: level 2 0011 0001 Parallel reduction: level 3 00000100 Admits both scalar and vector implementations

uint32_t hw_parallel(uint32_t w) { w = (w & 0x55555555) + ((w>> 1) & 0x55555555); /*Lev. 1*/ w = (w & 0x33333333) + ((w>> 2) & 0x33333333); /*L2*/ w = (w & 0x0F0F0F0F) + ((w>> 4) & 0x0F0F0F0F); /*L3*/ w = (w & 0x00FF00FF) + ((w>> 8) & 0x00FF00FF); /*L4*/ w = (w & 0x0000FFFF) + ((w>>16) & 0x0000FFFF); /*L5*/ return(w); }

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 9 / 35

slide-10
SLIDE 10

Existing algorithms

Merged parallel reduction (or tree merging)

Deals with bitstrings larger than a word Merges the intermediate results of several parallel reductions keeps processing just the combined result.

The degree of merging is limited by the widths of the accumulators

Admits both scalar and vector implementations Example: merged parallel reduction of 3 words (wa wb bc)

wa = (wa & 0x55555555) + ((wa>> 1) & 0x55555555); /*L1*/ wb = (wb & 0x55555555) + ((wb>> 1) & 0x55555555); wa = wa + ( wc & 0x55555555); wb = wb + ((wc>>1) & 0x55555555); wa = (wa & 0x33333333) + ((wa>> 2) & 0x33333333); /*L2*/ wb = (wb & 0x33333333) + ((wb>> 2) & 0x33333333); wa = wa + wb; wa = (wa & 0x0F0F0F0F) + ((wa>> 4) & 0x0F0F0F0F); /*L3*/ ...

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 10 / 35

slide-11
SLIDE 11

Existing algorithms

Bitslicing

Transforms a (2n − 1)-word bitstring into n words, preserving indeed the hamming weight of the original bitstring. The implementation relies on the parallel emulation of bits_per_word bit adders by using bit-wise logical instructions. Admits both scalar and vector implementations

2n−2

  • i=0

hw(wi) =

n−1

  • j=0

2j · hw(sj)

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 11 / 35

slide-12
SLIDE 12

Existing algorithms

Processor support

Some processors offer a machine instruction to compute the hamming weight of a machine word

For instance: Mark II (1954), IBM Stretch (1961), CDC 6600 (1964), Cray 1 (1976), Sun SPARCv9 (1995), Alpha 21264A (1999), IBM Power5 (2004) and ARM Cortex-A8 (2005)

Since 2007, x86 processors supporting SSE4.2 offer popcnt instruction

Computes the hamming weight of a scalar 32-bit or a 64-bit register

AMD 15h Intel Nehalem Sandy Bridge/Haswell 32-bit 64-bit 32/64 bit Latency (cycles) 4 6 3 Dispatch rate (inst/cyc) 1 0.25 1

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 12 / 35

slide-13
SLIDE 13

Outline

1

Introduction

2

Algorithms for computing hamming weight

3

Evaluation of existing implementations

4

Proposed hybrid implementation

5

Conclusion and future work

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 13 / 35

slide-14
SLIDE 14

Evaluation of existing implementations

Evaluation environment

Our benchmark consists in computing the hamming weight of several randomly initialized bitstrings

Bitstring words are located in consecutive memory locations

We evaluate two scenarios:

Uncached Cached

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 14 / 35

slide-15
SLIDE 15

Evaluation of existing implementations

Evaluation environment

Intel Core Intel Xeon i5-650 E5-2630L Microarchitecture Nehalem Sandy Bridge Frequency (max turbo) 3.2(3.46) GHz 2(2.5) GHz Cores 2 6 Reorder Buffer entries 128 µ-ops 168 µ-ops Scheduler entries 36 µ-ops 54 µ-ops Peak dispatch rate 6 µ-ops/cycle DL1 Size and assoc. 32KB, 8-way, 64Byte lines Bandwidth 128 bits/cycle 256 bits/cycle In-flight loads 48 64

  • Simult. misses

10 L2 256KB, 8-way, 64Byte lines L3 4MB, 16-way, 64B 15MB, 20-way, 64B

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 15 / 35

slide-16
SLIDE 16

Evaluation of existing implementations

Evaluated implementations

Single-word wide implementations Naïve hw_naive implementation Mem-8 Memoization, 28-entry lookup table Mem-16 Memoization, 216-entry lookup table Par.Red. Parallel reduction at bit level over 64-bit words SSE4.2 Uses 64-bit scalar instruction popcnt Multi-word wide implementations Merged Scalar merged par.red. on 30 64-bit words at level 3 Merged-V Vector merged par.red. on 30 128-bit words at level 3 (SSE2) Slice Scalar bit slicing on 7 64-bit words Slice-V Vector bit slicing on 7 128-bit words (SSE2) Mem-4 Vector memoization, 24-entry lookup table (SSSE3)

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 16 / 35

slide-17
SLIDE 17

Evaluation of existing implementations

Results on Nehalem platform: single-word wide/cached

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 17 / 35

slide-18
SLIDE 18

Evaluation of existing implementations

Results on Nehalem platform: cached

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 18 / 35

slide-19
SLIDE 19

Evaluation of existing implementations

Results on Sandy Bridge platform: cached

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 19 / 35

slide-20
SLIDE 20

Evaluation of existing implementations

Results

SSE4.2 performs best Multi-word wide implementations outperform single-word implementations (but SSE4.2) Vector implementation outperform scalar implementation of the same algorithm

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 20 / 35

slide-21
SLIDE 21

Evaluation of existing implementations

Conclusions

Although scalar SSE4.2 implementation performs best...

The dispatch rate of popcnt instruction is just 1 inst./cycle, that is, SSE4.2’s peak performace is 8 bytes/cycle But DL1 bandwidht is 16 bytes/cycle (Nehalem) and 32 bytes/cycle (Sandy Bridge) SSE4.2 implementation is fully scalar and can not exploit the unused dispatch ports to dispatch vector instructions

We wonder if SSE4.2 implementation may be outperformed by a hybrid implementation that makes use of both vector and scalar instructions

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 21 / 35

slide-22
SLIDE 22

Outline

1

Introduction

2

Algorithms for computing hamming weight

3

Evaluation of existing implementations

4

Proposed hybrid implementation

5

Conclusion and future work

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 22 / 35

slide-23
SLIDE 23

Proposed hybrid implementation

Design

Main idea: combining SSE4.2 (scalar) and Mem-4 (vector) implementations into a hybrid implementation

Distribute the bitstring words into the scalar and the vector functional units

Steps

Iterate through the bitstring, each loop iteration processes a fixed sized chunk Statically distribute the chunk bytes between the scalar and vector functional units Design-space dimensions:

Number of chunk bytes processed by the scalar units (S) Number of chunk bytes processed by the vector units (V)

Design-space exploration

Configurations (S,V) with chunk-length up to 80 bytes

(16,16), (32,16), (16,32), (48,16), (32,32), (16,48), (64,16), (48,32), (32,48) and (16, 64).

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 23 / 35

slide-24
SLIDE 24

Design-space exploration

Nehalem platform: uncached

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 24 / 35

slide-25
SLIDE 25

Design-space exploration

Nehalem platform: cached

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 25 / 35

slide-26
SLIDE 26

Design-space exploration

Sandy Bridge platform: uncached

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 26 / 35

slide-27
SLIDE 27

Design-space exploration

Sandy Bridge platform: cached

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 27 / 35

slide-28
SLIDE 28

Design-space exploration

Conclusions

Some hybrid configurations outperform SSE4.2 Performance potential is bigger in Sandy Bridge than in Nehalem The best hybrid configuration depends on the bitstring length

However, we pick only one configuration for each platform: (32,32)

  • Nehalem- and (32,48) -Sandy Bridge-

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 28 / 35

slide-29
SLIDE 29

Results

Sandy Bridge platform: uncached

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 29 / 35

slide-30
SLIDE 30

Results

Sandy Bridge platform: cached

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 30 / 35

slide-31
SLIDE 31

Results

Sandy Bridge platform

Speedup of (32,48) hybrid configuration with respect to SSE4.2 Bitstring length up to DL1 up to L2 up to L3 >L3 Uncached scenario

  • 1.07

1.10 Cached scenario 1.15 1.18 1.22 1.10

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 31 / 35

slide-32
SLIDE 32

Outline

1

Introduction

2

Algorithms for computing hamming weight

3

Evaluation of existing implementations

4

Proposed hybrid implementation

5

Conclusion and future work

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 32 / 35

slide-33
SLIDE 33

Conclusions

Processors can exploit both scalar and vector parallelism but applications expose only one kind of parallelism

Some processor resources are not fully exploited

Applications that admit both scalar and vector implementations, may benefit from a hybrid implementation that exposes both kinds

  • f parallelism simultaneously

Case of study: hamming weight

(32,48) hybrid configuration outperforms the, to the best of our knowledge, best implementation of hamming weight by up to 1.22X

  • n Sandy Bridge platform

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 33 / 35

slide-34
SLIDE 34

Future work

Evaluating this technique on newer platforms (e.g. Haswell)

AVX2: vector integer intructions, 256-bit vector registers

Applying this technique to other problems

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 34 / 35

slide-35
SLIDE 35

A Hybrid Implementation

  • f Hamming Weight

Enric Morancho

Computer Architecture Department Universitat Politècnica de Catalunya, BarcelonaTech Barcelona, Spain enricm@ac.upc.edu

22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Torino, Italy, Feb. 12nd − 14th, 2014

Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 35 / 35