Evaluating a Processing-in-Memory Architecture with the k -means - - PowerPoint PPT Presentation

evaluating a processing in memory architecture with the k
SMART_READER_LITE
LIVE PREVIEW

Evaluating a Processing-in-Memory Architecture with the k -means - - PowerPoint PPT Presentation

Evaluating a Processing-in-Memory Architecture with the k -means Algorithm Simon Bihel simon.bihel@ens-rennes.fr Lesly-Ann Daniel lesly-ann.daniel@ens-rennes.fr Florestan De Moor florestan.de-moor@ens-rennes.fr Bastien Thomas


slide-1
SLIDE 1

Evaluating a Processing-in-Memory Architecture with the k-means Algorithm

Simon Bihel simon.bihel@ens-rennes.fr Lesly-Ann Daniel lesly-ann.daniel@ens-rennes.fr Florestan De Moor florestan.de-moor@ens-rennes.fr Bastien Thomas bastien.thomas@ens-rennes.fr May 4, 2017

University of Rennes I École Normale Supérieure de Rennes

slide-2
SLIDE 2

With Help From…

Dominique Lavenier dominique.lavenier@irisa.fr

CNRS IRISA

David Furodet & the Upmem Team dfurodet@upmem.com

slide-3
SLIDE 3

Context

BIG DATA Workloads End of Dennard Scaling End of Moore’s Law Shift towards Data- Centric Architectures Exascale Bandwidth and Memory Walls

1/17

slide-4
SLIDE 4

Table of contents

  • 1. The Upmem Architecture
  • 2. k-means Implementation for the Upmem Architecture
  • 3. Experimental Evaluation

2/17

slide-5
SLIDE 5

The Upmem Architecture

slide-6
SLIDE 6

Upmem architecture overview

CPU WRAM DPU MRAM DDR bus WRAM DPU MRAM 255 ... ... ... ... DIMM

DPU dram processing-unit WRAM execution memory for programs MRAM main memory DIMM dual in-line memory module

3/17

slide-7
SLIDE 7

A massively parallel architecture

Characteristics

  • Several DIMMs can be added to a CPU
  • A 16 GBytes DIMM embeds 256 DPUs
  • Each DPU can support up to 24 threads

The context is switched between DPU threads every clock cycle. The programming approach has to consider this fine-grained parallelism.

4/17

slide-8
SLIDE 8

A massively parallel architecture

Characteristics

  • Several DIMMs can be added to a CPU
  • A 16 GBytes DIMM embeds 256 DPUs
  • Each DPU can support up to 24 threads

The context is switched between DPU threads every clock cycle. The programming approach has to consider this fine-grained parallelism.

4/17

slide-9
SLIDE 9

Upmem Architecture Overview

On a programming level: two programs must be specified.

CPU

performs data-intensive

  • perations
  • rchestrates

the execution

DPUs T asklet Host program

{ {

5/17

slide-10
SLIDE 10

Upmem Architecture Overview

On a programming level: two programs must be specified.

CPU

performs data-intensive

  • perations
  • rchestrates

the execution

DPUs T asklet Host program

{ {

communication

  • MRAM
  • Mailboxes

5/17

slide-11
SLIDE 11

Drawbacks and advantages

Drawbacks: computation power

  • Frequency around 750 MHz
  • No floating point operations
  • Significant multiplication overhead (no hardware

multiplier)

  • Explicit memory management

Advantages: data access

  • Parallelization power
  • Minimum latency
  • Increased bandwidth
  • Reduced power consumption

6/17

slide-12
SLIDE 12

Drawbacks and advantages

Drawbacks: computation power

  • Frequency around 750 MHz
  • No floating point operations
  • Significant multiplication overhead (no hardware

multiplier)

  • Explicit memory management

Advantages: data access

  • Parallelization power
  • Minimum latency
  • Increased bandwidth
  • Reduced power consumption

6/17

slide-13
SLIDE 13

k-means Implementation for the Upmem Architecture

slide-14
SLIDE 14

k-means Clustering Problem

Partition data ∈ Rn×m into k clusters C1 . . . Ck n (resp. m): number of points (resp. attributes) d: Euclidean distance ArgminC

k

i=1

p∈Ci

d(p, mean(Ci)) Examples of applications Segmentation Communities in social networks Market research Gene sequence analysis

7/17

slide-15
SLIDE 15

k-means Standard Algorithm [6]

1: function k-means(k, data, δ) 2:

Choose ˜ C := ( ˜ c1 . . . ˜ ck) initial centroids

3:

repeat

4:

C = ˜ C

5:

for all point p ∈data do

6:

j := Argmini d(p, ci) ▷ Find nearest cluster

7:

Assign p to cluster Cj

8:

end for

9:

for all i in {1 . . . k} do

10:

˜ ci = mean(p ∈ Ci) ▷ Compute new centroids

11:

end for

12:

until ∥˜ C − C∥ ≤ δ ▷ Convergence criteria

13:

return ˜ C ▷ Return the final centroids

14: end function

8/17

slide-16
SLIDE 16

k-means algorithm on Upmem

Computations Start centroids update

DPUs HOST

Send centroids End centroids update Data input Choose initial centroids Distribute points Convergence? Output results yes no points

The points are distributed across the DPUs.

9/17

slide-17
SLIDE 17

Implementation & Memory Management

  • int type to store distance

(easy to overflow with distances) MRAM

  • Global variables (e.g. # of

points)

  • Centers
  • Points
  • New centers

10/17

slide-18
SLIDE 18

Experimental Evaluation

slide-19
SLIDE 19

Experimental Setup

Simulator

  • Architecture not yet manufactured
  • Cycle-Accurate simulator

Datasets

  • int
  • Randomly generated (not

uniformly, with clusters) Could not find ready-to-use integer large datasets.

200 200 400 600 800 1000 200 200 400 600 800 1000

11/17

slide-20
SLIDE 20

Experimental Setup

Simulator

  • Architecture not yet manufactured
  • Cycle-Accurate simulator

Datasets

  • int
  • Randomly generated (not

uniformly, with clusters) Could not find ready-to-use integer large datasets.

200 200 400 600 800 1000 200 200 400 600 800 1000

11/17

slide-21
SLIDE 21

Number of Threads

5 10 15 20 25 Number of threads Runtime

High number of

  • points

(N=1000000, D=10, K=5)

  • dimensions

(N=500000, D=34, K=3)

  • centroids

(N=100000, D=2, K=10) Not the same runtime scales.

12/17

slide-22
SLIDE 22

Number of DPUs

5 10 15 20 25 30 35 Number of DPUs 10 20 30 40 50 60 70 80 Runtime (seconds)

Always the same number of points. Time is divided by the number of DPUs.

13/17

slide-23
SLIDE 23

Comparison with sequential k-means

Dataset Many Points Algorithm 16-DPUs 1 core SeqC Runtime (s) 1.568 0.268 Faster than SeqC with 94 DPUs Large number of dimensions provides a large amount of multiplications to compute distances

14/17

slide-24
SLIDE 24

Comparison with sequential k-means

Dataset Many Dimensions Algorithm 16-DPUs 1 core SeqC Runtime (s) 4.534 0.119 Faster than SeqC with 610 DPUs Large numbers of dimensions provides a large amount of multiplications to compute distances

14/17

slide-25
SLIDE 25

Comparison with sequential k-means

Dataset Many Centers Algorithm 16-DPUs 1 core SeqC Runtime (s) 0.4353 0.0142 Faster than SeqC with 491 DPUs Large numbers of centers provides a large amount of computation per memory transfer [2]

14/17

slide-26
SLIDE 26

Conclusion

slide-27
SLIDE 27

Conclusion

  • Ideal use case with very low computation programs (e.g.

genomic text processing [4, 5])

  • Even if there is no gain on time, power might be reduced
  • Overflows when computing distances
  • Implemented k-means++ [1] with GMP library (arbitrary

precision numbers) but what was interesting is the time for an iteration

15/17

slide-28
SLIDE 28

Conclusion

  • Ideal use case with very low computation programs (e.g.

genomic text processing [4, 5])

  • Even if there is no gain on time, power might be reduced
  • Overflows when computing distances
  • Implemented k-means++ [1] with GMP library (arbitrary

precision numbers) but what was interesting is the time for an iteration

15/17

slide-29
SLIDE 29

Conclusion

  • Ideal use case with very low computation programs (e.g.

genomic text processing [4, 5])

  • Even if there is no gain on time, power might be reduced
  • Overflows when computing distances
  • Implemented k-means++ [1] with GMP library (arbitrary

precision numbers) but what was interesting is the time for an iteration

15/17

slide-30
SLIDE 30

Conclusion

  • Ideal use case with very low computation programs (e.g.

genomic text processing [4, 5])

  • Even if there is no gain on time, power might be reduced
  • Overflows when computing distances
  • Implemented k-means++ [1] with GMP library (arbitrary

precision numbers) but what was interesting is the time for an iteration

15/17

slide-31
SLIDE 31

Going Further with the Hardware

Actual Physical Device

  • Evaluate how the program behaves at large scale
  • Impact on the DDR bus & communications

Hardware Multiplication

  • Now: 40% of multiplication instructions & 30 instructions

per multiplication

16/17

slide-32
SLIDE 32

Going Further with the Hardware

Actual Physical Device

  • Evaluate how the program behaves at large scale
  • Impact on the DDR bus & communications

Hardware Multiplication

  • Now: 40% of multiplication instructions & 30 instructions

per multiplication

16/17

slide-33
SLIDE 33

Going Further with the k-means

Keep the distance to the current nearest centroid [3] Easy to add in our implementation: keep distance in DPU + Avoid useless computations during next iteration − Reduce number of points per DPU Define a border made of points that can switch cluster [7] Harder to integrate Reduce the number of distance computations Might involve the CPU

17/17

slide-34
SLIDE 34

Going Further with the k-means

Keep the distance to the current nearest centroid [3] Easy to add in our implementation: keep distance in DPU + Avoid useless computations during next iteration − Reduce number of points per DPU Define a border made of points that can switch cluster [7] Harder to integrate + Reduce the number of distance computations − Might involve the CPU

17/17

slide-35
SLIDE 35

Thank You

slide-36
SLIDE 36

References

slide-37
SLIDE 37

References i

  • D. Arthur and S. Vassilvitskii.

k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.

  • M. A. Bender, J. Berry, S. D. Hammond, B. Moore, B. Moseley,

and C. A. Phillips. k-means clustering on two-level memory systems. In Proceedings of the 2015 International Symposium on Memory Systems, MEMSYS ’15, pages 197–205, New York, NY, USA, 2015. ACM.

slide-38
SLIDE 38

References ii

  • A. M. Fahim, A. M. Salem, F. A. Torkey, and M. A. Ramadan.

An efficient enhanced k-means clustering algorithm. Journal of Zhejiang University-SCIENCE A, 7(10):1626–1633, 2006.

  • D. Lavenier, C. Deltel, D. Furodet, and J.-F. Roy.

BLAST on UPMEM. Research Report RR-8878, INRIA Rennes - Bretagne Atlantique, Mar. 2016.

  • D. Lavenier, C. Deltel, D. Furodet, and J.-F. Roy.

MAPPING on UPMEM. Research Report RR-8923, INRIA, June 2016.

slide-39
SLIDE 39

References iii

  • S. Lloyd.

Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.

  • C. M. Poteraş, M. C. Mihăescu, and M. Mocanu.

An optimized version of the k-means clustering algorithm. In Computer Science and Information Systems (FedCSIS), 2014 Federated Conference on, pages 695–699. IEEE, 2014.