Evaluating a Processing-in-Memory Architecture with the k -means - - PowerPoint PPT Presentation
Evaluating a Processing-in-Memory Architecture with the k -means - - PowerPoint PPT Presentation
Evaluating a Processing-in-Memory Architecture with the k -means Algorithm Simon Bihel simon.bihel@ens-rennes.fr Lesly-Ann Daniel lesly-ann.daniel@ens-rennes.fr Florestan De Moor florestan.de-moor@ens-rennes.fr Bastien Thomas
With Help From…
Dominique Lavenier dominique.lavenier@irisa.fr
CNRS IRISA
David Furodet & the Upmem Team dfurodet@upmem.com
Context
BIG DATA Workloads End of Dennard Scaling End of Moore’s Law Shift towards Data- Centric Architectures Exascale Bandwidth and Memory Walls
1/17
Table of contents
- 1. The Upmem Architecture
- 2. k-means Implementation for the Upmem Architecture
- 3. Experimental Evaluation
2/17
The Upmem Architecture
Upmem architecture overview
CPU WRAM DPU MRAM DDR bus WRAM DPU MRAM 255 ... ... ... ... DIMM
DPU dram processing-unit WRAM execution memory for programs MRAM main memory DIMM dual in-line memory module
3/17
A massively parallel architecture
Characteristics
- Several DIMMs can be added to a CPU
- A 16 GBytes DIMM embeds 256 DPUs
- Each DPU can support up to 24 threads
The context is switched between DPU threads every clock cycle. The programming approach has to consider this fine-grained parallelism.
4/17
A massively parallel architecture
Characteristics
- Several DIMMs can be added to a CPU
- A 16 GBytes DIMM embeds 256 DPUs
- Each DPU can support up to 24 threads
The context is switched between DPU threads every clock cycle. The programming approach has to consider this fine-grained parallelism.
4/17
Upmem Architecture Overview
On a programming level: two programs must be specified.
CPU
performs data-intensive
- perations
- rchestrates
the execution
DPUs T asklet Host program
{ {
5/17
Upmem Architecture Overview
On a programming level: two programs must be specified.
CPU
performs data-intensive
- perations
- rchestrates
the execution
DPUs T asklet Host program
{ {
communication
- MRAM
- Mailboxes
5/17
Drawbacks and advantages
Drawbacks: computation power
- Frequency around 750 MHz
- No floating point operations
- Significant multiplication overhead (no hardware
multiplier)
- Explicit memory management
Advantages: data access
- Parallelization power
- Minimum latency
- Increased bandwidth
- Reduced power consumption
6/17
Drawbacks and advantages
Drawbacks: computation power
- Frequency around 750 MHz
- No floating point operations
- Significant multiplication overhead (no hardware
multiplier)
- Explicit memory management
Advantages: data access
- Parallelization power
- Minimum latency
- Increased bandwidth
- Reduced power consumption
6/17
k-means Implementation for the Upmem Architecture
k-means Clustering Problem
Partition data ∈ Rn×m into k clusters C1 . . . Ck n (resp. m): number of points (resp. attributes) d: Euclidean distance ArgminC
k
∑
i=1
∑
p∈Ci
d(p, mean(Ci)) Examples of applications Segmentation Communities in social networks Market research Gene sequence analysis
7/17
k-means Standard Algorithm [6]
1: function k-means(k, data, δ) 2:
Choose ˜ C := ( ˜ c1 . . . ˜ ck) initial centroids
3:
repeat
4:
C = ˜ C
5:
for all point p ∈data do
6:
j := Argmini d(p, ci) ▷ Find nearest cluster
7:
Assign p to cluster Cj
8:
end for
9:
for all i in {1 . . . k} do
10:
˜ ci = mean(p ∈ Ci) ▷ Compute new centroids
11:
end for
12:
until ∥˜ C − C∥ ≤ δ ▷ Convergence criteria
13:
return ˜ C ▷ Return the final centroids
14: end function
8/17
k-means algorithm on Upmem
Computations Start centroids update
DPUs HOST
Send centroids End centroids update Data input Choose initial centroids Distribute points Convergence? Output results yes no points
The points are distributed across the DPUs.
9/17
Implementation & Memory Management
- int type to store distance
(easy to overflow with distances) MRAM
- Global variables (e.g. # of
points)
- Centers
- Points
- New centers
10/17
Experimental Evaluation
Experimental Setup
Simulator
- Architecture not yet manufactured
- Cycle-Accurate simulator
Datasets
- int
- Randomly generated (not
uniformly, with clusters) Could not find ready-to-use integer large datasets.
200 200 400 600 800 1000 200 200 400 600 800 1000
11/17
Experimental Setup
Simulator
- Architecture not yet manufactured
- Cycle-Accurate simulator
Datasets
- int
- Randomly generated (not
uniformly, with clusters) Could not find ready-to-use integer large datasets.
200 200 400 600 800 1000 200 200 400 600 800 1000
11/17
Number of Threads
5 10 15 20 25 Number of threads Runtime
High number of
- points
(N=1000000, D=10, K=5)
- dimensions
(N=500000, D=34, K=3)
- centroids
(N=100000, D=2, K=10) Not the same runtime scales.
12/17
Number of DPUs
5 10 15 20 25 30 35 Number of DPUs 10 20 30 40 50 60 70 80 Runtime (seconds)
Always the same number of points. Time is divided by the number of DPUs.
13/17
Comparison with sequential k-means
Dataset Many Points Algorithm 16-DPUs 1 core SeqC Runtime (s) 1.568 0.268 Faster than SeqC with 94 DPUs Large number of dimensions provides a large amount of multiplications to compute distances
14/17
Comparison with sequential k-means
Dataset Many Dimensions Algorithm 16-DPUs 1 core SeqC Runtime (s) 4.534 0.119 Faster than SeqC with 610 DPUs Large numbers of dimensions provides a large amount of multiplications to compute distances
14/17
Comparison with sequential k-means
Dataset Many Centers Algorithm 16-DPUs 1 core SeqC Runtime (s) 0.4353 0.0142 Faster than SeqC with 491 DPUs Large numbers of centers provides a large amount of computation per memory transfer [2]
14/17
Conclusion
Conclusion
- Ideal use case with very low computation programs (e.g.
genomic text processing [4, 5])
- Even if there is no gain on time, power might be reduced
- Overflows when computing distances
- Implemented k-means++ [1] with GMP library (arbitrary
precision numbers) but what was interesting is the time for an iteration
15/17
Conclusion
- Ideal use case with very low computation programs (e.g.
genomic text processing [4, 5])
- Even if there is no gain on time, power might be reduced
- Overflows when computing distances
- Implemented k-means++ [1] with GMP library (arbitrary
precision numbers) but what was interesting is the time for an iteration
15/17
Conclusion
- Ideal use case with very low computation programs (e.g.
genomic text processing [4, 5])
- Even if there is no gain on time, power might be reduced
- Overflows when computing distances
- Implemented k-means++ [1] with GMP library (arbitrary
precision numbers) but what was interesting is the time for an iteration
15/17
Conclusion
- Ideal use case with very low computation programs (e.g.
genomic text processing [4, 5])
- Even if there is no gain on time, power might be reduced
- Overflows when computing distances
- Implemented k-means++ [1] with GMP library (arbitrary
precision numbers) but what was interesting is the time for an iteration
15/17
Going Further with the Hardware
Actual Physical Device
- Evaluate how the program behaves at large scale
- Impact on the DDR bus & communications
Hardware Multiplication
- Now: 40% of multiplication instructions & 30 instructions
per multiplication
16/17
Going Further with the Hardware
Actual Physical Device
- Evaluate how the program behaves at large scale
- Impact on the DDR bus & communications
Hardware Multiplication
- Now: 40% of multiplication instructions & 30 instructions
per multiplication
16/17
Going Further with the k-means
Keep the distance to the current nearest centroid [3] Easy to add in our implementation: keep distance in DPU + Avoid useless computations during next iteration − Reduce number of points per DPU Define a border made of points that can switch cluster [7] Harder to integrate Reduce the number of distance computations Might involve the CPU
17/17
Going Further with the k-means
Keep the distance to the current nearest centroid [3] Easy to add in our implementation: keep distance in DPU + Avoid useless computations during next iteration − Reduce number of points per DPU Define a border made of points that can switch cluster [7] Harder to integrate + Reduce the number of distance computations − Might involve the CPU
17/17
Thank You
References
References i
- D. Arthur and S. Vassilvitskii.
k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.
- M. A. Bender, J. Berry, S. D. Hammond, B. Moore, B. Moseley,
and C. A. Phillips. k-means clustering on two-level memory systems. In Proceedings of the 2015 International Symposium on Memory Systems, MEMSYS ’15, pages 197–205, New York, NY, USA, 2015. ACM.
References ii
- A. M. Fahim, A. M. Salem, F. A. Torkey, and M. A. Ramadan.
An efficient enhanced k-means clustering algorithm. Journal of Zhejiang University-SCIENCE A, 7(10):1626–1633, 2006.
- D. Lavenier, C. Deltel, D. Furodet, and J.-F. Roy.
BLAST on UPMEM. Research Report RR-8878, INRIA Rennes - Bretagne Atlantique, Mar. 2016.
- D. Lavenier, C. Deltel, D. Furodet, and J.-F. Roy.
MAPPING on UPMEM. Research Report RR-8923, INRIA, June 2016.
References iii
- S. Lloyd.
Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
- C. M. Poteraş, M. C. Mihăescu, and M. Mocanu.