evaluating a processing in memory architecture with the k
play

Evaluating a Processing-in-Memory Architecture with the k -means - PowerPoint PPT Presentation

Evaluating a Processing-in-Memory Architecture with the k -means Algorithm Simon Bihel simon.bihel@ens-rennes.fr Lesly-Ann Daniel lesly-ann.daniel@ens-rennes.fr Florestan De Moor florestan.de-moor@ens-rennes.fr Bastien Thomas


  1. Evaluating a Processing-in-Memory Architecture with the k -means Algorithm Simon Bihel simon.bihel@ens-rennes.fr Lesly-Ann Daniel lesly-ann.daniel@ens-rennes.fr Florestan De Moor florestan.de-moor@ens-rennes.fr Bastien Thomas bastien.thomas@ens-rennes.fr May 4, 2017 University of Rennes I École Normale Supérieure de Rennes

  2. With Help From… Dominique Lavenier dominique.lavenier@irisa.fr CNRS IRISA David Furodet & the Upmem Team dfurodet@upmem.com

  3. 1/17 Context BIG DATA Workloads End of Dennard Scaling Shift towards Data- Exascale Centric Architectures End of Moore’s Law Bandwidth and Memory Walls

  4. Table of contents 1. The Upmem Architecture 2. k -means Implementation for the Upmem Architecture 3. Experimental Evaluation 2/17

  5. The Upmem Architecture

  6. Upmem architecture overview DPU dram processing-unit DIMM dual in-line memory module MRAM main memory WRAM execution memory for programs 3/17 WRAM ... WRAM DPU ... DPU CPU DDR bus MRAM ... MRAM 0 ... 255 DIMM

  7. A massively parallel architecture Characteristics • Several DIMMs can be added to a CPU • A 16 GBytes DIMM embeds 256 DPUs • Each DPU can support up to 24 threads The context is switched between DPU threads every clock cycle. The programming approach has to consider this fine-grained parallelism. 4/17

  8. A massively parallel architecture Characteristics • Several DIMMs can be added to a CPU • A 16 GBytes DIMM embeds 256 DPUs • Each DPU can support up to 24 threads The context is switched between DPU threads every clock cycle. The programming approach has to consider this fine-grained parallelism. 4/17

  9. Upmem Architecture Overview On a programming level: two programs must be specified. 5/17 DPUs CPU { { performs Host orchestrates T asklet data-intensive the execution program operations

  10. Upmem Architecture Overview On a programming level: two programs must be specified. 5/17 DPUs CPU { { performs Host orchestrates T asklet data-intensive the execution program operations communication - MRAM - Mailboxes

  11. Drawbacks and advantages Drawbacks: computation power • Frequency around 750 MHz • No floating point operations • Significant multiplication overhead (no hardware multiplier) • Explicit memory management Advantages: data access • Parallelization power • Minimum latency • Increased bandwidth • Reduced power consumption 6/17

  12. Drawbacks and advantages Drawbacks: computation power • Frequency around 750 MHz • No floating point operations • Significant multiplication overhead (no hardware multiplier) • Explicit memory management Advantages: data access • Parallelization power • Minimum latency • Increased bandwidth • Reduced power consumption 6/17

  13. k -means Implementation for the Upmem Architecture

  14. k -means Clustering Problem Examples of applications Gene sequence analysis Market research networks Communities in social Segmentation 7/17 Argmin C k d : Euclidean distance n (resp. m ): number of points (resp. attributes) Partition data ∈ R n × m into k clusters C 1 . . . C k ∑ ∑ d ( p , mean ( C i )) p ∈ C i i = 1

  15. k -means Standard Algorithm [6] 6: 14: end function C 13: 12: end for 11: 10: 9: end for Assign p to cluster C j 7: 8: 8/17 3: 2: 5: C 4: repeat 1: function k -means( k , data, δ ) Choose ˜ C := ( ˜ c 1 . . . ˜ c k ) initial centroids C = ˜ for all point p ∈ data do j := Argmin i d ( p , c i ) ▷ Find nearest cluster for all i in { 1 . . . k } do ˜ c i = mean ( p ∈ C i ) ▷ Compute new centroids until ∥ ˜ C − C ∥ ≤ δ ▷ Convergence criteria return ˜ ▷ Return the final centroids

  16. k -means algorithm on Upmem The points are DPUs. distributed across the 9/17 Data input HOST Choose initial points centroids Distribute points DPUs Computations Send centroids Start centroids End centroids update update Convergence? no yes Output results

  17. Implementation & Memory Management • int type to store distance (easy to overflow with distances) MRAM • Global variables (e.g. # of points) • Centers • Points • New centers 10/17

  18. Experimental Evaluation

  19. 1000 800 600 400 200 0 200 200 0 200 400 600 800 1000 Experimental Setup Simulator integer large datasets. Could not find ready-to-use uniformly, with clusters) • Randomly generated (not • int Datasets • Cycle-Accurate simulator • Architecture not yet manufactured 11/17

  20. Experimental Setup uniformly, with clusters) Simulator integer large datasets. Could not find ready-to-use 11/17 • Randomly generated (not Datasets • Cycle-Accurate simulator • Architecture not yet manufactured • int 1000 800 600 400 200 0 200 200 0 200 400 600 800 1000

  21. Number of Threads (N=1000000, Not the same runtime scales. D=2, K=10) (N=100000, • centroids D=34, K=3) (N=500000, • dimensions D=10, K=5) • points High number of 12/17 Runtime 0 5 10 15 20 25 Number of threads

  22. Number of DPUs Always the same Time is divided by the number of DPUs. number of points. 13/17 80 70 60 Runtime (seconds) 50 40 30 20 10 0 0 5 10 15 20 25 30 35 Number of DPUs

  23. Comparison with sequential k -means Dataset Many Points Algorithm 16-DPUs 1 core SeqC Runtime (s) 1.568 0.268 Faster than SeqC with 94 DPUs Large number of dimensions provides a large amount of multiplications to compute distances 14/17

  24. Comparison with sequential k -means Dataset Many Dimensions Algorithm 16-DPUs 1 core SeqC Runtime (s) 4.534 0.119 Faster than SeqC with 610 DPUs Large numbers of dimensions provides a large amount of multiplications to compute distances 14/17

  25. Comparison with sequential k -means Dataset Many Centers Algorithm 16-DPUs 1 core SeqC Runtime (s) 0.4353 0.0142 Faster than SeqC with 491 DPUs Large numbers of centers provides a large amount of computation per memory transfer [2] 14/17

  26. Conclusion

  27. Conclusion • Ideal use case with very low computation programs (e.g. genomic text processing [4, 5]) • Even if there is no gain on time, power might be reduced • Overflows when computing distances • Implemented k -means++ [1] with GMP library (arbitrary precision numbers) but what was interesting is the time for an iteration 15/17

  28. Conclusion • Ideal use case with very low computation programs (e.g. genomic text processing [4, 5]) • Even if there is no gain on time, power might be reduced • Overflows when computing distances • Implemented k -means++ [1] with GMP library (arbitrary precision numbers) but what was interesting is the time for an iteration 15/17

  29. Conclusion • Ideal use case with very low computation programs (e.g. genomic text processing [4, 5]) • Even if there is no gain on time, power might be reduced • Overflows when computing distances • Implemented k -means++ [1] with GMP library (arbitrary precision numbers) but what was interesting is the time for an iteration 15/17

  30. Conclusion • Ideal use case with very low computation programs (e.g. genomic text processing [4, 5]) • Even if there is no gain on time, power might be reduced • Overflows when computing distances • Implemented k -means++ [1] with GMP library (arbitrary precision numbers) but what was interesting is the time for an iteration 15/17

  31. Going Further with the Hardware Actual Physical Device • Evaluate how the program behaves at large scale • Impact on the DDR bus & communications Hardware Multiplication • Now: 40% of multiplication instructions & 30 instructions per multiplication 16/17

  32. Going Further with the Hardware Actual Physical Device • Evaluate how the program behaves at large scale • Impact on the DDR bus & communications Hardware Multiplication • Now: 40% of multiplication instructions & 30 instructions per multiplication 16/17

  33. Going Further with the k -means Keep the distance to the current nearest centroid [3] Easy to add in our implementation: keep distance in DPU Define a border made of points that can switch cluster [7] Harder to integrate Reduce the number of distance computations Might involve the CPU 17/17 + Avoid useless computations during next iteration − Reduce number of points per DPU

  34. Going Further with the k -means Keep the distance to the current nearest centroid [3] Easy to add in our implementation: keep distance in DPU Define a border made of points that can switch cluster [7] Harder to integrate 17/17 + Avoid useless computations during next iteration − Reduce number of points per DPU + Reduce the number of distance computations − Might involve the CPU

  35. Thank You

  36. References

  37. References i D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms , pages 1027–1035. Society for Industrial and Applied Mathematics, 2007. M. A. Bender, J. Berry, S. D. Hammond, B. Moore, B. Moseley, and C. A. Phillips. k-means clustering on two-level memory systems. In Proceedings of the 2015 International Symposium on Memory Systems , MEMSYS ’15, pages 197–205, New York, NY, USA, 2015. ACM.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend