MicRun:A Framework for Scale-free Graph Algorithms on SIMD Architecture of the Xeon Phi
Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu, Qi Zhang, Xiaoling Li and Lei Luo College of Computer National University of Defense Technology
10/7/2017
the Xeon Phi Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu, Qi Zhang, - - PowerPoint PPT Presentation
MicRun A Framework for Scale-free Graph Algorithms on SIMD Architecture of the Xeon Phi Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu, Qi Zhang, Xiaoling Li and Lei Luo College of Computer National University of Defense Technology 10/7/2017
MicRun:A Framework for Scale-free Graph Algorithms on SIMD Architecture of the Xeon Phi
Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu, Qi Zhang, Xiaoling Li and Lei Luo College of Computer National University of Defense Technology
10/7/2017
2
Section 1 Backgrounds & Motivation
– Scale-free Graphs & Graph Algorithms – The Xeon Phi Architecture – Bucket Grouping Module – Auto-tuning Module
Section 2 The MicRun Framework Section 3 Experiments & Conclusions
Outline
3
Section 1 Backgrounds & Motivation
– Scale-free Graphs & Graph Algorithms – The Xeon Phi Architecture – Bucket Grouping Module – Auto-tuning Module
Section 2 The MicRun Framework Section 3 Experiments & Conclusions
Outline
4
− Social Networks Applications − Chemical Molecular Structures − Reference Citations
− The Sparsity Characteristic of Graphs − The Connectivity of Vertices Follows Power-law Distribution
10 10 1 10 2 10 3 10 4 10 10 1 10 2 10 3 10 4 10 5 Degree Number of Vertices 10 10 1 10 2 10 3 10 10 2 10 4 10 6 Degree Number of Vertices(a) Higgs-twitter (b) Soc-pokec
Backgrounds & Motivation
y = x-γ
5
− Load values of source vertices − Load values of edges − Compute
(e.g. Addition Minimum et.)
− Update destination vertices
Backgrounds & Motivation
− Architecture: Many Integrated Core (MIC) − 512-bit VPU and four hyper-threads supported − Frequency is more than 1.50GHz − Memory (GDDR5) is more than 8GB − 57-72 cores with optimized KNC Instruction set − Connect to CPU with PCIE
Backgrounds & Motivation
6
Backgrounds & Motivation
7
− SIMD access locality influenced by access range − Write conflicts can occur in SIMD Parallelism
− Tiling Enhance the data locality − Grouping Remove Parallel conflict − Related Citations
Efficient Parallel Graph Processing over CPU and MIC (Chen et al. CGO. 2016) Reusing Data Reorganization of graph
Optimizing scale-free SPVM on the Intel Xeon Phi. (Tang et al. CGO 2015)
Backgrounds & Motivation
8
9
− High Penalty when Using Greedy Grouping − Difficult to Select the Optimal Tile Size
50 100 150 200 250 300 350 soc-pokec higgs-twitter Time (second)
(a) Time Overhead (b) Memory Overhead
Backgrounds & Motivation
10
Section 1 Backgrounds & Motivation
– Scale-free Graphs & Graph Algorithms – The Xeon Phi Architecture – Bucket Grouping Module – Auto-tuning Module
Section 2 The MicRun Framework Section 3 Experiments & Conclusions
Outline
11
− Tiling Module − Bucket Grouping Module − Auto-tuning Module − Graph Algorithms
Workflow of the MicRun Framework.
The Mic icRun Framework
12
− Bucket Structure is introduced to construct groups − Max-heap Optimization is used to improve efficiency
1 2 3 9 4 5 6 7 8 10 11 12 13 14 15 16
Source Vertices
8 7 6 5 4 3 2 1 9 1 12 6 2 10 4 13 5 14 11 7 8 3 16 15 Bucket number nnz in buckets
(a) nnz in a tile (b) nnz transformed into groups using buckets
Group1 Group2 Group3 Group4 Group5 Group6 SIMD Bucket 7-1-2-4 11-3-9-12 14-6-10-13 15-5-8-D 16-D-D-D NULL 16/20 Sequential
(Chen. 2016)
1-2-3-4 5-6-7-8 9-10-11-12 13-14-D-D 15-D-D-D 16-D-D-D 16/24
The Mic icRun Framework
13
− Bucket Structure is introduced to construct groups − Max-heap Optimization is used to improve efficiency
1 2 3 9 4 5 6 7 8 10 11 12 13 14 15 16
Source Vertices
8 7 6 5 4 3 2 1 9 1 12 6 2 10 4 13 5 14 11 7 8 3 16 15 Bucket number nnz in buckets
(a) nnz in a tile (b) nnz transformed into groups using buckets
The Mic icRun Framework
14
− Extract Features Based on the Ideal Graph Application
sizes of the adjust matrix of graphs is related to the sparsity character The nnzs in the graph can influence the whole memory The number of nnzs in each column is related to the nnzs’ distribution The average stride between nnzs can influence the cache miss The feature tuple is constructed as: (s, n, γ, NC , ST)
− Decision Tree Model is Employed
The training target OT is obtained by manually probing
The Mic icRun Framework
int sum , , , 1 1 1 p q t float float c r nc g comp nc s total i j k
T T T T T nnz
Section 1 Backgrounds & Motivation
– Scale-free Graphs & Graph Algorithms – The Xeon Phi Architecture – Bucket Grouping Module – Auto-tuning Module
Section 2 The MicRun Framework Section 3 Experiments & Conclusions
Outline
15
16
− MIC node on the Tianhe-Ⅱ supercomputer − The version of the Xeon Phi is 31S1P − 57 X86 cores, 1.10 GHz, 4 hyper threads per core − The capacity of L2 cache is 28.5MB − Intel ICC 13.0.0, -O3 enabled
− Bellman-Ford Algorithm − PageRank Algorithm
− SNAP Dataset − University of Florida Sparse Matrix Collection
Exp xperiments
– No. 1 in TOP500 (2013.6 – 2015.11) – 33.86 PFLOPS, 32,000 CPUs+48,000 MICs
17
Experiments
18
Experiments
(a) Time Overhead during Grouping Stage (b) SIMD utilization by two Grouping Strategies
− Grouping Time Overhead − SIMD Utilization Ratio
Decrease stably Converge to 1 faster
(a) Comparison of Execution Time (b) Execution Time of Bellman-Ford (c) Execution Time of PageRank
1.2x on Average Experiments
19
Datasets Bellman-Ford PageRank
Val Size Val Size Val Size Val Size lp_osa_60 1.08 1024 1.03 256 1.07 256 1.07 256 msdoor 1.11 1152 1.05 4096 1.14 512 1.14 512 rajat24 1.18 2048 1.09 256 1.09 768 1.09 768 Si87H76 1.05 128 1.05 128 1.14 128 1.03 512 higgs-twitter 1.26 896 1.13 3072 1.33 1024 1.21 640 kron-logn18 1.29 4096 1.29 4096 1.36 2048 1.25 1024 SPEEDUP ACHIEVED BY OPT. AND AUTO. TILING OVER SEQUENTIAL TILING PERFORMANCE
Optimal 0ver Sequential 1.05x ~ 1.36x Auto-tuning 0ver Sequential 1.03x ~ 1.29x
Experiments
20
− Grouping Module
Bucket structure is employed Max-heap mechanism is embedded
− Auto-tuning Module
Decision Tree Classifier is introduced
− Enrich the graph algorithms built-in − Expand the framework to MIMD parallel level
Conclusions
21
The Tianhe-2 supercomputer is available online. All the scientists can collaborate with us to develop new software and access Tianhe-2 through the Internet.
Welcome to contact us !
Email: linjie15@nudt.edu.cn
22
Thank you! Questions?