the Xeon Phi Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu, Qi Zhang, - - PowerPoint PPT Presentation

the xeon phi
SMART_READER_LITE
LIVE PREVIEW

the Xeon Phi Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu, Qi Zhang, - - PowerPoint PPT Presentation

MicRun A Framework for Scale-free Graph Algorithms on SIMD Architecture of the Xeon Phi Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu, Qi Zhang, Xiaoling Li and Lei Luo College of Computer National University of Defense Technology 10/7/2017


slide-1
SLIDE 1

MicRun:A Framework for Scale-free Graph Algorithms on SIMD Architecture of the Xeon Phi

Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu, Qi Zhang, Xiaoling Li and Lei Luo College of Computer National University of Defense Technology

10/7/2017

slide-2
SLIDE 2

2

Section 1 Backgrounds & Motivation

– Scale-free Graphs & Graph Algorithms – The Xeon Phi Architecture – Bucket Grouping Module – Auto-tuning Module

Section 2 The MicRun Framework Section 3 Experiments & Conclusions

Outline

slide-3
SLIDE 3

3

Section 1 Backgrounds & Motivation

– Scale-free Graphs & Graph Algorithms – The Xeon Phi Architecture – Bucket Grouping Module – Auto-tuning Module

Section 2 The MicRun Framework Section 3 Experiments & Conclusions

Outline

slide-4
SLIDE 4

4

  • Scale-free Graphs are Widely Used

− Social Networks Applications − Chemical Molecular Structures − Reference Citations

  • Features of Scale-free Graphs

− The Sparsity Characteristic of Graphs − The Connectivity of Vertices Follows Power-law Distribution

10 10 1 10 2 10 3 10 4 10 10 1 10 2 10 3 10 4 10 5 Degree Number of Vertices 10 10 1 10 2 10 3 10 10 2 10 4 10 6 Degree Number of Vertices

(a) Higgs-twitter (b) Soc-pokec

Backgrounds & Motivation

y = x-γ

slide-5
SLIDE 5

5

  • Graph Algorithms

− Load values of source vertices − Load values of edges − Compute

(e.g. Addition Minimum et.)

− Update destination vertices

Backgrounds & Motivation

  • Sequential computation steps
slide-6
SLIDE 6
  • The Xeon Phi Architecture

− Architecture: Many Integrated Core (MIC) − 512-bit VPU and four hyper-threads supported − Frequency is more than 1.50GHz − Memory (GDDR5) is more than 8GB − 57-72 cores with optimized KNC Instruction set − Connect to CPU with PCIE

Backgrounds & Motivation

6

slide-7
SLIDE 7

Backgrounds & Motivation

7

  • Challenges of Executing Graph Algorithms on Phi

− SIMD access locality influenced by access range − Write conflicts can occur in SIMD Parallelism

slide-8
SLIDE 8
  • Tiling-and-Grouping Strategy is Commonly Used

− Tiling  Enhance the data locality − Grouping  Remove Parallel conflict − Related Citations

 Efficient Parallel Graph Processing over CPU and MIC (Chen et al. CGO. 2016)  Reusing Data Reorganization of graph

  • Applications. (Jiang et al. IPDPS. 2016)

 Optimizing scale-free SPVM on the Intel Xeon Phi. (Tang et al. CGO 2015)

Backgrounds & Motivation

8

slide-9
SLIDE 9

9

  • New Challenges Appear

− High Penalty when Using Greedy Grouping − Difficult to Select the Optimal Tile Size

50 100 150 200 250 300 350 soc-pokec higgs-twitter Time (second)

  • soc. blocking time
  • soc. grouping time
  • higgs. blocking time
  • higgs. grouping time
  • rig
128 256 512 1024 2048 4096 8192 16384 500 1000 1500 2000 2500 Tile Size File Size (MB) soc-pokec higgs-twitter

(a) Time Overhead (b) Memory Overhead

Backgrounds & Motivation

slide-10
SLIDE 10

10

Section 1 Backgrounds & Motivation

– Scale-free Graphs & Graph Algorithms – The Xeon Phi Architecture – Bucket Grouping Module – Auto-tuning Module

Section 2 The MicRun Framework Section 3 Experiments & Conclusions

Outline

slide-11
SLIDE 11

11

  • Overview of the Framework and the Modules

− Tiling Module − Bucket Grouping Module − Auto-tuning Module − Graph Algorithms

Workflow of the MicRun Framework.

The Mic icRun Framework

slide-12
SLIDE 12

12

  • Grouping Module

− Bucket Structure is introduced to construct groups − Max-heap Optimization is used to improve efficiency

1 2 3 9 4 5 6 7 8 10 11 12 13 14 15 16

  • Dest. Vertices

Source Vertices

8 7 6 5 4 3 2 1 9 1 12 6 2 10 4 13 5 14 11 7 8 3 16 15 Bucket number nnz in buckets

(a) nnz in a tile (b) nnz transformed into groups using buckets

O(n2)

Group1 Group2 Group3 Group4 Group5 Group6 SIMD Bucket 7-1-2-4 11-3-9-12 14-6-10-13 15-5-8-D 16-D-D-D NULL 16/20 Sequential

(Chen. 2016)

1-2-3-4 5-6-7-8 9-10-11-12 13-14-D-D 15-D-D-D 16-D-D-D 16/24

The Mic icRun Framework

O(n2)

slide-13
SLIDE 13

13

  • Grouping Module

− Bucket Structure is introduced to construct groups − Max-heap Optimization is used to improve efficiency

1 2 3 9 4 5 6 7 8 10 11 12 13 14 15 16

  • Dest. Vertices

Source Vertices

8 7 6 5 4 3 2 1 9 1 12 6 2 10 4 13 5 14 11 7 8 3 16 15 Bucket number nnz in buckets

(a) nnz in a tile (b) nnz transformed into groups using buckets

O(n2)

The Mic icRun Framework

O(n*log(b))

slide-14
SLIDE 14

14

  • Auto-tuning Module

− Extract Features Based on the Ideal Graph Application

 sizes of the adjust matrix of graphs is related to the sparsity character  The nnzs in the graph can influence the whole memory  The number of nnzs in each column is related to the nnzs’ distribution  The average stride between nnzs can influence the cache miss  The feature tuple is constructed as: (s, n, γ, NC , ST)

− Decision Tree Model is Employed

 The training target OT is obtained by manually probing

The Mic icRun Framework

int sum , , , 1 1 1 p q t float float c r nc g comp nc s total i j k

T T T T T nnz

  

          

  

slide-15
SLIDE 15

Section 1 Backgrounds & Motivation

– Scale-free Graphs & Graph Algorithms – The Xeon Phi Architecture – Bucket Grouping Module – Auto-tuning Module

Section 2 The MicRun Framework Section 3 Experiments & Conclusions

Outline

15

slide-16
SLIDE 16

16

  • Platform

− MIC node on the Tianhe-Ⅱ supercomputer − The version of the Xeon Phi is 31S1P − 57 X86 cores, 1.10 GHz, 4 hyper threads per core − The capacity of L2 cache is 28.5MB − Intel ICC 13.0.0, -O3 enabled

  • Graph Applications

− Bellman-Ford Algorithm − PageRank Algorithm

  • Datasets

− SNAP Dataset − University of Florida Sparse Matrix Collection

Exp xperiments

slide-17
SLIDE 17
  • College of Computer of NUDT
  • Hometown of Supercomputers: Tianhe - Ⅱ

– No. 1 in TOP500 (2013.6 – 2015.11) – 33.86 PFLOPS, 32,000 CPUs+48,000 MICs

17

Experiments

slide-18
SLIDE 18

18

Experiments

  • Bucket Grouping vs. Seq. Grouping (Chen. 2016)

(a) Time Overhead during Grouping Stage (b) SIMD utilization by two Grouping Strategies

− Grouping Time Overhead − SIMD Utilization Ratio

Decrease stably Converge to 1 faster

slide-19
SLIDE 19
  • The Execution of two Graph Algorithms

(a) Comparison of Execution Time (b) Execution Time of Bellman-Ford (c) Execution Time of PageRank

1.2x on Average Experiments

19

slide-20
SLIDE 20

Datasets Bellman-Ford PageRank

  • OPT. vs. SEQ.
  • AUTO. vs. SEQ.
  • OPT. vs. SEQ.
  • AUTO. vs. SEQ.

Val Size Val Size Val Size Val Size lp_osa_60 1.08 1024 1.03 256 1.07 256 1.07 256 msdoor 1.11 1152 1.05 4096 1.14 512 1.14 512 rajat24 1.18 2048 1.09 256 1.09 768 1.09 768 Si87H76 1.05 128 1.05 128 1.14 128 1.03 512 higgs-twitter 1.26 896 1.13 3072 1.33 1024 1.21 640 kron-logn18 1.29 4096 1.29 4096 1.36 2048 1.25 1024 SPEEDUP ACHIEVED BY OPT. AND AUTO. TILING OVER SEQUENTIAL TILING PERFORMANCE

  • The Performance of the Auto-tuning Module

Optimal 0ver Sequential 1.05x ~ 1.36x Auto-tuning 0ver Sequential 1.03x ~ 1.29x

Experiments

20

slide-21
SLIDE 21
  • The MicRun Framework

− Grouping Module

 Bucket structure is employed  Max-heap mechanism is embedded

− Auto-tuning Module

 Decision Tree Classifier is introduced

  • Future work

− Enrich the graph algorithms built-in − Expand the framework to MIMD parallel level

Conclusions

21

slide-22
SLIDE 22

The Tianhe-2 supercomputer is available online. All the scientists can collaborate with us to develop new software and access Tianhe-2 through the Internet.

Welcome to contact us !

Email: linjie15@nudt.edu.cn

22

slide-23
SLIDE 23

Thank you! Questions?