the Xeon Phi Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu, Qi Zhang, - PowerPoint PPT Presentation

MicRun ： A Framework for Scale-free Graph Algorithms on SIMD Architecture of the Xeon Phi Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu, Qi Zhang, Xiaoling Li and Lei Luo College of Computer National University of Defense Technology 10/7/2017

Outline Section 1 Backgrounds & Motivation – Scale-free Graphs & Graph Algorithms – The Xeon Phi Architecture Section 2 The MicRun Framework – Bucket Grouping Module – Auto-tuning Module Section 3 Experiments & Conclusions 2

Backgrounds & Motivation • Scale-free Graphs are Widely Used − Social Networks Applications − Chemical Molecular Structures − Reference Citations • Features of Scale-free Graphs − The Sparsity Characteristic of Graphs − The Connectivity of Vertices Follows Power-law Distribution 6 5 10 10 4 10 Number of Vertices Number of Vertices 4 10 3 10 y = x - γ 2 10 2 10 1 10 0 0 10 10 0 1 2 3 4 0 1 2 3 10 10 10 10 10 10 10 10 10 Degree Degree (a) Higgs-twitter (b) Soc-pokec 4

Backgrounds & Motivation • Graph Algorithms  Sequential computation steps − Load values of source vertices − Load values of edges − Compute (e.g. Addition Minimum et . ) − Update destination vertices 5

Backgrounds & Motivation • The Xeon Phi Architecture − Architecture: Many Integrated Core (MIC) − 512-bit VPU and four hyper-threads supported − Frequency is more than 1.50GHz − Memory (GDDR5) is more than 8GB − 57-72 cores with optimized KNC Instruction set − Connect to CPU with PCIE 6

Backgrounds & Motivation • Challenges of Executing Graph Algorithms on Phi − SIMD access locality influenced by access range − Write conflicts can occur in SIMD Parallelism 7

Backgrounds & Motivation • Tiling-and-Grouping Strategy is Commonly Used − Tiling  Enhance the data locality − Grouping  Remove Parallel conflict − Related Citations  Efficient Parallel Graph Processing over CPU and MIC ( Chen et al. CGO. 2016 )  Reusing Data Reorganization of graph Applications. ( Jiang et al. IPDPS. 2016 )  Optimizing scale-free SPVM on the Intel Xeon Phi. ( Tang et al. CGO 2015 ) 8

Backgrounds & Motivation • New Challenges Appear − High Penalty when Using Greedy Grouping − Difficult to Select the Optimal Tile Size 350 2500 soc-pokec soc. blocking time higgs-twitter soc. grouping time 300 higgs. blocking time 2000 higgs. grouping time 250 Time (second) File Size (MB) 1500 200 150 1000 100 500 50 0 0 orig 128 256 512 1024 2048 4096 8192 16384 soc-pokec higgs-twitter Tile Size (a) Time Overhead (b) Memory Overhead 9

The Mic icRun Framework • Overview of the Framework and the Modules − Tiling Module − Bucket Grouping Module − Auto-tuning Module − Graph Algorithms Workflow of the MicRun Framework. 11

The Mic icRun Framework • Grouping Module − Bucket Structure is introduced to construct groups − Max-heap Optimization is used to improve efficiency Dest. Vertices O(n 2 ) 1 2 3 16 4 5 6 7 8 15 9 10 11 14 Source Vertices 12 13 11 8 9 6 10 14 nnz in buckets 1 12 2 4 13 5 7 3 15 Bucket number 1 2 3 4 5 6 7 8 16 (a) nnz in a tile (b) nnz transformed into groups using buckets O(n 2 ) Group1 Group2 Group3 Group4 Group5 Group6 SIMD Bucket 7-1-2-4 11-3-9-12 14-6-10-13 15-5-8-D 16-D-D-D NULL 16/20 Sequential 1-2-3-4 5-6-7-8 9-10-11-12 13-14-D-D 15-D-D-D 16-D-D-D 16/24 ( Chen. 2016 ) 12

The Mic icRun Framework • Grouping Module − Bucket Structure is introduced to construct groups − Max-heap Optimization is used to improve efficiency Dest. Vertices 1 2 3 16 4 5 6 7 8 15 9 10 11 14 Source Vertices 12 13 11 8 9 6 10 14 nnz in buckets 1 12 2 4 13 5 7 3 15 Bucket number 1 2 3 4 5 6 7 8 16 (a) nnz in a tile (b) nnz transformed into groups using buckets O(n 2 ) O( n* log( b )) 13

The Mic icRun Framework • Auto-tuning Module − Extract Features Based on the Ideal Graph Application   p q t          int float float  T T T T T nnz sum c r , nc g , comp nc s , total      i 1 j 1 k 1  sizes of the adjust matrix of graphs is related to the sparsity character  The nnz s in the graph can influence the whole memory  The number of nnz s in each column is related to the nnz s ’ distribution  The average stride between nnz s can influence the cache miss  The feature tuple is constructed as: ( s , n , γ, N C , S T ) − Decision Tree Model is Employed  The training target OT is obtained by manually probing 14

Exp xperiments • Platform − MIC node on the Tianhe- Ⅱ supercomputer − The version of the Xeon Phi is 31S1P − 57 X86 cores, 1.10 GHz, 4 hyper threads per core − The capacity of L2 cache is 28.5MB − Intel ICC 13.0.0, -O3 enabled • Graph Applications − Bellman-Ford Algorithm − PageRank Algorithm • Datasets − SNAP Dataset − University of Florida Sparse Matrix Collection 16

Experiments • College of Computer of NUDT • Hometown of Supercomputers: Tianhe - Ⅱ – No. 1 in TOP500 (2013.6 – 2015.11) – 33.86 PFLOPS, 32,000 CPUs+48,000 MICs 17

Experiments • Bucket Grouping vs. Seq. Grouping (Chen. 2016) − Grouping Time Overhead − SIMD Utilization Ratio (a) Time Overhead during Grouping Stage (b) SIMD utilization by two Grouping Strategies Decrease stably Converge to 1 faster 18

Experiments • The Execution of two Graph Algorithms (b) Execution Time of Bellman-Ford (a) Comparison of Execution Time 1.2x on Average (c) Execution Time of PageRank 19

Experiments • The Performance of the Auto-tuning Module SPEEDUP ACHIEVED BY OPT. AND AUTO. TILING OVER SEQUENTIAL TILING PERFORMANCE Bellman-Ford PageRank Datasets OPT. vs. SEQ. AUTO. vs. SEQ. OPT. vs. SEQ. AUTO. vs. SEQ. Val Size Val Size Val Size Val Size lp_osa_60 1.08 1024 1.03 256 1.07 256 1.07 256 msdoor 1.11 1152 1.05 4096 1.14 512 1.14 512 rajat24 1.18 2048 1.09 256 1.09 768 1.09 768 Si87H76 1.05 128 1.05 128 1.14 128 1.03 512 higgs-twitter 1.26 896 1.13 3072 1.33 1024 1.21 640 kron-logn18 1.29 4096 1.29 4096 1.36 2048 1.25 1024 Optimal 0ver Sequential 1.05x ~ 1.36x Auto-tuning 0ver Sequential 1.03x ~ 1.29x 20

Conclusions • The MicRun Framework − Grouping Module  Bucket structure is employed  Max-heap mechanism is embedded − Auto-tuning Module  Decision Tree Classifier is introduced • Future work − Enrich the graph algorithms built-in − Expand the framework to MIMD parallel level 21

The Tianhe-2 supercomputer is available online. All the scientists can collaborate with us to develop new software and access Tianhe-2 through the Internet. Welcome to contact us ! Email: linjie15@nudt.edu.cn 22

Thank you! Questions ？

the Xeon Phi Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu, Qi Zhang, - PowerPoint PPT Presentation

MicRun A Framework for Scale-free Graph Algorithms on SIMD Architecture of the Xeon Phi Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu, Qi Zhang, Xiaoling Li and Lei Luo College of Computer National University of Defense Technology 10/7/2017

Outline Background 1 Xeon Phi Architecture 2 Programming Xeon Phi TM 3 Native Mode Offload

XEON PHI BASICS Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Xeon Phi Basics Reusing this

AsHES 2014 XSW: Accelerating Biological Database Search on Xeon Phi School of Computer Science

OPTIMISING PARALLEL PROGRAMS ON XEON PHI Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc

Optimizing Codes For Intel Xeon Phi Brian Friesen NERSC 2017 July 26 Cori What is different

PCS SERVICE FOR SALE FOR SALE Used PHI 660 Scanning Auger PHI 660 Scanning Auger Used

Omega Psi Phi Fraternity, Inc. Eta Delta Delta Chapter The History of Omega Psi Phi Omega

THE PHI PROJECT THE FINANCIAL IMPACT OF BREACHED PROTECTED HEALTH INFORMATION A

The Ritual Review of Phi Sigma Pi National Honor Fraternity Phi Sigma Pi National Honor

Communicating Phi Sigma Pis Mission and Identity Objectives Review Phi Sigma Pis

Towards Direct Visualization on CPU and Xeon Phi Aaron Knoll SCI Institute, University of Utah

GPU vs Xeon Phi: Performance of Bandwidth Bound Applications with a Lattice QCD Case Study

Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga Wickramasinghe Center for Research on

FLSCHED: A Lockless and Lightweight Approach to OS Scheduler for Xeon Phi Heeseung Jo Chonbuk

Harnessing the Intel Xeon Phi x200 Processor 2017 IXPUG US Annual Meeting for Earthquake

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Erik Saule 1 ,

Open Microphone Meeting: USP General Chapter <797> Pharmaceutical Compounding Sterile

Regular logarithmic connections Motivic Geometry CAS Oslo Sep 8, 2020 Piotr Achinger IMPAN

Meeting Recorder: Audio Processing Dan Ellis <dpwe@ee.columbia.edu> Lab ROSA , Columbia

Planning AV for your event All general use classrooms are equipped with: Sound system for

CS 5220: Heterogeneity and accelerators David Bindel 2017-10-03 1 Reminder: Totient cluster

Getting Acquainted W ith Zoom Outline Securing Your Computer Your Invitation The

Smart Microphones n Sound source direction finding, null- and beam- steering in the presence of

SOUND BASICS 35305 slide 0 Key Notes difference between

the Xeon Phi Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu, Qi Zhang, - PowerPoint PPT Presentation

MicRun A Framework for Scale-free Graph Algorithms on SIMD Architecture of the Xeon Phi Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu, Qi Zhang, Xiaoling Li and Lei Luo College of Computer National University of Defense Technology 10/7/2017

Outline Background 1 Xeon Phi Architecture 2 Programming Xeon Phi TM 3 Native Mode Offload

XEON PHI BASICS Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Xeon Phi Basics Reusing this

AsHES 2014 XSW: Accelerating Biological Database Search on Xeon Phi School of Computer Science

OPTIMISING PARALLEL PROGRAMS ON XEON PHI Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc

Optimizing Codes For Intel Xeon Phi Brian Friesen NERSC 2017 July 26 Cori What is different

PCS SERVICE FOR SALE FOR SALE Used PHI 660 Scanning Auger PHI 660 Scanning Auger Used

Omega Psi Phi Fraternity, Inc. Eta Delta Delta Chapter The History of Omega Psi Phi Omega

THE PHI PROJECT THE FINANCIAL IMPACT OF BREACHED PROTECTED HEALTH INFORMATION A

The Ritual Review of Phi Sigma Pi National Honor Fraternity Phi Sigma Pi National Honor

Communicating Phi Sigma Pis Mission and Identity Objectives Review Phi Sigma Pis

Towards Direct Visualization on CPU and Xeon Phi Aaron Knoll SCI Institute, University of Utah

GPU vs Xeon Phi: Performance of Bandwidth Bound Applications with a Lattice QCD Case Study

Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga Wickramasinghe Center for Research on

FLSCHED: A Lockless and Lightweight Approach to OS Scheduler for Xeon Phi Heeseung Jo Chonbuk

Harnessing the Intel Xeon Phi x200 Processor 2017 IXPUG US Annual Meeting for Earthquake

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Erik Saule 1 ,

Open Microphone Meeting: USP General Chapter &lt;797&gt; Pharmaceutical Compounding Sterile

Regular logarithmic connections Motivic Geometry CAS Oslo Sep 8, 2020 Piotr Achinger IMPAN

Meeting Recorder: Audio Processing Dan Ellis &lt;dpwe@ee.columbia.edu&gt; Lab ROSA , Columbia

Planning AV for your event All general use classrooms are equipped with: Sound system for

CS 5220: Heterogeneity and accelerators David Bindel 2017-10-03 1 Reminder: Totient cluster

Getting Acquainted W ith Zoom Outline Securing Your Computer Your Invitation The

Smart Microphones n Sound source direction finding, null- and beam- steering in the presence of

SOUND BASICS 35305 slide 0 Key Notes difference between

Open Microphone Meeting: USP General Chapter <797> Pharmaceutical Compounding Sterile

Meeting Recorder: Audio Processing Dan Ellis <dpwe@ee.columbia.edu> Lab ROSA , Columbia