PerfExpert Jim Browne, Ashay Rane and Leo Fialho Petascale Tools - PowerPoint PPT Presentation

PerfExpert Jim Browne, Ashay Rane and Leo Fialho Petascale Tools Workshop Madison WI, 2013

Introduction PerfExpert Modular Architecture Understanding and Extending PerfExpert Conclusions Agenda Introduction 1 PerfExpert Modular Architecture 2 Understanding and Extending 3 PerfExpert Conclusions 4 2 / 34

Introduction PerfExpert Modular Architecture Understanding and Extending PerfExpert Conclusions Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly Performance optimization for these chips requires deep knowledge of architectures, code patterns, compilers, etc. Performance optimization tools Powerful in the hands of experts Require detailed performance and system expertise HPC application developers are domain experts, not computer gurus Many HPC programmers/users do not use your tools (seriously) 3 / 34

Introduction PerfExpert Modular Architecture Understanding and Extending PerfExpert Conclusions Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures How to accomplish? Formulate the performance optimization task as a workflow of subtasks Leverage the state-of-the-art: build on the best available tools for the subtasks to minimize the effort and cost of development Automate the entire workflow 4 / 34

Introduction PerfExpert Modular Architecture Understanding and Extending PerfExpert Conclusions Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) Use of State-of-the-Art: HPCToolkit/Intel VTune, MACPO based on ROSE (1) PerfExpert Team (2 and 3) PerfExpert Team based on ROSE, PIPS, Bison and Flex (4) 5 / 34

Introduction PerfExpert Modular Architecture Understanding and Extending PerfExpert Conclusions Introduction Uniqueness of PerfExpert: Nearly complete optimization first three stages of optimization for chip/node level Framework for implementing optimizations is complete and several optimizations are completed Integrates code segment focused and data structure based measurements ( MACPO ) — Code segment local measurement — Data structure specific traces — More accurate (associative) cache models — Strides by data structure and code segment — Architecture “independent” metrics 6 / 34

Introduction PerfExpert Modular Architecture Understanding and Extending PerfExpert Conclusions What can PerfExpert provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization There are three possible outputs: Performance report only List of recommendations Fully automated code transformation 7 / 34

Introduction PerfExpert Modular Architecture Understanding and Extending PerfExpert Conclusions Performance Report Loop in function compute() at mm.c:8 (99.8% of the total runtime) =============================================================================== ratio to total instrns % 0.........25...........50.........75........100 - floating point : 100 *********************************************** - data accesses : 25 ************ * GFLOPS (% max) : 12 ****** - packed : 0 * - scalar : 12 ****** ------------------------------------------------------------------------------- performance assessment LCPI good......okay......fair......poor......bad.... * overall : 3.0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ upper bound estimates * data accesses : 9.6 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ - L1d hits : 0.9 >>>>>>>>>>>>>>>>> - L2d hits : 1.8 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - L2d misses : 6.9 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ * instruction accesses : 0.1 > - L1i hits : 0.0 > - L2i hits : 0.0 > - L2i misses : 0.1 > * data TLB : 4.6 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ * instruction TLB : 0.0 > * branch instructions : 0.1 >> - correctly predicted : 0.1 >> - mispredicted : 0.0 > * floating-point instr : 5.1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ - fast FP instr : 5.1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ - slow FP instr : 0.0 > 8 / 34

Introduction PerfExpert Modular Architecture Understanding and Extending PerfExpert Conclusions List of Recommendations #-------------------------------------------------- # Recommendations for mm.c:8 #-------------------------------------------------- # This is a possible recommendation for this code segment # Description: change the order of loops Reason: this optimization may improve the memory access pattern and make it more cache and TLB friendly Pattern Recognizers: c loop2 f loop2 Code example: loop i { loop j { ... } } =====> loop j { loop i { ... } } 9 / 34

Introduction PerfExpert Modular Architecture Understanding and Extending PerfExpert Conclusions Fully Automated Code Transformation Before: After: void compute() { void compute() { register int i, j, k; register int i, j, k; //PIPS generated variable register int jp, kp; /* PERFEXPERT: start work here */ /* PERFEXPERT: grandparent loop */ loop 6: for (i = 0; i < 1000; i++) for (i = 0; i <= 999; i++) /* PERFEXPERT: parent loop */ loop 7: for (j = 0; j < 1000; j++) for(jp = 0; jp <= 999; jp += 1) /* PERFEXPERT: bottleneck */ for (k = 0; k < 1000; k++) for(kp = 0; kp <= 999; kp += 1) c[i][j] += (a[i][k] * b[k][j]); c[i][kp] += a[i][jp]*b[jp][kp]; } } 10 / 34

Introduction PerfExpert Modular Architecture Understanding and Extending PerfExpert Conclusions Agenda Introduction 1 PerfExpert Modular Architecture 2 Understanding and Extending 3 PerfExpert Conclusions 4 11 / 34

! Introduction PerfExpert Modular Architecture Understanding and Extending PerfExpert Conclusions Current Version: The Big Picture Measurement and Analysis Phases ! Diagnose and Recommendation Phases ! binary general performance Measurement ! object ! metrics ! (HPCToolKit) ! Script ! Analyzer and ! Recommender ! User Interface ! Developed by code bottlenecks the authors ! and list of ! recommendations ! Input/output data ! 12 / 34

! ! ! Introduction PerfExpert Modular Architecture Understanding and Extending PerfExpert Conclusions New Version: The Big Picture Compilation Phase ! Measurement and Analysis Phases ! Diagnose and Recommendation Phases ! original ! code bottlenecks and source ! Analyzer ! binary object ! general performance MACPO ! Compiler ! code ! metrics ! (HPCToolKit) ! add data access ! optimized ! performance metrics to previous output ! Work Flow Code Integration Phase ! source code ! Script ! Optimization Support Database ! Integrator ! Formulator ! User Interface ! (ROSE) ! (ROSE) ! Pattern Developed by Transformer ! code fragments to code fragments to ! the authors ! Recognizer ! optimized code optimize and list of optimize and list of ! (PIPS/ROSE) ! fragments ! Standard code transformers ! (Bison/Flex) ! recommendations ! Compiler ! Input/output Code Transformation Phase ! data ! 13 / 34

! ! ! Introduction PerfExpert Modular Architecture Understanding and Extending PerfExpert Conclusions New Version: Work Flow Script Compilation Phase ! Measurement and Analysis Phases ! Diagnose and Recommendation Phases ! original ! code bottlenecks and source ! Analyzer ! binary object ! general performance MACPO ! Compiler ! code ! metrics ! (HPCToolKit) ! add data access ! optimized ! performance metrics to This is a shell script previous output ! Work Flow Code Integration Phase ! source code ! Script ! Optimization Accepts parameters Support Database ! Integrator ! Formulator ! User Interface ! Invokes all tools (ROSE) ! (ROSE) ! Pattern Developed by (including the compiler) Transformer ! code fragments to code fragments to ! the authors ! Recognizer ! optimized code optimize and list of optimize and list of ! (PIPS/ROSE) ! fragments ! Standard code transformers ! (Bison/Flex) ! recommendations ! Backward compatible Compiler ! Input/output Code Transformation Phase ! data ! 14 / 34

PerfExpert Jim Browne, Ashay Rane and Leo Fialho Petascale Tools - PowerPoint PPT Presentation

PerfExpert Jim Browne, Ashay Rane and Leo Fialho Petascale Tools Workshop Madison WI, 2013 Introduction PerfExpert Modular Architecture Understanding and Extending PerfExpert Conclusions Agenda Introduction 1 PerfExpert Modular

Google ProjectARA Power Management Challenges Patrick Titiano, About the Power Management of a

Requirements and Architecture Juerg Beringer Physics Division Lawrence Berkeley National

Understanding Individual Contribution and Collaboration in Student

Patricia Melin, Alejandra Mancilla, Miguel Lopez, Patricia Melin, Alejandra Mancilla, Miguel

Towards MKM in the Large: Modular Representation and Scalable Software Architecture Michael

Modularity and Scalability in Calvin dl57 Transaction Scheduling Calvin is a... Replication

module n. 1. A standard or unit of measurement. 2. Architecture The dimensions of a structural

and Design Overview II Mark C. Paulk, Ph.D. Mark.Paulk@utdallas.edu, Mark.Paulk@ieee.org

Lip6 meeting Sharing perspectives 20 th February 2019 Avionics Products & Simulation -

High Quality Automatic Typesetting Proposal for a new document model, typesetting language, and

Modular sensor architecture for automated agricultural data collection on the field ANDR C.

HMEAE: Hierarchical Modular Event Argument Extraction Xiaozhi Wang 1 , Ziqi Wang 1 , Xu Han 1 ,

Vector Modulation of High Power RF Y. Kang J. Wilson, M. McCarthy, M. Champion and RF Group

Dreams for a 3D storage device Fundamentals: curved and 3D magnetism S. S. P. Parkin, Science

Luminosity reduction due to phase modulations at the HL-LHC crab cavities E.Yamakawa 1 , P.

FPL 2019 - PhD Forum FPGA Accelerated Deep Learning Radio Modulation Classification Using MATLAB

Lecture 2 Digital Modulation I-Hsiang Wang National Taiwan University ihwang@ntu.edu.tw

Estimation III: Method of Moments and Maximum Likelihood Stat 3202 @ OSU Dalpiaz 1 A Standard

Chapter Leader Informational Webinar Executive and Chapter Committee Housekeeping Cameras and

Poli 5D Social Science Data Analytics More on Stata Shane Xinyang Xuan ShaneXuan.com February

1 Method of Moments Examples of Method of Moments 1 n Recall: n- th

USIT Case Study: to a familiar problem of improving bicycles A Moms Bicycle by a group of

Verification Challenges Jim Woodcock University of York Newton Institute | Cambridge 24

9 INFORMATION PRODUCTS Chapter Introduction Information products are likely to have a

PerfExpert Jim Browne, Ashay Rane and Leo Fialho Petascale Tools - PowerPoint PPT Presentation

PerfExpert Jim Browne, Ashay Rane and Leo Fialho Petascale Tools Workshop Madison WI, 2013 Introduction PerfExpert Modular Architecture Understanding and Extending PerfExpert Conclusions Agenda Introduction 1 PerfExpert Modular

Google ProjectARA Power Management Challenges Patrick Titiano, About the Power Management of a

Requirements and Architecture Juerg Beringer Physics Division Lawrence Berkeley National

Understanding Individual Contribution and Collaboration in Student

Patricia Melin, Alejandra Mancilla, Miguel Lopez, Patricia Melin, Alejandra Mancilla, Miguel

Towards MKM in the Large: Modular Representation and Scalable Software Architecture Michael

Modularity and Scalability in Calvin dl57 Transaction Scheduling Calvin is a... Replication

module n. 1. A standard or unit of measurement. 2. Architecture The dimensions of a structural

and Design Overview II Mark C. Paulk, Ph.D. Mark.Paulk@utdallas.edu, Mark.Paulk@ieee.org

Lip6 meeting Sharing perspectives 20 th February 2019 Avionics Products &amp; Simulation -

High Quality Automatic Typesetting Proposal for a new document model, typesetting language, and

Modular sensor architecture for automated agricultural data collection on the field ANDR C.

HMEAE: Hierarchical Modular Event Argument Extraction Xiaozhi Wang 1 , Ziqi Wang 1 , Xu Han 1 ,

Vector Modulation of High Power RF Y. Kang J. Wilson, M. McCarthy, M. Champion and RF Group

Dreams for a 3D storage device Fundamentals: curved and 3D magnetism S. S. P. Parkin, Science

Luminosity reduction due to phase modulations at the HL-LHC crab cavities E.Yamakawa 1 , P.

FPL 2019 - PhD Forum FPGA Accelerated Deep Learning Radio Modulation Classification Using MATLAB

Lecture 2 Digital Modulation I-Hsiang Wang National Taiwan University ihwang@ntu.edu.tw

Estimation III: Method of Moments and Maximum Likelihood Stat 3202 @ OSU Dalpiaz 1 A Standard

Chapter Leader Informational Webinar Executive and Chapter Committee Housekeeping Cameras and

Poli 5D Social Science Data Analytics More on Stata Shane Xinyang Xuan ShaneXuan.com February

1 Method of Moments Examples of Method of Moments 1 n Recall: n- th

USIT Case Study: to a familiar problem of improving bicycles A Moms Bicycle by a group of

Verification Challenges Jim Woodcock University of York Newton Institute | Cambridge 24

9 INFORMATION PRODUCTS Chapter Introduction Information products are likely to have a

Lip6 meeting Sharing perspectives 20 th February 2019 Avionics Products & Simulation -