Utilizing the Latest Features of Intel's Performance Monitoring Unit - PowerPoint PPT Presentation

Utilizing the Latest Features of Intel's Performance Monitoring Unit Scalable Tools Workshop 2019 Michael Chynoweth – Intel Fellow Contributors: Patrick Konsor, Sneha Gohad, Joe Olivas, Vishnu Naikawadi, Andi Kleen, Ahmad Yasin

Agenda • Timed Last Branch Records • Tagging and explaining microarchitectural issues • Eliminating frequent Performance Monitoring Interrupts • Extended Performance Event Based Sampling • Adaptive Performance Event Based Sampling • Reduction in overhead from Extended PEBS 2

Timed Last Branch Record HW Timing Exact core clock between taken branches Given a block of code: How long does it take to code to run 3ae0: push rbp LBR can give us individual timings: 3ae1: mov rbp,rsp 3, 55, 26, 17, 8, 16, 6, 49, 24, 3, 23,116, 3, 3, 5, 15, 3ae4. push r15 3, 19, 6, 26, 21, 2, 49, 146, 6, 17, 29, 19, 11, 147, 3ae6: push r14 Get 23, 3, 30, 7, 23, 19 core 3ae8: push rbx clock 3ae9: push rax Average: ~25 cycles timing 3aea: mov rbx,qword ptr [rdi+28] for this But the devil is in the details 3aee: mov r15,qword ptr [rdi+30] code 3af2: mov r14d,1 3af8: cmp rbx,r15 3afb: jz 3b14

Superblock 0x100083ae0 -> 0x100083afb 11 Instructions, 2 Loads, 9.7K samples LBR Hardware Timing Histogram Putting all those LBR timings together shows patterns 14% 12% 10% Percentage 8% 6% 4% 2% 0% + 0 20 40 60 80 100 LBR Timing Bucket 4

Superblock 0x100083ae0 -> 0x100083afb 11 Instructions, 2 Loads, 9.7K samples LBR Timing Histogram Putting all those LBR timings together shows patterns What behavior causes these spikes in time? 14% 12% 10% Percentage 8% 6% 4% 2% 0% + 0 20 40 60 80 100 LBR Timing Bucket 5

Superblock 0x100083ae0 -> 0x100083afb 11 Instructions, 2 Loads, 9.7K samples Timing Occurrences vs Cost Cost is % of total cycles spent in each bucket 45% 40% 35% 30% Percentage 25% 20% 15% 10% 5% 0% + 0 20 40 60 80 100 LBR Timing Bucket Samples Cost 6

Superblock 0x100083ae0 -> 0x100083afb 11 Instructions, 2 Loads, 9.7K samples Timing Occurrences vs Cost Cost is % of total cycles spent in each bucket Both occurrences and cost of a spike are important to understand cause 45% 40% 35% 30% Percentage 25% 20% 15% 5.4% of 12% of samples samples 10% 1.4% of cost 41% of cost 5% 0% + 0 20 40 60 80 100 LBR Timing Bucket Samples Cost 7

Multi-Core Superblock 0x100083ae0 -> 0x100083afb Attributing LBR Timing to Events 11 Instructions, 2 Loads, 9.7K samples Spike Metric Count Per Trip Count Cost 3 12.1% 1.4% Hit Count 1.40E+09 N/A 6 9.9% 2.4% Precise L2 Hits 1.52E+08 10.8% 22 6.5% 5.7% Precise L3 Hits 9.20E+07 6.6% >= 100 5.4% 40.7% Precise L3 Misses 3.68E+07 2.6% • LBR timing sample frequency is well 45% 40% correlated with load cache counters 35% • Model won’t know explicitly if a Percentage 30% 25% superblock has loads or how many 20% 15% 10% 5% 0% + 0 20 40 60 80 100 LBR Timing Bucket Samples Cost

Multi-Core Superblock 0x100083ae0 -> 0x100083afb Attributing LBR Timing to Events 11 Instructions, 2 Loads, 9.7K samples Spike Metric Count % Cycles Count Cost 3 12.1% 1.4% CPU_CLK_UNHALTED.THREAD 2.76E+10 100% 6 9.9% 2.4% Cycles L2 Hit (Derived) 4.60E+08 1.7% 22 6.5% 5.7% Cycles L3 Hit (Derived) 1.93E+09 7.0% >= 100 5.4% 40.7% Cycles L3 Miss 9.48E+09 34.3% • Spike cost well correlated with 45% 40% CYCLE_ACTIVITY counters 35% Percentage 30% • LBR-based spike cost may be more 25% accurate estimate of L3 miss cost 20% 15% 10% 5% 0% + 0 20 40 60 80 100 LBR Timing Bucket Samples Cost

Frequent Performance Monitoring Interrupts (PMI) For Profiling are Intrusive and Evil • Expensive way to profile an application • Each PMI requires ~5-10k cycles depending on tool, platform, OS • Profiling at sub-millisecond granularities becomes costly with PMIs • 8 programmable and 4 fixed counters overflowing sub-millisecond • Suffers from blind spots when interrupts masked • Forces tools/OS to support non-maskable interrupts for profiling • Run code and data particular to the interrupt handler • Can perturb every microarchitectural state on the system Goal is to eliminate necessity of frequent performance monitoring interrupts 10

Extended Processor Event Based Sampling (PEBS) #1 Performance ▪ Introduced in Ice Lake and Tremont Monitoring Interrupt ▪ Supports output of all programmable (PMI) and fixed counters without PMI Costs ~8k cycles ▪ Move Precise Distribution of Event Instructions Retired to fixed ctr0 Overflows Now Have ▪ Advantages 2 Choices ▪ More precise event attribution #2 PEBS Assist ▪ Avoids need for an expensive No Interrupt interrupt Costs estimated at ▪ Avoids “blind spots” when ~500 cycles interrupts masked (without NMIs) 11

Extended Processor Event Based Sample Improves Tagging to Correct Issue Extended PEBS tags cost to correct load Clocks% with Perf Issue Legacy clocks tags incorrectly to next instruction Disasm Offset Legacy Clocks Extended PEBS Clocks Extended PEBS Clocks% Legacy Clocks% Extended Processor Event Based Sampling Has Better Tagging to Problematic Instruction

Adaptive Processor Event Based Sampling • Control information in PEBS buffer Offset New or Legacy Group (if MSR Field name Bits name if name is all 1s) different 0x0 Record Format [47:0] <new> • Everything but basic information is optional [63:48] <new> Record Size Basic 0x8 Instruction Pointer [63:0] EventingIP info 0x10 TSC <legacy> • Only collect what is needed 0x18 Applicable Counters <legacy> Memory Access 0x20 [63:0] DLA Address Memory 0x28 Memory Auxiliary Info DATA_SRC • Adds Last Branch Records and XMMs info Memory Access 0x30 Load Latency Latency TSX 0x38 TSX Auxiliary Info Information • Greatly reduces cost of collection on 96 0x40 RFLAGS [63:0] 0x48 RIP LBRs entries needed to be read 0x50 RAX … GPRs 0x88 RDI <Legacy> • Collect multiple PEBS Buffers on PMI 0x90 R8 … 0xC8 R15 0xD0 XMM0 [127:0] <new> … XMMs 0x1C0 XMM15 0x1C8 LBR[tos].FROM [63:0] <new> 0x1D0 LBR[tos].TO PEBS Buffer will have option to collect 0x1D8 LBR[tos].INFO … LBRs LBR for lower overhead and higher sampling 0x4B0 LBR[tos-31].TO 0x4B8 LBR[tos-31].INFO 0x4C0 LBR[tos-31].INFO 13

Collection Overhead Reduced by Extended PEBS Collecting Reference Clocks at Extremely High Sampling Rates (10K Sample After Value) 20.00% 160000 %Overhead Collection Extended PEBS HW reduces overhead 9x 18.00% 140000 16.00% at collection with ~7 micro-second granularity 120000 14.00% Interrupts/s 100000 12.00% 10.00% 80000 8.00% 60000 6.00% 40000 4.00% 20000 2.00% 0.00% 0 1 PEBS per 10 PEBS per 20 PEBS per 30 PEBS per No Collection PMI PMI PMI PMI PEBS/PMI Ratio Overhead Collection Interrupts/s 14

Conclusions • Timed Last Branch Records are close to uncovering exact cost of microarchitectural issues • Extended Performance Event Based Sampling • More precise tagging of performance issues and events • Allows for more frequent sampling with lower overhead • Adaptive Performance Event Based Sampling • Allows users to only collect what is required in PEBS buffer 15

Utilizing the Latest Features of Intel's Performance Monitoring Unit - PowerPoint PPT Presentation

Utilizing the Latest Features of Intel's Performance Monitoring Unit Scalable Tools Workshop 2019 Michael Chynoweth Intel Fellow Contributors: Patrick Konsor, Sneha Gohad, Joe Olivas, Vishnu Naikawadi, Andi Kleen, Ahmad Yasin Agenda

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Intel Case Intel Case Processor Serial Number (PSN) Processor Serial Number (PSN) 5/9/99 Group

Validation Labs with OpenStack Shuquan Huang, Intel IT Engineering Computing Weibo: @

5G Cloud Native from RAN to Core Christian Maciocco, Intel Shilpa Talwar, Intel Saikrishna

AFS at Intel AFS at Intel Travis Broughton Travis Broughton Agenda Agenda Intels

Frontal Dummies Frontal Dummies Latest Developments Latest Developments Page 1 Hybrid III

Due to Code Placement in IA Zia Ansari zia.ansari@intel.com Intel Corporation 11/03/16 Agenda

Who watches the watchmen?: Utilizing Performance Monitors for Compromising keys of RSA on Intel

Utilizing Crowd Funding Utilizing Crowd Funding for Support SMEs funding for Support SMEs

Biomimetic Sound Biomimetic Sound Localization Utilizing Localization Utilizing Head

Utilizing Plant Utilizing Plant Available Water Available Water as a Drought as a Drought Risk

Utilizing Commercial Utilizing Commercial Object Libraries Object Libraries w ithin Loosely- w

Utilizing commercial graphics processors Utilizing commercial graphics processors in the

Utilizing Micr Utilizing Microblogs f oblogs for A r Automatic matic Ne News Highlights

The Vermont Community Foundation Investment Strategy/Performance Update March 4, 2020 1

intel.com/cloudforall Legal Disclaimer OpenStack is a registered trademark of the OpenStack

Lect ure # 24 ADVANCED DATABASE SYSTEMS Databases on New Hardware @ Andy_Pavlo // 15- 721 //

Chapter 1: Introduction Components of computer security Threats Policies and

Strengthening Early Childhood in Kansas in 2019 WEBINAR April 17, 2019 Welcome Amanda

Questions of Trust Jason Quinley, Christopher Ahern University of T ubingen, University of

osstest The Xen Projects CI system Some interesting architectural features Xen Summit

EPSCoR Webinar OCTOBER 25, 2019 Emera Technologies, LLC Safety Briefing 2 Introduction Quik

Hardware Transactional Memory on Haswell EP Viktor Leis Technische Universitt Mnchen 1 / 14

3D Multi-Object Tracking: A Baseline and New Evaluation Metrics Xinshuo Weng, Jianren Wang, David