Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a - PowerPoint PPT Presentation

Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a dso , a y So a Department of Computer Science University of Virginia y g ISPASS 2011 1

M Motivation i i  The number of cores doubles every 18 months  Expected: Performance number of cores  One of the bottlenecks is shared resource contention  For multi-threaded workloads, contention is unavoidable  To reduce contention it is necessary to understand  To reduce contention, it is necessary to understand where and how the contention is created 2

Shared Resource Contention in Shared Resource Contention in Chip ‐ Multiprocessors p p Application 1 Application 1 C0 C C C1 C C2 C C3 Thread L1 L1 L1 L1 Application 2 Thread Thread L2 L2 Front -Side Bus Memory Intel Quad Core Q9550 3

Scenario 1 Scenario 1 Multi ‐ threaded applications pp  With co-runner Application 1 Thread C0 C1 C2 C3 3 Application 2 Thread L1 L1 L1 L1 L2 L L L2 Memory 4

Scenario 2 Scenario 2 Multi ‐ threaded applications pp  Without co-runner Application Thread C0 C1 C2 C3 L1 L1 L1 L1 L2 L2 L2 L2 Memory Memory 5

Shared ‐ Resource Contention  Intra application contention  Intra-application contention  Contention among threads from the same application (No co-runners) ( )  Inter-application contention  Contention among threads from the co-running application 6

C Contributions ib i  A general methodology to evaluate a multi-threaded g gy application’s performance  Intra-application contention  Inter-application contention  Contention in the memory-hierarchy shared resources  Characterizing applications facilitates better understanding of the application’s resource sensitivity understanding of the application s resource sensitivity  Thorough performance analyses and characterization Thorough performance analyses and characterization of multi-threaded PARSEC benchmarks 7

O tli Outline  Motivation Motivation  Contributions  Methodology gy  Measuring intra-application contention  Measuring inter-application contention g pp  Related Work  Summary 8

Methodology Methodology  Designed to measure both intra- and inter- application contention for a targeted shared resource application contention for a targeted shared resource  L1-cache, L2-cache  Front Side Bus (FSB)  Each application is run in two configurations  Baseline: threads do not share the targeted resource  Contention: threads share the targeted resource  Multiple number of targeted resource  Determine contention by comparing performance  Determine contention by comparing performance (gathering hardware performance counters’ values) 9

O tli Outline  Motivation Motivation  Contributions  Methodology gy  Measuring intra-application contention (See paper)  Measuring inter-application contention g pp  Related Work  Summary 10

Measuring inter ‐ application contention  L1-cache Application 1 Thread Thread C0 C1 C2 C3 C0 C1 C2 C3 Application 2 L1 L1 L1 L1 L1 L1 L1 L1 Thread L2 L2 L2 L2 Memory Memory Baseline Baseline Contention Contention Configuration Configuration 11

Measuring inter ‐ application contention l  L2-cache Application 1 Thread C0 C1 C2 C3 C0 C1 C2 C3 Application 2 L1 L1 L1 L1 Thread L1 L1 L1 L1 L2 L2 L2 L2 Memory Memory Baseline Contention Configuration Configuration 12

M Measuring inter ‐ application contention i i t li ti t ti  FSB Application 1 Thread Thread C0 C2 C4 C6 C1 C3 C5 C7 Application 2 L1 L1 L1 L1 L1 L1 L1 L1 Thread L2 L2 L2 L2 Memory Baseline Configuration 13

Measuring intra ‐ application contention l  FSB Application 1 Thread C0 C2 C4 C6 C1 C3 C5 C7 Application 2 Thread L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 Memory Contention Configuration 14

PARSEC Benchmarks Application Domain Application Domain Benchmark(s) Benchmark(s) Financial Analysis Blackscholes (BS) Swaptions (SW) Computer Vision C t Vi i B d t Bodytrack (BT) k (BT) Engineering Canneal (CN) Enterprise Storage Dedup (DD) Animation Facesim (FA) Fluidanimate (FL) Similarity Search Similarity Search Ferret (FE) Ferret (FE) Rendering Raytrace (RT) Data Mining Streamcluster (SC) Media Processing Vips (VP) X264 (X2) 15

Experimental platform Experimental platform  Platform 1: Yorkfield C C0 C C1 C C2 C3 C  Intel Quad core Q9550  32 KB L1-D and L1-I L1 cache L1 cache L1 cache L1 cache cache h L1 L1 L1 L1 L1 L1 L1 L1 HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF  6MB L2-cache L2 cache L2 cache  2GB Memory  2GB Memory L2 L2 L2 L2 HW ‐ PF HW ‐ PF  Common FSB FSB FSB interface interface FSB Memory Controller Hub (Northbridge) MB Memory 16 16

Experimental platform Experimental platform  Platform 2: Harpertown C0 C2 C4 C6 C1 C3 C5 C7 L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L1 cache L1 L1 L1 L1 L1 L1 L1 L1 HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF L2 cache L2 cache L2 cache L2 cache L2 cache L2 cache L2 cache L2 cache L2 L2 L2 L2 HW ‐ PF HW ‐ PF HW ‐ PF HW ‐ PF FSB FSB FSB FSB interface interface interface interface FSB FSB Memory Controller Hub (Northbridge) MB Memory Tanima Dey 17 17

Performance Analysis  Inter-application contention  For i-th co-runner PercentPerformanceDifference i = ( PerformanceBase i – PerformanceContend i ) * 100 PerformanceBase PerformanceBase i  Absolute performance difference sum  Absolute performance difference sum APDS = Σ abs ( PercentPerformanceDifferencei ) 18

I t Inter ‐ application contention li ti t ti  L1-cache – for Streamcluster Inter-application L1-cache Contention 8 ifference (%) 6 4 2 erformance D 0 -2 -4 Pe -6 -8 ptions choles ytrack anneal Dedup acesim Ferret nimate ytrace Vips X264 Body D Ray Swap Blacksc Ca Fluidan Fa Co-running benchmarks 19

Inter application L1 cache contention Inter ‐ application L1 ‐ cache contention Streamcluster Inter-application L1-cache Contention nce (%) 8 6 mance Differen 4 2 0 -2 Perform -4 4 -6 -8 acesim choles dytrack anneal Dedup Ferret nimate aytrace cluster Vips X264 aptions D Fluidan Blacksc Ca Streamc Fa Bod Ra Swa Co-running benchmarks 20

I t Inter ‐ application contention li ti t ti  L1-cache 21 21

I t Inter ‐ application contention li ti t ti  L2-cache 22

I t Inter ‐ application contention li ti t ti  FSB 23

Characterization Benchmarks L1 ‐ cache L2 ‐ cache FSB Blackscholes none none none Bodytrack inter inter intra C Canneal l i t intra i t inter i t intra Dedup inter intra, inter intra, inter Facesim inter inter intra Ferret intra intra, inter intra Fluidanimate inter inter intra Raytrace Raytrace none none none none intra intra Streamcluster inter inter intra Swaptions none none none Vi Vips i intra i inter i inter X264 inter intra, inter intra 24

Summary  The methodology generalizes contention analysis of multi-threaded applications  New approach to characterize applications N h t h t i li ti  Useful for performance analysis of existing and future architecture or benchmarks architecture or benchmarks  Helpful for creating new workloads of diverse properties  Provides insights for designing improved contention- aware scheduling methods h d li th d 25

Related Work  Cache contention  Knauerhase et al. IEEE Micro 2008  Zhuravleve et al ASPLOS 2010  Zhuravleve et al. ASPLOS 2010  Xie et al. CMP-MSI 2008  Mars et al. HiPEAC 2011  Characterizing parallel workload  Jin et al., NASA Technical Report 2009  PARSEC benchmark suite  Bienia et al. PACT 2008  Bhadauria et al IISWC 2009  Bhadauria et al. IISWC 2009 26

Thank you! Thank you! 27

Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a - PowerPoint PPT Presentation

Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a dso , a y So a Department of Computer Science University of Virginia y g ISPASS 2011 1 M Motivation i i The number of cores doubles every 18 months Expected:

Recent Advances in Finite Element Methods for Structural Acoustics Dr. Saikat Dey Code 7130,

Convexification in global optimization Santanu S. Dey 1 1 Industrial and Systems Engineering,

Cases of Deformity among Newly Detected Leprosy Cases From 2014 to 2016 Dilip Dey , Tapan K

Mapping BCC in 5 States Swachh Bharat Abhiyan Kiran Negi & Rita Dey www.worldbank.org/water

Keynote Presentation Schedule: Section 1 Monday October 24 Dey, Sneha Wang, Emily Fujino,

The Role of National Quality Infrastructure in Promoting Risk-Based Solutions Dr. Monideep Dey

Espresso Somdeep Dey Rohit Gurunath Jianfeng Qian Oliver Willens Overview Introduction

Transition and turbulence in MHD at very strong magnetic fields Prasanta Dey, Yurong Zhao, Oleg

The Nexus of Open Source Innovation Eric Baldeschwieler, CTO, Hortonworks Avik Dey, Director,

Bioenergy Decision Support Systems: Worth the Effort? Daniel Wright , Prasanta Dey, John Brammer

http://www.flickr.com/photos/dey/72271271/ http://www.flickr.com/photos/jurvetson/2619972888/

Risk-Based Solutions Dr. Monideep Dey Consultation on public policy, regulation and

IN IN REC ECEN ENT T TIMES IMES Rahul Dey Manager- Capacity Building and Training services

Free vibration analysis of angle-ply composite plates with uncertain properties S. Dey, T.

Context-Awareness and Smartphones Anind K. Dey Human-Computer Interaction Institute Carnegie

Slides by Nolan Dey Graph Notation A B A B C D A B C D A 0 1 0 0 A 1 0 0 0 B 0 0 1

COOPERATION INSTEAD OF CONTENTION! THE NEBULOUS CONCEPT OF WIRELESS LINK. Network

Shuffling: A Lock Contention Aware Thread Scheduling Technique Kishore Pusukuri Multicores are

UNDERSTANDING TRANSACTIONAL MEMORY PERFORMANCE Donald E. Porter and Emmett Witchel The

M ULTICORE H ARDWARE S HARED R ESOURCES : U NDERSTANDING OF THE S TATE OF THE A RT Gabriel

Analytical Performance Modeling of Hierarchical Interconnect Fabrics Nikita Nikitin, Javier de

URSA: Precise Capacity Planning and Fair Scheduling based on Low-level Statistics for Public

Low Contention Mapping of Real-Time Tasks onto a TilePro 64 Core Processor Christopher Zimmer and

What well talk about 2 ZSim has a full-featured memory system (originally designed for

Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a - PowerPoint PPT Presentation

Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a dso , a y So a Department of Computer Science University of Virginia y g ISPASS 2011 1 M Motivation i i The number of cores doubles every 18 months Expected:

Recent Advances in Finite Element Methods for Structural Acoustics Dr. Saikat Dey Code 7130,

Convexification in global optimization Santanu S. Dey 1 1 Industrial and Systems Engineering,

Cases of Deformity among Newly Detected Leprosy Cases From 2014 to 2016 Dilip Dey , Tapan K

Mapping BCC in 5 States Swachh Bharat Abhiyan Kiran Negi &amp; Rita Dey www.worldbank.org/water

Keynote Presentation Schedule: Section 1 Monday October 24 Dey, Sneha Wang, Emily Fujino,

The Role of National Quality Infrastructure in Promoting Risk-Based Solutions Dr. Monideep Dey

Espresso Somdeep Dey Rohit Gurunath Jianfeng Qian Oliver Willens Overview Introduction

Transition and turbulence in MHD at very strong magnetic fields Prasanta Dey, Yurong Zhao, Oleg

The Nexus of Open Source Innovation Eric Baldeschwieler, CTO, Hortonworks Avik Dey, Director,

Bioenergy Decision Support Systems: Worth the Effort? Daniel Wright , Prasanta Dey, John Brammer

http://www.flickr.com/photos/dey/72271271/ http://www.flickr.com/photos/jurvetson/2619972888/

Risk-Based Solutions Dr. Monideep Dey Consultation on public policy, regulation and

IN IN REC ECEN ENT T TIMES IMES Rahul Dey Manager- Capacity Building and Training services

Free vibration analysis of angle-ply composite plates with uncertain properties S. Dey, T.

Context-Awareness and Smartphones Anind K. Dey Human-Computer Interaction Institute Carnegie

Slides by Nolan Dey Graph Notation A B A B C D A B C D A 0 1 0 0 A 1 0 0 0 B 0 0 1

COOPERATION INSTEAD OF CONTENTION! THE NEBULOUS CONCEPT OF WIRELESS LINK. Network

Shuffling: A Lock Contention Aware Thread Scheduling Technique Kishore Pusukuri Multicores are

UNDERSTANDING TRANSACTIONAL MEMORY PERFORMANCE Donald E. Porter and Emmett Witchel The

M ULTICORE H ARDWARE S HARED R ESOURCES : U NDERSTANDING OF THE S TATE OF THE A RT Gabriel

Analytical Performance Modeling of Hierarchical Interconnect Fabrics Nikita Nikitin, Javier de

URSA: Precise Capacity Planning and Fair Scheduling based on Low-level Statistics for Public

Low Contention Mapping of Real-Time Tasks onto a TilePro 64 Core Processor Christopher Zimmer and

What well talk about 2 ZSim has a full-featured memory system (originally designed for

Mapping BCC in 5 States Swachh Bharat Abhiyan Kiran Negi & Rita Dey www.worldbank.org/water