Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a - - PowerPoint PPT Presentation

tanima dey
SMART_READER_LITE
LIVE PREVIEW

Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a - - PowerPoint PPT Presentation

Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a dso , a y So a Department of Computer Science University of Virginia y g ISPASS 2011 1 M Motivation i i The number of cores doubles every 18 months Expected:


slide-1
SLIDE 1

Tanima Dey

Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a dso , a y So a Department of Computer Science University of Virginia

ISPASS 2011

y g

1

slide-2
SLIDE 2

M i i Motivation

 The number of cores doubles every 18 months  Expected: Performance number of cores  One of the bottlenecks is shared resource contention  For multi-threaded workloads, contention is

unavoidable

 To reduce contention it is necessary to understand  To reduce contention, it is necessary to understand

where and how the contention is created

2

slide-3
SLIDE 3

Shared Resource Contention in Shared Resource Contention in Chip‐Multiprocessors p p

C C C C Application 1 C0 C1 C2 C3 L1 L1 L1 L1 Application 1 Thread Application 2 Thread L2 L2

Front -Side Bus

Thread Memory

Intel Quad Core Q9550

3

slide-4
SLIDE 4

Scenario 1 Scenario 1 Multi‐threaded applications pp

 With co-runner

C0 C1 C2 C3 Application 1 Thread 3 L L L1 L1 L1 L1 Application 2 Thread L2 L2 Memory

4

slide-5
SLIDE 5

Scenario 2 Scenario 2 Multi‐threaded applications

 Without co-runner

pp

C0 C1 C2 C3 Application Thread L2 L2 L1 L1 L1 L1 L2 L2 Memory Memory

5

slide-6
SLIDE 6

Shared‐Resource Contention

 Intra application contention  Intra-application contention

 Contention among threads from the same application

(No co-runners) ( )

 Inter-application contention

 Contention among threads from the co-running

application

6

slide-7
SLIDE 7

C ib i Contributions

 A general methodology to evaluate a multi-threaded

g gy application’s performance

 Intra-application contention  Inter-application contention  Contention in the memory-hierarchy shared resources

 Characterizing applications facilitates better

understanding of the application’s resource sensitivity understanding of the application s resource sensitivity

 Thorough performance analyses and characterization

Thorough performance analyses and characterization

  • f multi-threaded PARSEC benchmarks

7

slide-8
SLIDE 8

O tli Outline

 Motivation

Motivation

 Contributions  Methodology

gy

 Measuring intra-application contention  Measuring inter-application contention

g pp

 Related Work  Summary

8

slide-9
SLIDE 9

Methodology Methodology

 Designed to measure both intra- and inter-

application contention for a targeted shared resource application contention for a targeted shared resource

 L1-cache, L2-cache  Front Side Bus (FSB)

 Each application is run in two configurations

 Baseline: threads do not share the targeted resource  Contention: threads share the targeted resource

 Multiple number of targeted resource  Determine contention by comparing performance

9

 Determine contention by comparing performance

(gathering hardware performance counters’ values)

slide-10
SLIDE 10

O tli Outline

 Motivation

Motivation

 Contributions  Methodology

gy

 Measuring intra-application contention (See paper)  Measuring inter-application contention

g pp

 Related Work  Summary

10

slide-11
SLIDE 11

 L1-cache

Measuring inter‐application contention

Application 1 Thread C0 C1 C2 C3 L1 L1 L1 L1 Thread Application 2 Thread C0 C1 C2 C3 L1 L1 L1 L1 L2 L2 L2 L2

Baseline Contention

Memory Memory

Baseline Configuration Contention Configuration

11

slide-12
SLIDE 12

l Measuring inter‐application contention

 L2-cache

Application 1 Thread C0 C1 C2 C3 L1 L1 L1 L1 Application 2 Thread C0 C1 C2 C3 L1 L1 L1 L1 L2 L2 L2 L2 Memory Memory

Baseline Configuration Contention Configuration

12

slide-13
SLIDE 13

M i i t li ti t ti Measuring inter‐application contention

 FSB

Application 1 Thread C0 C2 C4 C6 L1 L1 L1 L1 C1 C3 C5 C7 L1 L1 L1 L1 Thread Application 2 Thread L2 L2 L2 L2 Memory Baseline Configuration

13

slide-14
SLIDE 14

l Measuring intra‐application contention

 FSB

Application 1 Thread C0 C2 C4 C6 L1 L1 L1 L1 C1 C3 C5 C7 L1 L1 L1 L1 Application 2 Thread L2 L2 L2 L2 Memory Contention Configuration

14

slide-15
SLIDE 15

PARSEC Benchmarks

Application Domain Benchmark(s) Application Domain Benchmark(s) Financial Analysis Blackscholes (BS) Swaptions (SW) C t Vi i B d t k (BT) Computer Vision Bodytrack (BT) Engineering Canneal (CN) Enterprise Storage Dedup (DD) Animation Facesim (FA) Fluidanimate (FL) Similarity Search Ferret (FE) Similarity Search Ferret (FE) Rendering Raytrace (RT) Data Mining Streamcluster (SC)

15

Media Processing Vips (VP) X264 (X2)

slide-16
SLIDE 16

Experimental platform Experimental platform

 Platform 1: Yorkfield

C C C C

 Intel Quad core Q9550  32 KB L1-D and L1-I

h

C0

L1 cache L1

C1 C2 C3

L1 cache L1 L1 cache L1 L1 cache L1

cache

 6MB L2-cache  2GB Memory

L2 cache L2 cache L2 L2 L1 HW‐PF L1 HW‐PF L1 HW‐PF L1 HW‐PF

 2GB Memory  Common FSB

FSB interface L2 HW‐PF FSB interface L2 HW‐PF Memory Controller Hub (Northbridge)

FSB

Memory

MB

16 16

slide-17
SLIDE 17

Experimental platform Experimental platform

 Platform 2: Harpertown

C0

L1 cache

C2 C4 C6

L1 cache L1 cache L1 cache

C1

L1 cache

C3 C5 C7

L1 cache L1 cache L1 cache L2 cache L2 cache L1 HW‐PF L1 HW‐PF L1 HW‐PF L1 HW‐PF L2 cache L2 cache L1 HW‐PF L1 HW‐PF L1 HW‐PF L1 HW‐PF L2 cache FSB interface L2 cache L2 HW‐PF FSB interface L2 HW‐PF L2 cache FSB interface L2 cache L2 HW‐PF FSB interface L2 HW‐PF Memory Controller Hub (Northbridge)

FSB FSB

Tanima Dey

Memory

MB

17 17

slide-18
SLIDE 18

Performance Analysis

 Inter-application contention

 For i-th co-runner

PercentPerformanceDifferencei = ( PerformanceBasei – PerformanceContendi ) * 100 PerformanceBase PerformanceBasei

 Absolute performance difference sum  Absolute performance difference sum

APDS = Σ abs ( PercentPerformanceDifferencei )

18

slide-19
SLIDE 19

I t li ti t ti Inter‐application contention

 L1-cache – for Streamcluster

8

Inter-application L1-cache Contention

2 4 6 ifference (%)

  • 4
  • 2

erformance D

  • 8
  • 6

choles ytrack anneal Dedup acesim Ferret nimate ytrace ptions Vips X264 Pe 19 Blacksc Body Ca D Fa Fluidan Ray Swap Co-running benchmarks

slide-20
SLIDE 20

Inter‐application L1‐cache contention Inter application L1 cache contention

Streamcluster

Inter-application L1-cache Contention

6 8 nce (%)

  • 4
  • 2

2 4 mance Differen

  • 8
  • 6

4 choles dytrack anneal Dedup acesim Ferret nimate aytrace cluster aptions Vips X264 Perform

20

Blacksc Bod Ca D Fa Fluidan Ra Streamc Swa Co-running benchmarks

slide-21
SLIDE 21

I t li ti t ti Inter‐application contention

 L1-cache

21 21

slide-22
SLIDE 22

I t li ti t ti Inter‐application contention

 L2-cache

22

slide-23
SLIDE 23

I t li ti t ti Inter‐application contention

 FSB

23

slide-24
SLIDE 24

Characterization

Benchmarks L1‐cache L2‐cache FSB Blackscholes none none none Bodytrack inter inter intra C l i t i t i t Canneal intra inter intra Dedup inter intra, inter intra, inter Facesim inter inter intra Ferret intra intra, inter intra Fluidanimate inter inter intra Raytrace none none intra Raytrace none none intra Streamcluster inter inter intra Swaptions none none none Vi i i i

24

Vips intra inter inter X264 inter intra, inter intra

slide-25
SLIDE 25

Summary

 The methodology generalizes contention analysis of

multi-threaded applications

N h t h t i li ti

 New approach to characterize applications  Useful for performance analysis of existing and future

architecture or benchmarks architecture or benchmarks

 Helpful for creating new workloads of diverse

properties

 Provides insights for designing improved contention-

h d li th d aware scheduling methods

25

slide-26
SLIDE 26

Related Work

 Cache contention

 Knauerhase et al. IEEE Micro 2008  Zhuravleve et al ASPLOS 2010  Zhuravleve et al. ASPLOS 2010  Xie et al. CMP-MSI 2008  Mars et al. HiPEAC 2011

 Characterizing parallel workload

 Jin et al., NASA Technical Report 2009

 PARSEC benchmark suite

 Bienia et al. PACT 2008  Bhadauria et al IISWC 2009  Bhadauria et al. IISWC 2009

26

slide-27
SLIDE 27

Thank you! Thank you!

27