Hui Zhang, Jeffrey K. Hollingsworth {hzhang86, - - PowerPoint PPT Presentation

β–Ά
hui zhang jeffrey k hollingsworth hzhang86 hollings cs
SMART_READER_LITE
LIVE PREVIEW

Hui Zhang, Jeffrey K. Hollingsworth {hzhang86, - - PowerPoint PPT Presentation

Data-Centric Performance Measurement Technique for Chapel Programs Hui Zhang, Jeffrey K. Hollingsworth {hzhang86, hollings}@cs.umd.edu Department of Computer Science, University of Maryland- College Park 1 Introduction Why PGAS


slide-1
SLIDE 1

Data-Centric Performance Measurement Technique for Chapel Programs

Hui Zhang, Jeffrey K. Hollingsworth {hzhang86, hollings}@cs.umd.edu Department of Computer Science, University of Maryland- College Park

1

slide-2
SLIDE 2

Introduction

  • Why PGAS (Partitioned Global Address Space )
  • Parallel programming is too hard
  • Unified solution for mixed mode parallelism

(multi-core + multi-node)

  • Why Chapel
  • Emerging PGAS language with productive features
  • Potential for performance improvement and few

useful profilers for its end users

  • Insights for the language evolvement in the future

2

slide-3
SLIDE 3

Data-centric Profiling

3 int busy(int *x) { // hotspot function *x = complex(); return *x; } int main() { for (i=0; i<n; i++) { A[i] = busy(&B[i]) + busy(&C[i-1]) + busy(&C[i+1]); } } Data-centric Profiling main: 100% latency busy: 100% latency complex: 100% latency main: 100% latency busy: 100% latency complex: 100% latency Code-centric Profiling A: 100% latency B: 33.3% latency C: 66.7% latency A: 100% latency B: 33.3% latency C: 66.7% latency

slide-4
SLIDE 4

Our Contribution

1. Data-centric profiling of PGAS programs 2. First Chapel-specific profiler 3. Profiled three benchmarks and improved the performance up to 2.3x

4

slide-5
SLIDE 5

Tool Framework

1: Intraprocedural Static Analysis

Module: Global Variables, Type Analysis (class, record) Function: Local Variables, Parameters, Return Values Module: Global Variables, Type Analysis (class, record) Function: Local Variables, Parameters, Return Values

2: Monitored Execution

Run the Program with Sampling and Instrumentation Enabled Run the Program with Sampling and Instrumentation Enabled

3: Post Processing

Data Flow Analysis Control Flow Analysis

Node 1 Node 4 Node 2 Node 3

Aggregate Data from All Nodes and Display Aggregate Data from All Nodes and Display

4: GUI Presentation

Decode Context Sensitive Samples Variable Profiles (Per Node)

5

slide-6
SLIDE 6

Blame Definition

1) 1) π‘ͺπ’Žπ’ƒπ’π’‡π‘»π’‡π’– π’˜ = π‘ͺπ’ƒπ’…π’π’™π’ƒπ’”π’†π‘»π’Žπ’‹π’…π’‡ 𝒙

π’™βˆˆπ‘Ώ

2) 2) 𝒋𝒕π‘ͺπ’Žπ’ƒπ’π’‡π’† π’˜, 𝒕 = {π’‹π’ˆ 𝒕 ∈ π‘ͺπ’Žπ’ƒπ’π’‡π‘»π’‡π’– π’˜ π’–π’Šπ’‡π’ 𝟐 π’‡π’Žπ’•π’‡ 𝟏} 3) 3) π‘ͺπ’Žπ’ƒπ’π’‡π‘Έπ’‡π’”π’…π’‡π’π’–π’ƒπ’‰π’‡ π’˜, 𝑻 =

𝒋𝒕π‘ͺπ’Žπ’ƒπ’π’‡π’†(π’˜,𝒕) π’•βˆˆπ‘» 𝑻

  • v: a certain variable
  • w: a write statement to v’s memory region
  • W: a set of w (all write statements to v’s memory region)
  • s: a sample
  • S: a set of samples

6

slide-7
SLIDE 7

Blame Calculation Example

1 a=2; 2 b=3; //Sample 1 3 if a<b //Sample 2 4 a=b+1; //Sample 3 5 c=a+b; //Sample 4

7

Variable Name a b c BlameSet 1, 3, 4 2 1, 2, 3, 4, 5 Blame Samples S2, S3 S1 S1, S2, S3, S4 Blame 50% 25% 100%

slide-8
SLIDE 8

GUI screenshots of MiniMD

Code-centric Data-centric

8

slide-9
SLIDE 9

Optimization Result - MiniMD

5 10 15 20 25 w/o --fast w/ --fast 20.9 6.41 9.2 2.5 Execution Time (s)

  • riginal
  • ptimized

9

slide-10
SLIDE 10

Experiment - CLOMP

Name Type Blame Context partArray [partDomain] Part 99.5% main

  • >partArray[i]

Part 99.5% main

  • >partArray[i].zoneArray[j]

Zone 99.0% main

  • >partArray[i].zoneArray[j].value

real 99.0% main

  • >partArray[i].residue

real 12.3% main remaining_deposit real 11.8% update_part

10

slide-11
SLIDE 11

Optimization Result – CLOMP

1 2 3 4 5 6 7 8 1024/64,000 65536/10 12/640,000 65536/6400 4.02 4.79 3.87 7.88 2.18 4.4 1.82 7.14 Execution Time (s) Different Problem Sizes (#parts/#zones per part)

  • riginal
  • ptimized

w/o --fast

11

slide-12
SLIDE 12

Experiment – LULESH

12

  • 1. Number of profiling samples in this function 2. Percentage of profiling samples in this function
  • 3. Cumulative percentage of samples 4. Number of samples in this function and its callees
  • 5. Percentage of samples in this function and its callees 6. Function name

1 2 3 4 5 6

slide-13
SLIDE 13

Experiment – LULESH

Name Type Blame Context hgfz 8*real 30.8% CalcFBHourglassForceForElems hgfx 8*real 29.5% CalcFBHourglassForceForElems hgfy 8*real 29.2% CalcFBHourglassForceForElems shz real 27.9% CalcElemFBHourglassForce hz 4*real 27.6% CalcElemFBHourglassForce shx real 26.9% CalcElemFBHourglassForce shy real 26.6% CalcElemFBHourglassForce hx 4*real 26.6% CalcElemFBHourglassForce hy 4*real 26.6% CalcElemFBHourglassForce hourgam 8*(4*real) 25.0% CalcFBHourglassForceForElems determ [Elems] real 15.7% CalcVolumeForceForElems b_x 8*real 9.7% IntegrateStressForElems b_z 8*real 9.7% IntegrateStressForElems b_y 8*real 8.7% IntegrateStressForElems dvdx(y/z) [Elems] 8*real 8.3% CalcHourglassControlForElems hourmodx real 5.8% CalcFBHourglassForceForElems hourmody real 5.1% CalcFBHourglassForceForElems hourmodz real 4.8% CaclFBHourglassForceForElems

13

slide-14
SLIDE 14

Optimization Example - Loop

Code Snapshot of LULESH Hot Spot

14

slide-15
SLIDE 15

Results for different loop

  • ptimizations

11 11.2 11.4 11.6 11.8 12 12.2 12.4 12.6 12.8 13 12.47 12.04 11.65 12.95 11.78 12.59 11.89 12.6 12.1 12.33 12.75 Execution Time (s)

U*: manual loop unrolling at place *

15

slide-16
SLIDE 16

Optimization Result – LULESH

2 4 6 8 10 12 14 w/o --fast w/ --fast 12.47 4.7 11.57 4.59 11.65 4.54 9.98 3.39 9.02 3.2

  • riginal

CENN P1 VG best case

Performance Improvement 27.7%

Execution Time (s) 16

slide-17
SLIDE 17

Updates & Future Work

  • Updates:

– Built a prototype for multi-node Chapel – Optimized runtime instrumentation – Improved Graphic-User-Interface

  • Future work:

– Large-size problems on distributed systems – Further application of β€œBlame” in other fields

17

slide-18
SLIDE 18

Conclusion

  • β€œBlame” application on PGAS programs
  • First Chapel-specific profiler
  • Benchmark optimization

18 1 1 1 2.3 2.1 1.4 0.5 1 1.5 2 2.5 MiniMD CLOMP LULESH Speed-ups

  • rginal
  • ptimized