Understanding and Tuning the Performance of Critical Sections with - - PowerPoint PPT Presentation

understanding and tuning the performance of critical
SMART_READER_LITE
LIVE PREVIEW

Understanding and Tuning the Performance of Critical Sections with - - PowerPoint PPT Presentation

Understanding and Tuning the Performance of Critical Sections with Program Analysis and Software Visualization Tools Michael Dilip Shah Advisor: Samuel Z. Guyer Monday July 31, 2017 1 Why Care About Performance Servers Mobile


slide-1
SLIDE 1

Understanding and Tuning the Performance of Critical Sections with Program Analysis and Software Visualization Tools

Michael Dilip Shah Advisor: Samuel Z. Guyer Monday July 31, 2017

1

slide-2
SLIDE 2

Why Care About Performance

  • Servers
  • Mobile
  • Games

Image Sources: www.facebook.com http://www.techcrok.com/ http://modloader-for-minecraft.en.softonic.com/

2

slide-3
SLIDE 3

Moore’s Law

"The number of transistors incorporated in a chip will approximately double every 24 months."

  • -Gordon Moore, Intel co-founder

http://www-cs-faculty.stanford.edu/~eroberts/cs181/projects/2010-11/TechnologicalSingularity/pageviewa478.html?file=forfeasibility.html

3

slide-4
SLIDE 4

Moore’s Law

"The number of transistors incorporated in a chip will approximately double every 24 months."

  • -Gordon Moore, Intel co-founder

http://www-cs-faculty.stanford.edu/~eroberts/cs181/projects/2010-11/TechnologicalSingularity/pageviewa478.html?file=forfeasibility.html

4

  • Physically (on the atomic scale)

transistors are packed very tightly together

  • Heat becomes a problem
  • Energy consumption increases
slide-5
SLIDE 5

Now we use multiple processors to increase performance

Compute Y Compute Z

5

slide-6
SLIDE 6

Rendering an Image in Parallel

6

Sunflow – Java Multithreaded Raytracer

slide-7
SLIDE 7

Setup 16 threads

7

Sunflow – Java Multithreaded Raytracer

slide-8
SLIDE 8

Divide and Conquer

8

slide-9
SLIDE 9

Measure the performance

9

Threads Time per frame

1 20 seconds 16 6 seconds

slide-10
SLIDE 10

Measure the performance

10

  • Why is this not 16 times

faster?

Threads Time per frame

1 20 seconds 16 6 seconds

slide-11
SLIDE 11

Amdahl’s Law

  • We are limited in performance by the number of serial tasks in a

program

  • Ratio of serial tasks to parallel tasks dictates the maximum speedup.

11

Speedup = TSerial runtime TParallel runtime Amdahl’s Law

slide-12
SLIDE 12

Resources in a program are shared

12

  • Only 1 bunny in this scene
slide-13
SLIDE 13

Resources in a program are shared

13

  • Only 1 bunny in this scene
  • Attempting to update a

shared resource by 2 or more threads at the same time results in a data race

slide-14
SLIDE 14

Threads put in a waiting queue

14

. . .

  • A few threads

work

  • Threads are

blocked in

  • rder to

enforce correctness

Blocked Blocked Blocked

slide-15
SLIDE 15

Java Concurrency – Synchr hroni nized Method Example

synchronized void modifyBunny() { // . . . // modify geometry for the bunny // . . . }

15

slide-16
SLIDE 16

Synchr hroni nized – puts a lock o

  • ver s

shared resources

synchronized void modifyBunny() { // . . . // modify geometry for the bunny // . . . }

16

slide-17
SLIDE 17

Criti tical S Secti tions Defined

  • A section of code that is executed by only one

thread at a given time.

Thread 1 Thread 2 Thread N ……………….. Critical Section

Blocked Blocked

17

slide-18
SLIDE 18

Corr rrectness (can b be) Ea Easy Performance Hard

public class DrawPicture{ DrawPicture(…) {…} lighting (…) {…} tesselate(…) {…} shadows (…) {…} geometry(…) {…} getPixel (…) {…} getNumLights (…) {…} }

18

slide-19
SLIDE 19

Corr rrectness (can b be) Ea Easy Performance Hard

public class DrawPicture{ DrawPicture(…) {…} lighting (…) {…} tesselate(…) {…} shadows (…) {…} geometry(…) {…} getPixel (…) {…} getNumLights (…) {…} }

19

synchronized synchronized synchronized synchronized synchronized synchronized Good job— no data races here!

slide-20
SLIDE 20

http://www-cs-faculty.stanford.edu/~eroberts/cs181/projects/2010-11/TechnologicalSingularity/pageviewa478.html?file=forfeasibility.html

Correctness (can be) Easy Performance Hard rd

Your program runs sequentially– did you forget about Amdahl’s law?

20

slide-21
SLIDE 21

The Big Picture With Multithreaded Code

  • We want our software to run fast
  • Writing multithreaded code correctly is difficult
  • We use synchronized code when a common resource is shared

amongst threads.

21

slide-22
SLIDE 22

The Problem

Real world programmers do not always understand the performance of their code in critical sections.

22

slide-23
SLIDE 23

Related Work

  • 2012, PLDI - Understanding and Detecting Real-World Performance Bugs
  • 332 previously unknown performance problems are found in the latest versions
  • f MySQL, Apache, and Mozilla applications
  • “Developers frequently use inefficient code sequences that could be fixed by

simple patches. These inefficient code sequences can cause significant performance degradation and resource waste, referred to as performance bugs. Meager increases in single threaded performance in the multi-core era and increasing emphasis on energy efficiency call for more effort in tackling performance bugs. “

23

slide-24
SLIDE 24

Related Work

  • 2012, PLDI - Understanding and Detecting Real-World Performance Bugs
  • 332 previously unknown performance problems are found in the latest versions
  • f MySQL, Apache, and Mozilla applications
  • “Developers frequently use inefficient code sequences that could be fixed by

simple patches. These inefficient code sequences can cause significant performance degradation and resource waste, referred to as performance bugs. Meager increases in single threaded performance in the multi-core era and increasing emphasis on energy efficiency call for more effort in tackling performance bugs. “

24

slide-25
SLIDE 25

Related Work

  • 2012, PLDI - Understanding and Detecting Real-World Performance Bugs
  • 332 previously unknown performance problems are found in the latest versions
  • f MySQL, Apache, and Mozilla applications
  • “Developers frequently use inefficient code sequences that could be fixed by

simple patches. These inefficient code sequences can cause significant performance degradation and resource waste, referred to as performance bugs. Meager increases in single threaded performance in the multi-core era and increasing emphasis on energy efficiency call for more effort in tackling performance bugs. “

25

slide-26
SLIDE 26

Related Work

  • 2013, ICSE - Toddler: Detecting Performance Problems via Similar

Memory-Access Patterns

  • “detecting performance bugs usually requires time-consuming, manual analysis
  • f execution profiles. The human effort for performance analysis limits the

number of performance tests analyzed and enables performance bugs to easily escape to production. “

26

slide-27
SLIDE 27

Related Work

  • 2013, ICSE - Toddler: Detecting Performance Problems via Similar

Memory-Access Patterns

  • “detecting performance bugs usually requires time-consuming, manual analysis
  • f execution profiles. The human effort for performance analysis limits the

number of performance tests analyzed and enables performance bugs to easily escape to production. “

27

slide-28
SLIDE 28

Thesis Statement

Static, dynamic, and software visualization analysis tools focused

  • n critical sections are needed to uncover performance

variability in critical sections to avoid unintended software hangs

28

slide-29
SLIDE 29

Thesis Statement

Static, dynamic, and software visualization analysis tools focused

  • n critical sections are needed to uncover performance

variability in critical sections to avoid unintended software hangs

29

A potential bottleneck – remember only 1 thread of execution

slide-30
SLIDE 30

Thesis Statement

Static, dynamic, and software visualization analysis tools focused

  • n critical sections are needed to uncover performance

variability in critical sections to avoid unintended software hangs If we cannot estimate time accurately – does that impact user experience?

30

slide-31
SLIDE 31

Thesis Statement

Static, dynamic, and software visualization analysis tools focused

  • n critical sections are needed to uncover performance

variability in critical sections to avoid unintended software hangs New tools and analysis will provide insights into how to solve this problem.

31

slide-32
SLIDE 32

Program Analysis

  • Static Analysis
  • Dynamic Analysis

32

slide-33
SLIDE 33

Iceberg 2.0 Dynamic Analysis

Dynamic Analysis is information gathered when the program runs.

33

slide-34
SLIDE 34

34

Bytecode Instrumentation with Javassist

Execute Program with Javaagent Build Javaagent Write Transformation Compile with Java compiler

slide-35
SLIDE 35

35

Compile with Java compiler

slide-36
SLIDE 36
  • Leverage our previous static analysis to feed our dynamic analysis which

methods to instrument

36

Write Transformation Compile with Java compiler

slide-37
SLIDE 37
  • Use the Javassist bytecode engineering library to transform actual Java

bytecode .

  • Code that will be injected into critical sections
  • Care taken to minimally perturb the system

37

Write Transformation

Method Entry Probe Method Exit Probe

Compile with Java compiler

slide-38
SLIDE 38

38

Build Javaagent Write Transformation Compile with Java compiler

  • Build transformation into a .jar file.
slide-39
SLIDE 39

39

Execute Program with Javaagent Build Javaagent Write Transformation Compile with Java compiler

slide-40
SLIDE 40
  • Record time spent within critical

sections

40

Execute Program with Javaagent Build Javaagent Write Transformation Compile with Java compiler

slide-41
SLIDE 41
  • Record time spent within critical

sections

  • Gathering entry and exits from

methods

41

Execute Program with Javaagent Build Javaagent Write Transformation Compile with Java compiler

slide-42
SLIDE 42
  • Record time spent within critical

sections

  • Gathering entry and exits from

methods

  • Variety of power in

instrumentation

  • Can record stack
  • Can record thread contention
  • Can record full call tree

42

Execute Program with Javaagent Build Javaagent Write Transformation Compile with Java compiler

slide-43
SLIDE 43
  • Record time spent within critical

sections

  • Gathering entry and exits from

methods

  • Variety of power in

instrumentation

  • Can record stack
  • Can record thread contention
  • Can record full call tree
  • Interested in time for now

43

Execute Program with Javaagent Build Javaagent Write Transformation Compile with Java compiler

slide-44
SLIDE 44

Calibrating our tool with Microbenchmarks

  • 28 total microbenchmarks with synchronized methods
  • Input/Output
  • Networking
  • Allocations
  • Data Structures
  • Branching behavior
  • Nested synchronization
  • Sorting
  • Purpose to validate our tool works

44

slide-45
SLIDE 45

SortArrayListTest.addRandom5LetterWord

45

Time (Ns) Nth Execution of the same method

slide-46
SLIDE 46

SortArrayListTest.addRandom5LetterWord

46

Time (Ns) Nth Execution of the same method

slide-47
SLIDE 47

SortLinkedListTest.addRandom5LetterWord

47

Time (Ns) Nth Execution of the same method

slide-48
SLIDE 48

SortLinkedListTest.addRandom5LetterWord

48

Time (Ns) Nth Execution of the same method

slide-49
SLIDE 49

StackTest.remove()

49

Time (Ns) Nth Execution of the same method

slide-50
SLIDE 50

StackTest.remove()

50

Time (Ns) Nth Execution of the same method

slide-51
SLIDE 51

StringTest.append()

51

Time (Ns) Nth Execution of the same method

slide-52
SLIDE 52

StringTest.append()

52

Time (Ns) Nth Execution of the same method

slide-53
SLIDE 53

AllocationTest.remove()

53

Time (Ns) Nth Execution of the same method

slide-54
SLIDE 54

AllocationTest.remove()

54

Time (Ns) Nth Execution of the same method

slide-55
SLIDE 55

Our tool captures execution information

Variability in execution time exists—these plots are not flat

55

slide-56
SLIDE 56

Statistical model – Find executions greater than one standard deviation away from trend

56

slide-57
SLIDE 57

Statistical model – Find executions greater than one standard deviation away from trend

57

slide-58
SLIDE 58

Test on Real World Programs

58

Skulls – Video Game Sunflow - Raytracer Mediaplayer – Hi-resolution .mp4 video player

slide-59
SLIDE 59

Metrics collected for each synchronized method

  • Number of times method executes

59

slide-60
SLIDE 60

Metrics collected for each synchronized method

  • Number of times method executes
  • Total time spent in that method
  • Min, max, average ranges

60

slide-61
SLIDE 61

Java Movie Player

  • Analyzed 40 different

synchronized methods

  • Iceberg tells us which methods

diverged during most of their executions

  • Iceberg tells us how much total

time we spend within synchronized method

  • So we know if it is worth our time

to fix.

61

slide-62
SLIDE 62

Java Movie Player – Example 1 | getRenderer

  • A method that returns a

renderer to draw to the screen

62

slide-63
SLIDE 63

Java Movie Player – Example 1 | getRenderer

  • A method that returns a

renderer to draw to the screen

  • A big critical section
  • Iteration
  • Control flow changes based on

state

63

(source code tiny on purpose)

slide-64
SLIDE 64

Java Movie Player – Example 1 | getRenderer

  • A method that returns a

renderer to draw to the screen

  • A big critical section
  • Iteration
  • Control flow changes based on

state

  • Called 2104 times during a 14

minute movie

64

(source code tiny on purpose)

slide-65
SLIDE 65

Java Movie Player – Example 1 | getRenderer

  • A method that returns a

renderer to draw to the screen

  • A big critical section
  • Iteration
  • Control flow changes based on

state

  • Called 2104 times during a 14

minute movie

  • Diverges 0.03% of the time

65

(source code tiny on purpose)

slide-66
SLIDE 66

Resulting Analysis

  • 40 synchronized methods analyzed in Movie Player
  • 86 synchronized methods analyzed in Skulls
  • 44 synchronized methods analyzed in Sunflow
  • 10 additional benchmarks in the database, graphics, and game domains

analyzed.

66

slide-67
SLIDE 67

Resulting Analysis - 8 different behaviors

  • Behavior A: Several execution paths
  • Behavior B: high contention
  • Behavior C: amortized data structures
  • Behavior D: data locality (linked list versus array)
  • Behavior E: large number of allocations
  • Behavior F: Input/Output events, such as logging
  • Behavior G: Generic getter/setter methods
  • Behavior H: Critical sections that only execute once

67

slide-68
SLIDE 68

Resulting Analysis

  • Our tool picks out synchronized methods with the highest variability
  • Resolutions
  • Split the function up into special use cases (smaller critical sections)
  • Variability important, makes difficult for profile-guided optimizations

68

slide-69
SLIDE 69

Reminder

Static, dynamic, and software visualization analysis tools focused

  • n critical sections are needed to uncover performance

variability in critical sections to avoid unintended software hangs We have confirmed performance variability!

69

slide-70
SLIDE 70

Now to focus on root cause

70

slide-71
SLIDE 71

Acknowledgements

The committee for their patience, feedback, guidance, and support!

71

slide-72
SLIDE 72

Understanding and Tuning the Performance of Critical Sections with Program Analysis and Software Visualization Tools

Michael Dilip Shah Advisor: Samuel Z. Guyer Monday July 31, 2017

Committee Members

  • Remco Chang
  • Kathleen Fisher
  • Mark Hempstead
  • Tao B. Schardl

72