This presentation uses the following non-standard fonts: Lato - - PowerPoint PPT Presentation

this presentation uses the following non standard fonts
SMART_READER_LITE
LIVE PREVIEW

This presentation uses the following non-standard fonts: Lato - - PowerPoint PPT Presentation

This presentation uses the following non-standard fonts: Lato (http://www.latofonts.com/lato-free-fonts/) FontAwesome (https://fortawesome.github.io/Font-Awesome/) You will need to download and install the TTF versions of these fonts for the


slide-1
SLIDE 1

This presentation uses the following non-standard fonts: Lato (http://www.latofonts.com/lato-free-fonts/) FontAwesome (https://fortawesome.github.io/Font-Awesome/) You will need to download and install the TTF versions

  • f these fonts for the slides to work correctly. A PDF

version with embedded fonts is available at http:// www.cs.grinnell.edu/~curtsinger/

slide-2
SLIDE 2

G R I N N E L L C O L L E G E

Coz: Finding Code that Counts with Causal Profiling

Charlie Curtsinger Emery D. Berger

slide-3
SLIDE 3

A few months ago, in a valley not so far away…

slide-4
SLIDE 4

This is Bob.

Bob has an idea
 for a startup.

slide-5
SLIDE 5
slide-6
SLIDE 6 found 8,000,000
 similar images

It’s going to totally disrupt image search.

gle gle

slide-7
SLIDE 7

gle gle

gle gle

found 8,000,000
 similar images

Take a picture Add it to 
 the database Send results Find similar pictures Send it to Ogle

The Prototype

slide-8
SLIDE 8
  • found 8,000,000

similar images

Send results Find similar pictures

slide-9
SLIDE 9
  • found 8,000,000

similar images

Send results Find similar pictures

slide-10
SLIDE 10

gle gle

gle gle gle gle gle gle gle gle gle gle gle gle gle gle gle gle gle gle gle gle

Ogle is totally disrupBng image search.

slide-11
SLIDE 11

gle gle

loading…

Ogle is too slow!

slide-12
SLIDE 12
  • SoCware Profilers

% cumulative time seconds calls name 20.05 8.02 1 9.56 3.82 1 19.95 7.98 1 45.19 11.31 1 5.25 2.10 1

Number of calls
 to each func8on Run8me for
 each func8on

slide-13
SLIDE 13
  • SoCware Profilers

% cumulative time seconds calls 20.05 8.02 1 9.56 3.82 1 19.95 7.98 1 45.19 11.31 1 5.25 2.10 1

Frequently executed code Code that runs for a long 8me

Are these places where Bob
 should focus on performance?

slide-14
SLIDE 14

Would this speed up Ogle?

gle gle

loading…

gle gle

loading…

slide-15
SLIDE 15

Would this speed up Ogle?

gle gle

loading…

Frequently executed code Code that runs for a long 8me

Profilers do a bad job finding important code in parallel programs.

We need beEer tools.

slide-16
SLIDE 16

What would speed up Ogle?

  • Hey, look over here

What would this informa8on look like?

slide-17
SLIDE 17

Program Speedup Speedup

?

  • Causal Profile

Tells you where opBmizaBons will make a difference

slide-18
SLIDE 18

Program Speedup Speedup

?

  • Causal Profile

If you speed up this much

  • The program will run this much faster
slide-19
SLIDE 19

Program Speedup Speedup

?

  • Causal Profile

this change

Run an experiment

How do we know causes this effect?

slide-20
SLIDE 20
  • Performance Experiments
  • If we could magically speed up …
slide-21
SLIDE 21
  • Performance Experiments
  • If we could magically speed up …
  • Speeding up by this much…
  • speeds up the program by this much.
slide-22
SLIDE 22
  • Performance Experiments
  • If we could magically speed up …
  • More speedup in …
  • leads to a larger program speedup.
slide-23
SLIDE 23
  • Performance Experiments
  • If we could magically speed up …
slide-24
SLIDE 24
  • Performance Experiments
  • If we could magically speed up …
  • No program speedup
slide-25
SLIDE 25
  • Performance Experiments
  • We’re going to have to do this without magic.
slide-26
SLIDE 26
  • Performance Experiments
  • We’re going to have to do this without magic.

Otherwise we’d just do this.

slide-27
SLIDE 27

Virtual Speedup

“Speed up” by slowing everything else down.

  • Each Bme runs, pause all other threads.
slide-28
SLIDE 28

Virtual Speedup

“Speed up” by slowing everything else down.

  • Each Bme runs, pause all other threads.
slide-29
SLIDE 29

Virtual Speedup

“Speed up” by slowing everything else down.

  • To account for the size of the delay…
slide-30
SLIDE 30

Virtual Speedup

“Speed up” by slowing everything else down.

  • Speeding up by this much…
  • speeds up the program by this much.
slide-31
SLIDE 31

Program Speedup Speedup

?

Speedup Results

slide-32
SLIDE 32
  • Program Speedup

Speedup

?

Speedup Results

slide-33
SLIDE 33

Virtual Speedup

“Speed up” by slowing everything else down.

slide-34
SLIDE 34

Virtual Speedup

“Speed up” by slowing everything else down.

  • A larger speedup has no addiBonal effect
slide-35
SLIDE 35
  • Program Speedup

Speedup

?

Speedup Results

slide-36
SLIDE 36
  • Program Speedup

Speedup

?

Speedup Results

slide-37
SLIDE 37

Virtual Speedup

“Speed up” by slowing everything else down.

  • Each Bme runs, pause all other threads.
slide-38
SLIDE 38

Virtual Speedup

“Speed up” by slowing everything else down.

  • Each Bme runs, pause all other threads.
slide-39
SLIDE 39

Virtual Speedup

“Speed up” by slowing everything else down.

  • Each Bme runs, pause all other threads.
slide-40
SLIDE 40
  • Program Speedup

Speedup

?

Speedup Results

slide-41
SLIDE 41
  • Program Speedup

Speedup

?

Speedup Results

slide-42
SLIDE 42
  • Program Speedup

Speedup

?

Speedup Results

slide-43
SLIDE 43

Virtual Speedup

“Speed up” by slowing everything else down.

  • Each Bme runs, pause all other threads.
slide-44
SLIDE 44

Virtual Speedup

  • “Speed up” by slowing everything else down.
  • Each Bme runs, pause all other threads.
slide-45
SLIDE 45

Virtual Speedup

  • “Speed up” by slowing everything else down.
  • Each Bme runs, pause all other threads.
slide-46
SLIDE 46
  • Program Speedup

Speedup

?

Speedup Results

slide-47
SLIDE 47
  • Program Speedup

Speedup

?

Speedup Results

slide-48
SLIDE 48

Virtual Speedup

  • “Speed up” by slowing everything else down.
  • Each Bme runs, pause all other threads.
slide-49
SLIDE 49

Virtual Speedup

  • “Speed up” by slowing everything else down.
  • Each Bme runs, pause all other threads.
slide-50
SLIDE 50

Virtual Speedup

  • “Speed up” by slowing everything else down.
  • Each Bme runs, pause all other threads.
slide-51
SLIDE 51

Virtual Speedup

  • “Speed up” by slowing everything else down.
  • Each Bme runs, pause all other threads.
slide-52
SLIDE 52
  • Program Speedup

Speedup

?

Speedup Results

slide-53
SLIDE 53
  • Program Speedup

Speedup

?

Speedup Results

  • Speeding up slows the program down!
slide-54
SLIDE 54

Virtual Speedup

  • “Speed up” by slowing everything else down.
  • Each Bme runs, pause all other threads.
slide-55
SLIDE 55

Virtual Speedup

  • “Speed up” by slowing everything else down.
  • Each Bme runs, pause all other threads.
slide-56
SLIDE 56
  • Program Speedup

Speedup

?

Speedup Results

slide-57
SLIDE 57
  • Program Speedup

Speedup

?

Speedup Results

slide-58
SLIDE 58
  • Program Speedup

Speedup

?

Speedup Results

slide-59
SLIDE 59
  • Program Speedup

Speedup

?

Speedup Results

slide-60
SLIDE 60

Is runOme meaningful?

gle gle

found 8,000,000
 similar images

Take a picture Add it to 
 the database Send results Find similar pictures Send it to Ogle

gle gle

slide-61
SLIDE 61

Is runOme meaningful?

gle gle

found 8,000,000
 similar images

gle gle

How fast do results come back? How long between
 request and response?

slide-62
SLIDE 62

Progress Points

  • Bob wants to send responses faster.

Each 8me this code runs… He marks as a progress point.

slide-63
SLIDE 63

Progress Points

  • Many requests running for many users.
slide-64
SLIDE 64

Progress Points

  • One progress point measures throughput.

If I speed up , how much faster do I run ?

?

slide-65
SLIDE 65

Progress Points

  • Bob wants to minimize response Bme.

He adds latency progress points.

transactions++ transactions--

slide-66
SLIDE 66

Progress Points

  • Bob wants to minimize response Bme.

He adds latency progress points.

transactions++ transactions-- Little’s Law: W= L / λ latency = transactions / throughput

slide-67
SLIDE 67

Coz: a Causal Profiler for Linux

> coz run --- ./ogle_server deploy

slide-68
SLIDE 68

Coz Produces Causal Profiles

  • Program Speedup

Speedup

?

  • Let’s use it to improve Ogle
slide-69
SLIDE 69

Using Causal Profiling on Ogle

gle gle found 8,000,000
 similar images
slide-70
SLIDE 70

Using Causal Profiling on Ogle

dedup

compression

ferret

image comparison

slide-71
SLIDE 71

Ferret

input

  • utput

segmentaOon feature extracOon indexing ranking

image comparison

slide-72
SLIDE 72
  • 0%
25% 50% 75% 100% 0% 25% 50% 75% 100% 0% 25% 50% 75% 100% line 320 line 358 line 255 0% 25% 50% 75% 100%

Line Speedup Program Speedup

Line
  • line 320
line 358 line 255

ranking indexing segmentaOon

Ferret

slide-73
SLIDE 73

Ferret

input

  • utput

segmentaOon feature extracOon indexing ranking

slide-74
SLIDE 74

Ferret

input

  • utput

segmentaOon feature extracOon indexing ranking

Probably doesn’t need as many threads

slide-75
SLIDE 75

Ferret

input

  • utput

segmentaOon feature extracOon indexing ranking

21% Speedup

slide-76
SLIDE 76

ranking

What did Causal Profiling predict?

Increased from 16 to 22 threads 27% increase in ranking throughput

slide-77
SLIDE 77 0% 0% 25% 50% 75% 100% Line Speedup
  • 0%
25% 50% 75% 100% 100% line 320 Program Speedup

ranking

Exactly what we observed What did Causal Profiling predict?

27% increase in ranking throughput Causal Profiling predicted a 21% improvement

slide-78
SLIDE 78

Using Causal Profiling on Ogle

dedup

compression

ferret

image comparison

slide-79
SLIDE 79

Dedup

Compression via deduplicaBon

slide-80
SLIDE 80

Dedup

Compression via deduplicaBon

slide-81
SLIDE 81

Dedup

Compression via deduplicaBon grumpycat1.jpg funisawful.jpg

slide-82
SLIDE 82

Dedup

Compression via deduplicaBon

hash_function( ) i =

slide-83
SLIDE 83

hash_function( )

Dedup

Compression via deduplicaBon

i =

slide-84
SLIDE 84

Dedup

Compression via deduplicaBon

i = hash_function( )

slide-85
SLIDE 85

Dedup

Compression via deduplicaBon

i = hash_function( )

slide-86
SLIDE 86

Dedup

Compression via deduplicaBon

i = hash_function( )

slide-87
SLIDE 87

Dedup

Compression via deduplicaBon

i = hash_function( )

slide-88
SLIDE 88

Dedup

Compression via deduplicaBon Hash table is accessed concurrently by many threads Causal Profiler says the loop that accesses this list is important

slide-89
SLIDE 89

Dedup

Compression via deduplicaBon More hash buckets should lead to fewer collisions

slide-90
SLIDE 90

Dedup

Compression via deduplicaBon More hash buckets should lead to fewer collisions

slide-91
SLIDE 91

Dedup

Compression via deduplicaBon More hash buckets should lead to fewer collisions

No performance improvement

slide-92
SLIDE 92

Dedup

Compression via deduplicaBon

What else could be causing collisions? i = hash_function( )

slide-93
SLIDE 93

Horrible hash func8on!

Bin UBlizaBon

2.3%

50 100 150 Original Version Collisions 250 500 750 1000

Bucket Index

Items per-bucket

Dedup

Compression via deduplicaBon

slide-94
SLIDE 94

Bin UBlizaBon

2.3%

50 100 150 Original Version Collisions

82%

250 500 750 1000

Bucket Index

50 100 150 After Optimization Collisions

9% Speedup

Items per-bucket Compression via deduplicaBon

Dedup

slide-95
SLIDE 95

What did Causal Profiling predict?

Before: 76.7 ACer: 2.09

Blocks per-bucket

Dedup

Compression via deduplicaBon

96% traversal speedup 9% predicted speedup,
 exactly what we observed

slide-96
SLIDE 96

Using Causal Profiling on Ogle

dedup

compression

ferret

image comparison

slide-97
SLIDE 97

Simple SQL Database #if THREAD_SAFE config_t global_config = { … .unlock = pthread_mutex_unlock, .getsize = sqlite_usable_size, .nextitem = sqlite_pagecache_next, … }; #endif

slide-98
SLIDE 98

Simple SQL Database .nextitem = sqlite_pagecache_next, void pthreadMutexLeave(lock* l) { global_config.unlock(l); }

Indirect Call

Cheap, but almost the same cost as pthread_mutex_unlock

slide-99
SLIDE 99

Simple SQL Database void pthreadMutexLeave(lock* l) { global_config.unlock(l); } void sqlite3MemSize(void* p) { global_config.getsize(p); } void pcache1Fetch(item* i) { global_config.nextitem(i); }

Coz highlights these lines

slide-100
SLIDE 100

Simple SQL Database

  • Line 16916
Line 18974 Line 40345 −50% −25% 0% 25% 0% 50% 0% 50% 0% 50% Line Speedup Program Speedup
slide-101
SLIDE 101

Simple SQL Database void pthreadMutexUnlock(lock* l) { global_config.unlock(l); } void sqlite3MemSize(void* p) { global_config.getsize(p); } void pcache1Fetch(item* i) { global_config.nextitem(i); }

slide-102
SLIDE 102

Simple SQL Database void pthreadMutexUnlock(lock* l) { pthread_mutex_unlock(l); } void sqlite3MemSize(void* p) { sqlite_usable_size(p); } void pcache1Fetch(item* i) { sqlite_pagecache_next(i); }

25% Speedup

slide-103
SLIDE 103

What do other profilers say?

% Runtime Symbol 85.55% _raw_spin_lock 1.76% x86_pmu_enable_all ... 30 lines … 0.10% rcu_irq_enter
 0.09% sqlite3MemSize
 0.09% source_load ... 26 lines … 0.03% __queue_work
 0.03% pcache1Fetch
 0.03% kmem_cache_free
 0.03% update_cfs_rq_blocked_load
 0.03% pthreadMutexLeave
 0.03% sqlite3MemMalloc

Just 0.15% of
 total runOme

slide-104
SLIDE 104

Using Causal Profiling on Ogle

dedup

compression

ferret

image comparison

25% 9% 21% 0.15% 14.38% 0.00% speedup
 with Coz % run8me

(whole func,on)

slide-105
SLIDE 105

Summary of OpOmizaOons

Benchmark Speedup Diff Size Change Summary memcached 9.39%

  • 6, +2

removed unnecessary locks sqlite 25.60%

  • 3, +3

removed DIY vtable implementaBon blackscholes 2.56%

  • 61, +4

manual common subexpression eliminaBon dedup 8.95%

  • 3, +3

fixed degenerate hash funcBon ferret 21.27%

  • 4, +4

rebalanced pipeline thread allocaBon fluidanimate 37.50%

  • 1, +0

removed custom barrier with high contenBon streamcluster 68.40%

  • 1, +0

removed custom barrier with high contenBon swapOons 15.80%

  • 10, +16

reordered loop nests

slide-106
SLIDE 106

gle gle

loading…

% cumulative time seconds calls name 20.05 8.02 1 9.56 3.82 1 19.95 7.98 1 45.19 11.31 1 5.25 2.10 1
slide-107
SLIDE 107

gle gle

loading…

% cumulative time seconds calls name 20.05 8.02 1 9.56 3.82 1 19.95 7.98 1 45.19 11.31 1 5.25 2.10 1
slide-108
SLIDE 108

gle gle

loading…

% cumulative time seconds calls name 20.05 8.02 1 9.56 3.82 1 19.95 7.98 1 45.19 11.31 1 5.25 2.10 1
slide-109
SLIDE 109

G R I N N E L L C O L L E G E

This is the profiler you are looking for.

coz-profiler.org

slide-110
SLIDE 110

G R I N N E L L C O L L E G E

This is the profiler you are looking for.

coz-profiler.org

slide-111
SLIDE 111

0% 20% 40% 60% b l a c k s c h

  • l

e s b

  • d

y t r a c k c a n n e a l d e d u p f a c e s i m f e r r e t f l u i d a n i m a t e f r e q m i n e r a y t r a c e s t r e a m c l u s t e r s w a p t i

  • n

s v i p s x 2 6 4 m e a n

Benchmark

Percent Overhead

Delays Sampling Startup

Overhead of COZ

slide-112
SLIDE 112

Virtual Speedup Accuracy

Virtual Program Speedup

  • 75%
  • 50%
  • 25%
0% 25% 50%

Real Program Speedup

  • 75%
  • 50%
  • 25%
0% 25% 50%

Parallel compression program pbzip2 The program speedup we get from
 actually speeding up a piece of code

slide-113
SLIDE 113

Virtual Speedup Accuracy

Virtual Program Speedup

  • 75%
  • 50%
  • 25%
0% 25% 50%

Real Program Speedup

  • 75%
  • 50%
  • 25%
0% 25% 50%

Parallel compression program pbzip2 The program speedup we get from
 virtually speeding up the same code.

slide-114
SLIDE 114

Virtual Speedup Accuracy

Virtual Program Speedup

  • 75%
  • 50%
  • 25%
0% 25% 50%

Real Program Speedup

  • 75%
  • 50%
  • 25%
0% 25% 50%

Parallel compression program pbzip2

Causal Profiling’s predic8ons closely match reality The program ran faster with
 the slow version of this code!