Continuous Profiling in Production: What, Why and How Richard - - PowerPoint PPT Presentation

continuous profiling in production what why and how
SMART_READER_LITE
LIVE PREVIEW

Continuous Profiling in Production: What, Why and How Richard - - PowerPoint PPT Presentation

Continuous Profiling in Production: What, Why and How Richard Warburton (@richardwarburto) Sadiq Jaffer (@sadiqj) https://www.opsian.com Why Performance Tools Matter Development isnt Production Profiling vs Monitoring Continuous Profiling


slide-1
SLIDE 1

Continuous Profiling in Production: What, Why and How

Richard Warburton (@richardwarburto) Sadiq Jaffer (@sadiqj) https://www.opsian.com

slide-2
SLIDE 2

Why Performance Tools Matter Development isn’t Production Profiling vs Monitoring Continuous Profiling Conclusion

slide-3
SLIDE 3

Known Knowns

slide-4
SLIDE 4

Known Unknowns

slide-5
SLIDE 5

Unknown Unknowns

slide-6
SLIDE 6

Why Performance Tools Matter Development isn’t Production Profiling vs Monitoring Continuous Profiling Conclusion

slide-7
SLIDE 7

Development isn’t Production

Performance testing in development can be easier May not have access to production Tooling often desktop-based Not representative of production

slide-8
SLIDE 8

Unrepresentative Hardware

vs

slide-9
SLIDE 9

Unrepresentative Software

slide-10
SLIDE 10

Unrepresentative Workloads

vs

slide-11
SLIDE 11

The JVM may have very different behaviour in production

Hotspot does adaptive optimisation Production may optimise differently

slide-12
SLIDE 12
slide-13
SLIDE 13

Why Performance Tools Matter Development isn’t Production Profiling vs Monitoring Continuous Profiling Conclusion

slide-14
SLIDE 14

Ambient/Passive/System Metrics

Preconfigured Numerical Measure CPU Time Usage / Page-load Times Cheap and sometimes effective

slide-15
SLIDE 15
slide-16
SLIDE 16

Logging

Records arbitrary events emitted by the system being monitored log4j/slf4j/logback Logs of GC events Often manual, aids system understanding, expensive

slide-17
SLIDE 17

Coarse Grained Instrumentation

Measures time within some instrumented section of the code Time spent inside the controller layer of your web-app or performing SQL queries More detailed and actionable though expensive

slide-18
SLIDE 18

Production Profiling

What methods use up CPU time? What lines of code allocate the most objects? Where are your CPU Cache misses coming from? Automatic, can be cheap but often isn’t

slide-19
SLIDE 19

Where Instrumentation can be blind in the Real World

Problem: every 5 seconds an HTTP endpoint would be really slow. Instrumentation: on the servlet request, didn’t even show the pause! Cause: Tomcat expired its resources cache every 5 seconds, on load one resource scanned the entire classpath

slide-20
SLIDE 20
slide-21
SLIDE 21

Surely a better way?

Not just Metrics - Actionable Insights Diagnostics aren’t Diagnosis What about Profiling?

slide-22
SLIDE 22

Why Performance Tools Matter Development isn’t Production Profiling vs Monitoring Continuous Profiling Conclusion

slide-23
SLIDE 23

How to use Continuous Profilers

1) Extract relevant time period and apps/machines 2) Choose a type of profile: CPU Time/Wallclock Time/Memory 3) View results to tell you what the dominant consumer of a resource is 4) Fix biggest bottleneck 5) Deploy / Iterate

slide-24
SLIDE 24

CPU Time vs Wallclock Time

slide-25
SLIDE 25

You need both CPU Time and Wallclock Time

CPU - Diagnose expensive computational hotspots and inefficient algorithms Spot code that should not be executing but is ... Wallclock - Diagnose blocking that stops CPU usage e.g blocking on external IO and lock contention issues

slide-26
SLIDE 26

Profiling Hotspots

slide-27
SLIDE 27

Profiling Treeviews

slide-28
SLIDE 28

Profiling Flamegraphs

slide-29
SLIDE 29

Instrumenting Profilers

Add instructions to collect timings (Eg: JVisualVM Profiler) Inaccurate - modifies the behaviour of the program High Overhead - > 2x slower

slide-30
SLIDE 30

Sampling/Statistical Profilers

WebServerThread.run() Controller.doSomething() Controller.next() Repo.readPerson() new Person() View.printHtml() ??? ???

slide-31
SLIDE 31

Safepoints

Mechanism for bringing Java application threads to a halt Safepoint polls added to compiled code read known memory location Protecting memory page triggers a segfault and suspends threads

slide-32
SLIDE 32

Safepoint Bias

WebServerThread.run Controller.doSomething Controller.next() Repo.readPerson new Person View.printHtml ???

slide-33
SLIDE 33

Safepoint Bias after Inlining

Repo.readPerson new Person View.printHtml ??? ??? WebServerThread.run Controller.doSomething Controller.next()

slide-34
SLIDE 34

Time to Safepoint

  • XX:+PrintSafepointStatistics

Threads

Safepoint poll VM Operation

slide-35
SLIDE 35

Statistical Profiling in Java

Problem: getAllStackTraces is expensive to do frequently and inaccurate, also only gives us Wallclock time Need ways to: 1. Interrupt application 2. Sample resource of interest

slide-36
SLIDE 36

Advanced Statistical Profiling in Java

  • Interrupt with OS signals

○ Delivered to handler on only one thread ○ Lightweight

  • Sample resource of interest

○ Use AsyncGetCallTrace to sample stack ○ Examine JVM internals for other resources

slide-37
SLIDE 37

Advanced Statistical Profiling in Java

Approach not used by existing profilers (VisualVM and desktop commercial alternatives) Can give very low overheads (<1%) for reasonable sampling rates

slide-38
SLIDE 38

People are put off by practical as much as technical issues

slide-39
SLIDE 39

Barriers to Ad-Hoc Production Profiling

Generally requires access to production Process involves manual work - hard to automate Low-overhead open source profilers without commercial support

slide-40
SLIDE 40

What if we profiled all the time?

slide-41
SLIDE 41

Historical Data

Allows for post-hoc incident analysis Enables correlation with other data/metrics Performance regression analysis

slide-42
SLIDE 42

Putting Samples in Context

Application version Environment parameters (machine type, CPU, location, etc.) Ad-hoc profiling we can’t do this

slide-43
SLIDE 43

How to implement Continuous Profiling

slide-44
SLIDE 44

Google-wide profiling

Article: Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers Profiling data and binaries collected, processed and made available for browser-based reporting “The system has been actively profiling nearly all machines at Google for several years” https://ai.google/research/pubs/pub36575

slide-45
SLIDE 45

Self-build

  • Open source Java profilers suitable for production

○ Async-profiler ○ Honest profiler ○ Flight Recorder

  • Need to collect and store profiles in a database
  • Tools for retrieving and visualising stored profiling data

○ Browser-based ○ Command line

slide-46
SLIDE 46

Opsian - Continuous Profiling

Opsian Aggregation service Web Reports JVM Agents

slide-47
SLIDE 47

Summary

It’s possible to profile in production with low overhead To overcome practical issues we can profile production all the time We gain new capabilities by profiling all the time

slide-48
SLIDE 48

Why Performance Tools Matter Development isn’t Production Profiling vs Monitoring Continuous Profiling Conclusion

slide-49
SLIDE 49

Performance Matters Development isn’t Production Metrics can be unactionable Instrumentation has high overhead Continuous Profiling provides insight

slide-50
SLIDE 50

We need an attitude shift on profiling + monitoring

slide-51
SLIDE 51

Continuous Proactive not Reactive Systematic not Ad Hoc

slide-52
SLIDE 52

Please do Production Profiling. All the time.

slide-53
SLIDE 53

Any Questions?

https://www.opsian.com/

slide-54
SLIDE 54

Live Demo?

slide-55
SLIDE 55

Links

Collector - Flame Graph Collector - Hot Spots

slide-56
SLIDE 56

The End

slide-57
SLIDE 57

Existing tools are blind

Traditional profilers don’t work in production Metrics aren’t code level visibility Instrumentation must be done ahead of time

slide-58
SLIDE 58

How do we help?

Reduce the risk of change Help you scale with happy customers Cut the cost of infrastructure

slide-59
SLIDE 59

Production Visibility

Actionable reports for causes of latency and CPU usage From high-level reports to line-level granularity Very low overhead (<1%) and always-on

slide-60
SLIDE 60

Reduce the risk of change

On-demand performance comparison between releases Accelerate root-cause analysis for performance regressions

slide-61
SLIDE 61

Improve Developer Productivity

Source: ZT RebelLabs Developer Productivity Report 2017

slide-62
SLIDE 62

Understand don’t Overwhelm Too Little

You can’t understand production problems

Too Much

Needle in a Haystack You are the problem (overhead)

slide-63
SLIDE 63

Normalisation of Deviance

“Some of the tests always fail, so we just ignore them.” “Some of the alerts get triggered regularly, so we just ignore them.” Alert false positives have a cost

slide-64
SLIDE 64

Open Source Java Profilers

High Overhead

VisualVM hprof Twitter’s CPUProfile Anything GetAllStackTraces based

Low Overhead

Async Profiler Honest Profiler Java Mission Control

slide-65
SLIDE 65

Unactionable Metrics

Many metrics provide pretty graphs but don’t progress treatment

slide-66
SLIDE 66

Profiling Support in the Linux Kernel

perf and eBPF perf-map-agent for the JVM Hardware events (L1/L2/L3 cache misses, branch mispredictions, etc.) Take heed: potential security issues

slide-67
SLIDE 67

Customer Experience

slide-68
SLIDE 68

Amazon: 100ms of latency costs 1% of sales Google: 500ms seconds in search page generation time drops traffic by 20%

Responsive Applications make more Money

slide-69
SLIDE 69

Stop Costly Downtime

slide-70
SLIDE 70

Reduce Costs

slide-71
SLIDE 71

Performance Optimisation Cycle

Implement a Fix Deploy and Validate Fix Problem Reported Understand Cause / Bottleneck

slide-72
SLIDE 72

What’s Hard?

Implement a Fix Deploy and Validate Fix Problem Reported Understand Cause / Bottleneck

slide-73
SLIDE 73

How do you find performance bottlenecks?

slide-74
SLIDE 74