Panappticon: Event-Based Tracing to Optimize Mobile Application and - - PowerPoint PPT Presentation

panappticon event based tracing to optimize mobile
SMART_READER_LITE
LIVE PREVIEW

Panappticon: Event-Based Tracing to Optimize Mobile Application and - - PowerPoint PPT Presentation

Panappticon: Event-Based Tracing to Optimize Mobile Application and Platform Performance Lide Zhang , David R. Bild , Robert P. Dick , Z. Morley Mao , and Peter Dinda Department of Electrical Engineering and Computer Science


slide-1
SLIDE 1

Panappticon: Event-Based Tracing to Optimize Mobile Application and Platform Performance

Lide Zhang†, David R. Bild†, Robert P. Dick†, Z. Morley Mao†, and Peter Dinda‡

† Department of Electrical Engineering and Computer Science University of Michigan † Department of Electrical Engineering and Computer Science Northwestern University

2 October 2013

Supported, in part, by the NSF under award CNS-1059372.

slide-2
SLIDE 2

Introduction Algorithms and implementation Findings Motivation Definitions

Outline

  • 1. Introduction
  • 2. Algorithms and implementation
  • 3. Findings

2 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-3
SLIDE 3

Introduction Algorithms and implementation Findings Motivation Definitions

Goal: make smartphones faster

3 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-4
SLIDE 4

Introduction Algorithms and implementation Findings Motivation Definitions

Why not make everything faster?

That could degrade cost, battery lifespan, or satisfaction with user interface.

4 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-5
SLIDE 5

Introduction Algorithms and implementation Findings Motivation Definitions

Instead, make some things faster

What things? Whenever smartphone users perceive that they are waiting for the machine, we have an opportunity to improve user-perceived performance. How do we know when a smartphone user perceives that they are waiting?

5 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-6
SLIDE 6

Introduction Algorithms and implementation Findings Motivation Definitions

User-perceived transaction definition

The best definition A series of operations in the system started by user input and ended by the resulting output to the user. A definition 1.5 graduate students can implement infrastructure for in a reasonable amount of time A series of operations in the system started by a screen touch or button press and ended by the resulting display update.

6 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-7
SLIDE 7

Introduction Algorithms and implementation Findings Motivation Definitions

How to monitor and analyze a user-perceived transaction?

Questions When does it start and end? What are the causal relationships among events within the transaction? What takes time during the transaction? Answering these questions is hard! The operating system and many user-level processes cooperate. Processes synchronize and communicate in many ways. Simultaneously running applications influence latencies of transactions via resource contention. Multiple ways to update the display.

7 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-8
SLIDE 8

Introduction Algorithms and implementation Findings Motivation Definitions

Panopticon

A prison that has been radially arranged to allow a few guards to watch any prisoner at any time.

8 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-9
SLIDE 9

Introduction Algorithms and implementation Findings Motivation Definitions

Panappticon

Smartphone infrastructure that monitors the detailed operations of multiple operating system and application processes to support identification and analysis of user-perceived transactions.

9 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-10
SLIDE 10

Introduction Algorithms and implementation Findings Motivation Definitions

Who is Panappticon for?

Application designers: Optimize application performance. Operating system designers: Optimize system policies. Hardware designers: Choose the hardware changes that most improve user-perceived transaction latencies.

10 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-11
SLIDE 11

Introduction Algorithms and implementation Findings Motivation Definitions

Related work

[Barham’04]: Developer-provided event semantics used for trace analysis on servers. [Jovic’11]: Developers identify UI input methods. Unsuitable for multithreaded, asynchronous systems. [Ravindranath’12]: Instruments binaries to support tracing. Handles multiple application threads, but not other processes or kernel. Panappticon handles multiple threads/processes, including kernel threads.

11 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-12
SLIDE 12

Introduction Algorithms and implementation Findings Approach and architecture overview Graph construction procedure Tracing performance overhead

Outline

  • 1. Introduction
  • 2. Algorithms and implementation
  • 3. Findings

12 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-13
SLIDE 13

Introduction Algorithms and implementation Findings Approach and architecture overview Graph construction procedure Tracing performance overhead

Algorithm overview

UI thread worker thread Execution interval

Identify each execution interval. Identify causal relationships between intervals. Give intervals semantic labels. Do resource accounting along the critical path.

13 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-14
SLIDE 14

Introduction Algorithms and implementation Findings Approach and architecture overview Graph construction procedure Tracing performance overhead

Algorithm overview

UI thread worker thread Execution interval Submit an asynchronous task.

Identify each execution interval. Identify causal relationships between intervals. Give intervals semantic labels. Do resource accounting along the critical path.

13 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-15
SLIDE 15

Introduction Algorithms and implementation Findings Approach and architecture overview Graph construction procedure Tracing performance overhead

Algorithm overview

UI thread worker thread Execution interval Submit an asynchronous task. User input Display update

Identify each execution interval. Identify causal relationships between intervals. Give intervals semantic labels. Do resource accounting along the critical path.

13 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-16
SLIDE 16

Introduction Algorithms and implementation Findings Approach and architecture overview Graph construction procedure Tracing performance overhead

Algorithm overview

UI thread worker thread Execution interval Submit an asynchronous task. User input Display update

Identify each execution interval. Identify causal relationships between intervals. Give intervals semantic labels. Do resource accounting along the critical path.

13 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-17
SLIDE 17

Introduction Algorithms and implementation Findings Approach and architecture overview Graph construction procedure Tracing performance overhead

Algorithm overview

UI thread worker thread Execution interval Submit an asynchronous task. User input Display update IO block

Identify each execution interval. Identify causal relationships between intervals. Give intervals semantic labels. Do resource accounting along the critical path.

13 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-18
SLIDE 18

Introduction Algorithms and implementation Findings Approach and architecture overview Graph construction procedure Tracing performance overhead

Panappticon architecture

Application Application Dalvik VM User logger Dalvik VM User logger Kernel logger Kernel Event collector Server collector User transaction analyzer Kernel-level User-space framework User-space Application Device side Server side

14 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-19
SLIDE 19

Introduction Algorithms and implementation Findings Approach and architecture overview Graph construction procedure Tracing performance overhead

Data captured

Input events: screen touch and key press. Display update events. Causality between execution intervals: asynchronous task, enqueue/dequeue messages, IPC, forking a child thread (and locking primitives). Resource accounting events: context switches (and thread state), blocking on IO and network. Additional information to understand context: application name, foreground applications.

15 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-20
SLIDE 20

Introduction Algorithms and implementation Findings Approach and architecture overview Graph construction procedure Tracing performance overhead

Relationship graph construction

User input, enqueues message 1 (callback function for user input). Dequeues message 1 and submits asynchronous task 1. Consumes asynchronous task 1, blocks on IO, resumes, enqueues message 2. Dequeues message 2, triggers UI invalidate, UI display update.

16 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-21
SLIDE 21

Introduction Algorithms and implementation Findings Approach and architecture overview Graph construction procedure Tracing performance overhead

Performance evaluation of Panappticon

100 200 300 400 500 600 700 800 900 a s y n c t s a k w

  • r

k e r s e r v i c e w e b p a g e k 9 x w

  • r

d n p r b r

  • w

s e r r e a d User transaction (ms) Panappticon Android

Average performance overhead with Panappticon is 6.1%.

17 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-22
SLIDE 22

Introduction Algorithms and implementation Findings Experiment overview Identifying inefficient application/OS design Identifying DVFS policy related performance degradation Impact of additional core on user-perceived transaction latency

Outline

  • 1. Introduction
  • 2. Algorithms and implementation
  • 3. Findings

18 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-23
SLIDE 23

Introduction Algorithms and implementation Findings Experiment overview Identifying inefficient application/OS design Identifying DVFS policy related performance degradation Impact of additional core on user-perceived transaction latency

Experimental goals

Identify application performance bugs. Understand the impact of system policies, e.g., DVFS. Understand the impact of hardware design decisions, e.g., multi-core versus single core. Randomly switches between four configurations during deployment.

19 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-24
SLIDE 24

Introduction Algorithms and implementation Findings Experiment overview Identifying inefficient application/OS design Identifying DVFS policy related performance degradation Impact of additional core on user-perceived transaction latency

Study overview

Platform Galaxy Nexus, Android 4.1.2 Users 14 Analyzed transactions 88,656 Duration One month

20 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-25
SLIDE 25

Introduction Algorithms and implementation Findings Experiment overview Identifying inefficient application/OS design Identifying DVFS policy related performance degradation Impact of additional core on user-perceived transaction latency

User-perceived transaction durations

0.5 1 0.0001 0.001 0.01 0.1 1 10 100 CDF User transaction time (s)

Transactions last 38.6 seconds at most. 2% of the transaction lasts longer than 1 second. Both cores available. DVFS enabled.

21 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-26
SLIDE 26

Introduction Algorithms and implementation Findings Experiment overview Identifying inefficient application/OS design Identifying DVFS policy related performance degradation Impact of additional core on user-perceived transaction latency

Application commonly waits for CPU

Reddit news: a popular news application in Android market with millions of downloads.

Total latency (s) Network block (s) IO block (s) Waiting for CPU (s) 3.78 0.98 1.39 2.35 0.42 0.02 0.93 1.54 0.23 0.89 1.27 0.15 0.33

22 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-27
SLIDE 27

Introduction Algorithms and implementation Findings Experiment overview Identifying inefficient application/OS design Identifying DVFS policy related performance degradation Impact of additional core on user-perceived transaction latency

Application commonly waits for CPU

Reddit news: a popular news application in Android market with millions of downloads.

Total latency (s) Network block (s) IO block (s) Waiting for CPU (s) 3.78 0.98 1.39 2.35 0.42 0.02 0.93 1.54 0.23 0.89 1.27 0.15 0.33

22 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-28
SLIDE 28

Introduction Algorithms and implementation Findings Experiment overview Identifying inefficient application/OS design Identifying DVFS policy related performance degradation Impact of additional core on user-perceived transaction latency

Cause of application stalls

Reddit News Network SDCard CPU Reddit News CPU Waiting Reddit News CPU 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

Time (s)

Observations System thread responsible for writing to SD card often preempts critical path thread. Network downloads temporally correlated with the SD card thread activity. Possible application-level solution: defer saving until after user transaction.

23 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-29
SLIDE 29

Introduction Algorithms and implementation Findings Experiment overview Identifying inefficient application/OS design Identifying DVFS policy related performance degradation Impact of additional core on user-perceived transaction latency

Transaction latency as function of DVFS policy

0.0 0.2 0.4 0.6 0.8 1.0 0.1ms 1ms 10ms 20ms 100ms 1s 10s

Transaction Latency (log scale) Emperical Cumulative Density

DVFS Off DVFS On

517 ms additional delay at 98th percentile. DVFS governor significantly hurts the performance of long user transactions.

24 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-30
SLIDE 30

Introduction Algorithms and implementation Findings Experiment overview Identifying inefficient application/OS design Identifying DVFS policy related performance degradation Impact of additional core on user-perceived transaction latency

Cause of DVFS policy related latency increase

Interactive governor behavior Evaluation interval: 20 ms. Frequency increase when (1) the utilization in the window is above 85% or (2) on user input. Duration to stay at high frequency: 60 ms. Why does this make long transactions slow? For shorter transactions, the frequency is boosted based user interaction. The frequency is allowed to drop after 60 ms.

25 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-31
SLIDE 31

Introduction Algorithms and implementation Findings Experiment overview Identifying inefficient application/OS design Identifying DVFS policy related performance degradation Impact of additional core on user-perceived transaction latency

Dependence of latency on transaction duration

  • y=1.75x

y=x 60ms

1ms 10ms 100ms 1s 1ms 10ms 100ms 1s

DVFS Off DVFS On

DVFS policy doesn’t hurt performance for transactions < 60 ms. +75% latency for transactions > 60 ms.

26 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-32
SLIDE 32

Introduction Algorithms and implementation Findings Experiment overview Identifying inefficient application/OS design Identifying DVFS policy related performance degradation Impact of additional core on user-perceived transaction latency

Impact of transaction time on DVFS policy and transaction time

Network IO CPU 350 MHz 700 MHz 920 MHz 1.2 GHz 0.0 0.5 1.0 1.5 2.0

Time (s) Resource Frequency

Root cause Disk IO forces CPU frequency low. Transaction latency strongly dependent on CPU frequency despite low CPU utilization.

27 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-33
SLIDE 33

Introduction Algorithms and implementation Findings Experiment overview Identifying inefficient application/OS design Identifying DVFS policy related performance degradation Impact of additional core on user-perceived transaction latency

Methods for improving DVFS policy behavior

Extend duration to stay at high frequency (60 ms). Have DVFS policy treat IO and network blocks as CPU activity.

28 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-34
SLIDE 34

Introduction Algorithms and implementation Findings Experiment overview Identifying inefficient application/OS design Identifying DVFS policy related performance degradation Impact of additional core on user-perceived transaction latency

Comparison of single- and dual-core transaction latencies

0.0 0.2 0.4 0.6 0.8 1.0 0.1ms 1ms 10ms 20ms 100ms 1s 10s

Transaction Latency (log scale) Emperical Cumulative Density

1 Core 2 Cores

Observation: Additional cores don’t influence latencies of long transactions. Implication: These applications do not have parallelized CPU-bounded workloads for long transactions.

29 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-35
SLIDE 35

Introduction Algorithms and implementation Findings Experiment overview Identifying inefficient application/OS design Identifying DVFS policy related performance degradation Impact of additional core on user-perceived transaction latency

Suggestions

Parallelize CPU-intensive smartphone applications. Improve single-core performance.

30 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-36
SLIDE 36

Introduction Algorithms and implementation Findings Experiment overview Identifying inefficient application/OS design Identifying DVFS policy related performance degradation Impact of additional core on user-perceived transaction latency

Panappticon summary

Panappticon traces relevant data to extract perceived user transactions. We used it briefly to find and understand some interesting application/OS performance problems; you can do better.

31 Zhang, Bild, Dick, Mao, and Dinda Panappticon

slide-37
SLIDE 37

Introduction Algorithms and implementation Findings Experiment overview Identifying inefficient application/OS design Identifying DVFS policy related performance degradation Impact of additional core on user-perceived transaction latency

Thanks and survey

Thank you for attending! Try Panappticon: Guide application/OS/hardware improvements based on user-perceived transaction latencies. http://ziyang.eecs.umich.edu/projects/panappticon. Informal on-site survey Who among you plans to use the tool or ideas described in this talk?

32 Zhang, Bild, Dick, Mao, and Dinda Panappticon