Large-scale performance monitoring framework for cloud monitoring - PowerPoint PPT Presentation

Large-scale performance monitoring framework for cloud monitoring Run-Time Latency Detection in Production Julien Desfossez Michel Dagenais Décembre 2014 École Polytechnique de Montreal

Latency-tracker ● Kernel module to track down latency problems at run-time ● Simple API that can be called from anywhere in the kernel (tracepoints, kprobes, netfilter hooks, hardcoded in other module or the kernel tree source code) ● Keep track of entry/exit events and calls a callback if the delay between the two events is higher than a threshold 2

Latency tracker previous state ● Prototype working and stable ● Needed more testing use-cases ● Performance measurements was in progress ● Hashtable needed scaling optimization 3

Usage tracker = latency_tracker_create(); latency_tracker_event_in(tracker, key, threshold, timeout, callback); .... latency_tracker_event_out(tracker, key); If the delay between the event_in and event_out for the same key is higher than “threshold”, the callback function is called. The timeout parameter allows to launch the callback if the event_out takes too long to arrive (off-CPU profiling). 4

New feature: check current state ● It is now possible to query the current state of a request without removing the key : event = latency_tracker_get_event(tracker, key); latency_tracker_put_event(event); ● Stateful tracing ● Refcount-based ownership 5

Implemented use-cases ● Block layer latency – Delay between block request issue and complete ● Wake-up latency – Delay between sched_wakeup and sched_switch ● Network latency ● IRQ latency ● System call latency – Delay between the entry and exit of a system call ● Offcpu latency – How long a process has been scheduled out 6

System call latency ● Developed in collaboration with François Doray on syscall_entry: latency_tracker_event_in(current_pid); on syscall_exit: latency_tracker_event_out(current_pid); on sched_switch: event = latency_tracker_get_event(next_pid) if event && ((now – event->start) > threshold): dump_stack(next_pid) 7

System call latency ● Developed in collaboration with François Doray on syscall_entry: latency_tracker_event_in(current_pid); on syscall_exit: latency_tracker_event_out(current_pid); on sched_switch: event = latency_tracker_get_event(next_pid); if event && ((now – event->start) > threshold): dump_stack(next_pid); 8

System call latency example syscall_latency_stack: comm=sync, pid=32224 81136.460929 81136.461482 81136.467357 81136.470176 schedule _cond_resched _cond_resched schedule schedule_timeout sync_inodes_sb mempool_alloc schedule_timeout wait_for_completion sync_inodes_one_sb __split_and_process_ wait_for_completion sync_inodes_sb iterate_supers bio submit_bio_wait sync_inodes_one_sb sys_sync dm_request blkdev_issue_flush iterate_supers tracesys generic_make_reques ext4_sync_fs sys_sync t sync_fs_one_sb tracesys submit_bio iterate_supers submit_bio_wait sys_sync blkdev_issue_flush tracesys ext4_sync_fs sync_fs_one_sb Dynamically change the threshold: # echo 1000000 > /sys/module/latency_tracker_syscalls/parameters/usec_threshold 9

Off-cpu profiling on sched_switch(prev, next): latency_tracker_event_in(prev, cb ) latency_tracker_event_out(next) cb(): dump_stack(pid) on sched_wakeup(pid): event = latency_tracker_get_event(pid) if event && ((now – event->start) > threshold): dump_stack(current) 10

Off-cpu profiling example offcpu_sched_wakeup: offcpu_sched_switch: waker_comm= swapper/3 (0), comm= qemu-system-x86 , wakee_comm= qemu-system-x86 (7726), pid=7726, wakee_offcpu_delay=10000018451, delay=10000140896, waker_stack= stack= ttwu_do_wakeup schedule ttwu_do_activate.constprop.74 futex_wait_queue_me try_to_wake_up futex_wait wake_up_process do_futex hrtimer_wakeup SyS_futex __run_hrtimer system_call_fastpath hrtimer_interrupt local_apic_timer_interrupt smp_apic_timer_interrupt apic_timer_interrupt 11

Performance improvements ● Controlled memory allocation ● Lock-less free-list ● Out-of-context reallocation of memory if needed/enabled ● Now using userspace-rcu hashtable for lock-less insert and lookup (ported to the kernel by Mathieu Desnoyers: KURCU ?) ● Custom call_rcu thread to avoid the variable side- effects of the built-in one 12

CPU scaling 13

Overhead on sysbench oltp (MySQL) Test Average Overhead Baseline 63.26s LTTng sched 63.65s 0.61% LTTng syscalls 64.95s 2.66% Latency_tracker 65.36s 3.31% Latencytop 66.24s 4.70% LTTng all 70.24s 11% 14

Future Work ● Keep internal state of the current latency profile (last minute, last 5 minutes, last hours) ● Extract aggregated information about latencies as histograms ● Compare evolutions of latencies and major changes ● Analyse large data set of high-latency events to help create and understand latency profiles 15

Install it apt-get install git gcc make linux-headers-generic git clone https://github.com/jdesfossez/late ncy_tracker.git cd latency_tracker make 16

Questions ? 17

Large-scale performance monitoring framework for cloud monitoring - PowerPoint PPT Presentation

Large-scale performance monitoring framework for cloud monitoring Run-Time Latency Detection in Production Julien Desfossez Michel Dagenais Dcembre 2014 cole Polytechnique de Montreal Latency-tracker Kernel module to track down

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Large-scale performance monitoring framework for cloud monitoring Live Trace Reading and

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

OVERVIEW 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2 Overview

Cloud Computing & Cloud Models Cloud Models Topics Defining cloud computing

Evaluation of a Failure Prediction Model for Large Scale Cloud Applications Mohammad S. Jassas

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

SNR SNR- -cloud interaction cloud interaction cloud interaction SNR SNR cloud interaction

Cloud Cloud Cloud Cloud network Edge Edge Edge Edge as a Edge Edge Edge Edge Edge

Cloud Ross Mallace Commercial Director Cloud/SaaS Cloud is here. ALL By 2020 most core

Embracing Cloud Ian Apperley Agenda A little about me What is Cloud and where did it come

Are We Really Cloud-Native? Bert Ertman Cloud-Native Computing What is Cloud-Native? answer:

Latency Outliers Root Cause Analysis in the Field by Combining Aggregation and Tracing Tools

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

LIMELIGHT NETWORKS INVESTOR PRESENTATION July 2019 1 SAFE HARBOR STATEMENT Certain statements

Vectorization of single particle Tube DistanceToIn A particle can hit a tube in three ways:

Interrupt Coalescing in Xen with Scheduler Awareness Michael Peirce & Kevin Boos Outline

TDT24 Presentation - GPU Optimization Principles Johannes Kvam Department of Engineering

Family Week 2019 Building Connections in the Spirit of Hope A Family Blessing Blessed are we as

2017 2017 Launch Presentation #PAMS2017 TH THE JOU E JOURNE NEY THE HE BR BRIEF IEF

Large-scale performance monitoring framework for cloud monitoring - PowerPoint PPT Presentation

Large-scale performance monitoring framework for cloud monitoring Run-Time Latency Detection in Production Julien Desfossez Michel Dagenais Dcembre 2014 cole Polytechnique de Montreal Latency-tracker Kernel module to track down

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Large-scale performance monitoring framework for cloud monitoring Live Trace Reading and

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

OVERVIEW 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2 Overview

Cloud Computing &amp; Cloud Models Cloud Models Topics Defining cloud computing

Evaluation of a Failure Prediction Model for Large Scale Cloud Applications Mohammad S. Jassas

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

SNR SNR- -cloud interaction cloud interaction cloud interaction SNR SNR cloud interaction

Cloud Cloud Cloud Cloud network Edge Edge Edge Edge as a Edge Edge Edge Edge Edge

Cloud Ross Mallace Commercial Director Cloud/SaaS Cloud is here. ALL By 2020 most core

Embracing Cloud Ian Apperley Agenda A little about me What is Cloud and where did it come

Are We Really Cloud-Native? Bert Ertman Cloud-Native Computing What is Cloud-Native? answer:

Latency Outliers Root Cause Analysis in the Field by Combining Aggregation and Tracing Tools

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

LIMELIGHT NETWORKS INVESTOR PRESENTATION July 2019 1 SAFE HARBOR STATEMENT Certain statements

Vectorization of single particle Tube DistanceToIn A particle can hit a tube in three ways:

Interrupt Coalescing in Xen with Scheduler Awareness Michael Peirce &amp; Kevin Boos Outline

TDT24 Presentation - GPU Optimization Principles Johannes Kvam Department of Engineering

Family Week 2019 Building Connections in the Spirit of Hope A Family Blessing Blessed are we as

2017 2017 Launch Presentation #PAMS2017 TH THE JOU E JOURNE NEY THE HE BR BRIEF IEF

Cloud Computing & Cloud Models Cloud Models Topics Defining cloud computing

Interrupt Coalescing in Xen with Scheduler Awareness Michael Peirce & Kevin Boos Outline