LinuxCon 2010 Tracing Mini-Summit A new unified Lockless Ring - - PowerPoint PPT Presentation

linuxcon 2010 tracing mini summit
SMART_READER_LITE
LIVE PREVIEW

LinuxCon 2010 Tracing Mini-Summit A new unified Lockless Ring - - PowerPoint PPT Presentation

LinuxCon 2010 Tracing Mini-Summit A new unified Lockless Ring Buffer library for efficient kernel tracing Presentation at: http://www.efficios.com/linuxcon2010-tracingsummit E-mail: mathieu.desnoyers@efficios.com Mathieu Desnoyers August


slide-1
SLIDE 1

August 11th, 2010 Mathieu Desnoyers 1

LinuxCon 2010 Tracing Mini-Summit

A new unified Lockless Ring Buffer library for efficient kernel tracing Presentation at:

http://www.efficios.com/linuxcon2010-tracingsummit

E-mail: mathieu.desnoyers@efficios.com

slide-2
SLIDE 2

August 11th, 2010 Mathieu Desnoyers 2

> Presenter

  • Mathieu Desnoyers
  • EfficiOS Inc.
  • http://www.efficios.com
  • Author/Maintainer of
  • LTTng, LTTV, Userspace RCU
  • Ph.D. in computer engineering
  • Low-Impact Operating System Tracing
slide-3
SLIDE 3

August 11th, 2010 Mathieu Desnoyers 3

> Plan

  • History
  • Mandate
  • Genericity and Flexibility
  • Speed and Compactness
  • Reliability
  • Working together
slide-4
SLIDE 4

August 11th, 2010 Mathieu Desnoyers 4

> History

  • May 2005: LTTng implements its ring buffer

from scratch

– Learns lessons from K42, RelayFS and LTT.

  • October 2005: LTTng becomes lock-less

– LTTng gets increasingly used by the industry

and shipped with many embedded and RT Linux distributions since then.

  • 2008: Ftrace (lock-less in 2009)
  • 2010: Perf
slide-5
SLIDE 5

August 11th, 2010 Mathieu Desnoyers 5

> Mandate

  • Wish from Linus expressed at the Kernel

Summit 2008 to have a common tracer infrastructure in the kernel

  • Asked by Steven Rostedt to come up with a

unified solution

slide-6
SLIDE 6

August 11th, 2010 Mathieu Desnoyers 6

> Generic Ring Buffer Library

  • Input

– Data received as parameter from ring buffer

library clients

  • Output

– Data available through a global or per-CPU file

descriptor with splice, mmap or read.

– Or data available internally to the ring buffer

client for reading

slide-7
SLIDE 7

August 11th, 2010 Mathieu Desnoyers 7

> Generic Ring Buffer Library

  • Derived from the LTTng ring buffer

– Exists since 2005

  • Goals

– Generic and flexible – Clean API – Fast and compact – Reliable

slide-8
SLIDE 8

August 11th, 2010 Mathieu Desnoyers 8

> Genericity and Flexibility

  • Target Perf, Ftrace, LTTng and drivers
  • Not only tracer-specific

– Ring buffer sits in /lib

  • Achieve genericity without hurting performance

– Ring buffer clients – Instantiate client-specific configurations – Express configuration into a constant client

structure passed as parameter to inline functions

slide-9
SLIDE 9

August 11th, 2010 Mathieu Desnoyers 9

> API: pre-cooked (simple) APIs

  • Create/destroy a channel

– Global buffer – Per-CPU buffers

  • In-kernel write()
  • Read a file descriptor

– Global iterator

  • The library does fusion merge of per-CPU buffer

events based on a heap and quiescent states

– Per-CPU iterator

slide-10
SLIDE 10

August 11th, 2010 Mathieu Desnoyers 10

> API: pre-cooked APIs

  • Mode

– Overwrite – Discard

  • Channels

– Global – Per-CPU

  • Global iterators
  • Per-CPU iterators
slide-11
SLIDE 11

August 11th, 2010 Mathieu Desnoyers 11

> Advanced API

  • Client configuration
  • Client-provided callbacks
slide-12
SLIDE 12

August 11th, 2010 Mathieu Desnoyers 12

> Configuration

  • Buffers per-CPU or global
  • Overwrite or discard mode
  • Natural or packed alignment
  • Output

– splice(), mmap(), read(), iterator, client-specific

  • Memory allocation backend

– page, vmap, static

  • OOPS consistency, IPI barrier, wakeup
slide-13
SLIDE 13

August 11th, 2010 Mathieu Desnoyers 13

> Client-provided callbacks

  • Clock read
  • Event and sub-buffer header size
  • Sub-buffer begin/end
  • Buffer create/finalize
  • Record get

– For iterators

slide-14
SLIDE 14

August 11th, 2010 Mathieu Desnoyers 14

> Speed and Compactness

  • Fast paths

– Constant configuration structure – Compiler removes unused code

  • Slow paths

– Configuration dynamically tested – Same code shared amongst all clients

slide-15
SLIDE 15

August 11th, 2010 Mathieu Desnoyers 15

> Performance

  • Throughput
  • Scalability
slide-16
SLIDE 16

August 11th, 2010 Mathieu Desnoyers 16

> Throughput (overwrite mode)

  • Generic Ring Buffer Library

– 83-199 ns/entry (depending on configuration)

  • Ftrace

– 103-187 ns/entry

  • Perf

– Mode unavailable

slide-17
SLIDE 17

August 11th, 2010 Mathieu Desnoyers 17

> Throughput (discard mode)

  • Generic Ring Buffer Library

– 257 ns/entry written

  • Perf

– 423 ns/entry written

  • (approximation from Perf output)
  • Getting accurate results is hard, influenced by

discarded events

slide-18
SLIDE 18

August 11th, 2010 Mathieu Desnoyers 18

> Scalability

slide-19
SLIDE 19

August 11th, 2010 Mathieu Desnoyers 19

> Reliability

  • LTTng

– Formal verification of the ring buffer algorithm at

the architecture level (modeling execution on superscalar processors)

– Testing on large user-base

slide-20
SLIDE 20

August 11th, 2010 Mathieu Desnoyers 20

> Working together

  • Ever had the feeling you were trying to fit

something square-shaped into a circle ?

slide-21
SLIDE 21

August 11th, 2010 Mathieu Desnoyers 21

> Working together

  • Need to polish off the rough spots
slide-22
SLIDE 22

August 11th, 2010 Mathieu Desnoyers 22

> Working together

  • Trying to come up with a clean and flexible API
  • Nevertheless, does not always map the current

Ftrace and Perf APIs

  • Trying very hard not to bloat the API
slide-23
SLIDE 23

August 11th, 2010 Mathieu Desnoyers 23

> Working with Ftrace

  • Steven has been very helpful
  • I'm about 80% done working on Ftrace

transition to the generic ring buffer library

slide-24
SLIDE 24

August 11th, 2010 Mathieu Desnoyers 24

> Ftrace odd-fitting pieces

  • Ftrace iteration code

– Huge set of API functions for iterating on

stopped trace buffers without consuming data.

– Used for:

  • Dumping same output with "cat" many times
  • Peek next item to place brackets in function

graph tracer output

– Could be replaced by "rewind" ability and by

modifying the function graph tracer plugin

slide-25
SLIDE 25

August 11th, 2010 Mathieu Desnoyers 25

> Perf

  • mmap()-based ABI between kernel and user-

space for consuming data.

  • No kernel callback invoked when the consumer

finishes reading data.

– Severely limits design choices

  • Does not support (and developers don't

consider as valid use-case) reading data while writing into a buffer in flight recorder mode.

slide-26
SLIDE 26

August 11th, 2010 Mathieu Desnoyers 26

> Perf

  • Does not use padding between sub-buffers

– No concept of sub-buffers – All events are physically contiguous

  • Cannot create efficient chunks of data for

splice() without copy

  • Cannot efficiently index trace without reading all

events (increases delay before a large trace can be analyzed)

  • Basic data encapsulation principles
slide-27
SLIDE 27

August 11th, 2010 Mathieu Desnoyers 27

> Perf

  • Why do they hate sub-buffers so much ?

– Claim of simplicity

  • False. The fast path ends up being both larger

and slower than the generic ring buffer.

  • Why is this important ?

– Shows how low-level Perf design choices

prevent contributors from fulfilling end-user basic use-cases.

– Shows Perf developers unwillingness to support

use-cases other than kernel developers own needs.

slide-28
SLIDE 28

August 11th, 2010 Mathieu Desnoyers 28

> Funding

  • Thanks to Ericsson for funding parts of this

work.

slide-29
SLIDE 29

August 11th, 2010 Mathieu Desnoyers 29

> Questions ?

?

– http://www.efficios.com

  • LTTng Information

– http://lttng.org – ltt-dev@lists.casi.polymtl.ca

slide-30
SLIDE 30

August 11th, 2010 Mathieu Desnoyers 30

> API (per-CPU discard)

extern struct channel * ring_buffer_percpu_discard_create(size_t buf_size); extern void ring_buffer_percpu_discard_destroy(struct channel *chan); extern int ring_buffer_percpu_discard_write(struct channel *chan, const void *src, size_t len); And map file operation "channel_payload_file_operations" from iterator.h to file descriptor.