Mantis in Action Neeraj Joshi and Justin Becker 6/12/2015 Managing - - PowerPoint PPT Presentation

mantis in action
SMART_READER_LITE
LIVE PREVIEW

Mantis in Action Neeraj Joshi and Justin Becker 6/12/2015 Managing - - PowerPoint PPT Presentation

Mantis in Action Neeraj Joshi and Justin Becker 6/12/2015 Managing a complex operational environment is hard Developing an understanding of what is going on Knowing what works Developing an understanding of what is going on Identify what


slide-1
SLIDE 1

Mantis in Action

Neeraj Joshi and Justin Becker 6/12/2015

slide-2
SLIDE 2

Managing a complex

  • perational environment

is hard

slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6

Developing an understanding

  • f what is going on

Knowing what works

slide-7
SLIDE 7

Developing an understanding

  • f what is going on

Identify what doesn’t work

slide-8
SLIDE 8

Developing an understanding

  • f what is going on

Determining impact when doesn’t work

slide-9
SLIDE 9

Developing a deep understanding is hard, due to complexity

  • Hundreds of software services
  • Processing billions of requests
  • For millions of users
  • Operating in multiple data centers
  • Across the globe
slide-10
SLIDE 10

Help us (operators) make sense

  • ut of complexity, need tools

specifically, we need…

Insight tools

…to help comprehend what is going on in our

  • perational environments
slide-11
SLIDE 11

Make the case a relationship exists, between complexity and comprehension

slide-12
SLIDE 12

So, in order to manage complex environments, need to rethink insights, shift the curve

slide-13
SLIDE 13

Identified three insight ‘patterns’ to help shift the curve

  • Long tail analysis
  • Real time tracking and trending
  • Ad hoc investigation
slide-14
SLIDE 14

Grouped three patterns into a new effort

slide-15
SLIDE 15

Scalable Insights Initiative

slide-16
SLIDE 16

Goal is to help us manage (comprehend) our environments given an increase in complexity

slide-17
SLIDE 17

Mention specific insights tools to help shift curve

  • Realtime Data Explorer
  • Realtime Search
  • Realtime Application Monitoring
slide-18
SLIDE 18

Mention other generic insights jobs to help shift curve

  • Short term historical anomaly detection
  • Threshold-based anomaly detection
  • Realtime metrics generation
slide-19
SLIDE 19

Overview of scalable insights in action

  • 65-75 total jobs running in 3 regions
  • Global access to data
  • Processing 4.7 million events per second at peak
  • 20 specific data sources and generic adapters
  • Ability to startup jobs in ~5 seconds
slide-20
SLIDE 20

Demo

slide-21
SLIDE 21
slide-22
SLIDE 22
slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25

Mantis, a reactive stream processing system

slide-26
SLIDE 26

Some Basics Concepts

slide-27
SLIDE 27

Stream

Sequence of Events time

1 2 3 4

slide-28
SLIDE 28

Higher Order Functions

Transformations applied to a Stream to create a new stream

time

1 4 9 16 map (x=>sqrt(x))

time

1 2 3 4

slide-29
SLIDE 29

S O U R C E

Stream IN

S I N K Result OUT Stage 1 Stage 2 Stage n

A sequence of functions applied to a stream

map window reduce merge

Mantis Job

slide-30
SLIDE 30

Mantis Job

S O U R C E

Stream IN

S I N K Result OUT Stage 1 Stage 2 Stage n

slide-31
SLIDE 31

Named Jobs

Text

aa Job Name Job Version

slide-32
SLIDE 32

Are Parameterized

Parameter Parameter Parameter

slide-33
SLIDE 33

Have SLAs

Min/Max Instances of the job Min/Max runtime Perpetual or Transient

slide-34
SLIDE 34

Can be chained

Device Logs Source Job TopN Job Server Logs Source Job Anomaly Detector Job Metrics Aggregator Job Alert Service

slide-35
SLIDE 35

That is fine but...

slide-36
SLIDE 36

How does Mantis meet the Scalable Insights challenge?

slide-37
SLIDE 37
  • Cost (Utilization)

Sensitive

  • Optimize for low latency
  • High Throughput
  • Resilient

Key Requirements

slide-38
SLIDE 38

Minimizing Costs

slide-39
SLIDE 39

Elastic Clusters

slide-40
SLIDE 40

Elastic Jobs

slide-41
SLIDE 41

Job Autoscaling

Scaling Config CPU Strategy Network Strategy

slide-42
SLIDE 42

Filtering at Source

Source Job

Consumer Job

Data Producers

slide-43
SLIDE 43

Low latency - High Throughput

slide-44
SLIDE 44

To block or not to block?

slide-45
SLIDE 45

RxNetty (non-blocking) vs Tomcat (blocking)

by Brendan Gregg @brendangregg

https://docs.google.com/presentation/d/18i-d72m7tD4wKlzm-1PCR8g62l66_9Btbg5-fuFRqf0/edit#slide=id.g761289dab_0_77

slide-46
SLIDE 46

CPU consumption

  • RxNetty

consumes less CPU / request

  • Reduced thread

migration

  • Lower object

allocation rate

CPU consumed reduces as load increases

slide-47
SLIDE 47

Lower latency

  • RxNetty has

lower latency under high load

  • fewer lock

contentions

  • fewer thread

migrations

Latency knee for Tomcat ~ 400 Latency knee for Netty ~700

slide-48
SLIDE 48

Async Processing

Non-blocking I/O Async Processing

slide-49
SLIDE 49

Designed for Resilience

http://signsofpolitics.blogspot.com/2009/03/around-and-about-resilience.html

slide-50
SLIDE 50

Server Resilience

  • Servers crashes

inevitable

  • Server health constantly

monitored with heartbeats

  • Crashed servers

replaced

  • Lost jobs relaunched
slide-51
SLIDE 51

Network Resilience

  • Long lived connections

can fail

  • Connection topology is

constantly monitored and corrected

slide-52
SLIDE 52

Backpressure

slide-53
SLIDE 53

Cold Source

Amazon SQS

slide-54
SLIDE 54

Hot Source

slide-55
SLIDE 55

Reactive push-pull (Cold Source)

f1 f2 f3 f4

Cold Source

1024 1024 1024 1024 100 100 100 100

slide-56
SLIDE 56

Push mode

f1 f2 f3 f4

Cold Source

Max Max Max Max 100 100 100 100

slide-57
SLIDE 57

Pull mode

f1 f2 f3 f4

Cold Source

1 1 1 1 1 1 1 1

slide-58
SLIDE 58

Backpressure Strategies (Hot Source)

f2 f3 f4

Hot Source

10 10 10 100 10 10 10

Strategy Function

90

Drop Buffer Scale-up

slide-59
SLIDE 59

Questions