Is this normal? Finding anomalies in real-time data. Who am I? Im - - PowerPoint PPT Presentation
Is this normal? Finding anomalies in real-time data. Who am I? Im - - PowerPoint PPT Presentation
Is this normal? Finding anomalies in real-time data. Who am I? Im Theo (@postwait on Twitter) I write a lot of code 50+ open source projects several commercial code bases I wrote Scalable Internet Architectures I sit on the ACM
SLIDE 1
SLIDE 2
Who am I?
I’m Theo (@postwait on Twitter) I write a lot of code 50+ open source projects several commercial code bases I wrote “Scalable Internet Architectures” I sit on the ACM Queue and Professions boards. I spend all day looking at telemetry data at Circonus
SLIDE 3
What is real-time?
Hard real-time systems are those where the outputs of a system based on specific inputs are considered incorrect if the latency of their delivery is above a specified amount. Soft real-time systems are similar, but “less useful” instead of “incorrect.” I don’t design life support systems, avionics
- r other systems where lives are at stake,
so it’s a soft real-time life for me.
SLIDE 4
A survey of big data sytems.
Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza, Greenplum, Tableau, K The shiny: Hadoop, Hive, HBase, Pig, Cassandra The real-time: SQLstream, S4, Flumebase, Truviso, Esper, Storm
SLIDE 5
Big data the old way
Relational databases, both column store and not. Just work. Likely store more data than your “big data.”
SLIDE 6
Big data the distributed way
distributed systems allow much larger data sets, but markedly change the data analytics methods hard for existing quants to roll up their sleeves highly scalable and accommodate growth
SLIDE 7
Big data the real-time way
what we do needs a different approach the old (and even the distributed) do not design for soft real-time complex
- bservation of data.
Notable exceptions are S4 and Storm.
SLIDE 8
So, what’s your problem?
We have telemetry...
- ver 10 trillion data points on near-line storage
growing super-linearly
SLIDE 9
Data, what kind?
Most data is numeric: counts, averages, derivatives, stddevs, etc. Some data is: text changes (ssh fingerprints, production launches) histograms highly dimensional event streams.
SLIDE 10
Data rates.
Quantity of data isn’t such a big deal
- kay, yes it is, but we’ll get to that later.
The rate of new data arrival makes the problem hard. low end: 15k datum / second high end: 300k datum / second growing rapidly
SLIDE 11
What we use.
We use Esper Esper is very powerful, elegantly coded and performance focused Like any good tool that allows users to write queries...
http://www.flickr.com/photos/mcertou/
SLIDE 12
What we do with Esper
Detect absence in streams:
select b from pattern [every a=Event -> (timer:interval(30 sec) and not b=Event(id=a.id, metric=a.metric)]
Detect ad-hoc threshold violation:
select * from Event(id=”host1”, metric=”disk1”) where value > 95
- etc. etc. etc. [1]
SLIDE 13
Making the problem harder.
So, it just wasn’t enough. We want to do long term trending and apply that information to anomaly detection Think: Holt-Winters (or multivariate regressions) Look at historic data Use that to predict the immediate future with some quantifiable confidence.
SLIDE 14
How we do it.
We implemented the Snowth for storage of data. [2] We implemented a C/lua distributed system to analyze 4 weeks of data (~8k statistical aggregates) yielding a prediction with confidences (triple exponential smoothing) [3] To keep the system real-time, we need to ensure that queries return in less than 2ms (our goal is 100µs).
SLIDE 15
Cheating is winning.
Our predictions work on 5 minute windows. 4 weeks of data is 8064 windows. Given Pred(T-8063 .. T0) -> (P1, C1) Given Pred(T-8062 .. T0, P1) -> ~(P2, C2)
SLIDE 16
Tolerably inaccurate.
When V arrives, we determine the prediction window WN we need. If WN isn’t in cache, we assume V is within tolerances. If WN+1 isn’t in cache, we query the Snowth for WN, WN+1 placing in cache Cache accesses are local and always < 100µs.
SLIDE 17
I see challenges
How do I take offline data analytics techniques and apply them online to high-volume, low-latency event streams quickly? without deep expertise?
SLIDE 18
Thank you.
Circonus is hiring: software engineers, quants, and visualization engineers.
[1] http://esper.codehaus.org/tutorials/solution_patterns/solution_patterns.html [2] http://omniti.com/surge/2011/speakers/theo-schlossnagle [3] http://labs.omniti.com/people/jesus/papers/holtwinters.pdf