Economical machine learning via functional programming Big Data - - PowerPoint PPT Presentation

economical machine learning via functional programming
SMART_READER_LITE
LIVE PREVIEW

Economical machine learning via functional programming Big Data - - PowerPoint PPT Presentation

Economical machine learning via functional programming Big Data Scala by the Bay August 18, 2015 David Andrzejewski - @davidandrzej Data Sciences Engineering, Sumo Logic Sumo Logic Confidential Sumo Logic Machine data intelligence


slide-1
SLIDE 1 Sumo Logic Confidential

Economical machine learning via functional programming

Big Data Scala by the Bay – August 18, 2015

David Andrzejewski - @davidandrzej Data Sciences Engineering, Sumo Logic

slide-2
SLIDE 2 Sumo Logic Confidential

Sumo Logic

  • Machine data intelligence platform in AWS
  • Early-ish Scala adopter (2.7.7 in 2010)
  • Free trial for < 500 MB/day
slide-3
SLIDE 3 Sumo Logic Confidential

This talk

  • Machine learning is useful, but...
  • ...brings additional engineering complexity
  • Functional programming techniques can help
slide-4
SLIDE 4 Sumo Logic Confidential

Machine learning

  • Will robots...

– take our jobs? – annihilate humanity?

  • Key clip art

– robots studying – heads with gears in them

So hot right now

slide-5
SLIDE 5 Sumo Logic Confidential

“Machine learning studies computer algorithms for learning to do stuff.”

  • Prof. Rob Schapire (COS 511 scribe notes)
slide-6
SLIDE 6 Sumo Logic Confidential

What kinds of “stuff” can machines learn to do?

  • ...predict whether someone will click an ad
  • ...rank / recommend content by relevance
  • ...classify behavior as malicious or not
  • ...label images or text based on content

And how do they do it?

What How

Model

[(x1, y1), . . . , (xN, yN)] → ˆ f(x)

f(x) = y

ˆ f(x) = ˆ y

Estimate Predict

slide-7
SLIDE 7 Sumo Logic Confidential

Rise of complementary goods More Cloud

(source: Forrester via Forbes)

Moore’s Law

(source: Fairchild via computerhistory.org)

More Data

(source: IDC via The Economist)

slide-8
SLIDE 8 Sumo Logic Confidential

“Machine Learning disrupts software engineering.”

  • Léon Bottou (ICML 2015 keynote)
slide-9
SLIDE 9 Sumo Logic Confidential

Technical debt

  • Tight coupling
  • Hidden dependencies
  • Code repetition / duplication
  • Statefulness
  • Duct-taped workarounds

“...you are sure that it will make further changes harder in the future.” – Martin Fowler

slide-10
SLIDE 10 Sumo Logic Confidential

ML: new & exciting ways to shoot yourself in the foot

Trough of disillusionment?

Machine Learning: The High Interest Credit Card of Technical Debt

  • D. Sculley et al (NIPS 2014 workshop)

Two big challenges in machine learning Léon Bottou (ICML 2015 keynote)

  • Unreliable contracts
  • Unrealistic assumptions
  • Hard to

– test and debug – safely improve – manage data/features

  • Easy to

– erode boundaries – glue / hack / duct tape A Systems View of Machine Learning Joshua Bloom (PyData 2015 keynote)

slide-11
SLIDE 11 Sumo Logic Confidential

Principal Payments

slide-12
SLIDE 12 Sumo Logic Confidential

Principal Payments

N → ∞

slide-13
SLIDE 13 Sumo Logic Confidential

Principal Payments

slide-14
SLIDE 14 Sumo Logic Confidential

Principal Payments

Machine Learning

slide-15
SLIDE 15 Sumo Logic Confidential

Essential Complexity Out of the Tar Pit

Moseley & Marks (2006) – h/t Paco Nathan

Incidental Complexity Actual problem “Reality tax” Business logic Implementation detail SQL Hadoop Declarative Imperative

slide-16
SLIDE 16 Sumo Logic Confidential

Control “complexity spend” with Functional Programming

  • Avoid mutable state
  • Minimize custom logic surface area
  • Facilitate local reasoning
  • Compose small, well-typed, well-tested functions
slide-17
SLIDE 17 Sumo Logic Confidential

Functional programming (FP)

  • Big idea: pure functions

– no side effects – referential transparency

  • Consequences

– immutability – 1st class functions – higher-order functions

  • Examples use scalaz

– version 7.1.3 – see “learning scalaz” blog post series by eed3si9n (sp?)

slide-18
SLIDE 18 Sumo Logic Confidential
slide-19
SLIDE 19 Sumo Logic Confidential
slide-20
SLIDE 20 Sumo Logic Confidential
slide-21
SLIDE 21 Sumo Logic Confidential

Case 0: your code, does it work?

  • Useful tricks

– Case class wrappers – Unboxed tagged types

Step 0: use the types

def ¡getData(datasetId: ¡Long, ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡startTime: ¡Long, ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡endTime: ¡Long) ¡ def ¡getData(datasetId: ¡DatasetId, ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡timeRange: ¡DatasetInterval) ¡

slide-22
SLIDE 22 Sumo Logic Confidential

Case 0: your code, does it work?

Step 1: unit testing

(x1, y1) (x2, y2) (xk, yk) f(xi)

?

= yi

Testing for data scientists Trey Causey (PyData 2015)

slide-23
SLIDE 23 Sumo Logic Confidential

Case 0: your code, does it work?

Step 2: property testing

  • Define properties you expect to hold
  • Heuristics to probe edge cases
  • ML examples

– bounded output – find weird edge cases (e.g., empty clusters)

∀x p(x, f(x)) ∀x c(x) = ⇒ p(x, f(x))

Universal Conditional

slide-24
SLIDE 24 Sumo Logic Confidential

Case 0: your code, does it work?

Step 3: statistical estimators are functions

  • Confidence intervals
  • PAC-style bounds
  • Property testing

– customer generator: sets of datasets

P(L ≥ ) ≤ δ

slide-25
SLIDE 25 Sumo Logic Confidential

Approach Case 1: loose coupling via type class pattern Example Hard-wired

def ¡foo(x: ¡MyBuzzType) ¡ ¡

Parametric polymorphism

def ¡add[T](x: ¡T) ¡ ¡

Variance annotation

class ¡Stack[+T] ¡

Ad-hoc polymorphism

def ¡sort[T ¡: ¡Ordering] ¡ (xs: ¡List[T]) ¡

slide-26
SLIDE 26 Sumo Logic Confidential

Advantages of type classes

  • Retroactively extend external types (e.g., Joda Time)
  • Nicer than “wrapper class” / subtyping (blog post)
  • ML sweet spot: consumers “just short” of being polymorphic
  • Examples

– Timestamped[T] – Featurable[T] – Labeled[T]

slide-27
SLIDE 27 Sumo Logic Confidential
  • Basic k-fold CV
  • Stratified:

need label info

slide-28
SLIDE 28 Sumo Logic Confidential

Type class laws

  • What about transitivity? Let’s add trait ¡TotalOrdering[T] ¡

¡

Property Testing + Type Classes

slide-29
SLIDE 29 Sumo Logic Confidential

Case 2: Monoids + Monoids = Monoids

  • Experimental evaluation code frequently manipulates results
  • How to combine?
slide-30
SLIDE 30 Sumo Logic Confidential

Implementing Monoid

(I believe Shapeless can do this automagically...!?)

slide-31
SLIDE 31 Sumo Logic Confidential

Map(TestGroup ¡-­‑> ¡Results(79,119,171,14), ¡ ¡ ¡ ¡ ¡ ¡ControlGroup ¡-­‑> ¡Results(34,77,136,112)) ¡

slide-32
SLIDE 32 Sumo Logic Confidential

Distributed compute via monoid homomorphism

See: Twitter Algebird and related talks, Jimmy Lin “Monoidify!” paper

DATA ¡

f(s1 + s2) = f(s1) ⊕ f(s2)

DATA ¡ DATA ¡

slide-33
SLIDE 33 Sumo Logic Confidential

Monoidal classifiers: 400x faster than Weka

Algebraic Classifiers: a generic approach to fast cross-validation, online training, and parallel training - Izbicki, ICML13

slide-34
SLIDE 34 Sumo Logic Confidential

Key trick: prefix-sum

slide-35
SLIDE 35 Sumo Logic Confidential

Case 3: auditing computation with Writer Monad

Understanding multiclass predictions (credit: Kumar Avijit)

f(x) = argmax

i

wT

i x

slide-36
SLIDE 36 Sumo Logic Confidential
slide-37
SLIDE 37 Sumo Logic Confidential
slide-38
SLIDE 38 Sumo Logic Confidential

Confusion matrix with “max significant feature”

Tracking illuminates “bad features”

slide-39
SLIDE 39 Sumo Logic Confidential

How did we do that?

Writer Monad in simple drawings

slide-40
SLIDE 40 Sumo Logic Confidential

How did we do that?

Writer Monad in simple drawings

slide-41
SLIDE 41 Sumo Logic Confidential

How did we do that?

Writer Monad in simple drawings

slide-42
SLIDE 42 Sumo Logic Confidential

Case 4: stateful traversal

Example: sampling from p-th order autoregressive model

!! = !!

! !!!

!!!! + !!!

slide-43
SLIDE 43 Sumo Logic Confidential

Case 4: stateful traversal

Re-arrange to take current window state as input

!! = !!

! !!!

!!!! + !!!

slide-44
SLIDE 44 Sumo Logic Confidential

Case 4: stateful traversal

Partially apply the function for fixed parameters

!! = !!

! !!!

!!!! + !!!

slide-45
SLIDE 45 Sumo Logic Confidential

Case 4: stateful traversal

Map function over random noise terms

!! = !!

! !!!

!!!! + !!!

Now we’ve got something like g: ¡Window ¡=> ¡(Window, ¡Prediction) ¡

slide-46
SLIDE 46 Sumo Logic Confidential
slide-47
SLIDE 47 Sumo Logic Confidential

1. Convert each position into independent State calculation 2. Traverse/Sequence to convert List[State] → State[List] 3. Supply initial window state and run()

slide-48
SLIDE 48 Sumo Logic Confidential

Monoids, Monads, who cares?

  • Ubiquitous patterns

– Monoids: generalized addition/combination – Monads: computation within context

  • Make it explicit and reap the rewards

– type-checking – generalized wiring – optimization opportunities – common vocabulary

slide-49
SLIDE 49 Sumo Logic Confidential
  • Type-oriented

design

  • Function

composition

  • Unit tests
  • Property tests
  • Bounds and

randomized behavior

Correctness

  • Type class

design pattern

  • Utility functions

via ad hoc polymorphism

  • Chaining type

classes

  • Law-checking

Monoid design Monad design

Manage ML tech debt with functional programming

Loose coupling

  • Combine data

structures

  • Leverage

general plumbing

  • Efficient

distributed computation

  • Cross-fold

validation

  • Instrumented

model prediction with Writer

  • Stateful traversal
  • Type-checked

failure handling