ONLINE MACHINE LEARNING AND DATA MINING EDO LIBERTY STANDARD - - PowerPoint PPT Presentation

online machine learning and data mining
SMART_READER_LITE
LIVE PREVIEW

ONLINE MACHINE LEARNING AND DATA MINING EDO LIBERTY STANDARD - - PowerPoint PPT Presentation

ONLINE MACHINE LEARNING AND DATA MINING EDO LIBERTY STANDARD MACHINE LEARNING SETTING = call = ? = crawl Training Test Data Data Train call Apply Model Label 2 STANDARD MACHINE LEARNING SETTING Predicting the future


slide-1
SLIDE 1

ONLINE MACHINE LEARNING AND DATA MINING

EDO LIBERTY

slide-2
SLIDE 2

2

STANDARD MACHINE LEARNING SETTING

Training Data Train Test Data Apply Model Label

= “call” = “crawl”

“call”

= ?

slide-3
SLIDE 3

3

STANDARD MACHINE LEARNING SETTING

Data is generated by a stochastic process

More training data is better

Predicting the future is impossible (in general)

Big ML means

  • ptimization on

big data

slide-4
SLIDE 4

4

MORE DATA IS OFTEN WORSE (MORE DATA = OLDER DATA)

slide-5
SLIDE 5

5

OUR ACTIONS HEAVILY INFLUENCE THE DATA

slide-6
SLIDE 6

6

THE FUTURE IS OFTEN NOT LIKE THE PAST!

Same story line or not? 1) The answer depends on the future 2) We have to decide now…

slide-7
SLIDE 7

7

HAVING “A MODEL” IS COMPLETELY UNIMPORTANT

Elements of information theory, Cover, 1991 Efficient algorithms for universal portfolios, Kalai, Vempala, 2003 Efficient Algorithms for Online Game Playing and Universal Portfolio Management, Agarwal, Hazan, 2006

slide-8
SLIDE 8

8

ONLINE ALGORITHMS (DECISION MAKING WITHOUT PREDICTING)

slide-9
SLIDE 9

9

THE SKI RENTAL PROBLEM

Rent: x$ /day Buy: 1000$

slide-10
SLIDE 10

10

THE SKI RENTAL PROBLEM

70 + 70

R

70

Computation

slide-11
SLIDE 11

11

THE SKI RENTAL PROBLEM

Computation R R

70 + 90 160

90 70

slide-12
SLIDE 12

12

THE SKI RENTAL PROBLEM

Computation R R B

70 + 90 + 1000 1160

80 90 70

slide-13
SLIDE 13

13

THE SKI RENTAL PROBLEM

Computation R R

80

B

70 + 90 + 1000 + 1160

90 70 70

slide-14
SLIDE 14

14

THE SKI RENTAL PROBLEM

Computation R R

80

R

70 + 90 + 80 + 70 310

90 70

R You should have rented all along…

70

slide-15
SLIDE 15

15

THE SKI RENTAL PROBLEM

Computation

Input Output

90 88 72 79

B

80 90 70 70

R R R R

$1000 $1000

slide-16
SLIDE 16

16

THE SKI RENTAL PROBLEM

Algorithm Buy Optimal in hindsight

ALG <= 2 OPT

slide-17
SLIDE 17

17

ONLINE LINEAR CLASSIFICATION

slide-18
SLIDE 18

18

ONLINE MACHINE LEARNING

Emails Spam?

N Computation

slide-19
SLIDE 19

19

ONLINE MACHINE LEARNING

N N Computation

Emails Spam?

slide-20
SLIDE 20

20

ONLINE MACHINE LEARNING

N N Y Computation

Emails Spam?

slide-21
SLIDE 21

21

ONLINE MACHINE LEARNING

Y N N Y N N Y N Computation Number of mistakes is compared to the best classifier in hindsight! Variants of SGD have this property

Prediction, Learning, and Games, Cesa-Bianchi, Lugosi, 2006

slide-22
SLIDE 22

22

ONLINE PRINCIPAL COMPONENT ANALYSIS

Online Principal Components Analysis, Boutsidis, Garber, Karnin, Liberty 2014 Online PCA with Spectral Bounds, Karnin, Liberty, 2015

slide-23
SLIDE 23

23

xi

slide-24
SLIDE 24

24

xi ΦΦT x

slide-25
SLIDE 25

25

xi ΦΦT x kxi ΦT Φxik

slide-26
SLIDE 26

26

Eigenpets: https://bioramble.wordpress.com/2015/09/01/

slide-27
SLIDE 27

27

ONLINE PRINCIPAL COMPONENT ANALYSIS

Online PCA with Spectral Bounds, Karnin, Liberty, 2015

slide-28
SLIDE 28

28

slide-29
SLIDE 29

29

ONLINE K-MEANS CLUSTERING

An Algorithm for Online K-Means Clustering, Liberty, Sriharsha, Sviridenko, 2014

slide-30
SLIDE 30

30

K-MEANS CLUSTERING

http://en.wikipedia.org/wiki/MNIST_database http://research.ics.aalto.fi/mi/software/ne/

slide-31
SLIDE 31

31

K-MEANS CLUSTERING

  • Roughly 20,000 documents
  • 20 topics:
  • Graphics
  • PC hardware
  • Baseball
  • For-sale
  • Politics

http://qwone.com/~jason/20Newsgroups/ http://research.ics.aalto.fi/mi/software/ne/

slide-32
SLIDE 32

32

K-MEANS CLUSTERING

1) One can cluster points fully online 2) Create only slightly more than k centers 3) Be competitive with the best

  • ffline clustering to k clusters

An Algorithm for Online K-Means Clustering, Liberty, Sriharsha, Sviridenko 2015

slide-33
SLIDE 33

33

ONLINE K-MEANS CLUSTERING

0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1 1.2 20news-binary adult ijcnn1 letter magic04 maptaskcoref nomao poker shuttle.binary skin vehv2binary w8all An Algorithm for Online K-Means Clustering, Liberty, Sriharsha, Sviridenko 2015 k-means++: the advantages of careful seeding, Arthur, Vassilvitskii, 2006

slide-34
SLIDE 34

STREAMING ALGORITHMS OPEN SOURCE FROM YAHOO

EDO LIBERTY

slide-35
SLIDE 35

35

DATASKETCHES.GITHUB.IO

slide-36
SLIDE 36

36

Data Computation Result The World

slide-37
SLIDE 37

37

Data Data Data Data Computation Result The World

DISTRIBUTED STORAGE

slide-38
SLIDE 38

38

Data + Compute Data + Compute Data + Compute Data + Compute Computation Result The World Data + Compute Data + Compute Data + Compute Data + Compute

DISTRIBUTED MODEL (MAP/REDUCE, MESSAGE PASSING, …)

slide-39
SLIDE 39

39

Data + Compute Data + Compute Data + Compute Data + Compute Computation Result The World Data + Compute Data + Compute Data + Compute Data + Compute Computation Query

DISTRIBUTED MODEL (INDEXES, TABLES, DATABASES, …)

slide-40
SLIDE 40

40

BIG-DATA META INFOGRAPHIC

slide-41
SLIDE 41

41

The World Query Result Result Sketch

THE STREAMING COMPUTATIONAL MODEL

slide-42
SLIDE 42

42

1 7 8 1 1 7 7 Sketch Iterator Computation

THE STREAMING COMPUTATIONAL MODEL

O(n) Items

O(polylog(n)) Space

Query

slide-43
SLIDE 43

43

The World Sketch

THE DISTRIBUTED STREAMING COMPUTATIONAL MODEL

Sketch Sketch Sketch Sketch Merge

slide-44
SLIDE 44

44

Number of users (easy)

data Map (count) Reduce (sum)

slide-45
SLIDE 45

45

Time Time User User ID ID Site Site Time Spent Time Spent Sec Sec Items Items Viewed Viewed 9:00 U1 Apps 59 5 9:30 U2 Apps 179 15 10:00 U3 Music 29 3 1:00 U1 Music 89 10 … … … … … Time Time User User ID ID Site Site Purchased Purchased Revenue Revenue 9:00 U1 Apps FaceTune $3.99 9:30 U2 Apps Minecraft $6.99 10:00 U3 Music Purple Rain $1.29 10:05 U3 Apps Minecraft $6.99 … … … … …

Web Site Logs Web Site Logs Financial Transactions System Log Financial Transactions System Log

Unique User Queries Unique User Queries

  • Unique users viewing Apps since 9:45…?
  • Unique users visiting Apps site AND Music site?
  • Unique users visiting Apps site AND NOT Music site?

Quantile Queries Quantile Queries

  • The median and 95%ile Time Spent seconds by ...?
  • A Frequency Histogram of Time Spent by

Split-Points specified at query time?

Frequency Queries Frequency Queries

  • The numbers of times each app was purchased

Join Queries Join Queries

  • For all users that purchased Apps,

what is the average / median time spent?

slide-46
SLIDE 46

46

Number of unique users (hard)

data Map (key=user) Reduce (return 1) Reduce (sum)

slide-47
SLIDE 47

47

Number of unique users (made easy)

data Map (sketch) Reduce (merge)

slide-48
SLIDE 48

48

Current Sketch Implementations

Count Unique Sketches

– Both Theta Sketches* and HLL Sketches – Estimating Cardinality Estimating Cardinality of a stream of identifiers with duplicates – Set Operations Set Operations (e.g., Union, Intersection, and Difference) – Can be extended to produce approximate Joins

Quantiles Sketches

– Normal or Inverse PMF’s, CDF’s of streams of numeric values, using after-the-fact queries.

Frequent Item Sketches

– Identify the Heavy Hitters of arbitrary objects from a stream of objects – Estimate the frequency of any item from the stream

slide-49
SLIDE 49

49

DataSketches.GitHub.io Open Source Library

  • Dedicated to production quality

production quality Sketch implementations.

– These are not toy algorithms! – Heavily used within Yahoo

  • Common Attributes

– True streaming. Single pass, “one-touch” algorithms for either real-time or batch – All Sketches are Mergeable, which makes them highly parallelizable. – Designed for multiple large-scale computing environments large-scale computing environments:

  • Core of library is coded in Java with no external dependencies
  • Easy integration into virtually any system environment
  • Adaptors for Hadoop/Pig and Hadoop/Hive environments
  • Standard library promotes sharing across platforms and organizations

– Maven deployable and registered with Maven Central Repository

  • http://search.maven.org/#search|ga|1|datasketches

– Comprehensive unit tests and testing tools are provided – Extensive documentation with Systems Developers in mind – All algorithms are backed by published mathematical theory

slide-50
SLIDE 50

50

$ less emails.csv | wc -l 10000000 $ head –n 5 emails.csv facebookmail.com jobsdbalert.co.id facebookmail.com twitter.com bonsplansdujour.net $ cat emails.csv | sort | uniq | wc -l ^C $ cat emails.csv | sort -u -S 100% | wc -l ^C $ cat emails.csv | sketch uniq 47618 40772 55589 $ cat emails.csv | sketch uniq 0.01 53782 53351 54216

Counting distinct elements example

There are duplicates Roughly 200Mb and several minutes of CPU (~25 seconds for numbers) < 10Kb of memory and 1.5 Seconds! 10M sender domains from inbound emails

slide-51
SLIDE 51