ONLINE MACHINE LEARNING AND DATA MINING EDO LIBERTY STANDARD - PowerPoint PPT Presentation

ONLINE MACHINE LEARNING AND DATA MINING EDO LIBERTY

STANDARD MACHINE LEARNING SETTING = “call” = ? = “crawl” Training Test Data Data Train “call” Apply Model Label 2

STANDARD MACHINE LEARNING SETTING Predicting the future is Big ML means impossible optimization on (in general) big data Data is generated by a stochastic process More training data is better 3

MORE DATA IS OFTEN WORSE (MORE DATA = OLDER DATA) 4

OUR ACTIONS HEAVILY INFLUENCE THE DATA 5

THE FUTURE IS OFTEN NOT LIKE THE PAST! Same story line or not? 1) The answer depends on the future 2) We have to decide now… 6

HAVING “A MODEL” IS COMPLETELY UNIMPORTANT Elements of information theory, Cover, 1991 E ffi cient algorithms for universal portfolios, Kalai, Vempala, 2003 E ffi cient Algorithms for Online Game Playing and Universal Portfolio Management, Agarwal, Hazan, 2006 7

ONLINE ALGORITHMS (DECISION MAKING WITHOUT PREDICTING) 8

THE SKI RENTAL PROBLEM Rent: x$ /day Buy: 1000$ 9

THE SKI RENTAL PROBLEM 70 70 + Computation 70 R 10

THE SKI RENTAL PROBLEM 70 90 70 + 90 Computation 160 R R 11

THE SKI RENTAL PROBLEM 70 90 80 70 + 90 + 1000 Computation 1160 R R B 12

THE SKI RENTAL PROBLEM 70 90 80 70 70 + 90 + 1000 Computation + 0 1160 R R B 13

THE SKI RENTAL PROBLEM 70 90 80 70 70 + 90 + 80 Computation + 70 310 R R R R You should have rented all along… 14

THE SKI RENTAL PROBLEM Input 70 90 80 70 90 88 72 79 Computation $1000 $1000 Output R R R R B 15

THE SKI RENTAL PROBLEM ALG <= 2 OPT Algorithm Buy Optimal in hindsight 16

ONLINE LINEAR CLASSIFICATION 17

ONLINE MACHINE LEARNING Emails Computation Spam? N 18

ONLINE MACHINE LEARNING Emails Computation Spam? N N 19

ONLINE MACHINE LEARNING Emails Computation Spam? N N Y 20

ONLINE MACHINE LEARNING Number of mistakes is compared to Computation the best classifier in hindsight! Variants of SGD have this property N N Y N Y N N Y Prediction, Learning, and Games, Cesa-Bianchi, Lugosi, 2006 21

ONLINE PRINCIPAL COMPONENT ANALYSIS Online Principal Components Analysis, Boutsidis, Garber, Karnin, Liberty 2014 Online PCA with Spectral Bounds, Karnin, Liberty, 2015 22

x i 23

ΦΦ T x x i 24

ΦΦ T x x i k x i � Φ T Φ x i k 25

Eigenpets: https://bioramble.wordpress.com/2015/09/01/ 26

ONLINE PRINCIPAL COMPONENT ANALYSIS Online PCA with Spectral Bounds, Karnin, Liberty, 2015 27

ONLINE K-MEANS CLUSTERING An Algorithm for Online K-Means Clustering, Liberty, Sriharsha, Sviridenko, 2014 29

K-MEANS CLUSTERING http://en.wikipedia.org/wiki/MNIST_database http://research.ics.aalto.fi/mi/software/ne/ 30

K-MEANS CLUSTERING - Roughly 20,000 documents - 20 topics: - Graphics - PC hardware - Baseball - For-sale - Politics - … http://qwone.com/~jason/20Newsgroups/ http://research.ics.aalto.fi/mi/software/ne/ 31

K-MEANS CLUSTERING 1) One can cluster points fully online 2) Create only slightly more than k centers 3) Be competitive with the best o ffl ine clustering to k clusters An Algorithm for Online K-Means Clustering, Liberty, Sriharsha, Sviridenko 2015 32

ONLINE K-MEANS CLUSTERING 1.2 20news-binary adult 1 ijcnn1 letter 0.8 magic04 maptaskcoref 0.6 nomao poker 0.4 shuttle.binary skin 0.2 vehv2binary w8all 0 0 0.2 0.4 0.6 0.8 1 1.2 An Algorithm for Online K-Means Clustering, Liberty, Sriharsha, Sviridenko 2015 k-means++: the advantages of careful seeding, Arthur, Vassilvitskii, 2006 33

STREAMING ALGORITHMS OPEN SOURCE FROM YAHOO EDO LIBERTY

DATASKETCHES.GITHUB.IO 35

The World Data Computation Result 36

DISTRIBUTED STORAGE The World Data Data Data Data Computation Result 37

DISTRIBUTED MODEL (MAP/REDUCE, MESSAGE PASSING, …) The World Data + Data + Data + Data + Compute Compute Compute Compute Data + Data + Data + Data + Compute Compute Compute Compute Computation Result 38

DISTRIBUTED MODEL (INDEXES, TABLES, DATABASES, …) The World Data + Data + Data + Data + Compute Compute Compute Compute Data + Data + Data + Data + Compute Compute Compute Compute Query Computation Computation Result 39

BIG-DATA META INFOGRAPHIC 40

THE STREAMING COMPUTATIONAL MODEL The World Sketch Result Query Result 41

THE STREAMING COMPUTATIONAL MODEL 1 7 8 1 0 1 7 7 O ( n ) Items Iterator Computation O (polylog( n )) Space Query Sketch 42

THE DISTRIBUTED STREAMING COMPUTATIONAL MODEL The World Sketch Sketch Sketch Sketch Merge Sketch 43

Number of users (easy) data Map Reduce (count) (sum) 44

Web Site Logs Web Site Logs Financial Transactions System Log Financial Transactions System Log Time Time User User Site Site Time Spent Time Spent Items Items Time Time User User Site Site Purchased Purchased Revenue Revenue ID ID Sec Sec Viewed Viewed ID ID 9:00 U1 Apps 59 5 9:00 U1 Apps FaceTune $3.99 9:30 U2 Apps 179 15 9:30 U2 Apps Minecraft $6.99 10:00 U3 Music 29 3 10:00 U3 Music Purple Rain $1.29 1:00 U1 Music 89 10 10:05 U3 Apps Minecraft $6.99 … … … … … … … … … … Unique User Queries Unique User Queries Frequency Queries Frequency Queries • Unique users viewing Apps since 9:45…? • The numbers of times each app was purchased • Unique users visiting Apps site AND Music site? • Unique users visiting Apps site AND NOT Music site? Join Queries Join Queries • For all users that purchased Apps, Quantile Queries Quantile Queries what is the average / median time spent? • The median and 95%ile Time Spent seconds by ...? • A Frequency Histogram of Time Spent by Split-Points specified at query time? 45

Number of unique users (hard) data Map Reduce Reduce (key=user) (return 1) (sum) 46

Number of unique users (made easy) data Map Reduce (sketch) (merge) 47

Current Sketch Implementations Count Unique Sketches – Both Theta Sketches* and HLL Sketches – Estimating Cardinality Estimating Cardinality of a stream of identifiers with duplicates – Set Operations Set Operations (e.g., Union, Intersection, and Di ff erence) – Can be extended to produce approximate Joins Quantiles Sketches – Normal or Inverse PMF’s, CDF’s of streams of numeric values, using after-the-fact queries. Frequent Item Sketches – Identify the Heavy Hitters of arbitrary objects from a stream of objects – Estimate the frequency of any item from the stream 48

DataSketches.GitHub.io Open Source Library • Dedicated to production quality production quality Sketch implementations. – These are not toy algorithms! – Heavily used within Yahoo • Common Attributes – True streaming. Single pass, “one-touch” algorithms for either real-time or batch – All Sketches are Mergeable, which makes them highly parallelizable. – Designed for multiple large-scale computing environments large-scale computing environments: • Core of library is coded in Java with no external dependencies • Easy integration into virtually any system environment • Adaptors for Hadoop/Pig and Hadoop/Hive environments • Standard library promotes sharing across platforms and organizations – Maven deployable and registered with Maven Central Repository • http://search.maven.org/#search|ga|1|datasketches – Comprehensive unit tests and testing tools are provided – Extensive documentation with Systems Developers in mind – All algorithms are backed by published mathematical theory 49

Counting distinct elements example 10M sender domains from $ less emails.csv | wc -l inbound emails 10000000 $ head –n 5 emails.csv facebookmail.com jobsdbalert.co.id There are duplicates facebookmail.com twitter.com bonsplansdujour.net $ cat emails.csv | sort | uniq | wc -l ^C Roughly 200Mb and several minutes of CPU (~25 seconds for numbers) $ cat emails.csv | sort -u -S 100% | wc -l ^C $ cat emails.csv | sketch uniq 47618 40772 55589 < 10Kb of memory and 1.5 Seconds! $ cat emails.csv | sketch uniq 0.01 53782 53351 54216 50

ONLINE MACHINE LEARNING AND DATA MINING EDO LIBERTY STANDARD - PowerPoint PPT Presentation

ONLINE MACHINE LEARNING AND DATA MINING EDO LIBERTY STANDARD MACHINE LEARNING SETTING = call = ? = crawl Training Test Data Data Train call Apply Model Label 2 STANDARD MACHINE LEARNING SETTING Predicting the future

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 1 of Data Mining by

Introduction What is data mining? to Data mining functionalities Data Mining Major

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Testing Alternative Aggregation Methods Using Ordinal Data for a Census Asset-Based Wealth Index

Prediction of HIV viral tropism based on NGS data Nico Pfeifer Max Planck Institute for

Application of Big Data Analytics via Soft Computing Yunus Yetis INTRODUCTION System of

Kernel-Based Dimensionality Reduction Methods on Synthesized and Facial Image Data Jonathan L.

Introduction Outline XLSTAT Presentation Excel and XLSTAT Users A modular application

THE ROLE OF CUSTOMER LOYALTY PROGRAMS IN PROVIDING INTEGRATED ENERGY SERVICES TO RESIDENTIAL

Parallel Clustering of Large Document Collections Xiaohu Li, Deyun Gao, Zheyuan Yu 31 July 2003

evaluate representativeness of the Dutch monitoring sites Contents Classification of