Data-Intensive Distributed Computing CS 431/631 451/651 (Winter - - PowerPoint PPT Presentation

data intensive distributed computing
SMART_READER_LITE
LIVE PREVIEW

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter - - PowerPoint PPT Presentation

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 6: Data Mining (2/4) March 5, 2019 Adam Roegiest Kira Systems These slides are available at http://roegiest.com/bigdata-2019w This work is licensed under a Creative


slide-1
SLIDE 1

Data-Intensive Distributed Computing

Part 6: Data Mining (2/4)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 431/631 451/651 (Winter 2019) Adam Roegiest

Kira Systems

March 5, 2019

These slides are available at http://roegiest.com/bigdata-2019w

slide-2
SLIDE 2

The Task

Given:

(sparse) feature vector label

Induce:

Such that loss is minimized loss function

Typically, we consider functions of a parametric form:

model parameters

slide-3
SLIDE 3

Gradient Descent

Source: Wikipedia (Hills)

slide-4
SLIDE 4

mapper mapper mapper mapper reducer

compute partial gradient single reducer mappers update model iterate until convergence

MapReduce Implementation

slide-5
SLIDE 5

val points = spark.textFile(...).map(parsePoint).persist() var w = // random initial vector for (i <- 1 to ITERATIONS) { val gradient = points.map{ p => p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y }.reduce((a,b) => a+b) w -= gradient } mapper mapper mapper mapper reducer

compute partial gradient update model

Spark Implementation

slide-6
SLIDE 6

Gradient Descent

Source: Wikipedia (Hills)

slide-7
SLIDE 7

Stochastic Gradient Descent

Source: Wikipedia (Water Slide)

slide-8
SLIDE 8

Gradient Descent Stochastic Gradient Descent (SGD)

Batch vs. Online

“batch” learning: update model after considering all training instances “online” learning: update model after considering each (randomly-selected) training instance In practice… just as good! Opportunity to interleaving prediction and learning!

slide-9
SLIDE 9

We’ve solved the iteration problem! What about the single reducer problem?

Practical Notes

Order of the instances important!

Most common implementation: randomly shuffle training instances

Mini-batching as a middle ground Single vs. multi-pass approaches

slide-10
SLIDE 10

Source: Wikipedia (Orchestra)

Ensembles

slide-11
SLIDE 11

Ensemble Learning

Common implementation:

Train classifiers on different input partitions of the data Embarrassingly parallel!

Learn multiple models, combine results from different models to make prediction Combining predictions:

Majority voting Simple weighted voting: Model averaging …

slide-12
SLIDE 12

Ensemble Learning

Why does it work?

If errors uncorrelated, multiple classifiers being wrong is less likely Reduces the variance component of error

Learn multiple models, combine results from different models to make prediction

slide-13
SLIDE 13

training data training data training data training data mapper mapper mapper mapper learner learner learner learner

MapReduce Implementation

slide-14
SLIDE 14

training data training data training data training data mapper mapper mapper mapper reducer reducer learner learner

MapReduce Implementation

slide-15
SLIDE 15

MapReduce Implementation

How do we output the model?

Option 1: write model out as “side data” Option 2: emit model as intermediate output

slide-16
SLIDE 16

mapPartitions

f: (Iterator[T]) ⇒ Iterator[U]

RDD[T] RDD[U]

learner

What about Spark?

slide-17
SLIDE 17

Classifier Training Making Predictions

Just like any other parallel Pig dataflow label, feature vector

model UDF

feature vector prediction

model UDF

feature vector prediction

model

previous Pig dataflow

map reduce

previous Pig dataflow

model model

Pig storage function

slide-18
SLIDE 18

training = load ‘training.txt’ using SVMLightStorage() as (target: int, features: map[]); store training into ‘model/’ using FeaturesLRClassifierBuilder();

Want an ensemble?

training = foreach training generate label, features, RANDOM() as random; training = order training by random parallel 5;

Logistic regression + SGD (L2 regularization) Pegasos variant (fully SGD or sub-gradient)

Classifier Training

slide-19
SLIDE 19

define Classify ClassifyWithLR(‘model/’); data = load ‘test.txt’ using SVMLightStorage() as (target: double, features: map[]); data = foreach data generate target, Classify(features) as prediction;

Want an ensemble?

define Classify ClassifyWithEnsemble(‘model/’, ‘classifier.LR’, ‘vote’);

Making Predictions

slide-20
SLIDE 20

Source: Lin and Kolcz. (2012) Large-Scale Machine Learning at Twitter. SIGMOD.

Sentiment Analysis Case Study

Binary polarity classification: {positive, negative} sentiment

Use the “emoticon trick” to gather data

Data

Test: 500k positive/500k negative tweets from 9/1/2011 Training: {1m, 10m, 100m} instances from before (50/50 split)

Features:

Sliding window byte-4grams

Models + Optimization:

Logistic regression with SGD (L2 regularization) Ensembles of various sizes (simple weighted voting)

slide-21
SLIDE 21

“for free”

Ensembles with 10m examples better than 100m single classifier! Diminishing returns… single classifier 10m ensembles 100m ensembles

slide-22
SLIDE 22

training Model Machine Learning Algorithm testing/deployment

?

Supervised Machine Learning

slide-23
SLIDE 23

Evaluation

How do we know how well we’re doing? Induce:

Such that loss is minimized

We need end-to-end metrics!

Obvious metric: accuracy

slide-24
SLIDE 24

Metrics

True Positive (TP) True Negative (TN) False Positive (FP)

= Type 1 Error

False Negative (FN)

= Type II Error

Actual Predicted

Positive Negative Positive Negative

Precision = TP/(TP + FP) Miss rate = FN/(FN + TN) Recall or TPR = TP/(TP + FN) Fall-Out or FPR = FP/(FP + TN)

slide-25
SLIDE 25

ROC and PR Curves

Source: Davis and Goadrich. (2006) The Relationship Between Precision-Recall and ROC curves

AUC

slide-26
SLIDE 26

Training Test

Cross-Validation

Training/Testing Splits

slide-27
SLIDE 27

Cross-Validation

Training/Testing Splits

slide-28
SLIDE 28

Cross-Validation

Training/Testing Splits

slide-29
SLIDE 29

Cross-Validation

Training/Testing Splits

slide-30
SLIDE 30

Cross-Validation

Training/Testing Splits

slide-31
SLIDE 31

Cross-Validation

Training/Testing Splits

slide-32
SLIDE 32

Typical Industry Setup

Training Test A/B test time

slide-33
SLIDE 33

A/B Testing

Control

Gather metrics, compare alternatives

X %

Treatment

100 - X %

slide-34
SLIDE 34

A/B Testing: Complexities

Properly bucketing users Novelty Learning effects Long vs. short term effects Multiple, interacting tests Nosy tech journalists …

slide-35
SLIDE 35

training Model Machine Learning Algorithm testing/deployment

?

Supervised Machine Learning

slide-36
SLIDE 36

Applied ML in Academia

Download interesting dataset (comes with the problem) Run baseline model

Train/Test

Build better model

Train/Test

Does new model beat baseline?

Yes: publish a paper! No: try again!

slide-37
SLIDE 37
slide-38
SLIDE 38
slide-39
SLIDE 39

Fantasy

Extract features Develop cool ML technique #Profit

Reality

What’s the task? Where’s the data? What’s in this dataset? What’s all the f#$!* crap? Clean the data Extract features “Do” machine learning Fail, iterate…

slide-40
SLIDE 40

It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data. – DJ Patil “Data Jujitsu”

Source: Wikipedia (Jujitsu)

slide-41
SLIDE 41
slide-42
SLIDE 42

On finding things…

slide-43
SLIDE 43

CamelCase smallCamelCase snake_case camel_Snake dunder__snake userid user_id

On naming things…

slide-44
SLIDE 44

^(\\w+\\s+\\d+\\s+\\d+:\\d+:\\d+)\\s+ ([^@]+?)@(\\S+)\\s+(\\S+):\\s+(\\S+)\\s+(\\S+) \\s+((?:\\S+?,\\s+)*(?:\\S+?))\\s+(\\S+)\\s+(\\S+) \\s+\\[([^\\]]+)\\]\\s+\"(\\w+)\\s+([^\"\\\\]* (?:\\\\.[^\"\\\\]*)*)\\s+(\\S+)\"\\s+(\\S+)\\s+ (\\S+)\\s+\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*) \"\\s+\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\"\\s* (\\d*-[\\d-]*)?\\s*(\\d+)?\\s*(\\d*\\.[\\d\\.]*)? (\\s+[-\\w]+)?.*$

An actual Java regular expression used to parse log message at Twitter circa 2010

Friction is cumulative!

On feature extraction…

slide-45
SLIDE 45

Frontend Engineer

Develops new feature, adds logging code to capture clicks

Data Scientist

Analyze user behavior, extract insights to improve feature

Okay, let’s get going… where’s the click data? Well, that’s kinda non-intuitive, but okay… Oh, BTW, where’s the timestamp of the click? It’s over here… Well, it wouldn’t fit, so we had to shoehorn… Hang on, I don’t remember… Uh, bad news. Looks like we forgot to log it… [grumble, grumble, grumble]

Data Plumbing… Gone Wrong!

[scene: consumer internet company in the Bay Area…]

slide-46
SLIDE 46

Extract features Develop cool ML technique #Profit What’s the task? Where’s the data? What’s in this dataset? What’s all the f#$!* crap? Clean the data Extract features “Do” machine learning Fail, iterate…

Fantasy Reality

slide-47
SLIDE 47

Source: Wikipedia (Hills)

Congratulations, you’re halfway there…

slide-48
SLIDE 48

Does it actually work?

Congratulations, you’re halfway there…

Is it fast enough?

Good, you’re two thirds there…

A/B testing

slide-49
SLIDE 49

Source: Wikipedia (Oil refinery)

Productionize

slide-50
SLIDE 50

What are your jobs’ dependencies? How/when are your jobs scheduled? Infrastructure is critical here! Are there enough resources? How do you know if it’s working? Who do you call if it stops working? (plumbing)

Productionize

slide-51
SLIDE 51

Source: Wikipedia (Plumbing)

Most of data science isn’t glamorous!

Takeaway lesson: