Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) - - PDF document

data intensive distributed computing
SMART_READER_LITE
LIVE PREVIEW

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) - - PDF document

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 7: Data Mining (2/4) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451 This work is licensed under a Creative Commons


slide-1
SLIDE 1

Data-Intensive Distributed Computing

Part 7: Data Mining (2/4)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 431/631 451/651 (Fall 2020) Ali Abedi

These slides are available at https://www.student.cs.uwaterloo.ca/~cs451

1

1

slide-2
SLIDE 2

Source: Wikipedia (Water Slide)

2

Stochastic Gradient Descent

2

slide-3
SLIDE 3

Gradient Descent Stochastic Gradient Descent (SGD)

3

Stochastic Gradient Descent

3

slide-4
SLIDE 4

Gradient Descent Stochastic Gradient Descent (SGD)

4

Stochastic Gradient Descent

Considers all training instances in every iteration Considers a random instance in every iteration

4

slide-5
SLIDE 5

Gradient Descent Stochastic Gradient Descent (SGD)

5

Stochastic Gradient Descent

Considers all training instances in every iteration Considers a random instance in every iteration

5

slide-6
SLIDE 6

Batch Gradient Descent Stochastic Gradient Descent

Mini-batching

Considers a random subset of instances in every iteration

6

slide-7
SLIDE 7

Source: Wikipedia (Orchestra)

Ensembles

7

7

slide-8
SLIDE 8

Ensemble Learning

Common implementation:

Train classifiers on different input partitions of the data Embarrassingly parallel!

Learn multiple models, combine results from different models to make prediction Combining predictions:

Majority voting Model averaging

8

8

slide-9
SLIDE 9

Ensemble Learning

Why does it work?

If errors uncorrelated, multiple classifiers being wrong is less likely Reduces the variance component of error

Learn multiple models, combine results from different models to make prediction

9

9

slide-10
SLIDE 10

MapReduce Implementation

10

slide-11
SLIDE 11

training data training data training data training data mapper mapper mapper mapper

Gradient Descent

11

update model

reducer

iterate until convergence

11

slide-12
SLIDE 12

training data training data training data training data mapper mapper mapper mapper learner learner learner learner

Stochastic Gradient Descent

12

No iteration!

This is great because we no longer need iterations! Mappers go through the record and apply the stochastic gradient descend rule on that record and update the model. This process continues for all records 12

slide-13
SLIDE 13

training data training data training data training data mapper mapper mapper mapper reducer reducer learner learner

Stochastic Gradient Descent

13

No iteration!

13

slide-14
SLIDE 14

MapReduce Implementation

How do we output the model?

Option 1: write model out as “side data” Option 2: emit model as intermediate output

14

14

slide-15
SLIDE 15

mapPartitions

f: (Iterator[T]) ⇒ Iterator[U] RDD[T] RDD[U] learner

What about Spark?

15

15

slide-16
SLIDE 16

In practice …

Data scientists usually use provided transformations in Spark ML

val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize) val prediction = model.predict(point.features)

16

slide-17
SLIDE 17

Source: Lin and Kolcz. (2012) Large-Scale Machine Learning at Twitter. SIGMOD.

Sentiment Analysis Case Study

Binary polarity classification: {positive, negative} sentiment

Use the “emoticon trick” to gather data

Data

Test: 500k positive/500k negative tweets from 9/1/2011 Training: {1m, 10m, 100m} instances from before (50/50 split)

Features:

Sliding window byte-4grams

Models + Optimization:

Logistic regression with SGD (L2 regularization) Ensembles of various sizes (simple weighted voting)

17

T h i s P h

17

slide-18
SLIDE 18

“for free”

Ensembles with 10m examples better than 100m single classifier! Diminishing returns… single classifier 10m instances 100m instances

18

18

slide-19
SLIDE 19

training Model Machine Learning Algorithm testing/deployment

?

Supervised Machine Learning

19

19

slide-20
SLIDE 20

Evaluation

How do we know how well we’re doing? Induce:

Such that loss is minimized

We need end-to-end metrics!

Obvious metric: accuracy

20

20

slide-21
SLIDE 21

Metrics

True Positive (TP) True Negative (TN) False Positive (FP)

= Type 1 Error

False Negative (FN)

= Type II Error

Actual Predicted

Positive Negative Positive Negative

Precision = TP/(TP + FP) Miss rate = FN/(FN + TN) Recall or TPR = TP/(TP + FN) Fall-Out or FPR = FP/(FP + TN)

21

21

slide-22
SLIDE 22

22

slide-23
SLIDE 23

ROC and PR Curves

Source: Davis and Goadrich. (2006) The Relationship Between Precision-Recall and ROC curves

AUC

23

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. 23

slide-24
SLIDE 24

Training Test

Cross-Validation

Training/Testing Splits

24

24

slide-25
SLIDE 25

Cross-Validation

Training/Testing Splits

25

25

slide-26
SLIDE 26

Cross-Validation

Training/Testing Splits

26

26

slide-27
SLIDE 27

Cross-Validation

Training/Testing Splits

27

27

slide-28
SLIDE 28

Cross-Validation

Training/Testing Splits

28

28

slide-29
SLIDE 29

Cross-Validation

Training/Testing Splits

29

29

slide-30
SLIDE 30

Typical Industry Setup

Training Test A/B test time

30

30

slide-31
SLIDE 31

A/B Testing

Control

Gather metrics, compare alternatives

X %

Treatment

100 - X %

31

31

slide-32
SLIDE 32

A/B Testing: Complexities

Properly bucketing users Novelty Learning effects Long vs. short term effects Multiple, interacting tests Nosy tech journalists …

32

32

slide-33
SLIDE 33

training Model Machine Learning Algorithm testing/deployment

?

Supervised Machine Learning

33

33

slide-34
SLIDE 34

Applied ML in Academia

Download interesting dataset (comes with the problem) Run baseline model

Train/Test

Build better model

Train/Test

Does new model beat baseline?

Yes: publish a paper! No: try again!

34

34

slide-35
SLIDE 35

35

35

slide-36
SLIDE 36

36

36

slide-37
SLIDE 37

37

37

slide-38
SLIDE 38

Fantasy

Extract features Develop cool ML technique #Profit

Reality

What’s the task? Where’s the data? What’s in this dataset? What’s all the f#$!* crap? Clean the data Extract features “Do” machine learning Fail, iterate…

38

38

slide-39
SLIDE 39

39

slide-40
SLIDE 40

It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data. – DJ Patil “Data Jujitsu”

Source: Wikipedia (Jujitsu)

40

40

slide-41
SLIDE 41

41

41

slide-42
SLIDE 42

On finding things…

42

42

slide-43
SLIDE 43

CamelCase smallCamelCase snake_case camel_Snake dunder__snake userid user_id

On naming things…

43

43

slide-44
SLIDE 44

^(\\w+\\s+\\d+\\s+\\d+:\\d+:\\d+)\\s+ ([^@]+?)@(\\S+)\\s+(\\S+):\\s+(\\S+)\\s+(\\S+) \\s+((?:\\S+?,\\s+)*(?:\\S+?))\\s+(\\S+)\\s+(\\S+) \\s+\\[([^\\]]+)\\]\\s+\"(\\w+)\\s+([^\"\\\\]* (?:\\\\.[^\"\\\\]*)*)\\s+(\\S+)\"\\s+(\\S+)\\s+ (\\S+)\\s+\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*) \"\\s+\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\"\\s* (\\d*-[\\d-]*)?\\s*(\\d+)?\\s*(\\d*\\.[\\d\\.]*)? (\\s+[-\\w]+)?.*$

An actual Java regular expression used to parse log message at Twitter circa 2010

Friction is cumulative!

On feature extraction…

44

44

slide-45
SLIDE 45

Frontend Engineer

Develops new feature, adds logging code to capture clicks

Data Scientist

Analyze user behavior, extract insights to improve feature

Okay, let’s get going… where’s the click data? Well, that’s kinda non-intuitive, but okay… Oh, BTW, where’s the timestamp of the click? It’s over here… Well, it wouldn’t fit, so we had to shoehorn… Hang on, I don’t remember… Uh, bad news. Looks like we forgot to log it… [grumble, grumble, grumble]

Data Plumbing… Gone Wrong!

[scene: consumer internet company in the Bay Area…] 45

45

slide-46
SLIDE 46

Fantasy

Extract features Develop cool ML technique #Profit

Reality

What’s the task? Where’s the data? What’s in this dataset? What’s all the f#$!* crap? Clean the data Extract features “Do” machine learning Fail, iterate…

46

46

slide-47
SLIDE 47

Source: Wikipedia (Hills)

Congratulations, you’re halfway there…

47

47

slide-48
SLIDE 48

Does it actually work?

Congratulations, you’re halfway there…

Is it fast enough?

Good, you’re two thirds there…

A/B testing

48

48

slide-49
SLIDE 49

Source: Wikipedia (Oil refinery)

Productionize

49

49

slide-50
SLIDE 50

What are your jobs’ dependencies? How/when are your jobs scheduled? Infrastructure is critical here! Are there enough resources? How do you know if it’s working? Who do you call if it stops working? (plumbing)

Productionize

50

50

slide-51
SLIDE 51

Source: Wikipedia (Plumbing)

Most of data science isn’t glamorous!

Takeaway lesson:

51

51