Small Data Machine Learning Andrei Zmievski The goal is not a - - PowerPoint PPT Presentation

small data machine learning
SMART_READER_LITE
LIVE PREVIEW

Small Data Machine Learning Andrei Zmievski The goal is not a - - PowerPoint PPT Presentation

Small Data Machine Learning Andrei Zmievski The goal is not a comprehensive introduction, but to plant a seed in your head to get you interested in this topic. Questions - now and later WORK We are all superheroes, because we help our customers


slide-1
SLIDE 1

Small Data Machine Learning

Andrei Zmievski

The goal is not a comprehensive introduction, but to plant a seed in your head to get you interested in this topic. Questions - now and later
slide-2
SLIDE 2

WORK

We are all superheroes, because we help our customers keep their mission-critical apps running smoothly. If interested, I can show you a demo of what I’m working on. Come find me.
slide-3
SLIDE 3

WORK

We are all superheroes, because we help our customers keep their mission-critical apps running smoothly. If interested, I can show you a demo of what I’m working on. Come find me.
slide-4
SLIDE 4

TRAVEL

slide-5
SLIDE 5

TAKE PHOTOS

slide-6
SLIDE 6

DRINK BEER

slide-7
SLIDE 7

MAKE BEER

slide-8
SLIDE 8

MATH

slide-9
SLIDE 9

MATH SOME

slide-10
SLIDE 10

MATH SOME AWE

slide-11
SLIDE 11

@a

For those of you who don’t know me.. Acquired in October 2008 Had a difgerent account earlier, but then @k asked if I wanted it.. Know many other single-letter Twitterers.
slide-12
SLIDE 12

FAME

Advantages
slide-13
SLIDE 13

FAME FORTUNE

slide-14
SLIDE 14

FAME FORTUNE

Wall Street Journal?!

slide-15
SLIDE 15

FAME FORTUNE FOLLOWERS

slide-16
SLIDE 16

lol, what?!

FAME FORTUNE FOLLOWERS

slide-17
SLIDE 17

140-length(“@a “)=137

MAXIMUM REPLY SPACE!

slide-18
SLIDE 18

CONS

Disadvantages Visual filtering is next to impossible Could be a set of hard-coded rules derived empirically
slide-19
SLIDE 19

CONS

Disadvantages Visual filtering is next to impossible Could be a set of hard-coded rules derived empirically
slide-20
SLIDE 20

CONS

I hate humanity

Disadvantages Visual filtering is next to impossible Could be a set of hard-coded rules derived empirically
slide-21
SLIDE 21

A D D

slide-22
SLIDE 22

Annoyance Driven Development

Best way to learn something is to be annoyed enough to create a solution based on the tech.
slide-23
SLIDE 23

Machine Learning to the Rescue!

slide-24
SLIDE 24

REPLYCLEANER

Even with false negatives, reduces garbage to where visual filtering is possible
  • uses trained model to classify tweets into good/bad
  • blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline
slide-25
SLIDE 25

REPLYCLEANER

Even with false negatives, reduces garbage to where visual filtering is possible
  • uses trained model to classify tweets into good/bad
  • blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline
slide-26
SLIDE 26

REPLYCLEANER

slide-27
SLIDE 27

REPLYCLEANER

slide-28
SLIDE 28

REPLYCLEANER

slide-29
SLIDE 29

REPLYCLEANER

slide-30
SLIDE 30
slide-31
SLIDE 31 I still hate humanity
slide-32
SLIDE 32

I still hate humanity

I still hate humanity
slide-33
SLIDE 33

Machine Learning

A branch of Artificial Intelligence No widely accepted definition
slide-34
SLIDE 34

“Field of study that gives computers the ability to learn without being explicitly programmed.” — Arthur Samuel (1959)

concerns the construction and study of systems that can learn from data
slide-35
SLIDE 35

SPAM FILTERING

slide-36
SLIDE 36

RECOMMENDATIONS

slide-37
SLIDE 37

TRANSLATION

slide-38
SLIDE 38

CLUSTERING

And many more: medical diagnoses, detecting credit card fraud, etc.
slide-39
SLIDE 39

supervised

unsupervised

Labeled dataset, training maps input to desired outputs Example: regression - predicting house prices, classification - spam filtering
slide-40
SLIDE 40

supervised

unsupervised

no labels in the dataset, algorithm needs to find structure Example: clustering We will be talking about classification, a supervised learning process.
slide-41
SLIDE 41

Feature

individual measurable property of the phenomenon under observation

usually numeric
slide-42
SLIDE 42
slide-43
SLIDE 43

Feature Vector

a set of features for an observation

Think of it as an array
slide-44
SLIDE 44

features

# of rooms

  • sq. m2

house age yard?

feature vector and weights vector 1 added to pad the vector (account for the initial ofgset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
slide-45
SLIDE 45

features

102.3 0.94

  • 10.1

83.0

parameters

# of rooms

  • sq. m2

house age yard?

feature vector and weights vector 1 added to pad the vector (account for the initial ofgset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
slide-46
SLIDE 46

features

102.3 0.94

  • 10.1

83.0

parameters

# of rooms

  • sq. m2

house age yard? 1 45.7

feature vector and weights vector 1 added to pad the vector (account for the initial ofgset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
slide-47
SLIDE 47

features

102.3 0.94

  • 10.1

83.0

parameters

# of rooms

  • sq. m2

house age yard? 1 45.7

feature vector and weights vector 1 added to pad the vector (account for the initial ofgset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
slide-48
SLIDE 48

features

102.3 0.94

  • 10.1

83.0

=

parameters prediction

# of rooms

  • sq. m2

house age yard? 1 45.7

feature vector and weights vector 1 added to pad the vector (account for the initial ofgset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
slide-49
SLIDE 49

features

102.3 0.94

  • 10.1

83.0

=

parameters prediction

# of rooms

  • sq. m2

house age yard? 1 45.7

758,013

feature vector and weights vector 1 added to pad the vector (account for the initial ofgset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
slide-50
SLIDE 50

X = ⇥1 x1 x2 . . .⇤ θ = ⇥θ0 θ1 θ2 . . .⇤

dot product

X - input feature vector theta - weights
slide-51
SLIDE 51

X = ⇥1 x1 x2 . . .⇤ θ = ⇥θ0 θ1 θ2 . . .⇤

θ·X = θ0 + θ1x1 + θ2x2 + . . .

dot product

X - input feature vector theta - weights
slide-52
SLIDE 52

training data learning algorithm hypothesis

Hypothesis (decision function): what the system has learned so far Hypothesis is applied to new data
slide-53
SLIDE 53

hθ(X)

The task of our algorithm is to determine the parameters of the hypothesis.
slide-54
SLIDE 54

hθ(X)

input data

The task of our algorithm is to determine the parameters of the hypothesis.
slide-55
SLIDE 55

hθ(X)

parameters input data

The task of our algorithm is to determine the parameters of the hypothesis.
slide-56
SLIDE 56

hθ(X)

parameters input data prediction y

The task of our algorithm is to determine the parameters of the hypothesis.
slide-57
SLIDE 57

LINEAR REGRESSION

5 10 15 20 25 30 35 40 80 120 160 200 whisky age whisky price $

Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.
slide-58
SLIDE 58

LINEAR REGRESSION

5 10 15 20 25 30 35 40 80 120 160 200 whisky age whisky price $

Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.
slide-59
SLIDE 59

LINEAR REGRESSION

5 10 15 20 25 30 35 40 80 120 160 200 whisky age whisky price $

Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.
slide-60
SLIDE 60

LOGISTIC REGRESSION

g(z) = 1 1 + e−z

0.5 1 z

Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin. z is just our old dot product, the linear predictor. Transforms unbounded output into bounded.
slide-61
SLIDE 61

LOGISTIC REGRESSION

g(z) = 1 1 + e−z

0.5 1 z

z = θ · X

Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin. z is just our old dot product, the linear predictor. Transforms unbounded output into bounded.
slide-62
SLIDE 62

hθ(X) = 1 1 + e−θ·X

Probability that y=1 for input X

LOGISTIC REGRESSION

If hypothesis describes spam, then given X=body of email, if y=0.7, means there’s 70% chance it’s spam. Thresholding on that is up to you.
slide-63
SLIDE 63

Building the Tool

slide-64
SLIDE 64

Corpus

collection of source data used for training and testing the model

slide-65
SLIDE 65

Twitter MongoDB

phirehose

hooks into streaming API
slide-66
SLIDE 66

Twitter MongoDB

phirehose

8500 tweets

hooks into streaming API
slide-67
SLIDE 67

Feature Identification

slide-68
SLIDE 68

independent & discriminant

Independent: feature A should not co-occur (correlate) with feature B highly. Discriminant: a feature should provide uniquely classifiable data (what letter a tweet starts with is not a good feature).
slide-69
SLIDE 69
  • @a at the end of the tweet
  • @a...
  • length < N chars
  • # of user mentions in the tweet
  • # of hashtags
  • language!
  • @a followed by punctuation and a word

character (except for apostrophe)

  • …and more

possible features

slide-70
SLIDE 70

feature = extractor(tweet)

For each feature, write a small function that takes a tweet and returns a numeric value (floating-point).
slide-71
SLIDE 71

corpus extractors feature vectors

Run the set of these functions over the corpus and build up feature vectors Array of arrays Save to DB
slide-72
SLIDE 72

Language Matters

high correlation between the language of the tweet and its category (good/bad)
slide-73
SLIDE 73

Indonesian or Tagalog?

Garbage.

slide-74
SLIDE 74

id Indonesian 3548 en English 1804 tl Tagalog 733 es Spanish 329 so Somalian 305 ja Japanese 300 pt Portuguese 262 ar Arabic 256 nl Dutch 150 it Italian 137 sw Swahili 118 fr French 92

Top 12 Languages

I guarantee you people aren’t tweeting at me in Swahili.
slide-75
SLIDE 75

Language Detection

Can’t trust the language field in user’s profile data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.
slide-76
SLIDE 76

Language Detection

Text_LanguageDetect textcat pecl / pear /

Can’t trust the language field in user’s profile data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.
slide-77
SLIDE 77 ✓ Clean-up text (remove mentions, links, etc) ✓ Run language detection ✓ If unknown/low weight, pretend it’s English, else: ✓ If not a character set-determined language, try harder: ✓ Tokenize into words ✓ Difference with English vocabulary ✓ If words remain, run parts-of-speech tagger on each ✓ For NNS, VBZ, and VBD run stemming algorithm ✓ If result is in English vocabulary, remove from remaining ✓ If remaining list is not empty, calculate:

unusual_word_ratio = size(remaining)/size(words)

✓ If ratio < 20%, pretend it’s English

EnglishNotEnglish

A lot of this is heuristic-based, after some trial-and-error. Seems to help with my corpus.
slide-78
SLIDE 78

BINARY CLASSIFICATION

Grunt work Built a web-based tool to display tweets a page at a time and select good ones
slide-79
SLIDE 79

feature vectors labels (good/bad)

I N P U T O U T P U T

Had my input and output
slide-80
SLIDE 80

BIAS

CORRECTION

One more thing to address
slide-81
SLIDE 81

BIAS

CORRECTION

BAD

GOOD

99% = bad (less < 100 tweets were good) Training a model as-is would not produce good results Need to adjust the bias
slide-82
SLIDE 82

BIAS

CORRECTION

BAD

GOOD

slide-83
SLIDE 83

O V E R

SAMPLING

Oversampling: use multiple copies of good tweets to equalize with bad Problem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category
slide-84
SLIDE 84

O V E R

SAMPLING

Oversampling: use multiple copies of good tweets to equalize with bad Problem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category
slide-85
SLIDE 85

O V E R

SAMPLING

UNDER

Undersampling: drop most of the bad tweets to equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training
slide-86
SLIDE 86

SAMPLING

UNDER

Undersampling: drop most of the bad tweets to equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training
slide-87
SLIDE 87

OVERSAMPLING Synthetic

Synthesize feature vectors by determining what constitutes a good tweet and do weighted random selection of feature values.
slide-88
SLIDE 88

chance feature 90% “good” language 70% 25% 5% no hashtags 1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10

The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
slide-89
SLIDE 89

chance feature 90% “good” language 70% 25% 5% no hashtags 1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10

1

The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
slide-90
SLIDE 90

chance feature 90% “good” language 70% 25% 5% no hashtags 1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10

1 2

The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
slide-91
SLIDE 91

chance feature 90% “good” language 70% 25% 5% no hashtags 1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10

1 2

The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
slide-92
SLIDE 92

chance feature 90% “good” language 70% 25% 5% no hashtags 1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10

1 2 77

The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
slide-93
SLIDE 93

Model Training

We have the hypothesis (decision function) and the training set, How do we actually determine the weights/parameters?
slide-94
SLIDE 94

COST FUNCTION

Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.
slide-95
SLIDE 95

REALITY PREDICTION

COST FUNCTION

Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.
slide-96
SLIDE 96

COST FUNCTION

J(θ) = 1 m

m

X

i=1

Cost(hθ(x), y)

Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.
slide-97
SLIDE 97

Cost(hθ(x), y) = ( − log (hθ(x)) if y = 1 − log (1 − hθ(x)) if y = 0

LOGISTIC COST

slide-98
SLIDE 98

1

y=1 y=0

1

Correct guess Incorrect guess Cost = 0 Cost = huge

LOGISTIC COST

When y=1 and h(x) is 1 (good guess), cost is 0, but the closer h(x) gets to 0 (wrong guess), the more we penalize the algorithm. Same for y=0.
slide-99
SLIDE 99

minimize cost

OVER θ

Finding the best values of Theta that minimize the cost
slide-100
SLIDE 100

GRADIENT DESCENT

Random starting point. Pretend you’re standing on a hill. Find the direction of the steepest descent and take a step. Repeat. Imagine a ball rolling down from a hill.
slide-101
SLIDE 101

θi α = − θi ∂J(θ) ∂θi

GRADIENT DESCENT

Each step adjusts the parameters according to the slope
slide-102
SLIDE 102

α = −

each parameter

θi θi ∂J(θ) ∂θi

Have to update them simultaneously (the whole vector at a time).
slide-103
SLIDE 103

θi = − θi

learning rate

α∂J(θ) ∂θi

Controls how big a step you take If α is big have an aggressive gradient descent If α is small take tiny steps If too small, tiny steps, takes too long If too big, can overshoot the minimum and fail to converge
slide-104
SLIDE 104

θi α = − θi

derivative aka “the slope”

∂J(θ) ∂θi

The slope indicates the steepness of the descent step for each weight, i.e. direction. Keep going for a number of iterations or until cost is below a threshold (convergence). Graph the cost function versus # of iterations and see where it starts to approach 0, past that are diminishing returns.
slide-105
SLIDE 105

θi = θi − α

m

X

j=1

(hθ(xj)−yj)xj

i

THE UPDATE ALGORITHM

Derivative for logistic regression simplifies to this term. Have to update the weights simultaneously!
slide-106
SLIDE 106

X1 = [1 12.0] X2 = [1 -3.5]

Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
slide-107
SLIDE 107

y1 = 1 y2 = 0 X1 = [1 12.0] X2 = [1 -3.5]

Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
slide-108
SLIDE 108

y1 = 1 y2 = 0 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1]

Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
slide-109
SLIDE 109

y1 = 1 y2 = 0 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1]

α= 0.05

Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
slide-110
SLIDE 110

y1 = 1 y2 = 0 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1]

α= 0.05

Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
slide-111
SLIDE 111

y1 = 1 y2 = 0 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1]

α= 0.05

Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
slide-112
SLIDE 112

y1 = 1 y2 = 0 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0

α= 0.05

Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
slide-113
SLIDE 113

y1 = 1 y2 = 0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0

α= 0.05

Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
slide-114
SLIDE 114

y1 = 1 y2 = 0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1) h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0

α= 0.05

Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
slide-115
SLIDE 115

y1 = 1 y2 = 0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1) = 0.088 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0

α= 0.05

Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
slide-116
SLIDE 116

y1 = 1 y2 = 0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 12.0 + (0.438 - 0) • -3.5) = 0.305 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T1

α= 0.05

Note that the hypotheses don’t change within the iteration.
slide-117
SLIDE 117

y1 = 1 y2 = 0 X1 = [1 12.0] X2 = [1 -3.5]

θ = [T0 T1] α= 0.05

Replace parameter (weights) vector with the temporaries.
slide-118
SLIDE 118

y1 = 1 y2 = 0 X1 = [1 12.0] X2 = [1 -3.5]

θ = [0.088 0.305] α= 0.05

Do next iteration
slide-119
SLIDE 119

Trai ning CROSS

Used to assess the results of the training.
slide-120
SLIDE 120

DATA

slide-121
SLIDE 121

DATA TRAINING

slide-122
SLIDE 122

DATA TRAINING TEST

Train model on training set, then test results on test set. Rinse, lather, repeat feature selection/synthesis/training until results are "good enough". Pick the best parameters and save them (DB, other).
slide-123
SLIDE 123

Putting It All Together

Let’s put our model to use, finally. The tool hooks into Twitter Streaming API, and naturally that comes with need to do certain error handling, etc. Once we get the actual tweet though..
slide-124
SLIDE 124

Load the model

The weights we have calculated via training

Easiest is to load them from DB (can be used to test difgerent models).
slide-125
SLIDE 125

HARD CODED RULES

We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
slide-126
SLIDE 126

HARD CODED RULES SKIP

truncated retweets: "RT @A ..."

We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
slide-127
SLIDE 127

HARD CODED RULES SKIP

truncated retweets: "RT @A ..." @ mentions of friends

We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
slide-128
SLIDE 128

HARD CODED RULES SKIP

truncated retweets: "RT @A ..." tweets from friends @ mentions of friends

We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
slide-129
SLIDE 129

Classifying Tweets

This is the moment we’ve been waiting for.
slide-130
SLIDE 130

Classifying Tweets

GOOD

This is the moment we’ve been waiting for.
slide-131
SLIDE 131

Classifying Tweets

GOOD BAD

This is the moment we’ve been waiting for.
slide-132
SLIDE 132

hθ(X) = 1 1 + e−θ·X

Remember this?

First is our hypothesis.
slide-133
SLIDE 133

hθ(X) = 1 1 + e−θ·X

Remember this?

θ·X = θ0 + θ1X1 + θ2X2 + . . .

First is our hypothesis.
slide-134
SLIDE 134

Finally

hθ(X) = 1 1 + e−(θ0+θ1X1+θ2X2+... )

If h > threshold , tweet is bad, otherwise good

Remember that the output of h() is 0..1 (probability). Threshold is [0, 1], adjust it for your degree of tolerance. I used 0.9 to reduce false positives.
slide-135
SLIDE 135

extract features

3 simple steps Invoke the feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classifier.
slide-136
SLIDE 136

extract features run the model

3 simple steps Invoke the feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classifier.
slide-137
SLIDE 137

extract features run the model act on the result

3 simple steps Invoke the feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classifier.
slide-138
SLIDE 138

BAD?

block user!

Also save the tweet to DB for future analysis.
slide-139
SLIDE 139

Lessons Learned

Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear

slide-140
SLIDE 140

Lessons Learned

Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear

  • Connection handling, backofg in case of problems, undocumented API errors, etc.
slide-141
SLIDE 141

Lessons Learned

Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear

  • No way for blocked person to get ahold of you via Twitter anymore, so when training the
model, err on the side of caution.
slide-142
SLIDE 142

Lessons Learned

Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear

  • Some tweets are shown on the website, but never seen through the API.
slide-143
SLIDE 143

Lessons Learned

Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear

  • Lots of room for improvement.
slide-144
SLIDE 144

Lessons Learned

Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear PHP sucks at math-y stuff

  • Lots of room for improvement.
slide-145
SLIDE 145

NEXT STEPS

★ Realtime feedback ★ More features ★ Grammar analysis ★ Support Vector Machines or

decision trees

★ Clockwork Raven for manual

classification

★ Other minimization algos:

BFGS, conjugate gradient

★ Wish pecl/scikit-learn existed

Click on the tweets that are bad and it immediately incorporates them into the model. Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences. SVMs more appropriate for biased data sets. Farm out manual classification to Mechanical Turk. May help avoid local minima, no need to pick alpha, often faster than GD.
slide-146
SLIDE 146

TOOLS

★ MongoDB ★ pear/Text_LanguageDetect ★ English vocabulary corpus ★ Parts-Of-Speech tagging ★ SplFixedArray ★ phirehose ★ Python’s scikit-learn (for

validation)

★ Code sample

MongoDB (great fit for JSON data) English vocabulary corpus: http://corpus.byu.edu/ or http://www.keithv.com/software/wlist/ SplFixedArray in PHP (memory savings and slightly faster)
slide-147
SLIDE 147

LEARN

★ Coursera.org ML course ★ Ian Barber’s blog ★ FastML.com

Click on the tweets that are bad and it immediately incorporates them into the model. Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences. SVMs more appropriate for biased data sets. Farm out manual classification to Mechanical Turk. May help avoid local minima, no need to pick alpha, often faster than GD.
slide-148
SLIDE 148

Questions?