Small Data Machine Learning
Andrei Zmievski
The goal is not a comprehensive introduction, but to plant a seed in your head to get you interested in this topic. Questions - now and later
Small Data Machine Learning Andrei Zmievski The goal is not a - - PowerPoint PPT Presentation
Small Data Machine Learning Andrei Zmievski The goal is not a comprehensive introduction, but to plant a seed in your head to get you interested in this topic. Questions - now and later WORK We are all superheroes, because we help our customers
Small Data Machine Learning
Andrei Zmievski
The goal is not a comprehensive introduction, but to plant a seed in your head to get you interested in this topic. Questions - now and laterWORK
We are all superheroes, because we help our customers keep their mission-critical apps running smoothly. If interested, I can show you a demo of what I’m working on. Come find me.WORK
We are all superheroes, because we help our customers keep their mission-critical apps running smoothly. If interested, I can show you a demo of what I’m working on. Come find me.TRAVEL
TAKE PHOTOS
DRINK BEER
MAKE BEER
FAME
AdvantagesFAME FORTUNE
FAME FORTUNE
Wall Street Journal?!
FAME FORTUNE FOLLOWERS
lol, what?!
FAME FORTUNE FOLLOWERS
140-length(“@a “)=137
MAXIMUM REPLY SPACE!
CONS
Disadvantages Visual filtering is next to impossible Could be a set of hard-coded rules derived empiricallyCONS
Disadvantages Visual filtering is next to impossible Could be a set of hard-coded rules derived empiricallyCONS
I hate humanity
Disadvantages Visual filtering is next to impossible Could be a set of hard-coded rules derived empiricallyREPLYCLEANER
Even with false negatives, reduces garbage to where visual filtering is possibleREPLYCLEANER
Even with false negatives, reduces garbage to where visual filtering is possibleREPLYCLEANER
REPLYCLEANER
REPLYCLEANER
REPLYCLEANER
I still hate humanity
I still hate humanity“Field of study that gives computers the ability to learn without being explicitly programmed.” — Arthur Samuel (1959)
concerns the construction and study of systems that can learn from dataSPAM FILTERING
RECOMMENDATIONS
TRANSLATION
CLUSTERING
And many more: medical diagnoses, detecting credit card fraud, etc.unsupervised
Labeled dataset, training maps input to desired outputs Example: regression - predicting house prices, classification - spam filteringsupervised
individual measurable property of the phenomenon under observation
usually numerica set of features for an observation
Think of it as an arrayfeatures
# of rooms
house age yard?
feature vector and weights vector 1 added to pad the vector (account for the initial ofgset / bias / intercept weight, simplifies calculation) dot product produces a linear predictorfeatures
102.3 0.94
83.0
parameters
# of rooms
house age yard?
feature vector and weights vector 1 added to pad the vector (account for the initial ofgset / bias / intercept weight, simplifies calculation) dot product produces a linear predictorfeatures
102.3 0.94
83.0
parameters
# of rooms
house age yard? 1 45.7
feature vector and weights vector 1 added to pad the vector (account for the initial ofgset / bias / intercept weight, simplifies calculation) dot product produces a linear predictorfeatures
102.3 0.94
83.0
parameters
# of rooms
house age yard? 1 45.7
feature vector and weights vector 1 added to pad the vector (account for the initial ofgset / bias / intercept weight, simplifies calculation) dot product produces a linear predictorfeatures
102.3 0.94
83.0
=
parameters prediction
# of rooms
house age yard? 1 45.7
feature vector and weights vector 1 added to pad the vector (account for the initial ofgset / bias / intercept weight, simplifies calculation) dot product produces a linear predictorfeatures
102.3 0.94
83.0
=
parameters prediction
# of rooms
house age yard? 1 45.7
X = ⇥1 x1 x2 . . .⇤ θ = ⇥θ0 θ1 θ2 . . .⇤
X = ⇥1 x1 x2 . . .⇤ θ = ⇥θ0 θ1 θ2 . . .⇤
θ·X = θ0 + θ1x1 + θ2x2 + . . .
input data
The task of our algorithm is to determine the parameters of the hypothesis.parameters input data
The task of our algorithm is to determine the parameters of the hypothesis.parameters input data prediction y
The task of our algorithm is to determine the parameters of the hypothesis.LINEAR REGRESSION
5 10 15 20 25 30 35 40 80 120 160 200 whisky age whisky price $
Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.LINEAR REGRESSION
5 10 15 20 25 30 35 40 80 120 160 200 whisky age whisky price $
Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.LINEAR REGRESSION
5 10 15 20 25 30 35 40 80 120 160 200 whisky age whisky price $
Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.LOGISTIC REGRESSION
g(z) = 1 1 + e−z
0.5 1 z
Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin. z is just our old dot product, the linear predictor. Transforms unbounded output into bounded.LOGISTIC REGRESSION
g(z) = 1 1 + e−z
0.5 1 z
z = θ · X
Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin. z is just our old dot product, the linear predictor. Transforms unbounded output into bounded.hθ(X) = 1 1 + e−θ·X
Probability that y=1 for input X
LOGISTIC REGRESSION
If hypothesis describes spam, then given X=body of email, if y=0.7, means there’s 70% chance it’s spam. Thresholding on that is up to you.Corpus
collection of source data used for training and testing the model
phirehose
hooks into streaming APIphirehose
character (except for apostrophe)
possible features
Indonesian or Tagalog?
Garbage.
id Indonesian 3548 en English 1804 tl Tagalog 733 es Spanish 329 so Somalian 305 ja Japanese 300 pt Portuguese 262 ar Arabic 256 nl Dutch 150 it Italian 137 sw Swahili 118 fr French 92
Top 12 Languages
I guarantee you people aren’t tweeting at me in Swahili.Text_LanguageDetect textcat pecl / pear /
Can’t trust the language field in user’s profile data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.unusual_word_ratio = size(remaining)/size(words)
✓ If ratio < 20%, pretend it’s EnglishEnglishNotEnglish
A lot of this is heuristic-based, after some trial-and-error. Seems to help with my corpus.BINARY CLASSIFICATION
Grunt work Built a web-based tool to display tweets a page at a time and select good onesI N P U T O U T P U T
Had my input and outputCORRECTION
One more thing to addressCORRECTION
BAD
GOOD
99% = bad (less < 100 tweets were good) Training a model as-is would not produce good results Need to adjust the biasCORRECTION
BAD
GOOD
chance feature 90% “good” language 70% 25% 5% no hashtags 1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10
The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)chance feature 90% “good” language 70% 25% 5% no hashtags 1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10
1
The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)chance feature 90% “good” language 70% 25% 5% no hashtags 1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10
1 2
The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)chance feature 90% “good” language 70% 25% 5% no hashtags 1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10
1 2
The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)chance feature 90% “good” language 70% 25% 5% no hashtags 1 hashtag 2 hashtags 2% @a at the end 85% rand length > 10
1 2 77
The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)REALITY PREDICTION
J(θ) = 1 m
m
X
i=1
Cost(hθ(x), y)
Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.Cost(hθ(x), y) = ( − log (hθ(x)) if y = 1 − log (1 − hθ(x)) if y = 0
LOGISTIC COST
1
y=1 y=0
1
Correct guess Incorrect guess Cost = 0 Cost = huge
LOGISTIC COST
When y=1 and h(x) is 1 (good guess), cost is 0, but the closer h(x) gets to 0 (wrong guess), the more we penalize the algorithm. Same for y=0.GRADIENT DESCENT
Random starting point. Pretend you’re standing on a hill. Find the direction of the steepest descent and take a step. Repeat. Imagine a ball rolling down from a hill.θi α = − θi ∂J(θ) ∂θi
GRADIENT DESCENT
Each step adjusts the parameters according to the slopeα = −
each parameter
θi θi ∂J(θ) ∂θi
Have to update them simultaneously (the whole vector at a time).θi = − θi
learning rate
α∂J(θ) ∂θi
Controls how big a step you take If α is big have an aggressive gradient descent If α is small take tiny steps If too small, tiny steps, takes too long If too big, can overshoot the minimum and fail to convergeθi α = − θi
derivative aka “the slope”
∂J(θ) ∂θi
The slope indicates the steepness of the descent step for each weight, i.e. direction. Keep going for a number of iterations or until cost is below a threshold (convergence). Graph the cost function versus # of iterations and see where it starts to approach 0, past that are diminishing returns.θi = θi − α
m
X
j=1
(hθ(xj)−yj)xj
i
THE UPDATE ALGORITHM
Derivative for logistic regression simplifies to this term. Have to update the weights simultaneously!X1 = [1 12.0] X2 = [1 -3.5]
Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.y1 = 1 y2 = 0 X1 = [1 12.0] X2 = [1 -3.5]
Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.y1 = 1 y2 = 0 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1]
Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.y1 = 1 y2 = 0 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1]
α= 0.05
Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.y1 = 1 y2 = 0 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1]
α= 0.05
Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.y1 = 1 y2 = 0 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1]
α= 0.05
Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.y1 = 1 y2 = 0 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0
α= 0.05
Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.y1 = 1 y2 = 0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0
α= 0.05
Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.y1 = 1 y2 = 0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1) h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0
α= 0.05
Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.y1 = 1 y2 = 0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1) = 0.088 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T0
α= 0.05
Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.y1 = 1 y2 = 0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 12.0 + (0.438 - 0) • -3.5) = 0.305 h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438 X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] T1
α= 0.05
Note that the hypotheses don’t change within the iteration.y1 = 1 y2 = 0 X1 = [1 12.0] X2 = [1 -3.5]
θ = [T0 T1] α= 0.05
Replace parameter (weights) vector with the temporaries.y1 = 1 y2 = 0 X1 = [1 12.0] X2 = [1 -3.5]
θ = [0.088 0.305] α= 0.05
Do next iterationLoad the model
The weights we have calculated via training
Easiest is to load them from DB (can be used to test difgerent models).truncated retweets: "RT @A ..."
We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.truncated retweets: "RT @A ..." @ mentions of friends
We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.truncated retweets: "RT @A ..." tweets from friends @ mentions of friends
We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.GOOD
This is the moment we’ve been waiting for.GOOD BAD
This is the moment we’ve been waiting for.hθ(X) = 1 1 + e−θ·X
hθ(X) = 1 1 + e−θ·X
θ·X = θ0 + θ1X1 + θ2X2 + . . .
First is our hypothesis.hθ(X) = 1 1 + e−(θ0+θ1X1+θ2X2+... )
If h > threshold , tweet is bad, otherwise good
Remember that the output of h() is 0..1 (probability). Threshold is [0, 1], adjust it for your degree of tolerance. I used 0.9 to reduce false positives.BAD?
Lessons Learned
Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear
Lessons Learned
Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear
Lessons Learned
Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear
Lessons Learned
Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear
Lessons Learned
Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear
Lessons Learned
Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective Twitter API is a pain in the rear PHP sucks at math-y stuff
NEXT STEPS
★ Realtime feedback ★ More features ★ Grammar analysis ★ Support Vector Machines or
decision trees
★ Clockwork Raven for manual
classification
★ Other minimization algos:
BFGS, conjugate gradient
★ Wish pecl/scikit-learn existed
Click on the tweets that are bad and it immediately incorporates them into the model. Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences. SVMs more appropriate for biased data sets. Farm out manual classification to Mechanical Turk. May help avoid local minima, no need to pick alpha, often faster than GD.TOOLS
★ MongoDB ★ pear/Text_LanguageDetect ★ English vocabulary corpus ★ Parts-Of-Speech tagging ★ SplFixedArray ★ phirehose ★ Python’s scikit-learn (for
validation)
★ Code sample
MongoDB (great fit for JSON data) English vocabulary corpus: http://corpus.byu.edu/ or http://www.keithv.com/software/wlist/ SplFixedArray in PHP (memory savings and slightly faster)LEARN
★ Coursera.org ML course ★ Ian Barber’s blog ★ FastML.com
Click on the tweets that are bad and it immediately incorporates them into the model. Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences. SVMs more appropriate for biased data sets. Farm out manual classification to Mechanical Turk. May help avoid local minima, no need to pick alpha, often faster than GD.