The Art of Predictive Analytics: More Data, Same Models [STUDY SLIDES]
Joseph Turian joseph@metaoptimize.com @turian MetaOptimize
2012.02.02
The Art of Predictive Analytics: More Data, Same Models [STUDY - - PowerPoint PPT Presentation
The Art of Predictive Analytics: More Data, Same Models [STUDY SLIDES] Joseph Turian joseph@metaoptimize.com @turian 2012.02.02 MetaOptimize NOTE: These are the STUDY slides from my talk at the predictive analytics meetup:
The Art of Predictive Analytics: More Data, Same Models [STUDY SLIDES]
Joseph Turian joseph@metaoptimize.com @turian MetaOptimize
2012.02.02
NOTE: These are the STUDY slides from my talk at the predictive analytics meetup: http://bit.ly/xVLBuS I have removed some graphics, and added some text.
Who am I?
Engineer with 20 yrs coding exp PhD 10 yrs exp: large-scale ML + NLP Founded MetaOptimize
What is MetaOptimize?
Consultancy + community on: Large-scale ML + NLP Well engineered solutions
“Both NLP and ML have a lot of folk wisdom about what works and what doesn't. [This site] is crucial for sharing this collective knowledge.” - @aria42
http://metaoptimize.com/qa/
http://metaoptimize.com/qa/
http://metaoptimize.com/qa/
“A lot of expertise in machine learning is simply developing effective biases.”
(quoted from memory)
What's a good choice of learning rate for the second layer of this neural net on image patches? [intuition] (Yoshua Bengio)
Occam's Razor is a great example of ML intuition
Without the aid of prejudice and custom I should not be able to find my way across the room.
It's fun to be a geek
Be an artist
Be an artist
How to build the world's biggest langid (langcat) model?
+ Vowpal Wabbit = Win
How to build the world's biggest langid (langcat) model? SOLVED.
The art of predictive analytics: 1) Know the data out there 2) Know the code out there 3) Intuition (bias)
A lot of data with one feature correlated with the label
Twitter sentiment analysis?
Awesome! RT @rupertgrintnet Harry Potter Marks Place in Film History http://bit.ly/Eusxi :)
“Distant supervision” (Go et al., 09) (Use emoticons as labels)
Recipe: You know a lot about the problem
Smart Priors
You know a lot about the problem: Smart Priors
Yarowsky (1995), WSD 1) One sense per collocation. 2) One sense per discourse.
Recipe: You know a lot about the problem
Create new features
You know a lot about the problem: Create new features
Error-analysis
What errors is your model making? DO SOME EXPLORATORY DATA ANALYSIS (EDA)
Andrew Ng: “Advice for applying ML” Where do the errors come from?
Recipe: You know a little about the problem
Semi-supervised learning
You know a little about the problem: Semi-supervised learning
JOINT semi-supervised learning Ando and Zhang (2005) Suzuki and Isozaki (2008) Suzuki et al. (2009), etc. => effective but task-specific
You know a little about the problem: Semi-supervised learning
Unsupervised learning, followed by Supervised learning
34
Sup model Sup data
Supervised training
How can Bob improve his model?
35
Sup model Sup data
Supervised training
Semi-sup training?
36
Sup model Sup data
Supervised training
Semi-sup training? More feats
37
Sup model Sup data More feats Sup model Sup data More feats sup task 1 sup task 2
More features can be used on different tasks
38
Semi-sup model Unsup data Sup data
Joint semi-sup
(standard semi-sup setup)
39
Semi-sup model Unsup model Unsup data Sup data unsup pretraining semi-sup fine-tuning
Unsupervised, then supervised
40
Unsup model Unsup data unsup training unsup feats
Use unsupervised learning to create new features
41
Semi-sup model Unsup data unsup training Sup training Sup data unsup feats
These features can then be shared with other people
42
Unsup data unsup training unsup feats sup task 1 sup task 2 sup task 3
Recipe: You know almost nothing about the problem
Build cool generic features
Know almost nothing about problem: Build cool generic features
Word features (Turian et al., 2010)
http://metaoptimize.com/projects/wordreprs/
45
Brown clustering (Brown et al. 92)
(image from Terry Koo)
cluster(chairman) = `0010’ 2-prefix(cluster(chairman)) = `00’
46
50-dim embeddings: Collobert + Weston (2008) t-SNE vis by van der Maaten + Hinton (2008)
Know almost nothing about problem: Build cool generic features
Document features: Document clustering LSA/LDA Deep model
Document features
Salakhutdinov + Hinton 06
Domain adaptation for sentiment analysis (Glorot et al. 11)
Document features example
Recipe: You know a little about the problem Make more REAL training examples
Make more real training examples
Cuz you have some time
Amazon Mechanical Turk
Snow et al. 08 “Cheap and Fast – But is it Good?”
1K turk labels per dollar Average over (5) Turks to reduce noise => http://crowdflower.com/
Soylent (Bernstein et al. 10)
Find-Fix-Verify: Crowd control design pattern
Soylent, a prototype... Soylent, a prototype... Soylent, a prototype... Soylent, a prototype...
Find a problem Fix each problem Verify quality
Make more real training examples
Active learning
Dualist (Settles 11) http://code.google.com/p/dualist/
Dualist (Settles 11) http://code.google.com/p/dualist/ Applications: Document categorization WSD Information Extraction Twitter sentiment analysis
You know a little about the problem: Make more training examples
FAKE training examples
FAKE training examples
Denoising AA RBM
MNIST distortions (LeCun et al. 98)
No negative examples?
FAKE training examples
Multi-view / multi-modal
Multi-view / multi-modal
How do you evaluate an IR system, if you have no labels? See how good the title is at retrieving the body text.
2) KNOW THE DATA
Know the data
Labelled/structured data: ODP, Freebase, Wikipedia, Dbpedia, etc.
Know the data Unlabelled data: WaCKy, ClueWeb09, CommonCrawl, Ngram corpora
Ngrams
Google Bing Google Books Roll your own: Common crawl
Know the data Do something stupid on a lot of data
Do something stupid on a lot of data: Ngrams
Spell-checking Phrase segmentation Word breaking Synonyms Language models
See “An Overview of Microsoft Web N-gram Corpus and Applications” (Wang et al 10)
Do something stupid on a lot of data
Web-scale k-means for NER (Lin and Wu 09)
Do something stupid on a lot of data
Web-scale clustering
Know the data Multi-modal learning
Multi-modal learning Images and captions features features “facepalm” =
Multi-modal learning Titles and article body features features Article body = Title
Multi-modal learning Audio and tags features features “upbeat”, “hip hop” =
3) IT'S MODELS ALL THE WAY DOWN
Break down a pipeline 1-best (greedy), k-best, Finkel et al. 06
Good code to build on Stanford NLP tools, clustering algorithms, Terry Koo's parser, etc.
Good code to build on YOUR MODEL
Eat your own dogfood Bootstrapping (Yarowsky 95) Co-training (Blum+Mitchell 98) EM (Nigam et al., 00) Self-training (McClosky et al., 06)
Dualist (Settles '11) Active learning + semisup learning
Eat your own dogfood Cheap bootstrapping: One step of EM (Settles 11) “Awesome! What a great movie!”
It's models all the way down Use models to annotate Low recall + high precision + lots of data = win
Use models to annotate Face modeling
Pose-invariant face features
Pose-invariant face features
It's models all the way down
Joins on large noisy data sets
Joins on large noisy data sets ReVerb (Fader et al., 11) http://reverb.cs.washington.edu Extractions over entire ClueWeb09 (826 MB compressed)
ReVerb (Fader et al., 11)
Joins on noisy data sets (can clean up the data??)
The art of predictive analytics: 1) Know the data out there 2) Know the code out there 3) Intuition (bias)
Summary of recipes:
Know your problem Throw in good features Use other's good models in yr pipeline Make more training examples Use a lot of data
"It especially annoys me when racists are accused of 'discrimination.' The ability to discriminate is a precious facility; by judging all members of one 'race' to be the same, the racist precisely shows himself incapable of discrimination."
Other cool research to look at: * Frustratingly easy domain adaptation (Daume 07) * The Unreasonable Effectiveness of Data (Halevy et al 09) * Web-scale algorithms (search on http://metaoptimize.com/qa/) * Self-taught learning (Raina et al 07)
2012.02.02