SLIDE 1 Machine Learning Machine Learning Fast & Slow Fast & Slow
Suman Deb Roy Suman Deb Roy
Lead Data Scientist @ betaworks
SLIDE 2
SLIDE 3 bot
www.digg.com www.digg.com/messaging /messaging
www.poncho.is www.poncho.is www.rundexter.com www.rundexter.com
SLIDE 4
Runway Runway Art & Art & Science Science The The Last Last 10% 10%
SLIDE 5 1: Poncho 1: Poncho
- A weather cat that sends you
personalized weather messages.
- Algorithms + Humans
- Not every feature in weather
data has equal importance – what's ac?onable?
SLIDE 6 2: Digg Trending 2: Digg Trending
– 10 million RSS feeds, 200 million tweets, 7.5 million new ar?cles ranked each day
m.me/digg
SLIDE 7
3: Digg Deeper 3: Digg Deeper
SLIDE 8
4: 4: Instapaper’s Instapaper’s InstaRank InstaRank
SLIDE 9
5: Scale Model 5: Scale Model
Communi?es Not Keywords
SLIDE 10
MACHINE LEARNING MACHINE LEARNING WAS WAS HARD
HARD
ITS ITS STILL
STILL HARD
HARD
SLIDE 11 Varied Distribu?on Historical Data Similarity between training & test distribu?ons (less varied dist) Predic?on Error Impact of a more complex algorithm Historical Data Value
VALUE of VALUE of Algorithms Algorithms
SLIDE 12 Moving fast and slow Moving fast and slow
– Experience, Similar Problems, Pre-exis?ng pipelines
– New type of data, Bootstrap, Scaling
– how to jump between states, when to change gears.
SLIDE 13
Conscious Slow Conscious Fast Unconscious Slow Unconscious Fast Fast Fast Planned Planned Slow Slow
SLIDE 14 Effects of moving Fast Effects of moving Fast
– Refactoring code – improving unit tests – delete dead code – reducing dependencies – ?ghtening APIs – improving documenta?on
SLIDE 15 Effects of moving Slow Effects of moving Slow
– Wai?ng team mates – Uncertain quality assurance – Piling up further requests – Hypothesis might not be feedback driven – Overthinking the solu?on
SLIDE 16 Maintenance Maintenance
– How researchable, reusable, deployable
– Eroding abstrac?on boundaries
– Data influences ML behavior.
SLIDE 17 Data vs. Code Organization Data vs. Code Organization
- Snapshodng .. Detects bias
- Interface at the method , be procedural
– Easy to execute por?ons of the code.
- Separate hyper-arguments from parameters
– Parameter: How your model is specified – Hyper-Arguments: How your algorithm should run
SLIDE 18 Unstable APIs Unstable APIs
- Who owns the data stream?
- Who owns the model ?
- Ownership by
– en?re solu?on – Exper?se? DB ? Pipelines? Algorithms? Stats
– Frozen versioning instead of con?nual
SLIDE 19 Feature Erosion Feature Erosion
- User behavior with new model could make
features of current model unimportant
- How can we detect this?
- How can we prevent this?
SLIDE 20 Predictor Variables Predictor Variables
- Myth: If you add a few more variables, the
predictor will be befer.
- If the predictors have realis?c priors, their
coefficients could be appropriately pulled down (in expecta?on) and over fidng shouldn’t be such a problem
SLIDE 21
Visualizations Visualizations
Any ML algorithm must be seen to believe it.
SLIDE 22
Visualizations Visualizations
SLIDE 23 Research vs. Production Research vs. Production
- Collabora?on looks very different based on
the end goals
- Do you need to master git or just get by
- How quickly can you move something from
iPython to produc?on grade?
SLIDE 24 Even the best tools.. Even the best tools..
- Lets talk about iPython notebooks:
– Version Control – Fragmented Code is deadly for produc?on grade. – Security issue : all those open ports – Code Reviews and Pull Requests.
SLIDE 25
Heuristic Escape Heuristic Escape
“Heuristic is an algorithm in a clown suit. It’s less predictable, it’s more fun, and it comes without a 30- day, money-back guarantee.”
― Steve McConnell, Code Complete
SLIDE 26 Domain of Impact Domain of Impact
- Most engineers and computers scien?sts will
conceptualize domains as primarily a ra?onal, evidence-based, problem-solving enterprise focused on well-defined condi?ons.
- But the real world is ….. more complex!
- e.g.,: Trending News Algorithms
SLIDE 27 Invention vs. Innovation Invention vs. Innovation
- What is ML good at? Both ?
- Not outside the box, instead connect them.
- innova?on = improve significantly by adjus?ng
ML method
- inven?on = totally new ML method.
SLIDE 28 Fitting ML into the betaworks model Fitting ML into the betaworks model
Nexus
Product C Company A Research Company B
SLIDE 29 Code & Data Residence Code & Data Residence
– Code transfer
- Core module
- Model upda?ng component
- Analysis component
– Data transfer
- Infrastructure rebuild?
- Performance
- maintenance
SLIDE 30 Research ready pipelines Research ready pipelines
Powered by deepNews
SLIDE 31 Second order Analysis Second order Analysis
Powered by deepNews + Scale Model
SLIDE 32
Conversational Conversational Software Software
SLIDE 33
SLIDE 34
SLIDE 35
HBI
HUMAN HUMAN BOT BOT INTER INTER CONNECTION CONNECTION
SLIDE 36 APIs Apps for transactional tasks Topic Modeling DBpedia Freebase trending topics digg deeper Affective Computing
MANY automated solutions ZERO
automated solutions
SLIDE 37 APIs Apps for transactional tasks LDA LSA DBpedia Freebase Trending topics Digg deeper
LSTM ?
HIGH VALUE
Tone Analyzer?
LOW VALUE of historical data
SLIDE 38 Data Types by Company Data Types by Company
- Digg has topic modeling/ news data
- Scale model has social graph data
- Poncho has weather data/editorialized
personality
- Giphy has gifs (emo?on++)
- Instapaper has reading data
- Dexter has hooks to APIs
SLIDE 39 Transfer Learning Transfer Learning
Yosinski et. al. How transferrable are deep learning features? , in NIPS 2014
SLIDE 40 To Sum up To Sum up
- Constraints to ML solu?ons occur at three
levels:
– Algorithmic – Data – Humans
- These parameters lead to several oscilla?ng
cycles of fast and slow impact of ML
SLIDE 41 ML 2016 ML 2016
- Understood by few, hyped by some, revered by
most.
- Can be the difference between a company scaling
- vs. close shop.
- Almost every company can have at least 1
product feature powered by ML.
- Be careful about bias in data.
SLIDE 42
data.betaworks.com Suman Deb Roy suman@betaworks.com | @_roysd