Concept Drift: Learning on Data Streams Pdraig Cunningham Director - - PowerPoint PPT Presentation
Concept Drift: Learning on Data Streams Pdraig Cunningham Director - - PowerPoint PPT Presentation
Concept Drift: Learning on Data Streams Pdraig Cunningham Director Insight @ UCD PI @ CeADAR Online Learning & Concept Drift Predictive Analytics The Old Days Static data, the concept doesnt change Velocity & Variety
Online Learning & Concept Drift
Predictive Analytics
❏ The Old Days ❏ Static data, the concept doesn’t change
Velocity & Variety rather than Volume Concept Drift Online Learning
❏ Learning on Data Streams ❏ Model Update
Tools
❏ MOA: Massive Online Analysis (moa.cms.waikato.ac.nz) ❏ from the makers of Weka ❏ Apache Spark ❏ RapidMiner (rapidminer.com)
A typical predictive analytics task
Heart attack patient admitted
❏ 19 variables measured during first 24 hours ❏ Blood pressure, age, + 17 other ordered and binary variables ❏ Considered important indicators of patient’s condition
Goal:
❏ Learn from historic data ❏ Build a model to identify high risk patients ❏ i.e. will not survive 30 days ❏ (based on evidence of initial 24-hour data)
No Age BMI BP Res.
1 60 20 140
Ok
2 60 21 145
Ok
3 85 23 130
Ok
4 81 22 160
No
5 70 24 170
No
6 72 26 135
No
7 81 26 145
No
8 66 23 155
No
Q 66 24 148
?
Consider just 3 features Assumed to be a static ‘concept’ This model is good for all time.
Predictive Analytics
AKA: Supervised ML
Volume: Big Data’s Little Secret
‘Big’ doesn’t really matter…
❏ Typically not a lot of data needed to build a good model
Classification task on cancer microscopy images Prostate Cancer Gleason Grade 3 Gleason Grade 4
Velocity: Game Analytics
Task:
User segmentation: Predict return from new users Inputs: player profile Outputs: premium user, yes/no? Game Lifecycle (user numbers) Technology adoption (user type)
Consider: Model trained on Early Adopters Used on Late Majority
Google Trends
Other game lifecycles
Farmville Angry Birds Flappy Bird
Energy Demand Prediction
Demand profile has different ‘regimes’
Concept Drift
Over time, things that model expects to be positive come up negative.
❏ Bad loans ❏ Antibiotic resistance ❏ Conversion / Churn prediction
Training Time Later On
Concept Drift: Spam Detection
Without retraining error creeps up over time...
Time Error % Static Model Updating Model Delany, Cunningham, Tsymbal, FLAIRS 2006
Model Update Strategies
❏ First: when does new data (outcomes) come available?
❏ Immediately: energy consumption, financial prediction ❏ Time lag: credit scoring, online games - financial return ❏ Sometimes never:
? t t-1 Inputs Outputs Training window size
❏ Key parameters
❏ window size ❏ update frequency
❏ Immediately
Model Update
❏ Retrain model on recent ‘window’ of data ❏ Window size
❏ Big enough: hundreds or examples ❏ Oldest data in window must be relevant
❏ Update frequency
❏ model should be up to date ❏ don’t need to retrain on every click
Window Too big Window About right: enough data?
Software
From the makers of Weka...
Software
From Apache...
Summary
The Old Days:
❏ Train static models on historic data
Predictive analytics on data streams
❏ Underlying ‘concept’ changes over time ❏ How will we know? ❏ Model can be updated using new data
Adaptive models
❏ Data workflows becomes a big issue ❏ New parameters to be considered ❏ Window size ❏ Update frequency