Concept Drift: Learning on Data Streams Pdraig Cunningham Director - - PowerPoint PPT Presentation

concept drift
SMART_READER_LITE
LIVE PREVIEW

Concept Drift: Learning on Data Streams Pdraig Cunningham Director - - PowerPoint PPT Presentation

Concept Drift: Learning on Data Streams Pdraig Cunningham Director Insight @ UCD PI @ CeADAR Online Learning & Concept Drift Predictive Analytics The Old Days Static data, the concept doesnt change Velocity & Variety


slide-1
SLIDE 1

Concept Drift:

Learning on Data Streams

Pádraig Cunningham Director Insight @ UCD PI @ CeADAR

slide-2
SLIDE 2

Online Learning & Concept Drift

Predictive Analytics

❏ The Old Days ❏ Static data, the concept doesn’t change

Velocity & Variety rather than Volume Concept Drift Online Learning

❏ Learning on Data Streams ❏ Model Update

Tools

❏ MOA: Massive Online Analysis (moa.cms.waikato.ac.nz) ❏ from the makers of Weka ❏ Apache Spark ❏ RapidMiner (rapidminer.com)

slide-3
SLIDE 3

A typical predictive analytics task

Heart attack patient admitted

❏ 19 variables measured during first 24 hours ❏ Blood pressure, age, + 17 other ordered and binary variables ❏ Considered important indicators of patient’s condition

Goal:

❏ Learn from historic data ❏ Build a model to identify high risk patients ❏ i.e. will not survive 30 days ❏ (based on evidence of initial 24-hour data)

No Age BMI BP Res.

1 60 20 140

Ok

2 60 21 145

Ok

3 85 23 130

Ok

4 81 22 160

No

5 70 24 170

No

6 72 26 135

No

7 81 26 145

No

8 66 23 155

No

Q 66 24 148

?

Consider just 3 features Assumed to be a static ‘concept’ This model is good for all time.

slide-4
SLIDE 4

Predictive Analytics

AKA: Supervised ML

slide-5
SLIDE 5

Volume: Big Data’s Little Secret

‘Big’ doesn’t really matter…

❏ Typically not a lot of data needed to build a good model

Classification task on cancer microscopy images Prostate Cancer Gleason Grade 3 Gleason Grade 4

slide-6
SLIDE 6

Velocity: Game Analytics

Task:

User segmentation: Predict return from new users Inputs: player profile Outputs: premium user, yes/no? Game Lifecycle (user numbers) Technology adoption (user type)

Consider: Model trained on Early Adopters Used on Late Majority

slide-7
SLIDE 7

Google Trends

Other game lifecycles

Farmville Angry Birds Flappy Bird

slide-8
SLIDE 8

Energy Demand Prediction

Demand profile has different ‘regimes’

slide-9
SLIDE 9

Concept Drift

Over time, things that model expects to be positive come up negative.

❏ Bad loans ❏ Antibiotic resistance ❏ Conversion / Churn prediction

Training Time Later On

slide-10
SLIDE 10

Concept Drift: Spam Detection

Without retraining error creeps up over time...

Time Error % Static Model Updating Model Delany, Cunningham, Tsymbal, FLAIRS 2006

slide-11
SLIDE 11

Model Update Strategies

❏ First: when does new data (outcomes) come available?

❏ Immediately: energy consumption, financial prediction ❏ Time lag: credit scoring, online games - financial return ❏ Sometimes never:

? t t-1 Inputs Outputs Training window size

❏ Key parameters

❏ window size ❏ update frequency

❏ Immediately

slide-12
SLIDE 12

Model Update

❏ Retrain model on recent ‘window’ of data ❏ Window size

❏ Big enough: hundreds or examples ❏ Oldest data in window must be relevant

❏ Update frequency

❏ model should be up to date ❏ don’t need to retrain on every click

Window Too big Window About right: enough data?

slide-13
SLIDE 13

Software

From the makers of Weka...

slide-14
SLIDE 14

Software

From Apache...

slide-15
SLIDE 15

Summary

The Old Days:

❏ Train static models on historic data

Predictive analytics on data streams

❏ Underlying ‘concept’ changes over time ❏ How will we know? ❏ Model can be updated using new data

Adaptive models

❏ Data workflows becomes a big issue ❏ New parameters to be considered ❏ Window size ❏ Update frequency