Real Data Real Issues what they never told me at Uni Phil Brierley - - PowerPoint PPT Presentation

real data real issues
SMART_READER_LITE
LIVE PREVIEW

Real Data Real Issues what they never told me at Uni Phil Brierley - - PowerPoint PPT Presentation

Real Data Real Issues what they never told me at Uni Phil Brierley WOMBAT 18 th February 2016 Data Prep v Predictive Modelling Time wise 80% - Data Prep 5% - Model Building 50% - Explaining results Data Scientists like to over


slide-1
SLIDE 1

Real Data – Real Issues

what they never told me at Uni Phil Brierley WOMBAT 18th February 2016

slide-2
SLIDE 2

Data Prep v Predictive Modelling

Time wise… 80% - Data Prep 5%

  • Model Building

50% - Explaining results

Data Scientists like to over deliver!

slide-3
SLIDE 3
  • Model building is now becoming a point and

click commodity

  • ‘Correct’ data preparation will never be this
  • Rubbish in – Rubbish out
  • Caveat Emptor

let the buyer beware

slide-4
SLIDE 4
  • Knowing your data is the most important thing
  • Don’t listen to ‘expert opinion’
  • The data contains all the questions

the ‘experts’ may have the answers

slide-5
SLIDE 5

This talk is nothing to do with Maths, Statistics or Algorithms Its to do with the 90% of your time you will spend getting your data ‘Algorithm Ready’

slide-6
SLIDE 6

4 Steps

1. Getting the Data

– Receiving the data

2. Sanity Checking the Data

– Is it consistent?

3. Preprocessing the Data

– Don’t introduce issues

  • 4. Predictive Modelling

– Last 5 mins of project timeline & budget!

Getting the data ‘Algorithm Ready’ But we’ll talk a little

  • n this
slide-7
SLIDE 7
  • 1. Getting the Data

A long process

  • but can be made less painful
slide-8
SLIDE 8

Getting the Data

  • Can involve 2 or 3 iterations if its not extracted

correctly

  • This can be avoided by specifying exactly how

you want it

slide-9
SLIDE 9

Getting the Data

  • Raw data only

– We’ll do any aggregation – Quicker for us and you – We need to know what has gone on

  • All the data (size permitting)

– We’ll decide what populations not to use – Maybe only filter on time – Much quicker, storage is cheap – Not saving us or you any time by doing filtering your end

slide-10
SLIDE 10

Getting the Data

  • Database dump if possible

– Detach database, we’ll reattach – Ensures our ‘solution’ can be run in your production environment (thinking ahead!)

  • Delimited Text Files

– Pipe delimited (|) – No quotes around text fields

slide-11
SLIDE 11

Getting the Data

  • EXCEL

– Excel generally means humans involved – BAD! – Hard to replicate exactly what has gone on (see above) – Excel does weird stuff (see later) – Source data won’t be Excel (hopefully) – Putting it in Excel to ‘help us’ is not actually helping (the first thing we do is try get it out of Excel)

slide-12
SLIDE 12
  • 2. Sanity Checking

Identifying Systematic Data Issues Data driven predictive modelling assumes the future will be like the past – we need to make sure the past is like the past

slide-13
SLIDE 13
  • 1. Acquisitions and Events

June 2,000

  • lifetime health cover (change in govt. policy)

acquisition Time (months) >> New Members Health Insurer

slide-14
SLIDE 14
  • 1. Acquisitions and Events

‘made up’ data ! Company Financial Statement Dates Date >> Number of Reports

slide-15
SLIDE 15
  • 2. Made up Dates

Company Financial Statement Times Time of Day >> Year

Market Open Before Market Open After Market Close

slide-16
SLIDE 16
  • 2. Made up Dates

Date of Announcement >>

slide-17
SLIDE 17
  • 2. Made up Dates

2am clock change Sometimes fudged ! Electricity Consumption

slide-18
SLIDE 18
  • 2. Made up Dates
  • Always look at date distributions
  • Dates usually cannot be ‘null’ in a database
  • Thus common to see system default dates
  • Happens when data sources are ‘merged’ or

dates are unknown

slide-19
SLIDE 19
  • 3. Moving Definitions

Data gets updated retrospectively

  • nce all claims are

received

Date >> Medicare Non- Medicare

hospital hospital non hospital non hospital

Health Insurance Claims – time lags

slide-20
SLIDE 20
  • 3. Moving Definitions

What is the reason for this? Date >>

Hospital Admissions

Number of Admissions

slide-21
SLIDE 21
  • 3. Moving Definitions

Has Congenital Heart Disease suddenly been cured? Number of Admissions for Congenital Heart Disease Date >> Admissions

slide-22
SLIDE 22
  • 3. Moving Definitions

Date >> Number of Admissions to Emergency Dept. 1st July

slide-23
SLIDE 23
  • 3. Moving Definitions
slide-24
SLIDE 24
  • 4. Data Capture Bias

Only ‘odd lots’ recorded prior to 2014

Data thus biased towards lower priced shares

  • dependent on

volume of shares, not $ amount

Date >> Volume of Trades – Exchange reporting policy

Minimum Volume

(Share Splits)

slide-25
SLIDE 25
  • 5. Real or Systematic ?

Time of Day Discharge Times from Hospital (you don’t actually clock out!) Time of 6pm allocated to everyone Chemotherapy Non - Chemotherapy

slide-26
SLIDE 26
  • 6. System Overrides

Exam Mark Freq

Looks OK?

slide-27
SLIDE 27
  • 6. System Overrides

Exam Mark Freq

Hmmm! What do you think the pass mark is?

slide-28
SLIDE 28
  • 3. Pre-processing data

Avoid Self Imposed Coding Errors

slide-29
SLIDE 29
  • 1. Integer Division
slide-30
SLIDE 30
  • 1. Integer Division

Documented – but who reads documentation!

slide-31
SLIDE 31
  • 2. Floats & Reals

Comparisons of numbers dependent on format

slide-32
SLIDE 32
  • 2. Floats & Reals

It is documented !!

slide-33
SLIDE 33
  • 3. Nulls

Null is ‘unknown’ – so any calculation on records containing a Null correctly returns Null (not necessarily intuitive)

slide-34
SLIDE 34
  • 3. Nulls

Technically correct as NULL means ‘I don’t know’ …but none the less, not what you might be expecting

slide-35
SLIDE 35
  • 4. Nulls (again)

The disappearing record Records = 5 Yes = 2 Not Yes = 2 2 + 2 != 5 (not intuitive)

slide-36
SLIDE 36
  • 5. Nulls (again, again)

Beware Function Defaults

slide-37
SLIDE 37
  • 5. Nulls (again, again)

This genuinely caught me out!

slide-38
SLIDE 38
  • 6. Beware software quirks
  • the midnight hour

(caught me out again!) Bang on midnight the seconds disappear

slide-39
SLIDE 39
  • 7. Stay away from Excel !!
  • 1-3
  • Excel will convert it to 3-Jan
  • Convert cell to text and it becomes 42007
  • mm/dd/yy or dd/mm/yy
  • Phil's Rules

– avoid Excel as it has a mind of it’s own. – data used for modelling should go nowhere near Excel

slide-40
SLIDE 40
  • 8. Damn Smileys!

(damn Microsoft)

slide-41
SLIDE 41
  • 4. Predictive Modelling

If it looks too good to be true – it normally is

slide-42
SLIDE 42

Predictive Modelling

  • These days, all you need to know is:

– Ensembling – Over fitting (or how to avoid it) – Calculating Variable Importance

  • Helps detect information leakage
slide-43
SLIDE 43

If it’s too good to be true…

  • 1. University Attrition

(voluntary or involuntary)

  • 2. Insurance Claims

(level of cover)

  • 3. ID is a proxy for the outcome

(kdd Cup)

slide-44
SLIDE 44

Thank you for listening