Big data: What are you missing? the risks of assuming data equals all - PowerPoint PPT Presentation

Big data: What are you missing? the risks of assuming data equals “all” David J. Hand Imperial College London and Winton Capital Management 6 th January 2016 Theory of Big Data, UCL 1

BACKGROUND The promise of big data: McKinsey’s big data report : ‘ we are on the cusp of a tremendous wave of innovation, productivity, and growth, as well as new modes of competition and value capture, all driven by big data as consumers, companies, and economic sectors exploit its potential ’ and endless ditto by others Theory of Big Data, UCL 2

However, as I have argued elsewhere 1) big data is not the solution it’s what you do with it that counts 2) big data carries risks Theory of Big Data, UCL 3

Two kinds of big data opportunities 1) Computer science: through data manipulation merging, linking, matching, concatenating, sorting, basic arithmetic, ... Database heritage: could conceivably have all the data e.g. stock in the warehouse e.g. employees in the firm 2) Statistics: through inference and predictive analytics Many (most?) problems cannot have all the data e.g. observations in clinical trials e.g. forecasting e.g. physics experiments Theory of Big Data, UCL 4

The challenges of big data 1) Computational and mathematical challenges ‐ large n and/or d ‐ speed of acquisition, realtime analysis Hand’s Law: the requirements for increased computer power always increase faster than the increase in power itself 2) Inferential and statistical challenges ‐ complexity – networks, mixed data types, ... Theory of Big Data, UCL 5

3) Data challenges ‐ data quality ‐ non ‐ stationarity ‐ formulating the question ‐ correlation vs causation ‐ . . . . . . . . Theory of Big Data, UCL 6

THE AIM OF THIS PAPER: To focus on one problem and show how it is pervasive in big data opportunities ‐ risking misleading conclusions ‐ incorrect understanding ‐ mistaken decisions ‐ wasted money ‐ . . . . . . And to show what’s needed to tackle it Theory of Big Data, UCL 7

This is the problem of SELECTION BIAS Theory of Big Data, UCL 8

SOME EXAMPLES: Example 1: Potholes ‐ Streetbump smartphone app ‐ Detects potholes using accelerometer and emails location to local authority using GPS ‐ “Big data”, but no sophisticated computation or analytics Theory of Big Data, UCL 9

SOME EXAMPLES: Example 1: Potholes ‐ Streetbump smartphone app ‐ Detects potholes using accelerometer and emails location to local authority using GPS ‐ “Big data”, but no sophisticated computation or analytics ‐ But lower income people less likely to have smartphones and cars, older people less likely to have smartphones, ... → streets in richer areas get fi xed Theory of Big Data, UCL 10

Example 2: Hurricane Sandy 20 million tweets between 27 October and 1 November 2012 But a distorted impression of where problems are: ‐ most tweets came from Manhattan ‐ few from “more severely affected locations, such as Breezy Point, Coney Island and Rockaway” ‐ because of relative density of population/smartphones ‐ because power outages meant phones not recharged → distorted impression of where the damage occurred Theory of Big Data, UCL 11

Example 3: Retail finance scorecard construction Aim : build model to decide which applicants should be given a loan Data : characteristics and (default/repay) outcome of those granted loans in past ‐ but those granted loans in the past were selected on the basis of some previous scorecard ‐ they do not represent the entire population of applicants Same structure for student selection, staff recruitment, ..... Theory of Big Data, UCL 12

Example 4: Crime rates Points to note: 1) The difference between the CSE&W and PRC 2) The dramatic fall in CSE&W from 1995 Theory of Big Data, UCL 13

1) Crime Survey for E&W versus Police Recorded Crime CSE&W: aged ≥ 16; children 10 ‐ 15; not group residences; not crimes against commercial or public sector bodies; victim ‐ based (not include murder); not fraud and cyber; capping repeat victimisation; ... PRC: reported to and recorded by police; crime defined by “Notifiable Offence List” (incl. murder, public order, ...); incl. residents of institutions and tourists; incl. commercial bodies; 2) CSE&W: 19m in 1995 to 7m in y.e. June 2015 Less crime or shifting patterns of crime e.g to fraud, not measured on CSE&W Theory of Big Data, UCL 14

Plastic card fraud in the UK, 2004 ‐ 2014 Theory of Big Data, UCL 15

Example 5: Publication bias Relevant factors include: ‐ tendency not to submit negative results (file ‐ drawer effect) ‐ positive results are more interesting to editors; ‐ anomalous results may be regarded as errors, and not submitted; In an exploration of publication bias in the Cochrane database of systematic reviews: “ In the meta ‐ analyses of efficacy, outcomes favoring treatment had on average a 27% ... higher probability to be included than other outcomes. In the meta ‐ analyses of safety, results showing no evidence of adverse effects were on average 78% ... more likely to be included than results demonstrating that adverse effects existed.” Kicinski et al 5015 Theory of Big Data, UCL 16

WHAT DRIVES SELECTION BIAS: 1) Natural mechanisms Abraham Wald and the WWII bomber armour The bullet holes in returning bombers showed where they could be hit without bringing them down A lesson for business schools? Look at the failures, not the successes Francis Bacon “when they showed him hanging in a temple a picture of those who had paid their vows as having escaped shipwreck, and would have him say whether he did not now acknowledge the power of the gods — ‘Aye,’ asked he again, ‘but where are they painted that were drowned after their vows?’ " Theory of Big Data, UCL 17

2) Non ‐ response and refusals LFS quarterly survey wave ‐ specific response rates: March ‐ May 2000 to July ‐ Sept 2015 http://www.ons.gov.uk/ons/guide ‐ method/method ‐ quality/specific/labour ‐ market/labour ‐ force ‐ survey/index.html Theory of Big Data, UCL 18

3) Self ‐ selection (i) The magazine survey which asks the one question: do you reply to magazine surveys? (ii) The Literary Digest disastrous prediction that Landon would beat Roosevelt in the 1936 presidential election Standard explanation: the prediction was based on polling people with phones, who are more likely to be Republican But this is a myth In fact 10m people were polled, but only 2.3m replied A self ‐ selected sample, and in this election the anti ‐ Roosevelt voters felt more strongly than the pro Theory of Big Data, UCL 19

(iii) The Actuary edition of July 2006 included an editorial which said ‘ A couple of months ago I invited you ‐ all 16,245 of you ‐ to participate in our online survey concerning the sex of actuarial offspring. ... Well, I’m pleased to say that a number of you (13, in fact) replied to our poll. ’ Particularly web ‐ based surveys ‐ who replies? ‐ under ‐ representation of some groups ‐ multiple responding Theory of Big Data, UCL 20

4) Data dredging Test enough (true null) hypotheses and you expect some to be significant by chance This does not have to be dishonest: if 1000 teams each test one true null hypothesis at the 5% level .... Charles Babbage termed such data dredging “ cooking ”: “ make multitudes of observations, and out of these to select only those which agree, or very nearly agree. If a hundred observations are made, the cook must be very unlucky if he cannot pick out fifteen or twenty which will do for serving up ” Robert Millikan, Gregor Mendel, .... Theory of Big Data, UCL 21

5) Harking Hypothesising after the results are known Presenting post ‐ hoc hypotheses as if they were a priori Popperian science: Step 1: data suggest theory Step 2: theory is tested with new data Step 3: loop through steps 1 and 2 Harking arises when the same data are used in Steps 1 and 2 Theory of Big Data, UCL 22

6) Feedback and asymmetric information (i) The market for lemons The buyer of a used car, with no further information on the vehicle in question, offers the average price of such vehicles The seller can keep the better quality ones and sell only the poor quality ones Theory of Big Data, UCL 23

(ii) Crimemaps Theory of Big Data, UCL 24

But People will not bother to report minor crime if they feel there’s no point or for other reasons “More than 5.2 million people have not reported crimes for fear of deterring home buyers or renters since the online crime map was launched in February 2011” “A quarter (24 per cent) of people would not report a crime for fear it would harm their chances of selling or renting their property” http://www.directline.com/media/archive ‐ 2011/news ‐ 11072011 Theory of Big Data, UCL 25

(iii) Evaluating new scorecards Apply incumbent and challenger to a sample of customers But this sample will have been accepted by the incumbent → data asymmetry Standard scorecard performance measures favour the challenger Theory of Big Data, UCL 26

iv) Credit card transaction fraud detection Transaction stream terminated when incumbent detects a fraudulent transaction, not when the challenger does → data asymmetry Standard fraud detection measures favour the incumbent Theory of Big Data, UCL 27

Big data: What are you missing? the risks of assuming data equals all - PowerPoint PPT Presentation

Big data: What are you missing? the risks of assuming data equals all David J. Hand Imperial College London and Winton Capital Management 6 th January 2016 Theory of Big Data, UCL 1 BACKGROUND The promise of big data: McKinseys big data

Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1.

Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH , 2017 Outline Types of missing data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Missing Values in SAS Magnus Mengelbier Director PhUSE 2011 1 Topics Introduction

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Searching for and replacing missing values Nicholas Tierney Statistician DataCamp Dealing With

Bayesian Generalized linear mixed models with data missing not at random Overview: Two simple

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Missing data and data imputation with the Swiss Household Panel Andr Berchtold LIVES, LINES,

Whats Missing? SOCI 101 November 29, 2011 SOCI 101 () Whats Missing? November 29, 2011

Performing and tracking imputation Nicholas Tierney Statistician DataCamp Dealing With Missing

Estimating Gaussian Mixture Models from Data with Missing Features by Daniel McMichael CSSIP

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

Practical Data Issues Department of Political Science and Government Aarhus University March 3,

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

Earnings Call August 6, 2019 9:00am ET Innophos Holdings, Inc. | March 2019 Forward-Looking

Premises in the Aftermath of Crises UNU-WIDER Conference on Responding to Crises

PROTECTING DIGITAL INFORMATION Roadmap: Fall 2017 But First, An Aside: This is Misleading

Hidden Communication in P2P Networks Steganographic Handshake and Broadcast Raphael Eidenbenz,

NoMoATS: Towards Automatic Detection of Mobile Tracking Abstract: Todays mobile apps employ

Truncated Sums, Matrix Iteration Giacomo Boffi

Variational quantum eigensolver of interacting bosons with NISQ devices Andy C. Y. Li This

Why Im NOT Why Im NOT Jewish/ Christian Atheist Agnostic Hindu Muslim Buddhist

Sambuz

Useful Links

Newsletter

Mail Us

Big data: What are you missing? the risks of assuming data equals all - PowerPoint PPT Presentation

Big data: What are you missing? the risks of assuming data equals all David J. Hand Imperial College London and Winton Capital Management 6 th January 2016 Theory of Big Data, UCL 1 BACKGROUND The promise of big data: McKinseys big data

Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1.

Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH , 2017 Outline Types of missing data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Missing Values in SAS Magnus Mengelbier Director PhUSE 2011 1 Topics Introduction

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Searching for and replacing missing values Nicholas Tierney Statistician DataCamp Dealing With

Bayesian Generalized linear mixed models with data missing not at random Overview: Two simple

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Missing data and data imputation with the Swiss Household Panel Andr Berchtold LIVES, LINES,

Whats Missing? SOCI 101 November 29, 2011 SOCI 101 () Whats Missing? November 29, 2011

Performing and tracking imputation Nicholas Tierney Statistician DataCamp Dealing With Missing

Estimating Gaussian Mixture Models from Data with Missing Features by Daniel McMichael CSSIP

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

Practical Data Issues Department of Political Science and Government Aarhus University March 3,

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

Earnings Call August 6, 2019 9:00am ET Innophos Holdings, Inc. | March 2019 Forward-Looking

Premises in the Aftermath of Crises UNU-WIDER Conference on Responding to Crises

PROTECTING DIGITAL INFORMATION Roadmap: Fall 2017 But First, An Aside: This is Misleading

Hidden Communication in P2P Networks Steganographic Handshake and Broadcast Raphael Eidenbenz,

NoMoATS: Towards Automatic Detection of Mobile Tracking Abstract: Todays mobile apps employ

Truncated Sums, Matrix Iteration Giacomo Boffi

Variational quantum eigensolver of interacting bosons with NISQ devices Andy C. Y. Li This

Why Im NOT Why Im NOT Jewish/ Christian Atheist Agnostic Hindu Muslim Buddhist

Sambuz

Useful Links

Newsletter

Mail Us

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data