Big data: What are you missing? the risks of assuming data equals all - - PowerPoint PPT Presentation

big data what are you missing
SMART_READER_LITE
LIVE PREVIEW

Big data: What are you missing? the risks of assuming data equals all - - PowerPoint PPT Presentation

Big data: What are you missing? the risks of assuming data equals all David J. Hand Imperial College London and Winton Capital Management 6 th January 2016 Theory of Big Data, UCL 1 BACKGROUND The promise of big data: McKinseys big data


slide-1
SLIDE 1

Theory of Big Data, UCL 1

Big data: What are you missing?

the risks of assuming data equals “all”

David J. Hand Imperial College London and Winton Capital Management

6th January 2016

slide-2
SLIDE 2

Theory of Big Data, UCL 2

BACKGROUND

The promise of big data: McKinsey’s big data report: ‘we are on the cusp of a tremendous wave of innovation, productivity, and growth, as well as new modes of competition and value capture, all driven by big data as consumers, companies, and economic sectors exploit its potential’ and endless ditto by others

slide-3
SLIDE 3

Theory of Big Data, UCL 3

However, as I have argued elsewhere 1) big data is not the solution it’s what you do with it that counts 2) big data carries risks

slide-4
SLIDE 4

Theory of Big Data, UCL 4

Two kinds of big data opportunities

1) Computer science: through data manipulation

merging, linking, matching, concatenating, sorting, basic arithmetic, ...

Database heritage: could conceivably have all the data

e.g. stock in the warehouse e.g. employees in the firm

2) Statistics: through inference and predictive analytics Many (most?) problems cannot have all the data

e.g. observations in clinical trials e.g. forecasting e.g. physics experiments

slide-5
SLIDE 5

Theory of Big Data, UCL 5

The challenges of big data

1) Computational and mathematical challenges

‐ large n and/or d ‐ speed of acquisition, realtime analysis Hand’s Law: the requirements for increased computer power always increase faster than the increase in power itself

2) Inferential and statistical challenges

‐ complexity – networks, mixed data types, ...

slide-6
SLIDE 6

Theory of Big Data, UCL 6

3) Data challenges

‐ data quality ‐ non‐stationarity ‐ formulating the question ‐ correlation vs causation ‐ . . . . . . . .

slide-7
SLIDE 7

Theory of Big Data, UCL 7

THE AIM OF THIS PAPER:

To focus on one problem and show how it is pervasive in big data opportunities

‐ risking misleading conclusions ‐ incorrect understanding ‐ mistaken decisions ‐ wasted money ‐ . . . . . .

And to show what’s needed to tackle it

slide-8
SLIDE 8

Theory of Big Data, UCL 8

This is the problem of

SELECTION BIAS

slide-9
SLIDE 9

Theory of Big Data, UCL 9

SOME EXAMPLES:

Example 1: Potholes

‐ Streetbump smartphone app ‐ Detects potholes using accelerometer and emails location to local authority using GPS ‐ “Big data”, but no sophisticated computation or analytics

slide-10
SLIDE 10

Theory of Big Data, UCL 10

SOME EXAMPLES:

Example 1: Potholes

‐ Streetbump smartphone app ‐ Detects potholes using accelerometer and emails location to local authority using GPS ‐ “Big data”, but no sophisticated computation or analytics ‐ But lower income people less likely to have smartphones and cars, older people less likely to have smartphones, ... → streets in richer areas get fixed

slide-11
SLIDE 11

Theory of Big Data, UCL 11

Example 2: Hurricane Sandy

20 million tweets between 27 October and 1 November 2012 But a distorted impression of where problems are: ‐ most tweets came from Manhattan ‐ few from “more severely affected locations, such as Breezy Point, Coney Island and Rockaway” ‐ because of relative density of population/smartphones ‐ because power outages meant phones not recharged → distorted impression of where the damage occurred

slide-12
SLIDE 12

Theory of Big Data, UCL 12

Example 3: Retail finance scorecard construction

Aim: build model to decide which applicants should be given a loan Data: characteristics and (default/repay) outcome of those granted loans in past ‐ but those granted loans in the past were selected on the basis of some previous scorecard ‐ they do not represent the entire population of applicants Same structure for student selection, staff recruitment, .....

slide-13
SLIDE 13

Theory of Big Data, UCL 13

Example 4: Crime rates

Points to note: 1) The difference between the CSE&W and PRC 2) The dramatic fall in CSE&W from 1995

slide-14
SLIDE 14

Theory of Big Data, UCL 14

1) Crime Survey for E&W versus Police Recorded Crime CSE&W: aged ≥ 16; children 10‐15; not group residences; not crimes against commercial or public sector bodies; victim‐ based (not include murder); not fraud and cyber; capping repeat victimisation; ... PRC: reported to and recorded by police; crime defined by “Notifiable Offence List” (incl. murder, public order, ...); incl. residents of institutions and tourists; incl. commercial bodies; 2) CSE&W: 19m in 1995 to 7m in y.e. June 2015 Less crime or shifting patterns of crime e.g to fraud, not measured on CSE&W

slide-15
SLIDE 15

Theory of Big Data, UCL 15

Plastic card fraud in the UK, 2004‐2014

slide-16
SLIDE 16

Theory of Big Data, UCL 16

Example 5: Publication bias

Relevant factors include:

‐ tendency not to submit negative results (file‐drawer effect) ‐ positive results are more interesting to editors; ‐ anomalous results may be regarded as errors, and not submitted;

In an exploration of publication bias in the Cochrane database

  • f systematic reviews:

“In the meta‐analyses of efficacy, outcomes favoring treatment had on average a 27% ... higher probability to be included than other outcomes. In the meta‐analyses of safety, results showing no evidence of adverse effects were on average 78% ... more likely to be included than results demonstrating that adverse effects existed.”

Kicinski et al 5015

slide-17
SLIDE 17

Theory of Big Data, UCL 17

WHAT DRIVES SELECTION BIAS:

1) Natural mechanisms

Abraham Wald and the WWII bomber armour The bullet holes in returning bombers showed where they could be hit without bringing them down

A lesson for business schools? Look at the failures, not the successes

Francis Bacon

“when they showed him hanging in a temple a picture of those who had paid their vows as having escaped shipwreck, and would have him say whether he did not now acknowledge the power of the gods — ‘Aye,’ asked he again, ‘but where are they painted that were drowned after their vows?’ "

slide-18
SLIDE 18

Theory of Big Data, UCL 18

2) Non‐response and refusals

LFS quarterly survey wave‐specific response rates: March‐May 2000 to July‐Sept 2015

http://www.ons.gov.uk/ons/guide‐method/method‐quality/specific/labour‐market/labour‐force‐survey/index.html

slide-19
SLIDE 19

Theory of Big Data, UCL 19

3) Self‐selection

(i) The magazine survey which asks the one question: do you reply to magazine surveys? (ii) The Literary Digest disastrous prediction that Landon would beat Roosevelt in the 1936 presidential election Standard explanation: the prediction was based on polling people with phones, who are more likely to be Republican But this is a myth In fact 10m people were polled, but only 2.3m replied A self‐selected sample, and in this election the anti‐Roosevelt voters felt more strongly than the pro

slide-20
SLIDE 20

Theory of Big Data, UCL 20

(iii) The Actuary edition of July 2006 included an editorial which said ‘A couple of months ago I invited you ‐ all 16,245

  • f you ‐ to participate in our online survey concerning the sex
  • f actuarial offspring. ... Well, I’m pleased to say that a

number of you (13, in fact) replied to our poll.’ Particularly web‐based surveys ‐ who replies? ‐ under‐representation of some groups ‐ multiple responding

slide-21
SLIDE 21

Theory of Big Data, UCL 21

4) Data dredging

Test enough (true null) hypotheses and you expect some to be significant by chance This does not have to be dishonest: if 1000 teams each test one true null hypothesis at the 5% level .... Charles Babbage termed such data dredging “cooking”:

“make multitudes of observations, and out of these to select only those which agree, or very nearly agree. If a hundred observations are made, the cook must be very unlucky if he cannot pick out fifteen or twenty which will do for serving up”

Robert Millikan, Gregor Mendel, ....

slide-22
SLIDE 22

Theory of Big Data, UCL 22

5) Harking

Hypothesising after the results are known Presenting post‐hoc hypotheses as if they were a priori Popperian science: Step 1: data suggest theory Step 2: theory is tested with new data Step 3: loop through steps 1 and 2 Harking arises when the same data are used in Steps 1 and 2

slide-23
SLIDE 23

Theory of Big Data, UCL 23

6) Feedback and asymmetric information

(i) The market for lemons The buyer of a used car, with no further information on the vehicle in question, offers the average price of such vehicles The seller can keep the better quality ones and sell only the poor quality ones

slide-24
SLIDE 24

Theory of Big Data, UCL 24

(ii) Crimemaps

slide-25
SLIDE 25

Theory of Big Data, UCL 25

But People will not bother to report minor crime if they feel there’s no point

  • r for other reasons

“More than 5.2 million people have not reported crimes for fear of deterring home buyers or renters since the online crime map was launched in February 2011” “A quarter (24 per cent) of people would not report a crime for fear it would harm their chances of selling or renting their property”

http://www.directline.com/media/archive‐2011/news‐11072011

slide-26
SLIDE 26

Theory of Big Data, UCL 26

(iii) Evaluating new scorecards Apply incumbent and challenger to a sample of customers But this sample will have been accepted by the incumbent → data asymmetry Standard scorecard performance measures favour the challenger

slide-27
SLIDE 27

Theory of Big Data, UCL 27

iv) Credit card transaction fraud detection Transaction stream terminated when incumbent detects a fraudulent transaction, not when the challenger does → data asymmetry Standard fraud detection measures favour the incumbent

slide-28
SLIDE 28

Theory of Big Data, UCL 28

7) Gaming

Goodhart’s law: when a measure becomes a target, it ceases to be a good measure “As soon as the Government attempts to regulate any particular set of financial assets, these become unreliable as indicators of economic trends” Investors try to anticipate the effect of the regulation, and adapt to benefit from it

slide-29
SLIDE 29

Theory of Big Data, UCL 29

Campbell’s law: The more any quantitative social indicator is used for social decision‐making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor

‐ schools enter for public exams only those expected to excel ‐ ambulance response times

Bevan and Hamblin, 2009

slide-30
SLIDE 30

Theory of Big Data, UCL 30

8) The law

EU gender discrimination in insurance and credit Credit scoring aims to “discriminate” between the good and bad credit risks → statistical models to estimate probability of repaying a loan → build the best model we can (benefits company and customer)

‐ include all variables which enhance predictive power ‐ models as sophisticated as we like

slide-31
SLIDE 31

Theory of Big Data, UCL 31

Practical and ethical problem: US Equal Credit Opportunity Act of 1974 makes it illegal for creditors to discriminate against any applicant on the basis of race, colour, religion, national origin, sex, marital status, or age (similar in other countries);

slide-32
SLIDE 32

Theory of Big Data, UCL 32

Practical and ethical problem: US Equal Credit Opportunity Act of 1974 makes it illegal for creditors to discriminate against any applicant on the basis of race, colour, religion, national origin, sex, marital status, or age (similar in other countries); Even though, women are generally less risky than men That is: it is illegal to treat differently people who belong to certain groups with known different degrees of risk

slide-33
SLIDE 33

Theory of Big Data, UCL 33

Disadvantaging females, who have to pay higher rates, and have their loan applications rejected more often Advantaging males, who have to pay lower rates, and have their applications accepted more often Contrast with insurance: where males and females could be charged different premiums

slide-34
SLIDE 34

Theory of Big Data, UCL 34

Disadvantaging females, who have to pay higher rates, and have their loan applications rejected more often Advantaging males, who have to pay lower rates, and have their applications accepted more often Contrast with insurance: where males and females could be charged different premiums Until the European Court of Justice ruled in 2011 that the use of gender would not be permitted in determining prices and benefits from insurance from 21 December 2012

slide-35
SLIDE 35

Theory of Big Data, UCL 35

Now imagine ‐ If the cost of driving insurance is equalised at a weighted mean of the previous male and female values; ‐ then more of the higher risk category will be able to drive on

  • ur roads;

‐ increasing the risk to all of us In fact, nearly all age groups saw a drop in premiums, except ‐ women aged 17‐20 saw a rise in their premiums ‐ men of the same age saw the biggest drop

slide-36
SLIDE 36

Theory of Big Data, UCL 36

9) Underpowered studies

“Studies in psychology are endemically underpowered”

Bertamini and Munafo, 2012

The law of small numbers: The tendency to generalise from small samples “the mistaken assumption that the law of large numbers applies to small numbers as well”

Hand, The Improbability Principle, p194

slide-37
SLIDE 37

Theory of Big Data, UCL 37

10) Conditional probabilities

Regression to the mean

Every technology is overhyped at its birth

slide-38
SLIDE 38

Theory of Big Data, UCL 38

WHAT TO DO ABOUT IT:

1) Construct and stick to sampling frame

Or use “gold samples” Draw some cases from throughout the sample space Then standardise

slide-39
SLIDE 39

Theory of Big Data, UCL 39

2) Registers

e.g. in surveys of people e.g.pre‐registration in clinical trials September 2004: NEJM, Lancet, Annals of Internal Medicine, JAMA: required drug research sponsored by pharmaceutical companies to be pre‐registered ina a public database as a pre‐ condition for publication

slide-40
SLIDE 40

Theory of Big Data, UCL 40

3) Detecting, e.g. publication bias

Caliper tests: ratio of reported results just above and just below the critical value associated with (e.g.) p= 0.05 Funnel plots (and tests derived from them) are based on the law of small numbers ‐ large studies are likely to be published regardless of results ‐ small studies are likely to be published only if the results are “interesting”, i.e. significant

slide-41
SLIDE 41

Theory of Big Data, UCL 41

A relationship between sample size and effect size is suspicious Hence the overabundance of plots in the bottom right of the funnel and the dearth in the left

Copas 1999

slide-42
SLIDE 42

Theory of Big Data, UCL 42

4) Model the selection mechanism

Heckman selection models (Nobel Prize) Copas publication bias correction models

slide-43
SLIDE 43

Theory of Big Data, UCL 43

CONCLUSION: The danger of selection bias

“Out with every theory of human behavior, from linguistics to

  • sociology. Forget taxonomy, ontology, and psychology. Who

knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.”

Chris Anderson Wired in an article called ‘The end of theory: the data deluge makes scientific method obsolete’

slide-44
SLIDE 44

Theory of Big Data, UCL 44

The danger is that you don’t know it’s happening The numbers lie for themselves

slide-45
SLIDE 45

Theory of Big Data, UCL 45

A final example

Anthropic bias:

The extraordinary coincidence that the universe has exactly the right characteristics for human life to evolve

slide-46
SLIDE 46

Theory of Big Data, UCL 46

A final example

Anthropic bias:

The universe must be like it is

  • r we wouldn’t be here to see it
slide-47
SLIDE 47

Theory of Big Data, UCL 47

slide-48
SLIDE 48

Theory of Big Data, UCL 48

thanks