BIG DATA AND US 2 GARTNER HYPE CYCLE www.gartner.com - - PowerPoint PPT Presentation

big data and us
SMART_READER_LITE
LIVE PREVIEW

BIG DATA AND US 2 GARTNER HYPE CYCLE www.gartner.com - - PowerPoint PPT Presentation

BIG DATA AND US 2 GARTNER HYPE CYCLE www.gartner.com www.wikipedia.org GARTNER HYPE CYCLE 3 Emerging technologies 2014 www.gartner.com 4 TIME TO BE PRODUCTIVE Large data hide true quantitative signal Large data generate spurious


slide-1
SLIDE 1

BIG DATA AND US

slide-2
SLIDE 2

GARTNER HYPE CYCLE

2

www.wikipedia.org www.gartner.com

slide-3
SLIDE 3

GARTNER HYPE CYCLE

3

Emerging technologies 2014

www.gartner.com

slide-4
SLIDE 4

TIME TO BE PRODUCTIVE

  • Large data hide true quantitative signal
  • Large data generate spurious correlations
  • Large data help mistake correlation for causation
  • Large data amplify bias and confounding

4

‘A big computer, a complex algorithm and a long time does not equal science’

— Robert Gentleman

slide-5
SLIDE 5

TIME TO BE PRODUCTIVE

  • Large data hide true quantitative signal
  • Large data generate spurious correlations
  • Large data help mistake correlation for causation
  • Large data amplify bias and confounding

5

‘A big computer, a complex algorithm and a long time does not equal science’

— Robert Gentleman

slide-6
SLIDE 6

LARGE DATA HIDE SIGNAL

  • A simulation study
  • 100 subjects
  • 2 groups
  • 10 differentially abundant

proteins

  • Plot the first two principle

components

  • Expect good separation

between the groups

6

Fan et al., National Science Review, 1:293, 2014

2 proteins 40 proteins 200 proteins 1,000 proteins

‘We are drowning in information but starved for knowledge’

— John Naisbitt

slide-7
SLIDE 7

TIME TO BE PRODUCTIVE

  • Large data hide true quantitative signal
  • Large data generate spurious correlations
  • Large data help mistake correlation for causation
  • Large data amplify bias and confounding

7

slide-8
SLIDE 8

LARGE DATA HIDE SIGNAL

  • A simulation study
  • 60 subjects with

quantitative phenotype

  • red: 800 proteins

unrelated to phenotype

  • blue: 6400 proteins

unrelated to phenotype

  • Repeat 1,000 times

8

Fan et al., National Science Review, 1:293, 2014

0.3 0.4 0.5 0.6 0.5 0.6 0.7 0.8 Max correlation between the phenotype and a protein Max correlation between the phenotype and a linear combination

  • f 4 proteins

‘With four parameters I can fit an elephant, and with five I can make him wiggle his trunk’

— John von Neumann

slide-9
SLIDE 9

9

slide-10
SLIDE 10

TIME TO BE PRODUCTIVE

  • Large data hide true quantitative signal
  • Large data generate spurious correlations
  • Large data help mistake correlation for causation
  • Large data amplify bias and confounding

10

slide-11
SLIDE 11

SPURIOUS CORRELATIONS ABOUND

11

tylervigen.com/spurious-correlations

slide-12
SLIDE 12

SPURIOUS CORRELATIONS ABOUND

12

tylervigen.com/spurious-correlations

slide-13
SLIDE 13

SPURIOUS CORRELATIONS ABOUND

13 Chocolate consumption (kg/yr/capita) Nobel laureates per 10 mio

New England Journal of Medicine, 367:1562 (2012)

  • Premier medical journal
  • Nobel prize is related to

cognitive ability

  • flavanols (organic molecules

present in chocolate) are linked to cognitive ability

  • Technical flows
  • Nobel prize winners between

1900-2011

  • Chocolate consumption after

2002

  • Countries with many Nobel

prizes have a high Human Development Index and high per capita income

  • A. Jogalekar, Scientific American, 2012

Easy to dismiss when we understand the context

slide-14
SLIDE 14

SPURIOUS CORRELATIONS ABOUND

14

Not easy to dismiss when the context is unknown

Benabou et al., Princeton University

slide-15
SLIDE 15

SPURIOUS CORRELATIONS ABOUND

15

Not easy to dismiss when the context is unknown

Benabou et al., Princeton University

‘Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there’

slide-16
SLIDE 16

TIME TO BE PRODUCTIVE

  • Large data hide true quantitative signal
  • Large data generate spurious correlations
  • Large data help mistake correlation for causation
  • Large data amplify bias and confounding

16

slide-17
SLIDE 17

EXAMPLE

17

carat color price 0.23 E 326 0.21 E 326 0.23 E 327 0.29 I 334 0.31 J 335 ..............

0.5 1.0 1.5 2.0 2000 6000 10000 14000 carat price

Carat Price Carat Price 50 diamonds 53,940 diamonds

53,940 diamonds

slide-18
SLIDE 18
  • New discovery!

◆ later colors cost more!

18

carat color price 0.23 E 326 0.21 E 326 0.23 E 327 0.29 I 334 0.31 J 335 ..............

Color Price

D E F G H I J 2000 6000 10000 14000 price D E F G H I J 5000 10000 15000

Color Price 50 diamonds 53,940 diamonds

EXAMPLE

53,940 diamonds

slide-19
SLIDE 19

EXAMPLE

  • Subject matter knowledge

◆ later colors are cheaper ◆ they also weigh more ◆ Both color and weight affect price

19

carat color price 0.23 E 326 0.21 E 326 0.23 E 327 0.29 I 334 0.31 J 335 ..............

Color Carat

D E F G H I J 5000 10000 15000

Color Price 53,940 diamonds 53,940 diamonds

D E F G H I J 1 2 3 4 5

slide-20
SLIDE 20

EXAMPLE

20

Color, per carat group Price

‘To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of’

— Ronald Fisher

slide-21
SLIDE 21

SUMMARY

  • More data ≠ more information
  • How should we:

◆ state clearly the scientific question ◆ follow the fundamental principles of experimental design ◆ quantify the right number of analytes ◆ select appropriate statistical methods ◆ use problem-specific biological and technological information

  • Data and algorithms do not substitute thinking through

the problem

21

‘There are no routine statistical questions, only questionable statistical routines’

— D. R. Cox