BIG DATA AND US 2 GARTNER HYPE CYCLE www.gartner.com - - PowerPoint PPT Presentation

▶

Aug 13, 2023 233 likes •450 views

BIG DATA AND US 2 GARTNER HYPE CYCLE www.gartner.com www.wikipedia.org GARTNER HYPE CYCLE 3 Emerging technologies 2014 www.gartner.com 4 TIME TO BE PRODUCTIVE Large data hide true quantitative signal Large data generate spurious

SLIDE 1

BIG DATA AND US

SLIDE 2

GARTNER HYPE CYCLE

www.wikipedia.org www.gartner.com

SLIDE 3

GARTNER HYPE CYCLE

Emerging technologies 2014

www.gartner.com

SLIDE 4

TIME TO BE PRODUCTIVE

Large data hide true quantitative signal
Large data generate spurious correlations
Large data help mistake correlation for causation
Large data amplify bias and confounding

‘A big computer, a complex algorithm and a long time does not equal science’

— Robert Gentleman

SLIDE 5

TIME TO BE PRODUCTIVE

Large data hide true quantitative signal
Large data generate spurious correlations
Large data help mistake correlation for causation
Large data amplify bias and confounding

‘A big computer, a complex algorithm and a long time does not equal science’

— Robert Gentleman

SLIDE 6

LARGE DATA HIDE SIGNAL

A simulation study
100 subjects
2 groups
10 differentially abundant

proteins

Plot the first two principle

components

Expect good separation

between the groups

Fan et al., National Science Review, 1:293, 2014

2 proteins 40 proteins 200 proteins 1,000 proteins

‘We are drowning in information but starved for knowledge’

— John Naisbitt

SLIDE 7

TIME TO BE PRODUCTIVE

Large data hide true quantitative signal
Large data generate spurious correlations
Large data help mistake correlation for causation
Large data amplify bias and confounding

SLIDE 8

LARGE DATA HIDE SIGNAL

A simulation study
60 subjects with

quantitative phenotype

red: 800 proteins

unrelated to phenotype

blue: 6400 proteins

unrelated to phenotype

Repeat 1,000 times

Fan et al., National Science Review, 1:293, 2014

0.3 0.4 0.5 0.6 0.5 0.6 0.7 0.8 Max correlation between the phenotype and a protein Max correlation between the phenotype and a linear combination

f 4 proteins

‘With four parameters I can fit an elephant, and with five I can make him wiggle his trunk’

— John von Neumann

SLIDE 9

SLIDE 10

TIME TO BE PRODUCTIVE

Large data hide true quantitative signal
Large data generate spurious correlations
Large data help mistake correlation for causation
Large data amplify bias and confounding

SLIDE 11

SPURIOUS CORRELATIONS ABOUND

tylervigen.com/spurious-correlations

SLIDE 12

SPURIOUS CORRELATIONS ABOUND

tylervigen.com/spurious-correlations

SLIDE 13

SPURIOUS CORRELATIONS ABOUND

13 Chocolate consumption (kg/yr/capita) Nobel laureates per 10 mio

New England Journal of Medicine, 367:1562 (2012)

Premier medical journal
Nobel prize is related to

cognitive ability

flavanols (organic molecules

present in chocolate) are linked to cognitive ability

Technical flows
Nobel prize winners between

1900-2011

Chocolate consumption after

2002

Countries with many Nobel

prizes have a high Human Development Index and high per capita income

A. Jogalekar, Scientific American, 2012

Easy to dismiss when we understand the context

SLIDE 14

SPURIOUS CORRELATIONS ABOUND

Not easy to dismiss when the context is unknown

Benabou et al., Princeton University

SLIDE 15

SPURIOUS CORRELATIONS ABOUND

Not easy to dismiss when the context is unknown

Benabou et al., Princeton University

‘Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there’

SLIDE 16

TIME TO BE PRODUCTIVE

Large data hide true quantitative signal
Large data generate spurious correlations
Large data help mistake correlation for causation
Large data amplify bias and confounding

SLIDE 17

EXAMPLE

carat color price 0.23 E 326 0.21 E 326 0.23 E 327 0.29 I 334 0.31 J 335 ..............

0.5 1.0 1.5 2.0 2000 6000 10000 14000 carat price

Carat Price Carat Price 50 diamonds 53,940 diamonds

53,940 diamonds

SLIDE 18

New discovery!

◆ later colors cost more!

carat color price 0.23 E 326 0.21 E 326 0.23 E 327 0.29 I 334 0.31 J 335 ..............

Color Price

D E F G H I J 2000 6000 10000 14000 price D E F G H I J 5000 10000 15000

Color Price 50 diamonds 53,940 diamonds

EXAMPLE

53,940 diamonds

SLIDE 19

EXAMPLE

Subject matter knowledge

◆ later colors are cheaper ◆ they also weigh more ◆ Both color and weight affect price

carat color price 0.23 E 326 0.21 E 326 0.23 E 327 0.29 I 334 0.31 J 335 ..............

Color Carat

D E F G H I J 5000 10000 15000

Color Price 53,940 diamonds 53,940 diamonds

D E F G H I J 1 2 3 4 5

SLIDE 20

EXAMPLE

Color, per carat group Price

‘To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of’

— Ronald Fisher

SLIDE 21

SUMMARY

More data ≠ more information
How should we:

◆ state clearly the scientific question ◆ follow the fundamental principles of experimental design ◆ quantify the right number of analytes ◆ select appropriate statistical methods ◆ use problem-specific biological and technological information

Data and algorithms do not substitute thinking through

BIG DATA AND US

GARTNER HYPE CYCLE

GARTNER HYPE CYCLE

Emerging technologies 2014

TIME TO BE PRODUCTIVE

‘A big computer, a complex algorithm and a long time does not equal science’

— Robert Gentleman

TIME TO BE PRODUCTIVE

‘A big computer, a complex algorithm and a long time does not equal science’

— Robert Gentleman

LARGE DATA HIDE SIGNAL

proteins

components

between the groups

‘We are drowning in information but starved for knowledge’

— John Naisbitt

TIME TO BE PRODUCTIVE

LARGE DATA HIDE SIGNAL

quantitative phenotype

unrelated to phenotype

unrelated to phenotype

‘With four parameters I can fit an elephant, and with five I can make him wiggle his trunk’

— John von Neumann

TIME TO BE PRODUCTIVE

SPURIOUS CORRELATIONS ABOUND

SPURIOUS CORRELATIONS ABOUND

SPURIOUS CORRELATIONS ABOUND

cognitive ability

present in chocolate) are linked to cognitive ability

1900-2011

2002

prizes have a high Human Development Index and high per capita income

Easy to dismiss when we understand the context

SPURIOUS CORRELATIONS ABOUND

Not easy to dismiss when the context is unknown

SPURIOUS CORRELATIONS ABOUND

Not easy to dismiss when the context is unknown

‘Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there’

TIME TO BE PRODUCTIVE

EXAMPLE

carat color price 0.23 E 326 0.21 E 326 0.23 E 327 0.29 I 334 0.31 J 335 ..............

Carat Price Carat Price 50 diamonds 53,940 diamonds

53,940 diamonds

◆ later colors cost more!

carat color price 0.23 E 326 0.21 E 326 0.23 E 327 0.29 I 334 0.31 J 335 ..............

Color Price

Color Price 50 diamonds 53,940 diamonds

EXAMPLE

53,940 diamonds

EXAMPLE

◆ later colors are cheaper ◆ they also weigh more ◆ Both color and weight affect price

carat color price 0.23 E 326 0.21 E 326 0.23 E 327 0.29 I 334 0.31 J 335 ..............

Color Carat

Color Price 53,940 diamonds 53,940 diamonds

EXAMPLE

Color, per carat group Price

‘To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of’

— Ronald Fisher

SUMMARY

◆ state clearly the scientific question ◆ follow the fundamental principles of experimental design ◆ quantify the right number of analytes ◆ select appropriate statistical methods ◆ use problem-specific biological and technological information

the problem

‘There are no routine statistical questions, only questionable statistical routines’

— D. R. Cox