big data and us
play

BIG DATA AND US 2 GARTNER HYPE CYCLE www.gartner.com - PowerPoint PPT Presentation

BIG DATA AND US 2 GARTNER HYPE CYCLE www.gartner.com www.wikipedia.org GARTNER HYPE CYCLE 3 Emerging technologies 2014 www.gartner.com 4 TIME TO BE PRODUCTIVE Large data hide true quantitative signal Large data generate spurious


  1. BIG DATA AND US

  2. 2 GARTNER HYPE CYCLE www.gartner.com www.wikipedia.org

  3. GARTNER HYPE CYCLE 3 Emerging technologies 2014 www.gartner.com

  4. 4 TIME TO BE PRODUCTIVE ● Large data hide true quantitative signal ● Large data generate spurious correlations ● Large data help mistake correlation for causation ● Large data amplify bias and confounding ‘A big computer, a complex algorithm and a long time does not equal science’ — Robert Gentleman

  5. 5 TIME TO BE PRODUCTIVE ● Large data hide true quantitative signal ● Large data generate spurious correlations ● Large data help mistake correlation for causation ● Large data amplify bias and confounding ‘A big computer, a complex algorithm and a long time does not equal science’ — Robert Gentleman

  6. 6 LARGE DATA HIDE SIGNAL 2 proteins 40 proteins ● A simulation study ● 100 subjects ● 2 groups ● 10 differentially abundant proteins ● Plot the first two principle 200 proteins 1,000 proteins components ● Expect good separation between the groups ‘We are drowning in information but starved for knowledge’ — John Naisbitt Fan et al., National Science Review, 1:293, 2014

  7. 7 TIME TO BE PRODUCTIVE ● Large data hide true quantitative signal ● Large data generate spurious correlations ● Large data help mistake correlation for causation ● Large data amplify bias and confounding

  8. 8 LARGE DATA HIDE SIGNAL Max correlation between the phenotype and a protein ● A simulation study ● 60 subjects with quantitative phenotype ● red: 800 proteins unrelated to phenotype ● blue: 6400 proteins 0.3 0.4 0.5 0.6 unrelated to phenotype ● Repeat 1,000 times Max correlation between the phenotype and a linear combination of 4 proteins ‘With four parameters I can fit an elephant, and with five I can make him wiggle his trunk’ 0.5 0.6 0.7 0.8 — John von Neumann Fan et al., National Science Review, 1:293, 2014

  9. 9

  10. 10 TIME TO BE PRODUCTIVE ● Large data hide true quantitative signal ● Large data generate spurious correlations ● Large data help mistake correlation for causation ● Large data amplify bias and confounding

  11. 11 SPURIOUS CORRELATIONS ABOUND tylervigen.com/spurious-correlations

  12. 12 SPURIOUS CORRELATIONS ABOUND tylervigen.com/spurious-correlations

  13. 13 SPURIOUS CORRELATIONS ABOUND Easy to dismiss when we understand the context ● Premier medical journal ● Nobel prize is related to Nobel laureates per 10 mio cognitive ability ● flavanols (organic molecules present in chocolate) are linked to cognitive ability ● Technical flows ● Nobel prize winners between 1900-2011 ● Chocolate consumption after 2002 Chocolate consumption (kg/yr/capita) ● Countries with many Nobel prizes have a high Human Development Index and high per capita income New England Journal of Medicine, 367:1562 (2012) A. Jogalekar, Scientific American, 2012

  14. 14 SPURIOUS CORRELATIONS ABOUND Not easy to dismiss when the context is unknown Benabou et al., Princeton University

  15. 15 SPURIOUS CORRELATIONS ABOUND Not easy to dismiss when the context is unknown ‘Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there’ Benabou et al., Princeton University

  16. 16 TIME TO BE PRODUCTIVE ● Large data hide true quantitative signal ● Large data generate spurious correlations ● Large data help mistake correlation for causation ● Large data amplify bias and confounding

  17. 17 carat color price EXAMPLE 0.23 E 326 0.21 E 326 53,940 diamonds 0.23 E 327 0.29 I 334 0.31 J 335 .............. 50 diamonds 53,940 diamonds 14000 10000 Price price Price 6000 2000 0 0.5 1.0 1.5 2.0 Carat Carat carat

  18. 18 EXAMPLE carat color price 0.23 E 326 0.21 E 326 53,940 diamonds 0.23 E 327 0.29 I 334 ● New discovery! 0.31 J 335 ◆ later colors cost more! .............. 50 diamonds 53,940 diamonds 14000 15000 10000 10000 Price price Price 6000 5000 2000 0 0 D E F G H I J D E F G H I J Color Color

  19. 19 carat color price EXAMPLE 0.23 E 326 0.21 E 326 0.23 E 327 ● Subject matter knowledge 0.29 I 334 ◆ later colors are cheaper 0.31 J 335 ◆ .............. they also weigh more ◆ Both color and weight affect price 53,940 diamonds 53,940 diamonds 5 15000 4 3 10000 Price Carat 2 5000 1 0 D E F G H I J D E F G H I J Color Color

  20. 20 EXAMPLE Price Color, per carat group ‘To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of’ — Ronald Fisher

  21. 21 SUMMARY ● More data ≠ more information ● How should we: ◆ state clearly the scientific question ◆ follow the fundamental principles of experimental design ◆ quantify the right number of analytes ◆ select appropriate statistical methods ◆ use problem-specific biological and technological information ● Data and algorithms do not substitute thinking through the problem ‘There are no routine statistical questions, only questionable statistical routines’ — D. R. Cox

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend