LARGE DATASETS rogier.kievit@mrc-cbu.cam.ac.uk/@rogierK Outline - - PowerPoint PPT Presentation

large datasets
SMART_READER_LITE
LIVE PREVIEW

LARGE DATASETS rogier.kievit@mrc-cbu.cam.ac.uk/@rogierK Outline - - PowerPoint PPT Presentation

LARGE DATASETS rogier.kievit@mrc-cbu.cam.ac.uk/@rogierK Outline 1) What is big data? 2) Why bother? 3) Where to find large datasets 4) Challenges, pitfalls and opportunities Big data? The Australian Square Kilometre Array


slide-1
SLIDE 1

LARGE DATASETS

rogier.kievit@mrc-cbu.cam.ac.uk/@rogierK

slide-2
SLIDE 2

Outline

  • 1) What is big data?
  • 2) Why bother?
  • 3) Where to find large datasets
  • 4) Challenges, pitfalls and opportunities
slide-3
SLIDE 3

Big data?

  • ‘The Australian Square Kilometre Array Pathfinder (ASKAP) project currently

acquires 7.5 terabytes/second of sample image data, a rate projected to increase 100-fold to 750 terabytes/second (~25 zettabytes per year) by 2025

slide-4
SLIDE 4

Outline

  • 1) What is big data?
  • 2) Why bother?
  • 3) Where to find them
  • 4) Challenges, pitfalls and opportunities
slide-5
SLIDE 5

‘Well if I found an effect in a small sample, then there must be something there right?’

slide-6
SLIDE 6

Why bother

  • 1) It’s (almost) free
  • 2) More statistical power is

always better

  • 3) Reproducible
  • 4) Generalizability/

replicability

  • 5) Inspires new questions
  • 6) Look beyond your current

domain

  • 7) Develop/apply/test new

methods on real data

BF10=5.26*108315

slide-7
SLIDE 7

Procedure

  • 1) Find suitable data
  • 2) Apply
  • 3) Wait
  • 4) (Wait some more)
  • 5) Data!
slide-8
SLIDE 8

Open data types: Databases (cognitive neuro)

Sample size cost age data Biobank 500.000 2000 £ 43-73 everything ABCD 10000 free 9-11 cognitive, neural, mental health HCP 1000 <£1000 21-35 cognitive, neural, mental health IMAGEN 1500 free 0-3 cognitive, behavioural, some neural PNC 800 free 11–17 cognitive, behavioural, neural NKI Rockland 800 free 6-18 cognitive, behavioural, some neural

LARGE DATASETS

Reach out online/ Google data

OASIS, ADNI, HABS, ENIGMA, and many more

slide-9
SLIDE 9

Integration with other open science practices?

  • Data sharing: By

definition

  • Preregistration: Possible
  • Reproducibility
slide-10
SLIDE 10

Large public datasets in practice

slide-11
SLIDE 11
  • https://openpsychometrics.org/

_rawdata/

  • Freely downloadable
  • e.g.: Stress, anxiety, depression
  • N=48.000 in 5 seconds
  • (demanding) Model fit excellently
  • Personality and demographic

covariates explained >50% (!) of the variance in depression/ anxiety/stress

Jacobucci, R., Brandmaier, A. M., & Kievit, R. A. (2018). Variable selection in structural equation models with regularized MIMIC Models. In press, AMPPS

slide-12
SLIDE 12

Big data by leveraging technical tools

  • ‘Math Garden’
  • Incentivisation
  • Free ‘participation’
  • Accessible through signed form

'improving an online practice environment for math, currently containing over a billion responses’

Brinkhuis, M., Savi, A., Hofman, A., Coomans, F., van der Maas, H., & Maris, G. (2018). Learning As It Happens: A Decade of Analyzing and Shaping a Large-Scale Online Learning System.

slide-13
SLIDE 13
  • Survey of Health, Ageing and

Retirement in Europe

  • Freely and easily available
  • N=111.000 (!) in 60 minutes
  • 6 waves
  • 27 European countries and

Israel

  • Fit a series of complex

growth models

  • Cognitive health
  • Immediate

recall word list

  • f 10 words
  • 0-10, 4 waves
  • Proportion

remembered

  • Mental health
  • EURO-D scale
  • Depressive

symptoms

  • Inverted so that

higher scores -> better mental health

r=.94

Decline in memory Decline in mental health

slide-14
SLIDE 14
  • Case study: Biobank
  • Age-related decline (3 waves) in fluid

intelligence

  • Data access, acquisition: All excellent
  • But cognitive data
  • Not 3 waves
  • Not fluid reasoning
  • Self-paced
  • Ceiling effects
  • Floor effects
  • Easy to remember
  • No slope variance in

N=160.000

  • At the mercy of the data available

Me preregistering Me now

Kievit, R. A., Fuhrmann, D., Borgeest, G. S., Simpson-Kent, I. L., & Henson, R. N. (2018). The neural determinants of age-related changes in fluid intelligence: a pre-registered, longitudinal analysis in UK Biobank. Wellcome open research, 3.

slide-15
SLIDE 15
  • 1) Time
  • 2) Effort
  • 3) Requirements

Beyond CBU: 36 emails… 10 phone calls… 3 months…. to get a single signature. Within CBU:

  • Anyone who shares an office with you has to sign an NDA
  • The computer cannot be on if anybody who has NOT signed the NDA is in the same room
  • The computer with the data cannot connected to the internet or the CBU network
  • You have to enter a password every ^me you load the data
slide-16
SLIDE 16

Interim summary

  • Many benefits for researchers
  • Power, replication, precision
  • Widening access
  • Enrich existing paradigms
  • Learning/teaching data analysis
  • But… Where does this data come from?
slide-17
SLIDE 17

18 citations

slide-18
SLIDE 18

Cam-CAN data portal

  • 400 downloads
  • Managed access
slide-19
SLIDE 19

Summary

  • There is an ocean of data out there
  • It can be your primary focus, or complement
  • ther (e.g. experimental) work
  • Benefits
  • Power
  • Generalisability
  • Extensions/scope
  • Challenges
  • Cost (modest)
  • Time/effort (negligible, relatively)
  • Suitability (low to high)
  • Slanted towards individual differences/

epidemiological (some experimental exists!)

  • Your data can, and where possible

should, contribute to the ecosystem

  • Adapt your ethics forms to allow

sharing

  • Don’t decide which data is valuable

enough

slide-20
SLIDE 20

Questions?

rogier.kievit@mrc-cbu.cam.ac.uk/@rogierK