Reproducible Research Liz Bageant erb32@cornell.edu Cornell - - PowerPoint PPT Presentation

reproducible research
SMART_READER_LITE
LIVE PREVIEW

Reproducible Research Liz Bageant erb32@cornell.edu Cornell - - PowerPoint PPT Presentation

and Collabora*ve Reproducible Research Liz Bageant erb32@cornell.edu Cornell University Outline 1. ScienAfic method and research failures 2. Defining reproducible research 3. Strategies for reproducibility 1. ScienAfic method and research


slide-1
SLIDE 1

Reproducible Research

Liz Bageant erb32@cornell.edu Cornell University

and Collabora*ve

slide-2
SLIDE 2

Outline

  • 1. ScienAfic method and research failures
  • 2. Defining reproducible research
  • 3. Strategies for reproducibility
slide-3
SLIDE 3

ObservaAon Ask QuesAon Background Research Form Hypothesis Design Experiment / Study Carry Out Experiment / Study Data Analysis Conclusions Report Results Results

The scienAfic method

SchemaAc courtesy of Erika Mudrak

  • 1. ScienAfic method and research failure
slide-4
SLIDE 4

ConAnuum of research failure

Disorganiza*on Egregious behavior Failure of integrity Failure of process Deliberate manipula.on

  • f data to get results
  • P-hacking
  • ”Fishing expedi.ons”
  • 1. ScienAfic method and research failure
slide-5
SLIDE 5

ObservaAon Ask QuesAon Background Research Form Hypothesis Design Experiment / Study Carry Out Experiment / Study Data Analysis Conclusions Report Results Results

P-hacking / fishing expediAon

P<0.05

  • 1. ScienAfic method and research failure
slide-6
SLIDE 6

ConAnuum of research failure

Disorganiza*on Egregious behavior Failure of integrity Failure of process Deliberate manipula.on

  • f data to get results
  • P-hacking
  • ”Fishing expedi.ons”

HARK-ing

  • 1. ScienAfic method and research failure
slide-7
SLIDE 7

P-hack your way to scienAfic glory

hVps://projects.fivethirtyeight.com/p-hacking/

  • 1. ScienAfic method and research failure
slide-8
SLIDE 8

ObservaAon Ask QuesAon Background Research Form Hypothesis Design Experiment / Study Carry Out Experiment / Study Data Analysis Conclusions Report Results Results

Hypothesizing AYer Results are Known (HARK-ing)

  • 1. ScienAfic method and research failure
slide-9
SLIDE 9

Is HARK-ing ever okay?

Exploratory Confirmatory Research Goals

  • Exploratory research = hypothesis generaAon
  • Confirmatory research = hypothesis tesAng
  • 1. ScienAfic method and research failure
slide-10
SLIDE 10

ConAnuum of research failure

Disorganiza*on Egregious behavior Failure of integrity Failure of process Deliberate manipula.on

  • f data to get results
  • P-hacking
  • ”Fishing expedi.ons”

HARK-ing “Garden of forking paths”

  • 1. ScienAfic method and research failure
slide-11
SLIDE 11

Data Analysis Report Results Results

The garden of forking paths

(Gelman and Loken, 2013)

This observa.on seems funny—should I throw it out? Impute missing data? Other studies control for X, so maybe I should add that in? Should I log transform this? Logit, probit or linear probability model? Can we really assume that X is exogenous? Everyone else does. I tried this thing but it wasn’t significant, do I report it? This distribu.on looks funny—how can I fix it? Those results didn’t make sense, should I report them anyway? To winsorize or not to winsorize…. My interac.on isn’t significant…should I take it out? This observa.on seems funny—should I throw it out? Impute missing data? Other studies control for X, so maybe I should add that in? Should I log transform this? Logit, probit or linear probability model? Can we really assume that X is exogenous? Everyone else does. I tried this thing but it wasn’t significant, do I report it? This distribu.on looks funny—how can I fix it? Those results didn’t make sense, should I report them anyway? To winsorize or not to winsorize…. My interac.on isn’t significant…should I take it out? Impute missing data?

  • 1. ScienAfic method and research failure
slide-12
SLIDE 12

ConAnuum of research failure

Disorganiza*on Egregious behavior Failure of integrity Failure of process Deliberate manipula.on

  • f data to get results
  • P-hacking
  • ”Fishing expedi.ons”

HARK-ing “Garden of forking paths” Coding errors Poor documenta.on

  • 1. ScienAfic method and research failure
slide-13
SLIDE 13

To avoid the perils of the garden, HARK-ing, P-hacking, and silly mistakes…

  • Integrity! --> Be honest with yourself.
  • Transparency! --> Be honest with your readers.
  • Do you feel good enough about your decision-

making processes to write them down for all to see?

Reproducible research!

  • 1. ScienAfic method and research failure
slide-14
SLIDE 14

Replicability vs reproducibility

  • Replicability

– EssenAal to the scienAfic method – repeaAng a study from scratch using new data, analyst and code – if a given relaAonship between X and Y is true, it should show up in mulAple studies

  • 2. Defining Reproducibility
slide-15
SLIDE 15

ObservaAon Ask QuesAon Background Research Form Hypothesis Design Experiment / Study Carry Out Experiment / Study Data Analysis Conclusions Report Results Results

Replicability

  • 2. Defining Reproducibility
slide-16
SLIDE 16

Replicability vs reproducibility

  • Reproducibility

– Gefng the exact same result as an exisAng study using new analyst, but same data and code – Recently tractable due to compuAng and soYware advances

  • 2. Defining Reproducibility
slide-17
SLIDE 17

ObservaAon Ask QuesAon Background Research Form Hypothesis Design Experiment / Study Carry Out Experiment / Study Data Analysis Conclusions Report Results Results

Reproducibility

  • 2. Defining Reproducibility
slide-18
SLIDE 18

Reproducibility

  • Facilitate transparency by communicaAng

procedures easily

  • IdenAfy inadvertent errors
  • Avoid embarrassment
  • Facilitate collaboraAon
  • Save Ame
  • Greater potenAal for extension of work --> higher

impact over Ame

  • 2. Defining Reproducibility
slide-19
SLIDE 19

The public / the integrity of science Researchers in your field Reviewers Colleagues/Coauthors You in 6 months You next week You!

Who are you accountable to?

  • 2. Defining Reproducibility
slide-20
SLIDE 20

What are we aiming for?

  • Sufficient documentaAon to bring an unfamiliar user up to

speed

– Codebook – Readme file – Variable and value labels in analysis data set – EffecAve comments in code

  • A single click executes your project from start to finish.

– Downloading – Reformafng – Cleaning and variable construcAon – Analysis – Output tables, graphs, figures – Reproducible report

  • 2. Defining Reproducibility
slide-21
SLIDE 21

How do we get there?

  • Separate the phases of data work
  • SystemaAc file and naming structures
  • EffecAve and organized scripAng
  • Reproducible reports
  • 2. Defining Reproducibility
slide-22
SLIDE 22

Separate phases of data work

  • 1. Data conversion/cleaning/variable

construcAon

  • 2. Analysis
  • 3. Report generaAon
  • 3. Strategies for Reproducibility
slide-23
SLIDE 23

Naming convenAons

  • Agree with your collaborators on naming convenAons.
  • Human readable

– Short, useful names – InformaAon on content

  • Machine readable

– Avoid special characters, spaces, etc

  • CamelCase, ALLCAPS, lowercase, alloneword, underscore_between

– Consistent naming to facilitate searching

  • Default ordering

– Date format YYYYMMDD – Other numbers—add leading zeros

  • Never call something “final”. It probably isn’t.
  • 3. Strategies for Reproducibility
slide-24
SLIDE 24

SystemaAc file structure

  • Must be common to all users!
  • Choose a file structure and sAck to it.
  • Make skeleton of folders when you start a

project.

  • 3. Strategies for Reproducibility
slide-25
SLIDE 25
  • /dta

/original /stata raw /clean /analysis

  • /documenta*on

/metadata /reports

  • /do

/cleaning /analysis master.do

  • /output

/figures /tables /old output

  • /wri*ng

/ paper 1 / paper 2 /notes /old draYs

  • /temp

Variable- or module-specific clean files Copy of read-only original files exactly as obtained. Data aYer conversion to format of choice Data set(s) you will use for analysis Any/all codebooks or metadata related to data CollecAon of documents where the data was used, cited, described Cleaning, merging, reshaping, variable construcAon scripts Analysis scripts Script that sets up relaAve file paths and calls all scripts Separate folders if mulAple papers using the same data OpAonal as needed Keep older versions of paper, but get them out of the way Keep for reference, if you choose. Subfolders depend on type of project Get rid of cluVer as you make it

slide-26
SLIDE 26

ScripAng Aps

  • Data + Script = Reproducible Output
  • Master script: Runs other scripts in correct order
  • Modular scripAng vs. one big file

– Separate types of processes (cleaning, analysis) – Avoid repeaAng blocks of code: Separate program for repeated processes

  • Notes/comments.

– Consistent headers – Useful comments, not expressions of feeling

  • Clarity > efficiency? Consider your collaborators.
  • Re-run script from the beginning regularly. It must run!
  • 3. Strategies for Reproducibility
slide-27
SLIDE 27

Reproducible Reports

  • Integrate code into the prose of your report
  • Single file that executes all steps of data

process and outputs a final paper

  • Know exactly what data was used for analysis,

what code made which figure, etc.

  • Disadvantages—learning curve, iniAal

investment.

  • AlternaAve method: Copy and paste.
  • 3. Strategies for Reproducibility
slide-28
SLIDE 28

Avoid research failures by implemen*ng reproducible research techniques to improve

  • rganiza*on and transparency
  • 1. Separate phases of research
  • 2. SystemaAc file naming and structure
  • 3. EffecAve and organized scripAng
  • 4. Reproducible reports
  • PrioriAze elements that are aVainable for you.

Your future self thanks you!

slide-29
SLIDE 29

AddiAonal resources

  • P-hack your way to scienAfic glory!

hVps://projects.fivethirtyeight.com/p- hacking/

  • Gelman and Loken (2013) Garden of Forking

Paths. hVp://www.stat.columbia.edu/~gelman/ research/unpublished/p_hacking.pdf

slide-30
SLIDE 30