Principles of Data Management (for Biologists) Dr Joe Thorley - - PowerPoint PPT Presentation

principles of data management for biologists
SMART_READER_LITE
LIVE PREVIEW

Principles of Data Management (for Biologists) Dr Joe Thorley - - PowerPoint PPT Presentation

Principles of Data Management (for Biologists) Dr Joe Thorley R.P.Bio. Poisson Consulting August 14th, 2017 Introduction Biologists spends $1,000,000s of dollars collecting data with little regard for its management. Study Design Study


slide-1
SLIDE 1

Principles of Data Management (for Biologists)

Dr Joe Thorley R.P.Bio.

Poisson Consulting

August 14th, 2017

slide-2
SLIDE 2

Introduction

Biologists spends $1,000,000s of dollars collecting data with little regard for its management.

slide-3
SLIDE 3

Study Design

Study design should preceed data management

◮ Identify question(s)

◮ what do we want to know and why?

◮ Assess existing data/understanding

◮ what do we already know?

◮ Develop field protocol

◮ how much will it cost? ◮ how useful is the answer likely to be?

slide-4
SLIDE 4

Data Management

Once a study design has been developed data management begins. Data management cycles through the 10 stages of

  • 1. data collection
  • 2. data backup
  • 3. data security
  • 4. data digitization
  • 5. data cleansing
  • 6. data tidying
  • 7. data documentation
  • 8. data analysis
  • 9. data reporting
  • 10. data archiving
slide-5
SLIDE 5

Data Collection

Field crews should be trained and informed and provided with standard protocols and data collection forms. Printed forms on waterproof paper provide a cheap robust solution.

slide-6
SLIDE 6

Data Backup

Duplicate data as soon as possible. A smartphone camera is a simple way to duplicate data and sync to the cloud.

slide-7
SLIDE 7

Data Security

Ensure the right people have access. Dropbox (https://www.dropbox.com) provides simple data security and sharing.

slide-8
SLIDE 8

Data Digitization

Get the data into a useable electronic form. Excel is a useful data entry tool in the hands of a trained user.

slide-9
SLIDE 9

Data Cleansing

Correct the inevitable errors. At best, errors add noise; at worse, they invalidate subsequent analyses!

slide-10
SLIDE 10

Data Tidying

Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. Wickham 2014 SQLite (https://sqlite.org) is free, open-source, cross-platform, embedded database software.

slide-11
SLIDE 11

Relational Data

From R For Data Science (http://r4ds.had.co.nz) available via CC BY-NC-ND 3.0 US.

slide-12
SLIDE 12

Data Documentation

Data are just numbers and categories unless people know what they mean. A simple metadata table can provide a description and units for each variable Table Column Units Description Site Depth m The tidally corrected depth Visit Hour PST8PDT The hour of the visit

slide-13
SLIDE 13

Data Analysis

Analytic code can be shared on GitHub (https://github.com).

slide-14
SLIDE 14

GitHub bcgov

The province already has a GitHub account for sharing code.

slide-15
SLIDE 15

Data Reporting

An answer only has value if decision-makers are aware of it. Zotero (https://www.zotero.org) is a free, easy-to-use tool to help you collect, organize, cite, and share your research sources. ResearchGate (https://www.researchgate.net) is a free way to share and discover research.

slide-16
SLIDE 16

Data Archiving

Ensure others are able to use it in perpetuity. Zenodo (https://zenodo.org) is free, citeable, discoverable, long-term, with open, restricted and closed access options. Uses same cloud infrastructure as CERN’s own Large Hadron Collider (LHC) research data.

slide-17
SLIDE 17

Summary

Data management requires trained personnel with an understanding

  • f the principles but does not have to be expensive and pays for

itself many times over.

slide-18
SLIDE 18

DFO

slide-19
SLIDE 19

Parks

slide-20
SLIDE 20

DataBC

The provincial government has DataBC.

slide-21
SLIDE 21

CKAN

CKAN (https://ckan.org) is the world’s leading Open Source data portal platform. It is free and open source with teams and private data. A key feature is an API (application program interface) that allows code to interact with the repository.