Introduction to data management with applications in hydrobiology - PowerPoint PPT Presentation

Introduction to data management with applications in hydrobiology david.kneis@tu-dresden.de TU Dresden, Institute of Hydrobiology

Outline Motivation Basics about tables Example data set How to arrange data properly Software options for data storage Working with data in base R Topics not covered

. Motivation 4 Typical sources of data ◮ Monitoring (e.g. water quality recorded over time) ◮ Snapshot sampling (e.g. abundance of river bed organisms) ◮ Experiments (e.g. response of system to treatment; with replication) ◮ Model outputs (e.g. scenario or sensitivity analysis)

. Motivation 5 Why care about data management? ◮ the key to efficient data analysis ◮ avoids inconsistency / loss of information ◮ ensures re-usability by others (and yourself at a later time) ◮ a must for serious research (traceability of results) ◮ enables efficient version control and archiving Investment in good data management always pays out.

. Motivation 6 What is data management about? 1. Arranging data in tables with proper layout 2. Selecting a software for data storage and manipulation 3. Understanding operations on tables ◮ merging, filters, aggregation 4. Knowing how to create inputs for specific analysis ◮ plotting, statistical tests

. Motivation 7 What is data management about? 1. Arranging data in tables with proper layout 2. Selecting a software for data storage and manipulation 3. Understanding operations on tables ◮ merging, filters, aggregation 4. Knowing how to create inputs for specific analysis ◮ plotting, statistical tests These will be the main subjects of this course.

. Basics about tables 9 Data types numeric Weights, dimensions, concentrations, ... integer Number of offspring, ordinal and nominal data (classes), ID character nominal data (classes), ID logical All kinds of dichotomous data special types dates and times, images, ...

. Basics about tables 10 Tables ◮ Most common and versatile data container. ◮ Columns are vectors of a particular data type. ◮ A table row is, in general, not a vectors but a list (because types differ).

. Basics about tables 11 Tables Representation of tables in data.frame Classic, commonly used, but ’ugly’ defaults will likely confuse beginners tibble Good alternative data.table Another alternative

. Basics about tables 12 Exercise: A simple data frame rm ( list = ls ()) options (stringsAsFactors=FALSE) x <- read . table (file="data / lakedepth.txt", sep="\t", header=TRUE) print ( typeof (x)) # type of object print (str(x)) # structure print ( lapply (x, typeof )) # type of columns print (head(x)) # top rows print (x $ maxDepth) # access a column print (x["maxDepth"]) # ... print (x[,"maxDepth"]) # ... print (x[1,]) # access a row

. Example data set 14 Screening a river for AMR genes

. Example data set 15 Screening a river for AMR genes mcr1 −Inf −2.5 −4.5 −2 −4 −1.5 25 24 −3.5 −1 ● ● −3 ● ● ● ● POS 10 14 20 26 HIR HER ● 11 ● 15 ● 21 ● 28 3 8 RHG 22 GOM ● ● ● ● ● ● ● ● ● ● ● 9 12 13 19 23 29 30 27 ● 7 ● 18 1 2 ● ● ● ● ● 6 4 16 5 ● ● 17 HAU

. Example data set 16 Summary We sampled ... ◮ water and bottom sediment ◮ at multiple locations ◮ repeatedly, in monthly intervals to analyze DNA extracts for ... ◮ the abundance of various antibiotic resistance genes ◮ the abundance of marker genes (e.g. 16S rRNA) and we took physical and technical replicates.

. Example data set 17 Why is this bad practice?

. Example data set 18 Why is this bad practice? ◮ Mixed information in column and even cells ◮ Multiple values per cell ◮ Many sub-tables on spreadsheet ◮ Missing headers ◮ No software can read this out of the box ◮ Data become useless soon (missing headers and meta data)

. How to arrange data properly 20 Objectives Understand ... ◮ the main structure of a data set. ◮ how to split the data over separate tables. ◮ how individual tables are linked to each other. ◮ basic rules to achieve data integrity.

. How to arrange data properly 21 Data dimensions Consider the example data set (page 16). What are the major dimensions of the data?

. How to arrange data properly 22 Data dimensions Consider the example data set (page 16). What are the major dimensions of the data? ◮ Compartment (water, sediment) ◮ Space (2-dimensional, sampling locations) ◮ Time ◮ Gene

. How to arrange data properly 23 Data dimensions Consider the example data set (page 16). What are the major dimensions of the data? ◮ Compartment (water, sediment) ◮ Space (2-dimensional, sampling locations) ◮ Time ◮ Gene Variable

. How to arrange data properly 24 Data dimensions Consider the example data set (page 16). What are the major dimensions of the data? ◮ Compartment (water, sediment) ◮ Space (2-dimensional, sampling locations) ◮ Time ◮ Variable A very common case in hydro-biological field research.

. How to arrange data properly 25 Data dimensions Consider the example data set (page 16). What are the major dimensions of the data? ◮ Compartment (water, sediment) ◮ Space (2-dimensional, sampling locations) ◮ Time ◮ Variable A very common case in hydro-biological field research. If you are not sure about dimensions, imagine some plots of the data. Which item(s) would appear on the x-axis or in the legend?

. How to arrange data properly 26 Entities Consider the example data set (page 16). What are the important entities?

. How to arrange data properly 27 Entities Consider the example data set (page 16). What are the important entities? ◮ Samples ◮ Locations ◮ Compartments ◮ Variables ◮ Values (measured numerical properties)

. How to arrange data properly 28 Entities Consider the example data set (page 16). What are the important entities? ◮ Samples ◮ Locations ◮ Compartments (Dropped for simplicity) ◮ Variables ◮ Values (measured numerical properties)

. How to arrange data properly 29 Entities Consider the example data set (page 16). What are the important entities? ◮ Samples ◮ Locations ◮ Compartments (Dropped for simplicity) ◮ Variables ◮ Values (measured numerical properties) This leads us to the entity-relationship model (ERM) https://en.wikipedia.org/wiki/Entity-relationship_model

. How to arrange data properly 30 Entities and relations

. How to arrange data properly 31 Entities and relations ◮ Multiple values, each measured on one particular sample ◮ Multiple samples, each taken at one particular location ◮ Each value relates to just one variable ◮ ...

. How to arrange data properly 32 Entities and relations ◮ Multiple values, each measured on one particular sample ◮ Multiple samples, each taken at one particular location ◮ Each value relates to just one variable ◮ ... Relations of type 1:1 and n:m also exist and those need to be resolved (not discussed here).

. How to arrange data properly 33 Attributes of entities

. How to arrange data properly 34 Attributes of entities → Attributes become table columns

. How to arrange data properly 35 Tables and relations

. How to arrange data properly 36 Tables and relations ◮ No orphaned records (e.g. only samples from known locations) ◮ No ambiguity (e.g. two samples cannot share the same ID)

. How to arrange data properly 37 Additional constraints

. How to arrange data properly 38 Additional constraints ◮ Each table needs a unique primary key (green color) ◮ Further columns may require uniqueness (blue color) ◮ Constraints can apply to a single column or to a set of columns

. How to arrange data properly 39 Summary of basic steps ◮ Identify entities, attributes, and relations ◮ Optimize tables following the rules of ’normalization’ ◮ Introduce single-table constraints (primary key, unique, non-emptiness) for data integrity ◮ Ensure integrity of table relations (foreign key constraints) → Look for courses and books on ’relational database design’

. How to arrange data properly 40 Indicators of proper design ◮ Tables are strictly rectangular (well defined number of rows and columns) ◮ Data is self-contained (all relevant meta data included) ◮ Tables and columns have intuitive names ◮ No redundancies (eliminates risk of inconsistency) ◮ Limited number of explicit missing values (saves memory)

. How to arrange data properly 41 Why is redundancy bad?

Introduction to data management with applications in hydrobiology - PowerPoint PPT Presentation

Introduction to data management with applications in hydrobiology david.kneis@tu-dresden.de TU Dresden, Institute of Hydrobiology Outline Motivation Basics about tables Example data set How to arrange data properly Software options for data

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Collection and Data Management saverio . giallorenzo @gmail.com 1 Web Science Data

PRESENTS PRESENTS TRUE HOTEL MANAGEMENT SYSTEM TRUE HOTEL MANAGEMENT SYSTEM MANAGEMENT FEATURE

Introduction to Spatial Data Management with Postgis Spatial Data Management Content

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

gLite Data Management Agenda gLite Data Management Introduction Examples Name

CPSC 504: Data Management Rachel Pottinger Course Introduction 2019/01/02 What is this class

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Introduction to research data management Scott Summers UK Data Archive Practical research data

Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on

Modern Data Management and Governance Benjamin Pecheux Data Management and Governance for Better

Identification of Cyanobacteria and Potential for Future Blooms Kristy Sullivan (Wheaton College,

management and collaboration stimulated by ecological changes Dr. Margaret Dix, Dr. Sudeep

PREDICTING DRINKING WATER SAFETY INSIDE BUILDINGS IN A TECHNOLOGY CHANGING WORLD Andrew Whelton

DVR-210-S & DVR-310-S Technical Training Department 1925 E. Dominguez Street Long Beach, CA

NAXOS2018 2018-06-13 Background Antibiotics crisis Antibiotic resistance genes (ARGs) Emerging

Genome Biology Ontology + Gatekeeper Jasper Koehorst Laboratory of Systems and Synthetic Biology

Disclosures Urine Trouble: Women's Bladder Health and the Urinary Microbiome 1 2 Common

S. Chand and Company Limited Q2 FY2019-20 Investor Update 12 th November, 2019 SUMMARY

Introduction to data management with applications in hydrobiology - PowerPoint PPT Presentation

Introduction to data management with applications in hydrobiology david.kneis@tu-dresden.de TU Dresden, Institute of Hydrobiology Outline Motivation Basics about tables Example data set How to arrange data properly Software options for data

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Collection and Data Management saverio . giallorenzo @gmail.com 1 Web Science Data

PRESENTS PRESENTS TRUE HOTEL MANAGEMENT SYSTEM TRUE HOTEL MANAGEMENT SYSTEM MANAGEMENT FEATURE

Introduction to Spatial Data Management with Postgis Spatial Data Management Content

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

gLite Data Management Agenda gLite Data Management Introduction Examples Name

CPSC 504: Data Management Rachel Pottinger Course Introduction 2019/01/02 What is this class

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Introduction to research data management Scott Summers UK Data Archive Practical research data

Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on

Modern Data Management and Governance Benjamin Pecheux Data Management and Governance for Better

Identification of Cyanobacteria and Potential for Future Blooms Kristy Sullivan (Wheaton College,

management and collaboration stimulated by ecological changes Dr. Margaret Dix, Dr. Sudeep

PREDICTING DRINKING WATER SAFETY INSIDE BUILDINGS IN A TECHNOLOGY CHANGING WORLD Andrew Whelton

DVR-210-S &amp; DVR-310-S Technical Training Department 1925 E. Dominguez Street Long Beach, CA

NAXOS2018 2018-06-13 Background Antibiotics crisis Antibiotic resistance genes (ARGs) Emerging

Genome Biology Ontology + Gatekeeper Jasper Koehorst Laboratory of Systems and Synthetic Biology

Disclosures Urine Trouble: Women's Bladder Health and the Urinary Microbiome 1 2 Common

S. Chand and Company Limited Q2 FY2019-20 Investor Update 12 th November, 2019 SUMMARY

DVR-210-S & DVR-310-S Technical Training Department 1925 E. Dominguez Street Long Beach, CA