Introduction to Data Science using R
Lecture 1
1
Introduction to Data Science using R Lecture 1 2 58M Aims The - - PowerPoint PPT Presentation
1 Introduction to Data Science using R Lecture 1 2 58M Aims The aim of this module is to enable you to: to develop skills in some specific types of data analysis by providing supported practice in workshops and opportunities to
1
The aim of this module is to enable you to:
supported practice in workshops and opportunities to apply them independently in ‘projects’ These will help you become:
2
3
1. Analysing and using 3D structures in molecular bioscience research 2. Biological data science 3. Image analysis 4. Sequence analysis Each option is about 15 hours contact time
At the end of this module the successful student will be able to: 1. Demonstrate the acquisition of skills in experimental design and data analysis, related to the option chosen within the module 2. Apply the skills learned to address novel bioscience problems
4
Broadest sense
At the end of this module the successful student will be able to: 1. Demonstrate the acquisition of skills in experimental design and data analysis, related to the options chosen within the module 2. Apply the skills learned to address novel bioscience problems
i.e., Devise reproducible strategies to import, tidy, transform, model and present data in R
5
What is Data Science? Not the same as numeracy - you don’t have to be good at maths Not the same as Statistics: includes statistical analysis but also what you have to do before and after. Data Science: reproducible workflows for the simulation, collection, organisation, processing, analysis and presentation of data.
6
7
Data skills
Explanatory variables
Choose / set / manipulate
Experiments
(tests of ideas)
Response variables
measure
Experimental activity Analyse Visualise Interpret and report Simulation Abstraction
8
Tidy
(mental model and activity)
Import Transform Explore Model
(statistics)
Report Simulate
Less than you probably think ~80% of your time on getting data, cleaning data, aggregating data, reshaping data, and exploring data using exploratory data analysis and data visualization. Data analysis means: getting data, reshaping it, exploring it, and visualizing it as well as modelling Reproducibility: same data + same analysis = same results
9
10
Reproducibly
Tidy Import Transform Explore Model Report Simulate
11
Reproducible: scripting Repeatable: protocol, lab book
Explanatory variables
Choose / set / manipulate
Experiments
(tests of ideas)
Response variables
measure
Experimental design Analyse Visualise Interpret and report
Replication: within a study Repeatable: between studies. Independently, without the use of original data but generally using the same methods. Reproducible: The original data and original methods reproduce all of the findings
Methods need to be perfectly described
Patil et al. A statistical definition for reproducibility and replicability
12
That’s what it is….how will it work in this module What am I trying to do?
13
Or My objectives…. Create a learning environment characterised by
14
I didn’t want
I did want you to
Core skills taught with some recipes and lots of suggested practice examples 6 workshops support you in learning how to create the assessed output: a reproducible analysis related to your project, a past ‘project’ of or provided ‘projects’
15
Do you need to revise? Access to the latest versions of Stage 1 and Stage 2 L01 Introduction to Data Science W01 Developing independence and good practice - Tips W02 Importing data W03 Reproducibility 1 W04 Reproducibility 2 W05 Tidy data and Tidying data W06 An Introduction to Machine Learning W07 Project work Weeks 6, 7 and 8, drop-ins Note: timetable session titles may be incorrect...VLE is correct
16
i.e., Devise reproducible strategies to import, tidy and model data in R The submission is a zip file of organised files including, at least, the Rmd, the knitted output (can be html, pdf or word) and the data. An example is available on the VLE. The Rmd should be well commented and contain everything needed to recreate, and understand the recreation of, the knitted output. The knitted output should be no more than 1000 words
17
Do you need to revise? Access to the latest versions of Stage 1 and Stage 2 Other sources Google Talk to people RBloggers stackoverflow Foundation for Open Access Statistics Teach others especially house-keeping
18
#biol58M Genome Res. 2015. 25: 1417-1422 Good Enough Practices in Scientific Computing R for Data Science: Garrett Grolemund & Hadley Wickham
19