Introduction to Data Science using R Lecture 1 2 58M Aims The - - PowerPoint PPT Presentation

introduction to data science using r
SMART_READER_LITE
LIVE PREVIEW

Introduction to Data Science using R Lecture 1 2 58M Aims The - - PowerPoint PPT Presentation

1 Introduction to Data Science using R Lecture 1 2 58M Aims The aim of this module is to enable you to: to develop skills in some specific types of data analysis by providing supported practice in workshops and opportunities to


slide-1
SLIDE 1

Introduction to Data Science using R

Lecture 1

1

slide-2
SLIDE 2

58M Aims

The aim of this module is to enable you to:

  • to develop skills in some specific types of ‘data analysis’ by providing

supported practice in workshops and opportunities to apply them independently in ‘projects’ These will help you become:

  • independent researchers
  • highly employable

2

slide-3
SLIDE 3

Options: You do one of

3

1. Analysing and using 3D structures in molecular bioscience research 2. Biological data science 3. Image analysis 4. Sequence analysis Each option is about 15 hours contact time

slide-4
SLIDE 4

Learning Outcomes

At the end of this module the successful student will be able to: 1. Demonstrate the acquisition of skills in experimental design and data analysis, related to the option chosen within the module 2. Apply the skills learned to address novel bioscience problems

4

Broadest sense

slide-5
SLIDE 5

Learning Outcomes: this option

At the end of this module the successful student will be able to: 1. Demonstrate the acquisition of skills in experimental design and data analysis, related to the options chosen within the module 2. Apply the skills learned to address novel bioscience problems

i.e., Devise reproducible strategies to import, tidy, transform, model and present data in R

5

slide-6
SLIDE 6

Overview

What is Data Science? Not the same as numeracy - you don’t have to be good at maths Not the same as Statistics: includes statistical analysis but also what you have to do before and after. Data Science: reproducible workflows for the simulation, collection, organisation, processing, analysis and presentation of data.

6

slide-7
SLIDE 7

Science

7

Data skills

Explanatory variables

Choose / set / manipulate

Experiments

(tests of ideas)

Response variables

measure

Experimental activity Analyse Visualise Interpret and report Simulation Abstraction

slide-8
SLIDE 8

What is data science

8

Tidy

(mental model and activity)

Import Transform Explore Model

(statistics)

Report Simulate

slide-9
SLIDE 9

How much of data science is using statistics?

Less than you probably think ~80% of your time on getting data, cleaning data, aggregating data, reshaping data, and exploring data using exploratory data analysis and data visualization. Data analysis means: getting data, reshaping it, exploring it, and visualizing it as well as modelling Reproducibility: same data + same analysis = same results

9

slide-10
SLIDE 10

Reproducibility is a key feature

10

Reproducibly

Tidy Import Transform Explore Model Report Simulate

slide-11
SLIDE 11

Rationale

11

Reproducible: scripting Repeatable: protocol, lab book

Explanatory variables

Choose / set / manipulate

Experiments

(tests of ideas)

Response variables

measure

Experimental design Analyse Visualise Interpret and report

slide-12
SLIDE 12

Reproducible, Repeatable, Replicated

Replication: within a study Repeatable: between studies. Independently, without the use of original data but generally using the same methods. Reproducible: The original data and original methods reproduce all of the findings

  • f a study.

Methods need to be perfectly described

Patil et al. A statistical definition for reproducibility and replicability

12

slide-13
SLIDE 13

That’s what it is….how will it work in this module What am I trying to do?

My objectives

13

slide-14
SLIDE 14

Additional learning objectives

Or My objectives…. Create a learning environment characterised by

  • A focus on progress and improvement
  • Enjoyment and satisfaction
  • Interaction and exchange of ideas
  • Initiative and independence
  • Supported problem solving

14

slide-15
SLIDE 15

Assessment, learning objectives and approach

I didn’t want

  • ne size fits all
  • artificial/meaningless jumping through hoops
  • fear of failure and judgement to interfere

I did want you to

  • be able to work on problems you are interested in
  • be able to develop the skills needed for that
  • have more supported unstructured time
  • be assessed on what you can do (not what you can’t do)

Core skills taught with some recipes and lots of suggested practice examples 6 workshops support you in learning how to create the assessed output: a reproducible analysis related to your project, a past ‘project’ of or provided ‘projects’

15

slide-16
SLIDE 16

Module

Do you need to revise? Access to the latest versions of Stage 1 and Stage 2 L01 Introduction to Data Science W01 Developing independence and good practice - Tips W02 Importing data W03 Reproducibility 1 W04 Reproducibility 2 W05 Tidy data and Tidying data W06 An Introduction to Machine Learning W07 Project work Weeks 6, 7 and 8, drop-ins Note: timetable session titles may be incorrect...VLE is correct

16

slide-17
SLIDE 17

Assessment and the learning objectives

i.e., Devise reproducible strategies to import, tidy and model data in R The submission is a zip file of organised files including, at least, the Rmd, the knitted output (can be html, pdf or word) and the data. An example is available on the VLE. The Rmd should be well commented and contain everything needed to recreate, and understand the recreation of, the knitted output. The knitted output should be no more than 1000 words

17

slide-18
SLIDE 18

Advice

Do you need to revise? Access to the latest versions of Stage 1 and Stage 2 Other sources Google Talk to people RBloggers stackoverflow Foundation for Open Access Statistics Teach others especially house-keeping

18

slide-19
SLIDE 19

Reading

#biol58M Genome Res. 2015. 25: 1417-1422 Good Enough Practices in Scientific Computing R for Data Science: Garrett Grolemund & Hadley Wickham

19