Importing data into R Workshop 3 2 Objectives By doing this - PowerPoint PPT Presentation

1 Importing data into R Workshop 3

2 Objectives By doing this workshop and carrying out the independent study the successful student will be able to: ● Describe the breadth of data sources ● Devise reproducible strategies to import local and remote data in a variety of formats Short talk outline some of the possibilities followed by opportunities for you apply and combine ideas

3 Outline Where: stored locally (on your own computer) or remotely (on another computer/server) Format: various. structured as XML or JSON, in databases or may require harvesting How: base R functions; Access to APIs for many forms of specialised data has been made easier with packages e.g., bioconductor Result: always the same: dataframes or dataframe-like structures (eg., tibbles)

4 Locally stored: txt, csv or similar files Essentially plain text (can be opened in notepad and make sense) Occasionally fixed width columns more commonly ‘delimited’ by a particular character Read in with the read.table() methods read.table(file) is the minimum needed Remember that file location matters

5 Locally stored: txt, csv or similar files > mydata <- read.table("../data/structurepred.txt") > str(mydata) 'data.frame': 91 obs. of 3 variables: $ V1: Factor w/ 91 levels "0.08","0.353",..: 91 84 31 32 37 18 25 89 88 3 ... $ V2: Factor w/ 4 levels "Abstruct","Predicto",..: 3 1 1 1 1 1 1 1 1 1 ... $ V3: Factor w/ 31 levels "1","10","11",..: 31 1 12 23 25 26 27 28 29 30 ... Arguments depend on data format > mydata <- read.table("../data/structurepred.txt", header=T) > str(mydata) 'data.frame': 90 obs. of 3 variables: $ rmsd: num 9.04 14.95 17.73 3.12 11.28 ... $ prog: Factor w/ 3 levels "Abstruct","Predicto",..: 1 1 1 1 1 1 1 1 1 1 ... $ prot: int 1 2 3 4 5 6 7 8 9 10 ... The other defaults are appropriate here (incl sep)

6 Locally stored: txt, csv or similar files Arguments depend on data format > mydata <- read.table("../data/Icd10Code.csv", header=T) Error in read.table("../data/Icd10Code.csv", header = T) : more columns than column names Try reading the first > mydata <- read.table("../data/Icd10Code.csv", header=F, nrows = 1) > (mydata) line only The default sep is the problem V1 1 Code,Description > mydata <- read.table("../data/Icd10Code.csv", header=T, sep=",") OR > mydata <- read.csv("../data/Icd10Code.csv") 'data.frame': 12131 obs. of 2 variables: $ Code : Factor w/ 12131 levels "A00","A000","A001",..: 1 2 3 4 5 ... $ Description: Factor w/ 12079 levels "4-Aminophenol derivatives",..: 1822 1823 1824 1826 11605 ... See manual: defaults depend on which read. method

7 Locally stored: special format files Can not usually be opened in notepad (and make sense) Often specific to particular software If you have the software you can export it as comma or tab delimited and use a read.table method But it’s much better to do processing in the script: all steps documented and repeatable To determine how to read that type of file: Google Keep googling

8 Locally stored: special format files Packages are often the solution e.g haven foreign Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, dBase, ... * Once > install.packages(“haven”) * once each session > library(haven) e.g read_sav > mydata <- read_sav("../data/prac9a.sav") > str(mydata) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 6 variables: $ terrsize: atomic 0.463 0.446 0.651 0.507 0.879 ... ..- attr(*, "label")= chr "Territory size (Ha)" $ country :Class 'labelled' atomic [1:120] 1 1 1 1 1 1 1 1 1 1 … .... .... ....

9 Files from the Internet Why read from the web rather than saving the files then reading in normally? Repeatability Especially useful if you need to rerun analyses on regularly updated public data Use same methods as before - you just replace the file location with the URL of the data You still need to know the data format

10 Files from the Internet Data from a buoy (buoy #44025) off the coast of New Jersey at http://www.ndbc.noaa.gov/view_text_file.php?filename=44025h2011.txt.gz&dir=data/historical/stdmet/ # to make the code more readable we set a variable to the website address of the file: > file <- "http://www.ndbc.noaa.gov/view_text_file.php?filename=44025h2011.txt.gz&dir=data/historical/stdmet/" # data format: look on the web or use: > readLines(file,n=5) [1] "#YY MM DD hh mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP VIS TIDE" [2] "#yr mo dy hr mn degT m/s m/s m sec sec degT hPa degC degC degC mi ft" [3] "2010 12 31 23 50 222 7.2 8.5 0.75 4.55 3.72 203 1022.2 6.9 6.7 3.5 99.0 99.00" [4] "2011 01 01 00 50 233 6.0 6.8 0.76 4.76 3.77 196 1022.2 6.7 6.7 3.7 99.0 99.00" [5] "2011 01 01 01 50 230 5.0 5.9 0.72 4.55 3.85 201 1021.9 6.8 6.7 3.5 99.0 99.00" > mydata <- read.table(file, header = F, skip=2) > str(mydata) Would still need 'data.frame': 6358 obs. of 18 variables: $ V1 : int 2010 2011 2011 2011 2011 2011 2011 2011 2011 2011 ... some ‘tidying’ $ V2 : int 12 1 1 1 1 1 1 1 1 1 ... … $ V18: num 99 99 99 99 99 99 99 99 99 99 ...

11 Web scraping D’ya geddit?? ‘harvest’ What if data are not in a file but on webpage? One solution is to use to 'scrape' the data using package rvest and an extension to Chrome called Selectorgadget that allows you to interactively determine what ‘css selector’ you need to extract desired components from a page To see rvest in action we are going to retrieve the results of a google scholar search for Calvin Dytham user TJUyl1gAAAAJ

12 Web scraping with rvest > install.packages("XML") > install.packages("rvest") > library(rvest) > library(magrittr) > page <- read_html("https://scholar.google.co.uk/citations?user=TJUyl1gAAAAJ&hl=en") #Specify the css selector in html_nodes() and extract the text with html_text(). # and change the string to numeric using as.numeric(). > citations <- page %>% html_nodes("#gsc_a_b .gsc_a_c") %>% html_text()%>%as.numeric() > years <- page %>% html_nodes("#gsc_a_b .gsc_a_y") %>% html_text()%>%as.numeric() > citations [1] 1228 314 290 265 263 216 200 193 184 180 131 111 110 100 94 87 87 86 [19] 79 76 > years [1] 2011 2012 1999 2010 1999 2008 1999 2007 2011 2002 1998 2002 2003 2007 2005 2007 1995 2010 [19] 2009 2009

13 Data from databases Many packages For relational databases (Oracle, MSSQL, MySQL) RMySQL, RODBC For non- relational databases (MongoDB, Hadoop) rmongodb, rhbase For particular specialised research fields - packages for import, tidying, analysis rentrez, Bioconductor ropensci: R packages that provide programmatic access to a variety of scientific data, full-text of journal articles, and repositories that provide real-time metrics of scholarly impact.

14 And finally …. Connection to Programming: many programming concepts are typically required for data import. For example: input and output streams, pattern matching, loops When working with big datasets that take a while to read in, save your workspace (.RData) file and reload that rather than reading in the data and tidying each time This R Data Import Tutorial Is Everything You Need Importing Data Into R - Part Two

Importing data into R Workshop 3 2 Objectives By doing this - PowerPoint PPT Presentation

1 Importing data into R Workshop 3 2 Objectives By doing this workshop and carrying out the independent study the successful student will be able to: Describe the breadth of data sources Devise reproducible strategies to import

Introduction read.csv Importing Data in R Importing data in R ? Importing Data in R 5 types

Importing flat files from the web Importing Data in Python Youre already great at importing!

Reading sheets Importing Data in R Importing Data in R XLConnect Martin Studer Work

Importing flat files from the web Importing Data in Python II Youre already great at

Importing Data from Statistical So ware haven Importing Data into R Statistical So

CME/STATS 195 CME/STATS 195 Lecture 3: Importing and transforming data Lecture 3: Importing and

Importing text files Importing and Managing Financial Data in R getSymbols() with CSV files

Welcome to the course! Importing Data in Python I Import data Flat files, e.g. .txts,

Exporting and Importing data Overview for exporting/importing data Create an export on one

Introduction to relational databases Importing Data in Python I What is a relational database?

Introduction to relational databases Importing Data in Python What is a relational database?

Introduction to APIs and JSONs Importing Data in Python II APIs Application Programming

Introduction to other file types Importing Data in Python I Other file types Excel

Importing Data into Protg-OWL Martin OConnor Stanford Center for Biomedical Informatics

CAD & GIS INTEGRATION Tools and Methods for: Importing MicroStation DGN files into

Cochrane Register of Studies and Cochrane Register of Studies and importing references into

Maureen P. Walsh Open Repositories 2013 Charlottetown, PEI

Topics for today Introduction to R Graphics: Getting started with R g U i R t t fi

UNIX Data Tools Bualo Chapter 7 1 / 37 Overview In Chapter 3 we learned the basic operations

Practical Bioinformatics Mark Voorhies 4/3/2018 Mark Voorhies Practical Bioinformatics Mean

Data analysis pipelines Reading and tidying tables R.W. Oldford readr - importing

Data Modeling and Database Design Yuri Takhteyev Faculty of Information University of Toronto

Introduction to C Programming File Input/Output Waseda University Todays Topics

UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics Core Facility EPFL Objectives

Importing data into R Workshop 3 2 Objectives By doing this - PowerPoint PPT Presentation

1 Importing data into R Workshop 3 2 Objectives By doing this workshop and carrying out the independent study the successful student will be able to: Describe the breadth of data sources Devise reproducible strategies to import

Introduction read.csv Importing Data in R Importing data in R ? Importing Data in R 5 types

Importing flat files from the web Importing Data in Python Youre already great at importing!

Reading sheets Importing Data in R Importing Data in R XLConnect Martin Studer Work

Importing flat files from the web Importing Data in Python II Youre already great at

Importing Data from Statistical So ware haven Importing Data into R Statistical So

CME/STATS 195 CME/STATS 195 Lecture 3: Importing and transforming data Lecture 3: Importing and

Importing text files Importing and Managing Financial Data in R getSymbols() with CSV files

Welcome to the course! Importing Data in Python I Import data Flat files, e.g. .txts,

Exporting and Importing data Overview for exporting/importing data Create an export on one

Introduction to relational databases Importing Data in Python I What is a relational database?

Introduction to relational databases Importing Data in Python What is a relational database?

Introduction to APIs and JSONs Importing Data in Python II APIs Application Programming

Introduction to other file types Importing Data in Python I Other file types Excel

Importing Data into Protg-OWL Martin OConnor Stanford Center for Biomedical Informatics

CAD &amp; GIS INTEGRATION Tools and Methods for: Importing MicroStation DGN files into

Cochrane Register of Studies and Cochrane Register of Studies and importing references into

Maureen P. Walsh Open Repositories 2013 Charlottetown, PEI

Topics for today Introduction to R Graphics: Getting started with R g U i R t t fi

UNIX Data Tools Bualo Chapter 7 1 / 37 Overview In Chapter 3 we learned the basic operations

Practical Bioinformatics Mark Voorhies 4/3/2018 Mark Voorhies Practical Bioinformatics Mean

Data analysis pipelines Reading and tidying tables R.W. Oldford readr - importing

Data Modeling and Database Design Yuri Takhteyev Faculty of Information University of Toronto

Introduction to C Programming File Input/Output Waseda University Todays Topics

UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics Core Facility EPFL Objectives

CAD & GIS INTEGRATION Tools and Methods for: Importing MicroStation DGN files into