1
Importing data into R Workshop 2 2 Learning outcomes By following - - PowerPoint PPT Presentation
Importing data into R Workshop 2 2 Learning outcomes By following - - PowerPoint PPT Presentation
1 Importing data into R Workshop 2 2 Learning outcomes By following the slides and applying the techniques to the workshop examples the successful student will be able to: Describe the breadth of data sources Devise reproducible
2
Learning outcomes
By following the slides and applying the techniques to the workshop examples the successful student will be able to:
- Describe the breadth of data sources
- Devise reproducible strategies to import local and remote data in a variety of
formats Short talk outlining some possibilities followed by opportunities for you apply and combine ideas. Facilitated problem solving, rather than detailed tutorials
3
Outline: four aspects to consider
Where: stored locally (on your own computer) or remotely (on another computer/server) Format: various. structured as XML or JSON, in databases or may require harvesting How: base R functions; Access to APIs for many forms of specialised data has been made easier with packages e.g., bioconductor Result: often dataframes or dataframe-like structures (eg., tibbles), often specialised data structures
4
Revision: Locally stored: txt, csv or similar files
Essentially plain text (can be opened in notepad and make sense) Occasionally fixed width columns more commonly ‘delimited’ by a particular character Read in with the read.table() methods read.table(file) is the minimum needed, other arguments have defaults Remember that file location matters
5
Arguments depend on data format
> mydata <- read.table("../data/structurepred.txt") > str(mydata) 'data.frame': 91 obs. of 3 variables: $ V1: Factor w/ 91 levels "0.08","0.353",..: 91 84 31 32 37 18 25 89 88 3 ... $ V2: Factor w/ 4 levels "Abstruct","Predicto",..: 3 1 1 1 1 1 1 1 1 1 ... $ V3: Factor w/ 31 levels "1","10","11",..: 31 1 12 23 25 26 27 28 29 30 ... > mydata <- read.table("../data/structurepred.txt", header = T) > str(mydata) 'data.frame': 90 obs. of 3 variables: $ rmsd: num 9.04 14.95 17.73 3.12 11.28 ... $ prog: Factor w/ 3 levels "Abstruct","Predicto",..: 1 1 1 1 1 1 1 1 1 1 ... $ prot: int 1 2 3 4 5 6 7 8 9 10 ...
The other defaults are appropriate here (incl sep)
Revision: Locally stored: txt, csv or similar files
6
'data.frame': 12131 obs. of 2 variables: $ Code : Factor w/ 12131 levels "A00","A000","A001",..: 1 2 3 4 5 ... $ Description: Factor w/ 12079 levels "4-Aminophenol derivatives",..: 1822 1823 1824 1826 11605 ...
Arguments depend on data format
> mydata <- read.table("../data/Icd10Code.csv", header=T) Error in read.table("../data/Icd10Code.csv", header = T) : more columns than column names
See manual: defaults depend on which read. method
> mydata <- read.table("../data/Icd10Code.csv", header = F, nrows = 1) > (mydata) V1 1 Code,Description
Try reading the first line only The default sep is the problem
> mydata <- read.table("../data/Icd10Code.csv", header = T, sep=",") > mydata <- read.csv("../data/Icd10Code.csv")
OR Revision: Locally stored: txt, csv or similar files
7
Locally stored: special format files
Can not usually be opened in notepad (and make sense) Often specific to particular software Filepaths - no change Method/function - may differ
- If you have the software you can export it as comma or tab delimited and use a
read.table method
- But it’s much better to do processing in the script: all steps documented and repeatable
- To determine how to read that type of file: Google
- Keep googling
8
Packages are often the solution e.g haven foreign Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, dBase, ...
> mydata <- read_sav("../data/prac9a.sav") > str(mydata) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 6 variables: $ terrsize: atomic 0.463 0.446 0.651 0.507 0.879 ... ..- attr(*, "label")= chr "Territory size (Ha)" $ country :Class 'labelled' atomic [1:120] 1 1 1 1 1 1 1 1 1 1 … .... .... .... > install.packages(“haven”)
* Already installed in Biology/Rlibs (on your own pc, do once)
Locally stored: special format files
> library(haven)
e.g read_sav
* once each session
9
Files from the Internet
Why read from the web rather than saving the files then reading in normally? Repeatability Especially useful if you need to rerun analyses on regularly updated public data Use same methods as before - you just replace the file location with the URL of the data You still need to know the data format
10
Files from the Internet
Data from a buoy (buoy #44025) off the coast of New Jersey at
http://www.ndbc.noaa.gov/view_text_file.php?filename=44025h2011.txt.gz&dir=data/historical/stdmet/
# to make the code more readable we set a variable to the website address of the file: > file <- "http://www.ndbc.noaa.gov/view_text_file.php?filename=44025h2011.txt.gz&dir=data/historical/stdmet/" # data format: look on the web or use: > readLines(file, n = 5) [1] "#YY MM DD hh mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP VIS TIDE" [2] "#yr mo dy hr mn degT m/s m/s m sec sec degT hPa degC degC degC mi ft" [3] "2010 12 31 23 50 222 7.2 8.5 0.75 4.55 3.72 203 1022.2 6.9 6.7 3.5 99.0 99.00" [4] "2011 01 01 00 50 233 6.0 6.8 0.76 4.76 3.77 196 1022.2 6.7 6.7 3.7 99.0 99.00" [5] "2011 01 01 01 50 230 5.0 5.9 0.72 4.55 3.85 201 1021.9 6.8 6.7 3.5 99.0 99.00" > mydata <- read.table(file, header = F, skip = 2) > str(mydata) 'data.frame': 6358 obs. of 18 variables: $ V1 : int 2010 2011 2011 2011 2011 2011 2011 2011 2011 2011 ... $ V2 : int 12 1 1 1 1 1 1 1 1 1 ... … $ V18: num 99 99 99 99 99 99 99 99 99 99 ...
Would still need some ‘tidying’
11
Web scraping
What if data are not in a file but on webpage? One solution is to use to 'scrape' the data using package rvest and an extension to Chrome called Selectorgadget that allows you to interactively determine what ‘css selector’ you need to extract desired components from a page To see rvest in action we are going to retrieve the results of a google scholar search for Calvin Dytham user TJUyl1gAAAAJ
D’ya geddit?? ‘harvest’
12
Web scraping with rvest
> install.packages("XML") > install.packages("rvest") > library(rvest) > library(magrittr) > page <- read_html("https://scholar.google.co.uk/citations?user=TJUyl1gAAAAJ&hl=en") #Specify the css selector in html_nodes() and extract the text with html_text(). # and change the string to numeric using as.numeric(). > citations <- page %>% html_nodes("#gsc_a_b .gsc_a_c") %>% html_text()%>%as.numeric() > years <- page %>% html_nodes("#gsc_a_b .gsc_a_y") %>% html_text()%>%as.numeric() > citations [1] 1228 314 290 265 263 216 200 193 184 180 131 111 110 100 94 87 87 86 [19] 79 76 > years [1] 2011 2012 1999 2010 1999 2008 1999 2007 2011 2002 1998 2002 2003 2007 2005 2007 1995 2010 [19] 2009 2009
13
Data from databases
Many packages For relational databases (Oracle, MSSQL, MySQL) RMySQL, RODBC For non- relational databases (MongoDB, Hadoop) rmongodb, rhbase For particular specialised research fields - packages for import, tidying, analysis rentrez, Bioconductor ropensci: R packages that provide programmatic access to a variety of scientific data, full-text of journal articles, and repositories that provide real-time metrics of scholarly impact.
14
Resulting data structures
Dataframes or dataframe-like structures (eg., tibbles) Specialised: for microarray expression data, flowcytometry data, image data, proteomic, transcriptomic data. Usually have an element which is the actual data in a dataframe
> library(EBImage) > img1 <- readImage(file1) > str(img1) Formal class 'Image' [package "EBImage"] with 2 slots ..@ .Data : num [1:768, 1:512] 0.447 0.451 0.463 0.455 0.463 ... ..@ colormode: int 0 > display(img1)
15
16
And finally ….
Connection to Programming: many programming concepts are typically required for data import. For example: input and output streams, pattern matching, loops When working with big datasets that take a while to read in, save your workspace (.RData) file and reload that rather than reading in the data and tidying each time This R Data Import Tutorial Is Everything You Need Importing Data Into R - Part Two
17
Summary
Data can:be imported from locally or remotely stored files; scraped from webpages; local or remote databases; or accessed by APIs. Tips
- Understand the data format: read documentation, open plain text files, use
readLines
- Google import errors
- Experiment and test with toy examples
Data structures: mainly dataframes and tibbles; sometimes specialised structures with metadata (eg bioconductor packages). Read documentation and google a lot