Importing data into R Workshop 2 2 Learning outcomes By following - - PowerPoint PPT Presentation

▶

Nov 05, 2023 163 likes •358 views

1 Importing data into R Workshop 2 2 Learning outcomes By following the slides and applying the techniques to the workshop examples the successful student will be able to: Describe the breadth of data sources Devise reproducible

SLIDE 1

Importing data into R

Workshop 2

SLIDE 2

Learning outcomes

By following the slides and applying the techniques to the workshop examples the successful student will be able to:

Describe the breadth of data sources
Devise reproducible strategies to import local and remote data in a variety of

formats Short talk outlining some possibilities followed by opportunities for you apply and combine ideas. Facilitated problem solving, rather than detailed tutorials

SLIDE 3

Outline: four aspects to consider

Where: stored locally (on your own computer) or remotely (on another computer/server) Format: various. structured as XML or JSON, in databases or may require harvesting How: base R functions; Access to APIs for many forms of specialised data has been made easier with packages e.g., bioconductor Result: often dataframes or dataframe-like structures (eg., tibbles), often specialised data structures

SLIDE 4

Revision: Locally stored: txt, csv or similar files

Essentially plain text (can be opened in notepad and make sense) Occasionally fixed width columns more commonly ‘delimited’ by a particular character Read in with the read.table() methods read.table(file) is the minimum needed, other arguments have defaults Remember that file location matters

SLIDE 5

Arguments depend on data format

> mydata <- read.table("../data/structurepred.txt") > str(mydata) 'data.frame': 91 obs. of 3 variables: $ V1: Factor w/ 91 levels "0.08","0.353",..: 91 84 31 32 37 18 25 89 88 3 ... $ V2: Factor w/ 4 levels "Abstruct","Predicto",..: 3 1 1 1 1 1 1 1 1 1 ... $ V3: Factor w/ 31 levels "1","10","11",..: 31 1 12 23 25 26 27 28 29 30 ... > mydata <- read.table("../data/structurepred.txt", header = T) > str(mydata) 'data.frame': 90 obs. of 3 variables: $ rmsd: num 9.04 14.95 17.73 3.12 11.28 ... $ prog: Factor w/ 3 levels "Abstruct","Predicto",..: 1 1 1 1 1 1 1 1 1 1 ... $ prot: int 1 2 3 4 5 6 7 8 9 10 ...

The other defaults are appropriate here (incl sep)

Revision: Locally stored: txt, csv or similar files

SLIDE 6

'data.frame': 12131 obs. of 2 variables: $ Code : Factor w/ 12131 levels "A00","A000","A001",..: 1 2 3 4 5 ... $ Description: Factor w/ 12079 levels "4-Aminophenol derivatives",..: 1822 1823 1824 1826 11605 ...

Arguments depend on data format

> mydata <- read.table("../data/Icd10Code.csv", header=T) Error in read.table("../data/Icd10Code.csv", header = T) : more columns than column names

See manual: defaults depend on which read. method

> mydata <- read.table("../data/Icd10Code.csv", header = F, nrows = 1) > (mydata) V1 1 Code,Description

Try reading the first line only The default sep is the problem

> mydata <- read.table("../data/Icd10Code.csv", header = T, sep=",") > mydata <- read.csv("../data/Icd10Code.csv")

OR Revision: Locally stored: txt, csv or similar files

SLIDE 7

Locally stored: special format files

Can not usually be opened in notepad (and make sense) Often specific to particular software Filepaths - no change Method/function - may differ

If you have the software you can export it as comma or tab delimited and use a

read.table method

But it’s much better to do processing in the script: all steps documented and repeatable
To determine how to read that type of file: Google
Keep googling

SLIDE 8

Packages are often the solution e.g haven foreign Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, dBase, ...

> mydata <- read_sav("../data/prac9a.sav") > str(mydata) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 6 variables: $ terrsize: atomic 0.463 0.446 0.651 0.507 0.879 ... ..- attr(*, "label")= chr "Territory size (Ha)" $ country :Class 'labelled' atomic [1:120] 1 1 1 1 1 1 1 1 1 1 … .... .... .... > install.packages(“haven”)

* Already installed in Biology/Rlibs (on your own pc, do once)

Locally stored: special format files

> library(haven)

e.g read_sav

* once each session

SLIDE 9

Files from the Internet

Why read from the web rather than saving the files then reading in normally? Repeatability Especially useful if you need to rerun analyses on regularly updated public data Use same methods as before - you just replace the file location with the URL of the data You still need to know the data format

SLIDE 10

Files from the Internet

Data from a buoy (buoy #44025) off the coast of New Jersey at

http://www.ndbc.noaa.gov/view_text_file.php?filename=44025h2011.txt.gz&dir=data/historical/stdmet/

# to make the code more readable we set a variable to the website address of the file: > file <- "http://www.ndbc.noaa.gov/view_text_file.php?filename=44025h2011.txt.gz&dir=data/historical/stdmet/" # data format: look on the web or use: > readLines(file, n = 5) [1] "#YY MM DD hh mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP VIS TIDE" [2] "#yr mo dy hr mn degT m/s m/s m sec sec degT hPa degC degC degC mi ft" [3] "2010 12 31 23 50 222 7.2 8.5 0.75 4.55 3.72 203 1022.2 6.9 6.7 3.5 99.0 99.00" [4] "2011 01 01 00 50 233 6.0 6.8 0.76 4.76 3.77 196 1022.2 6.7 6.7 3.7 99.0 99.00" [5] "2011 01 01 01 50 230 5.0 5.9 0.72 4.55 3.85 201 1021.9 6.8 6.7 3.5 99.0 99.00" > mydata <- read.table(file, header = F, skip = 2) > str(mydata) 'data.frame': 6358 obs. of 18 variables: $ V1 : int 2010 2011 2011 2011 2011 2011 2011 2011 2011 2011 ... $ V2 : int 12 1 1 1 1 1 1 1 1 1 ... … $ V18: num 99 99 99 99 99 99 99 99 99 99 ...

Would still need some ‘tidying’

SLIDE 11

Web scraping

What if data are not in a file but on webpage? One solution is to use to 'scrape' the data using package rvest and an extension to Chrome called Selectorgadget that allows you to interactively determine what ‘css selector’ you need to extract desired components from a page To see rvest in action we are going to retrieve the results of a google scholar search for Calvin Dytham user TJUyl1gAAAAJ

D’ya geddit?? ‘harvest’

SLIDE 12

Web scraping with rvest

> install.packages("XML") > install.packages("rvest") > library(rvest) > library(magrittr) > page <- read_html("https://scholar.google.co.uk/citations?user=TJUyl1gAAAAJ&hl=en") #Specify the css selector in html_nodes() and extract the text with html_text(). # and change the string to numeric using as.numeric(). > citations <- page %>% html_nodes("#gsc_a_b .gsc_a_c") %>% html_text()%>%as.numeric() > years <- page %>% html_nodes("#gsc_a_b .gsc_a_y") %>% html_text()%>%as.numeric() > citations [1] 1228 314 290 265 263 216 200 193 184 180 131 111 110 100 94 87 87 86 [19] 79 76 > years [1] 2011 2012 1999 2010 1999 2008 1999 2007 2011 2002 1998 2002 2003 2007 2005 2007 1995 2010 [19] 2009 2009

SLIDE 13

Data from databases

Many packages For relational databases (Oracle, MSSQL, MySQL) RMySQL, RODBC For non- relational databases (MongoDB, Hadoop) rmongodb, rhbase For particular specialised research fields - packages for import, tidying, analysis rentrez, Bioconductor ropensci: R packages that provide programmatic access to a variety of scientific data, full-text of journal articles, and repositories that provide real-time metrics of scholarly impact.

SLIDE 14

Resulting data structures

Dataframes or dataframe-like structures (eg., tibbles) Specialised: for microarray expression data, flowcytometry data, image data, proteomic, transcriptomic data. Usually have an element which is the actual data in a dataframe

> library(EBImage) > img1 <- readImage(file1) > str(img1) Formal class 'Image' [package "EBImage"] with 2 slots ..@ .Data : num [1:768, 1:512] 0.447 0.451 0.463 0.455 0.463 ... ..@ colormode: int 0 > display(img1)

SLIDE 15

SLIDE 16

And finally ….

Connection to Programming: many programming concepts are typically required for data import. For example: input and output streams, pattern matching, loops When working with big datasets that take a while to read in, save your workspace (.RData) file and reload that rather than reading in the data and tidying each time This R Data Import Tutorial Is Everything You Need Importing Data Into R - Part Two

SLIDE 17

Summary

Data can:be imported from locally or remotely stored files; scraped from webpages; local or remote databases; or accessed by APIs. Tips

Understand the data format: read documentation, open plain text files, use

readLines

Google import errors
Experiment and test with toy examples

Data structures: mainly dataframes and tibbles; sometimes specialised structures with metadata (eg bioconductor packages). Read documentation and google a lot