data import in r
play

/ Data Import in R Fraud Detection Course - 2019/2020 Nuno Moniz - PowerPoint PPT Presentation

/ Data Import in R Fraud Detection Course - 2019/2020 Nuno Moniz nuno.moniz@fc.up.pt / Today 1. Data Import in R 1.1. Datasets 1.2. Importing 1.3. Databases 1.4. Hands-on Data Import Fraud Detection Course 2019/2020 - Nuno


  1. /

  2. Data Import in R Fraud Detection Course - 2019/2020 Nuno Moniz nuno.moniz@fc.up.pt /

  3. Today 1. Data Import in R · 1.1. Datasets · 1.2. Importing · 1.3. Databases · 1.4. Hands-on Data Import Fraud Detection Course 2019/2020 - Nuno Moniz /

  4. Data Import in R /

  5. Datasets · The most common setting: data analysis tasks use source data sets sorted in tabular formats · Datasets are bi-dimensional structures (e.g. table) - Rows are observation a certain phenomenon, and - Columns contain information describing each observation · The most common data structure in R to store such tables is the data frame Today: How to import data stored in di�erent formats/infra-structures into an R data frame Fraud Detection Course 2019/2020 - Nuno Moniz /

  6. Internal R Data Sets · R installations include many data sets. Obtain them by doing the following: data() · New packages may add new datasets: data(package = "DMwR") # datasets from a specific package data(package = .packages(all.available = TRUE)) # from all packages · To load them, use the function data data(iris) head(iris,3) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa Fraud Detection Course 2019/2020 - Nuno Moniz /

  7. RData File Format · All R objects (including data frames) can be saved in an RData �le: ds <- data.frame(x = rnorm(10), y = rnorm(10)) # generate data frame with random values save(ds, file="dummy.RData") # save the data frame to an RData file · To load the data frame back into R: rm(ds) # remove the variable ds from the R environment ds ## Error in eval(expr, envir, enclos): object 'ds' not found load("dummy.RData") # load the RData file with the data frame head(ds, 3) ## x y ## 1 -1.5090279 0.1893106 ## 2 0.2562916 1.4275355 ## 3 0.1648592 0.1893078 Fraud Detection Course 2019/2020 - Nuno Moniz /

  8. Import Text Files · Text �les: - A common way of storing and sharing data sets, - And one of the easiest also · Examples include: - CSV �les - Tabbed �les - Spaced �les - . . . Fraud Detection Course 2019/2020 - Nuno Moniz /

  9. Importing CSV Files · In CSV �les, the values in each line are separated by commas File "dummy.csv" Import to a data frame: ID, Name, Age ds <- read.csv("dummy.csv", 23424, Ana, 45 header = TRUE) head(ds) 11234, Charles, 23 77654, Susanne, 76 ## X ID Name Age ## 1 1 23424 Ana 45 ## 2 2 11234 Charles 23 ## 3 3 77654 Susanne 76 Fraud Detection Course 2019/2020 - Nuno Moniz /

  10. Importing CSV Files (alternate) · Some countries use the comma as a decimal separator (Portugal does, US doesn't) · An alternate method for importing CSV �les is based on the use of a semi-colon for separating values File "dummy.csv" Import to a data frame: ID, Name, Age ds <- read.csv2("dummy.csv") 23424, Ana, 45 head(ds) 11234, Charles, 23 77654, Susanne, 76 Fraud Detection Course 2019/2020 - Nuno Moniz /

  11. Importing Other Text Files · Files can also be presented as a regular .txt �le File "dummy.csv" ID, Name, Age 23424, Ana, ? 11234, Charles, 23 77654, Susanne, 76 · Import to a data frame: ds <- read.table("dummy.txt", header=TRUE, na.strings="???") · If you want to speci�cy the type of data in each column: ds <- read.table("dummy.txt", header=TRUE, na.strings="???", colClasses=c("integer","character","character")) Fraud Detection Course 2019/2020 - Nuno Moniz /

  12. Summary on Text Formats · Other functions with similar objective - read.table , read.csv , read.csv2 , read.delim , . . . · Relevant Parameters (very important): - sep : indicates the character used as values separator - dec : indicates the character used as decimal separator - header : indicates whether the �rst line contains the header column names - na.strings : indicates the vector of characters interpreted as unknown values · Calling this functions will result in the creation of a data frame · Beware of problem related to character encoding Fraud Detection Course 2019/2020 - Nuno Moniz /

  13. Importing Data from Spreadsheets (Method 1) ODBC connection (Windows) · Communication protocol between databases · Microsoft Excel is able to communicate using this protocol Installation · Install package RODBC Example library(RODBC) fc <- "C:\\Documents and Settings\\xpto\\My Documents\\calc.xls" # made-up cn <- odbcConnectExcel(fc) shs <- sqlTables(cn) dat <- sqlQuery(cn, paste("SELECT * FROM", shs$TABLE_NAME[1])) Fraud Detection Course 2019/2020 - Nuno Moniz /

  14. Importing Data from Spreadsheets (Method 2) Package gdata · This package includes the functions read.xls Installation · Install package gdata · Requires the availability of Perl Example library(gdata) fc <- "c:\\Documents and Settings\\xpto\\My Documents\\calc.xls" dat <- read.xls(fc, sheet=1) # ?read.xls for more examples Fraud Detection Course 2019/2020 - Nuno Moniz /

  15. Importing Data from Spreadsheets (Method 3) Using the clipboard · It's the easiest when importing small data tables Copy... And (sort of) paste ds <- read.table("clipboard", header=TRUE) ds ## ID Name Age ## 1 23424 Ana <NA> ## 2 11234 Charles 23 ## 3 77654 Susanne 76 Fraud Detection Course 2019/2020 - Nuno Moniz /

  16. Importing Data from Spreadsheets (Method 4) Package readxl · This packages includes the function read_excel Example library(readxl) fc <- "c:\\Documents and Settings\\xpto\\My Documents\\calc.xls" dat <- read_excel(fc, sheet=1) Fraud Detection Course 2019/2020 - Nuno Moniz /

  17. Connection to DB's · Take a look at the DBI package · This package provides a front-end interface to DBMS-speci�c drivers Fraud Detection Course 2019/2020 - Nuno Moniz /

  18. Connection to DB's (example) # it's a dummy example - it won't run properly library(DBI) library(RMySQL) drv <- dbDriver("MySQL") # Loading the MySQL driver con <- dbConnect(drv, dbname="transDB", # connecting to the DBMS username="myuser", password="mypasswd", host="localhost") # getting the results of a query as a data frame data <- dbGetQuery(con, "SELECT * FROM clients") dbDisconnect(con) # close connection Fraud Detection Course 2019/2020 - Nuno Moniz /

  19. Connection to DB's (another example) library(DBI) library(RMySQL) drv <- dbDriver("MySQL") # Loading the MySQL driver con <- dbConnect(drv,dbname="transDB", # connecting to the DBMS username="myuser",password="mypasswd", host="localhost") res <- dbSendQuery(con, "SELECT * FROM transactions") while (!dbHasCompleted(res)) { someData <- fetch(res, n = 50) # get the next 50 records on a data frame process(someData) # call some function that handles the current chunk } dbClearResult(res) # clear the results set dbDisconnect(con) # close connection Fraud Detection Course 2019/2020 - Nuno Moniz /

  20. More Data Importation · Databases - R has interfaces to all major DBMS (packages DBI , RMySQL , ROracle , etc.) · Formats from other statistical software - Minitab, S-Plus, SPSS, Stata, SAS, etc. - Packages foreign , Hmisc · For more information, check the R manual on Data Import/Export Fraud Detection Course 2019/2020 - Nuno Moniz /

  21. Hands-on: Importing Data in R · The site UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/machine-learning- databases/audiology/) (click the link) contains a data set of an Audiology problem - Download the dataset audiology.standardized.data and import that data into an R data frame. - Read the information on the web page to �nd out how unknown values are represented and make sure they are properly translated into R nomenclature Fraud Detection Course 2019/2020 - Nuno Moniz /

  22. /

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend