/
/ Data Import in R Fraud Detection Course - 2019/2020 Nuno Moniz - - PowerPoint PPT Presentation
/ Data Import in R Fraud Detection Course - 2019/2020 Nuno Moniz - - PowerPoint PPT Presentation
/ Data Import in R Fraud Detection Course - 2019/2020 Nuno Moniz nuno.moniz@fc.up.pt / Today 1. Data Import in R 1.1. Datasets 1.2. Importing 1.3. Databases 1.4. Hands-on Data Import Fraud Detection Course 2019/2020 - Nuno
/
Data Import in R
Fraud Detection Course - 2019/2020
Nuno Moniz nuno.moniz@fc.up.pt
/
Today
- 1. Data Import in R
1.1. Datasets 1.2. Importing 1.3. Databases 1.4. Hands-on Data Import · · · ·
Fraud Detection Course 2019/2020 - Nuno Moniz
/
Data Import in R
/
Datasets
Today: How to import data stored in dierent formats/infra-structures into an R data frame The most common setting: data analysis tasks use source data sets sorted in tabular formats Datasets are bi-dimensional structures (e.g. table) The most common data structure in R to store such tables is the data frame · · Rows are observation a certain phenomenon, and Columns contain information describing each observation
- ·
Fraud Detection Course 2019/2020 - Nuno Moniz
/
Internal R Data Sets
R installations include many data sets. Obtain them by doing the following: ·
data()
New packages may add new datasets: ·
data(package = "DMwR") # datasets from a specific package data(package = .packages(all.available = TRUE)) # from all packages
To load them, use the function data ·
data(iris) head(iris,3) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa
Fraud Detection Course 2019/2020 - Nuno Moniz
/
RData File Format
All R objects (including data frames) can be saved in an RData le: ·
ds <- data.frame(x = rnorm(10), y = rnorm(10)) # generate data frame with random values save(ds, file="dummy.RData") # save the data frame to an RData file
To load the data frame back into R: ·
rm(ds) # remove the variable ds from the R environment ds ## Error in eval(expr, envir, enclos): object 'ds' not found load("dummy.RData") # load the RData file with the data frame head(ds, 3) ## x y ## 1 -1.5090279 0.1893106 ## 2 0.2562916 1.4275355 ## 3 0.1648592 0.1893078
Fraud Detection Course 2019/2020 - Nuno Moniz
/
Import Text Files
Text les: Examples include: · A common way of storing and sharing data sets, And one of the easiest also
- ·
CSV les Tabbed les Spaced les
- . . .
Fraud Detection Course 2019/2020 - Nuno Moniz
/
File "dummy.csv"
ID, Name, Age 23424, Ana, 45 11234, Charles, 23 77654, Susanne, 76
Import to a data frame:
Importing CSV Files
In CSV les, the values in each line are separated by commas ·
ds <- read.csv("dummy.csv", header = TRUE) head(ds) ## X ID Name Age ## 1 1 23424 Ana 45 ## 2 2 11234 Charles 23 ## 3 3 77654 Susanne 76
Fraud Detection Course 2019/2020 - Nuno Moniz
/
File "dummy.csv"
ID, Name, Age 23424, Ana, 45 11234, Charles, 23 77654, Susanne, 76
Import to a data frame:
Importing CSV Files (alternate)
Some countries use the comma as a decimal separator (Portugal does, US doesn't) An alternate method for importing CSV les is based on the use of a semi-colon for separating values · ·
ds <- read.csv2("dummy.csv") head(ds)
Fraud Detection Course 2019/2020 - Nuno Moniz
/
Importing Other Text Files
File "dummy.csv"
ID, Name, Age 23424, Ana, ? 11234, Charles, 23 77654, Susanne, 76
Files can also be presented as a regular .txt le · Import to a data frame: ·
ds <- read.table("dummy.txt", header=TRUE, na.strings="???")
If you want to specicy the type of data in each column: ·
ds <- read.table("dummy.txt", header=TRUE, na.strings="???", colClasses=c("integer","character","character"))
Fraud Detection Course 2019/2020 - Nuno Moniz
/
Summary on Text Formats
Other functions with similar objective Relevant Parameters (very important): Calling this functions will result in the creation of a data frame Beware of problem related to character encoding ·
read.table, read.csv, read.csv2, read.delim,
- . . .
· sep : indicates the character used as values separator dec : indicates the character used as decimal separator header : indicates whether the rst line contains the header column names na.strings : indicates the vector of characters interpreted as unknown values
- ·
·
Fraud Detection Course 2019/2020 - Nuno Moniz
/
Importing Data from Spreadsheets (Method 1)
ODBC connection (Windows)
Installation Example
Communication protocol between databases Microsoft Excel is able to communicate using this protocol · · Install package RODBC ·
library(RODBC) fc <- "C:\\Documents and Settings\\xpto\\My Documents\\calc.xls" # made-up cn <- odbcConnectExcel(fc) shs <- sqlTables(cn) dat <- sqlQuery(cn, paste("SELECT * FROM", shs$TABLE_NAME[1]))
Fraud Detection Course 2019/2020 - Nuno Moniz
/
Importing Data from Spreadsheets (Method 2)
Package gdata
Installation Example
This package includes the functions read.xls · Install package gdata Requires the availability of Perl · ·
library(gdata) fc <- "c:\\Documents and Settings\\xpto\\My Documents\\calc.xls" dat <- read.xls(fc, sheet=1) # ?read.xls for more examples
Fraud Detection Course 2019/2020 - Nuno Moniz
/
Copy... And (sort of) paste
Importing Data from Spreadsheets (Method 3)
Using the clipboard It's the easiest when importing small data tables ·
ds <- read.table("clipboard", header=TRUE) ds ## ID Name Age ## 1 23424 Ana <NA> ## 2 11234 Charles 23 ## 3 77654 Susanne 76
Fraud Detection Course 2019/2020 - Nuno Moniz
/
Importing Data from Spreadsheets (Method 4)
Package readxl
Example
This packages includes the function read_excel ·
library(readxl) fc <- "c:\\Documents and Settings\\xpto\\My Documents\\calc.xls" dat <- read_excel(fc, sheet=1)
Fraud Detection Course 2019/2020 - Nuno Moniz
/
Connection to DB's
Take a look at the DBI package This package provides a front-end interface to DBMS-specic drivers · ·
Fraud Detection Course 2019/2020 - Nuno Moniz
/
Connection to DB's (example)
# it's a dummy example - it won't run properly library(DBI) library(RMySQL) drv <- dbDriver("MySQL") # Loading the MySQL driver con <- dbConnect(drv, dbname="transDB", # connecting to the DBMS username="myuser", password="mypasswd", host="localhost") # getting the results of a query as a data frame data <- dbGetQuery(con, "SELECT * FROM clients") dbDisconnect(con) # close connection
Fraud Detection Course 2019/2020 - Nuno Moniz
/
Connection to DB's (another example)
library(DBI) library(RMySQL) drv <- dbDriver("MySQL") # Loading the MySQL driver con <- dbConnect(drv,dbname="transDB", # connecting to the DBMS username="myuser",password="mypasswd", host="localhost") res <- dbSendQuery(con, "SELECT * FROM transactions") while (!dbHasCompleted(res)) { someData <- fetch(res, n = 50) # get the next 50 records on a data frame process(someData) # call some function that handles the current chunk } dbClearResult(res) # clear the results set dbDisconnect(con) # close connection
Fraud Detection Course 2019/2020 - Nuno Moniz
/
More Data Importation
Databases Formats from other statistical software For more information, check the R manual on Data Import/Export · R has interfaces to all major DBMS (packages DBI, RMySQL, ROracle, etc.)
- ·
Minitab, S-Plus, SPSS, Stata, SAS, etc. Packages foreign, Hmisc
- ·
Fraud Detection Course 2019/2020 - Nuno Moniz
/
Hands-on: Importing Data in R
The site UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/machine-learning- databases/audiology/) (click the link) contains a data set of an Audiology problem · Download the dataset audiology.standardized.data and import that data into an R data frame. Read the information on the web page to nd out how unknown values are represented and make sure they are properly translated into R nomenclature
- Fraud Detection Course 2019/2020 - Nuno Moniz
/