/ Data Import in R Fraud Detection Course - 2019/2020 Nuno Moniz - - PowerPoint PPT Presentation

data import in r
SMART_READER_LITE
LIVE PREVIEW

/ Data Import in R Fraud Detection Course - 2019/2020 Nuno Moniz - - PowerPoint PPT Presentation

/ Data Import in R Fraud Detection Course - 2019/2020 Nuno Moniz nuno.moniz@fc.up.pt / Today 1. Data Import in R 1.1. Datasets 1.2. Importing 1.3. Databases 1.4. Hands-on Data Import Fraud Detection Course 2019/2020 - Nuno


slide-1
SLIDE 1

/

slide-2
SLIDE 2

/

Data Import in R

Fraud Detection Course - 2019/2020

Nuno Moniz nuno.moniz@fc.up.pt

slide-3
SLIDE 3

/

Today

  • 1. Data Import in R

1.1. Datasets 1.2. Importing 1.3. Databases 1.4. Hands-on Data Import · · · ·

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-4
SLIDE 4

/

Data Import in R

slide-5
SLIDE 5

/

Datasets

Today: How to import data stored in dierent formats/infra-structures into an R data frame The most common setting: data analysis tasks use source data sets sorted in tabular formats Datasets are bi-dimensional structures (e.g. table) The most common data structure in R to store such tables is the data frame · · Rows are observation a certain phenomenon, and Columns contain information describing each observation

  • ·

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-6
SLIDE 6

/

Internal R Data Sets

R installations include many data sets. Obtain them by doing the following: ·

data()

New packages may add new datasets: ·

data(package = "DMwR") # datasets from a specific package data(package = .packages(all.available = TRUE)) # from all packages

To load them, use the function data ·

data(iris) head(iris,3) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-7
SLIDE 7

/

RData File Format

All R objects (including data frames) can be saved in an RData le: ·

ds <- data.frame(x = rnorm(10), y = rnorm(10)) # generate data frame with random values save(ds, file="dummy.RData") # save the data frame to an RData file

To load the data frame back into R: ·

rm(ds) # remove the variable ds from the R environment ds ## Error in eval(expr, envir, enclos): object 'ds' not found load("dummy.RData") # load the RData file with the data frame head(ds, 3) ## x y ## 1 -1.5090279 0.1893106 ## 2 0.2562916 1.4275355 ## 3 0.1648592 0.1893078

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-8
SLIDE 8

/

Import Text Files

Text les: Examples include: · A common way of storing and sharing data sets, And one of the easiest also

  • ·

CSV les Tabbed les Spaced les

  • . . .

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-9
SLIDE 9

/

File "dummy.csv"

ID, Name, Age 23424, Ana, 45 11234, Charles, 23 77654, Susanne, 76

Import to a data frame:

Importing CSV Files

In CSV les, the values in each line are separated by commas ·

ds <- read.csv("dummy.csv", header = TRUE) head(ds) ## X ID Name Age ## 1 1 23424 Ana 45 ## 2 2 11234 Charles 23 ## 3 3 77654 Susanne 76

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-10
SLIDE 10

/

File "dummy.csv"

ID, Name, Age 23424, Ana, 45 11234, Charles, 23 77654, Susanne, 76

Import to a data frame:

Importing CSV Files (alternate)

Some countries use the comma as a decimal separator (Portugal does, US doesn't) An alternate method for importing CSV les is based on the use of a semi-colon for separating values · ·

ds <- read.csv2("dummy.csv") head(ds)

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-11
SLIDE 11

/

Importing Other Text Files

File "dummy.csv"

ID, Name, Age 23424, Ana, ? 11234, Charles, 23 77654, Susanne, 76

Files can also be presented as a regular .txt le · Import to a data frame: ·

ds <- read.table("dummy.txt", header=TRUE, na.strings="???")

If you want to specicy the type of data in each column: ·

ds <- read.table("dummy.txt", header=TRUE, na.strings="???", colClasses=c("integer","character","character"))

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-12
SLIDE 12

/

Summary on Text Formats

Other functions with similar objective Relevant Parameters (very important): Calling this functions will result in the creation of a data frame Beware of problem related to character encoding ·

read.table, read.csv, read.csv2, read.delim,

  • . . .

· sep : indicates the character used as values separator dec : indicates the character used as decimal separator header : indicates whether the rst line contains the header column names na.strings : indicates the vector of characters interpreted as unknown values

  • ·

·

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-13
SLIDE 13

/

Importing Data from Spreadsheets (Method 1)

ODBC connection (Windows)

Installation Example

Communication protocol between databases Microsoft Excel is able to communicate using this protocol · · Install package RODBC ·

library(RODBC) fc <- "C:\\Documents and Settings\\xpto\\My Documents\\calc.xls" # made-up cn <- odbcConnectExcel(fc) shs <- sqlTables(cn) dat <- sqlQuery(cn, paste("SELECT * FROM", shs$TABLE_NAME[1]))

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-14
SLIDE 14

/

Importing Data from Spreadsheets (Method 2)

Package gdata

Installation Example

This package includes the functions read.xls · Install package gdata Requires the availability of Perl · ·

library(gdata) fc <- "c:\\Documents and Settings\\xpto\\My Documents\\calc.xls" dat <- read.xls(fc, sheet=1) # ?read.xls for more examples

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-15
SLIDE 15

/

Copy... And (sort of) paste

Importing Data from Spreadsheets (Method 3)

Using the clipboard It's the easiest when importing small data tables ·

ds <- read.table("clipboard", header=TRUE) ds ## ID Name Age ## 1 23424 Ana <NA> ## 2 11234 Charles 23 ## 3 77654 Susanne 76

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-16
SLIDE 16

/

Importing Data from Spreadsheets (Method 4)

Package readxl

Example

This packages includes the function read_excel ·

library(readxl) fc <- "c:\\Documents and Settings\\xpto\\My Documents\\calc.xls" dat <- read_excel(fc, sheet=1)

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-17
SLIDE 17

/

Connection to DB's

Take a look at the DBI package This package provides a front-end interface to DBMS-specic drivers · ·

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-18
SLIDE 18

/

Connection to DB's (example)

# it's a dummy example - it won't run properly library(DBI) library(RMySQL) drv <- dbDriver("MySQL") # Loading the MySQL driver con <- dbConnect(drv, dbname="transDB", # connecting to the DBMS username="myuser", password="mypasswd", host="localhost") # getting the results of a query as a data frame data <- dbGetQuery(con, "SELECT * FROM clients") dbDisconnect(con) # close connection

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-19
SLIDE 19

/

Connection to DB's (another example)

library(DBI) library(RMySQL) drv <- dbDriver("MySQL") # Loading the MySQL driver con <- dbConnect(drv,dbname="transDB", # connecting to the DBMS username="myuser",password="mypasswd", host="localhost") res <- dbSendQuery(con, "SELECT * FROM transactions") while (!dbHasCompleted(res)) { someData <- fetch(res, n = 50) # get the next 50 records on a data frame process(someData) # call some function that handles the current chunk } dbClearResult(res) # clear the results set dbDisconnect(con) # close connection

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-20
SLIDE 20

/

More Data Importation

Databases Formats from other statistical software For more information, check the R manual on Data Import/Export · R has interfaces to all major DBMS (packages DBI, RMySQL, ROracle, etc.)

  • ·

Minitab, S-Plus, SPSS, Stata, SAS, etc. Packages foreign, Hmisc

  • ·

Fraud Detection Course 2019/2020 - Nuno Moniz

slide-21
SLIDE 21

/

Hands-on: Importing Data in R

The site UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/machine-learning- databases/audiology/) (click the link) contains a data set of an Audiology problem · Download the dataset audiology.standardized.data and import that data into an R data frame. Read the information on the web page to nd out how unknown values are represented and make sure they are properly translated into R nomenclature

  • Fraud Detection Course 2019/2020 - Nuno Moniz
slide-22
SLIDE 22

/