arato@biconsulting.hu rstats.budapestbi.hu R and Big Data Master - - PowerPoint PPT Presentation

arato biconsulting hu
SMART_READER_LITE
LIVE PREVIEW

arato@biconsulting.hu rstats.budapestbi.hu R and Big Data Master - - PowerPoint PPT Presentation

The Big Connection - R and Big Data Bence Arat arato@biconsulting.hu rstats.budapestbi.hu R and Big Data Master Code Code Code Data Data Data Data Data Data Server Server Server 1 Server 1 Server 2 Server 2 3 3 HDFS Node


slide-1
SLIDE 1

The Big Connection - R and Big Data Bence Arató arato@biconsulting.hu rstats.budapestbi.hu

slide-2
SLIDE 2

R and Big Data Server 1 Master Data Server 2 Data Server 3 Data Server 1 Data Server 2 Data Server 3 Data Code Code Code

HDFS Node HDFS Node HDFS Node

slide-3
SLIDE 3

SQL on Big Data

⚫ SQL on Hadoop – why?

⚫ Familiar interface for most users ⚫ BI tools (like Tableau, Power BI etc) also

uses SQL to connect

⚫ Many different engines

⚫ Hive, Impala, Drill, Presto, … ⚫ Most offers ODBC/JDBC driver, usable

from R

slide-4
SLIDE 4

Hadoop Tracker

blogs.gartner.com/merv-adrian/2017/12/29/december-2017-tracker-wheres-hadoop

Hadoop distributions may have a preferred engine (for example Hortonworks -> Hive, Cloudera -> Impala)

slide-5
SLIDE 5

Using Impala from R

⚫ Steps for trying Impala-based access

⚫ Download the Cloudera QuickStart VM ⚫ Download the Impala ODBC driver

5

slide-6
SLIDE 6

ODBC R package

6

⚫ The ODBC R Package

⚫ DBI interface for ODBC ⚫ Maintained by Rstudio

slide-7
SLIDE 7

ODBC R package

7

library(odbc) library(DBI) drv <- odbc::odbc() con <- dbConnect(drv, driver = "Cloudera ODBC Driver for Impala", host = "localhost", port = 21050, database = "default", uid = "", pwd = "" ) #list available tables dbListTables(con) dbListTables(con, table_name = "%port%")

slide-8
SLIDE 8

ODBC R package

8

# list fields dbListFields(con, "airports") # Load all data from SQL into a local data.frame df_airports <- dbReadTable(con, "airports") str(df_airports) # Read data using an SQL query query_results= dbGetQuery(con, "select * from airports where faa='SFO'") head(query_results) query_results= dbGetQuery(con, "select * from airports where name like '%London%'") head(query_results)

slide-9
SLIDE 9

Implyr package

9

⚫ The implyr package

⚫ dplyr SQL backend for Impala ⚫ Developed by Cloudera, Ian Cook ⚫ Uses the ODBC connector for data access

slide-10
SLIDE 10

Implyr

10

# packages install.packages("implyr") library(odbc) library(implyr) library(dplyr) drv <- odbc::odbc() impala <- src_impala( drv = drv, driver = "Cloudera ODBC Driver for Impala", host = "host", port = 21050, database = "default", uid = "username", pwd = "password" ) # list available tables src_tbls(impala)

slide-11
SLIDE 11

Implyr

11

# create airports reference airports_tbl <- tbl(impala, "airports") airports_tbl # Running SQL - refresh Impala metadata dbExecute(impala, "refresh airports") # Query data using SQL airport_cnt <- dbGetQuery( impala, "select count(*) from airports") airport_cnt airport_sfo <- dbGetQuery( impala, "select * from airports where faa='SFO'") head(airport_sfo) # same using dplyr airports_tbl %>% filter( FAA =="SFO")

slide-12
SLIDE 12

Spark

12

slide-13
SLIDE 13

13

slide-14
SLIDE 14

14

⚫ R access for Spark (examples)

⚫ SparkR (Databricks) ⚫ RevoScaler (Microsoft) ⚫ sparklyr (Rstudio)

slide-15
SLIDE 15

15

www.rstudio.com/resources/cheatsheets