arato@biconsulting.hu rstats.budapestbi.hu R and Big Data Master - - PowerPoint PPT Presentation

▶

Sep 13, 2022 17 likes •169 views

The Big Connection - R and Big Data Bence Arat arato@biconsulting.hu rstats.budapestbi.hu R and Big Data Master Code Code Code Data Data Data Data Data Data Server Server Server 1 Server 1 Server 2 Server 2 3 3 HDFS Node

SLIDE 1

The Big Connection - R and Big Data Bence Arató arato@biconsulting.hu rstats.budapestbi.hu

SLIDE 2

R and Big Data Server 1 Master Data Server 2 Data Server 3 Data Server 1 Data Server 2 Data Server 3 Data Code Code Code

HDFS Node HDFS Node HDFS Node

SLIDE 3

SQL on Big Data

⚫ SQL on Hadoop – why?

⚫ Familiar interface for most users ⚫ BI tools (like Tableau, Power BI etc) also

uses SQL to connect

⚫ Many different engines

⚫ Hive, Impala, Drill, Presto, … ⚫ Most offers ODBC/JDBC driver, usable

from R

SLIDE 4

Hadoop Tracker

blogs.gartner.com/merv-adrian/2017/12/29/december-2017-tracker-wheres-hadoop

Hadoop distributions may have a preferred engine (for example Hortonworks -> Hive, Cloudera -> Impala)

SLIDE 5

Using Impala from R

⚫ Steps for trying Impala-based access

⚫ Download the Cloudera QuickStart VM ⚫ Download the Impala ODBC driver

5

SLIDE 6

ODBC R package

6

⚫ The ODBC R Package

⚫ DBI interface for ODBC ⚫ Maintained by Rstudio

SLIDE 7

ODBC R package

7 library(odbc) library(DBI) drv <- odbc::odbc() con <- dbConnect(drv, driver = "Cloudera ODBC Driver for Impala", host = "localhost", port = 21050, database = "default", uid = "", pwd = "" ) #list available tables dbListTables(con) dbListTables(con, table_name = "%port%")

SLIDE 8

ODBC R package

8 # list fields dbListFields(con, "airports") # Load all data from SQL into a local data.frame df_airports <- dbReadTable(con, "airports") str(df_airports) # Read data using an SQL query query_results= dbGetQuery(con, "select * from airports where faa='SFO'") head(query_results) query_results= dbGetQuery(con, "select * from airports where name like '%London%'") head(query_results)

SLIDE 9

Implyr package

9

⚫ The implyr package

⚫ dplyr SQL backend for Impala ⚫ Developed by Cloudera, Ian Cook ⚫ Uses the ODBC connector for data access

SLIDE 10

Implyr

10 # packages install.packages("implyr") library(odbc) library(implyr) library(dplyr) drv <- odbc::odbc() impala <- src_impala( drv = drv, driver = "Cloudera ODBC Driver for Impala", host = "host", port = 21050, database = "default", uid = "username", pwd = "password" ) # list available tables src_tbls(impala)

SLIDE 11

Implyr

11 # create airports reference airports_tbl <- tbl(impala, "airports") airports_tbl # Running SQL - refresh Impala metadata dbExecute(impala, "refresh airports") # Query data using SQL airport_cnt <- dbGetQuery( impala, "select count() from airports") airport_cnt airport_sfo <- dbGetQuery( impala, "select from airports where faa='SFO'") head(airport_sfo) # same using dplyr airports_tbl %>% filter( FAA =="SFO")

SLIDE 12

Spark

12

SLIDE 13

13

SLIDE 14

14

⚫ R access for Spark (examples)

⚫ SparkR (Databricks) ⚫ RevoScaler (Microsoft) ⚫ sparklyr (Rstudio)

SLIDE 15

The Big Connection - R and Big Data Bence Arató arato@biconsulting.hu rstats.budapestbi.hu

R and Big Data Server 1 Master Data Server 2 Data Server 3 Data Server 1 Data Server 2 Data Server 3 Data Code Code Code

HDFS Node HDFS Node HDFS Node

SQL on Big Data

⚫ SQL on Hadoop – why?

⚫ Familiar interface for most users ⚫ BI tools (like Tableau, Power BI etc) also

uses SQL to connect

⚫ Many different engines

⚫ Hive, Impala, Drill, Presto, … ⚫ Most offers ODBC/JDBC driver, usable

from R

Hadoop Tracker

blogs.gartner.com/merv-adrian/2017/12/29/december-2017-tracker-wheres-hadoop

Hadoop distributions may have a preferred engine (for example Hortonworks -> Hive, Cloudera -> Impala)

Using Impala from R

⚫ Steps for trying Impala-based access

⚫ Download the Cloudera QuickStart VM ⚫ Download the Impala ODBC driver

5

ODBC R package

6

⚫ The ODBC R Package

⚫ DBI interface for ODBC ⚫ Maintained by Rstudio

ODBC R package

7

library(odbc) library(DBI) drv <- odbc::odbc() con <- dbConnect(drv, driver = "Cloudera ODBC Driver for Impala", host = "localhost", port = 21050, database = "default", uid = "", pwd = "" ) #list available tables dbListTables(con) dbListTables(con, table_name = "%port%")

ODBC R package

8

Implyr package

9

⚫ The implyr package

⚫ dplyr SQL backend for Impala ⚫ Developed by Cloudera, Ian Cook ⚫ Uses the ODBC connector for data access

Implyr

10

Implyr

11

Spark

12

13

14

⚫ R access for Spark (examples)

⚫ SparkR (Databricks) ⚫ RevoScaler (Microsoft) ⚫ sparklyr (Rstudio)

15

www.rstudio.com/resources/cheatsheets