Analytical Data Management with R Hannes Mhleisen /132 1 - PowerPoint PPT Presentation

Analytical Data Management with R Hannes Mühleisen /132 1

Overview 1. Motivations to use a Database 2. System Scenarios 3. R and Databases State of the Art 4. Future Directions /132 2

Database Example •Database that models a digital music store to keep track of artists and albums. •Things we need to store: •Information about artists . •What albums those artists released. /132 3

1960 Solution: Flat files •Store database as comma-separated value (CSV) files that we manage in our own code •Use separate file per "entity" (artist, album) •The analysis has to parse files each time they want to read/update records /132 4

Flat File Example Artist (name, year, country) “Backstreet Boys”,1994,"USA" “Ice Cube”,1992,"USA" “Notorious BIG”,1989,"USA" Album (name,artist,year) “Millenium", "Backstreet Boys", 1999 “DNA”, “Backstreet Boys”, 2019 “AmeriKKKa's Most Wanted”, “Ice Cube”, 1990

Flat File Example "Get the year Ice Cube went solo" Artist (name, year, country) “Backstreet Boys”,1994,"USA" “Ice Cube”,1992,"USA" “Notorious BIG”,1989,"USA" df <- read.csv("artists.csv", header=F, col.names=c("name", "year", "country")) df[df$name=="Ice Cube", "year"] Multiple passes through entire dataset! /132 6

Data Integrity •How do we ensure that the artist is the same for each album entry? •What if someone overwrites the album year with an invalid string? •How do we store that there are multiple artists on an album? •How do we update several tables with all-or-nothing semantics? •How do we keep derived data up-to-date? /JUST KIDDING 8

Overview 1. Motivations to use a Database 2. System Scenarios 3. R and Databases State of the Art 4. Future Directions /37 9

System Scenarios •In-Process Database •External Database •User-Defined Functions 10

External DB •MySQL, PostgreSQL, SQL Server, Oracle, Redshift •Hive, Impala, BigQuery •(Spark) •Transferring large-is datasets slooow •Need complex SQL to fetch relevant data! 11

Client protocols? SELECT * FROM lineitem_sf10; [M. Raasveldt & H. Mühleisen: Don't Hold My Data Hostage - A Case For Client Protocol Redesign , VLDB 2017] 12

In-Process DB •Transactional persistent data management •RSQLite, DuckDB, (MonetDBLite) •Faster, but still conversion overhead :/ •ALTREP to the rescue •Later... DuckDB 13

User-Defined Functions •PostgreSQL PL/R (Joe Conway) •MonetDB R UDFs •Oracle ~ •Spark ~ •SQL Server ~ •Can be also fast, but also still some translation overhead. 14

Kinds of UDFs for SELECT • Filters • SELECT b FROM t WHERE fun(a) • UDF returns TRUE/FALSE, only rows where it returns TRUE are returned • Projection • SELECT fun(a, b) FROM t • UDF returns a single scalar value, becomes part of query result • Table-Returning • SELECT * FROM fun(42) • UDF returns a whole intermediate result table 15

Postgres PL/R CREATE OR REPLACE FUNCTION get_emps() RETURNS SETOF emp AS ’ names <- c("Joe","Jim","Jon") ages <- c(41,25,35) salaries <- c(250000,120000,50000) df <- data.frame(name = names, age = ages, salary = salaries) return(df) ’ LANGUAGE ’plr’;

Postgres PL/R select * from get_emps (); name | age | salary ------+-----+----------- Jim | 41 | 250000.00 Joe | 25 | 120000.00 Jon | 35 | 50000.00 (3 rows)

Overview 1. Motivations to use a Database 2. System Scenarios 3. R and Databases State of the Art 4. Future Directions 18

DB Client APIs Client Program Generic Commands SQL Query data.frame DBI + Specific Driver Specific Commands SQL through Protocol Query Results through Protocol Database Server 19

Database APIs for R dbplot Interaction dbplyr 1. Low-Level: DBI DBI 2. Verbs: dplyr/dbplyr Driver 3. Applications: dbplot, tidy predict Database 20

DBI •Basic API, adapts database-specific API/protocol into unified R API •Queries are strings, mostly SQL •Results are data.frame objects • dbConnect /dbDisconnect •dbListTables/dbListFields •dbWriteTable •dbGetQuery/dbExecute/dbReadTable 21

DBI •Lots of implementations: RMySQL, ROracle, RPostgreSQL, RRedshiftSQL, RClickhouse, RGreenplum, RMariaDB, RSQLite , virtuoso, sparklyr •Generic wrappers: RJDBC, odbc •Great if your DB vendor does not have R-specific driver •Heroic effort by Kirill Müller: DBItest •Result: Driver quality varies :/ 22

DBI? •Upside: Can talk to databases •Downside: Need to construct SQL strings :/ •Higher-level interface might be nice? 23

dplyr • Data reorganisation thing in “xyzverse” • dbplyr : extension to work with SQL DBs, Spark, ... • Mostly relational operators • Lazy evaluation, call chaining • Nicer than hand-rolling SQL (mostly) 24

dplyr "verbs" & pipes %>% n %>% select (first_name, last_name, race_desc, sex, birth_age) %>% filter (as.integer(birth_age) > 66, sex=="MALE", race_desc == "WHITE") %>% group_by (first_name) %>% summarise (count=n()) %>% arrange (desc(count)) %>% head (10) -> old_white_men print(old_white_men) 25

SQL translation show_query(old_white_men) SELECT * FROM (SELECT `first_name`, `last_name`, `race_desc`, `sex`, `birth_age` FROM `ncvoter`) WHERE ((CAST(`birth_age` AS INTEGER) > 66.0) AND (`sex` = 'MALE') AND (`race_desc` = 'WHITE')) Whats the advantage of this approach? 26

dplyr? • Easy to use, hides huge query complexity • If things go wrong, debugging is challenging • Cost/Benefit of additional layers, weigh carefully! 27

Overview 1. Motivations to use a Database 2. System Scenarios 3. R and Databases State of the Art 4. Future Directions 28

  ALTREP • Luke Tierney, Gabe Becker & Tomas Kalibera • Abstract vectors, ELT()/GET_REGION() methods • Lazy conversion! static void monetdb_altrep_init_int (DllInfo *dll) { R_altrep_class_t cls = R_make_altinteger_class(/* .. */); R_set_altinteger_Elt_method(cls, monetdb_altrep_elt_integer); /* .. */ } static int monetdb_altrep_elt_integer (SEXP x, R_xlen_t i) { int raw = (( int *) bataddr(x)->theap.base)[i]; return raw == int_nil ? NA_INTEGER : raw; } https://svn.r-project.org/R/branches/ALTREP/ALTREP.html#introduction 29

ALTREP, MonetDBLite & zero-copy library(“DBI”) con <- dbConnect(MonetDBLite::MonetDBLite(), "/tmp/dscdemo") dbGetQuery(con, "SELECT COUNT(*) FROM onebillion”) # 1 1e+09 system.time(a <- dbGetQuery(con, "SELECT i FROM onebillion”)) # user system elapsed # 0.001 0.000 0.001 .Internal(inspect(a$i)) # @7fe2e66f5710 13 INTSXP g0c0 [NAM(2)] BAT #1352 int -> integer ALTREP-wrapped   MonetDB Column 30

RIP MonetDBLite •Open-Source RDBMS created by the   • First in-process embedded analytical DBMS CWI Database Architectures research group •Purpose-built   • on CRAN 2016-2019 embedded analytical database • Showed use case for embedded analytics •No external server management or configuration •Fast data transfer between R and DuckDB • Also showed that re-using existing DBMS is rather •Source Code: https://github.com/cwida/duckdb difficult 31

DuckDB •Open-Source RDBMS created by the   CWI Database Architectures research group •Purpose-built   embedded analytical database •No external server management or configuration •Fast data transfer between R and DuckDB •Source Code: https://github.com/cwida/duckdb 32

Why DuckDB •DuckDB is optimized for analytical use cases •Read-mostly workloads •Complex queries, read large parts of the data •Bulk appends/updates •Traditional RDBMS (e.g. PostgreSQL, MySQL, SQLite): •Many small writes and updates •Simple queries, read only individual rows •Tight Integration with Analytics in R/Python/... 33

Why DuckDB • Vectorized Processing (DuckDB) •Optimized for CPU Cache locality •SIMD instructions, Pipelining •Small intermediates (fit in L3 cache) Tuple-at-a-Time Column-at-a-Time Vectorized Processing Table Result Table Result Table Result 34

Why DuckDB CPU CORE •Vectorized Processing L1 CACHE (32KB) LATENCY: 1NS •Intermediates fit in L3 cache L2 CACHE (256KB) •Column-at-a-Time LATENCY: 5NS •Intermediates go to memory L3 CACHE (20MB) LATENCY: 20NS MAIN MEMORY (16GB-2TB)   LATENCY: 100NS 35

Why DuckDB •TPC-H Benchmark •Analytics benchmark based on shipping company •Process 20-40X faster than traditional systems because of processing model DUCKDB SQLITE POSTGRESQL MYSQL 0 50 100 150 200 Time (s) 36

For the adventurous remotes::install_github("cwida/duckdb/tools/rpkg", build = FALSE) con <- dbConnect(duckdb::duckdb(), ":memory:") 37

Overview 1. Motivations to use a Database 2. System Scenarios 3. R and Databases State of the Art 4. The future is DuckDB 38

Analytical Data Management with R Hannes Mhleisen /132 1 - PowerPoint PPT Presentation

Analytical Data Management with R Hannes Mhleisen /132 1 Overview 1. Motivations to use a Database 2. System Scenarios 3. R and Databases State of the Art 4. Future Directions /132 2 Database Example Database that models a digital

BTEC: Analytical Services and Capabilities Nathaniel Hentz, Assistant Director Analytical What is

P1 Holistic Assessment for Mathematics 2013 Curricula Goal Curricula Goal Analytical

Low-Permeability Black Oil/Gas Condensate Reservoirs Using Analytical/Semi-Analytical Methods

Evidence-based Guidelines for the pre-analytical phase of RNA testing in Blood Samples Francesca

PRE-ANALYTICAL VARIABLES IN THE CONTEXT OF PUBLISHING/ESTABLISHING A STANDARDIZED PRE-ANALYTICAL

A semi- -analytical mathematical formulation as a analytical mathematical formulation as a A

Improving Analytical Capabilities of the Improving Analytical Capabilities of the California

RO-ICAC'2014 Analytical Chemistry for a Better Life September 17 th -21 st 2014 Trgovi te,

84.314 84.314 Analytical Chemistry II Analytical Chemistry II (Instrumental Analysis)

Approaching an Analytical Project Tuba Islam, Analytics CoE, SAS UK Approaching an Analytical

UT DA Hierarchical and Analytical Hierarchical and Analytical Pla Placement T cement

1 Response Matrix Analytical Implementation of TwissResponse Joschua Dilly VIA at CERN 2

ANALYTICAL N-BPM METHOD ANALYTICAL N-BPM METHOD IMPROVING ACCURACY AND ROBUSTNESS OF LINEAR

Analytical and Numerical Methods for Nonlinear Dynamics Tim Zolkin 1 1 Fermi National Accelerator

An XML Data Model for Analytical Instruments The world leader in serving science James Duckworth

Analytical data bases Database lectures for mathematics students Zbigniew Jurkiewicz, Institute

Anno unc e me nts FIT100 FIT100 FIT100 Quiz c a nc e le d fo r this we e k Anno unc e me

Guest Lecture Daniel Dao & Chad Cotton OVERVIEW What is Civitas Learning What We Do

Flat Files vs. DB Files So far, our PHP examples have

An App That Can Save Lives Dr Paul Lukowicz Professor Eve Mitleton-Kelly Director, Complexity

Structuring PLFS for Extensibility Chuck Cranor, Milo Polte, Garth Gibson PARALLEL DATA

COMP60411: Modelling Data on the Web Graphs, RDF, RDFS, SPARQL Week 5 Bijan Parsia &

C S C I 1 2 7 0 I n t r o d u c t i o n t o D a t a b a s e S y s

Streaming Grand Challenge Overview Graham Heyes February 12 th 2019 Where are we now? Online :

Analytical Data Management with R Hannes Mhleisen /132 1 - PowerPoint PPT Presentation

Analytical Data Management with R Hannes Mhleisen /132 1 Overview 1. Motivations to use a Database 2. System Scenarios 3. R and Databases State of the Art 4. Future Directions /132 2 Database Example Database that models a digital

BTEC: Analytical Services and Capabilities Nathaniel Hentz, Assistant Director Analytical What is

P1 Holistic Assessment for Mathematics 2013 Curricula Goal Curricula Goal Analytical

Low-Permeability Black Oil/Gas Condensate Reservoirs Using Analytical/Semi-Analytical Methods

Evidence-based Guidelines for the pre-analytical phase of RNA testing in Blood Samples Francesca

PRE-ANALYTICAL VARIABLES IN THE CONTEXT OF PUBLISHING/ESTABLISHING A STANDARDIZED PRE-ANALYTICAL

A semi- -analytical mathematical formulation as a analytical mathematical formulation as a A

Improving Analytical Capabilities of the Improving Analytical Capabilities of the California

RO-ICAC'2014 Analytical Chemistry for a Better Life September 17 th -21 st 2014 Trgovi te,

84.314 84.314 Analytical Chemistry II Analytical Chemistry II (Instrumental Analysis)

Approaching an Analytical Project Tuba Islam, Analytics CoE, SAS UK Approaching an Analytical

UT DA Hierarchical and Analytical Hierarchical and Analytical Pla Placement T cement

1 Response Matrix Analytical Implementation of TwissResponse Joschua Dilly VIA at CERN 2

ANALYTICAL N-BPM METHOD ANALYTICAL N-BPM METHOD IMPROVING ACCURACY AND ROBUSTNESS OF LINEAR

Analytical and Numerical Methods for Nonlinear Dynamics Tim Zolkin 1 1 Fermi National Accelerator

An XML Data Model for Analytical Instruments The world leader in serving science James Duckworth

Analytical data bases Database lectures for mathematics students Zbigniew Jurkiewicz, Institute

Anno unc e me nts FIT100 FIT100 FIT100 Quiz c a nc e le d fo r this we e k Anno unc e me

Guest Lecture Daniel Dao &amp; Chad Cotton OVERVIEW What is Civitas Learning What We Do

Flat Files vs. DB Files So far, our PHP examples have

An App That Can Save Lives Dr Paul Lukowicz Professor Eve Mitleton-Kelly Director, Complexity

Structuring PLFS for Extensibility Chuck Cranor, Milo Polte, Garth Gibson PARALLEL DATA

COMP60411: Modelling Data on the Web Graphs, RDF, RDFS, SPARQL Week 5 Bijan Parsia &amp;

C S C I 1 2 7 0 I n t r o d u c t i o n t o D a t a b a s e S y s

Streaming Grand Challenge Overview Graham Heyes February 12 th 2019 Where are we now? Online :

Guest Lecture Daniel Dao & Chad Cotton OVERVIEW What is Civitas Learning What We Do

COMP60411: Modelling Data on the Web Graphs, RDF, RDFS, SPARQL Week 5 Bijan Parsia &