Symbiosis Column Stores and R Statistics Hannes Mhleisen & - - PowerPoint PPT Presentation

symbiosis
SMART_READER_LITE
LIVE PREVIEW

Symbiosis Column Stores and R Statistics Hannes Mhleisen & - - PowerPoint PPT Presentation

Symbiosis Column Stores and R Statistics Hannes Mhleisen & Thomas Lumley Process Collect data Load data Filter, transform Analyze & Plot into R & aggregate data Publish paper Process Collect data Load data Filter,


slide-1
SLIDE 1

Symbiosis

Column Stores and R Statistics

Hannes Mühleisen & Thomas Lumley

slide-2
SLIDE 2

Collect data Load data into R Filter, transform & aggregate data Analyze & Plot Publish paper

Process

slide-3
SLIDE 3

Collect data Load data into R Filter, transform & aggregate data Analyze & Plot Publish paper

Process

slide-4
SLIDE 4

Running Example

  • Say you are Starfleet Research and want to

analyze warp drive performance (Coil Flux)

  • Lots of data (~1G CSV, 68M records)

class,speed,flux NX,1,11 Constitution,1,5 Galaxy,1,1 Defiant,1,3 Intrepid,1,1 NX,1,5

slide-5
SLIDE 5

Problem: #BiggerR

X

25 50 75 100 90 MB 690 MB 6 GB

92 5

R CSV Loading Time (s) Dataset

R

slide-6
SLIDE 6

Solution?

  • Use optimized data management system

for data loading & retrieval

  • ... like a relational database
  • ... like a analytics-optimized database
slide-7
SLIDE 7

Solution?

  • Use optimized data management system

for data loading & retrieval

  • ... like a relational database
  • ... like a analytics-optimized database
  • ... like MonetDB!
slide-8
SLIDE 8

Relational DBs 101

NX Constitution Galaxy Defiant Intrepid 1 1 1 1 1 3 8 3 6 1 class speed flux

slide-9
SLIDE 9

Postgres, Oracle, DB2, etc.:

NX Constitution Galaxy Defiant Intrepid 1 1 1 1 1 3 8 3 6 1 class speed flux NX Constitution Galaxy Defiant Intrepid 1 1 1 1 1 3 8 3 6 1

Conceptional Physical (on Disk)

slide-10
SLIDE 10

Column Store:

Peter A. Boncz, Martin L. Kersten, and Stefan Manegold. 2008. Breaking the memory wall in MonetDB.

  • Commun. ACM 51, 12 (December 2008), 77-85. DOI=10.1145/1409360.1409380

NX Constitution Galaxy Defiant Intrepid 1 1 1 1 1 3 8 3 6 1 class speed flux NX Constitution Galaxy Defiant Intrepid 1 1 1 1 1 3 8 3 6 1

slide-11
SLIDE 11
  • 0.01

1.00 100.00

Query Average time (s)

  • monetdb

postgres

TPC−H SF−100 Hot runs

Performance...

log!

slide-12
SLIDE 12

First Gains...

175 350 525 700 90 MB 690 MB 6 GB

700 35 1 92 5

CSV Loading Time (s) Dataset

R MonetDB

X

slide-13
SLIDE 13

But then...

...do we really want this?

data <- dbGetQuery(conn," SELECT t1,COUNT(t1) AS ct FROM ( SELECT CAST(flux as integer) AS t1 FROM starships WHERE ( (speed = 5) ) AND ( (class = 'NX') ) ) AS t WHERE t1 > 0 GROUP BY t1 ORDER BY t1 LIMIT 100; ") normalized <- data$ct/sum(data$ct)

slide-14
SLIDE 14

Enter monet.frame

data <- monet.frame(conn,"starships") nxw5 <- subset(data,class=="NX" & speed==5)$flux t <- tabulate(nxw5,100) normalized <- t/sum(t)

R-style data manipulation & aggregation

The virtual data object for R

slide-15
SLIDE 15

Meanwhile

data <- monet.frame(conn,"starships") SELECT * FROM starships; nxw5 <- subset(data,class=="NX" & speed==5)$flux SELECT * FROM starships WHERE class = 'NX' AND speed = 5; SELECT flux FROM starships WHERE class = 'NX' AND speed = 5; t <- tabulate(nxw5,100) SELECT t1,COUNT(t1) AS ct FROM (SELECT CAST(flux as integer) AS t1 FROM starships WHERE class = 'NX' AND speed = 5) AS t WHERE t1 > 0 GROUP BY t1 ORDER BY t1 LIMIT 100;

Actually executed

Behind the scenes:

slide-16
SLIDE 16

Implementation

# R core unique <- function(x, incomparables = FALSE, ...) UseMethod("unique") # MonetDB.R unique.monet.frame <- function (x, incomparables = FALSE, ...) as.vector(.col.func(x,"distinct",num=FALSE,aggregate=TRUE)) # On Shell unique(wcflux$flux) # result query: SELECT DISTINCT(flux) FROM starships;

slide-17
SLIDE 17

Flux Analysis Script

wcflux <- read.table("starships.csv",sep=",",header=T) classes <- sort(unique(wcflux$class)) wcflux5 <- subset(wcflux,speed==5)[c("class","flux")] plot(0,0,ylim = c(0,0.1),xlim = c(0,100),type = "n") for(i in 1:length(classes)){ tclass <- classes[[i]] ct <- tabulate(subset(wcflux5,class==tclass)$flux,100) normalized <- ct/sum(ct) lines(data.frame(x=seq(1,100),y=normalized)) }

slide-18
SLIDE 18

20 40 60 80 100 0.00 0.02 0.04 0.06 0.08 0.10 Warp Coil Flux (mC) Density

Density Plot of Warp Coil Flux per Starship Class (Warp 5)

Starship Class Constitution Defiant Galaxy Intrepid NX n=8999955

slide-19
SLIDE 19

Flux Analysis Script (2)

wcflux <- monet.frame(conn,"starships") classes <- sort(unique(wcflux$class)) wcflux5 <- subset(wcflux,speed==3)[c("class","flux")] plot(0,0,ylim = c(0,0.2),xlim = c(0,60),type = "n") for(i in 1:length(classes)){ tclass <- classes[[i]] ct <- tabulate(subset(wcflux5,class==tclass)$flux,60) normalized <- ct/sum(ct) lines(data.frame(x=seq(1,60),y=normalized)) }

changed!

slide-20
SLIDE 20

10 20 30 40 50 60 0.00 0.05 0.10 0.15 0.20 Warp Coil Flux (mC) Density

Density Plot of Warp Coil Flux per Starship Class (Warp 3)

Starship Class Constitution Defiant Galaxy Intrepid NX n=8999955

slide-21
SLIDE 21

SELECT DISTINCT(class) FROM starships; SELECT t1,COUNT(t1) AS ct FROM (SELECT CAST(flux as integer) AS t1 FROM starships WHERE ( (speed = 3) ) AND ( (class = 'Constitution') ) ) AS t WHERE t1 > 0 GROUP BY t1 ORDER BY t1 LIMIT 60;

  • - [...]

Generated SQL

slide-22
SLIDE 22

Performance

22,5 45 67,5 90 R MonetDB + R

6 20 16 70

Flux Analysis Time (s) System

Loading Execution

slide-23
SLIDE 23

Thank You!

Questions?

str() print() summary() names() dim() length() head() tail() [] $ subset() na.omit() sample() abs() sign() sqrt() floor() ceiling() trunc() round() signif() exp() log() sin() +

  • *

/ min() max() sum() range() sd() var() quantile() aggregate() merge() sort() range() tabulate()

CRAN: MonetDB.R

^ / ==

slide-24
SLIDE 24