Symbiosis Column Stores and R Statistics XLDB 2013, Hannes - - PowerPoint PPT Presentation

symbiosis
SMART_READER_LITE
LIVE PREVIEW

Symbiosis Column Stores and R Statistics XLDB 2013, Hannes - - PowerPoint PPT Presentation

Symbiosis Column Stores and R Statistics XLDB 2013, Hannes Mhleisen, CWI Database Architectures Why Statistical computing & graphics Free, Open Source, ... Data Handling, Calculations, ... Lots of contributed packages


slide-1
SLIDE 1

Symbiosis

Column Stores and R Statistics

XLDB 2013, Hannes Mühleisen, CWI Database Architectures

slide-2
SLIDE 2

Why

  • Statistical computing & graphics
  • Free, Open Source, ...
  • Data Handling, Calculations, ...
  • Lots of contributed packages
  • Pictures!
slide-3
SLIDE 3

Collect data Load data into R Filter & aggregate data Draw nice pictures Publish paper

Process

slide-4
SLIDE 4

Problem

X

25 50 75 100 90 MB 690 MB 6 GB

92 5

CSV Loading Time (s) Dataset

R

slide-5
SLIDE 5

Solution?

  • Use optimized data management system

for data loading & retrieval

  • ... like a analytics-optimized database
  • ... like MonetDB
slide-6
SLIDE 6

First Gains...

175 350 525 700 90 MB 690 MB 6 GB

700 35 1 92 5

CSV Loading Time (s) Dataset

R MonetDB

X

slide-7
SLIDE 7

But then...

...do we really want this?

data <- dbGetQuery(conn," SELECT t1,COUNT(t1) AS ct FROM ( SELECT CAST(flux as integer) AS t1 FROM starships WHERE ( (speed = 5) ) AND ( (class = 'NX') ) ) AS t WHERE t1 > 0 GROUP BY t1 ORDER BY t1 LIMIT 100; ") normalized <- data$ct/sum(data$ct)

slide-8
SLIDE 8

Enter monet.frame

data <- monet.frame(conn,"starships") nxw5 <- subset(data,class=="NX" & speed==5)$flux t <- tabulate(nxw5,100) normalized <- t/sum(t)

R-style data manipulation & aggregation

The virtual data object for R

slide-9
SLIDE 9

Meanwhile

data <- monet.frame(conn,"starships") SELECT * FROM starships; nxw5 <- subset(data,class=="NX" & speed==5)$flux SELECT * FROM starships WHERE class = 'NX' AND speed = 5; SELECT flux FROM starships WHERE class = 'NX' AND speed = 5; t <- tabulate(nxw5,100) SELECT t1,COUNT(t1) AS ct FROM (SELECT CAST(flux as integer) AS t1 FROM starships WHERE class = 'NX' AND speed = 5) AS t WHERE t1 > 0 GROUP BY t1 ORDER BY t1 LIMIT 100;

Actually executed

Behind the scenes:

slide-10
SLIDE 10

Small Example

  • Say you are Starfleet Research and want to

analyze warp drive performance (Coil Flux)

  • Lots of data (~1G CSV, 68M records)

class,speed,flux NX,1,11 Constitution,1,5 Galaxy,1,1 Defiant,1,3 Intrepid,1,1 NX,1,5

slide-11
SLIDE 11

Flux Analysis Script

wcflux <- read.table("starships.csv",sep=",",header=T) classes <- sort(unique(wcflux$class)) wcflux5 <- subset(wcflux,speed==5)[c("class","flux")] plot(0,0,ylim = c(0,0.1),xlim = c(0,100),type = "n") for(i in 1:length(classes)){ tclass <- classes[[i]] ct <- tabulate(subset(wcflux5,class==tclass)$flux,100) normalized <- ct/sum(ct) lines(data.frame(x=seq(1,100),y=normalized)) }

slide-12
SLIDE 12

20 40 60 80 100 0.00 0.02 0.04 0.06 0.08 0.10 Warp Coil Flux (mC) Density

Density Plot of Warp Coil Flux per Starship Class (Warp 5)

Starship Class Constitution Defiant Galaxy Intrepid NX n=8999955

slide-13
SLIDE 13

Flux Analysis Script (2)

wcflux <- monet.frame(conn,"starships") classes <- sort(unique(wcflux$class)) wcflux5 <- subset(wcflux,speed==3)[c("class","flux")] plot(0,0,ylim = c(0,0.2),xlim = c(0,60),type = "n") for(i in 1:length(classes)){ tclass <- classes[[i]] ct <- tabulate(subset(wcflux5,class==tclass)$flux,60) normalized <- ct/sum(ct) lines(data.frame(x=seq(1,60),y=normalized)) }

changed!

slide-14
SLIDE 14

10 20 30 40 50 60 0.00 0.05 0.10 0.15 0.20 Warp Coil Flux (mC) Density

Density Plot of Warp Coil Flux per Starship Class (Warp 3)

Starship Class Constitution Defiant Galaxy Intrepid NX n=8999955

slide-15
SLIDE 15

Performance

22,5 45 67,5 90 R MonetDB + R

6 20 16 70

Flux Analysis Time (s) System

Loading Execution

slide-16
SLIDE 16

Thank You!

Questions?

http://monetr.r-forge.r-project.org http://hannes.muehleisen.org

  • Hannes Mühleisen and Thomas Lumley:

Best of Both Worlds – Relational Databases and Statistics

25th International Conference on Scientific and Statistical Database Management (SSDBM2013), Jul. 2013