symbiosis
play

Symbiosis Column Stores and R Statistics XLDB 2013, Hannes - PowerPoint PPT Presentation

Symbiosis Column Stores and R Statistics XLDB 2013, Hannes Mhleisen, CWI Database Architectures Why Statistical computing & graphics Free, Open Source, ... Data Handling, Calculations, ... Lots of contributed packages


  1. Symbiosis Column Stores and R Statistics XLDB 2013, Hannes Mühleisen, CWI Database Architectures

  2. Why • Statistical computing & graphics • Free, Open Source, ... • Data Handling, Calculations, ... • Lots of contributed packages • Pictures!

  3. Process Collect data Load data Filter & Draw nice into R aggregate data pictures Publish paper

  4. Problem CSV Loading 100 92 75 Time (s) 50 25 5 X 0 90 MB 690 MB 6 GB Dataset R

  5. Solution? • Use optimized data management system for data loading & retrieval • ... like a analytics-optimized database • ... like MonetDB

  6. First Gains... CSV Loading 700 700 525 Time (s) 350 175 92 35 X 5 1 0 90 MB 690 MB 6 GB Dataset R MonetDB

  7. But then... data <- dbGetQuery(conn," SELECT t1,COUNT(t1) AS ct FROM ( SELECT CAST(flux as integer) AS t1 FROM starships WHERE ( (speed = 5) ) AND ( (class = 'NX') ) ) AS t WHERE t1 > 0 GROUP BY t1 ORDER BY t1 LIMIT 100; ") normalized <- data$ct/sum(data$ct) ...do we really want this?

  8. Enter monet.frame The virtual data object for R data <- monet.frame(conn,"starships") nxw5 <- subset(data,class=="NX" & speed==5)$flux t <- tabulate(nxw5,100) normalized <- t/sum(t) R-style data manipulation & aggregation

  9. Meanwhile Behind the scenes: data <- monet.frame(conn,"starships") SELECT * FROM starships; nxw5 <- subset(data,class=="NX" & speed==5)$flux SELECT * FROM starships WHERE class = 'NX' AND speed = 5; SELECT flux FROM starships WHERE class = 'NX' AND speed = 5; t <- tabulate(nxw5,100) SELECT t1,COUNT(t1) AS ct FROM (SELECT CAST(flux as integer) AS t1 FROM starships WHERE class = 'NX' AND speed = 5) AS t WHERE t1 > 0 GROUP BY t1 ORDER BY t1 LIMIT 100; Actually executed

  10. Small Example • Say you are Starfleet Research and want to analyze warp drive performance (Coil Flux) • Lots of data (~1G CSV, 68M records) class,speed,flux NX,1,11 Constitution,1,5 Galaxy,1,1 Defiant,1,3 Intrepid,1,1 NX,1,5

  11. Flux Analysis Script wcflux <- read.table("starships.csv",sep=",",header=T) classes <- sort(unique(wcflux$class)) wcflux5 <- subset(wcflux,speed==5)[c("class","flux")] plot(0,0,ylim = c(0,0.1),xlim = c(0,100),type = "n") for(i in 1:length(classes)){ tclass <- classes[[i]] ct <- tabulate(subset(wcflux5,class==tclass)$flux,100) normalized <- ct/sum(ct) lines(data.frame(x=seq(1,100),y=normalized)) }

  12. Density Plot of Warp Coil Flux per Starship Class (Warp 5) 0.10 Starship Class n=8999955 Constitution Defiant Galaxy Intrepid 0.08 NX 0.06 Density 0.04 0.02 0.00 0 20 40 60 80 100 Warp Coil Flux (mC)

  13. Flux Analysis Script (2) changed! wcflux <- monet.frame(conn,"starships") classes <- sort(unique(wcflux$class)) wcflux5 <- subset(wcflux,speed==3)[c("class","flux")] plot(0,0,ylim = c(0,0.2),xlim = c(0,60),type = "n") for(i in 1:length(classes)){ tclass <- classes[[i]] ct <- tabulate(subset(wcflux5,class==tclass)$flux,60) normalized <- ct/sum(ct) lines(data.frame(x=seq(1,60),y=normalized)) }

  14. Density Plot of Warp Coil Flux per Starship Class (Warp 3) 0.20 Starship Class n=8999955 Constitution Defiant Galaxy Intrepid NX 0.15 Density 0.10 0.05 0.00 0 10 20 30 40 50 60 Warp Coil Flux (mC)

  15. Performance Flux Analysis 90 20 67,5 70 Time (s) 45 22,5 6 16 0 R MonetDB + R System Execution Loading

  16. Thank You! Questions? • Hannes Mühleisen and Thomas Lumley: Best of Both Worlds – Relational Databases and Statistics 25th International Conference on Scientific and Statistical Database Management (SSDBM2013), Jul. 2013 http://monetr.r-forge.r-project.org http://hannes.muehleisen.org

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend