efficient data management and statistics with zero copy
play

Efficient Data Management and Statistics with Zero-Copy Integration - PowerPoint PPT Presentation

Efficient Data Management and Statistics with Zero-Copy Integration Jonathan Lajus & Hannes Mhleisen SSDBM 2014, 2014-06-30 Collect data Bottleneck, thanks David! Statistical Toolkit Filter, transform Load data Analyze & Plot


  1. Efficient Data Management and Statistics with Zero-Copy Integration Jonathan Lajus & Hannes Mühleisen SSDBM 2014, 2014-06-30

  2. Collect data Bottleneck, thanks David! Statistical Toolkit Filter, transform Load data Analyze & Plot & aggregate data Data Management System Publish paper/ Profit

  3. Data Transfer Options • “Socket-Style”, JDBC, ODBC, DBI, … • Serialization, copy, copy, copy, Deserialization • In-process embedding • No sockets (hopefully), but still conversion • Shared memory • No transfer altogether by extending DB or stats

  4. Zero-Copy • In the end, C-arrays of native types are everywhere • Hardware does not like Java objects (usually) • Hmm… • In-process data sharing by passing pointers • If both systems are based on C arrays, we could get away with metadata management

  5. Statistics Database SELECT...? (1) 0x00000000 Query Result Statistics Database (2) 0x00000000 0x10000000 Statistics Database Query Result 0x10000000! (3) 0x00000000 0x10000000

  6. Challenges • Compilation & Linking • symbol name clashes are likely • Read/Write synchronization • Memory management (who calls free()) • NA/NULL value encoding? • Complex Objects

  7. Proof-of-Concept Analyze & Plot Filter, transform & aggregate https://github.com/lajus/monetinr

  8. R SEXP Array ... 44 42 43 SEXP Header Reference MonetDB BAT head Column ... 2 0 1 Descriptor BAT Arrays Descriptor Reference Column ... tail 44 42 43 Descriptor

  9. Dress-up BAT MonetDB Descriptor Reference Column tail Descriptor ... R 44 SEXP Header 42 43 Reference + Garbage Collection Fun

  10. Experiments • data.table , high-performance R data access • MonetDB.R , DBI/socket-based DB access • Equvalent systems, different connection • RSQLite , embedded SQL database • Still needs conversion

  11. Predictions • Dedicated data management systems should be better at data management than pimped stats tools • In-process integration should make a big difference once result sets get large • Column store performance gain should be visible

  12. Setup • Typical data management tasks • Selection & Projection • Aggregation • Joins • 10MB, 100MB, 1GB, 10GB datasets • Desktop-class machine, 16 GB RAM, 3.4 Ghz i7, Fedora Linux

  13. 1% Rows Selected 10% Rows Selected 50% Rows Selected 10min ● ● ● ● 1min ● ● ● ● Execution Time (log) ● ● ● ● 10s ● ● ● ● ● ● ● ● ● ● 1s ● ● ● ● ● ● ● ● ● ● ● ● ● 100ms ● ● ● ● ● ● ● Beats Sockets & Stats Extensions data.table ● ● RSQLite ● 10ms ● ● MonetDB.R ● ● ● Prototype ● 100 MB 100 MB 100 MB 10 MB 10 MB 10 MB 10 GB 10 GB 10 GB 1 GB 1 GB 1 GB Dataset Size (log)

  14. 1 Group 500 Groups 10% Groups ● ● ● 10min ● ● ● Execution Time (log) ● ● ● ● ● ● 1min ● ● 10s ● ● ● ● ● ● ● ● ● ● 1s ● ● ● ● ● ● ● ● ● ● ● ● ● 100ms ● ● ● ● Complex queries / small result sets not worth it data.table ● ● RSQLite ● ● ● 10ms MonetDB.R ● ● Prototype ● 100 MB 100 MB 100 MB 10 MB 10 MB 10 MB 10 GB 10 GB 10 GB 1 GB 1 GB 1 GB Dataset Size (log)

  15. 1% Join Partner Size 10% Join Partner Size ● ● 10min ● Execution Time (log) ● ● 1min ● 10s ● ● ● ● ● ● 1s ● ● ● ● ● ● ● ● ● ● 100ms data.table ● ● ● RSQLite ● ● MonetDB.R ● ● ● ● Prototype ● 10ms 100 MB 100 MB 10 MB 10 MB 10 GB 10 GB 1 GB 1 GB Dataset Size (log)

  16. Conclusions • Zero-Copy possible • Vast performance benefits • But • Read/Write access? • Iterative processes? • Optimization?

  17. Special Thanks • Thomas Lumley (R) • Sjoerd Mullender (MonetDB)

  18. R UDFs in MonetDB CREATE FUNCTION kmeans (data FLOAT, ncluster INTEGER) RETURNS INTEGER LANGUAGE R { kmeans(data,ncluster)$cluster }; Watch the next MonetDB release…

  19. quantile(c(.05,.95)) ● PL/R 40 ● ● ● sys 30 ● sqltime Time (s) ● dumbtime ● udftime 20 ● vdumbtime ● plrtime ● 10 ● R in MonetDB ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● 1 K 1 K 1 K 1 K 1 K 10 K 10 K 10 K 10 K 10 K 100 K 100 K 100 K 100 K 100 K 200 K 1 M 1 M 1 M 1 M 10 M 10 M 10 M 10 M 100 M 100 M 100 M 100 M Rows

  20. Thank You! Questions?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend