Efficient Data Management and Statistics with Zero-Copy Integration - - PowerPoint PPT Presentation

efficient data management and statistics with zero copy
SMART_READER_LITE
LIVE PREVIEW

Efficient Data Management and Statistics with Zero-Copy Integration - - PowerPoint PPT Presentation

Efficient Data Management and Statistics with Zero-Copy Integration Jonathan Lajus & Hannes Mhleisen SSDBM 2014, 2014-06-30 Collect data Bottleneck, thanks David! Statistical Toolkit Filter, transform Load data Analyze & Plot


slide-1
SLIDE 1

Efficient Data Management and Statistics with Zero-Copy Integration

SSDBM 2014, 2014-06-30

Jonathan Lajus & Hannes Mühleisen

slide-2
SLIDE 2

Collect data Load data Filter, transform & aggregate data Analyze & Plot Publish paper/ Profit Data Management System Statistical Toolkit Bottleneck, thanks David!

slide-3
SLIDE 3

Data Transfer Options

  • “Socket-Style”, JDBC, ODBC, DBI, …
  • Serialization, copy, copy, copy, Deserialization
  • In-process embedding
  • No sockets (hopefully), but still conversion
  • Shared memory
  • No transfer altogether by extending DB or stats
slide-4
SLIDE 4

Zero-Copy

  • In the end, C-arrays of native types are everywhere
  • Hardware does not like Java objects (usually)
  • Hmm…
  • In-process data sharing by passing pointers
  • If both systems are based on C arrays, we could

get away with metadata management

slide-5
SLIDE 5

0x00000000 Statistics Database SELECT...? 0x00000000 Statistics Database Query Result 0x10000000 0x00000000 Statistics Database Query Result 0x10000000 0x10000000!

(1) (2) (3)

slide-6
SLIDE 6

Challenges

  • Compilation & Linking
  • symbol name clashes are likely
  • Read/Write synchronization
  • Memory management (who calls free())
  • NA/NULL value encoding?
  • Complex Objects
slide-7
SLIDE 7

Analyze & Plot Filter, transform & aggregate

Proof-of-Concept

https://github.com/lajus/monetinr

slide-8
SLIDE 8

BAT Descriptor Column Descriptor 1 2 ... 42 43 44 ... Column Descriptor Arrays head tail Reference

42 43 44 ... Reference SEXP Header Array

R SEXP MonetDB BAT

slide-9
SLIDE 9

BAT Descriptor Column Descriptor tail 42 43 44 ... Reference SEXP Header R Reference MonetDB

Dress-up

+ Garbage Collection Fun

slide-10
SLIDE 10

Experiments

  • data.table, high-performance R data access
  • MonetDB.R, DBI/socket-based DB access
  • Equvalent systems, different connection
  • RSQLite, embedded SQL database
  • Still needs conversion
slide-11
SLIDE 11

Predictions

  • Dedicated data management systems should be

better at data management than pimped stats tools

  • In-process integration should make a big difference
  • nce result sets get large
  • Column store performance gain should be visible
slide-12
SLIDE 12

Setup

  • Typical data management tasks
  • Selection & Projection
  • Aggregation
  • Joins
  • 10MB, 100MB, 1GB, 10GB datasets
  • Desktop-class machine, 16 GB RAM, 3.4 Ghz i7,

Fedora Linux

slide-13
SLIDE 13

1% Rows Selected 10% Rows Selected 50% Rows Selected

  • 10ms

100ms 1s 10s 1min 10min 10 MB 100 MB 1 GB 10 GB 10 MB 100 MB 1 GB 10 GB 10 MB 100 MB 1 GB 10 GB

Dataset Size (log) Execution Time (log)

  • data.table

RSQLite MonetDB.R Prototype

Beats Sockets & Stats Extensions

slide-14
SLIDE 14

1 Group 500 Groups 10% Groups

  • 10ms

100ms 1s 10s 1min 10min 10 MB 100 MB 1 GB 10 GB 10 MB 100 MB 1 GB 10 GB 10 MB 100 MB 1 GB 10 GB

Dataset Size (log) Execution Time (log)

  • data.table

RSQLite MonetDB.R Prototype

Complex queries / small result sets not worth it

slide-15
SLIDE 15

1% Join Partner Size 10% Join Partner Size

  • 10ms

100ms 1s 10s 1min 10min 10 MB 100 MB 1 GB 10 GB 10 MB 100 MB 1 GB 10 GB

Dataset Size (log) Execution Time (log)

  • data.table

RSQLite MonetDB.R Prototype

slide-16
SLIDE 16

Conclusions

  • Zero-Copy possible
  • Vast performance benefits
  • But
  • Read/Write access?
  • Iterative processes?
  • Optimization?
slide-17
SLIDE 17

Special Thanks

  • Thomas Lumley (R)
  • Sjoerd Mullender (MonetDB)
slide-18
SLIDE 18

CREATE FUNCTION kmeans (data FLOAT, ncluster INTEGER) RETURNS INTEGER LANGUAGE R { kmeans(data,ncluster)$cluster };

R UDFs in MonetDB

Watch the next MonetDB release…

slide-19
SLIDE 19
  • 10

20 30 40 1 K 1 K 1 K 1 K 10 K 10 K 10 K 10 K 100 K 100 K 100 K 100 K 1 M 1 M 1 M 1 M 10 M 10 M 10 M 10 M 100 M 100 M 100 M 100 M 1 K 10 K 100 K 200 K

Rows Time (s)

sys

  • sqltime

dumbtime udftime vdumbtime plrtime

quantile(c(.05,.95))

PL/R R in MonetDB

slide-20
SLIDE 20

Thank You!

Questions?