data analytics using deep learning
play

DATA ANALYTICS USING DEEP LEARNING GT 8803 FALL 2018 POOJA - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 FALL 2018 POOJA BHANDARY T O W A R D S A U N I F I E D A R C H I T E C T U R E F O R I N - R D B M S A N A L Y T I C S TODAYs PAPER SIGMOD 2012 In-RDBMS Analytics Hazy project at


  1. DATA ANALYTICS USING DEEP LEARNING GT 8803 FALL 2018 POOJA BHANDARY T O W A R D S A U N I F I E D A R C H I T E C T U R E F O R I N - R D B M S A N A L Y T I C S

  2. TODAY’s PAPER • SIGMOD 2012 � In-RDBMS Analytics • Hazy project at the Department of Computer Science, University of Wisconsin, Madison. GT 8803 // Fall 2018 2

  3. TODAY’S AGENDA • Motivation • Problem Overview • Key Idea • Technical Details • Experiments • Discussion GT 8803 // Fall 2018 3

  4. Motivation GT 8803 // Fall 2018 4

  5. Problem Overview • Ad hoc development cycle for incorporating new analytical tasks. • Performance optimization on a per module basis. • Limited code reusability. GT 8803 // Fall 2018 5

  6. In-RDBMS Analytics Architecture GT 8803 // Fall 2018 6

  7. High Level Idea • Devise a unified architecture that is capable of processing multiple data analytics techniques. • Frame analytical tasks using Convex Programming. GT 8803 // Fall 2018 7

  8. Main Contributions • Bismarck • Identification of factors that impact performance and suggesting relevant optimizations. GT 8803 // Fall 2018 8

  9. Bismarck GT 8803 // Fall 2018 9

  10. Convex Optimization GT 8803 // Fall 2018 10

  11. Gradient Descent GT 8803 // Fall 2018 11

  12. Incremental Gradient descent • ! (#$%) = ! # − ( # )ℱ(! # , , - ) GT 8803 // Fall 2018 12

  13. Incremental Gradient Descent • Data-access properties are amenable to an efficient in-RDBMS implementation. • IGD approximates the full gradient ∇ F using only one term at a time. GT 8803 // Fall 2018 13

  14. Technical Approach • IGD can be implemented using a classic RDBMS abstraction called a UDA ( user-defined aggregate ). GT 8803 // Fall 2018 14

  15. User Defined Aggregate(UDA) • Initialize • Transition • Finalize GT 8803 // Fall 2018 15

  16. GT 8803 // Fall 2018 16

  17. Performance Optimizations • Data Ordering • Parallelizing Gradient Computations • Avoiding Shuffling Overhead GT 8803 // Fall 2018 17

  18. Data Ordering • Data is often clustered in RDBMSs which could lead to slower convergence time. • Shuffling at every epoch can be computationally expensive. Solution: Shuffle once GT 8803 // Fall 2018 18

  19. Parallelizing Gradient Computations • Pure UDA - Shared Nothing Requires a merge function. Can lead to sub-optimal run time results. • Shared-Memory UDA Implemented in the user space. The model to be learned is maintained in shared memory and is concurrently updated by parallel threads. GT 8803 // Fall 2018 19

  20. Avoiding Shuffling Overhead • Shuffling once might not be feasible for very large datasets. • Straightforward reservoir sampling could lead to slower convergence rate by discarding data items that could lead to a faster convergence. GT 8803 // Fall 2018 20

  21. Multiplexed Reservoir Sampling • Combines the reservoir sampling idea with the concurrent update model. • Combine or multiplex, gradient steps over both the reservoir sample and the data that is not put in the reservoir buffer GT 8803 // Fall 2018 21

  22. Multiplexed Reservoir Sampling GT 8803 // Fall 2018 22

  23. Evaluation 1) Implement Bismarck over PostrgreSQL and two other commercial databases. 2) Compare its performance with native analytical tools provided by the RDBMSs. GT 8803 // Fall 2018 23

  24. Tasks and Datasets 1. Logistic Regression (LR) - Forest, DBLife 2. Support Vector Machine (SVM) - Forest, DBLife 3. Low Rank Matrix Factorization (LRM) – MovieLens 4. Conditional Random Fields Labeling(CRF) - CoNLL GT 8803 // Fall 2018 24

  25. Benchmarking Results Dataset Task PostgreSQL DBMS A DBMS B(8 segments) BISMARCK MADlib BISMARCK Native BISMARCK Native Forest LR 8.0 43.5 40.2 489.0 3.7 17.0 (Dense) SVM 7.5 140.2 32.7 66.7 3.3 19.2 DBLife LR 0.8 N/A 9.8 20.6 2.3 N/A (Sparse) SVM 1.2 N/A 11.6 4.8 4.1 N/A MovieLens LMF 36.0 29325.7 394.7 N/A 11.9 17431.3 GT 8803 // Fall 2018 25

  26. Impact of Data Ordering GT 8803 // Fall 2018 26

  27. Scalability Test GT 8803 // Fall 2018 27

  28. Strengths 1. Incorporating a new task requires only a few lines of code change. 2. Shorter development cycles. 3. Performance optimization is generic. GT 8803 // Fall 2018 28

  29. Weaknesses • Theoretical inference made about the effect of clustering on the convergence rate. • Only applies to analytical tasks that can be expressed as a convex optimization problem. GT 8803 // Fall 2018 29

  30. Reflections GT 8803 // Fall 2018 30

  31. References GT 8803 // Fall 2018 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend