DATA ANALYTICS USING DEEP LEARNING GT 8803 FALL 2018 POOJA - - PowerPoint PPT Presentation

data analytics using deep learning
SMART_READER_LITE
LIVE PREVIEW

DATA ANALYTICS USING DEEP LEARNING GT 8803 FALL 2018 POOJA - - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 FALL 2018 POOJA BHANDARY T O W A R D S A U N I F I E D A R C H I T E C T U R E F O R I N - R D B M S A N A L Y T I C S TODAYs PAPER SIGMOD 2012 In-RDBMS Analytics Hazy project at


slide-1
SLIDE 1

DATA ANALYTICS USING DEEP LEARNING

GT 8803 FALL 2018 POOJA BHANDARY

T O W A R D S A U N I F I E D A R C H I T E C T U R E F O R I N - R D B M S A N A L Y T I C S

slide-2
SLIDE 2

GT 8803 // Fall 2018

TODAY’s PAPER

  • SIGMOD 2012

In-RDBMS Analytics

  • Hazy project at the Department of Computer

Science, University of Wisconsin, Madison.

2

slide-3
SLIDE 3

GT 8803 // Fall 2018

TODAY’S AGENDA

  • Motivation
  • Problem Overview
  • Key Idea
  • Technical Details
  • Experiments
  • Discussion

3

slide-4
SLIDE 4

GT 8803 // Fall 2018

Motivation

4

slide-5
SLIDE 5

GT 8803 // Fall 2018

Problem Overview

  • Ad hoc development cycle for incorporating new

analytical tasks.

  • Performance optimization on a per module basis.
  • Limited code reusability.

5

slide-6
SLIDE 6

GT 8803 // Fall 2018

In-RDBMS Analytics Architecture

6

slide-7
SLIDE 7

GT 8803 // Fall 2018

High Level Idea

  • Devise a unified architecture that is capable of

processing multiple data analytics techniques.

  • Frame analytical tasks using Convex

Programming.

7

slide-8
SLIDE 8

GT 8803 // Fall 2018

Main Contributions

  • Bismarck
  • Identification of factors that impact performance and

suggesting relevant optimizations.

8

slide-9
SLIDE 9

GT 8803 // Fall 2018

Bismarck

9

slide-10
SLIDE 10

GT 8803 // Fall 2018

Convex Optimization

10

slide-11
SLIDE 11

GT 8803 // Fall 2018

Gradient Descent

11

slide-12
SLIDE 12

GT 8803 // Fall 2018

Incremental Gradient descent

  • !(#$%) = !#

− (#)ℱ(!#, ,-)

12

slide-13
SLIDE 13

GT 8803 // Fall 2018

Incremental Gradient Descent

  • Data-access properties are amenable to an

efficient in-RDBMS implementation.

  • IGD approximates the full gradient ∇F using
  • nly one term at a time.

13

slide-14
SLIDE 14

GT 8803 // Fall 2018

Technical Approach

  • IGD can be implemented using a classic RDBMS

abstraction called a UDA (user-defined aggregate).

14

slide-15
SLIDE 15

GT 8803 // Fall 2018

User Defined Aggregate(UDA)

  • Initialize
  • Transition
  • Finalize

15

slide-16
SLIDE 16

GT 8803 // Fall 2018

16

slide-17
SLIDE 17

GT 8803 // Fall 2018

Performance Optimizations

  • Data Ordering
  • Parallelizing Gradient Computations
  • Avoiding Shuffling Overhead

17

slide-18
SLIDE 18

GT 8803 // Fall 2018

Data Ordering

  • Data is often clustered in RDBMSs which

could lead to slower convergence time.

  • Shuffling at every epoch can be

computationally expensive. Solution: Shuffle once

18

slide-19
SLIDE 19

GT 8803 // Fall 2018

Parallelizing Gradient Computations

  • Pure UDA - Shared Nothing

Requires a merge function. Can lead to sub-optimal run time results.

  • Shared-Memory UDA

Implemented in the user space. The model to be learned is maintained in shared memory and is concurrently updated by parallel threads.

19

slide-20
SLIDE 20

GT 8803 // Fall 2018

Avoiding Shuffling Overhead

  • Shuffling once might not be feasible for very

large datasets.

  • Straightforward reservoir sampling could lead

to slower convergence rate by discarding data items that could lead to a faster convergence.

20

slide-21
SLIDE 21

GT 8803 // Fall 2018

Multiplexed Reservoir Sampling

  • Combines the reservoir sampling idea with

the concurrent update model.

  • Combine or multiplex, gradient steps over

both the reservoir sample and the data that is not put in the reservoir buffer

21

slide-22
SLIDE 22

GT 8803 // Fall 2018

Multiplexed Reservoir Sampling

22

slide-23
SLIDE 23

GT 8803 // Fall 2018

Evaluation

1) Implement Bismarck over PostrgreSQL and two other commercial databases. 2) Compare its performance with native analytical tools provided by the RDBMSs.

23

slide-24
SLIDE 24

GT 8803 // Fall 2018

Tasks and Datasets

  • 1. Logistic Regression (LR) - Forest, DBLife
  • 2. Support Vector Machine (SVM)- Forest, DBLife
  • 3. Low Rank Matrix Factorization (LRM)– MovieLens
  • 4. Conditional Random Fields Labeling(CRF) - CoNLL

24

slide-25
SLIDE 25

GT 8803 // Fall 2018

Benchmarking Results

25

Dataset Task PostgreSQL DBMS A DBMS B(8 segments)

BISMARCK MADlib BISMARCK Native BISMARCK Native

Forest (Dense) LR 8.0 43.5 40.2 489.0 3.7 17.0 SVM 7.5 140.2 32.7 66.7 3.3 19.2 DBLife (Sparse) LR 0.8 N/A 9.8 20.6 2.3 N/A SVM 1.2 N/A 11.6 4.8 4.1 N/A MovieLens LMF 36.0 29325.7 394.7 N/A 11.9 17431.3

slide-26
SLIDE 26

GT 8803 // Fall 2018

Impact of Data Ordering

26

slide-27
SLIDE 27

GT 8803 // Fall 2018

Scalability Test

27

slide-28
SLIDE 28

GT 8803 // Fall 2018

Strengths

  • 1. Incorporating a new task requires only a few lines of code

change.

  • 2. Shorter development cycles.
  • 3. Performance optimization is generic.

28

slide-29
SLIDE 29

GT 8803 // Fall 2018

Weaknesses

  • Theoretical inference made about the effect of

clustering on the convergence rate.

  • Only applies to analytical tasks that can be expressed as

a convex optimization problem.

29

slide-30
SLIDE 30

GT 8803 // Fall 2018

Reflections

30

slide-31
SLIDE 31

GT 8803 // Fall 2018

References

31