Kerim Y. Oktay, Vaibhav Khadilkar, Bijit Hore, Murat Kantarcioglu, - - PowerPoint PPT Presentation

kerim y oktay vaibhav khadilkar bijit hore murat
SMART_READER_LITE
LIVE PREVIEW

Kerim Y. Oktay, Vaibhav Khadilkar, Bijit Hore, Murat Kantarcioglu, - - PowerPoint PPT Presentation

Kerim Y. Oktay, Vaibhav Khadilkar, Bijit Hore, Murat Kantarcioglu, Sharad Mehrotra, Bhavani Thuraisingham 1 Cloud Computing App Code Server Cloud Email Computing Database Multimedia Like Software as a service and DAS model offers many


slide-1
SLIDE 1

1

Kerim Y. Oktay, Vaibhav Khadilkar, Bijit Hore, Murat Kantarcioglu, Sharad Mehrotra, Bhavani Thuraisingham

slide-2
SLIDE 2

Cloud Computing

 Like Software as a service and DAS model offers many

advantages

 Better availability  Reduced Costs  Unlimited scalability and elasticity

2

Cloud Computing

Database

App Server

Code Email

Multimedia

slide-3
SLIDE 3

Hybrid Cloud

Integrates local infrastructure with public cloud resources

Hybrid Cloud

Extra Advantages

  • The flexibility of shifting workload to public cloud when the private

cloud is overwhelmed (Cloud Bursting)

  • Utilizing in-house resources along with public resources

Cons

  • Sensitive data exposure
  • Public Cloud Resource Allocation Cost (both storage and computing)

3

Public/ External

Private/ Internal

slide-4
SLIDE 4

Constraints

Data & Computation Partitioning Challenge

4

s_id name ssn dept 1 James 1234 CS 2 Charlie 4321 EE 3 John 5645 CS 4 Matt 8743 ECON Q1: SELECT name, ssn from Student

Q2: SELECT dept, count(*) FROM Student GROUP_BY dept

Student

  • Q1 contains sensitive information
  • Q2 execution is more expensive

Sensitive How to partition the table ? How to split computation?

slide-5
SLIDE 5

R, Qpriv

Our Hybrid Cloud Architecture

5

Rpub, Qpub Results for Qpriv User Interface Layer Statistics Gathering Layer Hive Hadoop HDFS Hive Hadoop HDFS Private Public Results for Qpub Relations R Queries Q Constraints C Data and Query Management Layer

slide-6
SLIDE 6

Design Spectrum

 Data Model

  • Relational, Semi-structured, Key-Value Stores, Text

 Sensitivity Model

  • Attribute Level, Privacy Associations, View-Based

 Partitioning Models

  • Workload

Partitioning, Intra-query Parallelism, Dynamic Workload

 Minimization Priority

  • Running Time, Sensitive Data Disclosure, Monetary

Cost

6

slide-7
SLIDE 7

Outline of Solution

Notation Formulate Computation Partition Problem

(CPP)

Solution to CPP Experimental Results

7

slide-8
SLIDE 8

Notation

  • sens(R’): The estimated number of sensitive cells in

dataset R’

  • baseTables(q): The estimated minimum set of data items

necessary to answer query q Є Q

  • runTx(q): The estimated running time of query q Є Q at

site x (either public or private)

  • ORunT(Q’,Q’’): Overall execution time of queries in Q’,

given that queries in Q’’ are executed on the public cloud

8

     

 

   ' ' ' ' '

) ( ) ( ) ( ) ( max ) ' ' , ' (

Q Q q priv Q q pub

q runT x q freq q runT x q freq Q Q ORunT

slide-9
SLIDE 9

Rpub, Qpub R, Qpriv

Detailed Hybrid Cloud Architecture

9

runTx(q), baseTables(q) Statistics Gathering Layer Data And Query Management Layer Computation Partitioning Module Monetary Cost Estimator Disclosure Risk Estimator SR Relations R Queries Q Constraints C Hive Hadoop HDFS Hive Hadoop HDFS Private Public

slide-10
SLIDE 10

Computation Partitioning Problem (CPP)

 Find a subset of given query workload,

and subset of the given dataset where

 , are user defined constraints

10

Q Qpub  R Rpub 

MC DC

pub pub pub Q q pub pub

R q baseTables Q q DC R sens MC q proc x q freq R store Q Q ORunT

pub

      

) ( ) 3 ( ) ( ) 2 ( ) ( ) ( ) ( ) 1 ( ) , ( to subject minimize

slide-11
SLIDE 11

Metrics in CPP

 Query Execution Time (runTx(q))  Monetary Costs

  • stor(Rpub) : Storage monetary cost of the public cloud

partition

  • proc(q) : Processing monetary cost of a public side

query q

 Sensitive Data Disclosure Risk (sens(Rpub))

  • Estimated number of sensitive cells within Rpub

11

x q

  • perator

w

  • utSize

inpSize

 

 

  ) ( ) ( (q) runT

x

slide-12
SLIDE 12

Solution to CPP

 CPP can be simplified to only finding Qpub  Dynamic Programming Approach

 CPP (Q, MC, DC) = Qpub

12

Input Query Set Monetary Const. Disclosure Const. Output

slide-13
SLIDE 13

Example

 If MC < 25 or DC < 20

  • CPP({q1, q2, q3}, MC, DC) = CPP({q1, q2}, MC , DC)

13

 

3 2 1

q q q , ,  Q

q3 can only run on private side.

slide-14
SLIDE 14

Example

If q3 can run on both sides

Case 1

  • CPP({q1, q2, q3}, MC, DC) = CPP({q1, q2}, MC , DC)

14

 

3 2 1

q q q , ,  Q

What if q3 runs on private side.

slide-15
SLIDE 15

Example

 Case 2

  • CPP(Q, MC, DC) = MIN_TIME (CPP( , j, k)+ q3)

where MC-25 ≤ j ≤ MC-15 and DC-20 ≤ j ≤ DC-0

 Choose the minimum overall running time

between Case 1 and Case 2

15

 

3 2 1

q q q , ,  Q

What if q3 runs on public side.

 

2 1 q

q ,

2 

Q

2

Q

Max-Min possible monetary cost by q3 Max-Min possible disclosure risk by q3

slide-16
SLIDE 16

Experimental Setting

 Experimental Setting

 Private Cloud: 14 Nodes, located at UTD, Pentium IV,

4GB Ram, 290-320GB disk space

 Public Cloud: 38 Nodes, located at UCI, AMD Dual

Core, 8GB Ram, 631GB disk space

 Hadoop 0.20.2 and Hive 0.7.1

 Dataset and Statistic Collection

 100GB TPC-H Data

 Query Workload

 40 queries containing modified versions of Q1, Q3, Q6,

Q11

17

slide-17
SLIDE 17

Experimental Setting

 Estimation of Weight (wx)

 Running all 22 TPC-H queries for a 300GB dataset  wpub ≈ 40MB/sec , wpriv ≈ 8MB/sec

 Resource Allocation Cost

 Amazon S3 Pricing for storage and communication

 Storage = $0.140/GB + PUT, Communication= $0.120/GB + GET  PUT=$0.01/1000 request, GET=$0.01/10000 request

 Amazon EC2 and EMR Pricing for processing

 $0.085 + $0.015 = $0.1/hour

 Sensitivity

 Customer : c_name, c_phone, c_address attributes  Lineitem: All attributes in %1-5-10 of tuples

18

slide-18
SLIDE 18

Experimental Results

19

slide-19
SLIDE 19

Experimental Results

20

slide-20
SLIDE 20

Future Work

 Extend work to enable intra-query parallelism  Support Dynamically Changing (or arriving)

Workload

 Extend this work to other cloud computing

technologies

 Support Different Sensitivity Models

21