1
Kerim Y. Oktay, Vaibhav Khadilkar, Bijit Hore, Murat Kantarcioglu, Sharad Mehrotra, Bhavani Thuraisingham
Kerim Y. Oktay, Vaibhav Khadilkar, Bijit Hore, Murat Kantarcioglu, - - PowerPoint PPT Presentation
Kerim Y. Oktay, Vaibhav Khadilkar, Bijit Hore, Murat Kantarcioglu, Sharad Mehrotra, Bhavani Thuraisingham 1 Cloud Computing App Code Server Cloud Email Computing Database Multimedia Like Software as a service and DAS model offers many
1
Kerim Y. Oktay, Vaibhav Khadilkar, Bijit Hore, Murat Kantarcioglu, Sharad Mehrotra, Bhavani Thuraisingham
Like Software as a service and DAS model offers many
Better availability Reduced Costs Unlimited scalability and elasticity
2
Cloud Computing
Database
App Server
Code Email
Multimedia
Integrates local infrastructure with public cloud resources
Hybrid Cloud
Extra Advantages
cloud is overwhelmed (Cloud Bursting)
Cons
3
Public/ External
Private/ Internal
Constraints
4
s_id name ssn dept 1 James 1234 CS 2 Charlie 4321 EE 3 John 5645 CS 4 Matt 8743 ECON Q1: SELECT name, ssn from Student
Q2: SELECT dept, count(*) FROM Student GROUP_BY dept
Student
Sensitive How to partition the table ? How to split computation?
R, Qpriv
5
Rpub, Qpub Results for Qpriv User Interface Layer Statistics Gathering Layer Hive Hadoop HDFS Hive Hadoop HDFS Private Public Results for Qpub Relations R Queries Q Constraints C Data and Query Management Layer
Data Model
Sensitivity Model
Partitioning Models
Partitioning, Intra-query Parallelism, Dynamic Workload
Minimization Priority
Cost
6
7
dataset R’
necessary to answer query q Є Q
site x (either public or private)
given that queries in Q’’ are executed on the public cloud
8
' ' ' ' '
) ( ) ( ) ( ) ( max ) ' ' , ' (
Q Q q priv Q q pub
q runT x q freq q runT x q freq Q Q ORunT
Rpub, Qpub R, Qpriv
9
runTx(q), baseTables(q) Statistics Gathering Layer Data And Query Management Layer Computation Partitioning Module Monetary Cost Estimator Disclosure Risk Estimator SR Relations R Queries Q Constraints C Hive Hadoop HDFS Hive Hadoop HDFS Private Public
, are user defined constraints
10
Q Qpub R Rpub
MC DC
pub pub pub Q q pub pub
R q baseTables Q q DC R sens MC q proc x q freq R store Q Q ORunT
pub
) ( ) 3 ( ) ( ) 2 ( ) ( ) ( ) ( ) 1 ( ) , ( to subject minimize
Query Execution Time (runTx(q)) Monetary Costs
partition
query q
Sensitive Data Disclosure Risk (sens(Rpub))
11
x q
w
inpSize
) ( ) ( (q) runT
x
CPP can be simplified to only finding Qpub Dynamic Programming Approach
CPP (Q, MC, DC) = Qpub
12
Input Query Set Monetary Const. Disclosure Const. Output
If MC < 25 or DC < 20
13
3 2 1
q3 can only run on private side.
14
3 2 1
What if q3 runs on private side.
where MC-25 ≤ j ≤ MC-15 and DC-20 ≤ j ≤ DC-0
Choose the minimum overall running time
15
3 2 1
What if q3 runs on public side.
2 1 q
2
2
Q
Max-Min possible monetary cost by q3 Max-Min possible disclosure risk by q3
Experimental Setting
Private Cloud: 14 Nodes, located at UTD, Pentium IV,
4GB Ram, 290-320GB disk space
Public Cloud: 38 Nodes, located at UCI, AMD Dual
Core, 8GB Ram, 631GB disk space
Hadoop 0.20.2 and Hive 0.7.1
Dataset and Statistic Collection
100GB TPC-H Data
Query Workload
40 queries containing modified versions of Q1, Q3, Q6,
Q11
17
Estimation of Weight (wx)
Running all 22 TPC-H queries for a 300GB dataset wpub ≈ 40MB/sec , wpriv ≈ 8MB/sec
Resource Allocation Cost
Amazon S3 Pricing for storage and communication
Storage = $0.140/GB + PUT, Communication= $0.120/GB + GET PUT=$0.01/1000 request, GET=$0.01/10000 request
Amazon EC2 and EMR Pricing for processing
$0.085 + $0.015 = $0.1/hour
Sensitivity
Customer : c_name, c_phone, c_address attributes Lineitem: All attributes in %1-5-10 of tuples
18
19
20
Extend work to enable intra-query parallelism Support Dynamically Changing (or arriving)
Extend this work to other cloud computing
Support Different Sensitivity Models
21