kerim y oktay vaibhav khadilkar bijit hore murat
play

Kerim Y. Oktay, Vaibhav Khadilkar, Bijit Hore, Murat Kantarcioglu, - PowerPoint PPT Presentation

Kerim Y. Oktay, Vaibhav Khadilkar, Bijit Hore, Murat Kantarcioglu, Sharad Mehrotra, Bhavani Thuraisingham 1 Cloud Computing App Code Server Cloud Email Computing Database Multimedia Like Software as a service and DAS model offers many


  1. Kerim Y. Oktay, Vaibhav Khadilkar, Bijit Hore, Murat Kantarcioglu, Sharad Mehrotra, Bhavani Thuraisingham 1

  2. Cloud Computing App Code Server Cloud Email Computing Database Multimedia  Like Software as a service and DAS model offers many advantages  Better availability  Reduced Costs  Unlimited scalability and elasticity 2

  3. Hybrid Cloud  Integrates local infrastructure with public cloud resources Private/ Public/ Internal External Hybrid Cloud  Extra Advantages  The flexibility of shifting workload to public cloud when the private cloud is overwhelmed (Cloud Bursting)  Utilizing in-house resources along with public resources  Cons  Sensitive data exposure  Public Cloud Resource Allocation Cost (both storage and computing) 3

  4. Data & Computation Partitioning Challenge Sensitive Student Q1 : SELECT name, ssn from Student s_id name ssn dept 1 James 1234 CS Q2 : SELECT dept, count(*) FROM Student 2 Charlie 4321 EE GROUP_BY dept 3 John 5645 CS How to split computation? 4 Matt 8743 ECON How to partition the table ? Constraints • Q1 contains sensitive information • Q2 execution is more expensive 4

  5. Our Hybrid Cloud Architecture Queries Q Constraints C Relations R Results for Q pub Results for Q priv User Interface Layer Statistics Gathering Layer Data and Query Management Layer R pub, Q pub R , Q priv Hive Hive Hadoop HDFS Hadoop HDFS Private Public 5

  6. Design Spectrum  Data Model  Relational, Semi-structured, Key-Value Stores, Text  Sensitivity Model  Attribute Level, Privacy Associations, View-Based  Partitioning Models  Workload Partitioning, Intra-query Parallelism, Dynamic Workload  Minimization Priority  Running Time, Sensitive Data Disclosure, Monetary Cost 6

  7. Outline of Solution  Notation  Formulate Computation Partition Problem (CPP)  Solution to CPP  Experimental Results 7

  8. Notation  sens (R’) : The estimated number of sensitive cells in dataset R’  baseTables(q): The estimated minimum set of data items necessary to answer query q Є Q  runT x (q): The estimated running time of query q Є Q at site x (either public or private)  ORunT (Q’,Q’’) : Overall execution time of queries in Q’, given that queries in Q’’ are executed on the public cloud   freq ( q ) x runT ( q ) pub    q Q ' '  ORunT ( Q ' , Q ' ' ) max   freq ( q ) x runT ( q ) priv    q Q ' Q ' ' 8

  9. Detailed Hybrid Cloud Architecture Queries Q Constraints C Relations R SR Statistics Gathering Layer runT x (q), baseTables(q) Data And Query Management Layer Monetary Cost Estimator Computation Partitioning Module Disclosure Risk Estimator R pub, Q pub R , Q priv Hive Hive Hadoop HDFS Hadoop HDFS Public Private 9

  10. Computation Partitioning Problem (CPP)  Find a subset of given query workload , Q pub  Q and subset of the given dataset where R pub  R minimize ORunT ( Q , Q ) pub    subject to ( 1 ) store ( R ) freq ( q ) x proc ( q ) MC pub  q Q pub  ( 2 ) sens ( R ) DC pub    ( 3 ) q Q baseTables ( q ) R pub pub  , are user defined constraints MC DC 10

  11. Metrics in CPP  Query Execution Time ( runT x (q) )     inpSize ( ) outSize ( )     operator q runT (q) x w x  Monetary Costs  stor(R pub ) : Storage monetary cost of the public cloud partition  proc(q) : Processing monetary cost of a public side query q  Sensitive Data Disclosure Risk ( sens(R pub ) )  Estimated number of sensitive cells within R pub 11

  12. Solution to CPP  CPP can be simplified to only finding Q pub  Dynamic Programming Approach Output  CPP (Q, MC, DC) = Qpub Input Query Set Monetary Const. Disclosure Const. 12

  13. Example    Q q , q , q 1 2 3 q 3 can only run on private side.  If MC < 25 or DC < 20  CPP({ q 1 , q 2 , q 3 }, MC, DC) = CPP({ q 1 , q 2 }, MC , DC) 13

  14. Example    Q q , q , q 1 2 3 What if q 3  If q 3 can run on both sides runs on private side.  Case 1  CPP({ q 1 , q 2 , q 3 }, MC, DC) = CPP({ q 1 , q 2 }, MC , DC) 14

  15. Example    Q q , q , q 1 2 3 2    Q q , 1 q 2 What if q 3 runs on  Case 2 public side. 2 Q  CPP(Q, MC, DC) = MIN_TIME (CPP( , j, k)+ q 3 ) where MC- 25 ≤ j ≤ MC -15 and DC- 20 ≤ j ≤ DC -0 Max-Min possible Max-Min possible monetary cost by q 3 disclosure risk by q 3  Choose the minimum overall running time between Case 1 and Case 2 15

  16. Experimental Setting  Experimental Setting  Private Cloud: 14 Nodes, located at UTD, Pentium IV, 4 GB Ram, 290-320 GB disk space  Public Cloud: 38 Nodes, located at UCI, AMD Dual Core, 8GB Ram, 631 GB disk space  Hadoop 0.20.2 and Hive 0.7.1  Dataset and Statistic Collection  100 GB TPC-H Data  Query Workload  40 queries containing modified versions of Q1, Q3, Q6, Q11 17

  17. Experimental Setting  Estimation of Weight (w x )  Running all 22 TPC-H queries for a 300 GB dataset  w pub ≈ 40MB/sec , w priv ≈ 8MB/sec  Resource Allocation Cost  Amazon S3 Pricing for storage and communication  Storage = $0.140/GB + PUT, Communication= $0.120/GB + GET  PUT=$0.01/1000 request, GET=$0.01/10000 request  Amazon EC2 and EMR Pricing for processing  $0.085 + $0.015 = $0.1/hour  Sensitivity  Customer : c_name, c_phone, c_address attributes  Lineitem: All attributes in %1-5-10 of tuples 18

  18. Experimental Results 19

  19. Experimental Results 20

  20. Future Work  Extend work to enable intra-query parallelism  Support Dynamically Changing (or arriving) Workload  Extend this work to other cloud computing technologies  Support Different Sensitivity Models 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend