 
              Cloud Analytics Data Warehousing Marco Serafini COMPSCI 532 Lecture 18
Trivia • How does Amazon make money? • Selling books? • Entertainment? 2 2
Migrating to the Cloud • ELASTICITY • COST • Pay-as-you-go • HW procurement at scale • Unlimited scale • Cluster management at scale 3 3
Cloud Computing • Shared resources • Multiple tenants sharing resources (with isolation) • Economy of scale • Elastic provisioning • Can easily add and remove resources on the fly • Pay as you go only when used • Different flavors • IaaS, PaaS, SaaS • Public, private cloud 4 4
Cloud Offerings • Computing nodes • Example: AWS EC2 • Full nodes with local storage and pre-installed OS • Very large number of instance types: compute optimized, memory optimized, storage optimized, with GPUs, burstable… • Storage services • Example: AWS S3 • Key-value stores (put/get), file systems • Higher-level services • Example: DBMS 5 5
Storage Disaggregation • Computing nodes (e.g. EC2) • Feature-rich machines • Storage services (e.g. S3) • On cheaper, storage-heavy machines • Limited read/write interface • Advantages for cloud provider • Provision storage and computation independently • Advantages for users • Storage services cheaper • Network bandwidth ~ I/O bandwidth 6 6
Cloud Storage Types STORAGE PERFORMANC ACCESS APPENDS AVAILABILITY PRICE E ✓ OBJECT (S3) -- Shared X Low ✓ ✓ FILE SYSTEM (EFS) - Shared High ✓ BLOCK (EBS) + Instance (*) X Mid ✓ INSTANCE-LOCAL ++ Instance X High (**) (*) Can be detached from an instance and reattached to another (**) Storage-heavy instances are expensive 7 7
From Shared-Nothing Architecture… COMPUTE COMPUTE COMPUTE COMPUTE LS LS LS LS Principle: move computation to data 8 8
…To Hybrid Architectures Arbitrary COMPUTE COMPUTE COMPUTE COMPUTE computation LS LS LS LS STORAGE Read/Write only SERVICE Cannot move computation to data! 9 9
Scheduling Low-Priority Tasks • Helps increase hardware utilization • Spot instances • Allocated in real-time based on live bidding • Can be revoked any time (with notice) • Serverless computing • Example: AWS Lambda • Each of these services comes with own pricing 10 10
Goals: Push-Button Analytics • Easily parallelize single-threaded code • Eliminate cluster management overhead • Deployment of nodes • Installation • Configuration • Even cloud offerings have their complexities • Many instance types • Many services • Solution: Serverless functions 12 12
Goal: Push-Button Analytics • Use ”serverless” components • No need to select a specific cluster size • System auto-scales up and down on demand • Building blocks • Serverless functions (AWS Lambdas) • Cloud storage services (AWS S3) • This paper implements MapReduce in this setting 13 13
Serverless Functions • Single threaded code • Invoked through HTTP requests • Cloud platform takes care of • Deployment • Load balancing • Performance isolation • No need to • Deploy servers • Configure clusters 14 14
Challenges with Lambdas • No local storage, need to use remote cloud storage • For example S3 • No function-to-function communication • Again need remote storage to share remote memory • Short maximum running time 15 15
Remote vs. Local Storage 16 16
State and Fault Tolerance • State is lost after execution • Inputs and outputs need to be persisted • Fault tolerance • Re-execute function • Require atomic writes to check what has succeeded 17 17
Registering Functions • Registering a new Lambda function is slow • Solution • Register a single generic Lambda function • Serialize the code that needs the be executed • Store the code (and the input data) on S3 • Generic Lambda function loads code and executes it 18 18
Remote Storage Scalability 19 19
Semantics • Map is easy • Execute one function per element of the list • Map + single Reducer • E.g. parallel featurization + single-server ML • MapReduce • Many Lambdas needed, many small intermediate files • Use Redis, an in-memory key-value store • Parameter server • Use Redis 20 20
The Cost of Scaling Up • Using more nodes does not always imply higher cost • Lower latency à lower cost per node 21 21
Data Warehousing Architectures 22
Data Warehousing • Analytical (OLAP) relational queries • Different architectures • Snowflake: shared-disk + caching at compute nodes • Redshift: shared-nothing, store all data at compute nodes • Redshift Spectrum: serverless workers executing on-demand and reading from S3 • Let’s discuss these architectures and compare them 23 23
Snowflake • Shared-disk architecture • Data is stored on S3, all nodes can access it • But nodes keep a distributed cache • Challenges • Heterogeneous workloads • No one-size-fits-all hardware configuration • Membership changes • Large data shuffles when a node fails/is removed • Online upgrade • It is similar to changing all the nodes in the system 24 24
Snowflake Architecture • Data Storage • Based on S3: high throughput, high latency • Used also for intermediate data • Virtual Warehouses • Responsible for query execution • Stateless (restarted in their entirety) • Shared cache (low latency on hot data, most data cold) • Cloud Services • Query parsing, access control, optimization • Snapshot isolation with multi- versioning • Metadata on external key-value store 25 25
Snowflake Advantages • Storage on S3 is cheaper • Use expensive local disk only for hot data • All services (except storage) are stateless • Simpler fault tolerance and membership change 26 26
Redshift • Classical shared-nothing architecture • Initially based on PostgreSQL but heavily re-optimized for OLAP • Runs on EC2, explicit provisioning • All data pre-loaded on instance storage • Query compilation • S3 for backup only 27 27
Redshift Spectrum • Serverless query executor • Number of workers dynamically assigned • Stateless • Reads data directly from S3 • Scale out to leverage storage and computation bandwidth 28 28
Comparison Setup • Benchmark: TPC-H • 1 TB uncompressed data • 1 execution of the query suite • Configuration • Default: 4 nodes, memory optimized (r4 8xlarge) • Redshift: analogous node that offers SSD storage (dc2) • Athena: opaque 30 30
Comparison: Initialization Time • Paid every time we shut down and restart the system • Load metadata and (optionally) data 31 31
Comparison: Runtime • Pre-loading pays off • Initialization delay is easily amortized • Caching less helpful • Cost • Athena: pay data scan only • Other systems: mainly running time • Spectrum: scan + running time 32 32
Comparison: Execution Cost • RS can amortize loading costs • Athena • Servlerless • Pay per amount of data scanned • RS Spectrum • Similar scheme as Athena • But must add RS cluster cost 33 33
Storage Cost Per Day Instance storage + EBS very expensive S3 backup cheaper 34 34
Pushing Down Computation? • One should always move computation to data • But disaggregated storage cannot compute! Arbitrary COMPUTE COMPUTE COMPUTE COMPUTE computation LS LS LS LS STORAGE Read/Write only SERVICE 35 35
S3 Select • Computation on the storage layer • Simple selection and projection queries on structured data (e.g. CSV or Parquet) • Simple aggregations (e.g. sum) 36 36
PusdownDB • Stateless query execution with S3 select • Example: Bloom join • Standard hash join but push down Bloom filter to filter results that will not join 37
TPC-H Results • Great speedups with S3 select 38 38
Recommend
More recommend