cloud analytics data warehousing
play

Cloud Analytics Data Warehousing Marco Serafini COMPSCI 532 - PowerPoint PPT Presentation

Cloud Analytics Data Warehousing Marco Serafini COMPSCI 532 Lecture 18 Trivia How does Amazon make money? Selling books? Entertainment? 2 2 Migrating to the Cloud ELASTICITY COST Pay-as-you-go HW procurement at


  1. Cloud Analytics Data Warehousing Marco Serafini COMPSCI 532 Lecture 18

  2. Trivia • How does Amazon make money? • Selling books? • Entertainment? 2 2

  3. Migrating to the Cloud • ELASTICITY • COST • Pay-as-you-go • HW procurement at scale • Unlimited scale • Cluster management at scale 3 3

  4. Cloud Computing • Shared resources • Multiple tenants sharing resources (with isolation) • Economy of scale • Elastic provisioning • Can easily add and remove resources on the fly • Pay as you go only when used • Different flavors • IaaS, PaaS, SaaS • Public, private cloud 4 4

  5. Cloud Offerings • Computing nodes • Example: AWS EC2 • Full nodes with local storage and pre-installed OS • Very large number of instance types: compute optimized, memory optimized, storage optimized, with GPUs, burstable… • Storage services • Example: AWS S3 • Key-value stores (put/get), file systems • Higher-level services • Example: DBMS 5 5

  6. Storage Disaggregation • Computing nodes (e.g. EC2) • Feature-rich machines • Storage services (e.g. S3) • On cheaper, storage-heavy machines • Limited read/write interface • Advantages for cloud provider • Provision storage and computation independently • Advantages for users • Storage services cheaper • Network bandwidth ~ I/O bandwidth 6 6

  7. Cloud Storage Types STORAGE PERFORMANC ACCESS APPENDS AVAILABILITY PRICE E ✓ OBJECT (S3) -- Shared X Low ✓ ✓ FILE SYSTEM (EFS) - Shared High ✓ BLOCK (EBS) + Instance (*) X Mid ✓ INSTANCE-LOCAL ++ Instance X High (**) (*) Can be detached from an instance and reattached to another (**) Storage-heavy instances are expensive 7 7

  8. From Shared-Nothing Architecture… COMPUTE COMPUTE COMPUTE COMPUTE LS LS LS LS Principle: move computation to data 8 8

  9. …To Hybrid Architectures Arbitrary COMPUTE COMPUTE COMPUTE COMPUTE computation LS LS LS LS STORAGE Read/Write only SERVICE Cannot move computation to data! 9 9

  10. Scheduling Low-Priority Tasks • Helps increase hardware utilization • Spot instances • Allocated in real-time based on live bidding • Can be revoked any time (with notice) • Serverless computing • Example: AWS Lambda • Each of these services comes with own pricing 10 10

  11. Goals: Push-Button Analytics • Easily parallelize single-threaded code • Eliminate cluster management overhead • Deployment of nodes • Installation • Configuration • Even cloud offerings have their complexities • Many instance types • Many services • Solution: Serverless functions 12 12

  12. Goal: Push-Button Analytics • Use ”serverless” components • No need to select a specific cluster size • System auto-scales up and down on demand • Building blocks • Serverless functions (AWS Lambdas) • Cloud storage services (AWS S3) • This paper implements MapReduce in this setting 13 13

  13. Serverless Functions • Single threaded code • Invoked through HTTP requests • Cloud platform takes care of • Deployment • Load balancing • Performance isolation • No need to • Deploy servers • Configure clusters 14 14

  14. Challenges with Lambdas • No local storage, need to use remote cloud storage • For example S3 • No function-to-function communication • Again need remote storage to share remote memory • Short maximum running time 15 15

  15. Remote vs. Local Storage 16 16

  16. State and Fault Tolerance • State is lost after execution • Inputs and outputs need to be persisted • Fault tolerance • Re-execute function • Require atomic writes to check what has succeeded 17 17

  17. Registering Functions • Registering a new Lambda function is slow • Solution • Register a single generic Lambda function • Serialize the code that needs the be executed • Store the code (and the input data) on S3 • Generic Lambda function loads code and executes it 18 18

  18. Remote Storage Scalability 19 19

  19. Semantics • Map is easy • Execute one function per element of the list • Map + single Reducer • E.g. parallel featurization + single-server ML • MapReduce • Many Lambdas needed, many small intermediate files • Use Redis, an in-memory key-value store • Parameter server • Use Redis 20 20

  20. The Cost of Scaling Up • Using more nodes does not always imply higher cost • Lower latency à lower cost per node 21 21

  21. Data Warehousing Architectures 22

  22. Data Warehousing • Analytical (OLAP) relational queries • Different architectures • Snowflake: shared-disk + caching at compute nodes • Redshift: shared-nothing, store all data at compute nodes • Redshift Spectrum: serverless workers executing on-demand and reading from S3 • Let’s discuss these architectures and compare them 23 23

  23. Snowflake • Shared-disk architecture • Data is stored on S3, all nodes can access it • But nodes keep a distributed cache • Challenges • Heterogeneous workloads • No one-size-fits-all hardware configuration • Membership changes • Large data shuffles when a node fails/is removed • Online upgrade • It is similar to changing all the nodes in the system 24 24

  24. Snowflake Architecture • Data Storage • Based on S3: high throughput, high latency • Used also for intermediate data • Virtual Warehouses • Responsible for query execution • Stateless (restarted in their entirety) • Shared cache (low latency on hot data, most data cold) • Cloud Services • Query parsing, access control, optimization • Snapshot isolation with multi- versioning • Metadata on external key-value store 25 25

  25. Snowflake Advantages • Storage on S3 is cheaper • Use expensive local disk only for hot data • All services (except storage) are stateless • Simpler fault tolerance and membership change 26 26

  26. Redshift • Classical shared-nothing architecture • Initially based on PostgreSQL but heavily re-optimized for OLAP • Runs on EC2, explicit provisioning • All data pre-loaded on instance storage • Query compilation • S3 for backup only 27 27

  27. Redshift Spectrum • Serverless query executor • Number of workers dynamically assigned • Stateless • Reads data directly from S3 • Scale out to leverage storage and computation bandwidth 28 28

  28. Comparison Setup • Benchmark: TPC-H • 1 TB uncompressed data • 1 execution of the query suite • Configuration • Default: 4 nodes, memory optimized (r4 8xlarge) • Redshift: analogous node that offers SSD storage (dc2) • Athena: opaque 30 30

  29. Comparison: Initialization Time • Paid every time we shut down and restart the system • Load metadata and (optionally) data 31 31

  30. Comparison: Runtime • Pre-loading pays off • Initialization delay is easily amortized • Caching less helpful • Cost • Athena: pay data scan only • Other systems: mainly running time • Spectrum: scan + running time 32 32

  31. Comparison: Execution Cost • RS can amortize loading costs • Athena • Servlerless • Pay per amount of data scanned • RS Spectrum • Similar scheme as Athena • But must add RS cluster cost 33 33

  32. Storage Cost Per Day Instance storage + EBS very expensive S3 backup cheaper 34 34

  33. Pushing Down Computation? • One should always move computation to data • But disaggregated storage cannot compute! Arbitrary COMPUTE COMPUTE COMPUTE COMPUTE computation LS LS LS LS STORAGE Read/Write only SERVICE 35 35

  34. S3 Select • Computation on the storage layer • Simple selection and projection queries on structured data (e.g. CSV or Parquet) • Simple aggregations (e.g. sum) 36 36

  35. PusdownDB • Stateless query execution with S3 select • Example: Bloom join • Standard hash join but push down Bloom filter to filter results that will not join 37

  36. TPC-H Results • Great speedups with S3 select 38 38

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend