Cloud Analytics Data Warehousing Marco Serafini COMPSCI 532 - - PowerPoint PPT Presentation

cloud analytics data warehousing
SMART_READER_LITE
LIVE PREVIEW

Cloud Analytics Data Warehousing Marco Serafini COMPSCI 532 - - PowerPoint PPT Presentation

Cloud Analytics Data Warehousing Marco Serafini COMPSCI 532 Lecture 18 Trivia How does Amazon make money? Selling books? Entertainment? 2 2 Migrating to the Cloud ELASTICITY COST Pay-as-you-go HW procurement at


slide-1
SLIDE 1

Cloud Analytics Data Warehousing

Marco Serafini

COMPSCI 532 Lecture 18

slide-2
SLIDE 2

22

Trivia

  • How does Amazon make money?
  • Selling books?
  • Entertainment?
slide-3
SLIDE 3

33

Migrating to the Cloud

  • ELASTICITY
  • Pay-as-you-go
  • Unlimited scale
  • COST
  • HW procurement at scale
  • Cluster management at scale
slide-4
SLIDE 4

44

Cloud Computing

  • Shared resources
  • Multiple tenants sharing resources (with isolation)
  • Economy of scale
  • Elastic provisioning
  • Can easily add and remove resources on the fly
  • Pay as you go only when used
  • Different flavors
  • IaaS, PaaS, SaaS
  • Public, private cloud
slide-5
SLIDE 5

55

Cloud Offerings

  • Computing nodes
  • Example: AWS EC2
  • Full nodes with local storage and pre-installed OS
  • Very large number of instance types: compute optimized,

memory optimized, storage optimized, with GPUs, burstable…

  • Storage services
  • Example: AWS S3
  • Key-value stores (put/get), file systems
  • Higher-level services
  • Example: DBMS
slide-6
SLIDE 6

66

Storage Disaggregation

  • Computing nodes (e.g. EC2)
  • Feature-rich machines
  • Storage services (e.g. S3)
  • On cheaper, storage-heavy machines
  • Limited read/write interface
  • Advantages for cloud provider
  • Provision storage and computation independently
  • Advantages for users
  • Storage services cheaper
  • Network bandwidth ~ I/O bandwidth
slide-7
SLIDE 7

7

7

Cloud Storage Types

STORAGE PERFORMANC E ACCESS APPENDS AVAILABILITY PRICE OBJECT (S3)

  • Shared

X ✓ Low FILE SYSTEM (EFS)

  • Shared

✓ ✓ High BLOCK (EBS) + Instance (*) ✓ X Mid INSTANCE-LOCAL ++ Instance ✓ X High (**) (*) Can be detached from an instance and reattached to another (**) Storage-heavy instances are expensive

slide-8
SLIDE 8

88

From Shared-Nothing Architecture…

COMPUTE COMPUTE COMPUTE COMPUTE LS LS LS LS

Principle: move computation to data

slide-9
SLIDE 9

99

…To Hybrid Architectures

COMPUTE COMPUTE COMPUTE COMPUTE LS LS LS LS STORAGE SERVICE

Arbitrary computation Read/Write only Cannot move computation to data!

slide-10
SLIDE 10

10

10

Scheduling Low-Priority Tasks

  • Helps increase hardware utilization
  • Spot instances
  • Allocated in real-time based on live bidding
  • Can be revoked any time (with notice)
  • Serverless computing
  • Example: AWS Lambda
  • Each of these services comes with own pricing
slide-11
SLIDE 11
slide-12
SLIDE 12

12 12

Goals: Push-Button Analytics

  • Easily parallelize single-threaded code
  • Eliminate cluster management overhead
  • Deployment of nodes
  • Installation
  • Configuration
  • Even cloud offerings have their complexities
  • Many instance types
  • Many services
  • Solution: Serverless functions
slide-13
SLIDE 13

13

13

Goal: Push-Button Analytics

  • Use ”serverless” components
  • No need to select a specific cluster size
  • System auto-scales up and down on demand
  • Building blocks
  • Serverless functions (AWS Lambdas)
  • Cloud storage services (AWS S3)
  • This paper implements MapReduce in this setting
slide-14
SLIDE 14

14

14

Serverless Functions

  • Single threaded code
  • Invoked through HTTP requests
  • Cloud platform takes care of
  • Deployment
  • Load balancing
  • Performance isolation
  • No need to
  • Deploy servers
  • Configure clusters
slide-15
SLIDE 15

15

15

Challenges with Lambdas

  • No local storage, need to use remote cloud storage
  • For example S3
  • No function-to-function communication
  • Again need remote storage to share remote memory
  • Short maximum running time
slide-16
SLIDE 16

16

16

Remote vs. Local Storage

slide-17
SLIDE 17

17

17

State and Fault Tolerance

  • State is lost after execution
  • Inputs and outputs need to be persisted
  • Fault tolerance
  • Re-execute function
  • Require atomic writes to check what has succeeded
slide-18
SLIDE 18

18

18

Registering Functions

  • Registering a new Lambda function is slow
  • Solution
  • Register a single generic Lambda function
  • Serialize the code that needs the be executed
  • Store the code (and the input data) on S3
  • Generic Lambda function loads code and executes it
slide-19
SLIDE 19

19

19

Remote Storage Scalability

slide-20
SLIDE 20

20

20

Semantics

  • Map is easy
  • Execute one function per element of the list
  • Map + single Reducer
  • E.g. parallel featurization + single-server ML
  • MapReduce
  • Many Lambdas needed, many small intermediate files
  • Use Redis, an in-memory key-value store
  • Parameter server
  • Use Redis
slide-21
SLIDE 21

21

21

The Cost of Scaling Up

  • Using more nodes does not always imply higher cost
  • Lower latency à lower cost per node
slide-22
SLIDE 22

22

Data Warehousing Architectures

slide-23
SLIDE 23

23 23

Data Warehousing

  • Analytical (OLAP) relational queries
  • Different architectures
  • Snowflake: shared-disk + caching at compute nodes
  • Redshift: shared-nothing, store all data at compute nodes
  • Redshift Spectrum: serverless workers executing on-demand

and reading from S3

  • Let’s discuss these architectures and compare them
slide-24
SLIDE 24

24 24

Snowflake

  • Shared-disk architecture
  • Data is stored on S3, all nodes can access it
  • But nodes keep a distributed cache
  • Challenges
  • Heterogeneous workloads
  • No one-size-fits-all hardware configuration
  • Membership changes
  • Large data shuffles when a node fails/is removed
  • Online upgrade
  • It is similar to changing all the nodes in the system
slide-25
SLIDE 25

25

25

Snowflake Architecture

  • Data Storage
  • Based on S3: high throughput, high

latency

  • Used also for intermediate data
  • Virtual Warehouses
  • Responsible for query execution
  • Stateless (restarted in their entirety)
  • Shared cache (low latency on hot

data, most data cold)

  • Cloud Services
  • Query parsing, access control,
  • ptimization
  • Snapshot isolation with multi-

versioning

  • Metadata on external key-value

store

slide-26
SLIDE 26

26

26

Snowflake Advantages

  • Storage on S3 is cheaper
  • Use expensive local disk only for hot data
  • All services (except storage) are stateless
  • Simpler fault tolerance and membership change
slide-27
SLIDE 27

27 27

Redshift

  • Classical shared-nothing architecture
  • Initially based on PostgreSQL but heavily re-optimized for

OLAP

  • Runs on EC2, explicit provisioning
  • All data pre-loaded on instance storage
  • Query compilation
  • S3 for backup only
slide-28
SLIDE 28

28

28

Redshift Spectrum

  • Serverless query executor
  • Number of workers dynamically assigned
  • Stateless
  • Reads data directly from S3
  • Scale out to leverage storage and computation

bandwidth

slide-29
SLIDE 29
slide-30
SLIDE 30

30 30

Comparison Setup

  • Benchmark: TPC-H
  • 1 TB uncompressed data
  • 1 execution of the query suite
  • Configuration
  • Default: 4 nodes, memory
  • ptimized (r4 8xlarge)
  • Redshift: analogous node that
  • ffers SSD storage (dc2)
  • Athena: opaque
slide-31
SLIDE 31

31

31

Comparison: Initialization Time

  • Paid every time we shut

down and restart the system

  • Load metadata and

(optionally) data

slide-32
SLIDE 32

32

32

Comparison: Runtime

  • Pre-loading pays off
  • Initialization delay is easily

amortized

  • Caching less helpful
  • Cost
  • Athena: pay data scan only
  • Other systems: mainly running

time

  • Spectrum: scan + running time
slide-33
SLIDE 33

33

33

Comparison: Execution Cost

  • RS can amortize

loading costs

  • Athena
  • Servlerless
  • Pay per amount of

data scanned

  • RS Spectrum
  • Similar scheme as

Athena

  • But must add RS

cluster cost

slide-34
SLIDE 34

34

34

Storage Cost Per Day

EBS very expensive Instance storage + S3 backup cheaper

slide-35
SLIDE 35

35

35

Pushing Down Computation?

  • One should always move computation to data
  • But disaggregated storage cannot compute!

COMPUTE COMPUTE COMPUTE COMPUTE LS LS LS LS STORAGE SERVICE

Arbitrary computation Read/Write only

slide-36
SLIDE 36

36 36

S3 Select

  • Computation on the storage layer
  • Simple selection and projection queries on structured data

(e.g. CSV or Parquet)

  • Simple aggregations (e.g. sum)
slide-37
SLIDE 37

37

PusdownDB

  • Stateless query execution

with S3 select

  • Example: Bloom join
  • Standard hash join but push

down Bloom filter to filter results that will not join

slide-38
SLIDE 38

38

38

TPC-H Results

  • Great speedups with S3 select