Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) - - PowerPoint PPT Presentation

data intensive distributed computing
SMART_READER_LITE
LIVE PREVIEW

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) - - PowerPoint PPT Presentation

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 5: Analyzing Relational Data (1/3) October 10, 2019 Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451 This work is licensed under a


slide-1
SLIDE 1

Data-Intensive Distributed Computing

Part 5: Analyzing Relational Data (1/3)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 431/631 451/651 (Fall 2019) Ali Abedi October 10, 2019

These slides are available at https://www.student.cs.uwaterloo.ca/~cs451

1

slide-2
SLIDE 2

Structure of the Course

“Core” framework features and algorithm design

Analyzing Text Analyzing Graphs Analyzing Relational Data Data Mining

2

slide-3
SLIDE 3

Evolution of Enterprise Architectures

Next two sessions: techniques, algorithms, and

  • ptimizations for relational processing

3

slide-4
SLIDE 4

Monolithic Application

users

4

slide-5
SLIDE 5

Frontend Backend

users

5

slide-6
SLIDE 6

6

Edgar F. Codd

  • Inventor of the relational model for DBs
  • SQL was created based on his work
  • Turing award winner in 1981
slide-7
SLIDE 7

Frontend Backend

users database

7

slide-8
SLIDE 8

An organization should retain data that result from carrying

  • ut its mission and exploit those data to generate insights

that benefit the organization, for example, market analysis, strategic planning, decision making, etc.

Business Intelligence

8

slide-9
SLIDE 9

Frontend Backend

users database

BI tools

analysts

9

slide-10
SLIDE 10

Frontend Backend

users database

BI tools

analysts

Why is my application so slow? Why does my analysis take so long?

10

slide-11
SLIDE 11

Database Workloads

OLTP (online transaction processing)

Typical applications: e-commerce, banking, airline reservations User facing: real-time, low latency, highly-concurrent Tasks: relatively small set of “standard” transactional queries Data access pattern: random reads, updates, writes (small amounts of data)

OLAP (online analytical processing)

Typical applications: business intelligence, data mining Back-end processing: batch workloads, less concurrency Tasks: complex analytical queries, often ad hoc Data access pattern: table scans, large amounts of data per query

11

slide-12
SLIDE 12

OLTP and OLAP Together?

Downsides of co-existing OLTP and OLAP workloads

Poor memory management Conflicting data access patterns Variable latency

Solution?

users and analysts

12

slide-13
SLIDE 13

Source: Wikipedia (Warehouse)

Build a data warehouse!

13

slide-14
SLIDE 14

Frontend Backend

users

BI tools

analysts ETL

(Extract, Transform, and Load)

Data Warehouse OLTP database OLTP database for user- facing transactions OLAP database for data warehousing

14

slide-15
SLIDE 15

Customer Billing Order Inventory OrderLine

A Simple OLTP Schema

15

slide-16
SLIDE 16

Dim_Customer Dim_Date Dim_Product Fact_Sales Dim_Store

A Simple OLAP Schema

16

slide-17
SLIDE 17

ETL

Transform

Data cleaning and integrity checking Schema conversion Field transformations

When does ETL happen?

Extract Load

17

slide-18
SLIDE 18

Frontend Backend

users

BI tools

analysts ETL

(Extract, Transform, and Load)

Data Warehouse OLTP database

My data is a day old… Meh.

18

slide-19
SLIDE 19

Frontend Backend

users

BI tools

analysts ETL

(Extract, Transform, and Load)

Data Warehouse OLTP database

Frontend Backend

users

Frontend Backend

external APIs OLTP database OLTP database

19

slide-20
SLIDE 20

What do you actually do?

Dashboards Report generation Ad hoc analyses

20

slide-21
SLIDE 21

store product

slice and dice

Common operations

roll up/drill down pivot

OLAP Cubes

21

slide-22
SLIDE 22

OLAP Cubes: Challenges

Fundamentally, lots of joins, group-bys and aggregations

How to take advantage of schema structure to avoid repeated work?

Cube materialization

Realistic to materialize the entire cube? If not, how/when/what to materialize?

22

slide-23
SLIDE 23

Frontend Backend

users

BI tools

analysts ETL

(Extract, Transform, and Load)

Data Warehouse OLTP database

Frontend Backend

users

Frontend Backend

external APIs OLTP database OLTP database

23

slide-24
SLIDE 24

Fast forward…

24

slide-25
SLIDE 25

“On the first day of logging the Facebook clickstream, more than 400 gigabytes of data was collected. The load, index, and aggregation processes for this data set really taxed the Oracle data warehouse. Even after significant tuning, we were unable to aggregate a day

  • f clickstream data in less than 24 hours.”

Jeff Hammerbacher, Information Platforms and the Rise of the Data Scientist. In, Beautiful Data, O’Reilly, 2009. 25

slide-26
SLIDE 26

Frontend Backend

users

BI tools

analysts ETL

(Extract, Transform, and Load)

Data Warehouse OLTP database

Facebook context?

26

slide-27
SLIDE 27

Frontend Backend

users

BI tools

analysts ETL

(Extract, Transform, and Load)

Data Warehouse “OLTP”

Adding friends Updating profiles Likes, comments … Feed ranking Friend recommendation Demographic analysis …

27

slide-28
SLIDE 28

Frontend Backend

users analysts ETL

(Extract, Transform, and Load)

“OLTP”

PHP/MySQL

data scientists

Hadoop

  • r ELT?

28

slide-29
SLIDE 29

What’s changed?

Dropping cost of disks

Cheaper to store everything than to figure out what to throw away

29

slide-30
SLIDE 30

What’s changed?

Dropping cost of disks

Cheaper to store everything than to figure out what to throw away

Rise of social media and user-generated content

Large increase in data volume

Growing maturity of data mining techniques

Demonstrates value of data analytics

Types of data collected

From data that’s obviously valuable to data whose value is less apparent

30

slide-31
SLIDE 31

a useful service analyze user behavior to extract insights transform insights into action

$

(hopefully)

  • Google. Facebook. Twitter. Amazon. Uber.

Virtuous Product Cycle

31

slide-32
SLIDE 32

What do you actually do?

Dashboards Report generation Ad hoc analyses

“Descriptive” “Predictive”

Data products

32

slide-33
SLIDE 33

a useful service analyze user behavior to extract insights transform insights into action

$

(hopefully)

  • Google. Facebook. Twitter. Amazon. Uber.

data science data products

Virtuous Product Cycle

33

slide-34
SLIDE 34

“On the first day of logging the Facebook clickstream, more than 400 gigabytes of data was collected. The load, index, and aggregation processes for this data set really taxed the Oracle data warehouse. Even after significant tuning, we were unable to aggregate a day

  • f clickstream data in less than 24 hours.”

Jeff Hammerbacher, Information Platforms and the Rise of the Data Scientist. In, Beautiful Data, O’Reilly, 2009. 34

slide-35
SLIDE 35

Frontend Backend

users data scientists ETL

(Extract, Transform, and Load)

“OLTP” Hadoop

35

slide-36
SLIDE 36

Frontend Backend

users ETL

(Extract, Transform, and Load)

Hadoop Wait, so why not use a database to begin with?

The Irony…

“OLTP” data scientists

36

slide-37
SLIDE 37

Why not just use a database?

  • Scalability. Cost.

SQL is awesome

37

slide-38
SLIDE 38

Databases are great…

If your data has structure (and you know what the structure is) If you know what queries you’re going to run ahead of time If your data is reasonably clean

Databases are not so great…

If your data has little structure (or you don’t know the structure) If you don’t know what you’re looking for If your data is messy and noisy

38

slide-39
SLIDE 39

“there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are unknown unknowns – the ones we don't know we don't know…” – Donald Rumsfeld

Source: Wikipedia

39

slide-40
SLIDE 40

Databases are great…

If your data has structure (and you know what the structure is) If you know what queries you’re going to run ahead of time If your data is reasonably clean

Databases are not so great…

If your data has little structure (or you don’t know the structure) If you don’t know what you’re looking for If your data is messy and noisy

40

slide-41
SLIDE 41

Don’t need to know the schema ahead of time Many analyses are better formulated imperatively Raw scans are the most common operations Much faster data ingest rate

Advantages of Hadoop dataflow languages

41

slide-42
SLIDE 42

What do you actually do?

Dashboards Report generation Ad hoc analyses

“Descriptive” “Predictive”

Data products

42

slide-43
SLIDE 43

Frontend Backend

users

BI tools

analysts ETL

(Extract, Transform, and Load)

Data Warehouse OLTP database

Frontend Backend

users

Frontend Backend

external APIs OLTP database OLTP database

43

slide-44
SLIDE 44

Frontend Backend

users

Frontend Backend

users

Frontend Backend

external APIs

“Traditional” BI tools SQL on Hadoop Other tools

Data Warehouse “Data Lake” data scientists OLTP database ETL

(Extract, Transform, and Load)

OLTP database OLTP database

44

slide-45
SLIDE 45

Twitter’s data warehousing architecture (2012)

45

slide-46
SLIDE 46

~2010

~150 people total ~60 Hadoop nodes ~6 people use analytics stack daily

~2012

~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total Hadoop DW capacity ~100 TB ingest daily dozens of teams use Hadoop daily 10s of Ks of Hadoop jobs daily

46

slide-47
SLIDE 47

How does ETL actually happen?

Twitter’s data warehousing architecture (2012)

47

slide-48
SLIDE 48

Scribe Daemons (Production Hosts)

Main Hadoop DW

Main Datacenter Staging Hadoop Cluster

HDFS

Scribe Aggregators Scribe Daemons (Production Hosts) Datacenter Staging Hadoop Cluster

HDFS

Scribe Aggregators Scribe Daemons (Production Hosts) Datacenter Staging Hadoop Cluster

HDFS

Scribe Aggregators

Importing Log Data

48

slide-49
SLIDE 49

What’s Next?

Two developing trends…

49

slide-50
SLIDE 50

Frontend Backend

users database

BI tools

analysts

50

slide-51
SLIDE 51

Frontend Backend

users

BI tools

analysts ETL

(Extract, Transform, and Load)

Data Warehouse OLTP database

Frontend Backend

users

Frontend Backend

external APIs OLTP database OLTP database

51

slide-52
SLIDE 52

Frontend Backend

users

Frontend Backend

users

Frontend Backend

external APIs

“Traditional” BI tools SQL on Hadoop Other tools

Data Warehouse “Data Lake” data scientists OLTP database ETL

(Extract, Transform, and Load)

OLTP database OLTP database

My data is a day old… I refuse to accept that!

52

slide-53
SLIDE 53

ETL OLAP OLTP What if you didn’t have to do this?

53

slide-54
SLIDE 54

HTAP Hybrid Transactional/Analytical Processing (HTAP)

54

slide-55
SLIDE 55

Frontend Backend

users

Frontend Backend

users

Frontend Backend

external APIs

“Traditional” BI tools SQL on Hadoop Other tools

Data Warehouse “Data Lake” data scientists OLTP database ETL

(Extract, Transform, and Load)

OLTP database OLTP database

55

slide-56
SLIDE 56

Frontend Backend

users

Frontend Backend

users

Frontend Backend

external APIs

“Traditional” BI tools SQL on Hadoop Other tools

Data Warehouse “Data Lake” data scientists HTAP database ETL

(Extract, Transform, and Load)

HTAP database HTAP database Analytics

tools

data scientists

Analytics tools

data scientists

56

slide-57
SLIDE 57

Frontend Backend

users

Frontend Backend

users

Frontend Backend

external APIs

“Traditional” BI tools SQL on Hadoop Other tools

Data Warehouse “Data Lake” data scientists ETL

(Extract, Transform, and Load)

Everything In the cloud! IaaS / Load balance aaS

OLTP database OLTP database OLTP database

DBaaS (e.g., RDS) DBaaS (e.g., RedShift) S3 “Cloudified” tools ELT aaS

57