CS226 Big-Data Management Instructor: Ahmed Eldawy 1 Welcome - - PowerPoint PPT Presentation

cs226 big data management
SMART_READER_LITE
LIVE PREVIEW

CS226 Big-Data Management Instructor: Ahmed Eldawy 1 Welcome - - PowerPoint PPT Presentation

CS226 Big-Data Management Instructor: Ahmed Eldawy 1 Welcome (back) to UCR! 2 Class information Classes: Monday, Wednesday, Friday 1:00 1:50 PM at Humanities and Social Sciences1501 Instructor: Ahmed Eldawy TA: Saheli Ghosh Office


slide-1
SLIDE 1

CS226 Big-Data Management

Instructor: Ahmed Eldawy

1

slide-2
SLIDE 2

Welcome (back) to UCR!

2

slide-3
SLIDE 3

Class information

Classes: Monday, Wednesday, Friday 1:00 – 1:50 PM at Humanities and Social Sciences1501 Instructor: Ahmed Eldawy TA: Saheli Ghosh Office hours: TBD Website:

http://www.cs.ucr.edu/~eldawy/19FCS226/ iLearn (Any UCRX students?)

Email: eldawy@ucr.edu

Subject: “[CS226] …”

3

slide-4
SLIDE 4

Course work

Active participation in the class (5%) Reading and review tasks (10%) Assignments (20%) Mid-term (15%) Project (50%)

4

slide-5
SLIDE 5

Project

Groups of 4-5 students Milestones

Group Selection Project proposal (5%) Literature survey (10%) Report outline (5%) Class presentation (5%) Final report (15%) Poster presentation (10%)

5

slide-6
SLIDE 6

Course goals

What are your goals? Understand what big data means Identify the internal components of big data platforms Recognize the differences between different big data platforms Explain how a distributed query runs on big data

6

slide-7
SLIDE 7

Super Hero

7

slide-8
SLIDE 8

Big-data Expert

Understand how the big-data platforms really work Control those thousands of processors efficiently to carry out your task

8

slide-9
SLIDE 9

Syllabus

Overview of big data Big-data storage Big-data processing Big-data indexing Big-SQL processing Programming packages

9

slide-10
SLIDE 10

Introduction

10

slide-11
SLIDE 11

11

slide-12
SLIDE 12

12

slide-13
SLIDE 13

Jan 2012: World Economic Forum Report

13

slide-14
SLIDE 14

Interest in Big Data in the US

■ March 2012: Obama administration

unveils BIG DATA initiative: $200 Million in R&D investment

■ June 2013:

Washington Post is calling Obama “The Big Data President”

14

slide-15
SLIDE 15

Interest in Big Data in Europe

March 2014: David Cameron and Angela Merkel talking about Big Data in a Computer Expo in Hannover, Germany

15

slide-16
SLIDE 16

The Market of Big Data

16

slide-17
SLIDE 17

Four Three V’s of Big Data

17

slide-18
SLIDE 18

Big Data Vs Big Computation

Full scans (e.g., log processing) Range scans Point lookups Iterations Joins (self, binary, or multiway) Proximity queries Closures and graph traversals

18

slide-19
SLIDE 19

Big Data Applications

Web search Marketing and advertising Data cleaning Knowledge base Information retrieval Internet of Things (IoT) Visualization Behavioral studies

19

slide-20
SLIDE 20

Publicly Available Datasets

Data.gov Data.gov.uk Twitter Streaming API Yahoo! Webscope [http://webscope.sandbox.yahoo.com/] GDELT [http://www.gdeltproject.org/] Instagram API

20

slide-21
SLIDE 21

Big Data Landscape 2012

http://mattturck.com/2012/06/29/a-chart-of-the-big-data-ecosystem/

21

slide-22
SLIDE 22

Big Data Landscape 2014

http://mattturck.com/2014/05/11/the-state-of-big-data-in-2014-a-chart/

22

slide-23
SLIDE 23

Big Data Landscape 2016

http://mattturck.com/2016/02/01/big-data-landscape/

23

slide-24
SLIDE 24

Big Data Landscape 2018

24

slide-25
SLIDE 25

Components

  • f Big Data

25

slide-26
SLIDE 26

Storage of Big Data

Data is growing faster than Moore’s Law Too much data to fit

  • n a single machine

Partitioning Replication Fault-tolerance

26

slide-27
SLIDE 27

Hadoop Distributed File System

(HDFS)

The most widely used distributed file system Fixed-sized partitioning 3-way replication Write-once read-many

128MB 128MB 128MB 128MB 128MB 128MB …

27

slide-28
SLIDE 28

Indexing

Data-aware organization Global Index partitions the records into blocks Local Indexes organize the records in a partition Challenges:

Big volume HDFS limitation New programming paradigms Ad-hoc indexes

Global index Local indexes

28

slide-29
SLIDE 29

Fault Tolerance

Replication Redundancy Multiple masters

29

slide-30
SLIDE 30

Streaming

Sub-second latency for queries One scan over the data (Partial) preprocessing Continuous queries Eviction strategies In-memory indexes

…1000100010101011101110101010110111010111011101110100… Processing window

30

slide-31
SLIDE 31

Task Execution

MapReduce

Map-Shuffle- Reduce Resiliency through materialization

Resilient Distributed Datasets (RDD)

Directed-Acyclic-Graph (DAG) In-memory processing Resiliency through lineages

Hyracks Stragglers Load balance

M1 M2 … Mm R1 R2 Rn

31

slide-32
SLIDE 32

Query Optimization

Finding the most efficient query plan e.g., grouped aggregation Cost model (CPU – Disk – Network)

Agg Agg Agg Merge Merge Partition Partition Partition Agg Agg

Vs

32

slide-33
SLIDE 33

Provenance

Debugging in distributed systems is painful We need to keep track of transformations on each record

33

slide-34
SLIDE 34

Big Graphs

Motivated by social networks Billions of nodes and trillions of edges Tens of thousands of insertions per second Complex queries with graph traversals

34

slide-35
SLIDE 35

Hadoop Ecosystem

Hadoop Distributed File System (HDFS) Yet Another Resource Negotiator (YARN) MapReduce Query Engine Administration Pig

35

slide-36
SLIDE 36

Spark Ecosystem

Hadoop Distributed File System (HDFS) Yet Another Resource Negotiator (YARN) Resilient Distributed Dataset (RDD) a.k.a Spark Core Data Frames MLlib GraphX SparkR Spark Streaming Spark SQL

36

Kubernetes

slide-37
SLIDE 37

Hyracks Data-parallel Platform Algebricks Algebra Layer Hadoop MapReduce Compatibility Pregelix HiveSterix AsteixDB Other compilers Hyracks jobs Pregel Jobs MapReduce Jobs PigLatin HiveQL AsterixQL

37

slide-38
SLIDE 38

Impala

Hadoop Distributed File System (HDFS) Yet Another Resource Negotiator (YARN) Query Executor Query Planner Query Parser

38

slide-39
SLIDE 39

SpatialHadoop

Hadoop Distributed File System (HDFS) + Spatial Indexing Yet Another Resource Negotiator (YARN) MapReduce Processing + Spatial Query Processing Spatial Visualization Pig Latin + Pigeon

39

slide-40
SLIDE 40

Reading Material

“The Age of Analytics in a Data-driven World” [Executive Summary] by McKinsey & Company

40