of Big Data 10/01/2018 25 Storage of Big Data Data is growing - - PowerPoint PPT Presentation

of big data
SMART_READER_LITE
LIVE PREVIEW

of Big Data 10/01/2018 25 Storage of Big Data Data is growing - - PowerPoint PPT Presentation

Components of Big Data 10/01/2018 25 Storage of Big Data Data is growing faster than Moores Law Too much data to fit on a single machine Partitioning Replication Fault-tolerance 10/01/2018 26 Hadoop Distributed File System (HDFS)


slide-1
SLIDE 1

Components

  • f Big Data

10/01/2018 25

slide-2
SLIDE 2

Storage of Big Data

Data is growing faster than Moore’s Law Too much data to fit

  • n a single machine

Partitioning Replication Fault-tolerance

10/01/2018 26

slide-3
SLIDE 3

Hadoop Distributed File System

(HDFS)

The most widely used distributed file system Fixed-sized partitioning 3-way replication Write-once read-many

10/01/2018

128MB 128MB 128MB 128MB 128MB 128MB …

27

slide-4
SLIDE 4

Indexing

Data-aware organization Global Index partitions the records into blocks Local Indexes organize the records in a partition Challenges:

Big volume HDFS limitation New programming paradigms Ad-hoc indexes

10/01/2018

Global index Local indexes

28

slide-5
SLIDE 5

Fault Tolerance

Replication Redundancy Multiple masters

10/01/2018 29

slide-6
SLIDE 6

Streaming

Sub-second latency for queries One scan over the data (Partial) preprocessing Continuous queries Eviction strategies In-memory indexes

10/01/2018

…1000100010101011101110101010110111010111011101110100… Processing window

30

slide-7
SLIDE 7

Task Execution

MapReduce

Map-Shuffle- Reduce Resiliency through materialization

Resilient Distributed Datasets (RDD)

Directed-Acyclic-Graph (DAG) In-memory processing Resiliency through lineages

Hyracks Stragglers Load balance

10/01/2018

M1 M2 … Mm R1 R2 Rn

31

slide-8
SLIDE 8

Query Optimization

Finding the most efficient query plan e.g., grouped aggregation Cost model (CPU – Disk – Network)

10/01/2018

Agg Agg Agg Merge Merge Partition Partition Partition Agg Agg

Vs

32

slide-9
SLIDE 9

Provenance

Debugging in distributed systems is painful We need to keep track of transformations on each record

10/01/2018 33

slide-10
SLIDE 10

Big Graphs

Motivated by social networks Billions of nodes and trillions of edges Tens of thousands of insertions per second Complex queries with graph traversals

10/01/2018 34

slide-11
SLIDE 11

Hadoop Ecosystem

10/01/2018

Hadoop Distributed File System (HDFS) Yet Another Resource Negotiator (YARN) MapReduce Query Engine Administration Pig

35

slide-12
SLIDE 12

Spark Ecosystem

10/01/2018

Hadoop Distributed File System (HDFS) Yet Another Resource Negotiator (YARN) Resilient Distributed Dataset (RDD) a.k.a Spark Core Data Frames MLlib GraphX SparkR Spark Streaming Spark SQL

36

Kubernetes

slide-13
SLIDE 13

10/01/2018

Hyracks Data-parallel Platform Algebricks Algebra Layer Hadoop MapReduce Compatibility Pregelix HiveSterix AsteixDB Other compilers Hyracks jobs Pregel Jobs MapReduce Jobs PigLatin HiveQL AsterixQL

37

slide-14
SLIDE 14

Impala

10/01/2018

Hadoop Distributed File System (HDFS) Yet Another Resource Negotiator (YARN) Query Executor Query Planner Query Parser

38

slide-15
SLIDE 15

SpatialHadoop

10/01/2018

Hadoop Distributed File System (HDFS) + Spatial Indexing Yet Another Resource Negotiator (YARN) MapReduce Processing + Spatial Query Processing Spatial Visualization Pig Latin + Pigeon

39

slide-16
SLIDE 16

Reading Material

“The Age of Analytics in a Data-driven World” [Executive Summary] by McKinsey & Company

10/01/2018 40