Big Data Processing Techniques Chentao Wu Associate Professor - - PowerPoint PPT Presentation

big data processing techniques
SMART_READER_LITE
LIVE PREVIEW

Big Data Processing Techniques Chentao Wu Associate Professor - - PowerPoint PPT Presentation

Big Data Processing Techniques Chentao Wu Associate Professor Dept. of Computer Science and Engineering wuct@cs.sjtu.edu.cn Schedule lec1: Introduction on big data and cloud computing Iec2: Introduction on data storage lec3: Data


slide-1
SLIDE 1

Big Data Processing Techniques

Chentao Wu Associate Professor

  • Dept. of Computer Science and Engineering

wuct@cs.sjtu.edu.cn

slide-2
SLIDE 2

Schedule

  • lec1: Introduction on big data and cloud

computing

  • Iec2: Introduction on data storage
  • lec3: Data reliability (Replication/Archive/EC)
  • lec4: Data consistency problem
  • lec5: Block level storage and file storage
  • lec6: Object-based storage
  • lec7: Distributed file system
  • lec8: Metadata management
slide-3
SLIDE 3

Final Grade

  • Attendance 20%
  • Projects 80%
  • Projects will be given in the following classes.
  • Place: Room 317, SEIEE-4th Building
  • Time: 8:00-11:40
  • Date: Friday of 1st, 2nd, 3rd, 5th week
slide-4
SLIDE 4

Collaborators

slide-5
SLIDE 5

Contents

Introduction to Big Data

1

slide-6
SLIDE 6

Big Data Definition

  • No single standard definition…

“Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it…

slide-7
SLIDE 7

Types of Data

  • Structured
  • Semi-Structured/Quasi-Structured/Unstructured

Unstructured Quasi-Structured Semi-Structured Structured

  • Data that has no inherent structure and is usually stored as

different types of files.

  • E.g. Text documents, PDFs, images, and videos
  • Textual data with erratic formats that can be formatted with

effort and software tools

  • E.g. Clickstream data
  • Textual data files with an apparent pattern, enabling

analysis

  • E.g. Spreadsheets and XML files
  • Data having a defined data model, format, structure
  • E.g. Database

Increasing Growth

slide-8
SLIDE 8

Characteristics of big data

(1-Scale: Volume)

  • Data Volume
  • 44x increase from 2009 2020
  • From 0.8 ZettaBytes to 44ZB
  • Data volume is increasing

exponentially

Exponential increase in collected/generated data

slide-9
SLIDE 9

Characteristics of big data

(2-Complexity: Varity)

  • Various formats, types, and

structures

  • Text, numerical, images, audio,

video, sequences, time series, social media data, multi-dim arrays, etc…

  • Static data vs. streaming data
  • A single application can be

generating/collecting many types

  • f data

To extract knowledge all these types of data need to linked together

slide-10
SLIDE 10

Characteristics of big data

(3-Speed: Velocity)

  • Data is begin generated fast and need to be processed

fast

  • Online Data Analytics
  • Late decisions  missing opportunities
  • Examples
  • E-Promotions: Based on your current location, your purchase history, what

you like  send promotions right now for store next to you

  • Healthcare monitoring: sensors monitoring your activities and body 

any abnormal measurements require immediate reaction

slide-11
SLIDE 11

Big Data (3Vs)

slide-12
SLIDE 12

Big Data (4Vs)

slide-13
SLIDE 13

Big Data (5Vs/6Vs)

Volume

  • Massive volumes
  • f data
  • Challenges in

storage and analysis

Velocity

  • Rapidly changing

data

  • Challenges in

real-time analysis

Variety

  • Diverse data

from numerous sources

  • Challenges in

integration, and analysis

Variability

  • Constantly

changing meaning of data

  • Challenges in

gathering and interpretation

Veracity

  • Varying quality

and reliability of data

  • Challenges in

transforming and trusting data

Value

  • Cost-

effectiveness and business value

slide-14
SLIDE 14

Harnessing Big Data

  • OLTP: Online Transaction Processing (DBMSs)
  • OLAP: Online Analytical Processing (Data Warehousing)
  • RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
slide-15
SLIDE 15

Who’s Generating Big Data

Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data)

  • The progress and innovation is no longer hindered by the ability to collect data
  • But, by the ability to manage, analyze, summarize, visualize, and discover

knowledge from the collected data in a timely manner and in a scalable fashion

slide-16
SLIDE 16

The Model Has Changed…

  • The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data

slide-17
SLIDE 17

What’s driving Big Data

  • Ad-hoc querying and reporting
  • Data mining techniques
  • Structured data, typical sources
  • Small to mid-size datasets
  • Optimizations and predictive analytics
  • Complex statistical analysis
  • All types of data, and many sources
  • Very large datasets
  • More of a real-time
slide-18
SLIDE 18

Value of Big Data Analytics

  • Big data is more real-time in nature

than traditional DW applications

  • Traditional DW architectures (e.g.

Exadata, Teradata) are not well-suited for big data apps

  • Shared nothing, massively parallel

processing, scale out architectures are well-suited for big data apps

slide-19
SLIDE 19

Challenges in Handling Big Data

  • The Bottleneck is in technology
  • New architecture, algorithms, techniques are needed
  • Also in technical skills
  • Experts in using the new technology and dealing with big

data

slide-20
SLIDE 20

Big Data Landscape

slide-21
SLIDE 21

Big Data Technology

slide-22
SLIDE 22

Contents

Introduction to Cloud Computing

2

slide-23
SLIDE 23

What is Cloud Computing?

  • A cloud is a collection of network-accessible hardware

and software resources

  • Consists of IT resource pools deployed in data centers
  • Cloud model enables consumers to hire IT resources as

services

A model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources, (e.g., servers, storage, networks, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.

– U.S. National Institute of Standards and Technology, Special Publication 800-145

Cloud Computing

slide-24
SLIDE 24

What is Cloud Computing? (Cont'd)

Cloud Infrastructure

Applications Platform Software Network Compute Storage

LAN/WAN

Laptop Tablet and Mobile Desktop

slide-25
SLIDE 25

Essential Cloud Characteristics

Resource Pooling

3

Measured Service

5

Rapid Elasticity

4

Broad Network Access

2

On-demand self- service

1

Cloud Infrastructure

slide-26
SLIDE 26

Cloud Service Models

Software as a Service (SaaS)

3

Platform as a Service (PaaS)

2

Infrastructure as a Service (IaaS)

1

Cloud Infrastructure

slide-27
SLIDE 27

Infrastructure as a Service

The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and

  • applications. The consumer does not manage or

control the underlying cloud infrastructure but has control over operating systems, storage, and deployed applications; and possibly limited control of select networking components, (e.g., host firewalls).

– U.S. National Institute of Standards and Technology, Special Publication 800-145

Infrastructure as a Service

Cloud Infrastructure

Provider’s Resources Consumer’s Resources

slide-28
SLIDE 28

Platform as a Service

Cloud Infrastructure

Provider’s Resources Consumer’s Resources

The capability provided to the consumer is to deploy

  • nto the cloud infrastructure consumer-created or

acquired applications created using programming languages, libraries, services, and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly configuration settings for the application- hosting environment.

– U.S. National Institute of Standards and Technology, Special Publication 800-145

Platform as a Service

slide-29
SLIDE 29

Software as a Service

Cloud Infrastructure

Provider’s Resources

The capability provided to the consumer is to use the provider’s applications running on a cloud

  • infrastructure. The applications are accessible from

various client devices through either a thin client interface, such as a web browser, (e.g., web-based email, or a program interface. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.

– U.S. National Institute of Standards and Technology, Special Publication 800-145

Software as a Service

slide-30
SLIDE 30

Cloud Deployment Models

Private Cloud

2

Hybrid Cloud

4

Community Cloud

3

Public Cloud

1

Cloud Infrastructure

slide-31
SLIDE 31

Public Cloud

Enterprise P

Cloud Provider’s Resources

Enterprise Q Individual R

slide-32
SLIDE 32

Private Cloud

Enterprise P

Resources of Enterprise P

1) On-premise Private Cloud

Cloud Provider’s Resources

Dedicated for Enterprise P

Enterprise P

2) Externally-hosted Private Cloud

slide-33
SLIDE 33

Community Cloud

  • On-premise Community Cloud

Resources of Enterprise P

Enterprise P

Resources of Enterprise Q

Enterprise Q Enterprise R

slide-34
SLIDE 34

Community Cloud

  • Externally-hosted Community Cloud

Cloud Provider’s Resources

Dedicated for Community Enterprise P Enterprise Q Enterprise R Community Users

slide-35
SLIDE 35

Hybrid Cloud

Enterprise P

Resources of Enterprise P

Individual R

Cloud Provider’s Resources

Enterprise Q

slide-36
SLIDE 36

Contents

Industrial Solutions

3

slide-37
SLIDE 37

Hadoop

  • Apache top level project, open-source implementation of

frameworks for reliable, scalable, distributed computing and data storage.

  • It is a flexible and highly-available architecture for large scale

computation and data processing on a network of commodity hardware.

  • Designed to answer the question: “How to process big

data with reasonable cost and time?”

slide-38
SLIDE 38

Origin of Hadoop (1)

  • Search Engine in 1990’s
slide-39
SLIDE 39

Origin of Hadoop (2)

  • Search Engine in 1998 and 2010’s
slide-40
SLIDE 40

Origin of Hadoop (3)

2005: Doug Cutting and Michael J. Cafarella developed Hadoop to support distribution for the Nutch search engine project. The project was funded by Yahoo. 2006: Yahoo gave the project to Apache Software Foundation.

slide-41
SLIDE 41

Origin of Hadoop (4)

2003 2004 2006

slide-42
SLIDE 42

Hadoop Framework

slide-43
SLIDE 43

Google

slide-44
SLIDE 44

Compute

slide-45
SLIDE 45

Storage

slide-46
SLIDE 46

Amazon

 AWS is Amazon’s umbrella description of all of their web- based technology services.  Mainly infrastructure services:

  • Amazon Elastic Compute Cloud (EC2)
  • Amazon Simple Storage Service (S3)
  • Amazon Simple Queue Service (SQS)
  • Amazon CloudFront
  • Amazon SimpleDB
slide-47
SLIDE 47

Amazon

slide-48
SLIDE 48

AWS Management Console

slide-49
SLIDE 49

Microsoft Azure (1)

slide-50
SLIDE 50

Microsoft Azure (2)

slide-51
SLIDE 51

Microsoft Azure (3)

slide-52
SLIDE 52

Aliyun Framework(1)

slide-53
SLIDE 53

Aliyun Framework (2)

slide-54
SLIDE 54

Thank you!