Cloud Big Data Architectures Lynn Langit QCon Sao Paulo, Brazil - - PowerPoint PPT Presentation

cloud big data architectures
SMART_READER_LITE
LIVE PREVIEW

Cloud Big Data Architectures Lynn Langit QCon Sao Paulo, Brazil - - PowerPoint PPT Presentation

Cloud Big Data Architectures Lynn Langit QCon Sao Paulo, Brazil 2016 About this Workshop Real-world Cloud Scenarios w/AWS, Azure and GCP 1. Big Data Solution Types 2. Data Pipelines 3. ETL and Visualization Bonus (if time allows) 4.


slide-1
SLIDE 1

Cloud Big Data Architectures

Lynn Langit

QCon Sao Paulo, Brazil 2016

slide-2
SLIDE 2

About this Workshop

Real-world Cloud Scenarios w/AWS, Azure and GCP

1. Big Data Solution Types 2. Data Pipelines 3. ETL and Visualization 4. Bonus…(if time allows)

slide-3
SLIDE 3

Save ALL

  • f your Data
slide-4
SLIDE 4

What is the ACTUAL Cost of ✘ Saving all Data ✘ Using newer technologies ✘ Going beyond Relational

slide-5
SLIDE 5

About this Workshop

Real-world Cloud Scenarios w/AWS, Azure and GCP

1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus…(if time allows)

slide-6
SLIDE 6

1.

Big Data – Yes!

But what kind?

slide-7
SLIDE 7

Pattern 1

✘Which type(s) of Big Data work best?

  • - when to use Hadoop
  • - when to use NoSQL

and which type, i.e. key-value, document, graph, etc.

  • - when to use Big Relational

and what type of workload for hot, warm or cold data

slide-8
SLIDE 8

Choice… is good, right?

slide-9
SLIDE 9

When do I use…? ✘ Hadoop ✘ NoSQL ✘ Big Relational

slide-10
SLIDE 10

Size Matters

slide-11
SLIDE 11
slide-12
SLIDE 12

One Vendor’s View

I don’t Want Text here

slide-13
SLIDE 13
slide-14
SLIDE 14

Where is Hadoop Used?

slide-15
SLIDE 15

Hadoop is your LAST CHOICE

✘Volume

✘10 TB or greater to start ✘Growth of 25% YOY ✘Where FROM ✘Where TO

✘Velocity and Variety

✘Spark over HIVE ✘Kafka and Samsa

✘Veracity

✘Pay, train and hire team ✘Top $$$ for talent ✘IF you can find it ✘WATCH OUT for Cloud Vendors who promise ‘easy access’ ✘Complexity of ecosystem ✘Cloudera knows best

slide-16
SLIDE 16

When do I use…? ✘ Hadoop ✘ NoSQL ✘ Big Relational

slide-17
SLIDE 17

225

NoSQL Database Types to Choose From

slide-18
SLIDE 18

Let’s review some NoSQL concepts

Key-Value

Redis, Riak, Aerospike

Graph

Neo4j

Document

MongoDB

Wide-Column

Cassandra, HBase

slide-19
SLIDE 19

slide-20
SLIDE 20

Key Questions - Storage

✘Volume – how much now, what growth rate? ✘Variety – what type(s) of data? ‘rectangular’, ‘graph’, ‘k-v’, etc… ✘Velocity – batches, streams, both, what ingest rate? ✘Veracity – current state (quality) of data, amount of duplication of data stores, existence of authoritative (master) data management?

slide-21
SLIDE 21 21

✘Open Source is Free ✘Not Free

  • Rapid iteration, innovation
  • Can start up for free (on premise)
  • Can ‘rent’ for cheap or free on the cloud
  • Can use with the command line for free
  • Some vendors offer free online training
  • Ex. www.neo4j.org
  • Constant releases
  • Can be deceptively hard to set up (time is

money)

  • Don’t forget to turn it off if on the cloud!
  • GUI tools, support, training cost $$$
  • Ex. www.neo4j.com

NoSQL Example

slide-22
SLIDE 22

Practice

Applying Concepts - NoSQL

slide-23
SLIDE 23

NoSQL Applied

Log Files

  • ???

Product Catalogs

  • ???

Social Games

  • ???

Social aggregators

  • ???

Line-of- Business

  • ???
slide-24
SLIDE 24

NoSQL Applied

Log Files

  • Columnstore
  • HBase

Product Catalogs

  • Key/Value
  • Redis

Social Games

  • Document
  • MongoDB

Social aggregators

  • Graph
  • Neo4j

Line-of- Business

  • RDBMS
  • SQL Server
slide-25
SLIDE 25

More than NoSQL

NoSQL ✘ Non-relational ✘ Can be optimized in- memory ✘ Eventually consistent ✘ Schema on Read ✘ Example: Aerospike NewSQL ✘ Relational plus more ✘ Often in-memory ✘ Some kind of SQL-layer ✘ Schema on Write ✘ Example: MemSQL U-SQL ✘ What??? ✘ Microsoft’s universal SQL language ✘ Example: Azure Data Lake

slide-26
SLIDE 26

Focus

slide-27
SLIDE 27

How Best to Store your Data?

Complexity Scalability Developer Cost RDBMS easy medium low NoSQL medium big high Hadoop hard huge very high

slide-28
SLIDE 28

Real World Big Data -- When do I use what?

RDBMS 65% NoSQL 30% Hadoop 5%

slide-29
SLIDE 29

Do the Cloud Vendors Understand Big Data Realities?

slide-30
SLIDE 30

Cloud Big Data Vendors - Storage

AWS

✘ 5-10X market share of next competitor ✘ Most complete offering ✘ Most mature offering ✘ Notable: Big Relational

GCP

✘ Lean, mean and cheap ✘ Fastest player ✘ Requires top developers ✘ Notable: Query as a Service

Azure

✘ Catching up ✘ Best tooling integration ✘ Notable: On-premise integration

slide-31
SLIDE 31

Place your screenshot here

AWS Console 17 Data services

slide-32
SLIDE 32

Place your screenshot here

GCP Console 8 Data Services

slide-33
SLIDE 33

Place your screenshot here

Azure Console 15 Data Services

slide-34
SLIDE 34

Cloud Offerings – Big Data

AWS Google Microsoft

Managed RDBMS RDS Aurora Cloud SQL Azure SQL Data Warehouse Redshift BigQuery Azure SQL Data Warehouse NoSQL buckets S3 Glacier Cloud Storage Nearline Azure Blobs StorSimple NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore Azure Tables NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE DocumentDB Neo4j on Azure Hadoop Elastic MapReduce DataProc Data Lake HDInsight

slide-35
SLIDE 35

Practice

Applying Concepts – Real Cost of Storage Types

slide-36
SLIDE 36

Cloud NoSQL Applied – AWS

Log Files Product Catalogs Social Games Social aggregators Line-of- Business

slide-37
SLIDE 37

Cloud NoSQL Applied – AWS

Log Files

  • Stream or

Hadoop

  • Kinesis or

EMR

Product Catalogs

  • Key/Value
  • DynamoDB

Social Games

  • Document
  • MongoDB

Social aggregators

  • Graph
  • Neo4j

Line-of- Business

  • RDBMS
  • RDS
slide-38
SLIDE 38

???

The fastest growing cloud-based Big Data products are…

slide-39
SLIDE 39

Relational

The fastest growing cloud-based Big Data products are…

slide-40
SLIDE 40

When do I use…? ✘ Hadoop ✘ NoSQL ✘ Big Relational

slide-41
SLIDE 41

Practice

Applying Concepts – Real Cost of Storage Types

slide-42
SLIDE 42

Reasons to use Big Relational Cloud Services

Developers DevOps Cloud Vendors – AWS Developers DevOps Cloud Vendors – GCP

slide-43
SLIDE 43

Reasons to use Big Relational Cloud Services

Developers

Most know RDBMS query patterns Many know basic administration

DevOps

Most know RDBMS administration Many know basic RDBMS queries Many know query optimization

Cloud Vendors - AWS

Aurora – RDBMS up to 64 TB Redshift - $ 1k USD / 1 TB / year Rich partner ecosystem – ETL Integration with AWS products

Developers

Most know coding language patterns to interact with RDBMS systems

DevOps

Familiar RDBMS security patterns Familiar auditing Partner tooling integration

Cloud Vendors - GCP

Big Query – familiar SQL queries No hassle streaming ingest No hassle pay-as-you-go Zero administration

slide-44
SLIDE 44

My top Big Data Cloud Services

slide-45
SLIDE 45

ETL is 75% of all Big Data Projects

Surveying, cleaning and loading data is the majority of the billable time for new Big Data projects.

slide-46
SLIDE 46

About this Workshop

Real-world Cloud Scenarios w/AWS, Azure and GCP

1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus…(if time allows)

slide-47
SLIDE 47

2.

Data Pipelines

Build vs. Buy

slide-48
SLIDE 48

Pattern 2 ✘How to build optimized cloud-based data pipelines?

  • - Cloud-based ETL tools and processes
  • - includes load-testing patterns and security practices
  • - including connecting between different vendor clouds
slide-49
SLIDE 49

Key Questions – Ingestion and ETL

✘Volume – how much and how fast, now and future? ✘Variety – what type(s) or data, any pre-processing needed? ✘Velocity – batches or steaming? ✘Veracity – verification on ingest needed? new data needed?

slide-50
SLIDE 50

Together

How does your data pipeline flow?

slide-51
SLIDE 51

Considering… ✘ Initial Load/Transform ✘ Data Quality ✘ Batch vs. Stream

slide-52
SLIDE 52

Pipeline Phases

Phase 0 Eval Current Data - Quality & Quantity Phase 1 Get New Data - Free or Premium Phase 2 Build MVP & Forecast volume and growth Phase 3 Load test at scale Phase 4 Deploy – secure, audit and monitor

slide-53
SLIDE 53

Cloud Big Data Vendors - ETL

AWS ✘ 5X market share of next competitor ✘ Notable: Many, strong ETL Partners GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Notable: DataFlow requires Java or Python developers Azure ✘ Difficulty with scale ✘ Best tooling integration ✘ Notable: Nothing

slide-54
SLIDE 54

How Best to Ingest and ETL your Data?

Complexity Scalability Developer Cost RDBMS medium medium low NoSQL medium big high Hadoop hard huge very high

slide-55
SLIDE 55

Considering… ✘ Initial Load/Transform ✘ Data Quality ✘ Batch vs. Stream

slide-56
SLIDE 56

Building a Streaming Pipeline

Stream Interval Window

slide-57
SLIDE 57
slide-58
SLIDE 58

Near Real-time Streams

Load Test All The Things

slide-59
SLIDE 59

Key Questions - Streaming

✘Volume – how much data now and predicted over next 12 months? ✘Variety – what types of data now and future? ✘Velocity – volume of input data / time now and near future? ✘Veracity – volume of EXISTING data now

slide-60
SLIDE 60

Cloud Big Data Vendors - Streaming

AWS ✘ 5X market share of next competitor ✘ Most complete offering ✘ Most mature offering ✘ Notable: Kinesis Firehose GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Requires top developers ✘ Notable: DataFlow flexible Azure ✘ Catching up ✘ Best tooling integration ✘ Notable: Stream Analytics integration with other products

slide-61
SLIDE 61

Place your screenshot here

AWS Console 17 Data services

slide-62
SLIDE 62

Place your screenshot here

GCP Console 8 Data Services

slide-63
SLIDE 63

Place your screenshot here

Azure Console 15 Data Services

slide-64
SLIDE 64

Cloud Offerings – Data and Pipelines

AWS Google Microsoft Managed RDBMS RDS Aurora Cloud SQL Azure SQL Data Warehouse Redshift BigQuery Azure SQL Data Warehouse NoSQL buckets S3 Glacier Cloud Storage Nearline Azure Blobs StorSimple NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore Azure Tables Streaming or ML Kinesis AWS Machine Learning DataFlow Google Machine Learning StreamInsight Azure ML NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE DocumentDB Neo4j on Azure Hadoop Elastic MapReduce DataProc Data Lake HDInsight Cloud ETL Data Pipelines DataFlow Azure Data Pipeline

slide-65
SLIDE 65

How Best to Stream your Data?

Complexity Scalability Developer Cost Batches easy medium low Windows difficult big high Real-time very difficult huge high

slide-66
SLIDE 66

Practice

Applying Concepts

slide-67
SLIDE 67

Designing Cloud Data Pipelines

Log Files Product Catalogs Social Games Social aggregators Line-of- Business

slide-68
SLIDE 68

About this Workshop

Real-world Cloud Scenarios w/AWS, Azure and GCP

1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus…(if time allows)

slide-69
SLIDE 69

3.

Making Sense of Data

Analytics and Presentation

slide-70
SLIDE 70

Pattern 3

✘How best to Query and Visualize

  • - When to use business analytics vs. predictive analytics (machine

learning)

  • - how best to present data to clients - partner visualization products or

roll your own

slide-71
SLIDE 71

Making Sense of Data

Machine Learning Reports Presentation

slide-72
SLIDE 72

Key Questions - Query

✘Volume ✘Variety ✘Velocity ✘Veracity

slide-73
SLIDE 73

Graphs

What is nature of your questions?

slide-74
SLIDE 74
slide-75
SLIDE 75

Cloud Big Data Vendors - Query

AWS ✘ 5X market share of next competitor ✘ Most complete offering ✘ Most mature offering ✘ Notable: Big Relational GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Notable: Flexible, powerful machine learning Azure ✘ WATCH OUT – Cost! ✘ Notable: Developer Tooling

slide-76
SLIDE 76

Query Languages

SQL

Everyone knows it But how well do they know it?

NoSQL Vendor Language

Too many to list How will you learn it?

Cypher

Query language for graph databases The future?

ORM

Good, bad or horrible? Again, how well do they know it?

HIVE

Shown in too many vendor demos Really hard to make performant

Machine Learning Queries

SciPy, NumPy or Python R Language Julie Language Many more…

slide-77
SLIDE 77

Practice

Applying Concepts – Understanding D3

slide-78
SLIDE 78

How Best to Query your Data?

Business Analytics Predictive Analytics Developer Cost RDBMS NoSQL Hadoop

slide-79
SLIDE 79

How Best to Query your Data?

Business Analytics Predictive Analytics Developer Cost RDBMS easy medium low NoSQL hard very hard very high Hadoop hard hard very high

slide-80
SLIDE 80

Machine Learning aka Predictive Analytics

AWS

ML for developers GUI-based

GCP

3 Flavors of ML Python-based languages

Azure

ML for Data Scientists R Language

slide-81
SLIDE 81

Presentation

If you can’t see it, it’s not worth it.

slide-82
SLIDE 82

Dashboards ✘ More than KPIs ✘ Mobile ✘ Alerts ✘ Data Stories

Innovation in Data Visualization

Reports ✘ Level of Detail ✘ Meaningful Taxonomies ✘ Fast enough ✘ Drill for Data

slide-83
SLIDE 83

D3

The language of Data Visualization

slide-84
SLIDE 84
slide-85
SLIDE 85

Cloud Big Data Vendors - Visualization

AWS ✘ Most complete offering ✘ Notable: Partners & QuickSight GCP ✘ Big Query Partners ✘ Notable: New Dashboards Azure ✘ Integrated ✘ Notable: PowerBI

slide-86
SLIDE 86

About this Workshop

Real-world Cloud Scenarios w/AWS, Azure and GCP

1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus…(if time allows)

slide-87
SLIDE 87

4.

About IoT

It’s happening now

slide-88
SLIDE 88

Place your screenshot here

Data Generation Device

slide-89
SLIDE 89

IoT is Big Data Realized

slide-90
SLIDE 90

235,000,000,000 $

The IoT Market

2017

By the year

20 Billion devices

And a lot of users

slide-91
SLIDE 91

IoT all the Things

slide-92
SLIDE 92

Cloud Big Data Vendors - IoT

AWS ✘ First to market ✘ Most complete offering ✘ Most mature offering ✘ Notable: AWS IoT Rules GCP ✘ Still in Beta ✘ Fastest player ✘ Requires top developers ✘ Notable: Weave Azure ✘ Catching up ✘ Best tooling integration ✘ Notable: Device Mgmt.

slide-93
SLIDE 93

Save ALL

  • f your Data
slide-94
SLIDE 94

The Next Generation…

slide-95
SLIDE 95

‘brigada!

Any questions?

You can find me at @lynnlangit