Cloud Big Data Architectures
Lynn Langit
QCon Sao Paulo, Brazil 2016
Cloud Big Data Architectures Lynn Langit QCon Sao Paulo, Brazil - - PowerPoint PPT Presentation
Cloud Big Data Architectures Lynn Langit QCon Sao Paulo, Brazil 2016 About this Workshop Real-world Cloud Scenarios w/AWS, Azure and GCP 1. Big Data Solution Types 2. Data Pipelines 3. ETL and Visualization Bonus (if time allows) 4.
Lynn Langit
QCon Sao Paulo, Brazil 2016
About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP
1. Big Data Solution Types 2. Data Pipelines 3. ETL and Visualization 4. Bonus…(if time allows)
What is the ACTUAL Cost of ✘ Saving all Data ✘ Using newer technologies ✘ Going beyond Relational
About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP
1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus…(if time allows)
But what kind?
Pattern 1
✘Which type(s) of Big Data work best?
and which type, i.e. key-value, document, graph, etc.
and what type of workload for hot, warm or cold data
Choice… is good, right?
When do I use…? ✘ Hadoop ✘ NoSQL ✘ Big Relational
Size Matters
One Vendor’s View
I don’t Want Text here
Where is Hadoop Used?
Hadoop is your LAST CHOICE
✘Volume
✘10 TB or greater to start ✘Growth of 25% YOY ✘Where FROM ✘Where TO
✘Velocity and Variety
✘Spark over HIVE ✘Kafka and Samsa
✘Veracity
✘Pay, train and hire team ✘Top $$$ for talent ✘IF you can find it ✘WATCH OUT for Cloud Vendors who promise ‘easy access’ ✘Complexity of ecosystem ✘Cloudera knows best
When do I use…? ✘ Hadoop ✘ NoSQL ✘ Big Relational
NoSQL Database Types to Choose From
Let’s review some NoSQL concepts
Key-Value
Redis, Riak, Aerospike
Graph
Neo4j
Document
MongoDB
Wide-Column
Cassandra, HBase
Key Questions - Storage
✘Volume – how much now, what growth rate? ✘Variety – what type(s) of data? ‘rectangular’, ‘graph’, ‘k-v’, etc… ✘Velocity – batches, streams, both, what ingest rate? ✘Veracity – current state (quality) of data, amount of duplication of data stores, existence of authoritative (master) data management?
✘Open Source is Free ✘Not Free
money)
NoSQL Example
Applying Concepts - NoSQL
NoSQL Applied
Log Files
Product Catalogs
Social Games
Social aggregators
Line-of- Business
NoSQL Applied
Log Files
Product Catalogs
Social Games
Social aggregators
Line-of- Business
More than NoSQL
NoSQL ✘ Non-relational ✘ Can be optimized in- memory ✘ Eventually consistent ✘ Schema on Read ✘ Example: Aerospike NewSQL ✘ Relational plus more ✘ Often in-memory ✘ Some kind of SQL-layer ✘ Schema on Write ✘ Example: MemSQL U-SQL ✘ What??? ✘ Microsoft’s universal SQL language ✘ Example: Azure Data Lake
Focus
How Best to Store your Data?
Complexity Scalability Developer Cost RDBMS easy medium low NoSQL medium big high Hadoop hard huge very high
Real World Big Data -- When do I use what?
RDBMS 65% NoSQL 30% Hadoop 5%
Do the Cloud Vendors Understand Big Data Realities?
Cloud Big Data Vendors - Storage
AWS
✘ 5-10X market share of next competitor ✘ Most complete offering ✘ Most mature offering ✘ Notable: Big Relational
GCP
✘ Lean, mean and cheap ✘ Fastest player ✘ Requires top developers ✘ Notable: Query as a Service
Azure
✘ Catching up ✘ Best tooling integration ✘ Notable: On-premise integration
Place your screenshot here
AWS Console 17 Data services
Place your screenshot here
GCP Console 8 Data Services
Place your screenshot here
Azure Console 15 Data Services
Cloud Offerings – Big Data
AWS Google Microsoft
Managed RDBMS RDS Aurora Cloud SQL Azure SQL Data Warehouse Redshift BigQuery Azure SQL Data Warehouse NoSQL buckets S3 Glacier Cloud Storage Nearline Azure Blobs StorSimple NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore Azure Tables NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE DocumentDB Neo4j on Azure Hadoop Elastic MapReduce DataProc Data Lake HDInsight
Applying Concepts – Real Cost of Storage Types
Cloud NoSQL Applied – AWS
Log Files Product Catalogs Social Games Social aggregators Line-of- Business
Cloud NoSQL Applied – AWS
Log Files
Hadoop
EMR
Product Catalogs
Social Games
Social aggregators
Line-of- Business
The fastest growing cloud-based Big Data products are…
The fastest growing cloud-based Big Data products are…
When do I use…? ✘ Hadoop ✘ NoSQL ✘ Big Relational
Applying Concepts – Real Cost of Storage Types
Reasons to use Big Relational Cloud Services
Developers DevOps Cloud Vendors – AWS Developers DevOps Cloud Vendors – GCP
Reasons to use Big Relational Cloud Services
Developers
Most know RDBMS query patterns Many know basic administration
DevOps
Most know RDBMS administration Many know basic RDBMS queries Many know query optimization
Cloud Vendors - AWS
Aurora – RDBMS up to 64 TB Redshift - $ 1k USD / 1 TB / year Rich partner ecosystem – ETL Integration with AWS products
Developers
Most know coding language patterns to interact with RDBMS systems
DevOps
Familiar RDBMS security patterns Familiar auditing Partner tooling integration
Cloud Vendors - GCP
Big Query – familiar SQL queries No hassle streaming ingest No hassle pay-as-you-go Zero administration
My top Big Data Cloud Services
ETL is 75% of all Big Data Projects
Surveying, cleaning and loading data is the majority of the billable time for new Big Data projects.
About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP
1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus…(if time allows)
Build vs. Buy
Pattern 2 ✘How to build optimized cloud-based data pipelines?
Key Questions – Ingestion and ETL
✘Volume – how much and how fast, now and future? ✘Variety – what type(s) or data, any pre-processing needed? ✘Velocity – batches or steaming? ✘Veracity – verification on ingest needed? new data needed?
How does your data pipeline flow?
Considering… ✘ Initial Load/Transform ✘ Data Quality ✘ Batch vs. Stream
Pipeline Phases
Phase 0 Eval Current Data - Quality & Quantity Phase 1 Get New Data - Free or Premium Phase 2 Build MVP & Forecast volume and growth Phase 3 Load test at scale Phase 4 Deploy – secure, audit and monitor
Cloud Big Data Vendors - ETL
AWS ✘ 5X market share of next competitor ✘ Notable: Many, strong ETL Partners GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Notable: DataFlow requires Java or Python developers Azure ✘ Difficulty with scale ✘ Best tooling integration ✘ Notable: Nothing
How Best to Ingest and ETL your Data?
Complexity Scalability Developer Cost RDBMS medium medium low NoSQL medium big high Hadoop hard huge very high
Considering… ✘ Initial Load/Transform ✘ Data Quality ✘ Batch vs. Stream
Building a Streaming Pipeline
Stream Interval Window
Near Real-time Streams
Load Test All The Things
Key Questions - Streaming
✘Volume – how much data now and predicted over next 12 months? ✘Variety – what types of data now and future? ✘Velocity – volume of input data / time now and near future? ✘Veracity – volume of EXISTING data now
Cloud Big Data Vendors - Streaming
AWS ✘ 5X market share of next competitor ✘ Most complete offering ✘ Most mature offering ✘ Notable: Kinesis Firehose GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Requires top developers ✘ Notable: DataFlow flexible Azure ✘ Catching up ✘ Best tooling integration ✘ Notable: Stream Analytics integration with other products
Place your screenshot here
AWS Console 17 Data services
Place your screenshot here
GCP Console 8 Data Services
Place your screenshot here
Azure Console 15 Data Services
Cloud Offerings – Data and Pipelines
AWS Google Microsoft Managed RDBMS RDS Aurora Cloud SQL Azure SQL Data Warehouse Redshift BigQuery Azure SQL Data Warehouse NoSQL buckets S3 Glacier Cloud Storage Nearline Azure Blobs StorSimple NoSQL Key-Value NoSQL Wide Column DynamoDB Big Table Cloud Datastore Azure Tables Streaming or ML Kinesis AWS Machine Learning DataFlow Google Machine Learning StreamInsight Azure ML NoSQL Document NoSQL Graph MongoDB on EC2 Neo4j on EC2 MongoDB on GCE Neo4j on GCE DocumentDB Neo4j on Azure Hadoop Elastic MapReduce DataProc Data Lake HDInsight Cloud ETL Data Pipelines DataFlow Azure Data Pipeline
How Best to Stream your Data?
Complexity Scalability Developer Cost Batches easy medium low Windows difficult big high Real-time very difficult huge high
Applying Concepts
Designing Cloud Data Pipelines
Log Files Product Catalogs Social Games Social aggregators Line-of- Business
About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP
1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus…(if time allows)
Analytics and Presentation
Pattern 3
✘How best to Query and Visualize
learning)
roll your own
Making Sense of Data
Machine Learning Reports Presentation
Key Questions - Query
✘Volume ✘Variety ✘Velocity ✘Veracity
What is nature of your questions?
Cloud Big Data Vendors - Query
AWS ✘ 5X market share of next competitor ✘ Most complete offering ✘ Most mature offering ✘ Notable: Big Relational GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Notable: Flexible, powerful machine learning Azure ✘ WATCH OUT – Cost! ✘ Notable: Developer Tooling
Query Languages
SQL
Everyone knows it But how well do they know it?
NoSQL Vendor Language
Too many to list How will you learn it?
Cypher
Query language for graph databases The future?
ORM
Good, bad or horrible? Again, how well do they know it?
HIVE
Shown in too many vendor demos Really hard to make performant
Machine Learning Queries
SciPy, NumPy or Python R Language Julie Language Many more…
Applying Concepts – Understanding D3
How Best to Query your Data?
Business Analytics Predictive Analytics Developer Cost RDBMS NoSQL Hadoop
How Best to Query your Data?
Business Analytics Predictive Analytics Developer Cost RDBMS easy medium low NoSQL hard very hard very high Hadoop hard hard very high
Machine Learning aka Predictive Analytics
AWS
ML for developers GUI-based
GCP
3 Flavors of ML Python-based languages
Azure
ML for Data Scientists R Language
Presentation
If you can’t see it, it’s not worth it.
Dashboards ✘ More than KPIs ✘ Mobile ✘ Alerts ✘ Data Stories
Innovation in Data Visualization
Reports ✘ Level of Detail ✘ Meaningful Taxonomies ✘ Fast enough ✘ Drill for Data
The language of Data Visualization
Cloud Big Data Vendors - Visualization
AWS ✘ Most complete offering ✘ Notable: Partners & QuickSight GCP ✘ Big Query Partners ✘ Notable: New Dashboards Azure ✘ Integrated ✘ Notable: PowerBI
About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP
1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus…(if time allows)
It’s happening now
Place your screenshot here
Data Generation Device
IoT is Big Data Realized
The IoT Market
By the year
And a lot of users
IoT all the Things
Cloud Big Data Vendors - IoT
AWS ✘ First to market ✘ Most complete offering ✘ Most mature offering ✘ Notable: AWS IoT Rules GCP ✘ Still in Beta ✘ Fastest player ✘ Requires top developers ✘ Notable: Weave Azure ✘ Catching up ✘ Best tooling integration ✘ Notable: Device Mgmt.
The Next Generation…
‘brigada!
Any questions?
You can find me at @lynnlangit