Innovation at AWS Eric Ferreira ericfe@amazon.com Principal - - PowerPoint PPT Presentation

innovation at aws
SMART_READER_LITE
LIVE PREVIEW

Innovation at AWS Eric Ferreira ericfe@amazon.com Principal - - PowerPoint PPT Presentation

Innovation at AWS Eric Ferreira ericfe@amazon.com Principal Database Engineer Amazon Redshift The Amazon Flywheel Focus on things that stay the same Price Selection Delivery Applying this at AWS Focus on things that stay the same


slide-1
SLIDE 1

Innovation at AWS

Eric Ferreira ericfe@amazon.com Principal Database Engineer Amazon Redshift

slide-2
SLIDE 2

The Amazon Flywheel

Focus on things that stay the same Price Selection Delivery

slide-3
SLIDE 3

Applying this at AWS

slide-4
SLIDE 4

Amazon Redshift Focus on things that stay the same Performance Value Simplicity

slide-5
SLIDE 5

Adopt a retail mindset

slide-6
SLIDE 6

Customers have choice Delight them and they’ll stay Earn their business one hour at a time

slide-7
SLIDE 7

Start with the Customer Work Backwards

slide-8
SLIDE 8

What Do Customers Want?

  • What problems are customers facing?
  • How will my service alleviate this pain?
  • Why will this idea delight customers?
  • Why can I do this better than anyone else?
slide-9
SLIDE 9

What we heard from customers about DW

  • Complicated to install, maintain, operate
  • Require large upfront payments
  • Too expensive
  • Always running out of capacity
slide-10
SLIDE 10

Press Release

Describe the product in terms of customer value Why will customers care? Is it newsworthy? How is this differentiated?

slide-11
SLIDE 11

FAQ

Answer customer questions How does this help me? How do I get started? How will this work with my ETL/BI tools? When should I use this vs. Hadoop?

slide-12
SLIDE 12

2 pizza teams

  • An individual team should be no larger than can be fed

by two pizzas.

  • Beyond this size, you define contracts and interfaces

with other teams

  • Attention is a scarce resource. Time is a scarce resource
  • Apply attention and time to changing reality, not

communicating status.

slide-13
SLIDE 13

Assemble a Team Build Internal Beta Private Beta Launch Iterate

Build the Product

slide-14
SLIDE 14

Iterate

slide-15
SLIDE 15

Add Features that matter Raise Value Increase Adoption Get Feedback

slide-16
SLIDE 16

Redshift pushes a new DB version every two weeks. 120+ features since launch

Service Launch (2/14) PDX (4/2) Temp Credentials (4/11) Unload Encrypted Files DUB (4/25) NRT (6/5) JDBC Fetch Size (6/27) Unload logs (7/5) 4 byte UTF-8 (7/18) Statement Timeout (7/22) SHA1 Builtin (7/15) Timezone, Epoch, Autoformat (7/25) WLM Timeout/Wildcards (8/1) CRC32 Builtin, CSV, Restore Progress (8/9) UTF-8 Substitution (8/29) JSON, Regex, Cursors (9/10) Split_part, Audit tables (10/3) SIN/SYD (10/8) HSM Support (11/11) Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS Alerts, Cross Region Backup (11/13) SOC1/2/3 (5/8) Sharing snapshots (7/18) Resource Level IAM (8/9) PCI (8/22) Distributed Tables, Single Node Cursor Support, Maximum Connections to 500 (12/13) EIP Support for VPC Clusters (12/28) New query monitoring system tables and diststyle all (1/13) Redshift on DW2 (SSD) Nodes (1/23) Compression for COPY from SSH, Fetch size support for single node clusters, new system tables with commit stats, row_number(), strotol() and query termination (2/13) Resize progress indicator & Cluster Version (3/21) Regex_Substr, COPY from JSON (3/25) 50 slots, COPY from EMR, ECDHE ciphers (4/22) 3 new regex features, Unload to single file, FedRAMP(5/6)

slide-17
SLIDE 17

AWS Database Migration Service EMR

Analyze

Glacier S3

Store Collect

Kinesis Direct Connect Machine Learning Redshift DynamoDB AWS IoT

AWS Import/ Export Snowball QuickSight

Athena EC2 Elasticsearch

Lambda

AWS Glue

slide-18
SLIDE 18

Collection & Storage

  • Store anything
  • Object storage
  • Designed for 99.999999999% durability
  • Scalable & Cost effective; $0.023/GB-Mo
  • Integrated with Amazon Glacier
  • Support for multiple encryption methods; integrated with

AWS KMS, with support for external HSMs

Amazon S3

slide-19
SLIDE 19

Data Management & ETL

  • Hive Metastore-compatible data catalog with integrated

crawlers for schema, data type, and partition inference

  • Generates Python code to move data from source to

destination

  • Edit jobs using your favorite IDE and share snippets via Git
  • Runs jobs in Spark containers that auto-scale based on SLA
  • Serverless with no infrastructure to manage; pay only for the

resources you consume

AWS Glue

slide-20
SLIDE 20

Amazon RDS for Aurora

  • MySQL compatible with up to 5x better performance on the

same hardware: 100,000 writes/sec & 500,000 reads/sec

  • Scalable with up to 64 TB in single database, up to 15 read

replicas

  • Highly available, durable, and fault-tolerant custom SSD storage

layer: 6-way replicated across 3 Availability Zones

  • Transparent encryption for data at rest using AWS KMS
  • Stored procedures in Aurora can invoke AWS Lambda functions
  • MySQL & PostgreSQL compatible engines
slide-21
SLIDE 21

Structured Data Processing

  • Petabyte-scale relational, MPP, data warehousing clusters with the

ability to join across Exabytes of data in S3 using Redshift Spectrum, a serverless scale out query layer that charges $5/TB scanned

  • Fully managed with SSD and HDD platforms
  • Built-in end to end security, including customer-managed keys
  • Fault tolerant. Automatically recovers from disk and node failures
  • Data automatically backed up to Amazon S3 with cross region

backup capability for global disaster recovery

  • $1,000/TB/Year; start at $0.25/hour. Provision in minutes; scale from

160GB to 2PB of compressed data with just a few clicks

Amazon Redshift

slide-22
SLIDE 22

Semi-structured / Unstructured Data Processing

  • Hadoop, Hive, Presto, Spark, Tez, Impala etc.

– Release 5.3: Hadoop 2.7.3, Hive 2.1, Spark 2.1, Zeppelin, Presto, HBase 1.2.3 and HBase on S3, Phoenix, Tez, Flink. – New applications added within 30 days of their open source release

  • Fully managed, autoscaling clusters with support for on-demand

and spot pricing

  • Support for HDFS and S3 filesystems enabling separated compute

and storage; multiple clusters can run against the same data in S3

  • HIPAA-eligible. Support for end-to-end encryption, IAM/VPC, S3

client-side encryption with customer managed keys and AWS KMS

Amazon EMR

slide-23
SLIDE 23

Serverless Query Processing

  • Serverless query service for querying data in S3 using standard SQL,

with no infrastructure to manage

  • No data loading required; query directly from Amazon S3
  • Use standard ANSI SQL queries with support for joins, JSON, and

window functions

  • Support for multiple data formats include text, CSV, TSV, JSON,

Avro, ORC, Parquet

  • Pay per query only when you’re running queries based on data
  • scanned. If you compress your data, you pay less and your queries

run faster

Amazon Athena

slide-24
SLIDE 24

Serverless Event Processing

  • Server-less compute service that runs your code in

response to events

  • Extend AWS services with user defined custom logic
  • Write custom code in Node.js, Python, and Java
  • Pay only for the requests served and compute time

required - billing in increments of 100 milliseconds AWS Lambda

slide-25
SLIDE 25

Stream Processing

  • Real-time stream processing
  • High throughput; elastic
  • Highly available; data replicated across multiple

Availability Zones with configurable retention

  • S3, Redshift, DynamoDB Integrations
  • Kinesis Streams for custom streaming applications;

Kinesis Firehose for easy integration with Amazon S3 and Redshift; Kinesis Analytics for streaming SQL

Amazon Kinesis

slide-26
SLIDE 26

Search and Operational Analytics

  • Distributed search and analytics engine
  • Managed service using Elasticsearch and Kibana
  • Fully managed; Zero admin
  • Highly Available and Reliable
  • Tightly integrated with other AWS services

Amazon Elasticsearch Service

slide-27
SLIDE 27

Predictive Applications

  • Easy to use, managed service built for developers -

Deploy models to in seconds

  • Robust, powerful technology based on Amazon’s

internal systems

  • Create models using your data already stored in the

AWS cloud; deploy models in batch and real time modes

  • Spark on Amazon EMR also available for custom

machine learning applications Amazon ML

slide-28
SLIDE 28

Business Intelligence

  • Fast and cloud-powered
  • Easy to use, no infrastructure to manage
  • Scales to 100s of thousands of users
  • Quick calculations with SPICE
  • 1/10th the cost of legacy BI software

Amazon QuickSight

slide-29
SLIDE 29

Amazon Redshift

slide-30
SLIDE 30

Columnar

MPP

OLAP

AWS IAM Amazon VPC Amazon SWF Amazon S3 AWS KMS Amazon Route 53 Amazon CloudWatch Amazon EC2

PostgreSQL Amazon Redshift

slide-31
SLIDE 31

Redshift Cluster Architecture

  • Massively parallel, shared nothing
  • Leader node

– SQL endpoint – Stores metadata – Coordinates parallel SQL processing

  • Compute nodes

– Local, columnar storage – Executes queries in parallel – Load, backup, restore

10 GigE (HPC)

Ingestion Backup Restore

SQL Clients/BI Tools

128GB RAM 16TB disk 16 cores

S3 / EMR / DynamoDB / SSH

JDBC/ODBC

128GB RAM 16TB disk 16 cores

Compute Node

128GB RAM 16TB disk 16 cores

Compute Node

128GB RAM 16TB disk 16 cores

Compute Node Leader Node

slide-32
SLIDE 32

Brute force only takes you so far…

slide-33
SLIDE 33

Designed for I/O Reduction

  • Columnar storage
  • Data compression
  • Zone maps

aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14

  • Accessing dt with row storage:

– Need to read everything – Unnecessary I/O

aid loc dt

CREATE TABLE audience ( aid INT

  • -audience_id

,loc CHAR(3)

  • -location

,dt DATE

  • -date

);

slide-34
SLIDE 34

Designed for I/O Reduction

  • Columnar storage
  • Data compression
  • Zone maps

aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14

  • Accessing dt with columnar storage:

– Only scan blocks for relevant column

aid loc dt

CREATE TABLE audience ( aid INT

  • -audience_id

,loc CHAR(3)

  • -location

,dt DATE

  • -date

);

slide-35
SLIDE 35

Designed for I/O Reduction

  • Columnar storage
  • Data compression
  • Zone maps

aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14

  • Columns grow and shrink independently
  • Effective compression ratios due to like data
  • Reduces storage requirements
  • Reduces I/O

aid loc dt

CREATE TABLE audience ( aid INT ENCODE LZO ,loc CHAR(3) ENCODE BYTEDICT ,dt DATE ENCODE RUNLENGTH );

slide-36
SLIDE 36

Designed for I/O Reduction

  • Columnar storage
  • Data compression
  • Zone maps

aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14

aid loc dt

CREATE TABLE audience ( aid INT

  • -audience_id

,loc CHAR(3)

  • -location

,dt DATE

  • -date

);

  • In-memory block metadata
  • Contains per-block MIN and MAX value
  • Effectively prunes blocks which cannot

contain data for a given query

  • Eliminates unnecessary I/O
slide-37
SLIDE 37

SELECT COUNT(*) FROM LOGS WHERE DATE = '09-JUNE-2013'

MIN: 01-JUNE-2013 MAX: 20-JUNE-2013 MIN: 08-JUNE-2013 MAX: 30-JUNE-2013 MIN: 12-JUNE-2013 MAX: 20-JUNE-2013 MIN: 02-JUNE-2013 MAX: 25-JUNE-2013

Unsorted Table

MIN: 01-JUNE-2013 MAX: 06-JUNE-2013 MIN: 07-JUNE-2013 MAX: 12-JUNE-2013 MIN: 13-JUNE-2013 MAX: 18-JUNE-2013 MIN: 19-JUNE-2013 MAX: 24-JUNE-2013

Sorted By Date Zone Maps

slide-38
SLIDE 38

Data Distribution

  • Distribution style is a table property which dictates how that table’s data is

distributed throughout the cluster:

  • KEY: Value is hashed, same value goes to same location (slice)
  • ALL: Full table data goes to first slice of every node
  • EVEN: Round robin
  • Goals:
  • Distribute data evenly for parallel processing
  • Minimize data movement during query processing

KEY ALL

Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4

EVEN

slide-39
SLIDE 39

What is next?

When your data sets become so large and diverse that you have to start innovating around how to collect, store, process, analyze and share them

slide-40
SLIDE 40

1990 2000 2010 2020

Generated Data Available for Analysis

Sources: Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Data Volume Year

The Dark Data Problem

Most generated data is unavailable for analysis

slide-41
SLIDE 41

The tyranny of “OR”

Amazon EMR

Directly access data in S3 Scale out to thousands of nodes Open data formats Popular big data frameworks Anything you can dream up and code

Amazon Redshift

Super-fast local disk performance Sophisticated query optimization Join-optimized data formats Query using standard SQL Optimized for data warehousing

slide-42
SLIDE 42

Customers want

sophisticated query optimization and scale-out processing super fast performance and support for open formats the throughput of local disk and the scale of S3

slide-43
SLIDE 43

Amazon Redshift Spectrum

slide-44
SLIDE 44

Amazon Redshift Spectrum

Run SQL queries directly against data in S3 using thousands of nodes

Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query High concurrency: Multiple clusters access same data No ETL: Query data in-place using open file formats Full Amazon Redshift SQL support S3

SQL

slide-45
SLIDE 45

Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY…

Life of a query

Amazon Redshift JDBC/ODBC

... 1 2 3 4 N

Amazon S3

Exabyte-scale object storage

Data Catalog

Apache Hive Metastore

1

slide-46
SLIDE 46

Query is optimized and compiled at the leader node. Determine what gets run locally and what goes to Amazon Redshift Spectrum

Life of a query

Amazon Redshift JDBC/ODBC

... 1 2 3 4 N

Amazon S3

Exabyte-scale object storage

Data Catalog

Apache Hive Metastore

2

slide-47
SLIDE 47

Query plan is sent to all compute nodes

Life of a query

Amazon Redshift JDBC/ODBC

... 1 2 3 4 N

Amazon S3

Exabyte-scale object storage

Data Catalog

Apache Hive Metastore

3

slide-48
SLIDE 48

Compute nodes obtain partition info from Data Catalog; dynamically prune partitions

Life of a query

Amazon Redshift JDBC/ODBC

... 1 2 3 4 N

Amazon S3

Exabyte-scale object storage

Data Catalog

Apache Hive Metastore

4

slide-49
SLIDE 49

Each compute node issues multiple requests to the Amazon Redshift Spectrum layer

Life of a query

Amazon Redshift JDBC/ODBC

... 1 2 3 4 N

Amazon S3

Exabyte-scale object storage

Data Catalog

Apache Hive Metastore

5

slide-50
SLIDE 50

Amazon Redshift Spectrum nodes scan your S3 data

Life of a query

Amazon Redshift JDBC/ODBC

... 1 2 3 4 N

Amazon S3

Exabyte-scale object storage

Data Catalog

Apache Hive Metastore

6

slide-51
SLIDE 51

7

Amazon Redshift Spectrum projects, filters, joins and aggregates

Life of a query

Amazon Redshift JDBC/ODBC

... 1 2 3 4 N

Amazon S3

Exabyte-scale object storage

Data Catalog

Apache Hive Metastore

slide-52
SLIDE 52

Final aggregations and joins with local Amazon Redshift tables done in-cluster

Life of a query

Amazon Redshift JDBC/ODBC

... 1 2 3 4 N

Amazon S3

Exabyte-scale object storage

Data Catalog

Apache Hive Metastore

8

slide-53
SLIDE 53

Result is sent back to client

Life of a query

Amazon Redshift JDBC/ODBC

... 1 2 3 4 N

Amazon S3

Exabyte-scale object storage

Data Catalog

Apache Hive Metastore

9

slide-54
SLIDE 54

Running an analytic query

  • ver an exabyte in S3
slide-55
SLIDE 55

Now let’s run a query over an exabyte of data in S3

Roughly 140 TB of customer item order detail records for each day over past 20 years. 190 million files across 15,000 partitions in S3. One partition per day for USA and rest of world. Need a billion-fold reduction in data processed. Running this query using a 1000 node Hive cluster would take over 5 years.*

  • Compression ……………..….……..5X
  • Columnar file format……….......…10X
  • Scanning with 2500 nodes…....2500X
  • Static partition elimination…............2X
  • Dynamic partition elimination..….350X
  • Redshift’s query optimizer……......40X
  • Total reduction……….…………3.5B X

* Estimated using 20 node Hive cluster & 1.4TB, assume linear * Query used a 20 node DC1.8XLarge Amazon Redshift cluster * Not actual sales data - generated for this demo based on data format used by Amazon Retail.

slide-56
SLIDE 56

Amazon Redshift Spectrum is fast

Leverages Amazon Redshift’s advanced cost-based optimizer Pushes down projections, filters, aggregations and join reduction Dynamic partition pruning to minimize data processed Automatic parallelization of query execution against S3 data Efficient join processing within the Amazon Redshift cluster

slide-57
SLIDE 57

Amazon Redshift Spectrum is cost-effective

You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3 Each query can leverage 1000s of Amazon Redshift Spectrum nodes You can reduce the TB scanned and improve query performance by:

Partitioning data Using a columnar file format Compressing data

slide-58
SLIDE 58

Amazon Redshift Spectrum is secure

End-to-end data encryption Alerts & notifications Virtual private cloud Audit logging Certifications & compliance

Encrypt S3 data using SSE and AWS KMS Encrypt all Amazon Redshift data using KMS, AWS CloudHSM or your on-premises HSMs Enforce SSL with perfect forward encryption using ECDHE Amazon Redshift leader node in your VPC. Compute nodes in private VPC. Spectrum nodes in private VPC, store no state. Communicate event-specific notifications via email, text message, or call with Amazon SNS All API calls are logged using AWS CloudTrail All SQL statements are logged within Amazon Redshift PCI/DSS FedRAMP SOC1/2/3 HIPAA/BAA

slide-59
SLIDE 59

Amazon Redshift Spectrum uses standard SQL

Redshift Spectrum seamlessly integrates with your existing SQL & BI apps Support for complex joins, nested queries & window functions Support for data partitioned in S3 by any key

Date, Time and any other custom keys e.g., Year, Month, Day, Hour

slide-60
SLIDE 60

Defining External Schema and Creating Tables

Define an external schema in Amazon Redshift using the Amazon Athena data catalog or your own Apache Hive Metastore

CREATE EXTERNAL SCHEMA <schema_name>

Query external tables using <schema_name>.<table_name> Register external tables using Athena, your Hive Metastore client, or from Amazon Redshift CREATE EXTERNAL TABLE syntax

CREATE EXTERNAL TABLE <table_name> [PARTITIONED BY <column_name, data_type, …>] STORED AS file_format LOCATION s3_location [TABLE PROPERTIES property_name=property_value, …];

slide-61
SLIDE 61

Amazon Redshift Spectrum – Current support

File formats

  • Parquet
  • CSV
  • Sequence
  • RCFile
  • ORC (coming soon)
  • RegExSerDe (coming soon)

Compression

  • Gzip
  • Snappy
  • Lzo (coming soon)
  • Bz2

Encryption

  • SSE with AES256
  • SSE KMS with default

key Column types

  • Numeric: bigint, int, smallint, float, double

and decimal

  • Char/varchar/string
  • Timestamp
  • Boolean
  • DATE type can be used only as a

partitioning key Table type

  • Non-partitioned table

(s3://mybucket/orders/..)

  • Partitioned table

(s3://mybucket/orders/date=YYYY-MM- DD/..)

slide-62
SLIDE 62

Converting to Parquet and ORC using Amazon EMR

You can use Hive CREATE TABLE AS SELECT to convert data

CREATE TABLE data_converted STORED AS PARQUET AS SELECT col_1, col2, col3 FROM data_source

Or use Spark - 20 lines of Pyspark code, running on Amazon EMR

  • 1TB of text data reduced to 130 GB in Parquet format with snappy compression
  • Total cost of EMR job to do this: $5

https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion

slide-63
SLIDE 63

Is Amazon Redshift Spectrum useful if I don’t have an exabyte?

Your data will get bigger On average, data warehousing volumes grow 10x every 5 years The average Amazon Redshift customer doubles data each year Amazon Redshift Spectrum makes data analysis simpler Access your data without ETL pipelines Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake Amazon Redshift Spectrum improves availability and concurrency Run multiple Amazon Redshift clusters against common data Isolate jobs with tight SLAs from ad hoc analysis

slide-64
SLIDE 64

Over 20 customers helped preview Amazon Redshift Spectrum

slide-65
SLIDE 65

The Emerging Analytics Architecture

Athena

Amazon Athena

Interactive Query

AWS Glue

ETL & Data Catalog

Storage Serverless Compute Data Processing

Amazon S3

Exabyte-scale Object Storage

Amazon Kinesis Firehose

Real-Time Data Streaming

Amazon EMR

Managed Hadoop Applications

AWS Lambda

Trigger-based Code Execution

AWS Glue Data Catalog

Hive-compatible Metastore

Amazon Redshift Spectrum

Fast @ Exabyte scale

Amazon Redshift

Petabyte-scale Data Warehousing

slide-66
SLIDE 66

Resources

  • Amazon Redshift Engineering’s Advanced Table Design Playbook

https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings- advanced-table-design-playbook-preamble-prerequisites-and-prioritization/

  • https://github.com/awslabs/amazon-redshift-utils

– Admin scripts

Collection of utilities for running diagnostics on your cluster

– Admin views

Collection of utilities for managing your cluster, generating schema DDL, etc.

– ColumnEncodingUtility

Gives you the ability to apply optimal column encoding to an established schema with data already loaded

  • https://github.com/awslabs/amazon-redshift-monitoring
  • https://github.com/awslabs/amazon-redshift-udfs
slide-67
SLIDE 67

Thank You !