Innovation at AWS
Eric Ferreira ericfe@amazon.com Principal Database Engineer Amazon Redshift
Innovation at AWS Eric Ferreira ericfe@amazon.com Principal - - PowerPoint PPT Presentation
Innovation at AWS Eric Ferreira ericfe@amazon.com Principal Database Engineer Amazon Redshift The Amazon Flywheel Focus on things that stay the same Price Selection Delivery Applying this at AWS Focus on things that stay the same
Eric Ferreira ericfe@amazon.com Principal Database Engineer Amazon Redshift
Assemble a Team Build Internal Beta Private Beta Launch Iterate
Add Features that matter Raise Value Increase Adoption Get Feedback
Service Launch (2/14) PDX (4/2) Temp Credentials (4/11) Unload Encrypted Files DUB (4/25) NRT (6/5) JDBC Fetch Size (6/27) Unload logs (7/5) 4 byte UTF-8 (7/18) Statement Timeout (7/22) SHA1 Builtin (7/15) Timezone, Epoch, Autoformat (7/25) WLM Timeout/Wildcards (8/1) CRC32 Builtin, CSV, Restore Progress (8/9) UTF-8 Substitution (8/29) JSON, Regex, Cursors (9/10) Split_part, Audit tables (10/3) SIN/SYD (10/8) HSM Support (11/11) Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS Alerts, Cross Region Backup (11/13) SOC1/2/3 (5/8) Sharing snapshots (7/18) Resource Level IAM (8/9) PCI (8/22) Distributed Tables, Single Node Cursor Support, Maximum Connections to 500 (12/13) EIP Support for VPC Clusters (12/28) New query monitoring system tables and diststyle all (1/13) Redshift on DW2 (SSD) Nodes (1/23) Compression for COPY from SSH, Fetch size support for single node clusters, new system tables with commit stats, row_number(), strotol() and query termination (2/13) Resize progress indicator & Cluster Version (3/21) Regex_Substr, COPY from JSON (3/25) 50 slots, COPY from EMR, ECDHE ciphers (4/22) 3 new regex features, Unload to single file, FedRAMP(5/6)
AWS Database Migration Service EMR
Analyze
Glacier S3
Store Collect
Kinesis Direct Connect Machine Learning Redshift DynamoDB AWS IoT
AWS Import/ Export Snowball QuickSight
Athena EC2 Elasticsearch
Lambda
AWS Glue
AWS KMS, with support for external HSMs
Amazon S3
crawlers for schema, data type, and partition inference
destination
resources you consume
AWS Glue
same hardware: 100,000 writes/sec & 500,000 reads/sec
replicas
layer: 6-way replicated across 3 Availability Zones
ability to join across Exabytes of data in S3 using Redshift Spectrum, a serverless scale out query layer that charges $5/TB scanned
backup capability for global disaster recovery
160GB to 2PB of compressed data with just a few clicks
Amazon Redshift
– Release 5.3: Hadoop 2.7.3, Hive 2.1, Spark 2.1, Zeppelin, Presto, HBase 1.2.3 and HBase on S3, Phoenix, Tez, Flink. – New applications added within 30 days of their open source release
and spot pricing
and storage; multiple clusters can run against the same data in S3
client-side encryption with customer managed keys and AWS KMS
Amazon EMR
with no infrastructure to manage
window functions
Avro, ORC, Parquet
run faster
Amazon Athena
response to events
required - billing in increments of 100 milliseconds AWS Lambda
Availability Zones with configurable retention
Kinesis Firehose for easy integration with Amazon S3 and Redshift; Kinesis Analytics for streaming SQL
Amazon Kinesis
Amazon Elasticsearch Service
Deploy models to in seconds
internal systems
AWS cloud; deploy models in batch and real time modes
machine learning applications Amazon ML
Amazon QuickSight
Columnar
OLAP
AWS IAM Amazon VPC Amazon SWF Amazon S3 AWS KMS Amazon Route 53 Amazon CloudWatch Amazon EC2
PostgreSQL Amazon Redshift
– SQL endpoint – Stores metadata – Coordinates parallel SQL processing
– Local, columnar storage – Executes queries in parallel – Load, backup, restore
10 GigE (HPC)
Ingestion Backup Restore
SQL Clients/BI Tools
128GB RAM 16TB disk 16 cores
S3 / EMR / DynamoDB / SSH
JDBC/ODBC
128GB RAM 16TB disk 16 cores
Compute Node
128GB RAM 16TB disk 16 cores
Compute Node
128GB RAM 16TB disk 16 cores
Compute Node Leader Node
aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14
– Need to read everything – Unnecessary I/O
aid loc dt
CREATE TABLE audience ( aid INT
,loc CHAR(3)
,dt DATE
);
aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14
– Only scan blocks for relevant column
aid loc dt
CREATE TABLE audience ( aid INT
,loc CHAR(3)
,dt DATE
);
aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14
aid loc dt
CREATE TABLE audience ( aid INT ENCODE LZO ,loc CHAR(3) ENCODE BYTEDICT ,dt DATE ENCODE RUNLENGTH );
aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14
aid loc dt
CREATE TABLE audience ( aid INT
,loc CHAR(3)
,dt DATE
);
contain data for a given query
SELECT COUNT(*) FROM LOGS WHERE DATE = '09-JUNE-2013'
MIN: 01-JUNE-2013 MAX: 20-JUNE-2013 MIN: 08-JUNE-2013 MAX: 30-JUNE-2013 MIN: 12-JUNE-2013 MAX: 20-JUNE-2013 MIN: 02-JUNE-2013 MAX: 25-JUNE-2013
MIN: 01-JUNE-2013 MAX: 06-JUNE-2013 MIN: 07-JUNE-2013 MAX: 12-JUNE-2013 MIN: 13-JUNE-2013 MAX: 18-JUNE-2013 MIN: 19-JUNE-2013 MAX: 24-JUNE-2013
distributed throughout the cluster:
Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4
1990 2000 2010 2020
Generated Data Available for Analysis
Sources: Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Data Volume Year
Directly access data in S3 Scale out to thousands of nodes Open data formats Popular big data frameworks Anything you can dream up and code
Super-fast local disk performance Sophisticated query optimization Join-optimized data formats Query using standard SQL Optimized for data warehousing
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query High concurrency: Multiple clusters access same data No ETL: Query data in-place using open file formats Full Amazon Redshift SQL support S3
Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY…
Amazon Redshift JDBC/ODBC
... 1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
1
Query is optimized and compiled at the leader node. Determine what gets run locally and what goes to Amazon Redshift Spectrum
Amazon Redshift JDBC/ODBC
... 1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
2
Query plan is sent to all compute nodes
Amazon Redshift JDBC/ODBC
... 1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
3
Compute nodes obtain partition info from Data Catalog; dynamically prune partitions
Amazon Redshift JDBC/ODBC
... 1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
4
Each compute node issues multiple requests to the Amazon Redshift Spectrum layer
Amazon Redshift JDBC/ODBC
... 1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
5
Amazon Redshift Spectrum nodes scan your S3 data
Amazon Redshift JDBC/ODBC
... 1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
6
7
Amazon Redshift Spectrum projects, filters, joins and aggregates
Amazon Redshift JDBC/ODBC
... 1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
Final aggregations and joins with local Amazon Redshift tables done in-cluster
Amazon Redshift JDBC/ODBC
... 1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
8
Result is sent back to client
Amazon Redshift JDBC/ODBC
... 1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
9
Roughly 140 TB of customer item order detail records for each day over past 20 years. 190 million files across 15,000 partitions in S3. One partition per day for USA and rest of world. Need a billion-fold reduction in data processed. Running this query using a 1000 node Hive cluster would take over 5 years.*
* Estimated using 20 node Hive cluster & 1.4TB, assume linear * Query used a 20 node DC1.8XLarge Amazon Redshift cluster * Not actual sales data - generated for this demo based on data format used by Amazon Retail.
Leverages Amazon Redshift’s advanced cost-based optimizer Pushes down projections, filters, aggregations and join reduction Dynamic partition pruning to minimize data processed Automatic parallelization of query execution against S3 data Efficient join processing within the Amazon Redshift cluster
You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3 Each query can leverage 1000s of Amazon Redshift Spectrum nodes You can reduce the TB scanned and improve query performance by:
Partitioning data Using a columnar file format Compressing data
End-to-end data encryption Alerts & notifications Virtual private cloud Audit logging Certifications & compliance
Encrypt S3 data using SSE and AWS KMS Encrypt all Amazon Redshift data using KMS, AWS CloudHSM or your on-premises HSMs Enforce SSL with perfect forward encryption using ECDHE Amazon Redshift leader node in your VPC. Compute nodes in private VPC. Spectrum nodes in private VPC, store no state. Communicate event-specific notifications via email, text message, or call with Amazon SNS All API calls are logged using AWS CloudTrail All SQL statements are logged within Amazon Redshift PCI/DSS FedRAMP SOC1/2/3 HIPAA/BAA
Date, Time and any other custom keys e.g., Year, Month, Day, Hour
CREATE EXTERNAL SCHEMA <schema_name>
CREATE EXTERNAL TABLE <table_name> [PARTITIONED BY <column_name, data_type, …>] STORED AS file_format LOCATION s3_location [TABLE PROPERTIES property_name=property_value, …];
File formats
Compression
Encryption
key Column types
and decimal
partitioning key Table type
(s3://mybucket/orders/..)
(s3://mybucket/orders/date=YYYY-MM- DD/..)
You can use Hive CREATE TABLE AS SELECT to convert data
CREATE TABLE data_converted STORED AS PARQUET AS SELECT col_1, col2, col3 FROM data_source
Or use Spark - 20 lines of Pyspark code, running on Amazon EMR
https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
Your data will get bigger On average, data warehousing volumes grow 10x every 5 years The average Amazon Redshift customer doubles data each year Amazon Redshift Spectrum makes data analysis simpler Access your data without ETL pipelines Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake Amazon Redshift Spectrum improves availability and concurrency Run multiple Amazon Redshift clusters against common data Isolate jobs with tight SLAs from ad hoc analysis
Athena
Amazon Athena
Interactive Query
AWS Glue
ETL & Data Catalog
Storage Serverless Compute Data Processing
Amazon S3
Exabyte-scale Object Storage
Amazon Kinesis Firehose
Real-Time Data Streaming
Amazon EMR
Managed Hadoop Applications
AWS Lambda
Trigger-based Code Execution
AWS Glue Data Catalog
Hive-compatible Metastore
Amazon Redshift Spectrum
Fast @ Exabyte scale
Amazon Redshift
Petabyte-scale Data Warehousing
https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings- advanced-table-design-playbook-preamble-prerequisites-and-prioritization/
– Admin scripts
Collection of utilities for running diagnostics on your cluster
– Admin views
Collection of utilities for managing your cluster, generating schema DDL, etc.
– ColumnEncodingUtility
Gives you the ability to apply optimal column encoding to an established schema with data already loaded