Data Warehouses Google BigQuery AWS Redshift Azure Data Lake - PowerPoint PPT Presentation

Data Warehouses Google BigQuery AWS Redshift Azure Data Lake Apache Hive

Motiv tivation ation  What if you want unlimited capacity while supporting fast querying (i.e. like Google Search)?  Small transactional in-memory databases support fast queries, but do not scale (SQL, MySQL etc.)  Large file systems support scale, but can not (natively) support querying (GCS, S3)  NoSQL databases store massive datasets via distributed hash-table, but are difficult to query efficiently (i.e. scan and query, not SQL) Portland State University CS 430P/530 Internet, Web & Cloud Systems

Da Data ta warehou rehouses ses  Typically used for On-line Analytical Processing (OLAP) apps  e.g. Log processing for site/app analytics  Store large datasets organized for write once, read/query many access (WORM)  Optimized for large reads and writes  Does not require transactional properties of On-line Transaction Processing (OLTP)  e.g. No need for ACID (i.e SQL/Spanner semantics)  Implemented via cheap disks and slower CPUs Portland State University CS 430P/530 Internet, Web & Cloud Systems

BigQuery

Bi BigQ gQuer uery  Gartner's Magic Quadrant report on public cloud services "Google’s differentiation factor lies in its deep investments in analytics and ML. Many customers who choose Google for strategic adoption have applications that are anchored by BigQuery." Portland State University CS 430P/530 Internet, Web & Cloud Systems

Bi BigQ gQuer uery  Fully managed, no-ops data warehouse  Developed by Google when MapReduce on 24 hours of logs took 24 hours to execute  Supports fast, streaming data storage  100k rows and hundreds of TB per second  High-performance querying via SQL-like interface  Near real-time analysis of massive datasets via replication and parallelism  Allows one to bring code to where data is (in the cloud)  How? Portland State University CS 430P/530 Internet, Web & Cloud Systems

Initi itially: ally: Fl Flat at-file ile (row-based based) ) st storage ge for logs gs  Log data typically stored in a flat file in file system  Example: packet traces 09:59:27.329166 IP 10.218.128.13.62562 > mashimaro.cs.pdx.edu.http: Flags [.], ack 419, win 258, length 0 09:59:33.011286 IP 10.218.128.13.62562 > mashimaro.cs.pdx.edu.http: Flags [F.], seq 373, ack 419, win 258, length 0 09:59:33.011297 IP mashimaro.cs.pdx.edu.http > 10.218.128.13.62562: Flags [F.], seq 419, ack 374, win 237, length 0 09:59:33.017169 IP 10.218.128.13.62562 > mashimaro.cs.pdx.edu.http: Flags [.], ack 420, win 258, length 0  Consider calculating inter-arrival times of packets  Stored as a flat file? Reads entire trace  Common access pattern for data analytics is to access only one piece of data (i.e. slice or column) from each entry (row) Portland State University CS 430P/530 Internet, Web & Cloud Systems

Idea ea #1: Column umn-orient oriented ed st storag age  Split columns into separate contiguously stored files for performance  Reduces data accesses for column-oriented queries  Achieves better compression  Grouping of similar data types in columns  Replicate to support parallel querying  Only common columns needed in queries replicated Portland State University CS 430P/530 Internet, Web & Cloud Systems

Idea ea #2: Em Empl ploy y se serverles erless s que ueryin ying  Queries spawn off computing and storage resources to execute  Up to 2,000 nodes/shards if available  Done over a petabit network in backend data center  Pay per query with minimal cost to store data Portland State University CS 430P/530 Internet, Web & Cloud Systems

Bi BigQ gQuer uery dem emo  Copy and paste the query into editor  Use settings to specify "Legacy SQL"  Note how much data will be accessed SELECT name, sum(number) as name_count FROM [bigquery-public-data:usa_names.usa_1910_2013] WHERE gender='F' GROUP BY name ORDER BY name_count DESC LIMIT 10 SELECT language, SUM(views) as views FROM [bigquery-samples:wikipedia_benchmark.Wiki1B] // 1 b rows WHERE regexp_match(title,"Goog.*") GROUP BY language ORDER BY views DESC  Cached results are free  Check timing Portland State University CS 430P/530 Internet, Web & Cloud Systems

BigQ Bi gQuer uery dem emo  Larger query (Preview only. DO NOT RUN) SELECT language, SUM(views) as views FROM [bigquery-samples:wikipedia_benchmark.Wiki100B] // 100 b rows WHERE regexp_match(title,"G.*o.*o.*g") GROUP BY language ORDER BY views DESC  Pricing  < $0.02 per GB stored per month (first TB free)  But, $5 per TB processed  Do NOT do a “SELECT *”  Pay attention to dry run cost estimate before execution! Portland State University CS 430P/530 Internet, Web & Cloud Systems

Pub ublic lic datase tasets ts on Bi BigQ gQuer uery  QuickDraw with Google  50 million drawings  https://quickdraw.withgoogle.com/data  Github  Find out whether programmers prefer tabs or spaces  NYC public data  Find out which neighborhoods have the most car thefts  Find out which neighborhoods have issues with rat infestation (311 calls on rats)  NOAA ICODE ship data from 1662  Find ships nearby when Titanic sank  Bitcoin and Ethereum block-chains Portland State University CS 430P/530 Internet, Web & Cloud Systems

Data Notebooks iPython, Jupyter Google Cloud Datalab Azure Notebooks

Da Data ta notebooks ebooks  Interactive authoring tool  Combine program code (Python) with rich document elements (text, figures, equations, links)  e.g. Like a Google Doc that can execute code  Used to document data exploration, transformation, analysis, and visualization  Notebook includes the data products and artifacts along with code that generated them  Allows one to disseminate results in a reproducible manner! Portland State University CS 430P/530 Internet, Web & Cloud Systems

Da Data ta notebooks ebooks  Initially iPython (interactive Python) run locally  Now Jupyter  Server-based notebooks  Interpreter runs on server, wrapped in HTML and served via web  Server installed with all packages and data for producing artifacts within code Portland State University CS 430P/530 Internet, Web & Cloud Systems

Installing stalling Ju Jupyter er locally cally virtualenv -p python3 env source env/bin/activate pip install jupyter jupyter-notebook  Launches a web server that hosts the interactive notebook as a web app  Visit URL in browser Portland State University CS 430P/530 Internet, Web & Cloud Systems

Bi Binder nder  Combine Github and Docker with Jupyter  Github repo contains notebook and dependencies for launching it (via requirements.txt or app.yaml )  Binder parses dependencies and builds Docker image to run notebook  Allows you to replicate notebook environment of others Portland State University CS 430P/530 Internet, Web & Cloud Systems

Use sed to disse sseminat minate e exp xperimen eriments ts Portland State University CS 430P/530 Internet, Web & Cloud Systems

Go Google gle Cloud ud Da Datalab talab  Hosted Juypter instance  For analyzing data in the cloud  Avoid downloading data  Avoid installing all of the GCP libraries  Service automatically spins up a Jupyter instance on a Compute Engine VM  Access to BigQuery and Cloud Storage  Access to services such as Machine Learning Engine Portland State University CS 430P/530 Internet, Web & Cloud Systems

BigQuery, DataNotebooks Labs

Data Warehouses Google BigQuery AWS Redshift Azure Data Lake - PowerPoint PPT Presentation

Data Warehouses Google BigQuery AWS Redshift Azure Data Lake Apache Hive Motiv tivation ation What if you want unlimited capacity while supporting fast querying (i.e. like Google Search)? Small transactional in-memory databases

Towards Benchmarking Stream Data Warehouses Arian Br, Lukasz Golab 02.11.2012 Stream Data

BUSINESS ANALYTICS CHAPTER 29 LECTURE OUTLINE Data warehouses Comparison with

Towards a Formal Model for View Maintenance in Data Warehouses D. Agrawal, A. El Abbadi, A.

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

Data Mining for Knowledge Management Data Warehouses Themis Palpanas University of Trento

Carnegie Mellon Univ. Problem Dept. of Computer Science Getting the data: Data Warehouses,

Developing Data Warehouses with Quality in Mind Yannis Vassiliou National Technical University

XCube XML for Data Warehouses Wolfgang Hmmer Andreas Bauer Gunnar Harde DOLAP 2003 New

The DataPath System: A Data-Centric Analytic Processing Engine for Large Data Warehouses Subi

MaSM: Efficient Online Updates in Data Warehouses Manos Athanassoulis 1 Shimin Chen 2 Anastasia

Altinity Building Multi-Petabyte Data Warehouses with ClickHouse Alexander Zaitsev LifeSteet,

Value-driven Approach Designing Extended Data Warehouses Nabila BERKANI & Selma KHOURI

in data warehouses Manfred Jeusfeld University of Skvde, Sweden 1 (c) 2015 M. Jeusfeld,

Dynamic Workload Management for Very Large Data Warehouses Juggling Feathers and Bowling Balls

Towards a Formal Model for View Maintenance in Data Warehouses D. Agrawal , A. El Abbadi , A.

Benchmarking Summarizability Processing in Colocated with XML Warehouses with ACM CIKM 2012

File Syst ems Last t ime we t alked about disk int ernals 11: File Syst em Basics Despit

Flat Datacenter Storage Edmund B. Nightingale, Jeremy Elson, et al. 6.S897 Motivation Imagine a

Streaming Grand Challenge Overview Graham Heyes February 12 th 2019 Where are we now? Online :

C S C I 1 2 7 0 I n t r o d u c t i o n t o D a t a b a s e S y s

Next Generation Testing Cdric Beust, Google Alexandru Popescu, InfoQ Alexandru Popescu, InfoQ

More classes S4 S4 was the second OOP system introduced to R. It is much more formal than S3,

SLAC Internet Measurement Data Les Cottrell , Jerrod Williams, Connie Logg, Paola Grosso SLAC,

Cray User Group May 2009 James H. Laros III Sandia National Laboratories Sandia is a

Data Warehouses Google BigQuery AWS Redshift Azure Data Lake - PowerPoint PPT Presentation

Data Warehouses Google BigQuery AWS Redshift Azure Data Lake Apache Hive Motiv tivation ation What if you want unlimited capacity while supporting fast querying (i.e. like Google Search)? Small transactional in-memory databases

Towards Benchmarking Stream Data Warehouses Arian Br, Lukasz Golab 02.11.2012 Stream Data

BUSINESS ANALYTICS CHAPTER 29 LECTURE OUTLINE Data warehouses Comparison with

Towards a Formal Model for View Maintenance in Data Warehouses D. Agrawal, A. El Abbadi, A.

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

Data Mining for Knowledge Management Data Warehouses Themis Palpanas University of Trento

Carnegie Mellon Univ. Problem Dept. of Computer Science Getting the data: Data Warehouses,

Developing Data Warehouses with Quality in Mind Yannis Vassiliou National Technical University

XCube XML for Data Warehouses Wolfgang Hmmer Andreas Bauer Gunnar Harde DOLAP 2003 New

The DataPath System: A Data-Centric Analytic Processing Engine for Large Data Warehouses Subi

MaSM: Efficient Online Updates in Data Warehouses Manos Athanassoulis 1 Shimin Chen 2 Anastasia

Altinity Building Multi-Petabyte Data Warehouses with ClickHouse Alexander Zaitsev LifeSteet,

Value-driven Approach Designing Extended Data Warehouses Nabila BERKANI &amp; Selma KHOURI

in data warehouses Manfred Jeusfeld University of Skvde, Sweden 1 (c) 2015 M. Jeusfeld,

Dynamic Workload Management for Very Large Data Warehouses Juggling Feathers and Bowling Balls

Towards a Formal Model for View Maintenance in Data Warehouses D. Agrawal , A. El Abbadi , A.

Benchmarking Summarizability Processing in Colocated with XML Warehouses with ACM CIKM 2012

File Syst ems Last t ime we t alked about disk int ernals 11: File Syst em Basics Despit

Flat Datacenter Storage Edmund B. Nightingale, Jeremy Elson, et al. 6.S897 Motivation Imagine a

Streaming Grand Challenge Overview Graham Heyes February 12 th 2019 Where are we now? Online :

C S C I 1 2 7 0 I n t r o d u c t i o n t o D a t a b a s e S y s

Next Generation Testing Cdric Beust, Google Alexandru Popescu, InfoQ Alexandru Popescu, InfoQ

More classes S4 S4 was the second OOP system introduced to R. It is much more formal than S3,

SLAC Internet Measurement Data Les Cottrell , Jerrod Williams, Connie Logg, Paola Grosso SLAC,

Cray User Group May 2009 James H. Laros III Sandia National Laboratories Sandia is a

Value-driven Approach Designing Extended Data Warehouses Nabila BERKANI & Selma KHOURI