FROM FLAT FILES TO DECONSTRUCTED DATABASE The evolution and future - PowerPoint PPT Presentation

FROM FLAT FILES TO DECONSTRUCTED DATABASE The evolution and future of the Big Data ecosystem. Julien Le Dem @J_ julien.ledem@wework.com April 2018

Julien Le Dem @J_ Julien Principal Data Engineer Author of Parquet • • Apache member • Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez • Used Hadoop first at Yahoo in 2007 Formerly Twitter Data platform and Dremio •

Agenda ❖ At the beginning there was Hadoop (2005) ❖ Actually, SQL was invented in the 70s “MapReduce: A major step backwards” ❖ The deconstructed database ❖ What next?

At the beginning there was Hadoop

Hadoop Storage : A distributed file system Execution : Map Reduce Based on Google’s GFS and MapReduce papers

Great at looking for a needle in a haystack

Great at looking for a needle in a haystack … … with snowplows

Original Hadoop abstractions Execution Storage Map/Reduce File System Read write locally locally Simple Just flat files M R Flexible/Composable Any binary ● Shuffle ● Logic and optimizations No schema ● ● tightly coupled No standard ● Ties execution with M R The job must know how to ● ● persistence split the data M R

“MapReduce: A major step backwards” (2008)

Databases have been around for a long time Relational model • First described in 1969 by Edgar F. Codd • Originally SEQUEL developed in the early 70s at IBM SQL • 1986: first SQL standard • Updated in 1989, 1992, 1996, 1999, 2003, 2006, 2008, 2011, 2016 Global Standard • Universal Data access language • Widely understood

Underlying principles of relational databases Separation of logic and optimization Standard Separation of Schema and Application SQL is understood by many High level language focusing on logic (SQL) Indexing Optimizer Integrity Evolution Transactions Views Integrity constraints Schemas Referential integrity

Relational Database abstractions Execution Storage SQL Tables SELECT a, AVG(b) FROM FOO GROUP BY a SQL Abstracted notion of data Decouples logic of query Defines a Schema ● ● from: Format/layout decoupled ● Optimizations ○ from queries Data representation FOO ○ Has statistics/indexing ● Indexing ○ Can evolve over time ●

Query evaluation Syntax Semantic Optimization Execution

A well integrated system Storage SELECT f.c, AVG(b.d) Syntax Table FROM FOO f Semantic Columnar Columnar Columnar Metadata JOIN BAR b ON f.a = b.b data data data (Schema, GROUP BY f.c WHERE f.d = x stats, layout,…) Push Push Push downs downs downs Select Optimization Select Scan GROUP JOIN FILTER FOO BY GROUP BY Scan BAR JOIN Execution FILTER Scan BAR Scan FOO user

So why? Why Hadoop? Why Snowplows?

The relational model was constrained Constraints are good We need the right Constraints Need the right abstractions They allow optimizations Traditional SQL implementations: • Statistics • Flat schema • Pick the best join algorithm • Inflexible schema evolution • Change the data layout • History rewrite required • Reusable optimization logic • No lower level abstractions • Not scalable

Hadoop is flexible and scalable • Nested data structures • Unstructured text with semantic annotations No Data shape constraint • Graphs • Non-uniform schemas • Room to scale algorithms that are not part of the standard It’s just code • Machine learning • Your imagination is the limit • You can improve it Open source You can expand it • • You can reinvent it

You can actually implement SQL with this

And they did… SELECT f.c, AVG(b.d) FROM FOO f JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x GROUP BY Parser Execution Select JOIN GROUP BY JOIN FILTER FILTER Scan BAR BAR Scan FOO FOO (open-sourced 2009)

10 years later

The deconstructed database Author: gamene https://www.flickr.com/photos/gamene/4007091102

The deconstructed database

The deconstructed database Query Data Stream Batch Machine model Exchange Processing execution learning Storage

We can mix and match individual components Stream processing Machine learning Specialized Components Streams Stream persistance Storage Execution SQL Resource management *not exhaustive!

We can mix and match individual components Storage Query model Machine Learning Row oriented or columnar SQL Training models Immutable or mutable Functional Stream storage vs analytics optimized … Data Exchange Streaming Execution Batch Execution Row oriented Optimized for High Throughput and Low Optimized for high throughput and Columnar latency processing historical analysis

Emergence of standard components

Emergence of standard components Columnar Storage SQL parsing and Schema model optimization Apache Parquet as columnar Apache Avro as pipeline friendly schema representation at rest. for the analytics world. Apache Calcite as a versatile query optimizer framework Columnar Exchange Table abstraction Apache Arrow as the next generation in- Netflix’s Iceberg has a great potential to memory representation and no-overhead provide Snapshot isolation and layout data exchange abstraction on top of distributed file systems.

The deconstructed database’s optimizer: Calcite Storage Schema plugins SELECT f.c, AVG(b.d) Syntax Semantic FROM FOO f Select JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x … Optimizer rules Scan GROUP JOIN FILTER Execution engine FOO BY Select Optimization Scan BAR GROUP BY Execution JOIN FILTER Scan BAR Scan FOO

Apache Calcite is used in: Batch SQL Streaming SQL • Apache Hive • Apache Apex • Apache Drill • Apache Flink • Apache Phoenix • Apache SamzaSQL • Apache Kilin • Apache StormSQL … …

The deconstructed database’s storage Columnar Row oriented Immutable Optimized for analytics Optimized for serving Mutable To be performant a query engine requires deep integration with the storage layer. Query integration Implementing push down and a vectorized reader producing data in an efficient representation (for example Apache Arrow).

Storage: Push downs PROJECTION PREDICATE AGGREGATION Read only what you need Filter Avoid materializing intermediary data Evaluate filters during scan to: Read only the columns that are To reduce IO, aggregation can needed: also be implemented during the • Leverage storage properties scan to: (min/max stats, partitioning, • Columnar storage makes this sort, etc) efficient. • minimize materialization of • Avoid decoding skipped data. intermediary data • Reduce IO.

The deconstructed database interchange: Apache Arrow Projection and predicate push downs Parquet files Arrow batches Shu ffl e Partial Scanner Agg Agg Result projection push down read only a and b Partial Scanner Agg Agg SELECT SUM(a) FROM t WHERE c = 5 GROUP BY b Partial Scanner Agg Agg

Storage: Stream persistence Incubent Interesting Open source projects • State of consumer: how do we recover from failure • Snapshot • Decoupling reads from writes Features • Parallel reads • Replication • Data isolation

Big Data infrastructure blueprint

Big Data infrastructure blueprint Legend: Persistence Processing Metadata UI Data Infra Real time Batch publishing publishing Data API Stream persistance Stream processing Real time Data API dashboards Schema Datasets Data Storage registry metadata (S3, GCS, HDFS) (Parquet, Avro) Interactive analysis Monitoring / Data API Observability Analyst Batch processing Periodic Scheduling/ dashboards Job deps Eng

The Future

Still Improving A better data Better interoperability Better data governance abstraction • A better metadata repository • More efficient interoperability: • Global Data Lineage • A better table abstraction: Continued Apache Arrow adoption • Access control Netflix/Iceberg • Protect privacy • Common Push down • Record User consent implementations (filter, projection, aggregation)

Some predictions

A common access layer SQL (Impala, Drill, Presto, Hive, …) Batch (Spark, MR, Beam) ML (TensorFlow, …) Distributed access service Centralizes: Schema evolution • Table abstraction layer (Schema, push downs, access control, anonymization) Access control/anonymization • Efficient push downs • File System (HDFS, S3, GCS) Other storage (HBase, Cassandra, Kudu, …) Efficient interoperability •

A multi tiered streaming batch storage system Stream consumer API Batch-Stream storage Projection, Predicate, Aggregation push downs integration Mainly reads from here Convergence of Kafka and Kudu 2) in memory Schema evolution • Time based 1) In memory column 3) On Disk Column oriented Ingestion row oriented oriented store API Access control/anonymization • store store Efficient push downs • Efficient interoperability • Mainly reads from here Batch consumer API Projection, Predicate, Aggregation push downs

FROM FLAT FILES TO DECONSTRUCTED DATABASE The evolution and future - PowerPoint PPT Presentation

FROM FLAT FILES TO DECONSTRUCTED DATABASE The evolution and future of the Big Data ecosystem. Julien Le Dem @J_ julien.ledem@wework.com April 2018 Julien Le Dem @J_ Julien Principal Data Engineer Author of Parquet Apache member

FROM FLAT FILES TO DECONSTRUCTED DATABASE The evolution and future of the Big Data ecosystem.

Hipster MySQL Monitoring: Serving a deconstructed PMM Percona Live 2017 Santa Clara, California

On(x) ~Flat(x) START FINISH ~Flat(Spare) Intact(Spare) Off(Spare) On(Tire1) Flat(Tire1)

On(x) ~Flat(x) START FINISH ~Flat(Spare) Intact(Spare) Off(Spare) On(Tire1) Flat(Tire1)

Flat Files vs. DB Files So far, our PHP examples have

A Dimensionally Deconstructed Holographic Superconductor Dylan Albrecht Crete Center for

Accessing Files in Python Learning Objectives Concepts about files in Python How to open

Importing flat files from the web Importing Data in Python II Youre already great at

Importing flat files from the web Importing Data in Python Youre already great at importing!

Interlocking Forme No. 1 Flat - 520x418mm Finished - 220x307 Interlocking Forme No. 2 Flat -

Straight line drawing of a graph on the flat torus Luca Castelli Aleardi, LIX Olivier

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

Interacting with Files Python Files Files Basic container of data in modern computing

What is a Jar File? Java archive (jar) files are compressed files that can store one or many

What is a Jar File? Java archive (jar) files are compressed files that can store one or many

Using files ITEC 1630 We save data in files on disk or some Week 9: Files & Streams

Reforming Master Programmes in Finance in Armenia and Moldova/REFINE An Erasmus+ Capacity

THE BENEFITS OF A MODERN CIRCULAR ECONOMY FOR REGIONS, CITIES AND BUSINESS Brussels, 12 October

Commonwealth of Kentucky Partnership Summit: Kentucky HEALTH and Foundation for a Healthy Kentucky

Clockless IC Design using Handshake Technology Ad Peeters Handshake Solutions Philips

14nm Technologies Wei-Ting Jonas Chan, Andrew B. Kahng UC San Diego CSE and ECE Departments

Four-Lesson Special The Holocaust, Anti-Semitism, and UsPart 3 May 31, 2016 Dean Bible

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

Growing Global Leaders Advancing Palliative Care Using the Five Practices of Leadership to

FROM FLAT FILES TO DECONSTRUCTED DATABASE The evolution and future - PowerPoint PPT Presentation

FROM FLAT FILES TO DECONSTRUCTED DATABASE The evolution and future of the Big Data ecosystem. Julien Le Dem @J_ julien.ledem@wework.com April 2018 Julien Le Dem @J_ Julien Principal Data Engineer Author of Parquet Apache member

FROM FLAT FILES TO DECONSTRUCTED DATABASE The evolution and future of the Big Data ecosystem.

Hipster MySQL Monitoring: Serving a deconstructed PMM Percona Live 2017 Santa Clara, California

On(x) ~Flat(x) START FINISH ~Flat(Spare) Intact(Spare) Off(Spare) On(Tire1) Flat(Tire1)

On(x) ~Flat(x) START FINISH ~Flat(Spare) Intact(Spare) Off(Spare) On(Tire1) Flat(Tire1)

Flat Files vs. DB Files So far, our PHP examples have

A Dimensionally Deconstructed Holographic Superconductor Dylan Albrecht Crete Center for

Accessing Files in Python Learning Objectives Concepts about files in Python How to open

Importing flat files from the web Importing Data in Python II Youre already great at

Importing flat files from the web Importing Data in Python Youre already great at importing!

Interlocking Forme No. 1 Flat - 520x418mm Finished - 220x307 Interlocking Forme No. 2 Flat -

Straight line drawing of a graph on the flat torus Luca Castelli Aleardi, LIX Olivier

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

Interacting with Files Python Files Files Basic container of data in modern computing

What is a Jar File? Java archive (jar) files are compressed files that can store one or many

What is a Jar File? Java archive (jar) files are compressed files that can store one or many

Using files ITEC 1630 We save data in files on disk or some Week 9: Files &amp; Streams

Reforming Master Programmes in Finance in Armenia and Moldova/REFINE An Erasmus+ Capacity

THE BENEFITS OF A MODERN CIRCULAR ECONOMY FOR REGIONS, CITIES AND BUSINESS Brussels, 12 October

Commonwealth of Kentucky Partnership Summit: Kentucky HEALTH and Foundation for a Healthy Kentucky

Clockless IC Design using Handshake Technology Ad Peeters Handshake Solutions Philips

14nm Technologies Wei-Ting Jonas Chan, Andrew B. Kahng UC San Diego CSE and ECE Departments

Four-Lesson Special The Holocaust, Anti-Semitism, and UsPart 3 May 31, 2016 Dean Bible

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

Growing Global Leaders Advancing Palliative Care Using the Five Practices of Leadership to

Using files ITEC 1630 We save data in files on disk or some Week 9: Files & Streams