 
              FROM FLAT FILES TO DECONSTRUCTED DATABASE The evolution and future of the Big Data ecosystem. Julien Le Dem @J_ julien.ledem@wework.com April 2018
Julien Le Dem @J_ Julien Principal Data Engineer Author of Parquet • • Apache member • Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez • Used Hadoop first at Yahoo in 2007 Formerly Twitter Data platform and Dremio •
Agenda ❖ At the beginning there was Hadoop (2005) ❖ Actually, SQL was invented in the 70s “MapReduce: A major step backwards” ❖ The deconstructed database ❖ What next?
At the beginning there was Hadoop
Hadoop Storage : A distributed file system Execution : Map Reduce Based on Google’s GFS and MapReduce papers
Great at looking for a needle in a haystack
Great at looking for a needle in a haystack … … with snowplows
Original Hadoop abstractions Execution Storage Map/Reduce File System Read write locally locally Simple Just flat files M R Flexible/Composable Any binary ● Shuffle ● Logic and optimizations No schema ● ● tightly coupled No standard ● Ties execution with M R The job must know how to ● ● persistence split the data M R
“MapReduce: A major step backwards” (2008)
Databases have been around for a long time Relational model • First described in 1969 by Edgar F. Codd • Originally SEQUEL developed in the early 70s at IBM SQL • 1986: first SQL standard • Updated in 1989, 1992, 1996, 1999, 2003, 2006, 2008, 2011, 2016 Global Standard • Universal Data access language • Widely understood
Underlying principles of relational databases Separation of logic and optimization Standard Separation of Schema and Application SQL is understood by many High level language focusing on logic (SQL) Indexing Optimizer Integrity Evolution Transactions Views Integrity constraints Schemas Referential integrity
Relational Database abstractions Execution Storage SQL Tables SELECT a, AVG(b) FROM FOO GROUP BY a SQL Abstracted notion of data Decouples logic of query Defines a Schema ● ● from: Format/layout decoupled ● Optimizations ○ from queries Data representation FOO ○ Has statistics/indexing ● Indexing ○ Can evolve over time ●
Query evaluation Syntax Semantic Optimization Execution
A well integrated system Storage SELECT f.c, AVG(b.d) Syntax Table FROM FOO f Semantic Columnar Columnar Columnar Metadata JOIN BAR b ON f.a = b.b data data data (Schema, GROUP BY f.c WHERE f.d = x stats, layout,…) Push Push Push downs downs downs Select Optimization Select Scan GROUP JOIN FILTER FOO BY GROUP BY Scan BAR JOIN Execution FILTER Scan BAR Scan FOO user
So why? Why Hadoop? Why Snowplows?
The relational model was constrained Constraints are good We need the right Constraints Need the right abstractions They allow optimizations Traditional SQL implementations: • Statistics • Flat schema • Pick the best join algorithm • Inflexible schema evolution • Change the data layout • History rewrite required • Reusable optimization logic • No lower level abstractions • Not scalable
Hadoop is flexible and scalable • Nested data structures • Unstructured text with semantic annotations No Data shape constraint • Graphs • Non-uniform schemas • Room to scale algorithms that are not part of the standard It’s just code • Machine learning • Your imagination is the limit • You can improve it Open source You can expand it • • You can reinvent it
You can actually implement SQL with this
And they did… SELECT f.c, AVG(b.d) FROM FOO f JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x GROUP BY Parser Execution Select JOIN GROUP BY JOIN FILTER FILTER Scan BAR BAR Scan FOO FOO (open-sourced 2009)
10 years later
The deconstructed database Author: gamene https://www.flickr.com/photos/gamene/4007091102
The deconstructed database
The deconstructed database Query Data Stream Batch Machine model Exchange Processing execution learning Storage
We can mix and match individual components Stream processing Machine learning Specialized Components Streams Stream persistance Storage Execution SQL Resource management *not exhaustive!
We can mix and match individual components Storage Query model Machine Learning Row oriented or columnar SQL Training models Immutable or mutable Functional Stream storage vs analytics optimized … Data Exchange Streaming Execution Batch Execution Row oriented Optimized for High Throughput and Low Optimized for high throughput and Columnar latency processing historical analysis
Emergence of standard components
Emergence of standard components Columnar Storage SQL parsing and Schema model optimization Apache Parquet as columnar Apache Avro as pipeline friendly schema representation at rest. for the analytics world. Apache Calcite as a versatile query optimizer framework Columnar Exchange Table abstraction Apache Arrow as the next generation in- Netflix’s Iceberg has a great potential to memory representation and no-overhead provide Snapshot isolation and layout data exchange abstraction on top of distributed file systems.
The deconstructed database’s optimizer: Calcite Storage Schema plugins SELECT f.c, AVG(b.d) Syntax Semantic FROM FOO f Select JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x … Optimizer rules Scan GROUP JOIN FILTER Execution engine FOO BY Select Optimization Scan BAR GROUP BY Execution JOIN FILTER Scan BAR Scan FOO
Apache Calcite is used in: Batch SQL Streaming SQL • Apache Hive • Apache Apex • Apache Drill • Apache Flink • Apache Phoenix • Apache SamzaSQL • Apache Kilin • Apache StormSQL … …
The deconstructed database’s storage Columnar Row oriented Immutable Optimized for analytics Optimized for serving Mutable To be performant a query engine requires deep integration with the storage layer. Query integration Implementing push down and a vectorized reader producing data in an efficient representation (for example Apache Arrow).
Storage: Push downs PROJECTION PREDICATE AGGREGATION Read only what you need Filter Avoid materializing intermediary data Evaluate filters during scan to: Read only the columns that are To reduce IO, aggregation can needed: also be implemented during the • Leverage storage properties scan to: (min/max stats, partitioning, • Columnar storage makes this sort, etc) efficient. • minimize materialization of • Avoid decoding skipped data. intermediary data • Reduce IO.
The deconstructed database interchange: Apache Arrow Projection and predicate push downs Parquet files Arrow batches Shu ffl e Partial Scanner Agg Agg Result projection push down read only a and b Partial Scanner Agg Agg SELECT SUM(a) FROM t WHERE c = 5 GROUP BY b Partial Scanner Agg Agg
Storage: Stream persistence Incubent Interesting Open source projects • State of consumer: how do we recover from failure • Snapshot • Decoupling reads from writes Features • Parallel reads • Replication • Data isolation
Big Data infrastructure blueprint
Big Data infrastructure blueprint
Big Data infrastructure blueprint Legend: Persistence Processing Metadata UI Data Infra Real time Batch publishing publishing Data API Stream persistance Stream processing Real time Data API dashboards Schema Datasets Data Storage registry metadata (S3, GCS, HDFS) (Parquet, Avro) Interactive analysis Monitoring / Data API Observability Analyst Batch processing Periodic Scheduling/ dashboards Job deps Eng
The Future
Still Improving A better data Better interoperability Better data governance abstraction • A better metadata repository • More efficient interoperability: • Global Data Lineage • A better table abstraction: Continued Apache Arrow adoption • Access control Netflix/Iceberg • Protect privacy • Common Push down • Record User consent implementations (filter, projection, aggregation)
Some predictions
A common access layer SQL (Impala, Drill, Presto, Hive, …) Batch (Spark, MR, Beam) ML (TensorFlow, …) Distributed access service Centralizes: Schema evolution • Table abstraction layer (Schema, push downs, access control, anonymization) Access control/anonymization • Efficient push downs • File System (HDFS, S3, GCS) Other storage (HBase, Cassandra, Kudu, …) Efficient interoperability •
A multi tiered streaming batch storage system Stream consumer API Batch-Stream storage Projection, Predicate, Aggregation push downs integration Mainly reads from here Convergence of Kafka and Kudu 2) in memory Schema evolution • Time based 1) In memory column 3) On Disk Column oriented Ingestion row oriented oriented store API Access control/anonymization • store store Efficient push downs • Efficient interoperability • Mainly reads from here Batch consumer API Projection, Predicate, Aggregation push downs
Recommend
More recommend