Data Access for Data Science April 17, 2018 Ja Jacques Nadeau - PowerPoint PPT Presentation

Data Access for Data Science April 17, 2018

Ja Jacques Nadeau Co-Founder & CTO, Dremio PMC Chair, Apache Arrow PMC, Apache Calcite

Agenda Apache Arrow • Using Dremio for Self Service Data Access • Data Access Example (notebook + Dremio) • Reflections & Caching Overview • Caching Impact Example •

Getting Data Ready for Analysis Is Hard Data can be hard to find • Many modern data systems have poor quality interfaces • Data is rarely in a single system • Data access is frequently slow • Some types of issues can only be solved by IT tickets • Doing late stage data curation makes reproduction and • collaboration difficult: “do I copy and edit?” there should be a new, self-service data access tier

Apache Arrow

Apache Arrow • Standard for columnar in-memory processing and transport • Focused on Columnar In-Memory Analytics 1. 10-100x speedup on many workloads 2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language 4. Support for both relat ional and complex data • Consensus Driven: developed by contributors leading 13+ key OSS projects

Arrow: Fast Exchange, Fast Processing Focus on GPU and CPU Efficiency • Cache locality • Super-scalar and vectorizedoperation • Minimal structure overhead • Constant value access High Performance Sharing & Traditional Arrow Memory Memory Interchange • Zero Overhead Encoding • Scatter/Gather Optimized • Direct Memory definition • Designed for RDMA and shared memory access

Arrow Components Core Libraries • Building Blocks • Major Integrations •

Arrow: Core Libraries Java Library • C++ Library • Python Library • C Library • Ruby Library • JavaScript Library • Rust Library •

Arrow Building Blocks (in project) Plasma Pl Ar Arrow RPC* Shared memory caching layer, RPC/IPC interchange library originally created in Ray (active development) Arrow Kernels* Ar Fe Feather Common data manipulation components Fast ephemeral format for movement of data between R/Python *soon

Arrow Integrations Dr Dremio OSS project, Sabot Engine executes entirely on Arrow memory Pandas Pa Move seamlessly to from Arrow Parquet Pa as a means for communication, serialization, fast processing Read and write Parquet quickly to/from Parquet. C++ library builds directly on Arrow. Spark Sp Supports conversion to Pandas via Arrow construction using GO GOAI (GPU Open Analytics Init) Arrow Java Library Leverages Arrow as internal representation (including libgfd and GPU dataframe)

Apache Arrow Adoption Arrow downloads increased 44x since April (currently ~100K per month) Monthly PyPi (~40% of all downloads)

Dremio a system for self-service data access

About Apache-Licensed • Dremio Built on Apache Arrow, • Apache Calcite, Apache Parquet Launched in July 2017 • Easy extension, • customization and Self-Service Data Platform • enterprise flexibility Make Data Accessible to • SDKs for sources, functions, • whatever tool file formats, security The Narwhal’s name is • Execution, Input and Output • Gn Gnarly are all build on native Ar Arrow

Google Docs for your Data Powerful & Intuitive UX for Data Live Data Curation Find, manage and share data regardless of size & location AI-powered curation of data without creating a single copy

Self-Service Data Access Platform SQL Data Caching Data Catalog Data access at interactive speed, without cubes Data Discovery, Security and Personal Data or BI extracts Assets Data Access Data Curation RDBMS, MongoDB, Elasticsearch, Hadoop, S3, Wrangle, prepare, enrich any source without NAS, Excel, JSON making copies of your data

Data Access Example

Leveraging Underlying Source Capabilities Example

Reflections an advanced form of caching

Access isn’t Enough: Reducing Distance to Data What you want What you want What you want Distance to Data Work to Be Done • Resources Required • Raw data Time to Complete •

The basic concept behind a relational cache What you want What you want What you want New DTD • Maintain derived data that is between what you want and what the raw data Original DTD • Shortens distance to data (DTD) Reflection Cost reduction • Reduces resource requirements & latency • Materialization can be derived from raw Raw data data via arbitrary operator DAG

It doesn’t have to be a trivial relationship… What you want What you want What you want New DTD Original DTD Reflection 1 Reflection 2 Cost reduction Raw data

You already do this today (manually)!! Materializations (manually created): Users choose depending on need: • Cleansed • Data Scientists & Analysts trained to use different tables depending on the use • Partitioned by region or time case • Summarized for a particular purpose • Custom datasets, summarization and/or extraction for modeling, reports and dashboards

Dremio can make the decisions so you don’t have to Copy-and-pick Reflections ? ? ? ? Logical Model Data Sicentist pi picks be best optimization Dremio pi picks best opt ptimization Physical Optimizations (reflections) (t (transform, sort, partition, aggregate) Dremio de designs and d maintains Data Engineer de designs an and mai aintai ains Source Table

Cache Matching: Example Scenarios F(c’ < 10) A(a,b, sum(c)) Aggregation F(c’ < 10) A(a, sum(c) as c’) S(r1) Rollup P(a,c) A(a, sum(c) as c’) S(t1) S(r1) S(t1) Materialization Target A(a, sum(c) as c’) A(id, sum(c)) A(a, sum(c) as c’) Join/Agg Join(r1.id=t2.id) S(r1) Join(t1.id=t2.id) Transposition S(t1) S(r1) S(t2) S(t1) S(t2) Target Materialization S(r1) S(t1) Part by a F(a) Costing & S(r1) Target Materialization Partitioning pruned on a S(t1) S(r1) S(t1) Part by b Target Materialization Alternative Plan User Query Reflection Definition

Reflections • A reflection is a materialization designed to accelerate operations • Transparent to data consumers • Not required on day 1… you can add reflections at any time • One reflection can help accelerate queries on thousands of different virtual datasets (logical definitions) • Reflections are persisted (S3, HDFS, local disks, etc.) so there’s no memory overhead • Columnar on disk (Parquet) and Columnar in memory (Arrow) • Elastic, scales to 1000+ nodes

Reflection Impact Example

In conclusion

Distribution of Responsibilities Da Data Access Platform BYO Data Science & BI BY BI Solutions • • Analyze Data Index, secure, expose, share and curate datasets • Experiment and perform what-if • Expose data from different systems in a analysis standard namespace and • • Derive Conclusions Allow live cleanup and curation capabilities • Data manipulation that should be • Build Models reproducible and shared • … and everything else that results • Disconnect physical concerns from logical in an output that isn’t a dataset needs • Cache intermediate results to support accelerate common user patterns • Get to an in interestin ing g slic lice of data

Self-Service Data Access SQL Data Caching Data Catalog Data access at interactive speed, without cubes Data Discovery, Security and Personal Data or BI extracts Assets Data Access Data Curation RDBMS, MongoDB, Elasticsearch, Hadoop, S3, Wrangle, prepare, enrich any source without NAS, Excel, JSON making copies of your data

Join the Community! Come see me for Office hours! • Download: dremio.com/download • GitHub: github.com/dremio/dremio-oss • github.com/apache/arrow • Dremio Community: community.dremio.com • Arrow Mailing list: dev@arrow.apache.org • Twitter: @intjesus, @DremioHQ, @ApacheArrow •

Data Access for Data Science April 17, 2018 Ja Jacques Nadeau - PowerPoint PPT Presentation

Data Access for Data Science April 17, 2018 Ja Jacques Nadeau Co-Founder & CTO, Dremio PMC Chair, Apache Arrow PMC, Apache Calcite Agenda Apache Arrow Using Dremio for Self Service Data Access Data Access Example (notebook +

Evolving Data Access Evolving Data Access Evolving Data Access Evolving Data Access

Access Control and Protection Overview Access control: What and Why Abstract Models of

Access Control Jackson Argo Rackspace MO After-hours April 28, 2016 What is Access Control?

Jesus.net has a dream Access Know Grow Share Imagine a world Access Know Grow Share

Total Access Communication Total Access Communication Total Access Communication Total Access

Access Control Access Control 1 Access Control Access control : ensures that all direct

ACCESS Position Paper W3C Ubiquitous Web Workshop March 2006 Toshihiko Yamakami

Access Network Access Network Access network: local loop infrastructure It is the last

Legal aid and access to justice Fundamentals of Access Access to justice Awareness of

CBM and Bill Presentation Quick Reference How do I get access to CBM? How do I get access to

Product Product Innovation and Innovation and Access to Finance Access to Finance Access to

SYSTEM DIAGRAMS Door Entry & Access Control ACCESS CONTROL ACCESS CONTROL Allowing

Fine Grained Access Control Fine-Grained Access Control Fine Grained Access Control

Multi Multi Multi- Multi - - -Layer Access Control Layer Access Control Layer Access

Health Data Nova Scotia (HDNS) Data Access Committee Presented by Katelyn Frizzell Sept. 20,

CSE 510 Web Data Engineering Data Access Object (DAO) Java Design Pattern UB CSE 510 Web Data

Section 18.3 Learning Decision Trees CS4811 - Artificial Intelligence Nilufer Onder Department

Evaluation of a Failure Prediction Model for Large Scale Cloud Applications Mohammad S. Jassas

AIRFOILS Shishir Damani Mechanical Engineering Department NIT Tiruchirappalli AE-705

Genesis of Java Soheil Hassas Yeganeh Computer Engineering Department Sharif University of

Hello! Im Ashleigh Weeden . I ask a lot of questions, talk pretty fast and I care about

REIMAGINING RURAL FUTURES Hello! Im Ashleigh Weeden PhD Candidate - Rural Studies School of

Dolphin Semigroups Michael Torpey University of St Andrews 2016-04-06 Michael Torpey

Data Quality Initiative At the Botanic Garden and Botanical Museum Berlin-Dahlem David

Data Access for Data Science April 17, 2018 Ja Jacques Nadeau - PowerPoint PPT Presentation

Data Access for Data Science April 17, 2018 Ja Jacques Nadeau Co-Founder & CTO, Dremio PMC Chair, Apache Arrow PMC, Apache Calcite Agenda Apache Arrow Using Dremio for Self Service Data Access Data Access Example (notebook +

Evolving Data Access Evolving Data Access Evolving Data Access Evolving Data Access

Access Control and Protection Overview Access control: What and Why Abstract Models of

Access Control Jackson Argo Rackspace MO After-hours April 28, 2016 What is Access Control?

Jesus.net has a dream Access Know Grow Share Imagine a world Access Know Grow Share

Total Access Communication Total Access Communication Total Access Communication Total Access

Access Control Access Control 1 Access Control Access control : ensures that all direct

ACCESS Position Paper W3C Ubiquitous Web Workshop March 2006 Toshihiko Yamakami

Access Network Access Network Access network: local loop infrastructure It is the last

Legal aid and access to justice Fundamentals of Access Access to justice Awareness of

CBM and Bill Presentation Quick Reference How do I get access to CBM? How do I get access to

Product Product Innovation and Innovation and Access to Finance Access to Finance Access to

SYSTEM DIAGRAMS Door Entry &amp; Access Control ACCESS CONTROL ACCESS CONTROL Allowing

Fine Grained Access Control Fine-Grained Access Control Fine Grained Access Control

Multi Multi Multi- Multi - - -Layer Access Control Layer Access Control Layer Access

Health Data Nova Scotia (HDNS) Data Access Committee Presented by Katelyn Frizzell Sept. 20,

CSE 510 Web Data Engineering Data Access Object (DAO) Java Design Pattern UB CSE 510 Web Data

Section 18.3 Learning Decision Trees CS4811 - Artificial Intelligence Nilufer Onder Department

Evaluation of a Failure Prediction Model for Large Scale Cloud Applications Mohammad S. Jassas

AIRFOILS Shishir Damani Mechanical Engineering Department NIT Tiruchirappalli AE-705

Genesis of Java Soheil Hassas Yeganeh Computer Engineering Department Sharif University of

Hello! Im Ashleigh Weeden . I ask a lot of questions, talk pretty fast and I care about

REIMAGINING RURAL FUTURES Hello! Im Ashleigh Weeden PhD Candidate - Rural Studies School of

Dolphin Semigroups Michael Torpey University of St Andrews 2016-04-06 Michael Torpey

Data Quality Initiative At the Botanic Garden and Botanical Museum Berlin-Dahlem David

SYSTEM DIAGRAMS Door Entry & Access Control ACCESS CONTROL ACCESS CONTROL Allowing