Apache Calcite for Enabling SQL Access to NoSQL Data Systems such - - PowerPoint PPT Presentation

apache calcite for enabling sql access to nosql data
SMART_READER_LITE
LIVE PREVIEW

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such - - PowerPoint PPT Presentation

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian Tzolov Whoami Christian Tzolov Engineer at Pivotal, Big-Data, Hadoop, Spring Cloud Dataflow, Apache Geode, Apache HAWQ, Apache Committer, Apache


slide-1
SLIDE 1

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode

Christian Tzolov

slide-2
SLIDE 2

2

Christian Tzolov

Engineer at Pivotal, Big-Data, Hadoop, Spring Cloud Dataflow, Apache Geode, Apache HAWQ, Apache Committer, Apache Crunch PMC member

Whoami

Disclaimer This talk expresses my personal opinions. It is not read or approved by Pivotal and does not necessarily reflect the views and opinions of Pivotal nor does it constitute any official communication of Pivotal. Pivotal does not support any of the code shared here.

ctzolov@pivotal.io blog.tzolov.net twitter: @christzolov https://nl.linkedin.com/in/tzolov

slide-3
SLIDE 3

3

Big Data Landscape 2016

  • Volume
  • Velocity
  • Varity
  • Scalability
  • Latency
  • Consistency vs. Availability (CAP)
slide-4
SLIDE 4

4

  • {Old | New} SQL
  • Custom APIs

– Key / Value – Fluent APIs – REST APIs

  • {My} Query Language

Unified Data Access? At What Cost?

Data Access

slide-5
SLIDE 5

5

  • Apache Apex
  • Apache Drill
  • Apache Flink
  • Apache Hive
  • Apache Kylin
  • Apache Phoenix
  • Apache Samza
  • Apache Storm
  • Cascading
  • Qubole Quark
  • SQL-Gremlin

SQL?

  • Apache Geode
slide-6
SLIDE 6

Geode Adapter - Overview

Geode A API a and nd O OQL SQL SQL/JDBC/ JDBC/ODBC ODBC Conv nvert S SQL r relationa nal expressions ns i int nto O OQL q queries Geode Adapter (Geode Client) Geode Server Geode Server Geode Server Data Data Data Push d down t n the r relationa nal expressions ns s supported b by G Geode OQL a and nd f falls b back ck t to t the C Calci cite Enu numerable A Adapter f for t the r rest Enumerable Adapter Apache Calcite Spring Data Geode Spring ng D Data A API f for int nteract cting ng w with G Geode Parse S SQL, co conv nverts i int nto relationa nal e expression a n and nd

  • ptimizes
slide-7
SLIDE 7

SQL Relational Expressions

7

Scan Scan Join Filter Project Customer [c] BookOrder [b] (on customerNumber) (b.totalPrice > 0) (c.firstName, b.totalPrice)

SELECT b."totalPrice", c."firstName” FROM "BookOrder" as b INNER JOIN "Customer" as c ON b."customerNumber" = c."customerNumber” WHERE b."totalPrice" > 0;

Scan Scan Join Project Customer [c] BookOrder [b] (on customerNumber) (totalPrice > 0) (c.firstName, b.totalPrice) Project (firstName, customerNumber) Filter (totalPrice, customerNumber) Project

  • ptimize
slide-8
SLIDE 8

Geode Push Down Candidates

8

Relational Operator Geode Support LIMIT YES (without OFFSET) PROJECT YES FILTER YES JOIN For collocated Regions only AGGREGATE YES for GROUP BY, DISTINCT, MAX, MIN, SUM, AVG, COUNT http://bit.ly/2eKApd0 SORT YES

slide-9
SLIDE 9

Apache Geode? “… in-memory, distributed database with strong consistency built to support low latency transactional applications at extreme scale”

slide-10
SLIDE 10

Why Apache Geode?

10

5,700 train stations 4.5 million tickets per day 20 million daily users 1.4 billion page views per day 40,000 visits per second 7,000 stations 72,000 miles of track 23 million passengers daily 120,000 concurrent users 10,000 transactions per minute

https://pivotal.io/big-data/case-study/distributed-in-memory-data-management-solution https://pivotal.io/big-data/case-study/scaling-online-sales-for-the-largest-railway-in-the-world-china-railway-corporation

China Railway

slide-11
SLIDE 11

11

  • In-Memory Data Storage

– Over 100TB Memory – JVM Heap + Off Heap

  • Any Data Format

– Key-Value/Object Store

  • ACID and JTA Compliant

Transactions

  • HA and Linear Scalability
  • Strong Consistency
  • Streaming and Event Processing

– Listeners – Distributed Functions – Continuous OQL Queries

  • Multi-site / Inter-cluster
  • Full Text Search (Lucene indexes)
  • Embedded and Standalone
  • Top Level Apache Project

Apache Geode Features

slide-12
SLIDE 12

Apache Geode Concepts

Cache Server (member) Cache Region 1 Region N

Val Ke y v1 k1 v2 k2

Cach che - In-memory collection

  • f Regi

Regions Region - n - consistent, di distr stributed uted Ma Map (key-value), Partitioned or Replicated Cach cheServer – proce cess connected to the distributed system with created Cach che

Client (member) Locator (member)

Client nt –read and modify the content of the distributed system Loca cator – tracks system members and provides membership information

… Listeners Functions

Funct nctions ns – distributed, concurrent data processing Listene ner – event handler. Registers for one or more events and notified when they occur

slide-13
SLIDE 13

Geode Topology

Cache Server Cache Server Cache Server Cache Data Cache Data Cache Data

Peer-to-Peer

Cache Server Cache Server Cache Server Cache Data Cache Data Cache Data Client Local Cache

pool

Client-Server

Cache Server Cache Server Gateway Sender … Cache Server Gateway Receiver Cache Server Cache Server Cache Data Cache Data Cache Data Cache Data Gateway Receiver Cache Server … Gateway Sender Cache Server Cache Server Cache Data Cache Data Cache Data Cache Data

WAN Multi-site Boundary

Multi-Site

slide-14
SLIDE 14

Geode Client API

  • Client Cache
  • Key / Value - Region GET, PUT, REMOVE
  • OQL – QueryService
slide-15
SLIDE 15

Geode Data Types & Serialization

  • Key-Value with complex value formats
  • Portable Data eXchange (PDX) Serialization – Delta propagation, schema

evolution, polyglot support …

  • Object Query Language (OQL)

SELECT p.name FROM /Person p WHERE p.pet.type = “dino”

{ id: 1, name: “Fred”, age: 42, pet: { name: “Barney”, type: “dino” } }

single field deserialization nested fields

slide-16
SLIDE 16

Geode Demo (GFSH and OQL)

  • Connect to Geode cluster,
  • List available Regions
  • Run OQL query
slide-17
SLIDE 17

Apache Calcite?

Java framework that allows SQL interface and advanced query optimization, for virtually any data system

  • Query Parser, Validator and Optimizer(s)
  • JDBC drivers - local and remote
  • SQL Streaming
  • Agnostic to data storage and processing
  • SQL completes vs. NoSQL integrity
slide-18
SLIDE 18

Calcite Data Types

  • Catalog – namespaces accessed in queries
  • Schema - collection of schemas and tables
  • Table - single data set, collection of rows
  • RelDataType – SQL fields types in a Table

Your Data System Data System Data Types Calcite Schema SQL Engine Table Table

JDBC

Table …

Data Type Mapping SELECT title, author FROM test.BookMaster

Data Type Fields Schema Table

slide-19
SLIDE 19

Calcite Data Types: RelDataType

19

Type of a scalar expression or row

  • RelDataTypeFactory – RelDataType factory
  • JavaTypeFactory - registers Java classes as record types
  • JavaTypeFactoryImpl - Java Reflection to build RelDataTypes
  • SqlTypeFactoryImpl - Default implementation with all SQL types
slide-20
SLIDE 20

Geode to Calcite Data Types Mapping

20

Geode Cache Region 1 Region K

Val Key v1 k1 v2 k2

… Calcite Schema Table 1 Table K

Col1 Col2 ColN V(M,1) RowM V(M,2) V(M,N) V(2,1) Row2 V(2,2) V(2,N) V(1,1) Row1 V(1,2) V(1,N)

… Regions are mapped into Tables Geode Cache is mapped into Calcite Schema Geode Key/Value is mapped into Table Row Create Column Types (RelDataType) from Geode Value class (JavaTypeFact ctoryImpl)

slide-21
SLIDE 21

Calcite Bootstrap Flow

21

Typical calcite initialization flow

Model (JSON) SchemaFactory Schema Tables

Creates Creates Conf nfigures C Calci cite Creates

slide-22
SLIDE 22

Calcite Model

{ version: '1.0', defaultSchema: 'TEST', schemas: [ { name: 'TEST', type: 'custom', factory: 'org.apache.calcite.adapter.geode.simple.GeodeSchemaFactory',

  • perand: {

locatorHost: 'localhost', locatorPort: '10334', regions: 'BookMaster', pdxSerializablePackagePath: 'net.tzolov.geode.bookstore.domain.*' } }] }

Reference to your adapter schema factory implementation class Parameters to be passed to your adapter schema factory implementation The p path t to < <my-model>.json json i is p passed a as J JDBC co conne nnect ction a n argument nt: :

!connect jdbc:calcite:model=target/test-classes/<my-model-path>.json︎

Schema Name

Model SchemaFactory Schema Tables

slide-23
SLIDE 23

Geode Calcite Schema and Schema Factory

public class GeodeSchemaFactory implements SchemaFactory { public Schema create(SchemaPlus parentSchema, String schemaName, Map<String, Object> operand) { String locatorHost = (String) operand.get(“locatorHost”); int locatorPort = … String[] regionNames = … String pdxPackagePath = … return new GeodeSchema(locatorHost, locatorPort, regionNames, pdxPackagePath); } } public class GeodeSchema extends AbstractSchema { private String regionName = .. protected Map<String, Table> getTableMap() { final ImmutableMap.Builder<String, Table> builder = ImmutableMap.builder(); Region region = … Get Geode Region by region name … Class valueClass= … Find region’s value type … builder.put(regionName, new GeodeScannableTable(regionName, valueClass, clientCache)); return tableMap; }

Retrieves the parameters set in the model.json Create an Adapter Schema instance with the provided parameters. Create GeodeScannableTable instance for each Geode Region

Model SchemaFactory Schema Tables

slide-24
SLIDE 24

Geode Scannable Table

public class GeodeScannableTable extends AbstractTable implements ScannableTable { public RelDataType getRowType(RelDataTypeFactory typeFactory) { return new JavaTypeFactoryImpl().createStructType(valueClass); } public Enumerable<Object[]> scan(DataContext root) { return new AbstractEnumerable<Object[]>() { public Enumerator<Object[]> enumerator() { return new GeodeEnumerator<Object[]>(clientCache, regionName); } } public class GeodeEnumerator<E> implements Enumerator<E> { private E current; private SelectResults geodeIterator; public GeodeEnumerator(ClientCache clientCache, String regionName) { geodeterator = clientCache.getQueryService().newQuery("select * from /" + regionName).execute().iterator(); } public boolean moveNext() { current = convert(geodeIterator.next()); return true;} public E current() {return current;} public abstract E convert(Object geodeValue) { Convert PDX value into RelDataType row }

Defined in the Linq4j sub-project Retrieves the entire Region!! Converts Geode value response into Calcite row data Uses reflection (or pdx-instance) to builds RelDataType from value’s class type Returns an Enumeration over the entire target data store

Model SchemaFactory Schema Tables

slide-25
SLIDE 25

Geode Demo (Scannable Tables)

$ ./sqlline ︎ sqlline> !connect jdbc:calcite:model=target/test-classes/model2.json admin admin︎ ︎ jdbc:calcite> !tables ︎ jdbc:calcite> SELECT * FROM "BookMaster”;︎ jdbc:calcite> SELECT "yearPublished", AVG("retailCost") AS “AvgRetailCost” FROM "BookMaster" GROUP BY "yearPublished";︎ jdbc:calcite> SELECT b."totalPrice", c."firstName” FROM "BookOrder" AS b INNER JOIN "Customer" AS c ON b."customerNumber" = c."customerNumber” WHERE b."totalPrice" > 0;︎ ︎ ︎

Without and With Implementation

slide-26
SLIDE 26

Non-Relational Tables

26

Scanned without intermediate relational expression.

  • ScannableTable - can be scanned
  • FilterableTable - can be scanned, applying supplied filter expressions
  • ProjectableFilterableTable - can be scanned, applying supplied filter expressions

and projecting a given list of columns

Enumerable<Object[]> scan(DataContext root, List<RexNode> filters, int[] projects); Enumerable<Object[]> scan(DataContext root, List<RexNode> filters); Enumerable<Object[]> scan(DataContext root);

slide-27
SLIDE 27

Calcite Ecosystem

27

Several “semi-independent” projects.

JDBC and Avatica Linq4j Expression Tree Enumerable Adapter Relational

  • Relational Expressions
  • Row Expression
  • Optimization Rules
  • Planner …

SQL Parser & AST

Port of LINQ (Language-Integrated Query) to Java. Local and Remote JDBC driver Converts SQL queries Into AST (SqlNode …)

3rd party Adapters

Method for translating executable code into data (LINQ/MSN port) Default (In-memory) Data Store Adapter implementation. Leverages Linq4j Relational Algebra, expression,

  • ptimizations …

Interpreter

Complies Java code generated from linq4j “Expressions”. Part of the physical plan implementer

slide-28
SLIDE 28

Calcite SQL Query Execution Flow

28

Enumerable Interpreter Prepare SQL, Relational, Planner Geode Adapter Binder JDBC Geode Cluster

1 2 3 4 5 6 7 7 7

  • 2. Parse SQL, convert to rel.
  • expressions. Va

Valid lidate te and Opti Optimi mize ze them

  • 3. Start building a physical plan from

the relation expressions

  • 4. Implement the Geode relations and

encode them as Expression t n tree

  • 5. Pass the Expression tree to the

Interpreter to generate Java code

  • 6. Generate and Compile a Binder

instance that on ‘bind()’ call runs Geodes’ query method

  • 1. On new SQL query JD

JDBC BC delegates to Pr Prepar epare to prepare the query execution

  • 7. JDBC uses the newly compiled

Binder to perform the query on the Geode Cluster

Calcite Framework Geode Adapter

2

slide-29
SLIDE 29

Linq4j and Expression Tree

29

(Node) Visitor Node

+ accept(Visitor) + evaluate(Node)

Expression Statement Block Condition For Goto Label Switch Throw Try While Binary Constant MethodCall Parameter Member … … Queryable Enumberable QueryProvider

slide-30
SLIDE 30

Bindable Generated Code

30

Calcite via Enumerable Converts Expressions into Java Code

slide-31
SLIDE 31

Calcite Relational Expressions

31

RelNode Relationa nal expression TableScan Project Filter Aggregate Join Intersect Sort RexlNode Ro Row-level expressions

Project, Sort field fields s Filter, Join co cond nditions ns

Input Column Ref Literal Struct field access Function call Window expressions

*

RelTrait

*

Physica cal attribute

  • f a relation
slide-32
SLIDE 32

Calcite Relational Expressions (2)

32

RelNode

+ register(RelOptPlander) + List<RelNode> getInputs();

RelOptPlanner

+findBestExp():RelNode

RexNode RelTrait Convention NONE

* *

EnumberableConvention RelOptRule

+ onMatch(call)

<<register>> <<create>>

MyDBConvention ConverterRule

+ RelNode convert(RelNode)

Converts from one calling convention to another

Convertor

Indicate that it converts a physical attribute only! <<rules>>

*

<<inputs>>

*

<<root>> Query optimizer: Transforms a relational expression according to a given set of rules and a cost model.

RelOptCluster

Rule transforms an expression into another. It has a list of Operands, which determine whether the rule can be applied to a particular section of the tree.

RelOptRuleOperand

*

<<fire criteria>> Calling convention used to represent a single data source. Inputs to a relational expression must be in the same convention

slide-33
SLIDE 33

Calcite Adapter Patterns

33

MyAdapterRel

+ implement(implContext)

MyAdapterConvention Convention.Impl(“MyAdapter”)

Common interface for all MyAdapter Relation Expressions. Provides implementation callback method called as part of physical plan implementation

ImplContext

+ implParm1 + implParm2 …

RelNode MyAdapterTable

+ toRel(optTable) + asQueryable(provider,…)

MyAdapterQueryable

+ myQuery(params) : Enumetator

TranslatableTable

<<instance of>>

AbstractQueryableTable AbstractTableQueryable

<<create>> Can convert queries in Expression myQuer myQuery() implements the call to your DB It is called by the auto generated code. It must return an Enumberable instance

MyAdapterScan

+ register(planer) {

Registers all MyAdapter Rules

}

<<create>>

MyAdapterToEnumerableConvertorRule

  • perands: (RelNode.class,

MyAdapterConvention, EnumerableConvention)

ConverterRue TableScan MyAdapterToEnumerableConvertor

+ implement(EnumerableRelImplementor) {

ctx = new MyAdapterRel.ImplContext() getImputs().implement(ctx) return BlockBuild.append( MY_QUERY_REF, Expressions.constant(ctx.implParms1), Expressions.constant(ctx.implParms2) …

EnumerableRel ConvertorImpl

<<create on match >>

MyAdapterProject MyAdapterFilter MyAdapterXXX RelOptRule MyAdapterProjectRu MyAdapterFilterRule MyAdapterXXXRule

<<create on match >> Recursively call the implement on each MyAdapter Relation Expression Encode the myQuery(params) call as Expressions

MY_QUERY_REF = Types.lookupMethod( MyAdapterQueryable.class, ”myQuery”, String.class String.class);

1 3 4 5 2 6 7 8 9 Calcite Framework MyAdapter components

slide-34
SLIDE 34

Calcite with Geode - Without Implementation

34

SELECT b."totalPrice", c."firstName" FROM "BookOrder" as b INNER JOIN "Customer" as c ON b."customerNumber" = c."customerNumber" WHERE b."totalPrice" > 0;

slide-35
SLIDE 35

Calcite with Geode - With Implementation

35

SELECT b."totalPrice", c."firstName" FROM "BookOrder" as b INNER JOIN "Customer" as c ON b."customerNumber" = c."customerNumber" WHERE b."totalPrice" > 0;

slide-36
SLIDE 36

Calcite JDBC Connection

36

slide-37
SLIDE 37

What About Testing?

37

slide-38
SLIDE 38

TODO

38

  • Improve nested data structures support
  • Push down Join for collocated data sets
  • Push down the COUNT expression
  • Beyond OQL (e.g. implement Join, aggregations with custom functions)
  • Leverage Calcite Streaming with Geode
  • Transaction Support
  • Table Statistics based on Region statistics
  • Benchmarks and estimate the Calcite SQL overhead compared to pure OQL
slide-39
SLIDE 39

References

39

  • Apache Geode Adapter for Apache Calcite: https://github.com/tzolov/calcite
  • Introduction to Apache Calcite (2016) : http://bit.ly/2fB1iBz
  • Apache Calcite Overview (2014) : http://bit.ly/2fMJgbS
  • Introduction to Apache Geode (2016) : http://bit.ly/1Rfztbd
  • Apache Calcite Project (2016) : https://calcite.apache.org
  • Apache Geode Project (2016) : http://geode.apache.org
  • Geode Object Query Language (OQL) : http://bit.ly/2eKywgp
  • Expression Tree Basic: http://bit.ly/2flBiXH
slide-40
SLIDE 40

Credits

40

  • To Dan Baskette for suggesting and motivating the project
  • To Apache Geode and Apache Calcite for the inspiring projects and for the

productive discussions

  • To Pivotal for letting me work on projects like this
slide-41
SLIDE 41