PRESTO TO PRESTO : DISTRIBUTED QUERIES IN HYBRID DATALAKE Agenda: - - PowerPoint PPT Presentation

presto to presto
SMART_READER_LITE
LIVE PREVIEW

PRESTO TO PRESTO : DISTRIBUTED QUERIES IN HYBRID DATALAKE Agenda: - - PowerPoint PPT Presentation

PRESTO TO PRESTO : DISTRIBUTED QUERIES IN HYBRID DATALAKE Agenda: Query distribution challenge Presto-to-Presto (P2P) connector P2P challenges Sajumon Saj Joseph P2P best practices Principal Architect Comcast P2P


slide-1
SLIDE 1

PRESTO TO PRESTO:

DISTRIBUTED QUERIES IN HYBRID DATALAKE

Sajumon “Saj” Joseph Principal Architect Comcast

  • Query distribution challenge
  • Presto-to-Presto (P2P) connector
  • P2P challenges
  • P2P best practices
  • P2P future work

Agenda:

slide-2
SLIDE 2

2

DATA ACCESS: 2018/EARLY 2019

Teradata SQL

Hadoop Teradata

Query Grid ANSI SQL

Oracle Cassandra SQL Server MongoDB

ANSI SQL ANSI SQL

AWS S3 S3 S3

slide-3
SLIDE 3

3

CURRENT ARCHITECTURE CHALLENGES

  • Cloud native apps cannot access on-premise datasets
  • Distributed queries and compute
  • Network connectivity
  • Presto upgrades, version management
slide-4
SLIDE 4

4

DATA ACCESS: 2019

T e r a d a t a S Q L

Hadoop Teradata

Query Grid A N S I S Q L

Oracle Cassandra SQL Server MongoDB

ANSI SQL

slide-5
SLIDE 5

5

DATA ACCESS: 2019

Hadoop Teradata

A N S I S Q L ANSI SQL

AWS S3 Aurora DB S3 S3

slide-6
SLIDE 6

6

QUERY FABRIC – DISTRIBUTED QUERIES SOLUTION

  • Federated access
  • Secure access
  • Cost savings
slide-7
SLIDE 7

7

PRESTO TO PRESTO (P2P) CONNECTOR

  • Based on JDBC, treats remote Presto cluster as a JDBC data

source

  • Uses pre-established trust between Presto clusters for

authentication

  • Adopts current filtering capabilities available for JDBC

connectors

slide-8
SLIDE 8

8

PRESTO-TO-PRESTO ON-PREMISE USER

datalake1 datawarehouse Presto single endpoint Teradata (On-premise) Hadoop (On-premise) datalake2 On-premise Presto Cloud Presto datalake2 AWS (S3) P2P

slide-9
SLIDE 9

9

PRESTO-TO-PRESTO CLOUD USER

datalake1 datawarehouse Presto single endpoint Teradata (On-premise) Hadoop (On-premise) datalake1 On-premise Presto P2P Cloud Presto datalake2 AWS (S3) datawarehouse P2P

slide-10
SLIDE 10

1 0

PRESTO-TO-PRESTO SECURE CONNECTION FLOW

On-premise Presto Cloud Presto

1 2

Step 1: User authenticates in on-premise cluster, submits query Step 2: On Premise Cluster connects to Cloud Presto cluster by passing client certificate and user identity Step 3. User access to dataset validated by checking the user identity against Ranger. Step 4. Cloud Presto runs the query Step 5. Results returned

3 4 5

Presto single endpoint

slide-11
SLIDE 11

1 1

PRESTO-TO-PRESTO CONNECTOR CLIENT CONFIGURATION etc/catalog/datalake1.properties

connector.name=presto connection-url=jdbc:presto://<host>:<port>/<catalog> presto.SSL=true presto.SSLTrustStorePath=<path of public certificate of remote cluster> presto.SSLKeyStorePath=<path of client java keystore> presto.SSLKeyStorePassword=<client java keystore password> presto.clientTags=<client tags> unsupported-type-handling=CONVERT_TO_VARCHAR

slide-12
SLIDE 12

1 2

PRESTO-TO-PRESTO REMOTE CLUSTER CONFIGURATION

On Coordinator node:

etc/config.properties

http-server.authentication.type=JWT,PASSWORD,CERTIFICATE,KERBEROS http-server.https.truststore.path=<path of trust store containing client cluster certificate>

slide-13
SLIDE 13

1 3

PRESTO-TO-PRESTO – REMOTE CLUSTER CONFIGURATION

On Coordinator node:

etc/rules.json

"PRINCIPALS": [ { "PRINCIPAL": "CN=<client CN>.*", "USER": "(.*)", "ALLOW": TRUE } ]

More details on Principal rules: https://prestosql.io/docs/current/security/built-in-system-access-control.html

slide-14
SLIDE 14

1 4

PRESTO-TO-PRESTO IMPLEMENTATION CHALLENGES

SOLUTION:

Certificate-based authentication JDBC-based data types LIMIT and PROJECTION pushdown

CHALLENGES:

  • Delegate identity
  • Data types support
  • Query optimization
slide-15
SLIDE 15

1 5

PRESTO-TO-PRESTO BEST PRACTICES

  • Use views on remote cluster
  • Use client tags to control resource usage in remote cluster
slide-16
SLIDE 16

1 6

PRESTO-TO-PRESTO NEXT STEPS

  • Complex data type (java.sql.Struct) support
  • Parallel Presto-to-Presto connector
  • Enhanced error reporting
slide-17
SLIDE 17

1 7

P2P – DISTRIBUTED QUERY IN HYBRID DATA LAKE P2P enables processing on our “hybrid” data lake:

  • Support for multiple storage locations
  • Delegate identity
  • Centralized support for querying
  • Distributed queries
  • Cost savings

Contact Details:

Sajumon Joseph

Sajumon_Joseph@cable.comcast.com

CONCLUDING THOUGHTS