 
              PRESTO TO PRESTO : DISTRIBUTED QUERIES IN HYBRID DATALAKE Agenda: • Query distribution challenge • Presto-to-Presto (P2P) connector • P2P challenges Sajumon “Saj” Joseph • P2P best practices Principal Architect Comcast • P2P future work
DATA ACCESS: 2018/EARLY 2019 ANSI SQL Teradata SQL AWS ANSI SQL ANSI SQL MongoDB S3 S3 SQL Server S3 Cassandra Query Teradata Hadoop Grid Oracle 2
CURRENT ARCHITECTURE CHALLENGES • Cloud native apps cannot access on-premise datasets • Distributed queries and compute • Network connectivity • Presto upgrades, version management 3
DATA ACCESS: 2019 L Q S A a t a N d a S r e ANSI SQL I T S Q L MongoDB SQL Server Query Cassandra Teradata Hadoop Grid Oracle 4
DATA ACCESS: 2019 L ANSI SQL Q S I S N A AWS S3 Aurora DB Teradata Hadoop S3 S3 5
QUERY FABRIC – DISTRIBUTED QUERIES SOLUTION • Federated access • Secure access • Cost savings 6
PRESTO TO PRESTO (P2P) CONNECTOR • Based on JDBC, treats remote Presto cluster as a JDBC data source • Uses pre-established trust between Presto clusters for authentication • Adopts current filtering capabilities available for JDBC connectors 7
PRESTO-TO-PRESTO ON-PREMISE USER Presto single endpoint Cloud On-premise Presto Presto P2P datawarehouse datalake2 datalake1 datalake2 Hadoop Teradata AWS (On-premise) (On-premise) (S3) 8
PRESTO-TO-PRESTO CLOUD USER Presto single endpoint Cloud On-premise Presto Presto P2P P2P datawarehouse datalake1 datalake1 datawarehouse datalake2 AWS Hadoop Teradata (S3) (On-premise) (On-premise) 9
PRESTO-TO-PRESTO SECURE CONNECTION FLOW Presto single endpoint Cloud On-premise Presto Presto 1 3 2 4 5 Step 1: User authenticates in on-premise cluster, submits query Step 2: On Premise Cluster connects to Cloud Presto cluster by passing client certificate and user identity Step 3. User access to dataset validated by checking the user identity against Ranger. Step 4. Cloud Presto runs the query Step 5. Results returned 1 0
PRESTO-TO-PRESTO CONNECTOR CLIENT CONFIGURATION etc/catalog/datalake1.properties connector.name =presto connection-url =jdbc:presto://<host>:<port>/<catalog> presto.SSL =true presto.SSLTrustStorePath =<path of public certificate of remote cluster> presto.SSLKeyStorePath =<path of client java keystore> presto.SSLKeyStorePassword =<client java keystore password> presto.clientTags= <client tags> unsupported-type-handling =CONVERT_TO_VARCHAR 1 1
PRESTO-TO-PRESTO REMOTE CLUSTER CONFIGURATION On Coordinator node: etc/config.properties http-server.authentication.type= JWT,PASSWORD,CERTIFICATE,KERBEROS http-server.https.truststore.path= <path of trust store containing client cluster certificate> 1 2
PRESTO-TO-PRESTO – REMOTE CLUSTER CONFIGURATION On Coordinator node: etc/rules.json "PRINCIPALS": [ { "PRINCIPAL": "CN=<client CN>.*", "USER": "(.*)", "ALLOW": TRUE } ] More details on Principal rules: https://prestosql.io/docs/current/security/built-in-system-access-control.html 1 3
PRESTO-TO-PRESTO IMPLEMENTATION CHALLENGES CHALLENGES: SOLUTION: • Delegate identity Certificate-based authentication • Data types support JDBC-based data types • Query optimization LIMIT and PROJECTION pushdown 1 4
PRESTO-TO-PRESTO BEST PRACTICES • Use views on remote cluster • Use client tags to control resource usage in remote cluster 1 5
PRESTO-TO-PRESTO NEXT STEPS • Complex data type (java.sql.Struct) support • Parallel Presto-to-Presto connector • Enhanced error reporting 1 6
CONCLUDING THOUGHTS P2P – DISTRIBUTED QUERY IN HYBRID DATA LAKE Contact Details: P2P enables processing on our “hybrid” data lake: Sajumon Joseph Sajumon_Joseph@cable.comcast.com • Support for multiple storage locations • Delegate identity • Centralized support for querying • Distributed queries • Cost savings 1 7
Recommend
More recommend