Peregrine: workload optimization for cloud query engines
Alekh Jindal, Hiren Patel, Abhishek Roy, Shi Qiao, Zhicheng Yin, Rathijit Sen, Subru Krishnan
Peregrine: workload optimization for cloud query engines Alekh - - PowerPoint PPT Presentation
Peregrine: workload optimization for cloud query engines Alekh Jindal, Hiren Patel, Abhishek Roy, Shi Qiao, Zhicheng Yin, Rathijit Sen, Subru Krishnan DBA Workload Engine On-Premise DBA On-Premise DBA Need to reach by 10, On-Premise can
Alekh Jindal, Hiren Patel, Abhishek Roy, Shi Qiao, Zhicheng Yin, Rathijit Sen, Subru Krishnan
Need to reach by 10, can we drive faster? Sure!
Need to reach by 10, can we drive faster? Sorry, we don’t have a DBA
Reality Check for customers:
Reality Check for providers:
..….. Database Vendor Developers DB DBA Users
Customer 1
DB DBA Users
Customer 2
DB DBA Users
Customer n
Workload Workload Workload DS1 Users DS2 DS3 DSn ..… Developers Data Services Workload
Workload Workload Workload Workload Fragmented on-premise workloads
Workload
Job metadata name, user, account, submit/start/end times Query plans logical, physical, stage graph, estimates Runtime statistics Operator-wise observables Task level logs start/end events Machine counters CPU, IO, etc. Several TBs of metadata / day
Instrument, log, and collect workload characteristics
Logical plan Physical plan Stage graph Tasks Signatures Denormalized view Anonymized (Workload IR) Log + metrics Log + metrics Log + metrics Log + metrics
Queries Data Queries Data Data Queries Data
Query templates appear
Queries over same datasets have similarities Queries depend on datasets produced by previous queries
* Towards a Learning Optimizer for Shared Clouds. Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao Le, Shi Qiao, Sriram Rao. VLDB 2019.
ideal
* Computation Reuse in Analytics Job Service at Microsoft. Alekh Jindal, Shi Qiao, Hiren Patel, Jarod Yin, Jieming Di, Malay Bag, Marc Friedman, Yifung Lin, Konstantinos Karanasos, Sriram Rao. SIGMOD 2018. * Selecting Subexpressions to Materialize at Datacenter Scale. Alekh Jindal, Konstantinos Karanasos, Sriram Rao, Hiren Patel. VLDB 2018.
emerging as a
manage they pay ever, the and teams ., parts of generating computation reuse
20 40 60 80 100 clus er1 clus er2 clus er3 clus er4 clus er5
Percentage
Overlapping jobs Users with overlapping jobs Overlapping subgraphs
* Dependency-driven analytics: A compass for uncharted data oceans. R. Mavlyutov, C. Curino, B. Asipov, and P. Cudré-Mauroux. CIDR 2017.
Feedback Lookup & Action
Rules Configs
Compiler Optimizer Scheduler Runtime
Query Result
Workload Representation Workload Optimization Feedback Service
Query Annotations
Query Engine
Annotation: signature --> actions
Extensions Jar
Optimizer Rule1: Online materialize Optimizer Rule2: Computation Reuse
SCOPE Modifications to compiler/optimizer
Pluggable extensions from outside
SCOPE
Compiler flags
Compiler Optimizer Scheduler Runtime
Query Result
Query Engine
Feedback Service View Selection
Selected Views
Learn Cardinality
Cardinality Models
Common Subexpressions Query Subexpressions IR Workload Repository
SCOPE
Connectors Parsers Enumerators Recurring Signature Strict Signature
SCOPE
Spark Hive
..…
Workload-aware Query Engines Sharing Recurring Coordinating
Multi-query Optimization, e.g., CloudViews Learned optimizations, e.g., Learned Cardinality
..…
Mathematical Solvers Machine Learning Graph Analytics
Workload Optimization
Patterns
Dependency-driven optimizations, e.g., physical design for pipeline Metadata Plans Statistics
Feature Store
Ingest Parse Enumerate
Workload Intermediate Representation (IR)
Signatures
Query Plan Instrumentation ..… Workload Representation Insights Recommendations Self-tuning Users
Dashboard Alerts
Feedback Service
Query Annotations
Workload Feedback
Feedback
https://azuredata.microsoft.com/labs/gsl