PGCon 2020 Tatsuro Yamada Julien Rouhaud Who we are Tatsuro - - PowerPoint PPT Presentation
PGCon 2020 Tatsuro Yamada Julien Rouhaud Who we are Tatsuro - - PowerPoint PPT Presentation
PGCon 2020 Tatsuro Yamada Julien Rouhaud Who we are Tatsuro Yamada Works for NTT Comware as a Database Engineer PostgreSQL Contributor Oracle_fdw Committer pg_plan_advsr Author Member of PGConf.Asia Organizers since 2016
Tatsuro Yamada
- Works for NTT Comware as a Database Engineer
- PostgreSQL Contributor
- Oracle_fdw Committer
- pg_plan_advsr Author
- Member of PGConf.Asia Organizers since 2016
Julian Rouhaud
- Works for VMware
- PostgreSQL Contributor
- HypoPG Author
- pg_qualstats Co-Author
Who we are
Agenda
1. What is a Query plan? 2. Why inefficient plan can be chosen? 3. Our Challenges to get an efficient plan 4. Conclusion
- 1. What is a Query plan?
- 1. What is a Query plan?
SQL
- declarative Language
Query Plan
- planner steps to retrieve the wanted data
- contains multiple plan nodes (e.g. scan, join, aggregate, …)
- tree structure
Planner
- selects a query plan it thinks is the most efficient plan based on
calculated cost
Does planner select an efficient plan every time?
Inefficient plan
- It tends to be a long query execution time
- EXPLAIN command allows checking a query plan
e.g. EXPLAIN (ANALYZE, BUFFERS) select ….
Inefficient plan
In this example, using an Index can reduce the execution time
QUERY PLAN
- Nested Loop ([...] rows=313475000 width=16) ([...] rows=0 loops=1)
- > Seq Scan on pgqs t2 ([...] rows=25078 width=8) ([...] rows=25000 loops=1)
Filter: (val1 = 0) Rows Removed by Filter: 25000
- > Materialize ([...] rows=12500 width=8) ([...] rows=0 loops=25000)
- > Seq Scan on pgqs t1 ([...] rows=12500 width=8) ([...] rows=0 loops=1)
Filter: ((val1 = 0) AND (val2 = 0)) Rows Removed by Filter: 50000
Why inefficient plan can be chosen?
https://rjuju.github.io/postgresql/2020/02/28/pg_qualstats-2-selectivity-error.html
- 2. Why inefficient plan
can be chosen?
- 2. Why inefficient plan can be chosen?
Inefficient plan Bad scan method Bad join method Bad join order
Inaccurate statistics
Reasons Root causes Problem
Index deficiency Planner specification
2.1. Index deficiency or not working
Problems
- No suitable index available
- Good candidate index can't be used
Solutions
- Create suitable index and check query writing
- Better to create a minimum number of indexes to achieve better performance
- If you have thousands of queries, it’s a tough work
Plan Plan
What are statistics?
- Sampled data from a table
e.g.
- Number of rows
- Number of unique values
- Most common values
How does planner use statistics?
- To calculate the cardinality/selectivity of nodes, and cost of query plans
- Planner assumes the lowest cost plan is the most efficient plan
2.2. Inaccurate statistics
Query Stats
Planner
Plan
In what cases can statistics be inaccurate?
2.2. Inaccurate statistics
Cases
- a. Autovacuum not properly tuned
- b. Usage of temporary tables
- c. Massive imports directly followed by queries or in the same transaction
2.2. Inaccurate statistics
- a. autovacuum not properly tuned
Problem
- autovacuum appears to not work properly
Solution
- Tune the threshold for executing analyze by using GUC parameters:
- autovacuum_analyze_threshold
- autoavacuum_analyze_scale_factor
- And ideally on a per-table basis
- ALTER TABLE tbl SET (autovacuum_analyze_... TO …)
2.2. Inaccurate statistics
- b. Usage of temporary tables
Problem
- Temporary tables have no statistics by default, as autovacuum won’t
see them
- Planner uses Default rows number to calculate a cost
Solution
- Run manual ANALYZE on temporary tables
2.2. Inaccurate statistics
- c. Massive imports followed by queries in the same
transaction
Problem
- A query executed just after importing a large number of rows can lead to
inefficient plan since planner can’t use up to date statistics
Solutions
- Execute analyze manually is better than waiting for autovacuum
- Divide the transaction in three processing steps, such as import, analyze,
and query processing, if possible
imports + query imports analyze query
2.3. Planner specification
Selectivity estimation for AND-ed quals
Problem
- The planner by default assumes that columns are totally independent
for selectivity estimation
Example WHERE zipcode = 75 AND city = ‘Paris’
- Paris zipcode is 75, so both predicate are actually redundant
- If both predicates retain 1% of the rows, applying both should also
retain 1% of the rows
- But the planner will by default multiply the selectivity, assuming that
- nly 0.01% of the rows will be retained
2.3. Planner specification
Solution
- Detect those correlated columns
- Create extended statistics (CREATE STATISTICS)
Causes and Solutions
Inefficient plan Bad scan method Bad join method Bad join order Inaccurate statistics
Reasons Root causes Problem
Index deficiency Planner specification Create Index Appropriate setting Create Extended Stats
Solutions
Is it possible to apply the solutions easily?
Plan Tuning
- 3. Our Challenges
to get efficient plan
- 3. Our Challenges to get an efficient plan
3.1. Automatic indexing tool
- pg_qualstats
3.2. Auto query plan tuning tool
- pg_plan_advsr
3.3. What's next?
3.1. Auto index and extended statistics advising tool: pg_qualstats
3.1.a What is it? 3.1.b How does index suggestion work? 3.1.c Demo 3.1.d Summary
3.1.a What is pg_qualstats
- PostgreSQL extension that keeps track of predicates
(WHERE/JOIN clauses) statistics during query execution
- Selectivity, error in selectivity estimation, number of execution, type of
scan…
- Aggregate per unique query (same as pg_stat_statements)
- And a global index advisor using those statistics
- Take into account the server actual workload
- Use multiple heuristics to suggest the minimum number of new index
- E.g. Preference for multi-column indexes
- Returns new index suggestion and list predicates that can
automatically be optimized
- E.g. Operator that don’t have any index AM compatibility
3.1.b How does index suggestion work
- Retrieve the list of “interesting” predicates
- Filter more than 50% of the rows
- Using a sequential scan
- For queries run more than 10 times
- …
3.1.b How does index suggestion work
- Build the full set of paths, with each AND-ed predicates
combination
3.1.b How does index suggestion work
- Compute a score for each path, giving a weight to each
predicate corresponding to the number of « simple » predicate
3.1.b How does index suggestion work
- Highest score is the « best »index
- a single index optimizing the higher number of predicate
- Generate the index DDL given the predicate list
- The correct column order correspond to the ascending weight in the
path!
- Remove all optimized predicates from the list, and start again
until all predicates are optimized
3.1.c Demo – the queries
3.1.c Demo – optimizable predicates
3.1.c Demo – non optimizable predicates
- The ILIKE (~~*) can’t be optimized automatically. DBA
attention is required here to know how to properly handle this.
3.1.d Summary
- Useful extension to extract knowledge on your production
workload
- Helps DBA to focus on the “complex” optimization part thanks to
the global index suggestion
- But also gives a lot of other information that can be used for
- ther parts of optimization
- Automatically find correlated columns
- Automatically find problematic statistics
- …
3.2. Auto query plan tuning tool: pg_plan_advsr
3.2.a What is this? 3.2.b How it works 3.2.c Demo 3.2.d Summary
3.2.a What are pg_plan_advsr and its merit
pg_plan_advsr
- allows automatic tuning of a query plan by using a feedback loop
Advantages
- The Machine performs the tuning, rather than humans
- Tuning is possible even when there is no expert.
- For experts, it is possible to gain new awareness
- Can find an optimal plan
- Since it can controls each plan node, It can create an optimal plan that
might not be selected by Vanilla PostgreSQL
- Availability of all plan history in tuning process
- Moreover, able to reproduce any plan from the history
3.2.a What are pg_plan_advsr and its merit
Use-cases
- System development field
- Tuning during performance tests and performance troubles
- Regular reporting process
- Analytical processing
- Batch processing
- Field of study
- Verify planner's capability
- You can check the performance of the selected plan when there is no estimation error
How did this tool achieve automatic tuning?
Note: - pg_plan_advsr is in POC phase, so recommend to use it in a verification environment
- Refer to the manual for details due to functional restrictions
3.2.b How does Auto plan tuning work
Parser Planner Executer
pg_plan_advsr
Query Result
- 1. Detect
estimation error
- 2. Correct
estimation error by using feedback info on next query execution
Feedback Info Error Info [1]
- 3. Record
[1] The difference between Estimated rows and Actual rows
- EXPLAIN Analyze
- PDCA cycle
- Add Feedback Loop
How to correct an estimation error
3.2.b Rough concept of pg_plan_advsr
Hash Join estimate:1000 actual: 1000
Seq Scan Index Scan
Nest Loop estimate: 1 actual: 1000
(A B) A B
1st iteration
Seq Scan Index Scan (A B) A B
2nd iteration
Fix
Feedback Info
(Hints)
3.2.c Short Demonstration
- 31c.sql from Join Order Benchmark
- Analytical query
- Join count: 10
- Aggregate functions: 4
- Data size: 10GB
- Parameters
- max_parallel_workers_per_gather = 0
- max_parallel_workers = 0
- geqo_threshold = 20
- from_collapse_limit = 16
- join_collapse_limit = 16
- random_page_cost = 2
- Environment
- PG118
- CentOS 7.8 on VirtualBox
- i5-6300U 2.4GHz 2core / 4thread
- Mem 16GB
- SSD 512GB
3.2.c Short Demonstration
Scenario
- 1. Prewarm all tables (done)
- 2. Execute “psql –f 31c.sql” 17 times for Auto plan tuning
- > auto_tune.sh
- 3. Check Plan history table to see plan changes
- 4. Re-verification:
- > Baseline plan: 31c_org.sql
- > Tuned plan: 31c_hinted.sql
Demo: Auto Plan Tuning
3x speed Iterations Execution Time Plan history Table
Plan history table shows some changes
id pgsp_queryid pgsp_planid diff_of_joins execution_time (ms) 87 1458904999
863916075 943869
9444.67
88 … 1204820903 119616 1839.69 89 … 3427660743 83425 1403.17 90 … 916206257 41572 3709.36 91 … 1314881321 124359 2327.76 92 … 3744232778 12080 1582.87 93 … 1330829809 101159 1579.09 94 … 1610196430 110410 1634.02 95 … 2071239096 120713 2021.11 96 … 3418115224 25800 3000.66 97 … 3498353310 75603 2629.61 98 … 245020999 8996 1971.09 99 … 2367663326 337358 3799.11 100 1458904999 2565516663
1152.47
… … … … …
Performance Improved 8.2 times
(9.4s -> 1.1s)
Baseline Tuned
Optimizer Hints to reproduce the Optimized plan
/*+ LEADING( ((((((it1 ((it2 ((cn mc )mi_idx ))mi ))ci )n )t )mk )k ) ) HASHJOIN(ci cn it1 it2 k mc mi mi_idx mk n t) NESTLOOP(ci cn it1 it2 mc mi mi_idx mk n t) NESTLOOP(ci cn it1 it2 mc mi mi_idx n t) NESTLOOP(ci cn it1 it2 mc mi mi_idx n) NESTLOOP(ci cn it1 it2 mc mi mi_idx) NESTLOOP(cn it1 it2 mc mi mi_idx) NESTLOOP(cn it2 mc mi mi_idx) NESTLOOP(cn it2 mc mi_idx) NESTLOOP(cn mc mi_idx) NESTLOOP(cn mc) SEQSCAN(it1) SEQSCAN(it2) SEQSCAN(cn) INDEXSCAN(mc) INDEXSCAN(mi_idx) INDEXSCAN(mi) INDEXSCAN(ci) INDEXSCAN(n) INDEXSCAN(t) INDEXSCAN(mk) SEQSCAN(k) */
- -2565516663
You can bring the hints to other environments to reproduce the plan!
Demo: Re-verification
Original query (Baseline plan) Hinted query (Optimized plan)
Result of Re-verification
Query Plan Execution time (ms) Performance Improved 31c_org.sql Baseline plan 9398.37
- 31c_hinted.sql
Optimized plan (Tuned) 1109.77
Proved the effectiveness of auto plan tuning!
8.4x faster
3.2.d Summary of pg_plan_advsr
Merit for DBA
- It is possible to improve query performance by automatically optimizing
the query plan (OK even without an expert)
- The optimized plan can be reproduced on other environments
How it works
- Optimize the plan by using a feedback loop to improve the cost
calculation each time query execution.
Demo
- Ten joins query performance improved 8 times by auto plan tuning.
3.3. What's next?
Work together the both extensions
Causes of inefficient plan Solutions pg_qualstats pg_plan_advsr Both Index deficiency Find missing Index
✔
- ✔
Inaccurate statistics Appropriate setting
- Planner
specification Find correlated columns
possible
- possible
Plan tuning
- ✔
✔
3.3. Make the extensions work together
- Recommend using the following procedure to get an efficient plan
Step 1. Test your queries Step 2. Extract appropriate index with pg_qualstats and add them Step 3. Find queries with multiple join and bad estimate to feed pg_plan_advsr Step 4. Optimize the plans with pg_plan_advsr
- 4. Conclusion
- 4. Conclusion
- 1. What is a Query plan? and 2. Why inefficient plans can be chosen?
- Explained query plans, what caused inefficient plans, and how to resolve them
- 3. Our Challenges to get efficient plan
- pg_qualstats:
Automatic index advising tool, and bad selectivity estimate query detection
- pg_plan_advsr:
Automatic optimized query plan advising tool
- Propose to PostgreSQL community
- Advising feature, Adaptive query or estimate have large merit for DBA,
We’d like to propose creating it in Vanilla PostgreSQL
Thanks
yamatattsu at gmail.com rjuju123 at gmail.com
"elephants beach walk" by Senorhorst Jahnsen is licensed under CC BY 2.0
Q&A See you in Zoom!
References
- Query text of 31c.sql
- Feedback Info of 31c.sql (example)
- Source code
- Related Talk and Extension
- Tools to help inspect query plans
Query text of 31c.sql
EXPLAIN ANALYZE SELECT MIN(mi.info) AS movie_budget, MIN(mi_idx.info) AS movie_votes, MIN(n.name) AS writer, MIN(t.title) AS violent_liongate_movie FROM cast_info AS ci, company_name AS cn, info_type AS it1, info_type AS it2, keyword AS k, movie_companies AS mc, movie_info AS mi, movie_info_idx AS mi_idx, movie_keyword AS mk, name AS n, title AS t WHERE ci.note IN ('(writer)', '(head writer)', '(written by)', '(story)', '(story editor)') AND cn.name LIKE 'Lionsgate%' AND it1.info = 'genres' AND it2.info = 'votes' AND k.keyword IN ('murder', 'violence', 'blood', 'gore', 'death', 'female-nudity', 'hospital') AND mi.info IN ('Horror', 'Action', 'Sci-Fi', 'Thriller', 'Crime', 'War') AND t.id = mi.movie_id AND t.id = mi_idx.movie_id AND t.id = ci.movie_id AND t.id = mk.movie_id AND t.id = mc.movie_id AND ci.movie_id = mi.movie_id AND ci.movie_id = mi_idx.movie_id AND ci.movie_id = mk.movie_id AND ci.movie_id = mc.movie_id AND mi.movie_id = mi_idx.movie_id AND mi.movie_id = mk.movie_id AND mi.movie_id = mc.movie_id AND mi_idx.movie_id = mk.movie_id AND mi_idx.movie_id = mc.movie_id AND mk.movie_id = mc.movie_id AND n.id = ci.person_id AND it1.id = mi.info_type_id AND it2.id = mi_idx.info_type_id AND k.id = mk.keyword_id AND cn.id = mc.company_id;
Feedback Info of 31c.sql (example)
ROWS(ci cn it1 it2 k mc mi mi_idx mk n t #2825) ROWS(ci cn it1 it2 mc mi mi_idx mk n t #107406) ROWS(ci cn it1 it2 mc mi mi_idx n t #1017) ROWS(ci cn it1 it2 mc mi mi_idx n #1017) ROWS(ci cn it1 it2 mc mi mi_idx #1017) ROWS(cn it1 it2 mc mi mi_idx #1235) ROWS(cn it2 mc mi mi_idx #1295) ROWS(cn it2 mc mi_idx #1449) ROWS(cn mc mi_idx #4370) ….
Source code
- pg_qualstats
https://github.com/powa-team/pg_qualstats
- pg_plan_advsr
https://github.com/ossc-db/pg_plan_advsr
Related Talk and Extension
Talk
- PGConf.Russia 2019
- Auto plan tuning using feedback by Tatsuro Yamada
https://pgconf.ru/en/2019/242844
- PGConf.Eu 2018
- Auto plan tuning using feedback by Tatsuro Yamada
https://www.postgresql.eu/events/pgconfeu2018/schedule/session/2132-auto- plan-tuning-using-feedback-loop/
Extension
- AQO by Postgres Professional
- https://github.com/postgrespro/aqo
Tools to help inspect query plans
- Visualization tools
- explain.depesz.com
- pev2
- pg_flame