depth estimation for ranking query optimization
play

Depth Estimation for Ranking Query Optimization - PowerPoint PPT Presentation

Depth Estimation for Ranking Query Optimization KarlSchnaitter,UCSantaCruz JoshuaSpiegel,BEASystems,Inc. NeoklisPolyzotis,UCSantaCruz Relational Ranking Queries SELECT h.hid, r.rid, e.eid FROM Hotels h,


  1. Depth Estimation for Ranking Query Optimization Karl�Schnaitter,�UC�Santa�Cruz Joshua�Spiegel,�BEA�Systems,�Inc. Neoklis�Polyzotis,�UC�Santa�Cruz

  2. Relational Ranking Queries SELECT h.hid, r.rid, e.eid FROM Hotels h, Restaurants r, Events e WHERE h.city = r.city AND r.city = e.city RANK BY 0.3/h.price + 0.5*r.rating + 0.2*isMusic(e) LIMIT 10 • A� ���������� for�each�table�in�[0,1] � � (h) = 1/h.price Hotels: � � (r) = r.rating Restaurants: � � (e) = isMusic(e) Events: • Combined�with�a� ������������������ � ( � � ,� � � ,� � � )�=�0.3* � � +�0.5* � � +�0.2* � � • Return�top� �� results�based�on� � In�this�case,� � =�10 2

  3. Ranking Query Execution SELECT h.hid, r.rid, e.eid FROM Hotels h, Restaurants r, Events e WHERE h.city = r.city AND r.city = e.city RANK BY 0.3/h.price + 0.5*r.rating + 0.2*isMusic(e) LIMIT 10 conventional�plan rank4aware�plan ���������������� ���������������� ��������� Rank join Join Rank join � Join � � � � � Ordered 3 by score

  4. Depth Estimation • ����� : number�of�accessed�tuples – Indicates�execution�cost – Linked�to�memory�consumption Rank join ���������� � � ����������� • ������������� Estimate�depths�for� each operator�in�a�rank4aware�plan 4

  5. Depth Estimation Methods • Ilyas�et�al.�(SIGMOD�2004) – Uses�probabilistic�model�of�data – Assumes�relations�of�equal�size�and�a� scoring�function�that�sums�scores – Limited�applicability • Li�et�al.�(SIGMOD�2005) – Samples�a�subset�of�rows�from�each�table – Independent�samples�give�a�poor�model� of�join�results 5

  6. Our Solution: DEEP • �� pth� � stimation�for� � hysical�plans • Strengths�of�DEEP – A�principled methodology • Uses statistical�model�of�data�distribution • Formally�computes�depth�over�statistics – Efficient�estimation�algorithms – Widely�applicable • Works�with state4of4the4art�physical�plans • Realizable�with�common�data�synopses 6

  7. Outline • Preliminaries • DEEP�Framework • Experimental�Results 7

  8. Outline • Preliminaries • DEEP�Framework • Experimental�Results 8

  9. Monotonic Functions • A�function� � ( � 1 ,..., � � )�is� ��������� if � � ( � � ≤y � )� � � ( � 1 ,..., � � )�≤ � (y 1 ,...,y � ) f(x) x 9

  10. Monotonic Functions • A�function� � ( � 1 ,..., � � )�is� ��������� if � � ( � � ≤y � )� � � ( � 1 ,..., � � )�≤ � (y 1 ,...,y � ) • Most�scoring�functions�are�monotonic – E.g.�sum,�product,�avg,�max,�min • Monotonicity�enables�bound�on�score – In�example�query,�score�was 0.3/h.price�+�0.5*r.rating�+�0.2*isMusic(e) – Given�a�restaurant� � ,�upper�bound�is 0.3*1 +�0.5*r.rating�+�0.2*1 10

  11. Hash Rank Join [IAE04] • The� ���������� ��� algorithm – Joins�inputs�sorted�by�score – Returns�results�with�highest�score • Main�ideas – Alternate�between�inputs�based�on� ������������! – Score�bounds�allow�early�termination Bound: 1.8 Bound: 1.7 L a b L R a b R Query: Top result from L R with scoring function x 1.0 y 1.0 S ( b L , b R ) = b L + b R y 0.8 z 0.9 Result: y — — — w 0.7 Score: 1.8 — — — 11

  12. HRJN* [IAE04] • The�HRJN*�pull�strategy: a) Pull�from�the�input�with�highest�bound b) If�(a)�is�a�tie,�pull�from�input�with�the� smaller�number�of�pulls�so�far c) If�(b)�is�a�tie,�pull�from�the�left Bound: 2.0 1.8 Bound: 2.0 1.9 1.7 Query: Top result from L R L a b L R a b R with scoring function x 1.0 y 1.0 S ( b L , b R ) = b L + b R y 0.8 z 0.9 Result: y � ? w 0.7 Score: 1.8 12

  13. Outline • Preliminaries • DEEP�Framework • Experimental�Results 13

  14. Supported Operators Evidence�in�favor�of�HRJN* – Pull�strategy�has�strong�properties� • Within�constant�factor�of�optimal�cost • Optimal�for�a�significant�class�of�inputs • More�details�in�the�paper – Efficient�in experiments�[IAE04] � DEEP�explicitly�supports�HRJN* – Easily�extended�to�other�join�operators – Selection�operators�too 14

  15. DEEP: Conceptual View ������������� defined in Depth terms of Statistical Computation Data Model �������������� defined in Estimation terms of Statistics Data Algorithms Interface Synopsis 15

  16. Statistics Model • Statistics�yield�the�distribution�of� scores�for�base�tables�and�joins � " � " ( � " ) F L 1.0 5 0.9 2 F L R b L b R F L R ( b L , b R ) 0.8 3 1.0 1.0 6 0.6 12 1.0 0.5 4 0.4 8 1.0 0.7 3 0.9 0.7 2 � � � � ( � � ) F R 0.6 0.7 2 1.0 3 0.7 1 0.5 2 16

  17. Statistics Interface • DEEP�accesses�statistics�with�two�methods – ������# ( � ):�Return�frequency�of� � – ��������� ( � , � ):�Return�next�lowest�score�on�dimension� � b L b R F L R ( b L , b R ) 1.0 1.0 6 ������# ( � )�=�3 1.0 0.5 4 ��������� ( � ,1)=0.9 � 1.0 0.7 3 ��������� ( � ,2)=0.5 0.9 0.7 2 0.6 0.7 2 • The interface�allows�for�efficient�algorithms – Abstracts�the�physical statistics�format – Allows�statistics�to�be�generated�on4the4fly 17

  18. Statistics Implementation • Interface�can�be�implemented�over� common�types�of�data�synopses • Can�use�a�histogram�if a) Base�score�function�is�invertible,�or b) Base�score�measures�distance • Assume�uniformity�&�independence�if a) Base�score�function�is�too�complex,�or b) Sufficient�statistics�are�not�available 18

  19. Depth Estimation Overview Top4 � query�plan Estimates�made Value Score of the k th best s 1 1. s 1 tuple out of 1 1 2. Depths of needed l 1 and r 1 l 1 r 1 1 to output score of s 1 th best C s 2 3. Score of the l 1 s 2 tuple out of 2 2 l 2 r 2 4. Depths of needed l 2 and r 2 2 to output score of s 2 B A 19

  20. Estimating Terminal Score • Suppose�we�want� • Idea the�10 th best�score – Sort�by�total�score – Sum�frequencies b L b R F L R ( b L , b R ) b L + b R F L R ( b L ,b R ) sum 1.0 1.0 6 2.0 6 6 1.0 0.5 4 1.7 3 9 1.0 0.7 3 1.6 2 11 0.9 0.7 2 1.5 4 0.6 0.7 2 1.3 2 � term =�1.6 20

  21. Estimation Algorithm • Idea:�Only�process�necessary�statistics 1 0.7 0.5 b L b R F L R ( b L , b R ) 6 3 4 1 1.0 1.0 6 2 0.9 1.0 0.5 4 1.0 0.7 3 0.8 0.9 0.7 2 2 0.6 0.7 2 0.6 � term =�1.6 • Algorithm�relies�solely�on� ������# and� ��������� – Avoids�materializing�complete�table • Worst4case�complexity�equivalent�to�sorting�table – More�efficient�in�practice 21

  22. Depth Estimation Overview Top4 � query�plan Estimates�made Value Score of the k th best s 1 1. s 1 tuple out of 1 1 2. Depths of needed l 1 and r 1 l 1 r 1 1 to output score of s 1 th best C s 2 3. Score of the l 1 s 2 tuple out of 2 2 l 2 r 2 4. Depths of needed l 2 and r 2 2 to output score of s 2 B A 22

  23. Estimating Depth for HRJN* �������� � ≤ ≤ depth�of�HRJN*� ≤ $ ≤ ≤ Input Score Bounds Example: � term =�1.6 >�S term � " � " ( � " ) � " + 1 � " ( � " ) > S term � > S term 1.0 5 2.0 5 >�S term 1.9 2 0.9 2 $ =�S term 1.8 3 0.8 3 = S term 1.6 4 0.6 12 =�S term 0.4 8 1.4 8 <�S term 11� ≤ depth� ≤ 15 <�S term • Estimation�algorithm <�S term < S term – Access�via� ������# and� ��������� <�S term – Similar�to�estimation�of � term 23

  24. Outline • Preliminaries • DEEP�Framework • Experimental�Results 24

  25. Experimental Setting • TPC4H�data�set – Total�size�of�1�GB – Varying�amount�of�skew • Workloads�of�250�queries – Top410,�top4100,�top41000�queries – One�or�two�joins�per�query • Error�metric:� ���������������%������� 25

  26. Depth Estimation Techniques • DEEP – Uses 150�KB�TuG�synopsis�[SP06] • Probabilistic�[IAE04] – Uses�same�TuG�synopsis – Modified�to�handle�single4join�queries� with�varying�table�sizes • Sampling�[LCIS05] – 5%�sample�=�4.6�MB 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend