advanced database systems
play

ADVANCED DATABASE SYSTEMS Parallel Join Algorithms (Sorting) @ - PowerPoint PPT Presentation

Lect ure # 18 ADVANCED DATABASE SYSTEMS Parallel Join Algorithms (Sorting) @ Andy_Pavlo // 15- 721 // Spring 2019 CMU 15-721 (Spring 2019) 2 PRO J ECT # 2 This Week Status Meetings Monday April 8 th Code Review Submission


  1. Lect ure # 18 ADVANCED DATABASE SYSTEMS Parallel Join Algorithms (Sorting) @ Andy_Pavlo // 15- 721 // Spring 2019

  2. CMU 15-721 (Spring 2019) 2 PRO J ECT # 2 This Week → Status Meetings Monday April 8 th → Code Review Submission → Update Presentation → Design Document

  3. CMU 15-721 (Spring 2019) 3 PARALLEL J O IN ALGO RITH M S Perform a join between two relations on multiple threads simultaneously to speed up operation. Two main approaches: → Hash Join → Sort-Merge Join

  4. CMU 15-721 (Spring 2019) 4 SIMD Background Parallel Sort-Merge Join Evaluation

  5. CMU 15-721 (Spring 2019) 5 S IN GLE I N STRUCTIO N, M ULTIPLE D ATA A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously. All major ISAs have microarchitecture support SIMD operations. → x86 : MMX, SSE, SSE2, SSE3, SSE4, AVX → PowerPC : Altivec → ARM : NEON

  6. CMU 15-721 (Spring 2019) 6 SIM D EXAM PLE X + Y = Z 8 7 x 1 y 1 x 1 +y 1 6 X 5 x 2 y 2 x 2 +y 2 + = 4 Z ⋮ ⋮ ⋮ 3 x n y n x n +y n 2 SISD 1 + 1 for (i=0; i<n; i++) { 1 Z[i] = X[i] + Y[i]; 1 } Y 1 1 1 1 1

  7. CMU 15-721 (Spring 2019) 6 SIM D EXAM PLE X + Y = Z 8 7 x 1 y 1 x 1 +y 1 6 X 5 x 2 y 2 x 2 +y 2 + = 4 Z ⋮ ⋮ ⋮ 3 x n y n x n +y n 2 SISD 1 + 9 1 for (i=0; i<n; i++) { 1 Z[i] = X[i] + Y[i]; 1 } Y 1 1 1 1 1

  8. CMU 15-721 (Spring 2019) 6 SIM D EXAM PLE X + Y = Z 8 7 x 1 y 1 x 1 +y 1 6 X 5 x 2 y 2 x 2 +y 2 + = 4 Z ⋮ ⋮ ⋮ 3 x n y n x n +y n 2 SISD 1 + 9 8 7 6 5 4 3 2 1 for (i=0; i<n; i++) { 1 Z[i] = X[i] + Y[i]; 1 } Y 1 1 1 1 1

  9. CMU 15-721 (Spring 2019) 6 SIM D EXAM PLE 128-bit SIMD Register X + Y = Z 8 7 x 1 y 1 x 1 +y 1 8 7 6 5 6 X 5 x 2 y 2 x 2 +y 2 + = 4 Z ⋮ ⋮ ⋮ 3 x n y n x n +y n 2 SIMD + 1 9 8 7 6 1 for (i=0; i<n; i++) { 128-bit SIMD Register 1 1 1 1 1 Z[i] = X[i] + Y[i]; 1 128-bit SIMD Register } Y 1 1 1 1 1

  10. CMU 15-721 (Spring 2019) 6 SIM D EXAM PLE X + Y = Z 8 7 x 1 y 1 x 1 +y 1 6 X 5 x 2 y 2 x 2 +y 2 + = 4 Z ⋮ ⋮ ⋮ 3 4 3 2 1 x n y n x n +y n 2 SIMD + 1 9 8 7 6 5 4 3 2 1 for (i=0; i<n; i++) { 1 Z[i] = X[i] + Y[i]; 1 } Y 1 1 1 1 1 1 1 1 1

  11. CMU 15-721 (Spring 2019) 7 SIM D TRADE- O FFS Advantages: → Significant performance gains and resource utilization if an algorithm can be vectorized. Disadvantages: → Implementing an algorithm using SIMD is still mostly a manual process. → SIMD may have restrictions on data alignment. → Gathering data into SIMD registers and scattering it to the correct locations is tricky and/or inefficient.

  12. CMU 15-721 (Spring 2019) 8 SO RT- M ERGE J O IN (R ⨝ S) Phase #1: Sort → Sort the tuples of R and S based on the join key. Phase #2: Merge → Scan the sorted relations and compare tuples. → The outer relation R only needs to be scanned once.

  13. CMU 15-721 (Spring 2019) 9 SO RT- M ERGE J O IN (R ⨝ S) Relation R Relation S SORT! SORT!

  14. CMU 15-721 (Spring 2019) 9 SO RT- M ERGE J O IN (R ⨝ S) Relation R Relation S MERGE! SORT! SORT! ⨝

  15. CMU 15-721 (Spring 2019) 9 SO RT- M ERGE J O IN (R ⨝ S) Relation R Relation S MERGE! SORT! SORT! ⨝

  16. CMU 15-721 (Spring 2019) 10 PARALLEL SO RT- M ERGE J O IN S Sorting is always the most expensive part. Use hardware correctly to speed up the join algorithm as much as possible. → Utilize as many CPU cores as possible. → Be mindful of NUMA boundaries. → Use SIMD instructions where applicable. MULTI- CORE, MAIN- MEMORY JOINS: SORT VS. HASH REVISITED VLDB 2013

  17. CMU 15-721 (Spring 2019) 11 PARALLEL SO RT- M ERGE J O IN (R ⨝ S) Phase #1: Partitioning (optional) → Partition R and assign them to workers / cores. Phase #2: Sort → Sort the tuples of R and S based on the join key. Phase #3: Merge → Scan the sorted relations and compare tuples. → The outer relation R only needs to be scanned once.

  18. CMU 15-721 (Spring 2019) 12 PARTITIO N IN G PH ASE Approach #1: Implicit Partitioning → The data was partitioned on the join key when it was loaded into the database. → No extra pass over the data is needed. Approach #2: Explicit Partitioning → Divide only the outer relation and redistribute among the different CPU cores. → Can use the same radix partitioning approach we talked about last time.

  19. CMU 15-721 (Spring 2019) 13 SO RT PH ASE Create runs of sorted chunks of tuples for both input relations. It used to be that Quicksort was good enough. But NUMA and parallel architectures require us to be more careful…

  20. CMU 15-721 (Spring 2019) 14 CACH E- CO N SCIO US SO RTIN G Level #1: In-Register Sorting → Sort runs that fit into CPU registers. Level #2: In-Cache Sorting → Merge Level #1 output into runs that fit into CPU caches. → Repeat until sorted runs are ½ cache size. Level #3: Out-of-Cache Sorting → Used when the runs of Level #2 exceed the size of caches. SORT VS. HASH REVISITED: FAST JOIN IMPLEMENTATION ON MODERN M MULTI- CORE C CPUS VLDB 2009

  21. CMU 15-721 (Spring 2019) 15 CACH E- CO N SCIO US SO RTIN G UNSORTED Level #1 Level #2 Level #3 SORTED

  22. CMU 15-721 (Spring 2019) 16 LEVEL # 1 SO RTIN G N ETWO RKS Abstract model for sorting keys. → Fixed wiring “paths” for lists with the same # of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches. Input Output 9 5 3 6

  23. CMU 15-721 (Spring 2019) 16 LEVEL # 1 SO RTIN G N ETWO RKS Abstract model for sorting keys. → Fixed wiring “paths” for lists with the same # of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches. Input Output 5 9 9 5 3 3 6 6

  24. CMU 15-721 (Spring 2019) 16 LEVEL # 1 SO RTIN G N ETWO RKS Abstract model for sorting keys. → Fixed wiring “paths” for lists with the same # of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches. Input Output 5 3 9 9 5 3 5 3 6 6

  25. CMU 15-721 (Spring 2019) 16 LEVEL # 1 SO RTIN G N ETWO RKS Abstract model for sorting keys. → Fixed wiring “paths” for lists with the same # of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches. Input Output 5 3 9 3 9 5 3 5 3 6 6

  26. CMU 15-721 (Spring 2019) 16 LEVEL # 1 SO RTIN G N ETWO RKS Abstract model for sorting keys. → Fixed wiring “paths” for lists with the same # of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches. Input Output 5 3 9 3 9 5 3 5 3 6 6

  27. CMU 15-721 (Spring 2019) 16 LEVEL # 1 SO RTIN G N ETWO RKS Abstract model for sorting keys. → Fixed wiring “paths” for lists with the same # of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches. Input Output 5 3 9 3 9 6 5 5 5 3 5 6 3 6 6 9 6 9

  28. CMU 15-721 (Spring 2019) 16 LEVEL # 1 SO RTIN G N ETWO RKS Abstract model for sorting keys. → Fixed wiring “paths” for lists with the same # of elements. → Efficient to execute on modern CPUs because of limited wires = [9,5,3,6] data dependencies and no branches. wires[0] = min (wires[0], wires[1]) Input Output wires[1] = max (wires[0], wires[1]) 5 3 wires[2] = min (wires[2], wires[3]) 9 3 wires[3] = max (wires[2], wires[3]) 9 6 5 wires[0] = min (wires[0], wires[2]) 5 5 wires[2] = max (wires[0], wires[2]) 3 5 6 wires[1] = min (wires[1], wires[3]) 3 6 wires[3] = max (wires[1], wires[3]) 6 9 6 9 wires[1] = min (wires[1], wires[2]) wires[2] = max (wires[1], wires[2])

  29. CMU 15-721 (Spring 2019) 16 LEVEL # 1 SO RTIN G N ETWO RKS Abstract model for sorting keys. → Fixed wiring “paths” for lists with the same # of elements. → Efficient to execute on modern CPUs because of limited wires = [9,5,3,6] data dependencies and no branches. 1 wires[0] = min (wires[0], wires[1]) 1 Input Output wires[1] = max (wires[0], wires[1]) 5 3 wires[2] = min (wires[2], wires[3]) 9 3 wires[3] = max (wires[2], wires[3]) 9 6 5 2 wires[0] = min (wires[0], wires[2]) 5 5 2 wires[2] = max (wires[0], wires[2]) 3 5 6 wires[1] = min (wires[1], wires[3]) 3 6 3 wires[3] = max (wires[1], wires[3]) 6 9 6 9 3 wires[1] = min (wires[1], wires[2]) wires[2] = max (wires[1], wires[2])

  30. CMU 15-721 (Spring 2019) 17 LEVEL # 1 SO RTIN G N ETWO RKS 12 21 4 13 9 8 6 7 1 14 3 0 5 11 15 10 Instructions: → 4 LOAD

  31. CMU 15-721 (Spring 2019) 17 LEVEL # 1 SO RTIN G N ETWO RKS Sort Across Registers 12 21 4 13 9 8 6 7 1 14 3 0 5 11 15 10 Instructions: → 4 LOAD

  32. CMU 15-721 (Spring 2019) 17 LEVEL # 1 SO RTIN G N ETWO RKS Sort Across Registers 12 21 4 13 1 8 3 0 9 8 6 7 5 11 4 7 1 14 3 0 9 14 6 10 5 11 15 10 12 21 15 13 Instructions: Instructions: → 4 LOAD → 10 MIN/MAX

  33. CMU 15-721 (Spring 2019) 17 LEVEL # 1 SO RTIN G N ETWO RKS Sort Across Transpose Registers Registers 12 21 4 13 1 8 3 0 1 5 9 12 9 8 6 7 5 11 4 7 8 11 14 21 1 14 3 0 9 14 6 10 3 4 6 15 5 11 15 10 12 21 15 13 0 7 10 13 Instructions: Instructions: Instructions: → 4 LOAD → 10 MIN/MAX → 8 SHUFFLE → 4 STORE

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend