relational graph processing on gpus
play

Relational Graph Processing on GPUs Haicheng Wu 1 , Daniel Zinn 2 , - PowerPoint PPT Presentation

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1 , Daniel Zinn 2 , Molham Aref 2 , Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox Inc. SCHOOL OF ELECTRICAL AND COMPUTER


  1. Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1 , Daniel Zinn 2 , Molham Aref 2 , Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox Inc. SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  2. System Diversity Today Amazon EC2 GPU Instances Mobile Platforms (DSP, GPUs) Hardware Diversity is Mainstream Keeneland System (GPUs) Cray Titan (GPUs) 2 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  3. GPU and CUDA CPU (Multi Core) ③ Execute 2-16 Cores ① Input GPU Data ~3000 Cores  GPU is a many core co-processor ② Launch ~50 GB/s  1000s of cores Kernel ~300 GB/s ④ Result  1000s of concurrent threads PCI-E  Higher memory bandwidth MAIN MEM GPU MEM ~128GB ~6GB 16GB/s  Smaller memory capacity  CUDA and OpenCL are the Streaming Multiprocessor (SM) dominant programming models A A A A A A A A A A A A A A A A L L L L L L L L L L L L L L L L U U U U U U U U U U U U U U U U R R R R R R R R R R R R R R R R  Well suited for data parallel apps Thread Cooperative Thread Arrays (CTA)  Molecular Dynamics, Options branch Pricing, Ray Tracing, etc. End of branch  Commodity: led by NVIDIA, AMD, CUDA Kernel and Intel Warp 1 Warp N Shared Memory Coalesced Access 0 4 8 C 10 14 18 1C Address DRAM 3 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  4. Relational Queries and Data Analytics  The Opportunity  Significant potential data parallelism  The Problem Applications  Need to process 1-50 TBs of data 1  Small Mem Capacity & Small PCIe bandwidth  Irregularity  Fine grained computation Large Graphs  Data dependent  Low locality 1 Independent Oracle Users Group. A New Dimension to Data Warehousing: 2011 IOUG Data Warehousing Survey . 4 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  5. Candidate Application The Challenge Domains New Applications and Software …… Stacks LargeQty(p) <- Qty(q), q > 1000. …… New Accelerator Architectures Large Graphs Relational Computations Over Massive Unstructured Data Sets: Sustain 10X – 100X throughput over multicore 5 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  6. Multipredicate Join  Goal: Implementation of Leapfrog Triejoin (LFTJ) on GPU  A worst-case optimal multi-predicate join algorithm  Details (e.g., complexity analysis) in T. L. Veldhuizen, ICDT 2014  Benefits  Smaller memory footprint and data movement  No data reorganization (e.g. sorting or rebuilding hash table) after changing join key  Approach  CPU version  CPU-Friendly GPU version  Customized GPU version 6 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  7. An Important Example – Graph Problems  Finding cliques Multi-predicate Join  triangle(x,y,z)<-E(x,y),E(y,z),E(x,z), x<y<z.  4cl(x,y,z,w)<-E(x,y),E(x,z),E(x,w),E(y,z),E(y,w),E(z,w), x<y<z<w. Edge: 0 From To 0 1 1 1 2 1 3 2 3 2 3 2 4 5 4 3 5 7 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  8. Leapfrog Join (LFJ)  LFJ is the base of LFTJ  Essentially multi-way-intersections  Basic primitives: seek() , next() seek(2) seek(10) seek(8) A 0 1 3 4 5 6 7 8 9 11 seek(3) seek(8) seek(10) B 0 2 6 7 8 9 seek(6) next() C 2 4 5 8 10 C ourtesy : T. L. Veldhuizen, ICDT 2014 8 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  9. Trie Data Structure  LFTJ works on Trie Data Stucture Edge: Root 0 From To 0 1 1 From 1 2 0 1 2 3 1 3 2 3 2 3 2 4 1 2 3 3 4 5 To 5 4 3 5 9 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  10. LFTJ Algorithm – join 3 tries E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 10 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  11. LFTJ Algorithm – open() level x E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 11 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  12. LFTJ Algorithm – seek(0) in E(x,z) level x E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 12 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  13. LFTJ Algorithm – open() level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 13 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  14. LFTJ Algorithm – seek(1) in E(y,z) level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 14 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  15. LFTJ Algorithm – open() level z E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 15 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  16. LFTJ Algorithm – seek(2) in E(x,z) level z and failed E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 16 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  17. LFTJ Algorithm – up() to level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 17 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  18. LFTJ Algorithm – up() to level x E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 18 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  19. LFTJ Algorithm – seek(1) in E(x,z) level x E(x,z) E(x,y) E(y,z) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 19 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  20. LFTJ Algorithm – open() level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 20 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  21. LFTJ Algorithm – seek(2) in E(y,z) level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 21 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  22. LFTJ Algorithm – open() level z E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 22 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  23. LFTJ Algorithm – seek(3) in E(x,z) level z E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 23 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  24. LFTJ Algorithm – next() E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 24 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  25. LFTJ Algorithm – final result E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 25 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  26. LFTJ Algorithm – short conclusion  Very simple set of primitives to implement  A sequential algorithm  Traverse the Trie in depth first order  Two methods for applying this technique with GPUs  CPU algorithm per GPU thread  Customize data parallel application 26 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  27. LFTJ-GPU: First Algorithm  Evenly map the top level of the leftmost trie to GPU threads  Run sequential LFTJ in each GPU thread  seek() is implemented as binary search  Data dependent control flow  No spacial or temporal locality Example: mapping to 2 GPU threads E(y,z) E(x,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 t0 t1 27 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend