dbms on a modern processor where does time go
play

DBMS on a modern processor: where does time go? Anastasia Ailamaki, - PowerPoint PPT Presentation

DBMS on a modern processor: where does time go? Anastasia Ailamaki, David DeWitt, Mark Hill and David Wood University of Wisconsin Madison Presented by: Bogdan Simion Current DBMS Performance + = Where is query execution time spent? Identify


  1. DBMS on a modern processor: where does time go? Anastasia Ailamaki, David DeWitt, Mark Hill and David Wood University of Wisconsin ‐ Madison Presented by: Bogdan Simion

  2. Current DBMS Performance + =

  3. Where is query execution time spent? Identify performance bottlenecks in CPU and memory

  4. Outline • Motivation • Background • Query execution time breakdown • Experimental results and discussions • Conclusions

  5. Hardware performance standards • Processors are designed and evaluated with simple programs • Benchmarks: SPEC, LINPACK • What about DBMSs?

  6. DBMS bottlenecks • Initially, bottleneck was I/O • Nowadays ‐ memory and compute intensive apps • Modern platforms: – sophisticated execution hardware – fast, non ‐ blocking caches and memory • Still … – DBMS hardware behaviour is suboptimal compared to scientific workloads

  7. Execution pipeline INSTRUCTION POOL FETCH/ DISPATCH RETIRE DECODE EXECUTE UNIT UNIT UNIT L1 I ‐ CACHE L1 D ‐ CACHE L2 CACHE MAIN MEMORY Stalls overlapped with useful work !!!

  8. Execution time breakdown T Q = T C + T M + T B + T R ‐ T OVL •T C ‐ Computation L1D, L1I •T M ‐ Memory stalls L2D, L2I DTLB, ITLB •T B ‐ Branch Mispredictions Functional Units •T R ‐ Stalls on Execution Resources Dependency Stalls

  9. DB setup • DB is memory resident => no I/O interference • No dynamic and random parameters, no concurrency control among transactions

  10. Workload choice • Simple queries: – Single ‐ table range selections (sequential, index) – Two ‐ table equijoins • Easy to setup and run • Fully controllable parameters • Isolates basic operations • Enable iterative hypotheses !!! • Building blocks for complex workloads?

  11. Execution Time Breakdown (%) 10% Sequential Scan 10% Indexed Range Selection Join (no index) 100% 100% 100% Query execution time (%) 80% 80% 80% 60% 60% 60% 40% 40% 40% 20% 20% 20% 0% 0% 0% A B C D B C D A B C D DBMS DBMS DBMS Computation Memory Branch mispredictions Resource • Stalls at least 50% of time • Memory stalls are major bottleneck

  12. Memory Stalls Breakdown (%) 10% Sequential Scan 10% Indexed Range Selection Join (no index) 100% 100% 100% Memory stall time (%) 80% 80% 80% 60% 60% 60% 40% 40% 40% 20% 20% 20% 0% 0% 0% A B C D B C D A B C D DBMS DBMS DBMS L1 Data L1 Instruction L2 Data L2 Instruction • Role of L1 data cache and L2 instruction cache unimportant • L2 data and L1 instruction stalls dominate • Memory bottlenecks across DBMSs and queries vary

  13. Effect of Record Size 10% Sequential Scan L2 data misses / record L1 instruction misses / record 8 25 # of misses per record 20 6 15 4 10 2 5 0 0 20 48 100 200 20 48 100 200 record size record size System A System B System C System D • L2D increase: locality + page crossing (except D) • L1I increase: page boundary crossing costs

  14. Memory Bottlenecks • Memory is important ‐ Increasing memory ‐ processor performance gap ‐ Deeper memory hierarchies expected • Stalls due to L2 cache data misses ‐ Expensive fetches from main memory ‐ L2 grows (8MB), but will be slower • Stalls due to L1 I ‐ cache misses ‐ Buffer pool code is expensive ‐ L1 I ‐ cache not likely to grow as much as L2

  15. Branch Mispredictions Are Expensive 25% 25% Branch misprediction rates Query execution time (%) 20% 20% 15% 15% 10% 10% 5% 5% 0% 0% A B C D A B C D DBMS DBMS Sequential Scan Index scan Join (no index) • Rates are low, but contribution is significant • A compiler task, but decisive for L1I performance

  16. Mispredictions Vs. L1 ‐ I Misses 10% Sequential Scan 10% Indexed Range Selection Join (no index) 20 12 50 Events / 1000 instr. 40 15 9 30 6 10 20 5 3 10 0 0 0 A B C D A B C D B C D DBMS DBMS DBMS Branch mispredictions L1 I-cache misses • More branch mispredictions incur more L1I misses • Index code more complicated ‐ needs optimization

  17. Resource ‐ related Stalls Dependency ‐ related stalls (T DEP ) Functional Unit ‐ related stalls (T FU ) 25% 25% % of query execution time 20% 20% 15% 15% 10% 10% 5% 5% 0% 0% A B C D A B C D DBMS DBMS Sequential Scan Index scan Join (no index) • High T DEP for all systems : Low ILP opportunity • A’s sequential scan: Memory unit load buffers?

  18. Microbenchmarks vs. TPC CPI Breakdown System B System D 3.5 3.5 3 3 2.5 2.5 Clock ticks 2 2 1.5 1.5 1 1 0.5 0.5 0 0 sequential TPC-D 2ary index TPC-C sequential TPC-D 2ary TPC-C scan scan index benchmark benchmark Computation Memory Branch misprediction Resource • Sequential scan breakdown similar to TPC ‐ D • 2ary index and TPC ‐ C: higher CPI, memory stalls (L2 D&I mostly)

  19. Conclusions • Execution time breakdown shows trends • L1I and L2D are major memory bottlenecks • We need to: – reduce page crossing costs – optimize instruction stream – optimize data placement in L2 cache – reduce stalls at all levels • TPC may not be necessary to locate bottlenecks

  20. Five years later – Becker et al 2004 • Same DBMSs, setup and workloads (memory resident) and same metrics • Outcome: stalls still take lots of time – Seq scans: L1I stalls, branch mispredictions much lower – Index scans: no improvement – Joins: improvements, similar to seq scans – Bottleneck shift to L2D misses => must improve data placement – What works well on some hardware doesn’t on other

  21. Five years later – Becker et al 2004 • C on a Quad P3 700MHz, 4G RAM, 16K L1, 2M L2 • B on a single P4 3GHz, 1G RAM, 8K L1D + 12KuOp trace cache, 512K L2, BTB 8x than P3 • P3 results: – Similar to 5 years ago: major bottlenecks are L1I and L2D • P4 results: – Memory stalls almost entirely due to L1D and L2D stalls – L1D stalls higher ‐ smaller cache and larger cache line – L1I stalls removed due to trace cache (esp. for seq. scan, but still some for index) Hardware – awareness is important !

  22. References • DBMS on a modern processor: where does time go? Revisited – CMU Tech Report 2004 • Anastassia Ailamaki – VLDB’99 talk slides

  23. Questions?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend