DBMS on a modern processor: where does time go? Anastasia Ailamaki, - - PowerPoint PPT Presentation

dbms on a modern processor where does time go
SMART_READER_LITE
LIVE PREVIEW

DBMS on a modern processor: where does time go? Anastasia Ailamaki, - - PowerPoint PPT Presentation

DBMS on a modern processor: where does time go? Anastasia Ailamaki, David DeWitt, Mark Hill and David Wood University of Wisconsin Madison Presented by: Bogdan Simion Current DBMS Performance + = Where is query execution time spent? Identify


slide-1
SLIDE 1

DBMS on a modern processor: where does time go?

Anastasia Ailamaki, David DeWitt, Mark Hill and David Wood University of Wisconsin‐Madison

Presented by: Bogdan Simion

slide-2
SLIDE 2

Current DBMS Performance

+ =

slide-3
SLIDE 3

Where is query execution time spent?

Identify performance bottlenecks in CPU and memory

slide-4
SLIDE 4

Outline

  • Motivation
  • Background
  • Query execution time breakdown
  • Experimental results and discussions
  • Conclusions
slide-5
SLIDE 5

Hardware performance standards

  • Processors are designed and evaluated with

simple programs

  • Benchmarks: SPEC, LINPACK
  • What about DBMSs?
slide-6
SLIDE 6

DBMS bottlenecks

  • Initially, bottleneck was I/O
  • Nowadays ‐ memory and compute intensive apps
  • Modern platforms:

– sophisticated execution hardware – fast, non‐blocking caches and memory

  • Still …

– DBMS hardware behaviour is suboptimal compared to scientific workloads

slide-7
SLIDE 7

Execution pipeline

FETCH/ DECODE UNIT DISPATCH EXECUTE UNIT RETIRE UNIT INSTRUCTION POOL L1 I‐CACHE L1 D‐CACHE

L2 CACHE

MAIN MEMORY

Stalls overlapped with useful work !!!

slide-8
SLIDE 8

Execution time breakdown

  • TC ‐ Computation
  • TM ‐ Memory stalls
  • TB ‐ Branch Mispredictions
  • TR ‐ Stalls on Execution Resources

L1D, L1I L2D, L2I DTLB, ITLB Functional Units Dependency Stalls

TQ = TC + TM + TB + TR ‐ TOVL

slide-9
SLIDE 9

DB setup

  • DB is memory resident => no I/O interference
  • No dynamic and random parameters, no

concurrency control among transactions

slide-10
SLIDE 10

Workload choice

  • Simple queries:

– Single‐table range selections (sequential, index) – Two‐table equijoins

  • Easy to setup and run
  • Fully controllable parameters
  • Isolates basic operations
  • Enable iterative hypotheses !!!
  • Building blocks for complex workloads?
slide-11
SLIDE 11

10% Indexed Range Selection 0% 20% 40% 60% 80% 100% B C D DBMS

Computation Memory Branch mispredictions Resource

Execution Time Breakdown (%)

  • Stalls at least 50% of time
  • Memory stalls are major bottleneck

Join (no index) 0% 20% 40% 60% 80% 100% A B C D DBMS 10% Sequential Scan 0% 20% 40% 60% 80% 100% A B C D DBMS Query execution time (%)

slide-12
SLIDE 12

10% Indexed Range Selection 0% 20% 40% 60% 80% 100% B C D DBMS

L1 Data L1 Instruction L2 Data L2 Instruction

Memory Stalls Breakdown (%)

  • Role of L1 data cache and L2 instruction cache unimportant
  • L2 data and L1 instruction stalls dominate
  • Memory bottlenecks across DBMSs and queries vary

Join (no index) 0% 20% 40% 60% 80% 100% A B C D DBMS 10% Sequential Scan 0% 20% 40% 60% 80% 100% A B C D DBMS Memory stall time (%)

slide-13
SLIDE 13

Effect of Record Size

10% Sequential Scan

L1 instruction misses / record

5 10 15 20 25 20 48 100 200 record size

System A System B System C System D

  • L2D increase: locality + page crossing (except D)
  • L1I increase: page boundary crossing costs

L2 data misses / record

2 4 6 8 20 48 100 200 record size # of misses per record

slide-14
SLIDE 14

Memory Bottlenecks

  • Memory is important

‐ Increasing memory‐processor performance gap ‐ Deeper memory hierarchies expected

  • Stalls due to L2 cache data misses

‐ Expensive fetches from main memory ‐ L2 grows (8MB), but will be slower

  • Stalls due to L1 I‐cache misses

‐ Buffer pool code is expensive ‐ L1 I‐cache not likely to grow as much as L2

slide-15
SLIDE 15

Branch Mispredictions Are Expensive

0% 5% 10% 15% 20% 25% A B C D DBMS Query execution time (%)

  • Rates are low, but contribution is significant
  • A compiler task, but decisive for L1I performance

0% 5% 10% 15% 20% 25% A B C D DBMS Branch misprediction rates

Sequential Scan Index scan Join (no index)

slide-16
SLIDE 16

Mispredictions Vs. L1‐I Misses

10% Sequential Scan 3 6 9 12 A B C D DBMS Events / 1000 instr. Join (no index) 5 10 15 20 A B C D DBMS

  • More branch mispredictions incur more L1I misses
  • Index code more complicated ‐ needs optimization

10% Indexed Range Selection 10 20 30 40 50 B C D DBMS

Branch mispredictions L1 I-cache misses

slide-17
SLIDE 17

Resource‐related Stalls

0% 5% 10% 15% 20% 25% A B C D DBMS

  • High TDEP for all systems : Low ILP opportunity
  • A’s sequential scan: Memory unit load buffers?

Dependency‐related stalls (TDEP) Functional Unit‐related stalls (TFU)

0% 5% 10% 15% 20% 25% A B C D DBMS % of query execution time

Sequential Scan Index scan Join (no index)

slide-18
SLIDE 18

Microbenchmarks vs. TPC

CPI Breakdown

  • Sequential scan breakdown similar to TPC‐D
  • 2ary index and TPC‐C: higher CPI, memory stalls (L2 D&I mostly)

System B 0.5 1 1.5 2 2.5 3 3.5 sequential scan TPC-D 2ary index TPC-C benchmark Clock ticks

Computation Memory Branch misprediction Resource

System D 0.5 1 1.5 2 2.5 3 3.5 sequential scan TPC-D 2ary index TPC-C benchmark

slide-19
SLIDE 19

Conclusions

  • Execution time breakdown shows trends
  • L1I and L2D are major memory bottlenecks
  • We need to:

– reduce page crossing costs – optimize instruction stream – optimize data placement in L2 cache – reduce stalls at all levels

  • TPC may not be necessary to locate bottlenecks
slide-20
SLIDE 20

Five years later – Becker et al 2004

  • Same DBMSs, setup and workloads (memory

resident) and same metrics

  • Outcome: stalls still take lots of time

– Seq scans: L1I stalls, branch mispredictions much lower – Index scans: no improvement – Joins: improvements, similar to seq scans – Bottleneck shift to L2D misses => must improve data placement – What works well on some hardware doesn’t on other

slide-21
SLIDE 21

Five years later – Becker et al 2004

  • C on a Quad P3 700MHz, 4G RAM, 16K L1, 2M L2
  • B on a single P4 3GHz, 1G RAM, 8K L1D + 12KuOp trace

cache, 512K L2, BTB 8x than P3

  • P3 results:

– Similar to 5 years ago: major bottlenecks are L1I and L2D

  • P4 results:

– Memory stalls almost entirely due to L1D and L2D stalls – L1D stalls higher ‐ smaller cache and larger cache line – L1I stalls removed due to trace cache (esp. for seq. scan, but still some for index)

Hardware – awareness is important !

slide-22
SLIDE 22

References

  • DBMS on a modern processor: where does

time go? Revisited – CMU Tech Report 2004

  • Anastassia Ailamaki – VLDB’99 talk slides
slide-23
SLIDE 23

Questions?