Database servers on chip multiprocessors: limitations and - - PowerPoint PPT Presentation

database servers on chip multiprocessors limitations and
SMART_READER_LITE
LIVE PREVIEW

Database servers on chip multiprocessors: limitations and - - PowerPoint PPT Presentation

Database servers on chip multiprocessors: limitations and opportunities N. Hardavellas N. Mancheril I. Pandis A. Ailamaki R. Johnson B. Falsafi Benjamin Reilly September 27, 2011 Tuesday, 27 September, 11 The fattened cache (and CPU)


slide-1
SLIDE 1

Database servers on chip multiprocessors: limitations and

  • pportunities
  • N. Hardavellas
  • I. Pandis
  • R. Johnson
  • N. Mancheril
  • A. Ailamaki
  • B. Falsafi

Benjamin Reilly September 27, 2011

Tuesday, 27 September, 11

slide-2
SLIDE 2

The fattened cache (and CPU)

Cache capacity = Cache latency More data on hand, but higher cost to retrieve it CPUs show similar trend in development: continually larger, and more complex

Tuesday, 27 September, 11

slide-3
SLIDE 3

OVERVIEW

  • Motivation
  • Experiment design
  • Results and observations
  • What now?
  • Summary and discussion

Tuesday, 27 September, 11

slide-4
SLIDE 4

Dividing the CMPs

CMPs? Chip multiprocessors: several cores sharing on-chip resources (caches) Vary in:

  • # of cores
  • # of hardware threads (“contexts”)
  • Execution order
  • Pipeline depth

Tuesday, 27 September, 11

slide-5
SLIDE 5

The ‘Fat Camp’ (FC)

Key characteristics

  • Few, but powerful cores
  • Few (1-2) hardware contexts
  • OoO –– Out-of-Order execution
  • ILP –– Instruction-level parallelism

Core 0 Core 1

Thread Context

Tuesday, 27 September, 11

slide-6
SLIDE 6

Hiding data stalls: FC

OoO Out of Order Execution ILP Instruction-level parallelism

  • p 0
  • p 1
  • p 2

Wait for... input Compute

  • p 1
  • p 2

earlier

  • perations

a += b b += c d += e

Tuesday, 27 September, 11

slide-7
SLIDE 7

The ‘Lean Camp’ (LC)

Key characteristics

  • Many, but weaker cores
  • Several (4+) hardware contexts
  • In-order execution (simpler)

Core 0 Core 1 Core 5 Core 2 Core 3 Core 4 Core 7 Core 6

Tuesday, 27 September, 11

slide-8
SLIDE 8

Core 0 Core 0 Core 0 Core 0 Core 0 Core 0 Core 0 Core 0 Core 0 Core 0

Hiding data stalls: LC

Hardware contexts interleaved in round-robin fashion, skipping contexts that are in data stalls. Running Idle (runnable) Stalled (non-runnable)

Tuesday, 27 September, 11

slide-9
SLIDE 9

(Un)saturated workloads

Workloads

  • DSS
  • OLTP

Number of requests

  • Saturated: work always available for each

hardware context

  • Unsaturated: work not always available

Tuesday, 27 September, 11

slide-10
SLIDE 10

LC vs. FC Performance

LC has slower response time in unsaturated workloads

+12% (low ILP for FC) +70% (high ILP for FC)

Tuesday, 27 September, 11

slide-11
SLIDE 11

LC vs. FC Performance

LC has higher throughput in saturated workloads

+70% (ILP not significant for FC)

Tuesday, 27 September, 11

slide-12
SLIDE 12

LC vs. FC Performance

Observations:

  • FC spends 46-64%
  • f execution on data

stalls

  • At best (saturated

workloads), LC spends 76-80% on computation

Tuesday, 27 September, 11

slide-13
SLIDE 13

Data stall breakdown

Larger (and hence slower) caches are decreasingly optimal Consider three components

  • f data cache stalls:
  • 1. Cache size

Tuesday, 27 September, 11

slide-14
SLIDE 14

Data stall breakdown

CPI contributions for: OLTP DSS L2 hit stalls responsible for an increasingly large portion of the CPI

Tuesday, 27 September, 11

slide-15
SLIDE 15

Data stall breakdown

  • 2. Per-chip core integration

SMP CMP Processing 4x 1-core 1x 4-core L2 cache(s) 4MB / CPU 16 MB shared

Fewer cores per chip = fewer L2 hits

Tuesday, 27 September, 11

slide-16
SLIDE 16

Data stall breakdown

  • 3. On-chip core count

8 cores:

  • 9% superlinear increase in

throughput (for DSS) 16 cores:

  • 26% sublinear decrease

(OLTP)

  • Too much pressure on L2

Tuesday, 27 September, 11

slide-17
SLIDE 17

How do we apply this?

1.Increase parallelism

  • Divide!

(more threads ⇒ more saturation)

  • Pipeline/OLP (producer-consumer pairs)
  • Partition input (not ideal; static and complex)

Tuesday, 27 September, 11

slide-18
SLIDE 18

How do we apply this?

2.Improve data locality

  • Reduce data stalls to help with unsaturated

workloads

  • Halt producers in favour of consumers
  • Use cache-friendly algorithms

3.Use staged DBs

  • Partition work by groups of relational operators

Tuesday, 27 September, 11

slide-19
SLIDE 19

Summary & Discussion

  • 1. LC typically performs better than FC
  • LC is best under saturated workloads.
  • Is there room for FC CMPs in DB applications?
  • 2. L2 hits are a bottleneck
  • Why were DBs ignored in HW design?
  • How can we avoid incurring the cost of an L2

hit?

Tuesday, 27 September, 11