Dynamic Fine-Grained Scheduling for Energy-Efficient Main-Memory - - PowerPoint PPT Presentation

dynamic fine grained scheduling for
SMART_READER_LITE
LIVE PREVIEW

Dynamic Fine-Grained Scheduling for Energy-Efficient Main-Memory - - PowerPoint PPT Presentation

Dynamic Fine-Grained Scheduling for Energy-Efficient Main-Memory Queries Iraklis Psaroudakis (EPFL, SAP AG), Thomas Kissinger (TU Dresden), Danica Porobic (EPFL), Thomas Ilsche (TU Dresden), Erietta Liarou (EPFL), Pinar Tzn (EPFL), Anastasia


slide-1
SLIDE 1

Dynamic Fine-Grained Scheduling for Energy-Efficient Main-Memory Queries

Iraklis Psaroudakis (EPFL, SAP AG), Thomas Kissinger (TU Dresden), Danica Porobic (EPFL), Thomas Ilsche (TU Dresden), Erietta Liarou (EPFL), Pinar Tözün (EPFL), Anastasia Ailamaki (EPFL), Wolfgang Lehner (TU Dresden)

1

slide-2
SLIDE 2

We need to make DBMS power-aware

Why care about power?

2

57% 8% 18% 13% 4% Servers Networking Equipment Power Distribution & Cooling Power Other Monthly datacenter costs [J. R. Hamilton]

30% power-related Dynamic fraction increasing

Power Utilization Today Ideal Energy proportionality Getting there:

  • Power management features
  • Power-aware software
slide-3
SLIDE 3

Power management features

  • Dynamic voltage and frequency scaling (DVFS)
  • Turbo boost
  • Idle states (C-states)
  • Power-related H/W counters

3

1.2GHz 2.9GHz > 2.9GHz

We can exploit these to improve energy efficiency

slide-4
SLIDE 4

DBMS

Current approaches

  • Black box

– e.g. dynamic concurrency throttling [TPDS13]

  • Query optimizer [ICDE10]

4

We need fine-grained energy-awareness in the database

unpredictable behavior

+ power costs

coarse-grained, without low-level tuning

slide-5
SLIDE 5

Fine-grained energy-aware scheduling

  • parameters:

– parallelism – thread placement – data placement – dynamic voltage and frequency scaling (DVFS)

5

S Σ

How do you schedule this query plan? Calibration of operators under different parameters

slide-6
SLIDE 6

Concurrent partitioned scans

  • Each thread scans 128MB of integers for 5 secs
  • Maximize

𝑞𝑓𝑠𝑔𝑝𝑠𝑛𝑏𝑜𝑑𝑓 𝑞𝑓𝑠 𝑞𝑝𝑥𝑓𝑠 = 𝑢ℎ𝑠𝑝𝑣𝑕ℎ𝑞𝑣𝑢 𝑞𝑝𝑥𝑓𝑠 – under different parallelism, scheduling, and frequency settings

  • Machine

– Two 8-core Intel Xeon E5-2690, HT enabled, 64GB RAM, frequencies from 1.2GHz to 2.9GHz

  • Power measurements

– Hardware performance counters RAPL (CPU & DRAM) – External equipment

6

slide-7
SLIDE 7

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4 8 12 16 20 24 28 32 Throughput per Watt # Threads Auto (RAPL)

Socket-fill scheduling

7

Core 1 & HT

Socket 1

bandwidth saturation

Core 2 & HT Core 8 & HT

Core 9 & HT

Socket 2

Core 10 & HT Core 16 & HT

9 1 10 2 16 8 25 17 26 18 32 24

slide-8
SLIDE 8

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4 8 12 16 20 24 28 32 Throughput per Watt # Threads Auto (RAPL) Auto (external equipment)

Socket-fill scheduling

8

constant difference

Core 1 & HT

Socket 1

Core 2 & HT Core 8 & HT

Core 9 & HT

Socket 2

Core 10 & HT Core 16 & HT

9 1 10 2 16 8 25 17 26 18 32 24

slide-9
SLIDE 9

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4 8 12 16 20 24 28 32 Throughput per Watt # Threads 1.2GHz 2.0GHz 2.9GHz Auto

Socket-fill scheduling

9

best frequency different saturation points

Core 1 & HT

Socket 1

Core 2 & HT Core 8 & HT

Core 9 & HT

Socket 2

Core 10 & HT Core 16 & HT

9 1 10 2 16 8 25 17 26 18 32 24

slide-10
SLIDE 10

Socket-fill HT scheduling

10

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4 8 12 16 20 24 28 32 Throughput per Watt # Threads 1.2GHz 2.0GHz 2.9GHz Auto

HT draws negligible power

Core 1 & HT

Socket 1

Core 2 & HT Core 8 & HT

Core 9 & HT

Socket 2

Core 10 & HT Core 16 & HT

2 1 4 3 16 15 18 17 20 19 32 31

slide-11
SLIDE 11

Socket-wise scheduling

11

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4 8 12 16 20 24 28 32 Throughput per Watt # Threads 1.2GHz 2.0GHz 2.9GHz Auto

avoids socket-specific bandwidth saturation

Core 1 & HT

Socket 1

Core 2 & HT Core 8 & HT

Core 9 & HT

Socket 2

Core 10 & HT Core 16 & HT

17 1 19 3 31 15 18 2 20 4 32 16

slide-12
SLIDE 12

Socket-wise HT scheduling

12

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4 8 12 16 20 24 28 32 Throughput per Watt # Threads 1.2GHz 2.0GHz 2.9GHz Auto

best energy efficiency

1.3x

Core 1 & HT

Socket 1

Core 2 & HT Core 8 & HT

Core 9 & HT

Socket 2

Core 10 & HT Core 16 & HT

2 1 6 5 30 29 4 3 8 7 32 31

slide-13
SLIDE 13

Parallel aggregation

  • 𝑏 = 𝑐 𝑗 + 𝑑 𝑗

, 4GB arrays

  • Minimize

𝑓𝑜𝑓𝑠𝑕𝑧 𝑒𝑓𝑚𝑏𝑧 𝑞𝑠𝑝𝑒𝑣𝑑𝑢 (𝐹𝐸𝑄) = 𝑠𝑓𝑡𝑞𝑝𝑜𝑡𝑓 𝑢𝑗𝑛𝑓 𝑡𝑓𝑑 ∗ 𝑓𝑜𝑓𝑠𝑕𝑧( 𝐾) – under different parallelism, scheduling, and memory placement

  • Machine

– Two 8-core Intel Xeon E5-2640, HT disabled, 256GB of RAM

  • Memory placement

– On first socket – Interleaved

13

slide-14
SLIDE 14

Parallel aggregation

14

Memory interleaved Memory on first socket 0.1 1 10 100 4 8 12 16 EDP (kJ x sec) # Threads Socket-fill Socket-wise 0.1 1 10 100 4 8 12 16 # Threads Socket-fill Socket-wise

bandwidth constrained socket-wise better

slide-15
SLIDE 15

Main-memory memory-bound operations

  • Intermediate frequency has best efficiency

– Different saturation points

  • Avoid memory bandwidth saturation

– by data and thread placement

  • Up to 4x energy efficiency

15

slide-16
SLIDE 16

Fine-grained energy awareness

16

Measurements Runtime decisions Calibration analysis

Energy efficiency # Threads

parallelism data & thread placement DVFS

  • f operators and

parameters hardware counters and/or external equipment

Power Time

power CPU utilization memory utilization scheduling, resource allocation, power management

THIS PAPER

Thank you!

slide-17
SLIDE 17

References

  • [J. R. Hamilton] Internet-Scale Datacenter Economics: Where the Costs

And Opportunities Lie. HPTS, 2011.

  • [TPDS13] D. Li, B. R. de Supinski, M. Schulz, D. S. Nikolopoulos, and K. W.
  • Cameron. Strategies for energy-ecient resource management of hybrid

programming models. IEEE TPDS, 24(1):144-157, 2013.

  • [ICDE10] Z. Xu, Y.-C. Tu, and X. Wang. Exploring power-performance

tradeos in database systems. In ICDE, pages 485-496, 2010.

17