OLTP on Hardware Islands Danica Porobic , Ippokratis Pandis*, Miguel - - PowerPoint PPT Presentation

oltp on hardware islands
SMART_READER_LITE
LIVE PREVIEW

OLTP on Hardware Islands Danica Porobic , Ippokratis Pandis*, Miguel - - PowerPoint PPT Presentation

OLTP on Hardware Islands Danica Porobic , Ippokratis Pandis*, Miguel Branco, Pnar Tzn , Anastasia Ailamaki Data-Intensive Application and Systems Lab, EPFL *IBM Research - Almaden Hardware topologies have changed Core 2 Hardware


slide-1
SLIDE 1

OLTP on Hardware Islands

Danica Porobic, Ippokratis Pandis*, Miguel Branco, Pınar Tözün, Anastasia Ailamaki

Data-Intensive Application and Systems Lab, EPFL *IBM Research - Almaden

slide-2
SLIDE 2

Hardware topologies have changed

Core

2

slide-3
SLIDE 3

Hardware topologies have changed

Topology Core to core latency

Core Core Core Core

SMP high (~100ns)

3

slide-4
SLIDE 4

Hardware topologies have changed

Topology Core to core latency

Core

Core Core Core Core

CMP low SMP high (~10ns) (~100ns)

4

slide-5
SLIDE 5

Hardware topologies have changed

Topology Core to core latency

Core

Core Core Core Core Core Core Core Core

Island

CMP SMP of CMP low variable SMP high (~10ns) (10-100ns) (~100ns)

5

Variable latencies affect performance & predictability

slide-6
SLIDE 6

Deploying OLTP on Hardware Islands

6

% Multisite Transactions in Workload Throughput Shared-Nothing Shared-Everything

HW + Workload -> Optimal OLTP configuration

Best Worst

Performance

Thread to core assignment

slide-7
SLIDE 7

Outline

  • Introduction
  • Hardware Islands
  • OLTP on Hardware Islands
  • Conclusions and future work

7

slide-8
SLIDE 8

8

Multisocket multicores

Communication latency varies significantly

Core Core Core L1 L2 L1 L2 L1 L2 L3 Memory controller Core L1 L2 Core Core Core L1 L2 L1 L2 L1 L2 L3 Core L1 L2 L1 L3 Inter-socket links Memory controller Inter-socket links Inter-socket links Inter-socket links

50 cycles 500 cycles <10 cycles

slide-9
SLIDE 9

OS OS

Placement of application threads

Counter microbenchmark TPC-C Payment

50 100 150 200 250 300 350 400 Throughput (Mtps) 2 4 6 8 10 12 Throughput (Ktps) 8socket x 10cores 4socket x 6cores

39%

9

Unpredictable

40% 47%

? ? ? ? ? ? ? ?

? ? ? ?

Spread Island Spread Island

Islands-aware placement matters

slide-10
SLIDE 10

10

1 10 100 1000 10000

Counter per core Counter per socket Single counter Throughput (Mtps)

Impact of sharing data among threads

Counter microbenchmark TPC-C Payment – local-only

20 40 60 80 100 120 140 160 Shared nothing Shared everything Throughput (Ktps)

18.7x 516.8x 4.5x

Fewer sharers lead to higher performance

8socket x 10cores 4socket x 6cores Log scale

slide-11
SLIDE 11

Outline

  • Introduction
  • Hardware Islands
  • OLTP on Hardware Islands

– Experimental setup – Read-only workloads – Update workloads – Impact of skew

  • Conclusions and future work

11

slide-12
SLIDE 12

Experimental setup

  • Shore-MT

– Top-of-the-line open source storage manager – Enabled shared-nothing capability

  • Multisocket servers

– 4-socket, 6-core Intel Xeon E7530, 64GB RAM – 8-socket, 10-core Intel Xeon E7-L8867, 192GB RAM

  • Disabled hyper-threading
  • OS: Red Hat Enterprise Linux 6.2, kernel 2.6.32
  • Profiler: Intel VTune Amplifier XE 2011
  • Workloads: TPC-C, microbenchmarks

12

slide-13
SLIDE 13

Microbenchmark workload

  • Singlesite version

– Probe/update N rows from the local partition

  • Multisite version

– Probe/update 1 row from the local partition – Probe/update N-1 rows uniformly from any partition – Partitions may reside on the same instance

  • Input size: 10 000 rows/core

13

slide-14
SLIDE 14

Software System Configurations

1 Island

Shared-everything

24 Islands

Shared-nothing

4 Islands

14

slide-15
SLIDE 15

15

Increasing % of multisite xcts: reads

50 100 150 200 20 40 60 80 100 Throughput (KTps) % Multisite transactions 1 Island 24 Islands 4 Islands

Contention for shared data No locks or latches Messaging

  • verhead

Fewer messages for 1 transaction

Finer grained configurations are more sensitive to distributed transactions

slide-16
SLIDE 16

10 20 30 40 50 60 0% 50% 100% Time per transactions (µs) Multisite transactions Logging Locking Communication Xct management Xct execution

Where are the bottlenecks? Read case

16

4 Islands 10 rows

Communication overhead dominates

slide-17
SLIDE 17

17

Increasing size of multisite xct: read case

20 40 60 80 100 120 140 20 40 60 80 100 Time per transaction (μs) Number of rows retrieved 24 Islands 12 Islands 4 Islands 1 Island

Physical contention More instances per transaction All instances involved in a transaction

Communication costs rise until all instances are involved in every transaction

slide-18
SLIDE 18

18

Increasing % of multisite xcts: updates

20 40 60 80 100 120 20 40 60 80 100 Throughput (KTps) % Multisite transactions 1 Island 24 Islands 4 Islands

No latches 2 round of messages +Extra logging +Lock held longer

Distributed update transactions are more expensive

slide-19
SLIDE 19

50 100 150 200 250 300 0% 50% 100% Time per transactions (µs) Multisite transactions Logging Locking Communication Xct management Xct execution

Where are the bottlenecks? Update case

19

4 Islands 10 rows

Communication overhead dominates

slide-20
SLIDE 20

20

Increasing size of multisite xct: update case

50 100 150 200 250 20 40 60 80 100 Time per transaction (μs) Number of rows updated 24 Islands 12 Islands 4 Islands 1 Island

Efficient logging with Aether* More instances per transaction Increased contention

*R. Johnson, et al: Aether: a scalable approach to logging, VLDB 2010

Shared everything exposes constructive interference

slide-21
SLIDE 21

21

100 200 300 400 500 600 700 800 0.25 0.5 0.75 1 Skew factor 50% multisite 100 200 300 400 500 600 700 800 0.25 0.5 0.75 1 Throughput (KTps) Skew factor Local only 24 Islands 4 Islands 1 Island

Effects of skewed input

4 Islands effectively balance skew and contention

Few instances are highly loaded Still few hot instances Contention for hot data Larger instances can balance load

slide-22
SLIDE 22

OLTP systems on Hardware Islands

  • Shared-everything: stable, but non-optimal
  • Shared-nothing: fast, but sensitive to workload
  • OLTP Islands: a robust, middle-ground

– Runs on close cores – Small instances limits contention between threads – Few instances simplify partitioning

  • Future work:

– Automatically choose and setup optimal configuration – Dynamically adjust to workload changes

Thank you!

22