 
              OLTP on Hardware Islands Danica Porobic , Ippokratis Pandis*, Miguel Branco, Pınar Tözün , Anastasia Ailamaki Data-Intensive Application and Systems Lab, EPFL *IBM Research - Almaden
Hardware topologies have changed Core 2
Hardware topologies have changed Core to core Topology latency Core Core high SMP (~100ns) Core Core 3
Hardware topologies have changed Core to core Topology latency high Core Core SMP (~100ns) Core low CMP (~10ns) Core Core 4
Hardware topologies have changed Core to core Topology latency Core Core high Core Core SMP (~100ns) Core low CMP (~10ns) Core Core Core Core variable SMP of CMP (10-100ns) Island Variable latencies affect performance & predictability 5
Deploying OLTP on Hardware Islands Shared-Nothing Throughput Performance Shared-Everything % Multisite Transactions in Workload Best Worst Thread to core assignment HW + Workload -> Optimal OLTP configuration 6
Outline • Introduction • Hardware Islands • OLTP on Hardware Islands • Conclusions and future work 7
Multisocket multicores <10 cycles 50 cycles 500 cycles Core Core Core Core Core Core Core Core L1 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 L2 L2 L3 L3 L3 Memory controller Memory controller Inter-socket links Inter-socket links Inter-socket links Inter-socket links Communication latency varies significantly 8
Placement of application threads Counter microbenchmark TPC-C Payment 8socket x 10cores 4socket x 6cores 400 12 Unpredictable 350 10 39% Throughput (Mtps) Throughput (Ktps) 300 40% 8 250 6 200 47% 150 4 100 2 50 0 0 ? ? ? ? ? ? ? ? ? ? ? ? OS Spread Island OS Spread Island Islands-aware placement matters 9
Impact of sharing data among threads Counter microbenchmark TPC-C Payment – local-only Log scale 8socket x 10cores 4socket x 6cores 160 10000 140 Throughput (Mtps) Throughput (Ktps) 18.7x 120 1000 516.8x 100 4.5x 80 100 60 40 10 20 0 1 Shared Shared Counter Counter Single per core per counter nothing everything socket Fewer sharers lead to higher performance 10
Outline • Introduction • Hardware Islands • OLTP on Hardware Islands – Experimental setup – Read-only workloads – Update workloads – Impact of skew • Conclusions and future work 11
Experimental setup • Shore-MT – Top-of-the-line open source storage manager – Enabled shared-nothing capability • Multisocket servers – 4-socket, 6-core Intel Xeon E7530, 64GB RAM – 8-socket, 10-core Intel Xeon E7-L8867, 192GB RAM • Disabled hyper-threading • OS: Red Hat Enterprise Linux 6.2, kernel 2.6.32 • Profiler: Intel VTune Amplifier XE 2011 • Workloads: TPC-C, microbenchmarks 12
Microbenchmark workload • Singlesite version – Probe/update N rows from the local partition • Multisite version – Probe/update 1 row from the local partition – Probe/update N-1 rows uniformly from any partition – Partitions may reside on the same instance • Input size: 10 000 rows/core 13
Software System Configurations 1 Island 24 Islands 4 Islands Shared-everything Shared-nothing 14
Increasing % of multisite xcts: reads 200 No locks or 1 Island latches Throughput (KTps) 24 Islands Messaging 150 4 Islands overhead 100 50 Fewer messages Contention for for 1 transaction shared data 0 0 20 40 60 80 100 % Multisite transactions Finer grained configurations are more sensitive to distributed transactions 15
Where are the bottlenecks? Read case 4 Islands 60 10 rows Time per transactions (µs) 50 40 Logging 30 Locking Communication 20 Xct management 10 Xct execution 0 0% 50% 100% Multisite transactions Communication overhead dominates 16
Increasing size of multisite xct: read case Time per transaction (μs) 140 24 Islands Physical 120 12 Islands contention 4 Islands 100 1 Island 80 More instances 60 per transaction 40 All instances 20 involved in a transaction 0 0 20 40 60 80 100 Number of rows retrieved Communication costs rise until all instances are involved in every transaction 17
Increasing % of multisite xcts: updates 120 1 Island No latches Throughput (KTps) 24 Islands 100 4 Islands 80 60 40 2 round of messages 20 +Extra logging +Lock held longer 0 0 20 40 60 80 100 % Multisite transactions Distributed update transactions are more expensive 18
Where are the bottlenecks? Update case 4 Islands 300 10 rows Time per transactions (µs) 250 200 Logging 150 Locking Communication 100 Xct management 50 Xct execution 0 0% 50% 100% Multisite transactions Communication overhead dominates 19
Increasing size of multisite xct: update case 24 Islands Time per transaction (μs) 250 12 Islands 4 Islands 200 1 Island 150 More instances per transaction Increased 100 contention 50 Efficient logging with Aether* 0 0 20 40 60 80 100 Number of rows updated Shared everything exposes constructive interference 20 *R. Johnson, et al: Aether: a scalable approach to logging, VLDB 2010
Effects of skewed input 50% multisite Local only 800 800 Few instances are 700 700 Throughput (KTps) highly loaded 600 600 Larger instances 500 500 can balance load 400 400 300 300 24 Islands 200 200 Contention 4 Islands Still few hot for hot 100 100 instances 1 Island data 0 0 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 Skew factor Skew factor 4 Islands effectively balance skew and contention 21
OLTP systems on Hardware Islands • Shared-everything: stable, but non-optimal • Shared-nothing: fast, but sensitive to workload • OLTP Islands: a robust, middle-ground – Runs on close cores – Small instances limits contention between threads – Few instances simplify partitioning • Future work: – Automatically choose and setup optimal configuration – Dynamically adjust to workload changes Thank you! 22
Recommend
More recommend