Database servers on chip multiprocessors: limitations and - PowerPoint PPT Presentation

Database servers on chip multiprocessors: limitations and opportunities N. Hardavellas N. Mancheril I. Pandis A. Ailamaki R. Johnson B. Falsafi Benjamin Reilly September 27, 2011 Tuesday, 27 September, 11

The fattened cache (and CPU) Cache capacity More data on hand, but = Cache latency higher cost to retrieve it CPUs show similar trend in development: continually larger , and more complex Tuesday, 27 September, 11

• Motivation • Experiment design OVERVIEW • Results and observations • What now? • Summary and discussion Tuesday, 27 September, 11

Dividing the CMPs CMPs? Chip multiprocessors : several cores sharing on-chip resources (caches) Vary in: • # of cores • # of hardware threads (“ contexts ”) • Execution order • Pipeline depth Tuesday, 27 September, 11

The ‘ Fat Camp’ ( FC ) Core 0 Core 1 Thread Context Key characteristics • Few, but powerful cores • Few (1-2) hardware contexts • OoO –– Out-of-Order execution • ILP –– Instruction-level parallelism Tuesday, 27 September, 11

Hiding data stalls: FC earlier Wait for... input operations op 0 op 1 Compute OoO op 1 op 2 Out of Order Execution op 2 ILP a += b Instruction-level parallelism b += c d += e Tuesday, 27 September, 11

The ‘ Lean Camp’ ( LC ) Core 0 Core 4 Core 2 Core 6 Core 1 Core 7 Core 3 Core 5 Key characteristics • Many, but weaker cores • Several (4+) hardware contexts • In-order execution (simpler) Tuesday, 27 September, 11

Hiding data stalls: LC Hardware contexts interleaved in round-robin fashion, skipping contexts that are in data stalls . Core 0 Core 0 Core 0 Core 0 Core 0 Core 0 Core 0 Core 0 Core 0 Core 0 Running Idle (runnable) Stalled (non-runnable) Tuesday, 27 September, 11

(Un)saturated workloads Workloads • DSS • OLTP Number of requests • Saturated : work always available for each hardware context • Unsaturated : work not always available Tuesday, 27 September, 11

LC vs. FC Performance +12% +70% ( low ILP ( high ILP for FC) for FC) LC has slower response time in un saturated workloads Tuesday, 27 September, 11

LC vs. FC Performance +70% (ILP not significant for FC) LC has higher throughput in saturated workloads Tuesday, 27 September, 11

LC vs. FC Performance Observations: • FC spends 46-64% of execution on data stalls • At best ( saturated workloads), LC spends 76-80% on computation Tuesday, 27 September, 11

Data stall breakdown Consider three components of data cache stalls: 1. Cache size Larger (and hence slower ) caches are decreasingly optimal Tuesday, 27 September, 11

Data stall breakdown CPI contributions for: OLTP DSS L2 hit stalls responsible for an increasingly large portion of the CPI Tuesday, 27 September, 11

Data stall breakdown 2. Per-chip core integration SMP CMP Processing 4x 1-core 1x 4-core 16 MB L2 cache(s) 4MB / CPU shared Fewer cores per chip = fewer L2 hits Tuesday, 27 September, 11

Data stall breakdown 3. On-chip core count 8 cores: • 9% superlinear increase in throughput (for DSS) 16 cores: • 26% sublinear decrease (OLTP) • Too much pressure on L2 Tuesday, 27 September, 11

How do we apply this? 1.Increase parallelism • Divide! (more threads ⇒ more saturation ) • Pipeline/OLP (producer-consumer pairs) • Partition input (not ideal; static and complex) Tuesday, 27 September, 11

How do we apply this? 2.Improve data locality • Reduce data stalls to help with unsaturated workloads • Halt producers in favour of consumers • Use cache-friendly algorithms 3.Use staged DBs • Partition work by groups of relational operators Tuesday, 27 September, 11

Summary & Discussion 1. LC typically performs better than FC • LC is best under saturated workloads. • Is there room for FC CMPs in DB applications? 2. L2 hits are a bottleneck • Why were DBs ignored in HW design? • How can we avoid incurring the cost of an L2 hit? Tuesday, 27 September, 11

Database servers on chip multiprocessors: limitations and - PowerPoint PPT Presentation

Database servers on chip multiprocessors: limitations and opportunities N. Hardavellas N. Mancheril I. Pandis A. Ailamaki R. Johnson B. Falsafi Benjamin Reilly September 27, 2011 Tuesday, 27 September, 11 The fattened cache (and CPU)

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

Reducing the Interconnection Network Cost of Chip Multiprocessors Pablo Abad , Valentn Puente

Ordinary DNS: www.google.com A? Client's k.root-servers.net com. NS a.gtld-servers.net Resolver

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

NEBC Database Course 2008 Database Servers Database Interfaces Tim Booth : tbooth@ceh.ac.uk

7 On-Chip Interconnection Networks Chip Multiprocessors (ACS MPhil) Robert Mullins

6 Transactional Memory Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

Cap5 - Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

Why Multiprocessors? Limits on the performance of a single processor: what are they? Spring 2009

Ordinary DNS: www.google.com A? Client's k.root-servers.net com. NS a.gtld-servers.net Resolver

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

CHIP MULTIPROCESSORS VIA AN OPERATING SYSTEM SCHEDULER Alexandra Fedorova Margo Seltzer Michael

2 Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

District of Columbia Geographic Information System Steering Committee July 18 th , 2012 Matt

Virtual CDN Implementation Eugene E. Otoakhia - eugene.otoakhia@bt.com, BT Peter Willis

SALES HISTORY SINCE IPO 35,000,000 2019 19-20 20 in su summa mary 30,000,000 15

WARM SRAM: A Novel Scheme to Reduce Static Leakage Energy in SRAM Arrays Mahadevan

Avionics Compositional System of Systems Simulation and Modeling Tool Chain ASSIST October 28,

SOFT FTWAR ARE- E- DE DEFI FINED-S NED-STORAGE AGE The futur The future of c e of cloud-

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir,

Why Use Scheduling? Sequential accesses to DRAM Memory Access are wasteful Scheduler