application controlled frequency scaling
play

Application-controlled Frequency Scaling Jons-Tobias Wamhoff - PowerPoint PPT Presentation

Application-controlled Frequency Scaling Jons-Tobias Wamhoff Stephan Diestelhorst Christof Fetzer Technische Universitt Dresden, Germany Patrick Marlier Pascal Felber Universit de Neuchtel, Switzerland Dave Dice Oracle Labs, USA


  1. Application-controlled Frequency Scaling Jons-Tobias Wamhoff Stephan Diestelhorst Christof Fetzer Technische Universität Dresden, Germany Patrick Marlier Pascal Felber Université de Neuchâtel, Switzerland Dave Dice Oracle Labs, USA

  2. Overview • Dynamic voltage and frequency scaling (DVFS) • traditionally: used to save energy or boost sequential bottlenecks/serial peak loads • today: improve performance by exposing asymmetric properties of applications • Outline • Recap DVFS features on current x86 multicores • DVFS properties: latency and power • Applying DVFS on application-level 2

  3. P- and C-states • P-states: performance states • predefined frequency/voltage pairs P turbo frequency/voltage • controlled through machine-specific registers … P base (MSRs, privileged rdmsr / wrmsr ) • C-states: power states … P slow • trade entry/wakeup latency for higher power C0 savings halted C1-Cn • entered by hlt or monitor / mwait 3

  4. AMD Intel & Turbo CORE Turbo Boost HT HT x86 FPU x86 P base P base P base P base • Voltage and frequency domain: module vs. package P turbo ≥ C1 ≥ C1 ≥ C1 • Boosting: deterministic vs. thermal P turbo P slow P slow P slow • AMD only: asymmetric frequencies with manual boost 4

  5. Evaluation Setup Acquire entry Acquire exit Release t wait t CS f P base time • Critical sections (CS) protected by MCS queue lock • Decorations on acquire/release → trigger DVFS • Variable size of CS → amortize DVFS cost t CS • Effective CS frequency : f CS = f base · t A + CS + R • Energy for 1 hour at P base : E NORM = E sample · t A + CS + R t CS 5

  6. Automatic Frequency Scaling t CS t P turbo → P base f P turbo t P base → C halt t C halt → P base t wait f P base t ramp OS halt: entry, wakeup CPU deeper C-state boosted P-state • Decoration: spinning vs. blocking • P-state transitions triggered by hardware 6

  7. Blocking vs. Spinning Locks Frequency AMD Frequency Intel 4 . 0 3 . 9 3 . 4 3 . 1 f CS (GHz) ↑ ↑ 1.5M 4M 1 . 4 0 . 8 0 . 0 0 . 0 Energy AMD Energy Intel 0 . 6 0 . 6 E NORM (kWh) 0 . 5 spin 0 . 5 futex 0 . 4 0 . 4 0 . 3 0 . 3 10k 1M, t wait = 7M t wait = 70k 0 . 2 0 . 2 ↓ ↓ 0 . 1 0 . 1 0 . 0 0 . 0 10 3 10 4 10 5 10 6 10 7 10 2 10 3 10 4 10 5 10 6 10 7 Size CS (cycles, log) Size CS (cycles, log) 7

  8. Manual Frequency Scaling t CS t P turbo → P base f P turbo t P base → P slow t P slow → P turbo f P base t wait t ramp f P slow ioctl 1k 1k 1k wrmsr 28k 2k 23k transition 2k 225k 1k • Decoration: spin and application-level DVFS control 8

  9. Manual Lock Boosting Frequency AMD Energy AMD 0 . 8 4 . 0 0 . 7 spin ownr E NORM (kWh) 3 . 1 0 . 6 dlgt ↖ f CS (GHz) ↗ mgrt 0 . 5 200k ↑ 600k 0 . 4 400k 0 . 3 1 . 4 0 . 2 0 . 1 0 . 0 0 . 0 10 3 10 4 10 5 10 6 10 7 10 8 10 3 10 4 10 5 10 6 10 7 10 8 Size CS (cycles, log) Size CS (cycles, log) futex: 1.5M • delegate: dedicated wrmsr core • spin: static P base • owner: dynamically boost • migrate: statically boosted core 9

  10. T URBO Library • Convenient programmatical application-level DVFS control • Testbed to explore challenges of future heterogeneous cores Execution ThreadRegistry ThreadControl control - Create/Register - Decorate lock, barriers, …: boosting/profiling Performance Thread P-States PerformanceMonitor configuration - Migrate to core - Setting & configuration - Low-level profiling Hardware Topology PCI-Configuration MSR-Interface PerfEvent abstraction - P-states - HW counters Linux kernel and hardware interfaces https://bitbucket.org/donjonsn/turbo 10

  11. Boosting Applications • Expose application knowledge • Asymmetric software transactional memory: 
 up to 50% speedup with only 2% more energy • Tradeoffs when IPC depends on core frequency • Hash table resize in memcached: 
 9% speedup but 22% higher frequency • Outweigh P-state latency by delegating CS • High cross-module round-trip delay (2k cycles) • Intra-module delay scales with P-state (P boost : 280 cycles) 11

  12. Next Steps • Intel Haswell-EP supports per core P-states • Allows to give hints • Application domains • Real-time scheduling • Fork-join benchmarks • …? 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend