jaume abella francisco j cazorla
play

Jaume Abella, Francisco J. Cazorla July 4 th Euromicro Conference on - PowerPoint PPT Presentation

HWP: Hardware Support to Reconcile Cache Energy, Complexity, Performance and WCET Estimates in Multicore Real-Time Systems Pedro Benedicte , Carles Hernandez, Jaume Abella, Francisco J. Cazorla July 4 th Euromicro Conference on Real-Time Systems


  1. HWP: Hardware Support to Reconcile Cache Energy, Complexity, Performance and WCET Estimates in Multicore Real-Time Systems Pedro Benedicte , Carles Hernandez, Jaume Abella, Francisco J. Cazorla July 4 th Euromicro Conference on Real-Time Systems Barcelona, Spain

  2. Performance in Real-Time Systems • Future, more complex features require an increase in guaranteed performance • COTS hardware used in HPC/commodity domain offers higher performance • Common features: • Multicores • Caches NXP T2080 (avionics/rail) Zynq UltraScale+ EC NVIDIA Pascal (space/auto) SnapDragon (auto) (auto) • We focus on multicore with multilevel caches (MLC) 2

  3. At the heart of MLC write policy Write Complexity Metrics policy Coherence Performance Energy Reliability 3

  4. Contributions • Analysis of most used policies • Write-Through (WT) • Write-Back (WB) • Write policies used in commercial processors • Proposal HWP: Hybrid Write Policy • Try to take the best of both policies • Evaluation • Guaranteed Performance • Energy • Reliability • Coherence complexity 4

  5. Assumptions • Multi-core system • Private level of cache Memory • Shared level of cache • Bus to connect the different cores L2 ECC • Reliability • Parity when no correction needed Bus • SECDED otherwise L1 L1 L1 • Coherence … • Snooping based protocol Core Core Core • Good with a moderate number of cores • Also can be applied to directory based • We assume write-invalidate (MESI) 5

  6. Write-Through A L2 Bus L1 L1 Core Core write A 6

  7. Write-Through A L2 Bus A L1 L1 A Core Core read A 6

  8. Write-Through Metric Metric Metric Performance Performance Performance Energy Energy Energy Coherence simplicity Coherence simplicity L2 ECC Reliability cost Bus Parity Parity L1 L1 Core Core 6

  9. Shared bus writes • Each write is sent to the bus • Store takes k bus cycles • Bus admits 1/ k accesses per cycle without saturation Bus access k k k time • 4 cores accessing • Bus admits 1/(4·k) Bus access k k k k k k k time • WT increases load on bus with writes 7

  10. Store percentage in real-time applications • 9% stores on average MediaBench store % • Data-intensive 30 real-time applications 25 20 have a higher 15 percentage of memory 10 operations 5 0 • 4 cores: 36% stores EEMBC automotive store % 30 25 • If store takes > 3 cycles 20 15 10 5 Bus saturated 0 8

  11. WT: reliability and coherence complexity • Reliability: • dL1 does not keep dirty data • No need to correct data in dL1 • Just detect error and request to L2 • Parity in dL1 64 bit line P • 1,6% overhead • Data in L2 is always updated • SECDED in L2 64 bit line SECDED • 12,5% overhead • Coherence: • Data is always in L2, no dirty state • A simple valid/invalid protocol is enough 9

  12. Write-though: summary 1. Stores to bus can create contention and affect guaranteed performance 1 2. More accesses to bus and L2 4 2 increase energy consumption 3 3. Only requires parity in L1 (higher is better) 4. Simple coherence protocol 10

  13. Write-Back L2 Bus A L1 L1 Core Core write A 11

  14. Write-Back A L2 Bus A A L1 L1 A Core Core read A 11

  15. Write-Back Metric Metric Metric Performance Performance Performance Energy Energy Energy Coherence simplicity Coherence simplicity L2 ECC Reliability cost Bus L1 ECC L1 ECC Core Core 11

  16. Write-back: summary • Reduced pressure on bus improves guaranteed performance and energy consumption • ECC (SECDED) is required for private caches • There can be dirty data in L1 • Increase in coherence protocol complexity • Due to private dirty lines tracking 12

  17. Write Policies in Commercial Architectures Processor Cores Frequency L1 WT? L1 WB? ARM Cortex R5 1-2 160MHz Yes, ECC/parity Yes, ECC/parity ARM Cortex M7 1-2 200MHz Yes, ECC Yes, ECC Freescale PowerQUICC 1 250MHz Yes, ECC Yes, parity Freescale P4080 8 1,5GHz No Yes, ECC Cobham LEON 3 2 100MHz Yes, parity No Cobham LEON 4 4 150MHz Yes, parity No • There is a mixture of WT/WB implementations • No obvious solution • Both solutions can be appropriate depending on the requirements 13

  18. WT and WB comparison Write-through Write-back • Each policy has pros and cons • We want to get the best of each policy HWP 14

  19. Hybrid Write Policy: main idea • Observations: • Coherence complex with WB because shared cache lines accessed may be dirty in local L1 caches • Private data is unaffected by cache coherence • A significant percentage of data is only accessed by one processor (even in parallel applications), so no coherence management is needed • Based on these observations, we propose HWP: • Shared data is managed like in WT cache • Private data is managed like in WB caches • Elements to consider: • Classify data as private/shared • Implementation (cost, complexity…) 15

  20. Hybrid Write Policy Shared data A L2 Bus L1 L1 Core Core write A 16

  21. Hybrid Write Policy Shared data A L2 Bus A L1 L1 A Core Core read A 16

  22. Hybrid Write Policy Private data L2 Bus A L1 L1 Core Core write A 16

  23. Hybrid Write Policy Metric Metric Metric Performance Performance Performance Energy Energy Energy L2 ECC Coherence simplicity Coherence simplicity Reliability cost Bus L1 ECC L1 ECC Core Core 16

  24. Private/Shared data classification • The hardware needs to know if data is shared or private • Page granularity is optimal for OS • If any data in a page is shared, the page is classified as shared • Techniques already exist in both OS (Linux) and real hardware platform (LEON3) • Possible techniques: • Dynamic classification • Predictability issues in RTS • Software address partitioning • We assume this solution 17

  25. Implementation • Small hardware modifications 18

  26. HWP: summary • Guaranteed performance • Accesses to bus are limited to shared data • Energy consumption of bus and L2 also reduced • Reliability • Sensitive data could be marked as shared so is always in L2 • For critical applications, SECDED needed, private data can be in L1 and not in L2 • Coherence • Same coherence complexity as WT 19

  27. WT, WB and HWT comparison Write-through Write-back Hybrid Write Policy 20

  28. Evaluation: Setup • SoCLib simulator for cycles • CACTI for energy usage • Architecture based on NGMP • With 8 cores instead of 4 • Private iL1 and dL1, shared L2 • Benchmarks: • EEMBC automotive, MediaBench 21

  29. Methodology • 4 different mixes from single thread benchmarks • Suppose different percentages of shared data to evaluate the different scenarios • Model for bus contention [1] • Uses PMC to count the type of the competing cores’ accesses • With this model we obtain partially time composable WCET estimates • To summarize, the model takes into consideration the worst possible accesses the other cores DO make • Task : 100 accesses to bus Other tasks: 50 accesses to bus • The model takes into account only the 50 potential interferences • More tight WCET estimates [1] J. Jalle et al. Bounding resource contention interference 22 in the next-generation microprocessor (NGMP)

  30. Guaranteed performance • Normalized WCET bus contention • 10% of data is shared WT • WT does not scale well with the number of cores • HWP scales similar to WB • Some degradation due to shared HWP accesses WB Cores 23

  31. Guaranteed performance 0% shared data 10% shared data 20% shared data 40% shared data • Each plot normalized to its own single-core • Same trends we saw are seen across all setups 24

  32. Energy EEMBC MediaBench • Coherence is higher in WB policy • Reliability has a small energy cost • Main difference: L2 access energy 25

  33. Coherence Invalidation messages Invalidation messages Shared dirty data communication Shared dirty data communication EEMBC MediaBench • Invalidation messages • WT has a high number • WB and HWP only broadcast to shared data • Shared dirty data communication • Significant impact in WB 26

  34. Conclusions • Both WT and WB offer tradeoffs in different metrics • No best policy, commercial architectures show this • HWP tries to improve this • Not perfect, but improves overall • Guaranteed performance and energy similar to WB • Coherence complexity like WT 27

  35. Thank you! Any questions? Pedro Benedicte , Carles Hernandez, Jaume Abella, Francisco J. Cazorla

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend