Jaume Abella, Francisco J. Cazorla July 4 th Euromicro Conference on - PowerPoint PPT Presentation

HWP: Hardware Support to Reconcile Cache Energy, Complexity, Performance and WCET Estimates in Multicore Real-Time Systems Pedro Benedicte , Carles Hernandez, Jaume Abella, Francisco J. Cazorla July 4 th Euromicro Conference on Real-Time Systems Barcelona, Spain

Performance in Real-Time Systems • Future, more complex features require an increase in guaranteed performance • COTS hardware used in HPC/commodity domain offers higher performance • Common features: • Multicores • Caches NXP T2080 (avionics/rail) Zynq UltraScale+ EC NVIDIA Pascal (space/auto) SnapDragon (auto) (auto) • We focus on multicore with multilevel caches (MLC) 2

At the heart of MLC write policy Write Complexity Metrics policy Coherence Performance Energy Reliability 3

Contributions • Analysis of most used policies • Write-Through (WT) • Write-Back (WB) • Write policies used in commercial processors • Proposal HWP: Hybrid Write Policy • Try to take the best of both policies • Evaluation • Guaranteed Performance • Energy • Reliability • Coherence complexity 4

Assumptions • Multi-core system • Private level of cache Memory • Shared level of cache • Bus to connect the different cores L2 ECC • Reliability • Parity when no correction needed Bus • SECDED otherwise L1 L1 L1 • Coherence … • Snooping based protocol Core Core Core • Good with a moderate number of cores • Also can be applied to directory based • We assume write-invalidate (MESI) 5

Write-Through A L2 Bus L1 L1 Core Core write A 6

Write-Through A L2 Bus A L1 L1 A Core Core read A 6

Write-Through Metric Metric Metric Performance Performance Performance Energy Energy Energy Coherence simplicity Coherence simplicity L2 ECC Reliability cost Bus Parity Parity L1 L1 Core Core 6

Shared bus writes • Each write is sent to the bus • Store takes k bus cycles • Bus admits 1/ k accesses per cycle without saturation Bus access k k k time • 4 cores accessing • Bus admits 1/(4·k) Bus access k k k k k k k time • WT increases load on bus with writes 7

Store percentage in real-time applications • 9% stores on average MediaBench store % • Data-intensive 30 real-time applications 25 20 have a higher 15 percentage of memory 10 operations 5 0 • 4 cores: 36% stores EEMBC automotive store % 30 25 • If store takes > 3 cycles 20 15 10 5 Bus saturated 0 8

WT: reliability and coherence complexity • Reliability: • dL1 does not keep dirty data • No need to correct data in dL1 • Just detect error and request to L2 • Parity in dL1 64 bit line P • 1,6% overhead • Data in L2 is always updated • SECDED in L2 64 bit line SECDED • 12,5% overhead • Coherence: • Data is always in L2, no dirty state • A simple valid/invalid protocol is enough 9

Write-though: summary 1. Stores to bus can create contention and affect guaranteed performance 1 2. More accesses to bus and L2 4 2 increase energy consumption 3 3. Only requires parity in L1 (higher is better) 4. Simple coherence protocol 10

Write-Back L2 Bus A L1 L1 Core Core write A 11

Write-Back A L2 Bus A A L1 L1 A Core Core read A 11

Write-Back Metric Metric Metric Performance Performance Performance Energy Energy Energy Coherence simplicity Coherence simplicity L2 ECC Reliability cost Bus L1 ECC L1 ECC Core Core 11

Write-back: summary • Reduced pressure on bus improves guaranteed performance and energy consumption • ECC (SECDED) is required for private caches • There can be dirty data in L1 • Increase in coherence protocol complexity • Due to private dirty lines tracking 12

Write Policies in Commercial Architectures Processor Cores Frequency L1 WT? L1 WB? ARM Cortex R5 1-2 160MHz Yes, ECC/parity Yes, ECC/parity ARM Cortex M7 1-2 200MHz Yes, ECC Yes, ECC Freescale PowerQUICC 1 250MHz Yes, ECC Yes, parity Freescale P4080 8 1,5GHz No Yes, ECC Cobham LEON 3 2 100MHz Yes, parity No Cobham LEON 4 4 150MHz Yes, parity No • There is a mixture of WT/WB implementations • No obvious solution • Both solutions can be appropriate depending on the requirements 13

WT and WB comparison Write-through Write-back • Each policy has pros and cons • We want to get the best of each policy HWP 14

Hybrid Write Policy: main idea • Observations: • Coherence complex with WB because shared cache lines accessed may be dirty in local L1 caches • Private data is unaffected by cache coherence • A significant percentage of data is only accessed by one processor (even in parallel applications), so no coherence management is needed • Based on these observations, we propose HWP: • Shared data is managed like in WT cache • Private data is managed like in WB caches • Elements to consider: • Classify data as private/shared • Implementation (cost, complexity…) 15

Hybrid Write Policy Shared data A L2 Bus L1 L1 Core Core write A 16

Hybrid Write Policy Shared data A L2 Bus A L1 L1 A Core Core read A 16

Hybrid Write Policy Private data L2 Bus A L1 L1 Core Core write A 16

Hybrid Write Policy Metric Metric Metric Performance Performance Performance Energy Energy Energy L2 ECC Coherence simplicity Coherence simplicity Reliability cost Bus L1 ECC L1 ECC Core Core 16

Private/Shared data classification • The hardware needs to know if data is shared or private • Page granularity is optimal for OS • If any data in a page is shared, the page is classified as shared • Techniques already exist in both OS (Linux) and real hardware platform (LEON3) • Possible techniques: • Dynamic classification • Predictability issues in RTS • Software address partitioning • We assume this solution 17

Implementation • Small hardware modifications 18

HWP: summary • Guaranteed performance • Accesses to bus are limited to shared data • Energy consumption of bus and L2 also reduced • Reliability • Sensitive data could be marked as shared so is always in L2 • For critical applications, SECDED needed, private data can be in L1 and not in L2 • Coherence • Same coherence complexity as WT 19

WT, WB and HWT comparison Write-through Write-back Hybrid Write Policy 20

Evaluation: Setup • SoCLib simulator for cycles • CACTI for energy usage • Architecture based on NGMP • With 8 cores instead of 4 • Private iL1 and dL1, shared L2 • Benchmarks: • EEMBC automotive, MediaBench 21

Methodology • 4 different mixes from single thread benchmarks • Suppose different percentages of shared data to evaluate the different scenarios • Model for bus contention [1] • Uses PMC to count the type of the competing cores’ accesses • With this model we obtain partially time composable WCET estimates • To summarize, the model takes into consideration the worst possible accesses the other cores DO make • Task : 100 accesses to bus Other tasks: 50 accesses to bus • The model takes into account only the 50 potential interferences • More tight WCET estimates [1] J. Jalle et al. Bounding resource contention interference 22 in the next-generation microprocessor (NGMP)

Guaranteed performance • Normalized WCET bus contention • 10% of data is shared WT • WT does not scale well with the number of cores • HWP scales similar to WB • Some degradation due to shared HWP accesses WB Cores 23

Guaranteed performance 0% shared data 10% shared data 20% shared data 40% shared data • Each plot normalized to its own single-core • Same trends we saw are seen across all setups 24

Energy EEMBC MediaBench • Coherence is higher in WB policy • Reliability has a small energy cost • Main difference: L2 access energy 25

Coherence Invalidation messages Invalidation messages Shared dirty data communication Shared dirty data communication EEMBC MediaBench • Invalidation messages • WT has a high number • WB and HWP only broadcast to shared data • Shared dirty data communication • Significant impact in WB 26

Conclusions • Both WT and WB offer tradeoffs in different metrics • No best policy, commercial architectures show this • HWP tries to improve this • Not perfect, but improves overall • Guaranteed performance and energy similar to WB • Coherence complexity like WT 27

Thank you! Any questions? Pedro Benedicte , Carles Hernandez, Jaume Abella, Francisco J. Cazorla

Jaume Abella, Francisco J. Cazorla July 4 th Euromicro Conference on - PowerPoint PPT Presentation

HWP: Hardware Support to Reconcile Cache Energy, Complexity, Performance and WCET Estimates in Multicore Real-Time Systems Pedro Benedicte , Carles Hernandez, Jaume Abella, Francisco J. Cazorla July 4 th Euromicro Conference on Real-Time Systems

Upper-bounding Program Execution Time with Extreme Value Theory Francisco J. Cazorla, Eduardo

The Abella Interactive Theorem Prover (System Description) Andrew Gacek Department of Computer

Globalization and the state Jaume Ventura Bojos per lEconomia! 2017 Jaume Ventura ( ) Bojos

M ULTICORE H ARDWARE S HARED R ESOURCES : U NDERSTANDING OF THE S TATE OF THE A RT Gabriel

Mitigating Software Instrumentation Cache Effects in Measurement-Based Timing Analysis 1 Enrique

Matter Bounce Scenario in F ( T ) gravity Jaume Haro and Jaume Amor os Departament de Matem`

Non-integrability criteria for polynomial differential systems in C 2 e 1 , Jaume Llibre 2 Jaume

Computation-as-deduction in Abella: work in progress Kaustuv Chaudhuri, Ulysse G erard and Dale

Reasoning about Computational Systems using Abella Kaustuv Chaudhuri 1 Gopalan Nadathur 2 1 Inria

Fair CPU Time Accounting in CMP+SMT Processors Carlos Luque (UPC/BSC) Miquel Moreto

Universitat Jaume I Smart Campus Geospatial Technologies research group: http://www.geotec.uji.es

Globalization, the State and the Future of Europe Jaume Ventura Seventh Annual JRCPPF Conference

City and County of San Francisco City and County of San Francisco San Francisco Planning

EPC: Extended Path Coverage for Measurement-based Probabilistic Timing Analysis M. Ziccardi ,

Relating System F and 2: A Case Study in Coq, Abella and Beluga Jonas Kaiser Brigitte Pientka

Reasoning in Abella about Structural Operational Semantics Specifications Andrew Gacek 1 Dale

Categorification of perfect matchings Alastair King, Bath work in progress with I. Canakci &

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

Advanced Algorithms (IV) Chihao Zhang Shanghai Jiao Tong University Mar. 18, 2019 Advanced

INTRODUCTION TO WEB DEVELOPMENT IN C++ WITH WT 4 https://www.webtoolkit.eu/wt Roel Standaert

Splash User-friendly Programming Interface for Parallelizing Stochastic Algorithms Yuchen Zhang

Organization to Teach Gathering and Implementation of Requirements Gregor Gabrysiak, Regina

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 6 Mohammed

Motivation Portfolio approaches Javier Estrada Standard/Traditional IESE Business

Jaume Abella, Francisco J. Cazorla July 4 th Euromicro Conference on - PowerPoint PPT Presentation

HWP: Hardware Support to Reconcile Cache Energy, Complexity, Performance and WCET Estimates in Multicore Real-Time Systems Pedro Benedicte , Carles Hernandez, Jaume Abella, Francisco J. Cazorla July 4 th Euromicro Conference on Real-Time Systems

Upper-bounding Program Execution Time with Extreme Value Theory Francisco J. Cazorla, Eduardo

The Abella Interactive Theorem Prover (System Description) Andrew Gacek Department of Computer

Globalization and the state Jaume Ventura Bojos per lEconomia! 2017 Jaume Ventura ( ) Bojos

M ULTICORE H ARDWARE S HARED R ESOURCES : U NDERSTANDING OF THE S TATE OF THE A RT Gabriel

Mitigating Software Instrumentation Cache Effects in Measurement-Based Timing Analysis 1 Enrique

Matter Bounce Scenario in F ( T ) gravity Jaume Haro and Jaume Amor os Departament de Matem`

Non-integrability criteria for polynomial differential systems in C 2 e 1 , Jaume Llibre 2 Jaume

Computation-as-deduction in Abella: work in progress Kaustuv Chaudhuri, Ulysse G erard and Dale

Reasoning about Computational Systems using Abella Kaustuv Chaudhuri 1 Gopalan Nadathur 2 1 Inria

Fair CPU Time Accounting in CMP+SMT Processors Carlos Luque (UPC/BSC) Miquel Moreto

Universitat Jaume I Smart Campus Geospatial Technologies research group: http://www.geotec.uji.es

Globalization, the State and the Future of Europe Jaume Ventura Seventh Annual JRCPPF Conference

City and County of San Francisco City and County of San Francisco San Francisco Planning

EPC: Extended Path Coverage for Measurement-based Probabilistic Timing Analysis M. Ziccardi ,

Relating System F and 2: A Case Study in Coq, Abella and Beluga Jonas Kaiser Brigitte Pientka

Reasoning in Abella about Structural Operational Semantics Specifications Andrew Gacek 1 Dale

Categorification of perfect matchings Alastair King, Bath work in progress with I. Canakci &amp;

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

Advanced Algorithms (IV) Chihao Zhang Shanghai Jiao Tong University Mar. 18, 2019 Advanced

INTRODUCTION TO WEB DEVELOPMENT IN C++ WITH WT 4 https://www.webtoolkit.eu/wt Roel Standaert

Splash User-friendly Programming Interface for Parallelizing Stochastic Algorithms Yuchen Zhang

Organization to Teach Gathering and Implementation of Requirements Gregor Gabrysiak, Regina

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 6 Mohammed

Motivation Portfolio approaches Javier Estrada Standard/Traditional IESE Business

Categorification of perfect matchings Alastair King, Bath work in progress with I. Canakci &