Jaume Abella, Francisco J. Cazorla July 4 th Euromicro Conference on - - PowerPoint PPT Presentation

jaume abella francisco j cazorla
SMART_READER_LITE
LIVE PREVIEW

Jaume Abella, Francisco J. Cazorla July 4 th Euromicro Conference on - - PowerPoint PPT Presentation

HWP: Hardware Support to Reconcile Cache Energy, Complexity, Performance and WCET Estimates in Multicore Real-Time Systems Pedro Benedicte , Carles Hernandez, Jaume Abella, Francisco J. Cazorla July 4 th Euromicro Conference on Real-Time Systems


slide-1
SLIDE 1

Euromicro Conference July 4th

  • n Real-Time Systems

Barcelona, Spain

HWP: Hardware Support to Reconcile Cache Energy, Complexity, Performance and WCET Estimates in Multicore Real-Time Systems

Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco J. Cazorla

slide-2
SLIDE 2

Performance in Real-Time Systems

  • Future, more complex features require an increase in

guaranteed performance

  • COTS hardware used in HPC/commodity domain offers higher

performance

  • Common features:
  • Multicores
  • Caches

2 NVIDIA Pascal (auto) SnapDragon (auto) NXP T2080 (avionics/rail) Zynq UltraScale+ EC (space/auto)

  • We focus on multicore with multilevel caches (MLC)
slide-3
SLIDE 3

Write policy

3

At the heart of MLC write policy

Metrics Performance Energy Reliability Coherence Complexity

slide-4
SLIDE 4

Contributions

  • Analysis of most used policies
  • Write-Through (WT)
  • Write-Back (WB)
  • Write policies used in commercial processors
  • Proposal HWP: Hybrid Write Policy
  • Try to take the best of both policies
  • Evaluation
  • Guaranteed Performance
  • Energy
  • Reliability
  • Coherence complexity

4

slide-5
SLIDE 5

Assumptions

  • Multi-core system
  • Private level of cache
  • Shared level of cache
  • Bus to connect the different cores
  • Reliability
  • Parity when no correction needed
  • SECDED otherwise
  • Coherence
  • Snooping based protocol
  • Good with a moderate number of cores
  • Also can be applied to directory based
  • We assume write-invalidate (MESI)

5

L1 Bus Core L2 Memory ECC L1 Core L1 Core …

slide-6
SLIDE 6

Write-Through

6

L1 L1 L2 Bus Core Core write A A

slide-7
SLIDE 7

Write-Through

6

L1 L1 L2 Bus Core Core A A A read A

slide-8
SLIDE 8

Write-Through

6

L1 L1 L2 Bus Core Core ECC Parity Parity

Metric Performance Energy Coherence simplicity Reliability cost Metric Performance Energy Coherence simplicity Metric Performance Energy

slide-9
SLIDE 9

Shared bus writes

  • Each write is sent to the bus
  • Store takes k bus cycles
  • Bus admits 1/k accesses per cycle without saturation
  • 4 cores accessing
  • Bus admits 1/(4·k)
  • WT increases load on bus with writes

7

k k k

Bus access time Bus access time

k k k k k k k

slide-10
SLIDE 10

Store percentage in real-time applications

  • 9% stores on average
  • Data-intensive

real-time applications have a higher percentage of memory

  • perations
  • 4 cores: 36% stores
  • If store takes > 3 cycles

Bus saturated

8

5 10 15 20 25 30

EEMBC automotive store %

5 10 15 20 25 30

MediaBench store %

slide-11
SLIDE 11

WT: reliability and coherence complexity

  • Reliability:
  • dL1 does not keep dirty data
  • No need to correct data in dL1
  • Just detect error and request to L2
  • Parity in dL1
  • 1,6% overhead
  • Data in L2 is always updated
  • SECDED in L2
  • 12,5% overhead
  • Coherence:
  • Data is always in L2, no dirty state
  • A simple valid/invalid protocol is enough

9

64 bit line P SECDED 64 bit line

slide-12
SLIDE 12

Write-though: summary

  • 1. Stores to bus can create

contention and affect guaranteed performance

  • 2. More accesses to bus and L2

increase energy consumption

  • 3. Only requires parity in L1
  • 4. Simple coherence protocol

10

(higher is better)

1 2 3 4

slide-13
SLIDE 13

Write-Back

11

L1 L1 L2 Bus Core Core write A A

slide-14
SLIDE 14

Write-Back

11

L1 L1 L2 Bus Core Core A A A read A A

slide-15
SLIDE 15

Write-Back

11

L1 L1 L2 Bus Core Core ECC ECC ECC

Metric Performance Energy Coherence simplicity Reliability cost Metric Performance Energy Coherence simplicity Metric Performance Energy

slide-16
SLIDE 16

Write-back: summary

  • Reduced pressure on bus improves

guaranteed performance and energy consumption

  • ECC (SECDED) is required for private

caches

  • There can be dirty data in L1
  • Increase in coherence protocol

complexity

  • Due to private dirty lines tracking

12

slide-17
SLIDE 17

Write Policies in Commercial Architectures

  • There is a mixture of WT/WB implementations
  • No obvious solution
  • Both solutions can be appropriate depending on the

requirements

13

Processor Cores Frequency L1 WT? L1 WB? ARM Cortex R5 1-2 160MHz Yes, ECC/parity Yes, ECC/parity ARM Cortex M7 1-2 200MHz Yes, ECC Yes, ECC Freescale PowerQUICC 1 250MHz Yes, ECC Yes, parity Freescale P4080 8 1,5GHz No Yes, ECC Cobham LEON 3 2 100MHz Yes, parity No Cobham LEON 4 4 150MHz Yes, parity No

slide-18
SLIDE 18

WT and WB comparison

14

Write-through Write-back

  • Each policy has pros and cons
  • We want to get the best of each

policy HWP

slide-19
SLIDE 19

Hybrid Write Policy: main idea

  • Observations:
  • Coherence complex with WB because shared cache lines accessed may

be dirty in local L1 caches

  • Private data is unaffected by cache coherence
  • A significant percentage of data is only accessed by one processor

(even in parallel applications), so no coherence management is needed

  • Based on these observations, we propose HWP:
  • Shared data is managed like in WT cache
  • Private data is managed like in WB caches
  • Elements to consider:
  • Classify data as private/shared
  • Implementation (cost, complexity…)

15

slide-20
SLIDE 20

Hybrid Write Policy

16

L1 L1 L2 Bus Core Core write A A

Shared data

slide-21
SLIDE 21

Hybrid Write Policy

16

L1 L1 L2 Bus Core Core A A A read A

Shared data

slide-22
SLIDE 22

Hybrid Write Policy

16

L1 L1 L2 Bus Core Core

Private data

write A A

slide-23
SLIDE 23

Hybrid Write Policy

16

L1 L1 L2 Bus Core Core ECC ECC ECC

Metric Performance Energy Coherence simplicity Reliability cost Metric Performance Energy Coherence simplicity Metric Performance Energy

slide-24
SLIDE 24

Private/Shared data classification

  • The hardware needs to know if data is shared or private
  • Page granularity is optimal for OS
  • If any data in a page is shared, the page is classified as shared
  • Techniques already exist in both OS (Linux) and real hardware

platform (LEON3)

  • Possible techniques:
  • Dynamic classification
  • Predictability issues in RTS
  • Software address partitioning
  • We assume this solution

17

slide-25
SLIDE 25

Implementation

  • Small hardware modifications

18

slide-26
SLIDE 26

HWP: summary

  • Guaranteed performance
  • Accesses to bus are limited to shared

data

  • Energy consumption of bus and L2

also reduced

  • Reliability
  • Sensitive data could be marked as

shared so is always in L2

  • For critical applications, SECDED

needed, private data can be in L1 and not in L2

  • Coherence
  • Same coherence complexity as WT

19

slide-27
SLIDE 27

WT, WB and HWT comparison

20

Write-through Write-back Hybrid Write Policy

slide-28
SLIDE 28

Evaluation: Setup

21

  • SoCLib simulator for cycles
  • CACTI for energy usage
  • Architecture based on NGMP
  • With 8 cores instead of 4
  • Private iL1 and dL1, shared L2
  • Benchmarks:
  • EEMBC automotive, MediaBench
slide-29
SLIDE 29

Methodology

  • 4 different mixes from single thread benchmarks
  • Suppose different percentages of shared data to evaluate the

different scenarios

  • Model for bus contention [1]
  • Uses PMC to count the type of the competing cores’ accesses
  • With this model we obtain partially time composable WCET estimates
  • To summarize, the model takes into consideration the worst possible

accesses the other cores DO make

  • Task : 100 accesses to bus Other tasks: 50 accesses to bus
  • The model takes into account only the 50 potential interferences
  • More tight WCET estimates

22

[1] J. Jalle et al. Bounding resource contention interference in the next-generation microprocessor (NGMP)

slide-30
SLIDE 30

Guaranteed performance

23

  • Normalized WCET bus contention
  • 10% of data is shared
  • WT does not scale well with the

number of cores

  • HWP scales similar to WB
  • Some degradation due to shared

accesses

WT HWP WB Cores

slide-31
SLIDE 31

Guaranteed performance

24

0% shared data 10% shared data 20% shared data 40% shared data

  • Each plot normalized to its own single-core
  • Same trends we saw are seen across all setups
slide-32
SLIDE 32

Energy

25

  • Coherence is higher in WB policy
  • Reliability has a small energy cost
  • Main difference: L2 access energy

EEMBC MediaBench

slide-33
SLIDE 33

Coherence

26

EEMBC MediaBench

  • Invalidation messages
  • WT has a high number
  • WB and HWP only broadcast to shared data
  • Shared dirty data communication
  • Significant impact in WB

Invalidation messages Shared dirty data communication Invalidation messages Shared dirty data communication

slide-34
SLIDE 34

Conclusions

  • Both WT and WB offer tradeoffs in different metrics
  • No best policy, commercial architectures show this
  • HWP tries to improve this
  • Not perfect, but improves overall
  • Guaranteed performance and energy similar to WB
  • Coherence complexity like WT

27

slide-35
SLIDE 35

Thank you! Any questions?

Pedro Benedicte, Carles Hernandez, Jaume Abella, Francisco J. Cazorla