DACOTA: Post-silicon validation of the
memory subsystem in multi-core designs
Andrew DeOrio Ilya Wagner Valeria Bertacco
Advanced Computer Architecture Laboratory University of Michigan, Ann Arbor HPCA 2009
Valeria Bertacco Advanced Computer Architecture Laboratory - - PowerPoint PPT Presentation
DACOTA: Post-silicon validation of the memory subsystem in multi-core designs Andrew DeOrio Ilya Wagner Valeria Bertacco Advanced Computer Architecture Laboratory University of Michigan, Ann Arbor HPCA 2009 Multi-core Designs Many simple
Advanced Computer Architecture Laboratory University of Michigan, Ann Arbor HPCA 2009
2/22
Intel Polaris Tilera TILE64
3/22
Core 0
Core N-1
4/22
– Ad-hoc
– Austin, Malik, Sorin
– Intel, AMD
bugs exposed: 98% bugs exposed: 2% bugs exposed: <1% effort: 70% effort: 30% effort: 0%
Logic Sim.
module dut; assign x = ~y | z; always @ * begin … end
RTL
5/22
6/22
bug AW38 No Fix Instruction fetch may cause a livelock during
snoops of the L1 data cache
Excerpt from Specification Update [Nov. 2007]
7/22
– Enable self-detection of memory ordering errors – Coherence and consistency errors
=
Test generator
Prototype’s final state Simulation final state
Critical path self check
8/22
Core 0
Core N-1
– stores ordering info – uses cache storage temporarily
– starts when storage fills –distributed algorithm on individual cores
benchmark execution check
time
9/22
dacota components
L 2 Cache Interconnect
data access vector c
t r
D a c
a
Core
c
t r
D a c
a
Core N
10/22
– Tracks the order of memory accesses to one line – Entry for each core stores a sequence ID
data access vector
– Stores a sequence of access vectors in program order
each core counter 1 2 1 2
3 3
L2 Cache Interconnect
data access vector control Dacota
Core 0
control Dacota
Core N-1
11/22
L2 Cache Interconnect
L1 Cache
Core 0
L1 Cache
Core N-1
ST A1 ST B1 ST A2 LD B1 ST C1 ST D1 ST C2 ST E1 ST B2 ST B1 ST A2 ST C2 ST E1 … ST A1
12/22
Issue Order [C1] store to address 0xC [C0] load from address 0xC [C1] load from address 0xB [C0] store to address 0xA [C0] store to address 0xB [C1] load from address 0xA
<data> 0xB 1 1 0 0xC
tag data log
<data> 0xA 1 1 0 0xB <data> 0xB 1 1 0 0xB <data> 0xA 0 0 0xA <data> 0xA <data> 0xC <data> 0xC 1 1 0 0xA 1 1 0 0xC
Actual Order [C1] store to address 0xC [C0] load from address 0xC [C1] load from address 0xA [C0] store to address 0xA [C0] store to address 0xB [C1] load from address 0xB
13/22
1 1 0 0xA 1 1 0 0xB ST 0xA ST 0xB LD 0xA LD 0xB
1 1 0 0xB 0 0 0xA
1 1 0 0xC 1 1 0 0xC LD 0xC ST 0xC
address reference edges program order edges
14/22
L2 Cache
data access vector control Dacota
Core 0
control Dacota
Core N-1
directory-based MOESI cache coherence 4MB Mesh network
15/22
shared-store store to a shared line may not invalidate other caches invisible-store store message may not reach all cores store-alloc1 store allocation in any core may not occur properly store-alloc2 store allocation in one core may not occur properly reorder1 invalid store reordering (all cores) reorder2 invalid store reordering (one core) reorder3 invalid store reordering (single address, all cores) reorder4 invalid store reordering (single address, one core)
– Directed random stimulus: memory intensive – SPLASH2 Benchmarks
Cycles to Expose Bug 0.3M 1.3M 1.9M 2.3M 1.4M 2.8M 2.9M 5.6M
16/22
Performance overhead (%)
299 20 40 60 80 100 120
Computation overhead Communication overhead average
17/22
10 20 30 40 50
Performance overhead (%) Computation overhead Communication overhead average
Pre-Silicon 100,000,000 % Traditional Post-Silicon 10,000 % DACOTA Post-Silicon 60 %
18/22
Area Overhead - Storage DACOTA 544 B Chen, et al., 2008 617,472 B Meixner, et al., 2006 940,032 B Pre-Silicon 100,000,000 % Traditional Post-Silicon 10,000 % DACOTA Post-Silicon 60 % Runtime 0 %
19/22
Overhead due to communication (%) Core activity log entries Core activity log entries
50 100 150 200 250 300 350 64 128 256 512 1024 2048
large_1000_shared barrier locks small_0_shared average
5 10 15 20 25 30 35 40 64 128 256 512 1024 2048 radix lu cholesky fft average
SPLASH2 benchmarks Random benchmarks
20/22
Overhead due to Checking Alg. (%) Core activity log entries
20 40 60 80 100 120 64 128 256 512 1024 2048
radix lu cholesky fft average
100 200 300 400 500 600 700 64 128 256 512 1024 2048
large_1000_shared barrier locks small_0_shared average
Core activity log entries
SPLASH2 benchmarks Random benchmarks ideal trade-off
21/22
Meixner, et al., 2006; Chen, et al., 2008
protection against transient faults
functional errors
Dill, et al., 1992; Abts, et al., 1993; Pong, et al., 1997; German, et al., 2003
possible for abstract protocol
implementation
Josephson, et al., 2006 Paniccia, et al., 1998 Whetsel, et al., 1991 Tsang, et al., 2000
DeOrio, et al., 2008
but not consistency
22/22
– Enables self-detection of memory ordering errors
– 100x more coverage than traditional post-silicon
– 0.01% area overhead on OpenSPARC T1
– Disable on shipment