Valeria Bertacco Advanced Computer Architecture Laboratory - - PowerPoint PPT Presentation

valeria bertacco
SMART_READER_LITE
LIVE PREVIEW

Valeria Bertacco Advanced Computer Architecture Laboratory - - PowerPoint PPT Presentation

DACOTA: Post-silicon validation of the memory subsystem in multi-core designs Andrew DeOrio Ilya Wagner Valeria Bertacco Advanced Computer Architecture Laboratory University of Michigan, Ann Arbor HPCA 2009 Multi-core Designs Many simple


slide-1
SLIDE 1

DACOTA: Post-silicon validation of the

memory subsystem in multi-core designs

Andrew DeOrio Ilya Wagner Valeria Bertacco

Advanced Computer Architecture Laboratory University of Michigan, Ann Arbor HPCA 2009

slide-2
SLIDE 2

2/22

Multi-core Designs

  • Many simple processors
  • Communicate through interconnect network

Intel Polaris Tilera TILE64

slide-3
SLIDE 3

3/22

Complex Multi-core: Memory Subsystem

  • Cache coherence: the ordering of operations to a single

cache line

  • Memory consistency: controls the ordering of
  • perations among different memory addresses

L2 Cache Interconnect

...

L1 Cache

Core 0

L1 Cache

Core N-1

The memory subsystem is hard to verify

slide-4
SLIDE 4

4/22

The Verification Landscape

Pre-Silicon Post-Silicon Runtime

  • Fast: at-speed
  • Early HW prototypes
  • Hard-to-find bugs
  • Relatively new technology

– Ad-hoc

  • Slow: ~Hz
  • Stimuli generators
  • Random testers
  • Formal verification
  • Fast: at-speed
  • Research ideas

– Austin, Malik, Sorin

  • Microcode patching

– Intel, AMD

bugs exposed: 98% bugs exposed: 2% bugs exposed: <1% effort: 70% effort: 30% effort: 0%

Logic Sim.

Stimuli

module dut; assign x = ~y | z; always @ * begin … end

RTL

slide-5
SLIDE 5

5/22

Post-Silicon Validation Today

=

Silicon prototype Simulation servers

Test generator

Prototype’s final state Simulation final state Simulation is the bottleneck of the validation process

Critical path

slide-6
SLIDE 6

6/22

  • 10% of the bugs that made it to product are related to

the memory subsystem

Escaped Bugs in the Memory Subsystem

bug AW38 No Fix Instruction fetch may cause a livelock during

snoops of the L1 data cache

Excerpt from Specification Update [Nov. 2007]

Memory related bugs are hard to find

slide-7
SLIDE 7

7/22

Post-Silicon Design Goals

  • High coverage

– Enable self-detection of memory ordering errors – Coherence and consistency errors

  • Low area impact
  • No performance impact after shipment

=

Test generator

Prototype’s final state Simulation final state

Critical path self check

slide-8
SLIDE 8

8/22

L2 Cache Interconnect

...

L1 Cache

Core 0

L1 Cache

Core N-1

  • Logging

– stores ordering info – uses cache storage temporarily

  • Checking

– starts when storage fills –distributed algorithm on individual cores

DACOTA: Data Coloring for Consistency Testing and Analysis

Post-silicon validation for the memory subsystem

benchmark execution check

time

slide-9
SLIDE 9

9/22

  • DACOTA controller augments cache controller logic
  • Reconfigures a portion of cache for activity log

Low Overhead Logging Architecture

dacota components

L 2 Cache Interconnect

data access vector c

  • n

t r

  • l

D a c

  • t

a

Core

c

  • n

t r

  • l

D a c

  • t

a

Core N

  • 1
slide-10
SLIDE 10

10/22

  • Attach access vector to each cache line

– Tracks the order of memory accesses to one line – Entry for each core stores a sequence ID

data access vector

  • Allocate space for activity log

– Stores a sequence of access vectors in program order

  • ne entry for

each core counter 1 2 1 2

  • 1. core 0 store
  • 2. core 1 store
  • 3. core 0 store

3 3

Low Overhead Logging Architecture

L2 Cache Interconnect

data access vector control Dacota

Core 0

control Dacota

Core N-1

slide-11
SLIDE 11

11/22

L2 Cache Interconnect

...

L1 Cache

Core 0

L1 Cache

Core N-1

Checking Algorithm – On Site

  • Compares activity logs from L1 caches
  • Distributed algorithm runs on cores
  • 1. Aggregate logs
  • 2. Construct graph (protocol specific)
  • many protocol supported: SC, TSO, processor C., weak C.
  • 3. Search graph for cycles, indicating ordering violation

ST A1 ST B1 ST A2 LD B1 ST C1 ST D1 ST C2 ST E1 ST B2 ST B1 ST A2 ST C2 ST E1 … ST A1

slide-12
SLIDE 12

12/22

c

  • n

t r

  • l

D a c

  • t

a

Core

c

  • n

t r

  • l

D a c

  • t

a

Core 1

Example – Sequential Consistency

Issue Order [C1] store to address 0xC [C0] load from address 0xC [C1] load from address 0xB [C0] store to address 0xA [C0] store to address 0xB [C1] load from address 0xA

<data> 0xB 1 1 0 0xC

tag data log

<data> 0xA 1 1 0 0xB <data> 0xB 1 1 0 0xB <data> 0xA 0 0 0xA <data> 0xA <data> 0xC <data> 0xC 1 1 0 0xA 1 1 0 0xC

Actual Order [C1] store to address 0xC [C0] load from address 0xC [C1] load from address 0xA [C0] store to address 0xA [C0] store to address 0xB [C1] load from address 0xB

slide-13
SLIDE 13

13/22

Example – Sequential Consistency

1 1 0 0xA 1 1 0 0xB ST 0xA ST 0xB LD 0xA LD 0xB

Activity Logs

1 1 0 0xB 0 0 0xA

cycle indicates violation

1 1 0 0xC 1 1 0 0xC LD 0xC ST 0xC

address reference edges program order edges

slide-14
SLIDE 14

14/22

L2 Cache

data access vector control Dacota

Core 0

control Dacota

Core N-1

Experimental Setup

  • Implemented checkers in GEMs simulator
  • Created buggy versions of cache controllers
  • TSO consistency model

directory-based MOESI cache coherence 4MB Mesh network

… 16 cores

slide-15
SLIDE 15

15/22

Experimental Setup

  • Bugs inspired by bugs found in processor errata
  • Injected one at a time

shared-store store to a shared line may not invalidate other caches invisible-store store message may not reach all cores store-alloc1 store allocation in any core may not occur properly store-alloc2 store allocation in one core may not occur properly reorder1 invalid store reordering (all cores) reorder2 invalid store reordering (one core) reorder3 invalid store reordering (single address, all cores) reorder4 invalid store reordering (single address, one core)

  • Testbenches

– Directed random stimulus: memory intensive – SPLASH2 Benchmarks

Cycles to Expose Bug 0.3M 1.3M 1.9M 2.3M 1.4M 2.8M 2.9M 5.6M

slide-16
SLIDE 16

16/22

Performance Impact - Random

Performance overhead (%)

299 20 40 60 80 100 120

Computation overhead Communication overhead average

slide-17
SLIDE 17

17/22

10 20 30 40 50

Performance Impact – SPLASH2

Performance overhead (%) Computation overhead Communication overhead average

Pre-Silicon 100,000,000 % Traditional Post-Silicon 10,000 % DACOTA Post-Silicon 60 %

100x more tests!

slide-18
SLIDE 18

18/22

Area Impact

Area Overhead - Storage DACOTA 544 B Chen, et al., 2008 617,472 B Meixner, et al., 2006 940,032 B Pre-Silicon 100,000,000 % Traditional Post-Silicon 10,000 % DACOTA Post-Silicon 60 % Runtime 0 %

  • Implemented DACOTA in Verilog
  • 0.01% overhead in OpenSPARC T1
slide-19
SLIDE 19

19/22

Communication Overhead

Overhead due to communication (%) Core activity log entries Core activity log entries

50 100 150 200 250 300 350 64 128 256 512 1024 2048

large_1000_shared barrier locks small_0_shared average

5 10 15 20 25 30 35 40 64 128 256 512 1024 2048 radix lu cholesky fft average

SPLASH2 benchmarks Random benchmarks

slide-20
SLIDE 20

20/22

Checking Algorithm Overhead

Overhead due to Checking Alg. (%) Core activity log entries

20 40 60 80 100 120 64 128 256 512 1024 2048

radix lu cholesky fft average

100 200 300 400 500 600 700 64 128 256 512 1024 2048

large_1000_shared barrier locks small_0_shared average

Core activity log entries

SPLASH2 benchmarks Random benchmarks ideal trade-off

slide-21
SLIDE 21

21/22

Related Work

Pre-Silicon Post-Silicon Runtime

Meixner, et al., 2006; Chen, et al., 2008

  • Effective for

protection against transient faults

  • Problematic for

functional errors

  • High area overhead

Dill, et al., 1992; Abts, et al., 1993; Pong, et al., 1997; German, et al., 2003

  • Formal verification

possible for abstract protocol

  • Insufficient for

implementation

Josephson, et al., 2006 Paniccia, et al., 1998 Whetsel, et al., 1991 Tsang, et al., 2000

  • Post-Si testing

DeOrio, et al., 2008

  • Post-Si verification
  • Verifies coherence,

but not consistency

slide-22
SLIDE 22

22/22

Conclusions

  • DACOTA is an on-chip post-silicon debugging solution

for detecting errors in memory ordering

– Enables self-detection of memory ordering errors

  • Effective at catching bugs

– 100x more coverage than traditional post-silicon

  • Very low area overhead

– 0.01% area overhead on OpenSPARC T1

  • No performance impact to end user

– Disable on shipment