System Challenges and System Challenges and Opportunities for - - PowerPoint PPT Presentation

system challenges and system challenges and opportunities
SMART_READER_LITE
LIVE PREVIEW

System Challenges and System Challenges and Opportunities for - - PowerPoint PPT Presentation

System Challenges and System Challenges and Opportunities for Opportunities for Transactional Memory Transactional Memory JaeWoong Chung Chung JaeWoong Computer System Lab Computer System Lab Stanford University Stanford University My


slide-1
SLIDE 1

System Challenges and System Challenges and Opportunities for Opportunities for Transactional Memory Transactional Memory

JaeWoong JaeWoong Chung Chung

Computer System Lab Computer System Lab Stanford University Stanford University

slide-2
SLIDE 2

2

My thesis is about My thesis is about

  • Computer system design that help leveraging

Computer system design that help leveraging hardware parallelism hardware parallelism

  • Transactional memory (TM) for easy parallel

Transactional memory (TM) for easy parallel programming programming

  • Contribution

Contribution

  • Challenges to building an efficient and practical TM system

Challenges to building an efficient and practical TM system

  • Opportunities to use TM beyond parallel programming

Opportunities to use TM beyond parallel programming

slide-3
SLIDE 3

3

Multi Core Processors Multi Core Processors

  • No more frequency race

No more frequency race

  • The era of multi cores

The era of multi cores

  • Parallel programming is not easy

Parallel programming is not easy

  • Split a sequential task into multiple sub tasks

Split a sequential task into multiple sub tasks Year Performance

Pentium (1993) Pentium 4 (2000) Core Duo (2006)

slide-4
SLIDE 4

4

Locking is hard to use Locking is hard to use

Object reference graph (e.g. Java and C++) Object reference graph (e.g. Java and C++)

  • Synchronize access to shared data

Synchronize access to shared data

  • Coarse

Coarse-

  • grain locking

grain locking

  • Easy to program

Easy to program

  • The other task is blocked

The other task is blocked

  • Fine

Fine-

  • grain locking

grain locking

  • High concurrency

High concurrency

  • Hard to use

Hard to use

  • Dead lock, priority inversion,

Dead lock, priority inversion, … …

  • High locking overhead

High locking overhead

1 2 3 4 6 5 Task1 Task2 : Object : Reference

slide-5
SLIDE 5

Transactional Memory Transactional Memory

slide-6
SLIDE 6

6

Transactional Memory Transactional Memory

  • Atomic and isolated execution of instructions

Atomic and isolated execution of instructions

  • Atomicity : All or nothing

Atomicity : All or nothing

  • Isolation : No intermediate results

Isolation : No intermediate results

  • Programmer

Programmer

  • A transaction encloses instructions

A transaction encloses instructions

  • logically sequential execution of transactions

logically sequential execution of transactions

  • TM system

TM system

  • Transactions are executed in parallel without conflict

Transactions are executed in parallel without conflict

  • If conflict, one of them is aborted and restarts

If conflict, one of them is aborted and restarts // Instructions // for Task1 // Instructions // for Task2 TX_Begin TX_End TX_Begin TX_End

slide-7
SLIDE 7

7

TM Example TM Example

  • Data versioning

Data versioning

  • At

At TX_Begin TX_Begin, save register values , save register values

  • At write, save old memory values

At write, save old memory values

  • Conflict detection

Conflict detection

  • Read

Read-

  • set and write

set and write-

  • set per transaction

set per transaction

  • Conflict detection with set comparison

Conflict detection with set comparison 1 2 3 4 6 5 Tx1 Tx2

Tx 1 : 1 3 Tx 2 : 2 5 6 R R W W

R R R W W

slide-8
SLIDE 8

8

TM Benefits TM Benefits

  • Logically sequential execution of transactions

Logically sequential execution of transactions

  • Optimistic concurrency control for parallel transaction

Optimistic concurrency control for parallel transaction execution execution

  • No dead lock, priority inversion, and convoying

No dead lock, priority inversion, and convoying

  • TM system handles pathological cases

TM system handles pathological cases

  • Composability

Composability

  • Error Recovery

Error Recovery

slide-9
SLIDE 9

9

TM System Design TM System Design

  • Many proposals in hardware and software

Many proposals in hardware and software

  • Hardware acceleration for TM is crucial for performance

Hardware acceleration for TM is crucial for performance

  • HTM is 2 ~3 times faster than STM

HTM is 2 ~3 times faster than STM

  • Correctness : strong isolation

Correctness : strong isolation

  • Hardware TM

Hardware TM

  • In the beginning

In the beginning

  • register checkpoint

register checkpoint

  • At memory access

At memory access

  • Set read/write bits per cache line

Set read/write bits per cache line

  • Buffer new values in cache or log old values

Buffer new values in cache or log old values

  • Conflict detection

Conflict detection

  • With cache coherence protocol

With cache coherence protocol

  • With transaction validation protocol

With transaction validation protocol

slide-10
SLIDE 10

10

Hardware TM Example Hardware TM Example

  • TM hardware

TM hardware

1 2 3 4 6 5 Tx1 Tx2

L1 cache

ADDR : DATA : R : W

1 XXX 3 XXX 5 XXX

L1 cache

ADDR : DATA : R : W

2 XXX 6 XXX

BUS

Memory

Load 1 1 1 1

  • TM programs

TM programs

Load 5

Conflict

5 XXX

Core 2 Regs’ Core 1

1

R R R W W R

Regs’

1 1

Tx1 Tx2

slide-11
SLIDE 11

11

Challenges and Opportunities Challenges and Opportunities

  • How to build efficient TM system tuned for common

How to build efficient TM system tuned for common case? case?

  • How to build practical TM system to deal with uncommon

How to build practical TM system to deal with uncommon case? case?

  • Can we use TM to support system software?

Can we use TM to support system software?

  • Can we use TM to improve other important system

Can we use TM to improve other important system metrics? metrics?

slide-12
SLIDE 12

12

Contributions Contributions

  • Challenges to building TM systems

Challenges to building TM systems

  • Common case behavior of parallel programs

Common case behavior of parallel programs

  • Extract architectural parameters for efficient TM system design

Extract architectural parameters for efficient TM system design

  • TM virtualization

TM virtualization

  • Overcome the limitation of TM hardware

Overcome the limitation of TM hardware

  • Opportunity for system beyond parallel programming

Opportunity for system beyond parallel programming

  • Multithreading for dynamic binary translation

Multithreading for dynamic binary translation

  • Guarantee correctness of DBT

Guarantee correctness of DBT

  • Support for reliability, security, and fast memory snapshot

Support for reliability, security, and fast memory snapshot

  • Improve important system metrics other than performance

Improve important system metrics other than performance

slide-13
SLIDE 13

13

Outline Outline

  • Software parallelization : a major issue for performance

Software parallelization : a major issue for performance

  • Transactional memory

Transactional memory

  • Challenges to building TM systems

Challenges to building TM systems

  • Common case behavior of parallel programs

Common case behavior of parallel programs

  • TM virtualization

TM virtualization

  • Opportunities for systems beyond parallel programming

Opportunities for systems beyond parallel programming

  • Multithreading for dynamic binary translation

Multithreading for dynamic binary translation

  • Support for reliability, security, and fast memory snapshot

Support for reliability, security, and fast memory snapshot

  • Conclusion

Conclusion

slide-14
SLIDE 14

Challenges to Challenges to Building TM Systems Building TM Systems

slide-15
SLIDE 15

15

Challenge 1 : Challenge 1 : Common Case Behavior of Common Case Behavior of Parallel Programs Parallel Programs

  • Goal

Goal

  • Understand the common case behavior of TM programs

Understand the common case behavior of TM programs

  • Few TM programs available

Few TM programs available

  • More TM programs now but for research purpose

More TM programs now but for research purpose

  • Few efficient TM systems as development tool

Few efficient TM systems as development tool

“chicken & egg problem chicken & egg problem” ”

slide-16
SLIDE 16

16

Inferring Transactions in Inferring Transactions in Multithread Programs Multithread Programs

  • Analyze existing parallel programs

Analyze existing parallel programs

  • Assumption : the inherent parallelism remains regardless of

Assumption : the inherent parallelism remains regardless of programming tools programming tools

  • Mapping programming primitives to transactions

Mapping programming primitives to transactions

Begin/End Parallel_For Begin/End Lock/Unlock Transaction primitive Transaction primitive Programming primitive Programming primitive

slide-17
SLIDE 17

17

Parallel Applications Parallel Applications

  • Different domains and different language

Different domains and different language

APPLU, APPLU, Equake Equake, Art, Swim, CG, BT, IS , Art, Swim, CG, BT, IS OpenMP OpenMP Barnes, Mp3d, Ocean, Radix, FMM, Barnes, Mp3d, Ocean, Radix, FMM, Cholesky Cholesky, , Radiosity Radiosity, FFT, , FFT, Volrend Volrend, Water , Water-

  • N2, Water

N2, Water-

  • Spatial

Spatial ANL ANL Apache, Apache, Kingate Kingate, Bp , Bp-

  • vision, Localize, Ultra Tic

vision, Localize, Ultra Tic Tac Tac Toe, MPEG2, AOL Server Toe, MPEG2, AOL Server Pthread Pthread MolDyn MolDyn, , MonteCarlo MonteCarlo, , RayTracer RayTracer, Crypt, , Crypt, LUFact LUFact, , Series, SOR, Series, SOR, SparseMatmult SparseMatmult, SPECjbb2000, PMD, , SPECjbb2000, PMD, HSQLDB HSQLDB Java Java Applications Applications Languages Languages

slide-18
SLIDE 18

18

Key Metrics of TM Programs Key Metrics of TM Programs

Support for nesting and system calls Frequency of nesting & I/O in transactions Transaction commit/abort

  • verhead

Write-set to length ratio Buffer size Read-/write-set size TM primitive overhead Transaction length Architectural parameters Architectural parameters Key metrics Key metrics

  • Measure the key metrics of TM programs

Measure the key metrics of TM programs

  • Use the metrics to make suggestions for TM designs

Use the metrics to make suggestions for TM designs

slide-19
SLIDE 19

19

Transaction Length Transaction Length

Observation : Up to 95% of transactions < 5000 instructions

Suggestion : Light-weight transactional primitives

Observation : Rare but long transactions

Suggesion : Transaction over context-switching

16782 772 114 256 ANL average 22591 1056 805 879 Pthreads average 13519488 4256 149 5949 Java average Max 95th % 50th % Avg Length in Instructions Application

Number of instructions executed in transaction

slide-20
SLIDE 20

20 0.01 0.1 1 10 100 1000 50th 80th Percentile of Transactions Write Set Size in Kbytes

ANL Java Pthreads

0.1 1 10 100 1000 50th 80th Percentile of Transactions Read Set Size in Kbytes

ANL Java Pthreads

Read Read-

  • /Write

/Write-

  • Set Size

Set Size

  • Observation : 98% transactions <16KB read

Observation : 98% transactions <16KB read-

  • set and

set and <6KB write set <6KB write set

  • Suggestion : 32K L1 cache is sufficient

Suggestion : 32K L1 cache is sufficient

  • Observation : Few very large transaction > 32K

Observation : Few very large transaction > 32K

  • Suggestion : Need for buffer space virtualization

Suggestion : Need for buffer space virtualization

  • Bytes of data read/written by transaction

Bytes of data read/written by transaction

slide-21
SLIDE 21

21

Outline Outline

  • Software parallelization : a major issue for performance

Software parallelization : a major issue for performance

  • Transactional memory

Transactional memory

  • Challenges to building TM systems

Challenges to building TM systems

  • Common case behavior of parallel programs

Common case behavior of parallel programs

  • TM virtualization

TM virtualization

  • Opportunities for systems beyond parallel programming

Opportunities for systems beyond parallel programming

  • Multithreading for dynamic binary translation

Multithreading for dynamic binary translation

  • Support for reliability, security, and fast memory snapshot

Support for reliability, security, and fast memory snapshot

  • Conclusion

Conclusion

slide-22
SLIDE 22

22

Challenge 2 : Challenge 2 : TM Virtualization TM Virtualization

  • Problem

Problem

  • Limited hardware resources tuned for common cases

Limited hardware resources tuned for common cases

  • E.g. buffer size for 99% transactions

E.g. buffer size for 99% transactions

  • How do we cover uncommon cases as well?

How do we cover uncommon cases as well?

  • Cache as buffer for transactional data

Cache as buffer for transactional data

  • What if cache capacity is exhausted? => space virtualization

What if cache capacity is exhausted? => space virtualization

  • What if a transaction is interrupted?

What if a transaction is interrupted?

  • Time virtualization

Time virtualization

  • What if transactions are nested deeply?

What if transactions are nested deeply?

  • Depth virtualization

Depth virtualization

slide-23
SLIDE 23

23

XTM: XTM: eXtended eXtended TM TM

  • Goals

Goals

  • Virtualize

Virtualize TM space, time, and depth at low HW cost TM space, time, and depth at low HW cost

  • Completely transparent to user SW

Completely transparent to user SW

  • Minimize interference with coexisting HW transactions

Minimize interference with coexisting HW transactions

  • Assumption

Assumption

  • Overflows, interrupts, and deep nesting are rare

Overflows, interrupts, and deep nesting are rare

  • Approach

Approach

  • Transactional data and metadata in virtual memory

Transactional data and metadata in virtual memory

  • Using virtual memory support in OS

Using virtual memory support in OS

  • Data versioning & conflict detection at page granularity

Data versioning & conflict detection at page granularity

  • Similar to page

Similar to page-

  • based software DSM systems

based software DSM systems

slide-24
SLIDE 24

24

XTM Overview XTM Overview

  • Basic operation

Basic operation

  • On HTM overflow, rollback and restart in SW mode

On HTM overflow, rollback and restart in SW mode

  • At the first access, create a copy of original (master) page

At the first access, create a copy of original (master) page

  • Change the address mapping to the copy (private page)

Change the address mapping to the copy (private page)

  • Transactional data in private page, committed data in master

Transactional data in private page, committed data in master page page

  • At commit, make the private page the new master page

At commit, make the private page the new master page

  • All orchestrated by the operating system (no HW)

All orchestrated by the operating system (no HW)

  • Conflict detection

Conflict detection

  • Use TLB shoot

Use TLB shoot-

  • downs to gain exclusive page access

downs to gain exclusive page access

  • Hardware requirement

Hardware requirement

  • Overflow exception

Overflow exception

slide-25
SLIDE 25

25

XTM Example XTM Example

Timeline Per-Tx Page table For Level 0 Private Copy R W

XTM Metadata

Master page table Master page Page table pointer HTM Overflow Xtn Write Xtn Read Commit

slide-26
SLIDE 26

26

Hardware Acceleration Hardware Acceleration

  • XTM

XTM-

  • g

g

  • Gradual page

Gradual page-

  • by

by-

  • page switching

page switching

  • Reduce the switch overhead from hardware mode to software

Reduce the switch overhead from hardware mode to software mode mode

  • A portion of transactional data in private pages, the rest in th

A portion of transactional data in private pages, the rest in the e cache cache

  • Hardware requirement :

Hardware requirement : OV(overflow OV(overflow) bit in page table ) bit in page table

  • XTM

XTM-

  • e

e

  • Additional buffer for overflowed read/write bits

Additional buffer for overflowed read/write bits

  • Reduce false conflict at page granularity

Reduce false conflict at page granularity

  • Hardware requirement : Eviction buffer

Hardware requirement : Eviction buffer

slide-27
SLIDE 27

27

Switch to software mode

Time Virtualization Time Virtualization

Interrupt Can wait? No Yes Wait for short transaction to finish Young transction? No Abort young transaction Yes

  • Interrupt and context

Interrupt and context-

  • switch procedure

switch procedure

Na Naï ïve ve Approach Approach Rare case

slide-28
SLIDE 28

28

Performance Analysis Performance Analysis

  • XTM causes no cost for applications without overflow

XTM causes no cost for applications without overflow

  • XTM

XTM-

  • g presents a good cost/performance tradeoff point

g presents a good cost/performance tradeoff point

  • 20% faster to 50% slower than a fully

20% faster to 50% slower than a fully-

  • hardware solution

hardware solution

0.5 1 1.5 2 2.5 3 XTM XTM-g XTM-e VTM XTM XTM-g XTM-e VTM XTM XTM-g XTM-e VTM XTM XTM-g XTM-e VTM XTM XTM-g XTM-e VTM XTM XTM-g XTM-e VTM t omcat v [ 37.7%] volrend [ 0.01%] radix [ 0.26%] micro- P10 [ 39.2%] micro- P20 [ 60.3%] micro- P30 [ 60.8%] Normalized E xecution Time Versioning Validat ion Commit Violat ions Idle Useful 8.3

slide-29
SLIDE 29

29

Outline Outline

  • Software parallelization : a major issue for performance

Software parallelization : a major issue for performance

  • Transactional memory

Transactional memory

  • Challenges to building TM systems

Challenges to building TM systems

  • Common case behavior of parallel programs

Common case behavior of parallel programs

  • TM virtualization

TM virtualization

  • Opportunities for systems beyond parallel programming

Opportunities for systems beyond parallel programming

  • Multithreading for dynamic binary translation

Multithreading for dynamic binary translation

  • Support for reliability, security, and fast memory snapshot

Support for reliability, security, and fast memory snapshot

  • Conclusion

Conclusion

slide-30
SLIDE 30

Opportunities for Opportunities for Systems beyond Systems beyond Parallel Programming Parallel Programming

slide-31
SLIDE 31

31

Opportunity 1 : Opportunity 1 : Dynamic Binary Translation Dynamic Binary Translation

  • DBT

DBT

  • Binary code is translated in run

Binary code is translated in run-

  • time

time

  • PIN,

PIN, Valgrind Valgrind, , DynamoRIO DynamoRIO, , StarDBT StarDBT, etc , etc

  • DBT use cases

DBT use cases

  • Translation on new target architecture

Translation on new target architecture

  • JIT optimizations in virtual machines

JIT optimizations in virtual machines

  • Binary instrumentation

Binary instrumentation

  • Profiling, security, debugging,

Profiling, security, debugging, … …

Original Binary Translated Binary DBT Framework DBT Tool

slide-32
SLIDE 32

32

Example: Dynamic Information Flow Example: Dynamic Information Flow Tracking (DIFT) Tracking (DIFT)

  • Track

Track untrusted untrusted data data

  • A taint bit per memory byte

A taint bit per memory byte

  • Security policy uses the taint bit.

Security policy uses the taint bit.

  • E.g. no

E.g. no syscall syscall with with untrusted untrusted data data t = XX ; // untrusted data from network ……. swap t, u1; u2 = u1; taint(t) = 1; swap taint(t), taint(u1); taint(u2) = taint(u1); Variables Taint bits t u1 u2

slide-33
SLIDE 33

33

  • Atomicity between original and instrumented

Atomicity between original and instrumented instructions for correctness instructions for correctness

Thread 1 swap t, u1; Thread2 u2 = u1; swap taint(t), taint(u1); taint(u2) = taint(u1); Variables Taint bits t u1 u2 XX 1 XX

Problem : Problem : DBT with Parallel Program DBT with Parallel Program

Security Security Breach !! Breach !!

slide-34
SLIDE 34

34

How to Guarantee Atomicity? How to Guarantee Atomicity?

  • Easy but unsatisfactory solutions

Easy but unsatisfactory solutions

  • No multithreaded programs (

No multithreaded programs (StarDBT StarDBT) )

  • Serialization (

Serialization (Valgrind Valgrind) )

  • Hard solution : Locking

Hard solution : Locking

  • Idea : Enclose original and instrumented instruction with lock

Idea : Enclose original and instrumented instruction with lock

  • Fine

Fine-

  • grained locks

grained locks

  • locking overhead, convoying, limited scope of DBT

locking overhead, convoying, limited scope of DBT

  • ptimizations
  • ptimizations
  • Coarse

Coarse-

  • grained locks

grained locks

  • performance degradation

performance degradation

  • Lock nesting between app & DBT locks

Lock nesting between app & DBT locks

  • potential deadlock

potential deadlock

  • Tool developers should be feature + multithreading experts

Tool developers should be feature + multithreading experts

slide-35
SLIDE 35

35

Transactional Memory for Transactional Memory for Correctness of DBT Correctness of DBT

  • Idea

Idea

  • Original and instrumented instructions in a transaction

Original and instrumented instructions in a transaction

  • Advantages

Advantages

  • Atomic execution

Atomic execution

  • High performance through optimistic concurrency

High performance through optimistic concurrency

  • Support for nested transactions

Support for nested transactions Thread 1 swap t, u1; swap taint(t), taint(u1); Thread2 u2 = u1; taint(u2) = taint(u1); TX_Begin TX_End TX_Begin TX_End

slide-36
SLIDE 36

36

Granularity of Transaction Granularity of Transaction Instrumentation Instrumentation

  • Per instruction : short

Per instruction : short

  • High overhead of executing

High overhead of executing TX_Begin TX_Begin and and TX_End TX_End

  • Limited scope for DBT optimizations

Limited scope for DBT optimizations

  • Per basic block : long

Per basic block : long

  • Amortizing the

Amortizing the TX_Begin TX_Begin and and TX_End TX_End overhead

  • verhead
  • Easy to match

Easy to match TX_Begin TX_Begin and and TX_End TX_End

  • Per trace : longer

Per trace : longer

  • Further amortization of the overhead

Further amortization of the overhead

  • Potentially high transaction conflict

Potentially high transaction conflict

  • Profile

Profile-

  • based sizing : dynamic

based sizing : dynamic

  • Optimize transaction size based on transaction abort ratio

Optimize transaction size based on transaction abort ratio

slide-37
SLIDE 37

37

Baseline Performance Results Baseline Performance Results

0% 10% 20% 30% 40% 50% 60%

Barnes Equake F mm Radiosity Radix S wim Tomcatv Water Water‐ spatial

Normalized Overhead (%)

  • 8 CPUs

8 CPUs

  • Software TM and DIFT on PIN

Software TM and DIFT on PIN

  • 41 % overhead on the average

41 % overhead on the average

  • Transaction at the DBT trace granularity

Transaction at the DBT trace granularity

slide-38
SLIDE 38

38

Hardware Acceleration Hardware Acceleration

  • Overhead reduction

Overhead reduction

  • 28% with register checkpoint

28% with register checkpoint

  • 12% with register checkpoint + hardware signature

12% with register checkpoint + hardware signature

  • 6% with full hardware TM

6% with full hardware TM

0% 10% 20% 30% 40% 50% 60%

B arnes E quake F MMR adios ityR adix S wim TomcatvWater Water‐ s patial

Normailized Overhead (%)

Register Checkpoint

  • Reg. + HW

Signature SW Only Full HW

slide-39
SLIDE 39

39

Outline Outline

  • Software parallelization : a major issue for performance

Software parallelization : a major issue for performance

  • Transactional memory

Transactional memory

  • Challenges to building TM systems

Challenges to building TM systems

  • Common case behavior of parallel programs

Common case behavior of parallel programs

  • TM virtualization

TM virtualization

  • Opportunities for systems beyond parallel programming

Opportunities for systems beyond parallel programming

  • Multithreading for dynamic binary translation

Multithreading for dynamic binary translation

  • Support for reliability, security, and fast memory snapshot

Support for reliability, security, and fast memory snapshot

  • Conclusion

Conclusion

slide-40
SLIDE 40

40

Opportunity 2 : Opportunity 2 : Improving Other System Metrics Improving Other System Metrics

  • TM hardware consists of

TM hardware consists of

  • Fine

Fine-

  • grain data versioning HW

grain data versioning HW

  • Fine

Fine-

  • grain access tracking HW

grain access tracking HW

  • Fast exception handlers

Fast exception handlers

  • Can use such HW for other purposes

Can use such HW for other purposes

  • Reliability, Security,

Reliability, Security, … …

  • The benefits for SW

The benefits for SW

  • Finer granularity (compared to VM

Finer granularity (compared to VM-

  • based approach)

based approach)

  • User

User-

  • level event handling (compared to VM

level event handling (compared to VM-

  • based approach)

based approach)

  • No instrumentation overhead (compared to DBT

No instrumentation overhead (compared to DBT-

  • based approach)

based approach)

  • Simplified code (compared to DBT

Simplified code (compared to DBT-

  • based approach)

based approach)

slide-41
SLIDE 41

41

Outline for Outline for TM Hardware Application TM Hardware Application

  • Reliability

Reliability

  • Global & local checkpoints (data versioning)

Global & local checkpoints (data versioning)

  • Security

Security

  • Fine

Fine-

  • grain read/write barriers (address tracking)

grain read/write barriers (address tracking)

  • Isolated execution (data versioning)

Isolated execution (data versioning)

  • Memory snapshot (data versioning)

Memory snapshot (data versioning)

  • Concurrent garbage collector

Concurrent garbage collector

  • Dynamic memory profiler

Dynamic memory profiler

slide-42
SLIDE 42

42

Memory Snapshot Memory Snapshot

  • Snapshot

Snapshot

  • Read

Read-

  • only image
  • nly image
  • Multiple regions

Multiple regions

  • Shared by multiple

Shared by multiple threads threads

  • Applications

Applications

  • Service threads that

Service threads that analyze memory in analyze memory in parallel with app threads parallel with app threads

  • Garbage collection,

Garbage collection, memory profiling (heap & memory profiling (heap & stack), stack), … …

Memory

mutator mutator mutator mutator

Read-only Snapshot

collector collector collector collector

< Garbage Collection >

slide-43
SLIDE 43

43

TM Hardware TM Hardware ⇒ ⇒ Snapshot Snapshot

  • Feature correspondence

Feature correspondence

  • TM metadata

TM metadata ⇒ ⇒ track data written since snapshot track data written since snapshot

  • TM versioning

TM versioning ⇒ ⇒ storage for progressive snapshot storage for progressive snapshot

  • Including virtualization mechanism

Including virtualization mechanism

  • TM conflict detection

TM conflict detection ⇒ ⇒ catch errors catch errors

  • Writes to read

Writes to read-

  • only snapshot
  • nly snapshot
  • Differences & additions

Differences & additions

  • Data versioning for single thread Vs. multiple thread

Data versioning for single thread Vs. multiple thread

  • Table to record snapshot regions

Table to record snapshot regions

  • Resulting snapshot system

Resulting snapshot system

  • Fast : O(# CPUs) Scan (create) and O(1) write/read

Fast : O(# CPUs) Scan (create) and O(1) write/read

  • Small memory footprint : O(# memory locations written)

Small memory footprint : O(# memory locations written)

slide-44
SLIDE 44

44

GC Overhead GC Overhead

  • Parallel GC: stop app threads & run GC threads

Parallel GC: stop app threads & run GC threads

  • 20% to 30% overhead for memory intensive apps

20% to 30% overhead for memory intensive apps

  • Snapshot GC

Snapshot GC ⇒ ⇒ GC is essentially free GC is essentially free

  • Fast : Stop app, take snapshot, then run GC & app concurrently

Fast : Stop app, take snapshot, then run GC & app concurrently

  • Simple : +100 lines over parallel GC by Boehm

Simple : +100 lines over parallel GC by Boehm

  • Fundamentally simpler than any other concurrent GC

Fundamentally simpler than any other concurrent GC

slide-45
SLIDE 45

45

Conclusion Conclusion

  • Challenges to building TM systems

Challenges to building TM systems

  • Common case behavior of parallel programs

Common case behavior of parallel programs

  • Extract architectural parameters for efficient TM system design

Extract architectural parameters for efficient TM system design

  • TM virtualization

TM virtualization

  • Overcome the limitation of TM hardware

Overcome the limitation of TM hardware

  • Opportunity for system beyond parallel programming

Opportunity for system beyond parallel programming

  • Multithreading for dynamic binary translation

Multithreading for dynamic binary translation

  • Fix correctness issue for DBT

Fix correctness issue for DBT

  • Support for reliability, security, and fast memory snapshot

Support for reliability, security, and fast memory snapshot

  • Improve important system metrics other than performance

Improve important system metrics other than performance

slide-46
SLIDE 46

46

Acknowledgement Acknowledgement

  • KyungHae

KyungHae, wife , wife

  • Parents, brother, in

Parents, brother, in-

  • laws

laws

  • Prof.
  • Prof. Kozyrakis

Kozyrakis, advisor , advisor

  • Prof.
  • Prof. Olukotun

Olukotun, associate advisor , associate advisor

  • Prof. Garcia
  • Prof. Garcia-
  • Molina

Molina

  • Prof.
  • Prof. Saraswat

Saraswat

  • TCC group mates and research colleagues

TCC group mates and research colleagues

  • Korean mafia

Korean mafia