Stories, Not Words: Abstract Datatype Instruction Sets Martha Kim - - PowerPoint PPT Presentation

stories not words abstract datatype instruction sets
SMART_READER_LITE
LIVE PREVIEW

Stories, Not Words: Abstract Datatype Instruction Sets Martha Kim - - PowerPoint PPT Presentation

Stories, Not Words: Abstract Datatype Instruction Sets Martha Kim Columbia University Workshop on New Directions in Computer Architecture 6/5/2011 Sunday, June 5, 2011 The Utilization Wall Exponential decrease in percentage of


slide-1
SLIDE 1

Stories, Not Words: Abstract Datatype Instruction Sets

Martha Kim Columbia University Workshop on New Directions in Computer Architecture 6/5/2011

Sunday, June 5, 2011

slide-2
SLIDE 2

The Utilization Wall

  • Exponential decrease in percentage of transistors

that can be operated at full frequency.

  • In 45nm TSMC process, 7% of 300mm die can
  • perate at full frequency
  • In 32nm, 3.5%

Moore’s Law (manufacturable transistors) Power budget (operable transistors)

Goulding et al. Conservation cores: Reducing the energy of mature computations. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 205–218, Pittsburgh, Pennsylvania, March 2010.

2

Sunday, June 5, 2011

slide-3
SLIDE 3

Specialization Is a Promising Approach

  • R. Hameed et al., “Understanding sources of inefficiency in general-purpose chips,” ISCA '10
  • G. Venkatesh et al., “Conservation cores: reducing the energy of mature computations,” ASPLOS

'10

  • J. Kelm, D. Johnson, W. Tuohy, S. Lumetta, and S. Patel, “Cohesion: a hybrid memory model for

accelerators,” ISCA '10

  • H. Franke et al., “Introduction to the wire-speed processor and architecture,” IBM Journal of

Research and Development, vol. 54, no. 1, pp. 3:1–3:11, 2010.

  • V. Govindaraju, C. Ho, and K. Sankaralingam, “Dynamically Specialized Datapaths for energy

efficient computing,” HPCA ’11

  • M. Lyons, M. Hempstead, G. Wei, and D. Brooks, “The Accelerator Store framework for high-

performance, low-power accelerator-based systems,” Computer Architecture Letters, vol. 9, no. 2, pp. 53–56, 2010.

  • C. Cascaval, S. Chatterjee, H. Franke, K. Gildea, and P. Pattnaik, “A taxonomy of accelerator

architectures and their programming models,” IBM Journal of Research and Development, vol. 54, no. 5, p. 5, 2010.

  • R. Hou et al., “Efficient data streaming with on-chip accelerators: Opportunities and

challenges,” HPCA ’11

  • N. Goulding et al., “GreenDroid: A Mobile Application Processor for Silicon’s Dark Future,”

Hotchips ‘10.

Sunday, June 5, 2011

slide-4
SLIDE 4

An Ideal Accelerator System

High Performance Low Energy Easy to Program Software Portability

Sunday, June 5, 2011

slide-5
SLIDE 5

Accelerator Design Processes

Application

Sunday, June 5, 2011

slide-6
SLIDE 6

Accelerator Design Processes

Application Microarch.

Sunday, June 5, 2011

slide-7
SLIDE 7

Accelerator Design Processes

Application Microarch. Arch.

Sunday, June 5, 2011

slide-8
SLIDE 8

Accelerator Design Processes

!

Application Microarch. Arch.

Application Microarch. Arch.

Sunday, June 5, 2011

slide-9
SLIDE 9

Accelerator Design Processes

!

Application Microarch. Arch.

Application Microarch. Arch.

Application Arch.

Sunday, June 5, 2011

slide-10
SLIDE 10

Accelerator Design Processes

!

Application Microarch. Arch.

Application Microarch. Arch.

Application Arch. Microarch.

Sunday, June 5, 2011

slide-11
SLIDE 11

Extending Software Abstractions to Hardware

Application Libraries Machine Code Micro-ops Execution core Caches Memory

Sunday, June 5, 2011

slide-12
SLIDE 12

Extending Software Abstractions to Hardware

Application Libraries Machine Code Micro-ops Execution core Caches Memory

Sunday, June 5, 2011

slide-13
SLIDE 13

Extending Software Abstractions to Hardware

Application Libraries Machine Code Micro-ops Execution core Caches Memory Raise HW/SW interface

Sunday, June 5, 2011

slide-14
SLIDE 14

Extending Software Abstractions to Hardware

Application Libraries Machine Code Micro-ops Execution core Caches Memory Raise HW/SW interface

Extend interfaces from libraries to hardware

Sunday, June 5, 2011

slide-15
SLIDE 15

Extending Software Abstractions to Hardware

Application Libraries Machine Code Micro-ops Execution core Caches Memory Raise HW/SW interface

Extend interfaces from libraries to hardware Exploit interfaces with specialized hardware

Sunday, June 5, 2011

slide-16
SLIDE 16

Abstract Datatype Processing SW Arch UArch

Sunday, June 5, 2011

slide-17
SLIDE 17

Abstract Datatype Processing

class HashTable

put(k,v) v get(k)

SW Arch UArch

Sunday, June 5, 2011

slide-18
SLIDE 18

Abstract Datatype Processing

class HashTable

put(k,v) v get(k) put $h, $k, $v get $h, $k, $v

SW Arch UArch

Sunday, June 5, 2011

slide-19
SLIDE 19

Hash Table Processor

Abstract Datatype Processing

class HashTable

put(k,v) v get(k) put $h, $k, $v get $h, $k, $v

SW Arch UArch

Sunday, June 5, 2011

slide-20
SLIDE 20

Compilation & Execution

Sequence Labeling SparseVec HashTable SV HT GP

Dispatch

Sunday, June 5, 2011

slide-21
SLIDE 21

The Software Fallback

SV GP

Dispatch

SV GP

Dispatch

Sunday, June 5, 2011

slide-22
SLIDE 22

An Ideal Accelerator System

High Performance Low Energy Easy Use - align hardware interfaces with those software is already using Portability - software fallback plan

Sunday, June 5, 2011

slide-23
SLIDE 23

Sparse Vector Accelerator

Enforcing Data Encapsulation

set $v,$i,$x

CPU

get $v,$i,$x dot $v1,$v2,$p

Sunday, June 5, 2011

slide-24
SLIDE 24

Sparse Vector Accelerator

Enforcing Data Encapsulation

set $v,$i,$x

CPU

get $v,$i,$x dot $v1,$v2,$p

v i x

A I B

Sunday, June 5, 2011

slide-25
SLIDE 25

Sparse Vector Accelerator

Enforcing Data Encapsulation

set $v,$i,$x

CPU

get $v,$i,$x dot $v1,$v2,$p

v i x

A I B A I B I A B

Sunday, June 5, 2011

slide-26
SLIDE 26

Sparse Vector Accelerator

Enforcing Data Encapsulation

set $v,$i,$x

CPU

get $v,$i,$x dot $v1,$v2,$p

v i x

A I B A I B I A B C D C D

Sunday, June 5, 2011

slide-27
SLIDE 27

Specialized Caching for Sparse Vectors

0% 25% 50% 75% 100% 128 256 512 1024 2048

Hit Rate

Storage Capacity (B)

Standard Cache VecStore

Sunday, June 5, 2011

slide-28
SLIDE 28

Key Reuse in Hash Tables

0% 25% 50% 75% 100% 0.1 1 10 100 1000 10000 100000

  • Pct. Hash Operations

Number of Keys

LZW Compress Parser

Sunday, June 5, 2011

slide-29
SLIDE 29

Key Reuse in Hash Tables

0% 25% 50% 75% 100% 0.1 1 10 100 1000 10000 100000

  • Pct. Hash Operations

Number of Keys

LZW Compress Parser

Sunday, June 5, 2011

slide-30
SLIDE 30

Key Reuse in Hash Tables

0% 25% 50% 75% 100% 0.1 1 10 100 1000 10000 100000

  • Pct. Hash Operations

Number of Keys

LZW Compress Parser

386 entry table 26% of table 99% of dynamic accesses

Sunday, June 5, 2011

slide-31
SLIDE 31

Key Reuse in Hash Tables

0% 25% 50% 75% 100% 0.1 1 10 100 1000 10000 100000

  • Pct. Hash Operations

Number of Keys

LZW Compress Parser

386 entry table 26% of table 99% of dynamic accesses 94K entry table .1% of table 75% of dynamic accesses

Sunday, June 5, 2011

slide-32
SLIDE 32

Exploiting Key Reuse

Compress HTX-M Parser HTX-M Accesses Compress HTX-M Entrystore Accesses Parser HTX-M Entrystore Accesses

Hash Table Accelerator (HTX)

put $h,$k,$v get $h,$k,$v HTX-M HTX-C

Sunday, June 5, 2011

slide-33
SLIDE 33

Exploiting Key Reuse

0% 25% 50% 75% 100% 1 10 100 1000 Reduction In HTX-M Accesses Cache Capacity

Compress HTX-M Parser HTX-M Accesses Compress HTX-M Entrystore Accesses Parser HTX-M Entrystore Accesses

Hash Table Accelerator (HTX)

put $h,$k,$v get $h,$k,$v HTX-M HTX-C

Sunday, June 5, 2011

slide-34
SLIDE 34

Summary

Extend software’s encapsulated datatypes into hardware accelerators Natural alignment with standard software engineering Accelerator utility on all applications that use a particular type A software fallback that ensures portability Aggressive optimization of computation and data movement

Sunday, June 5, 2011

slide-35
SLIDE 35

Research Challenges

What are the appropriate types to target? What is the lower bound in complexity? Is there a max number

  • f types a hardware system can support?

How do I implment polymorphism efficiently? (e.g., priority queue with arbitrary types and user-defined sort function) How do I optimized enforcement of data encapsulation? (copy-on-read is conservative) Can the execution model support parallel execution? What is type-specific coherence like? Simpler? Uglier? What is the appropriate system-level resource allocation between general and specialized? Between different types?

Sunday, June 5, 2011

slide-36
SLIDE 36

Thank You

Sunday, June 5, 2011