 
              Stories, Not Words: Abstract Datatype Instruction Sets Martha Kim Columbia University Workshop on New Directions in Computer Architecture 6/5/2011 Sunday, June 5, 2011
The Utilization Wall • Exponential decrease in percentage of transistors that can be operated at full frequency. Moore’s Law (manufacturable transistors) Power budget (operable transistors) 2 • In 45nm TSMC process, 7% of 300mm die can operate at full frequency • In 32nm, 3.5% Goulding et al. Conservation cores: Reducing the energy of mature computations. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 205–218, Pittsburgh, Pennsylvania, March 2010. Sunday, June 5, 2011
Specialization Is a Promising Approach R. Hameed et al., “Understanding sources of inefficiency in general-purpose chips,” ISCA '10 G. Venkatesh et al., “Conservation cores: reducing the energy of mature computations,” ASPLOS '10 J. Kelm, D. Johnson, W. Tuohy, S. Lumetta, and S. Patel, “Cohesion: a hybrid memory model for accelerators,” ISCA '10 H. Franke et al., “Introduction to the wire-speed processor and architecture,” IBM Journal of Research and Development, vol. 54, no. 1, pp. 3:1–3:11, 2010. V. Govindaraju, C. Ho, and K. Sankaralingam, “Dynamically Specialized Datapaths for energy efficient computing,” HPCA ’11 M. Lyons, M. Hempstead, G. Wei, and D. Brooks, “The Accelerator Store framework for high- performance, low-power accelerator-based systems,” Computer Architecture Letters, vol. 9, no. 2, pp. 53–56, 2010. C. Cascaval, S. Chatterjee, H. Franke, K. Gildea, and P. Pattnaik, “A taxonomy of accelerator architectures and their programming models,” IBM Journal of Research and Development, vol. 54, no. 5, p. 5, 2010. R. Hou et al., “Efficient data streaming with on-chip accelerators: Opportunities and challenges,” HPCA ’11 N. Goulding et al., “GreenDroid: A Mobile Application Processor for Silicon’s Dark Future,” Hotchips ‘10. Sunday, June 5, 2011
An Ideal Accelerator System High Performance Low Energy Easy to Program Software Portability Sunday, June 5, 2011
Accelerator Design Processes Application Sunday, June 5, 2011
Accelerator Design Processes Application Microarch. Sunday, June 5, 2011
Accelerator Design Processes Application Microarch. Arch. Sunday, June 5, 2011
Accelerator Design Processes Application Application ! Microarch. Arch. Arch. Microarch. Sunday, June 5, 2011
Accelerator Design Processes Application Application Application ! Microarch. Arch. Arch. Arch. Microarch. Sunday, June 5, 2011
Accelerator Design Processes Application Application Application ! Microarch. Arch. Arch. Microarch. Arch. Microarch. Sunday, June 5, 2011
Extending Software Abstractions to Hardware Application Libraries Machine Code Micro-ops Execution core Caches Memory Sunday, June 5, 2011
Extending Software Abstractions to Hardware Application Libraries Machine Code Micro-ops Execution core Caches Memory Sunday, June 5, 2011
Extending Software Abstractions to Hardware Application Raise HW/SW interface Libraries Machine Code Micro-ops Execution core Caches Memory Sunday, June 5, 2011
Extending Software Abstractions to Hardware Application Raise HW/SW interface Libraries Machine Code Extend interfaces from Micro-ops libraries to hardware Execution core Caches Memory Sunday, June 5, 2011
Extending Software Abstractions to Hardware Application Raise HW/SW interface Libraries Machine Code Extend interfaces from Micro-ops libraries to hardware Execution core Exploit Caches interfaces with specialized Memory hardware Sunday, June 5, 2011
Abstract Datatype Processing SW Arch UArch Sunday, June 5, 2011
Abstract Datatype Processing put(k,v) v get(k) SW class HashTable Arch UArch Sunday, June 5, 2011
Abstract Datatype Processing put(k,v) v get(k) SW class HashTable Arch put $h, $k, $v get $h, $k, $v UArch Sunday, June 5, 2011
Abstract Datatype Processing put(k,v) v get(k) SW class HashTable Arch put $h, $k, $v get $h, $k, $v Hash Table Processor UArch Sunday, June 5, 2011
Compilation & Execution Sequence Labeling SparseVec HashTable Dispatch GP SV HT Sunday, June 5, 2011
The Software Fallback Dispatch Dispatch GP SV GP SV Sunday, June 5, 2011
An Ideal Accelerator System High Performance Low Energy Easy Use - align hardware interfaces with those software is already using Portability - software fallback plan Sunday, June 5, 2011
Enforcing Data Encapsulation set $v,$i,$x get $v,$i,$x dot $v1,$v2,$p CPU Sparse Vector Accelerator Sunday, June 5, 2011
Enforcing Data Encapsulation set $v,$i,$x get $v,$i,$x dot $v1,$v2,$p CPU Sparse Vector Accelerator I A B v i x Sunday, June 5, 2011
Enforcing Data Encapsulation set $v,$i,$x get $v,$i,$x dot $v1,$v2,$p CPU Sparse Vector Accelerator I A B I A B I A B v i x Sunday, June 5, 2011
Enforcing Data Encapsulation set $v,$i,$x get $v,$i,$x dot $v1,$v2,$p CPU Sparse Vector Accelerator I A B I A B I C D A B C D v i x Sunday, June 5, 2011
Specialized Caching for Sparse Vectors 100% Standard Cache VecStore 75% Hit Rate 50% 25% 0% 128 256 512 1024 2048 Storage Capacity (B) Sunday, June 5, 2011
Key Reuse in Hash Tables LZW Compress Parser 100% 75% Pct. Hash Operations 50% 25% 0% 0.1 1 10 100 1000 10000 100000 Number of Keys Sunday, June 5, 2011
Key Reuse in Hash Tables LZW Compress Parser 100% 75% Pct. Hash Operations 50% 25% 0% 0.1 1 10 100 1000 10000 100000 Number of Keys Sunday, June 5, 2011
Key Reuse in Hash Tables LZW Compress Parser 100% 386 entry table 75% 26% of table Pct. Hash Operations 99% of dynamic accesses 50% 25% 0% 0.1 1 10 100 1000 10000 100000 Number of Keys Sunday, June 5, 2011
Key Reuse in Hash Tables LZW Compress Parser 100% 386 entry table 75% 26% of table Pct. Hash Operations 99% of dynamic accesses 50% 94K entry table .1% of table 25% 75% of dynamic accesses 0% 0.1 1 10 100 1000 10000 100000 Number of Keys Sunday, June 5, 2011
Exploiting Key Reuse Compress HTX-M Parser HTX-M Accesses Compress HTX-M Entrystore Accesses put $h,$k,$v get $h,$k,$v Parser HTX-M Entrystore Accesses HTX-C HTX-M Hash Table Accelerator (HTX) Sunday, June 5, 2011
Exploiting Key Reuse Compress HTX-M Parser HTX-M Accesses Compress HTX-M Entrystore Accesses put $h,$k,$v get $h,$k,$v Parser HTX-M Entrystore Accesses 100% Reduction In HTX-M Accesses HTX-C 75% 50% 25% HTX-M 0% Hash Table Accelerator (HTX) 1 10 100 1000 Cache Capacity Sunday, June 5, 2011
Summary Extend software’s encapsulated datatypes into hardware accelerators Natural alignment with standard software engineering Accelerator utility on all applications that use a particular type A software fallback that ensures portability Aggressive optimization of computation and data movement Sunday, June 5, 2011
Research Challenges What are the appropriate types to target? What is the lower bound in complexity? Is there a max number of types a hardware system can support? How do I implment polymorphism efficiently? (e.g., priority queue with arbitrary types and user-defined sort function) How do I optimized enforcement of data encapsulation? (copy-on-read is conservative) Can the execution model support parallel execution? What is type-specific coherence like? Simpler? Uglier? What is the appropriate system-level resource allocation between general and specialized? Between different types? Sunday, June 5, 2011
Thank You Sunday, June 5, 2011
Recommend
More recommend