Lists Revisited: Cache Conscious STL lists Leonor Frias, Jordi - PowerPoint PPT Presentation

Lists Revisited: Cache Conscious STL lists Leonor Frias, Jordi Petit, Salvador Roura Departament de Llenguatges i Sistemes Inform` atics. Universitat Polit` ecnica de Catalunya.

Overview Goal: Improve STL lists perfomance in most common settings using a cache-conscious data structure. Previous work: Either ✷ double-linked lists implementations: easily cope with standard requirements ✷ theoretical cache-conscious data structures : do not take into account any of these requirements Main contribution: merging both approaches. Main problem: dealing with STL list s iterator functionality. Work done: analysis, design, implementation and comprehensive experimental study.

Index 1. Introduction and motivation 2. Problem and our approach 3. Design 4. Experiments 5. Conclusions and further work

Standard Template Library (STL) Core of C ++ standard library [International Standard ISO/IEC 14882 1998] . Elements : ✷ containers: list , vector , map ... ✷ iterators: high-level pointers ✷ algorithms: sort , reverse , find ... Implementation : classical literature on algorithms and data structures.

Improve performance Use memory hierarchy effectively for known / regular access patterns algorithms & → cache-conscious data structures General idea : organize data s.t. logical access pattern ≈ physical memory locations. Models : ✷ cache-aware ✷ cache-oblivious [Frigo et al. 1999]

STL lists Forward and backward traversal container, that supports insertion and deletion in constant time. STL list iterators properties : ✷ arbitrary number ✷ operations cannot invalidate them Straightforward implementation : This is what all known STL implementations do!

Double-linked lists cache performance Pointer-based data structures cannot guarantee good cache performance. Traversal using libstdc++ 0.5 no-modification 1-insert-erase 0.45 2-insert-erase 4-insert-erase 0.4 8-insert-erase 16-insert-erase 32-insert-erase 0.35 scaled time (in microsec) 64-insert-erase sort 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 4 8 16 32 64 128 256 512 list size (in 10^4) It is worth trying a cache-conscious approach!

Previous work on cache-conscious lists [Demaine 2002] Cache-aware : partition of Θ( n/B ) pieces with ( B/ 2, B ) elems. ✷ Traversal: O ( n/B ) amortized ✷ Update: constant Cache-oblivious : uses the packed memory structure, array of Θ( n ) size with uniformly distributed gaps. ✷ Traversal: O ( n/B ) amortized ✷ Update: O ((log 2 n ) /B ) (lower by partitioning the array) ⊲ Amortized constant with self-organizing structures (updates may break the uniformity until the list is reorganized when traversed).

Problem Pointers + cache-conscious data structure: physical/logical location are not independent. No trivial pointers ⇒ reach iterators whenever a modification occurs. Main issue : unbounded number of iterators pointing to the same element. Achieving Θ(1) operations: ✷ number of iterators arbitrarily restricted ✷ iterators pointing to the same element share some data STL list s are not traversed as a whole but step by step ⇒ NO self-organizing strategies.

Our approach Efficient data access + full iterator functionality + (constant) worst case costs compliant with the Standard Base: cache-aware solution. Common list usages: ✷ Only a few iterators on a list instance ✷ Many traversals are performed due to sequential access ✷ Frequent modifications at any position ✷ Small/Plain old data (POD) types (*)Implicit or explicit in general cache-conscious literature

Basic design Double-linked list of buckets . What more? 1. how to arrange the elements inside a bucket 2. how to reorganize the buckets on insertion/deletion 3. how to manage iterators 4. bucket capacity? → Experimentally

Arrangement of elements

Reorganization of buckets Preserve data structure invariant after modification ✷ minimum bucket occupancy ✷ arrangement coherency ✷ . . . Main issue: Keeping balance between: ✷ high occupancy ✷ few bucket accessed ✷ few elements movements

Iterators management Key idea: all the iterators referred to an element are identified with a dynamic node ( relayer ) that points to it. Figure 1: Bucket of pairs Figure 2: 2-level- list

Set up Our three implementations : ✷ bucket-pairs ✷ 2-level-cont ✷ 2-level-link against libstdC ++ in GCC 4.01. Basic environment : ✷ 64-bit Sun Workstation, AMD Opteron CPU at 2.4 Ghz ✷ 1 GB main memory ✷ 64 KB + 64 KB 2-associative L1 cache, 1024 KB 16-associative L2 cache and 64 bytes per cache line. Other : Pentium 4, 3.06 GHz hyperthreading CPU, 900 Mb of main memory and 512 Kb L2 cache.

Which experiments Performance measures: ✷ wall-clock times ✷ cache performance data: Pin [Luk et al. 2005] Types of experiments: ✷ lists with no iterators ✷ lists with iterators ✷ lists with several bucket capacities ✷ LEDA Lists before and after elements reorganization (by sorting).

Traversal before Traversal before shuffling (0% it load and bucket capacity 100) 0.08 0.07 scaled time (in microsec) 0.06 0.05 0.04 0.03 0.02 gcc bucket-pairs 0.01 2-level-cont 2-level-link 0 1 2 4 8 16 32 64 128 256 512 list size (in 10^4)

Traversal after Traversal after shuffling (0% it load and bucket capacity 100) 0.5 gcc 0.45 bucket-pairs 2-level-cont 0.4 2-level-link scaled time (in microsec) 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 4 8 16 32 64 128 256 512 list size (in 10^4)

Pin Traversal after Traversal after shuffling (0% it load and bucket capacity 100) 10 scaled number of L2 cache accesses 1 0.1 0.01 gcc (misses) bucket-pairs (misses) 0.001 2-level-cont (misses) 2-level-link (misses) gcc (total) 1e-04 bucket-pairs (total) 2-level-cont (total) 2-level-link (total) 1e-05 1 2 4 8 16 32 64 128 256 512 list size (in 10^4)

Insert before Insert traversal before shuffling (0% it load and bucket capacity 100) 0.25 0.2 scaled time (in microsec) 0.15 0.1 gcc 0.05 bucket-pairs 2-level-cont 2-level-link 0 1 2 4 8 16 32 64 128 256 512 list size (in 10^4)

Insert after Insert traversal after shuffling (0% it load and bucket capacity 100) 1 gcc 0.9 bucket-pairs 2-level-cont 0.8 2-level-link scaled time (in microsec) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 4 8 16 32 64 128 256 512 list size (in 10^4)

Intensive insertion Insert after shuffling (0% it load and bucket capacity 100) 0.45 gcc bucket-pairs 0.4 2-level-cont 2-level-link 0.35 scaled time (in microsec) 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 4 8 16 32 64 128 256 512 list size (in 10^4)

Internal sort Sort (0% it load and bucket capacity 100) 1 gcc 0.9 bucket-pairs 2-level-cont 0.8 2-level-link scaled time (in microsec) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 4 8 16 32 64 128 256 512 list size (in 10^4)

Effect of bucket capacity Insert traversal after shuffling (486* 10^4 list size and 0% it load) 5 gcc 4.5 bucket-pairs 2-level-cont 4 2-level-link scaled time (in microsec) 3.5 3 2.5 2 1.5 1 0.5 0 10 100 1000 bucket capacity

Iterators Traversal after shuffling (486* 10^4 list size and bucket capacity 100) 0.7 0.6 scaled time (in microsec) 0.5 0.4 0.3 0.2 gcc bucket-pairs 0.1 2-level-cont 2-level-link 0 0 20 40 60 80 100 percentage of iterator load

LEDA Traversal after shuffling (0% it load and bucket capacity 100) 0.9 gcc leda 0.8 2-level-link 0.7 scaled time (in microsec) 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 4 8 16 32 64 128 256 512 list size (in 10^4)

Conclusions (1) Pioneering to show the importance of porting existing theory and practice on cache-conscious data structures to standard libraries, as the STL. Provided three standard compliant cache-conscious lists implementations. This is not straightforward, although based on simple existing data structures. ✷ Kept with standard requirements , in particular with iterators . We have provided two standard compliant iterators designs. ✷ The algorithms involved must be designed carefully to keep up some properties.

Conclusions (2) Provided a comprehensive experimental study. Our implementations are prefferable in many (common) situations to classical double-linked list implementations, such as GCC (or LEDA). Specifically, ✷ 5-10 times faster traversals ✷ 3-5 times faster internal sort ✷ still competitive with (unusual) big load of iterators ✷ bucket capacity is not a critical parameter Between our implementations : ✷ 2-level linked implementation ✷ linked bucket implementation

What next? My webpage : www.lsi.upc.edu/~lfrias Extended article : reorganization algorithm analysed in detail. ✷ Using amortized analysis, we show that the number of created/destroyed buckets is assymptotically optimal.

Thank you Questions?

Lists Revisited: Cache Conscious STL lists Leonor Frias, Jordi - PowerPoint PPT Presentation

Lists Revisited: Cache Conscious STL lists Leonor Frias, Jordi Petit, Salvador Roura Departament de Llenguatges i Sistemes Inform` atics. Universitat Polit` ecnica de Catalunya. Overview Goal: Improve STL lists perfomance in most common

A Cache-conscious Profitability A Cache-conscious Profitability Model for Empirical Tuning of

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

STL Standard Template Library September 22, 2016 CMPE 250 STL Standard Template Library

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Nested Lists Nested Lists Lists can hold any object Lists are themselves objects

Spire Missouri STL Pipeline November 13, 2019 STL Pipeline Supports strategy to modernize

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Hom and Ext, Revisited Justin Lyle Lawrence, KS justin.lyle@ku.edu April 28, 2018 JL Hom and

CS 10: Problem solving via Object Oriented Programming Lists Part 2 (Arrays Revenge!) Agenda

Recursive Structures in Python class Node: data: int next: Node An attribute can refer to

CS 225 Data Structures Se Septembe ber 18 18 Li List Implementation G G Carl Evans Li

Singly-Linked List Class 15-121 Fall 2020 Margaret Reid-Miller Exam 1 during class Thursday

Linked Structures Songs, Games, Movies II Fall 2013 Carola Wenk Linked Lists x: y:

CS 310 - Advanced Data Structures and Algorithms Basic Data Structures June 5, 2017 Tong Wang

TESTING WITH JUNIT Lab 3 : Testing Overview Testing with JUnit JUnit Basics Sample

Announcements Thursday Extras None this week (that I know of yet) CS Technical Interview