Hash Table Design and Optimization for Software Virtual Switches - - PowerPoint PPT Presentation

▶

Mar 07, 2024 440 likes •612 views

Hash Table Design and Optimization for Software Virtual Switches PRESENTER: REN WANG YIPENG WANG, SAMEH GOBRIEL, REN WANG, CHARLIE TAI, CRISTIAN DUMITRESCU INTEL LABS OUTLINE Background and motivation Survey and understanding

SLIDE 1

Hash Table Design and Optimization for Software Virtual Switches

PRESENTER: REN WANG YIPENG WANG, SAMEH GOBRIEL, REN WANG, CHARLIE TAI, CRISTIAN DUMITRESCU INTEL LABS

SLIDE 2

OUTLINE

Background and motivation
Survey and understanding
Analysis

SLIDE 3

Background

 We found the most common data structure used in virtual switch is hash table.

 wildcarding match (tuple space search): routing table, ACL  exact match: con-track table, flow cache, etc.

 Comparing to tree based data structure, hash table based data structure has certain advantages:

More parallelism: no pointer chasing Faster rule updates

SLIDE 4

Background

 Hash table lookup is also one of the most time consuming stage during packet processing:

E.g. Open vSwitch (100k rules, 20 subtables)

 A major source of hash table lookup overhead is memory access latency.

IO Preprocessing Rule lookup

thers

Execution percentage ~8% ~5% ~78% ~9%

SLIDE 5

Motivation

 Hash table is a simple data structure, but there are many different design and implementations.  Understanding of hash table performance and how to design an efficient hash table structure is the key to a good software switch. A general guideline to hash table designs will benefit future vswitch development.

SLIDE 6

Basic hash table structure

The evolution of hash table algorithms: single array -> bucket-based -> n-hash

Key A Key A Key A

SLIDE 7

Cuckoo hashing

Cuckoo hash algorithm: existing keys can be displaced to alternative bucket

Key A

Hash func

Key B Key A Key B

SLIDE 8

Survey

We also studied into various open source virtual switch applications to learn their

implementations.

Three major purposes these applications use hash table for:
Routing table/ACL – tuple space search
Connect tracking table – exact match
Flow cache – exact/signature match with replacement policy

SLIDE 9

Observations

Set-associative table and cuckoo hash are widely used.
Bucket size is usually 4-8 entries
Cache alignment
Vectorization
Capacity guarantee is needed in telcom use cases
Linked list based hash table as extended table
Software techniques to improve performance:
Software pipelining
Batching
Read write concurrency
Optimistic locking
Intel TSX

SLIDE 10

Analysis

Table organization and data structure
Number of keys per bucket
More entries in a bucket can directly improve the table utilizations.
Conclusion: when table utilization is important, cuckoo hash should be used. Multiple hash function and

multiple ways per bucket also help a lot.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

1 2 4 8 16

load factors vs. assoc.

1 hash 2hash cuckoo

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 90 92 94 96 98 100

Cuckoo hash insertion cycles vs. load factor

SLIDE 11

Analysis

Separate key storage and cache alignment
Hash tables could store key-data pair in a separate memory location, and only keep signatures and

index in the table.

Pros: signature and index are easily to be cache aligned, benefit cache miss case.
Cons: requires another memory jump when hit.
Out tests show that with optimized DPDK hash tables, storing keys in or outside the table does not

show major difference with 16 or 32-byte key size.

However, cache alignment will improve hash table lookup speed by 6.5-16.7% in our DPDK based

performance test.

11 SIG

Key+data

SLIDE 12

Analysis

Hash table based cache
When use hash table for flow cache we need to consider cache miss ratio.
4-8 ways per bucket can already keep the miss ratio to be reasonable low.
We propose a new AVX-based LRU implementation.
Use Intel AVX instruction to permute the bucket.

0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 1 2 4 8 16

miss ratio vs. ways/bucket

miss ratio 5 10 15 insert lookup

speed comparison (lower the better)

avx age ll bplru tplru

lookup

void adjust_location(int location, bucket* bucket){ __m256i array = avx_load(bucket) __m256i permute_pattern = avx_load(permute_index[location]) __m256i permuted_array = avx_permute(array, permute_pattern) avx_store (bucket, permuted_array) }

SLIDE 13

Analysis

Software pipelining and batching
Batching can enable us to prefetch hash table bucket for different lookup keys.
Together with batching, software pipelining can further improve performance.
Software pipelining + batching easily improve performance by 2X in our test case.

20 40 60 80 100 120 140 no opt. prefetch pipelined

cycle/pkt with prefetch or pipelining

SLIDE 14

Analysis- Vectorization

Besides using Intel AVX instruction for LRU operation, we can also use AVX instruction to

perform signature comparison.

We compare three mechanisms:
No vectorization.
Horizontal vectorization: compare one key’s signature to all signatures in a bucket.
Vertical vecotrization: compare all key’s signatures in a batch to different entries across different

buckets.

Observation:
Vertical or scalar better for low table utilization.
Horizontal better for high table utilization.
An adaptive method could benefit.

20 40 60 80 100 120 140

4.2 3.8 2.4 1.05 1

lookup cycle vs. avg entry location

scalar vert hori

SLIDE 15

Future directions of Hash Table Design

Cuckoo hash + extended linked list design
Linked list based hash table provides capacity guarantee.
Cuckoo hash table provides high table utilization and constant table lookup time.
The combination of both to achieve both capacity guarantee and better utilization.
Adaptive vrouter
From the study, we found no single data structure could fit all use cases.
Runtime decision based on traffic patterns could benefit.
During runtime, a “learning” (e.g., trial and rank) phase to try various hash table data structures.

SLIDE 16

Conclusion

We investigated multiple hash table algorithms and implementations in

popular virtual switches.

We analyzed various hash table designs and provide guide lines for different

use cases.

We proposed Intel AVX based LRU cache implementation and adaptive

signature comparison.

We proposed future directions on hash table design in virtual switches.