Scalable String Matching on the Scalable String Matching on the - - PowerPoint PPT Presentation

scalable string matching on the scalable string matching
SMART_READER_LITE
LIVE PREVIEW

Scalable String Matching on the Scalable String Matching on the - - PowerPoint PPT Presentation

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the Cell BE Processor BE Processor Cell Cell BE Processor Daniele Scarpazza, Oreste Villa, Fabrizio Petrini Applied Computer Science Group Pacific


slide-1
SLIDE 1

Georgia Tech, Sony/Toshiba/IBM Workshop on Software and Applications for the Cell BE Processor Atlanta, GA, June 19 2007

Scalable String Matching on the Cell BE Processor Scalable String Matching on the Scalable String Matching on the Cell Cell BE Processor BE Processor

Daniele Scarpazza, Oreste Villa, Fabrizio Petrini

Applied Computer Science Group Pacific Northwest National Laboratory

fabrizio.petrini@pnl.gov

slide-2
SLIDE 2

2

Outline Outline Outline

The problem

Network Intrusion Detection Systems (NIDS) are becoming an

essential part of data centers

At the heart of a NIDS there is a string matching algorithm

The Aho-Corasick algorithm

A Deterministic Finite Automaton (DFA)

Multicore Processors

An interesting opportunity to accelerate keyword scanning Most of existing work done on FPGAs/specialized processors

Goals and challenges

Scalability of the dictionary and the network speed

DFAs with very high speed

Two SPEs can handle a 10 Gbit/sec rate with a transition table of

less than 200KB

slide-3
SLIDE 3

3

Year 2003 2005 2007 2009 2011 2013 Medieval Times Renaissance Period Industrial Age SMT 100 Threads Small Number Of Traditional Cores Arrays of Throughput Cores 10 1

The advent of teraflop-scale, many-core processors.

Courtesy of Doug Carmean, Intel

slide-4
SLIDE 4

4

Set Pattern Matching Problem Set Pattern Matching Problem Set Pattern Matching Problem Find patterns in text P={P1, P2, ... Pq}, in T Aho and Corasick proposed an interesting algorithm for multi-pattern string matching Uses a state machine Important problem in a number of fields

Text processing, biology, network security, etc.

slide-5
SLIDE 5

5

Aho Corasick - Example Aho Corasick Aho Corasick -

  • Example

Example

ϖ h i he ir is her iri iris P = {her, iris, he, is} T = “the iris for her”

slide-6
SLIDE 6

6

Aho Corasick - Example Aho Corasick Aho Corasick -

  • Example

Example

ϖ h i he ir is her iri iris P = {her, iris, he, is} T = “the iris for her”

slide-7
SLIDE 7

7

Aho Corasick - Example Aho Corasick Aho Corasick -

  • Example

Example

ϖ h i he ir is her iri iris P = {her, iris, he, is} T = “the iris for her”

slide-8
SLIDE 8

8

First Step: Keyword Tree First Step: Keyword Tree First Step: Keyword Tree

slide-9
SLIDE 9

9

Second Step: Failed Transitions (Non- deterministic Finite Automaton NFA) Second Step: Failed Transitions (Non Second Step: Failed Transitions (Non-

  • deterministic Finite Automaton NFA)

deterministic Finite Automaton NFA)

slide-10
SLIDE 10

10

Extend Failed Transitions for Each Character Extend Failed Transitions for Each Extend Failed Transitions for Each Character Character

slide-11
SLIDE 11

11

Build an Optimized Deterministic Finite Automaton (DFA) Build an Optimized Deterministic Finite Build an Optimized Deterministic Finite Automaton (DFA) Automaton (DFA)

slide-12
SLIDE 12

12

Design Challenges: Speed vs Size of the Dictionary Design Challenges: Speed Design Challenges: Speed vs vs Size of the Size of the Dictionary Dictionary

slide-13
SLIDE 13

13

PPE SPE1 SPE3 SPE5 SPE7 IOIF1 MIC SPE0 SPE2 SPE4 SPE6 BIF IOIF0 Data Arbiter

Mapping the Aho-Corasick Algorithm on the Cell Processor: Data Streaming and SIMD parallelism Mapping the Mapping the Aho Aho-

  • Corasick

Corasick Algorithm on Algorithm on the Cell Processor: Data Streaming and the Cell Processor: Data Streaming and SIMD parallelism SIMD parallelism

slide-14
SLIDE 14

14

Aho-Corasick: A Multi-level Parallelization Aho Aho-

  • Corasick: A Multi

Corasick: A Multi-

  • level Parallelization

level Parallelization

General approach

Multithreaded parallelism within a Synergistic

Processing Unit (SPU), using multiple segments/connections of the input stream

SIMD parallelism, pipeline parallelism (even/odd

pipelines of the SPU)

An arsenal of techniques: loop unrolling, removing

speculation, restricted pointers, etc.

Using multiple SPUs to increase processing bandwidth/dictionary size Dynamic loading of dictionaries

slide-15
SLIDE 15

15

Aggregate Main Memory Bandwidth: Memory Aggregate Main Memory Bandwidth: Memory Access Traffic Explicitly Orchestrated at User Access Traffic Explicitly Orchestrated at User-

  • Level

Level

slide-16
SLIDE 16

16

SIMD and Pipeline Parallelism SIMD and Pipeline Parallelism SIMD and Pipeline Parallelism

3

address

SIMD shl

>> 1 split + + + + + + + + + + + + + + + +

SIMD shr

address address address address address address address address address address address address address address address load load load load load load load load load load load load load load load load

&&&&&&&&&&&&&&&& &&&&&&&&&&&&&&&& 0xFFFFFFFE 0x00000001

16 Interleaved input streams 16 input characters 16 offsets to the transition table cells 16 input symbols Current state pointers for the 16 DFAs Addresses to the cells containing the next state pointers State Transition Table Next state pointers for the 16 DFA Final state flags for the 16 DFA 16 SISD add 16 SISD ands 16 loads 16 SISD ands

<<

slide-17
SLIDE 17

17

DFA state transition table (1520 states, 32 input symbols) Input buffer 0 Input buffer 1 16 k 16 k 190 k 34 k 256 k (total size of the local store) DFA state transition table (1648 states, 32 input symbols) Input buffer 0 Input buffer 1 8 k 8 k 206 k 34 k DFA state transition table (1712 states, 32 input symbols) 4 k 214 k 34 k 4 k Case 1 Case 2 Case 3 Code and Stack Code and Stack Code and Stack

Local Storage Usage Local Storage Usage Local Storage Usage

slide-18
SLIDE 18

18

Process buffer 0 (25.64 us) Time Computation Data transfer Process buffer 0 (25.64 us) Process buffer 0 (25.64 us) Load buffer 0 (5.94 us) Load buffer 0 (5.94 us) Process buffer 1 (25.64 us) Load buffer 1 (5.94 us) Load buffer 1 (5.94 us)

Overlapping Computation with Communication Overlapping Computation with Overlapping Computation with Communication Communication

slide-19
SLIDE 19

19

Time Process buffer 0 (match against STT 0) (25.64 us) Process buffer 1 (match against STT 0) (25.64 us) Process buffer 0 (match against STT 1) (25.64 us) Computation Process buffer 1 (match against STT 1) (25.64 us) Process buffer 0 (match against STT 0) (25.64 us) Load input to buffer 0 (5.94 us) Load input to buffer 1 (5.94 us) Load input to buffer 0 (5.94 us) Load input to buffer 1 (5.94 us) Data transfer Load next STT into STT 1 chunk 1/2 (48 kbyte) (17.83 us) Load next STT into STT 1 chunk 2/2 (47 kbyte) (17.46 us) Load next STT into STT 0 chunk 1/2 (48 kbyte) (17.83 us) Load input to buffer 0 (5.94 us) Load next STT into STT 0 chunk 2/2 (47 kbyte) (17.46 us) Load input to buffer 1 (5.94 us) Load next STT into STT 1 chunk 1/2 (48 kbyte) (17.83 us)

Schedule of a Dynamic State Transition Table (STT) Replacement Schedule of a Dynamic State Transition Schedule of a Dynamic State Transition Table (STT) Replacement Table (STT) Replacement

slide-20
SLIDE 20

20

Thoughput Provide by the STT replacement with a a variable number of tiles (1 to 8) Thoughput Thoughput Provide by the STT replacement with Provide by the STT replacement with a a variable number of tiles (1 to 8) a a variable number of tiles (1 to 8)

slide-21
SLIDE 21

21

Conclusion Conclusion Conclusion

Multi-core processors competitive with FPGAs and specialized network processors Multiple data streaming options to perform string matching Performance from 40 Gbits/sec to 5 Gbits/sec

With small dictionaries

Future work includes

Addressing larger dictionaries Compression of the STT

Paper available at http://hpc.pnl.gov/people/fabrizio/papers/smtps07.pdf