Generalized Pattern Matching Micro-Engine Yuanwei Fang*, Raihan - - PowerPoint PPT Presentation

generalized pattern matching
SMART_READER_LITE
LIVE PREVIEW

Generalized Pattern Matching Micro-Engine Yuanwei Fang*, Raihan - - PowerPoint PPT Presentation

Generalized Pattern Matching Micro-Engine Yuanwei Fang*, Raihan Rasool , Dilip Vasudevan*, Andrew A. Chien* Argonne National Laboratory King Faisal University University of Chicago * Big Data Applications Deep Packet


slide-1
SLIDE 1

Generalized Pattern Matching Micro-Engine

Yuanwei Fang*, Raihan Rasool‡, Dilip Vasudevan*, Andrew A. Chien*† University of Chicago * Argonne National Laboratory† King Faisal University‡

slide-2
SLIDE 2

Big Data Applications

  • Deep Packet Inspection
  • Bioinformatics (DNA Alignment)
  • JSON/XML Parsing
  • Signal Triggering

6/24/2014 UNIVERSITY OF CHICAGO

2

slide-3
SLIDE 3

Deep Packet Inspection

6/24/2014 UNIVERSITY OF CHICAGO

3

High speed network : 100Gb/s Growing number of patterns: 6000 Snort Rules Speed requirement: > 75 Tera DFAops/s Power budget : 200 W Energy efficiency requirement: > 375Gops/J

slide-4
SLIDE 4

Bioinformatics (DNA Alignment)

6/24/2014 UNIVERSITY OF CHICAGO

4

Genome size: 130G base pairs Bioinformatics database: millions of species Speed requirement: > 1 Tera DFAops/s Power budget : 200 W Energy efficiency requirement: > 5 Gops/J

slide-5
SLIDE 5

Deterministic Finite Automata (DFA)

6/24/2014 UNIVERSITY OF CHICAGO

5

slide-6
SLIDE 6

Programmable Approaches

Intel Xeon E5-2600: 17G DFAops/second with 130W, 0.13Gops/J ;

6/24/2014 UNIVERSITY OF CHICAGO

6

target

slide-7
SLIDE 7

Approach

  • Workload

M input characters(M DFA transitions) N DFA rules perform on the M input characters

  • Goal

Compute N x M transitions efficiently

  • Approach

Parallelize DFA execution Fused Instruction

6/24/2014 UNIVERSITY OF CHICAGO

7

slide-8
SLIDE 8

What Is Micro-Engine

Generalized Pattern Matching Micro-Engine ( GenPM ) is one micro-engine of 10x10 approach

6/24/2014 UNIVERSITY OF CHICAGO

8

Basic RISC CPU

I-Cache

Micro- engine 2

I-Cache

Micro- engine 3

I-Cache

Micro- engine 4

I-Cache

GenPM

I-Cache

Micro- engine 7

I-Cache

Micro- engine 6

I-Cache

Micro- engine 8

I-Cache

Shared L1 Data Cache Local Memory

slide-9
SLIDE 9

GenPM Micro Architecture

6/24/2014 UNIVERSITY OF CHICAGO

9

slide-10
SLIDE 10

Fused Instructions: Multi-Step

6/24/2014 UNIVERSITY OF CHICAGO

10

A D Q1 Q4 ENB

Current State

ALU

Local Mem

c b a

String buffer

Next State 1

Acc_Vec

address

Accept

A D Q1 Q4 ENB

Current State

ALU

Local Mem

c b a

String buffer

Next State

Acc_Vec

address

Accept

slide-11
SLIDE 11

Fused Instructions: Multi-Step

6/24/2014 UNIVERSITY OF CHICAGO

11

A D Q1 Q4 ENB

Current State

ALU

Local Mem

c b a

String buffer

Next State 1

Acc_Vec

address

Accept

slide-12
SLIDE 12

Fused Instructions: Multi-Step

6/24/2014 UNIVERSITY OF CHICAGO

12

A D Q1 Q4 ENB

Current State

ALU

Local Mem

c b a

String buffer

Next State 1

Acc_Vec

address

Accept

A D Q1 Q4 ENB

Current State

ALU

Local Mem

c b a

String buffer

Next State 1

Acc_Vec

address

Accept

slide-13
SLIDE 13

Fused Instructions: Multi-Step

6/24/2014 UNIVERSITY OF CHICAGO

13

A D Q1 Q4 ENB

Current State

ALU

Local Mem

c b a

String buffer

Next State 1

Acc_Vec

address

Accept

A D Q1 Q4 ENB

Current State

ALU

Local Mem

c b a

String buffer

Next State 1

Acc_Vec

address

Accept

Acc_Vec

CHECK

slide-14
SLIDE 14

Parallel DFA: Vector Instruction

6/24/2014 UNIVERSITY OF CHICAGO

14

SSE ADD + + + + + + +

slide-15
SLIDE 15

Parallel DFA: Vector Instruction

6/24/2014 UNIVERSITY OF CHICAGO

15

GMVSNEXT DFAop DFAop DFAop DFAop DFAop DFAop DFAop

slide-16
SLIDE 16

GenPM Code Example

6/24/2014 UNIVERSITY OF CHICAGO

16

Data movement Multi-step parallel DFA execution Find precise matching position

slide-17
SLIDE 17

Methodology

  • Design space: Parallelism and step length
  • Baseline
  • 32-bit 6-stage in-order RISC
  • 4GB DDR3 DRAM
  • 32KB L1 I-cache, 24KB L1 D-cache, 512KB L2 (modeled on Intel Silverthorne)
  • GenPM
  • 1MB Local memory (up to 64 banks)
  • Vector and Fused Instructions
  • Performance/Power Model
  • Core : 32nm synthesis by Synopsys Processor Designer
  • Memories : MARSSX86/CACTI 6 + DRAMSim2
  • Workload
  • 64 Snort rules from 2.9.5.6 snapshot, 10KB random network dump

6/24/2014 UNIVERSITY OF CHICAGO

17

slide-18
SLIDE 18

Performance

6/24/2014 UNIVERSITY OF CHICAGO

18

36 243 300

289 1947 2498

500 1000 1500 2000 2500 3000

1 8 16

speedup versus RISC step length

Speedup

GenPM_8way GenPM_64way

slide-19
SLIDE 19

Energy Efficiency

6/24/2014 UNIVERSITY OF CHICAGO

19

31 151 174 213 861 980

200 400 600 800 1000 1200

1 8 16

energy improvement versus RISC

step length

GenPM_8way GenPM_64way

slide-20
SLIDE 20

Throughput/watt (absolute)

6/24/2014 UNIVERSITY OF CHICAGO

20

5 10 15 20 25 30 35 40

1 8 16

Throughput per watt(Gops/J) step length

Throughput/watt

GenPM_8way GenPM_64way

Scale to a 75W chip, GenPM delivers > 2.6 Tera DFAops/second

slide-21
SLIDE 21

Energy Breakdown

6/24/2014 UNIVERSITY OF CHICAGO

21

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

RISC GenPM_8B_1S GenPM_8B_8S GenPM_8B_16S GenPM_64B_1S GenPM_64B_8S GenPM_64B_16S

total energy LM L1_I L1_D L2 DRAM Core

LM_max = 83%

slide-22
SLIDE 22

General Comparison

6/24/2014 UNIVERSITY OF CHICAGO

22

slide-23
SLIDE 23

Related Work

ASIC: [Brodie, et.al. ISCA 2006], [Titanic System RXP], [ Cisco SCE ] FPGA: [Yang Xu, et.al. ANCS 2011], [ T Song, et.al. INFOCOM 2008], [I Sourdis et.al. VLSI 2008] CPU: [Mytkowicz et.al. ASPLOS 2014 ] , [ Intel HyperScan] GPU: [Vasiliadis G, et.al. CCS 2011], [ Lin CH, et.al. INFOCOM 2012] SoC: [C Johnson et.al. ISSCC 2010 ], [ Cavium Octeon ], [ IBM PowerEN ]

6/24/2014 UNIVERSITY OF CHICAGO

23

slide-24
SLIDE 24

Summery

  • GenPM is a high performance and energy efficient accelerator for

pattern matching workloads

  • ISA exploits parallelism and multi-step execution
  • Scale to a 75W chip, GenPM delivers > 2.6 Tera DFAops/second
  • GenPM approaches ASIC efficiency and integrates it into a

programmable core

6/24/2014 UNIVERSITY OF CHICAGO

24

slide-25
SLIDE 25

Future Work

  • DFA table compression
  • Scale up with multiple GenPM micro-engines
  • Explore more applications

6/24/2014 UNIVERSITY OF CHICAGO

25

slide-26
SLIDE 26

Acknowledgements

  • Defense Advanced Research Projects Agency (DARPA)
  • Agilent Technologies (now Keysight Technologies)
  • Synopsys Academic program
  • Dr. Tung Hoang and members of the Large Scale

Systems Group in the Department of Computer Science

6/24/2014 UNIVERSITY OF CHICAGO

26