Leftmost Longest Regular Expression Matching in Reconfigurable Logic - PDF document

Leftmost Longest Regular Expression Matching in Reconfigurable Logic Kubilay Atasu IBM Research - Zurich kat@zurich.ibm.com Abstract —Regular expression (regex) matching is an essential part of text analytics and network intrusion detection systems. The leftmost longest regex matching feature enables finding a leftmost derivation of an input text and helps resolve ambiguities that can arise in natural-language parsing. We show that leftmost longest regex matching can be efficiently performed in a data- flow pipeline by combining a recently proposed regex-matching architecture with simple last-in first-out (LIFO) buffers and streaming filter units, without creating significant back-pressure or using costly sorting operations. The techniques we propose can be used to compute overlapping and non-overlapping leftmost longest and rightmost longest regex matches. In addition, we show that the latency of the LIFO buffers can be hidden by overlapping the processing of subsequent input streams, without replicating the buffer space. Experiments on an Altera Stratix IV FPGA show a 200-fold improvement of the processing rates Fig. 1. Using FPGA-based accelerators for text analysis significantly compared with a multithreaded software implementation. improves the query-processing rates and enables real-time response latencies. I. I NTRODUCTION e.g., person and company names, in free text [3], [4]. Typ- We live in a data-centric world. Data is driving discovery ically, these regex and dictionary matching tasks, which are in many fields, such as in healthcare analytics, cyber-security, implemented using finite-state machines, dominate the runtime weather forecasting, and computational astrophysics etc. The of text analytics systems [5]. The processing of finite-state- so-called big data has become a new natural resource, and machine-based tasks does not map well on general-purpose discovering insights in big data will be the key capability of processors [6]. However, FPGAs are an ideal medium for future computing platforms. The explosion in the size of the executing such tasks because of the massive parallelism they datasets is leading to a paradigm shift in system design. The offer, which can be exploited at bit-level granularity [7]. need to achieve an efficient integration of massive data and computation is resulting in a major re-thinking of memory Fig. 1 illustrates a use case of FPGA-based accelerators in hierarchies and computing fabrics in datacentres. Data-centric a business analytics platform that continuously collects news systems diverge from traditional computer architectures in entries from different data sources and indexes them using a local news search engine. When a user submits a news two main aspects. First, to improve the bandwidth of data search query that contains a set of keywords, e.g., “IBM” access, computation is being moved closer to the data [1]. and “Switzerland”, the news search engine retrieves all news Secondly, the energy consumption of datacentres is increasing entries that contain these keywords from its index. After that, at an alarming rate, and energy costs start to exceed equip- the relevant news entries are parsed word by word to identify ment costs [2]. Scaling up datacentre performance simply phrases that might, for instance, reveal a business expansion by increasing the number of processor cores is no longer strategy of “IBM” in “Switzerland”, e.g., the opening of a feasible economically. To improve both performance and en- new office or the announcement of a new strategic partnership. ergy efficiency and to exploit the data-access bandwidth more This second stage acts as a second level of filtering, and only efficiently, data-centric systems are increasingly relying on those entries that contain interesting and useful information heterogeneous compute resources, such as graphics-processing are transferred to the user, preferably in almost real time. This units (GPUs) and field-programmable gate arrays (FPGAs). requires a deeper analysis of the news entries, and thus is The process of extracting information from large-scale computationally much more intensive than a simple keyword unstructured text is called text analytics and has applications lookup in an index. When thousands of users submit news in business analytics, healthcare, and security intelligence. search queries concurrently, this second stage becomes a com- Analyzing unstructured text and extracting insights hidden putational bottleneck. One way of eliminating this bottleneck in it at high bandwidth and low latency are computationally is to scale up the number of processor cores, which, however, challenging tasks. In particular, text analytics functions rely results in higher space and energy consumption and lower heavily on regexs and dictionaries for locating named entities, reliability. A more promising solution can involve combining an existing processor with a hardware accelerator, which boosts 978-1-4673-9091-0/15/$31.00 c � 2015 IEEE

Leftmost Longest Regular Expression Matching in Reconfigurable Logic - PDF document

Leftmost Longest Regular Expression Matching in Reconfigurable Logic Kubilay Atasu IBM Research - Zurich kat@zurich.ibm.com Abstract Regular expression (regex) matching is an essential part of text analytics and network intrusion detection

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

Regular a regular expression I Example 1.68 Consider the following DFA b a 1 2 a b a

Regular Expressions A regular expression describes a language using three operations. Regular

Lec 03. Regular expression, Pumping lemma Eunjung Kim F ORMAL DEFINITION OF R EGULAR EXPRESSION

CSE 421 Longest Path in a DAG, LIS, Shortest Path with Negative Weights Shayan Oveis Gharan 1

Gene Expression Data Introduction to gene expression data Expression data storage concept An

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

Regular Expressions CS 2110 What is a regular expression? A special string for describing a

The Expression Problem and Lenses Lambdajam 2016 Tony Morris The Expression Problem A new name

Objectives You should be able to ... Regular Languages Use the syntax of regular expressions

Matching Bipartite Matching Input Given a (undirected) graph G = ( V , E ) Input Given a bipartite

Regular Expression More conventionally called a pattern An expression that

An Improved Algorithm to Accelerate Regular Expression Evaluation Michela Becchi and Patrick

LPEG: a new approach to pattern LPEG: a new approach to pattern matching in Lua matching in Lua

Differential expression analysis John Blischak Instructor DataCamp Differential Expression

AstroAccelerate GPU accelerated signal processing on the path to the Square Kilometre Array Wes

Developer-centric Application Security Scans Ray Kelly, Practice Principal - Fortify Sherman

Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization Alain

Modeling Discourse Cohesion for Discourse Parsing via Memory Network Yanyan Jia, Yuan Ye, Yansong

xBGAS: Toward a RISC-V Extension for Global, Scalable Shared Memory John Leidel 1 , David

Software Design for Persistent Memory Systems Howard Chu CTO, Symas Corp. hyc@symas.com

Control-Flow Hijacking: Are We Making Progress? Mathias Payer, Purdue University

FY 2013 Statement of Assurance (SoA) / Managers Internal Control Program (MICP) for AT&L and

Leftmost Longest Regular Expression Matching in Reconfigurable Logic - PDF document

Leftmost Longest Regular Expression Matching in Reconfigurable Logic Kubilay Atasu IBM Research - Zurich kat@zurich.ibm.com Abstract Regular expression (regex) matching is an essential part of text analytics and network intrusion detection

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

Regular a regular expression I Example 1.68 Consider the following DFA b a 1 2 a b a

Regular Expressions A regular expression describes a language using three operations. Regular

Lec 03. Regular expression, Pumping lemma Eunjung Kim F ORMAL DEFINITION OF R EGULAR EXPRESSION

CSE 421 Longest Path in a DAG, LIS, Shortest Path with Negative Weights Shayan Oveis Gharan 1

Gene Expression Data Introduction to gene expression data Expression data storage concept An

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

Regular Expressions CS 2110 What is a regular expression? A special string for describing a

The Expression Problem and Lenses Lambdajam 2016 Tony Morris The Expression Problem A new name

Objectives You should be able to ... Regular Languages Use the syntax of regular expressions

Matching Bipartite Matching Input Given a (undirected) graph G = ( V , E ) Input Given a bipartite

Regular Expression More conventionally called a pattern An expression that

An Improved Algorithm to Accelerate Regular Expression Evaluation Michela Becchi and Patrick

LPEG: a new approach to pattern LPEG: a new approach to pattern matching in Lua matching in Lua

Differential expression analysis John Blischak Instructor DataCamp Differential Expression

AstroAccelerate GPU accelerated signal processing on the path to the Square Kilometre Array Wes

Developer-centric Application Security Scans Ray Kelly, Practice Principal - Fortify Sherman

Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization Alain

Modeling Discourse Cohesion for Discourse Parsing via Memory Network Yanyan Jia, Yuan Ye, Yansong

xBGAS: Toward a RISC-V Extension for Global, Scalable Shared Memory John Leidel 1 , David

Software Design for Persistent Memory Systems Howard Chu CTO, Symas Corp. hyc@symas.com

Control-Flow Hijacking: Are We Making Progress? Mathias Payer, Purdue University

FY 2013 Statement of Assurance (SoA) / Managers Internal Control Program (MICP) for AT&amp;L and

FY 2013 Statement of Assurance (SoA) / Managers Internal Control Program (MICP) for AT&L and