Co-Evaluation of Pattern Matching Algorithms on IoT Devices with Embedded GPUs
Charalampos Stylianopoulos Simon Kindström Magnus Almgren Olaf Landsiedel Marina Papatriantafilou
Distributed Computing and Systems
Co-Evaluation of Pattern Matching Algorithms on IoT Devices with - - PowerPoint PPT Presentation
Co-Evaluation of Pattern Matching Algorithms on IoT Devices with Embedded GPUs Charalampos Stylianopoulos Simon Kindstrm Magnus Almgren Olaf Landsiedel Marina Papatriantafilou Distributed Computing and Systems Motivation IoT security
Distributed Computing and Systems
2
¢ IoT security is a concern ¢ Recent attacks:
l Show that IoT security is lacking
thermostat l Underline the need for
3
¢ Challenges
l Resource constrained devices l More connected devices -> More traffic to inspect ¢ NIDS l Performance bottleneck l Not tailored to hardware
4
5
… http://some.site.com/get.asp?f=/etc/passwd … GET HTTP try_backdoor…
Input Stream
… /etc/passwd admin.dll get.asp backdoor …
Pattern set Search for all patterns, anywhere in the network stream.
… http://some.site.com/get.asp?f=/etc/passwd … GET HTTP try_backdoor…
Compare all network traffic against all malicious signatures
more than 70% of running time [1]
[1] "Generating realistic workloads for network intrusion detection systems", Antonatos et al.
6
8
¢ Co-evaluation of pattern matching algorithms
l Evaluate existing implementations l Influence the design of new ones
¢ Target embedded GPUs
l Deep look in their architectural features
¢ Extensive evaluation
l Different datasets, patterns, l Energy efficiency
9
10
11 [1]"Gnort: High Performance Network Intrusion Detection Using Graphics Processors”, Vasiliadis et al., RAID 2008 [2]"APUNet: Revitalizing GPU as Packet Processing Accelerator”, Go et al, NSDI 2017 [3]"A highly-efficient memory-compression scheme for GPU-accelerated intrusion detection systems”, Bellekens et al. SINCONF 2017
12 Source :Energy efficient run-time mapping and thread partitioning of concurrent OpenCL applications on CPU-GPU MPSoCs
¢ Small number of cores/threads ¢ No main memory on the GPU Ø Shared main memory between CPU and GPU ¢ No local memory on chip ¢ Vectorization in each GPU thread ¢ Separate instruction counter per GPU thread Ø No need to worry about divergent execution
13
14
15
Aho Corasick DFC CPU GPU State machine based Filtering based
¢ Used in many Network Intrusion Detection Systems ¢ Builds a State Machine (SM) from all the patterns ¢ Traverses the SM reading the input byte by byte
“Efficient String Matching: An Aid to Bibliographic Search”, A. Aho, M, Corasick, ACM Comm.’75
Limitations
Benefits
¢ Aho Corasick ¢ DFC
16
The DFC algorithm
¢ Creates a filter from patterns ¢ Quickly filter outs parts of
the input
“DFC: Accelerating String Pattern Matching for Network Applications”, Choi et al. NSDI’16
… a c t i v a t e a d m i n . d l l b a c k d o o r g e t . a s p …
Pattern set
… 1 1 1 1
Filter (8 KB)
ac ad ab ... ba bb ... ge ...
… t h i s i s a n i n p u t
Input Stream
¢ Aho Corasick ¢ DFC
17
¢ Aho Corasick ¢ DFC
¢ Progressive filtering
l in cache
¢ Verification
l in memory
Hash% tables Initial%filter
1B … … 223B 427B 82 B … … … … … … … …
Pattern length specific filters
… … … … … … … …
“DFC: Accelerating String Pattern Matching for Network Applications”, Choi et al. NSDI’16
The DFC algorithm (continued)
18
19
Aho Corasick PFAC [1] DFC DFC (GPU) HYBRID CPU GPU
[1] “Accelerating Pattern Matching Using a Novel Parallel Algorithm on GPUs” Lin et al., TOC 2013
State machine based Filtering based
20
More in the paper…
21
22
CPU 4 ARM big.LITTLE GPU ARM Mali-T628 (6 shader cores) Memory 2GB RAM Sensors On board energy sensors
l 3 publicly available traffic traces l 1 randomly generated data set l 2183 patterns (from Snort)
l 5000 patterns (emergingthreats.net)
¢ Goal of the evaluation:
1.
How fast we can process the input (execution time)
2.
How much energy we spent for processing (energy consumption)
3.
Effect of datasets and number of patterns
4.
Influence the design of new algorithms
¢ Versions:
l Aho-Corasick l DFC l PFAC l DFC on GPU (w/wo vectorization) l HYBRID (w/wo vectorization)
CPU GPU 23
24
(
Post-processing = Output which and how many patterns matched, on the CPU )
Post-processing
CPU->GPU CPU->GPU
CPU Versions GPU Versions
Vect
25
26
2183 patterns 5000 patterns
27
Bigger Filter = Slower access time (green trend, left y-axis) Higher hit ratio -> Less verification (red trend, right y-axis)
¢ Conclusions
l New hardware features (embedded GPUs) can alleviate the
bottleneck of pattern matching
l Architecture characteristics important for high performance and
low energy consumption
l Possible to design new algorithms tailored to the hardware
¢ Future Work
l Overlap CPU/GPU execution (heterogeneous design) l More algorithms and devices (e.g. Nvidia’s Jetson Nano) l Integrate with existing systems (e.g. Snort)
¢ Code available online 28
29
30
more than 70%
includes pattern matching
¢ Used in many Network Intrusion Detection Systems ¢ Builds a State Machine (SM) from all the patterns ¢ Traverses the SM reading the input byte by byte
“Efficient String Matching: An Aid to Bibliographic Search”, A. Aho, M, Corasick, ACM Comm.’75
31
Limitations
Benefits
¢ Aho Corasick ¢ DFC
“Efficient String Matching: An Aid to Bibliographic Search”, A. Aho, M, Corasick, ACM Comm.’75
… a c t i v a t e a d m i n . d l l b a c k d o o r g e t . a s p …
Pattern set
…
Filter (8 KB)
ac ad ab ... ba bb ... ge ...
… 1 … 1 1 … 1 1 1 1
"DFC: Accelerating String Pattern Matching for Network Applications”, Choi et al. NSDI’16
underutilized
e.g. vector instructions?
32