Classification of URL Bitstreams using Bag of Bytes Keiichi Shima - PowerPoint PPT Presentation

です。とができなく媒体の条件などによって印刷方式、また、い。最適のサイズで使用してくださ分に検討し、使用する媒体の特性やスペース等を十最小サイズ東大マークの印刷物における再生上の規定、この最小使用サイズはなります。表示を正確に伝達するこ注意が必要です。損なう恐れがあり、く東大マークの再現性を著し下で使用すると、本項で示された最小サイズ以されています。使用時の最小サイズが設定東大マークには、基本型〈タテ〉〈タテ〉東大マーク基本型東大マーク集 2 もマークの再現性が異なることについても Classification of URL Bitstreams using Bag of Bytes Keiichi Shima & Hiroshi Abe (IIJ Innovation Institute, Inc.) Daisuke Miyamoto (Nara Institute of Science and Technology) Tomohiro Ishihara, Kazuya Okada & Yuji Sekiya (The University of Tokyo) Yusuke Doi & Hirochika Asai (Preferred Networks, inc.) Network Intelligence 2018, 2018-02-19 PROJECT This work was supported by JST CREST Grant Number JPMJCR1783, Japan. � 1

Outstanding AI works • In recent years, AI, more specifically, Deep Learning (DL), is getting notable attention • Especially in media recognition fields, such as image, voice recognition, etc. • Some researchers are also trying to apply DL in di ff erent fields (e.g. factory robots, games, etc) • Back to our works, are we getting a benefit from AI technologies? � 2

Difficulties • DL (or Machine Learning (ML) also) requires information to be converted into vectors • The vector is called as a feature vector • Designing the model of a feature vector requires deep knowledge of the target information domains � 3

Why is DL so hot? • Because recent DL applications don’t require to extract features manually • A neural network learns which parts of information are important from a lot of examples • For example, we can just throw the binary photo data into a neural network and that’s it • Well, it is not that simple, anyway :) � 4

What we try to achieve • We are thinking if we can apply the similar approach used for image recognition to network information • Just put (almost) raw data and let the machines extract features • No need to achieve domain specific deep knowledge before analyzing � 5

Back to URLs • Phishing is one of the major techniques to steal personal information • 1,220,523 attacks were reported in 2016 (*1) • There exists several services (products) to defend them • URL whitelisting • Contents investigation (*1) Anti Phishing WG report: http://docs.apwg.org/reports/apwg_trends_report_q4_2016.pdf � 6

URL features? • Challenges • Is there any hidden features in the URL strings used for phishing sites? • Is it possible to distinguish “white” URLs and “black” URLs by just looking at the URL strings? • We try to vectorize URLs to use as input information of ML methods without any specific domain knowledge � 7

とができなく分に検討し、く損なう恐れがあり、表示を正確に伝達するこ URL Classification with Stupid URL Vectorizer なります。この最小使用サイズは、東大マークの印刷物における再生上の規定です。使用する媒体の特性やスペース等を十最適のサイズで使用してくださ下で使用すると、い。また、印刷方式、媒体の条件などによってもマークの再現性が異なることについても注意が必要です。最小サイズ We invented a stupidly simple method to vectorize a URL as shown below. path part respectively (Bag of features) Count the number of unique values for the host part and the URL Extract 8-bits values by shifting 4 bits in the HEX values 東大マークの再現性を著し本項で示された最小サイズ以 Split characters trends of white URLs and black URLs We need Vectors To utilize ML/DL techniques, we need to encode target entities into vectors. OK, then, how can we encode URLs to vectors? URL2CSV Classification using URL2CSV and SVM We tried to classify 25,000 "white URLs" captured at WIDE project and 26,000 "black URLs" provided by phishtank.com. The result shows that the vector a r e されています。 q u i t e d i ff e r e n t a n d distinguishable with high accuracy. Keiichi SHIMA (IIJ Innovation Insitute / WIDE Muscle Learning Team) 東大マーク集 2 東大マーク基本型〈タテ〉基本型〈タテ〉東大マークには、使用時の最小サイズが設定 Convert the URL into HEX values How to vectorize? www.iij.ad.jp/index.html w w w . i i j . a d . j p / i n d e x . h t m l 7777772E69696A2E61642E6A703F696E6465782E68746D6C 77,77,77,77,77,72,2E, 3F,F6,69,96,6E,E6,64, E6,69,96,69,96,6A,A2, 46,65,57,78,82,2E,E6, 2E,E6,61,16,64,42,2E, 68,87,74,46,6D,D6,6C E6,6A,A7,70 � 8

How to vectorize? www.iij.ad.jp index.html 16 � 1 2E � 3 2E � 1 46 � 1 42 � 1 61 � 1 57 � 1 65 � 1 64 � 1 69 � 2 68 � 1 6C � 1 6A � 2 70 � 1 6D � 1 74 � 1 72 � 1 77 � 5 78 � 1 82 � 1 96 � 2 A2 � 1 87 � 1 D6 � 1 A7 � 1 E6 � 3 E6 � 1 256 dimensional 256 dimensional sparse vector sparse vector 512 dimensional sparse vector � 9

Neural network topology A 512 dimensional vector generated from a URL string Linear mapping to 256 nodes W: (256, 512), float32 b: (256), float32 (100, 512), float32 LinearFunction (100, 256), float32 ReLU (100, 256), float32 Dropout Linear mapping to 256 nodes b: (256), float32 W: (256, 256), float32 (100, 256), float32 LinearFunction (100, 256), float32 ReLU (100, 256), float32 Dropout Reduction to 2 nodes b: (2), float32 (100, 256), float32 W: (2, 256), float32 LinearFunction (100, 2), float32 (100), int32 SoftmaxCrossEntropy Loss calculation float32 � 10

Classify using the neural network TABLE I. URL DATASETS FOR TRAINING Type Content Count Blacklist 1 Phishing site URLs reported at PhishTank.com before 26,722 2017-04-25. This list is used as a blacklist for learning and testing in conjunction with the Whitelist 1. Blacklist 2 Phishing site URLs reported at PhishTank.com before 68,172 2017-10-03. This list is used to cleanse the target access log captured at the anonymous research organization X. Whitelist 1 A sampled list of URL access log captured at the 26,722 anonymous research organization X on 2017-04-25 excluding the entries listed in the Blacklist 2. This list is used for learning and testing in conjunction with the Blacklist 1. � 11

Classify using the neural network Blacklist 1 Graylist 26,722 URLs 142,749,999 URLs (before 2017-04-25) (on 2017-04-25) Exclude Blacklist 2 68,172 URLs (before 2017-10-03) Sample Whitelist Blacklist 26,722 URLs 26,722 URLs Use 10% of URLs for training, and use the rest for validation � 12

Accuracy and Loss TABLE II. R ESULTS OF ACCURACY AND TRAINING TIME USING W HITELIST 1 AND B LACKLIST 1 IN T ABLE I Optimizer Accuracy (%) Training time (s) Our method Adam 94.18 32 – AdaDelta 93.54 31 – SGD 88.29 31 eXpose[6] Adam 90.52 119 – AdaDelta 91.31 119 – SGD 77.99 116 • Our approach could achieve better accuracy compared to the eXpose(*1) work which uses similar approach using a more complex deep neural network (*1) J. Saxe and K. Berlin, “eXpose: A character-level convolutional neural network with embeddings for detecting malicious URLs, file paths and registry keys,” CoRR, vol. abs/1702.08568, February 2017. � 13

Prediction results TABLE IV. P REDICTION RESULTS OF THE DATASET SHOWN IN T ABLE III USING THE TRAINED NEURAL NETWORK MODEL WITH THE DATASET SHOWN IN T ABLE I Accuracy (%) Precision (%) Recall (%) F-measure • Try to predict future Our method 95.17% 93.76% 96.78% 0.9525 dataset on 2017-05-25 eXpose 92.99% 93.00% 92.99% 0.9299 using the trained model with the dataset of 2017-04-25 • Our approach achieved 95% of accuracy which was also better than the eXpose Fig. 5. ROC curves and AUC values measured with the prediction datasets as shown in Table III using our model and eXpose model � 14

Discussion • Di ffi culties to create proper datasets • It is almost impossible to make a pure white dataset • Di ffi culties to compare • In most case, the dataset used for the evaluation is not disclosed (same as in our case) • Need to make e ff orts to have shared datasets � 15

Classification of URL Bitstreams using Bag of Bytes Keiichi Shima - PowerPoint PPT Presentation

URL STUCTURING Building an SEO-Friendly URL Structure W HAT IS A URL S TRUCTURE ? Essentially how

WINE BOTTLE AIRBAG SINGLE WINE BOTTLE AIRBAG SINGLE BOTTLE AIR BAG PROTECT ALL BOTTLED PRODUCT

Red-Bag Engineers Consultants Software User Day April 2017 Red-Bag 2017 1 Ves Online

Pathway Red Bag Scheme October 2018 The Red Bag concept The Red Bag scheme was first implemented

The Plastic Bag Free world in action Surfriders Ban the Bag Campaign Plastic bag free

CS1063: Understanding CS1063: Understanding Computer Hardware Computer Hardware Lots of Bytes

Cache Example Main memory: Byte addressable memory of size 4GB = 2 32 bytes Cache size: 64KB = 2 16

Bytes, bits, etc., R. Inkulu http://www.iitg.ac.in/rinkulu/ (Bytes, bits, etc.,) 1 / 25

Bag of Words Model Overview of todays lecture Bag-of-words. K-means clustering.

Bag-of-features models for category classification for category classification Cordelia Schmid

Bag-of-features for category classification for category classification Cordelia Schmid

DC Bag Law Presented by Jeffrey Seltzer Associate Director Stormwater Management Division District

FPGA Architecture Support for Heterogeneous, Relocatable Partial Bitstreams Christophe H URIAUX v

Trojans Modifying Soft-Processor Instruction Sequences Embedded in FPGA Bitstreams Ismail

Virtual Memory (2) top 16 bits of address not used for translation bytes?? (512GB??) would take

Bits and Bytes At the smallest scale in the computer, information is stored as bits and bytes. In

Detecting Hardware Trojans: A Tale of Two Techniques Sharad Malik sharad@princeton.edu FMCAD

CSCI 4250/6250 Fall 2015 Computer and Networks Security Network Security Goodrich, Chapter

SQL Injection Attacks Many web servers have backing databases Much of their information

into Drive-by Cryptocurrency Mining and Its Defense RAJSHAKHAR PAUL Outlines Introduction

Fuzzing Filesystems on NetBSD via AFL+KCOV Maciej Grochowski Maciej.Grochowski[at]protonmail.com

CAIS Sensor: Distributed Sensors Network in Brazilian NREN LACSEC LACNIC27 Regarding RNP

GPU on OpenStack Masafumi Ohta @masafumiohta Who am I > Working for System Integrator as

Location tracking Location tracking Engineering & Public Policy Lorrie Faith Cranor

Sambuz

Useful Links

Newsletter

Mail Us

Classification of URL Bitstreams using Bag of Bytes Keiichi Shima - PowerPoint PPT Presentation

URL STUCTURING Building an SEO-Friendly URL Structure W HAT IS A URL S TRUCTURE ? Essentially how

WINE BOTTLE AIRBAG SINGLE WINE BOTTLE AIRBAG SINGLE BOTTLE AIR BAG PROTECT ALL BOTTLED PRODUCT

Red-Bag Engineers Consultants Software User Day April 2017 Red-Bag 2017 1 Ves Online

Pathway Red Bag Scheme October 2018 The Red Bag concept The Red Bag scheme was first implemented

The Plastic Bag Free world in action Surfriders Ban the Bag Campaign Plastic bag free

CS1063: Understanding CS1063: Understanding Computer Hardware Computer Hardware Lots of Bytes

Cache Example Main memory: Byte addressable memory of size 4GB = 2 32 bytes Cache size: 64KB = 2 16

Bytes, bits, etc., R. Inkulu http://www.iitg.ac.in/rinkulu/ (Bytes, bits, etc.,) 1 / 25

Bag of Words Model Overview of todays lecture Bag-of-words. K-means clustering.

Bag-of-features models for category classification for category classification Cordelia Schmid

Bag-of-features for category classification for category classification Cordelia Schmid

DC Bag Law Presented by Jeffrey Seltzer Associate Director Stormwater Management Division District

FPGA Architecture Support for Heterogeneous, Relocatable Partial Bitstreams Christophe H URIAUX v

Trojans Modifying Soft-Processor Instruction Sequences Embedded in FPGA Bitstreams Ismail

Virtual Memory (2) top 16 bits of address not used for translation bytes?? (512GB??) would take

Bits and Bytes At the smallest scale in the computer, information is stored as bits and bytes. In

Detecting Hardware Trojans: A Tale of Two Techniques Sharad Malik sharad@princeton.edu FMCAD

CSCI 4250/6250 Fall 2015 Computer and Networks Security Network Security Goodrich, Chapter

SQL Injection Attacks Many web servers have backing databases Much of their information

into Drive-by Cryptocurrency Mining and Its Defense RAJSHAKHAR PAUL Outlines Introduction

Fuzzing Filesystems on NetBSD via AFL+KCOV Maciej Grochowski Maciej.Grochowski[at]protonmail.com

CAIS Sensor: Distributed Sensors Network in Brazilian NREN LACSEC LACNIC27 Regarding RNP

GPU on OpenStack Masafumi Ohta @masafumiohta Who am I &gt; Working for System Integrator as

Location tracking Location tracking Engineering &amp; Public Policy Lorrie Faith Cranor

Sambuz

Useful Links

Newsletter

Mail Us

GPU on OpenStack Masafumi Ohta @masafumiohta Who am I > Working for System Integrator as

Location tracking Location tracking Engineering & Public Policy Lorrie Faith Cranor