Date: 2017/05/09 Author: Mayank Kejriwal, Pedro Szekely Source: ACM WWW’ 17 Advisor: Jia-ling Koh Speaker : Yi-hui Lee
1
Information Extraction in Illicit Web Domains Date: 2017/05/09 - - PowerPoint PPT Presentation
Information Extraction in Illicit Web Domains Date: 2017/05/09 Author: Mayank Kejriwal, Pedro Szekely Source: ACM WWW 17 Advisor: Jia-ling Koh Speaker : Yi-hui Lee 1 Outline Introduction Approach Experiment Conclusion 2
Date: 2017/05/09 Author: Mayank Kejriwal, Pedro Szekely Source: ACM WWW’ 17 Advisor: Jia-ling Koh Speaker : Yi-hui Lee
1
2
3
4
ages (of human trafficking victims) locations prices of services posting dates
extraction approach:
5
preprocessing
step 1 step 2
Apply recognizers
step 3
word representation learning supervised classifier
step 4
input: Dark Web
6
Readability Text Extractor(RTE):
NLTK:
7
8
GeoNames-Cities GeoNames-States RegEx-Ages: use regular expressions Dictionary-Names: person names
9
D1: The cow is in the farm. D2: I jumped over the farm. D3: I saw a cow in the farm.
D1 D2 D3 The 1 cow 1 1 jumped 1
1 the 1 1 1 moon farm 1 1 1
sim(cow, farm) = 2/(sqrt(2)+sqrt(3)) = 0.64 sim(cow, moon) = 0
10
Random Index [27]
11
Aggregate vectors -> l2-normalization
I saw a cow jumped over the farm
saw = [1, 0, 0, …, 1, 0] a = [1, 1, 1, …, 1, 1] cow = [1, 0, 1, …, 0, 0] jumped = [0, 0, 0, …, 1, 1]
aggregate = [1, 0, 0, …, 1, 0, 1, 1, 1, …, 1, 1, ……, 0, 1] l2-normalization = [0.0001, 0, 0, …, 0.0001, 0, 0.0001, 0.0001, 0.0001, …, 0.0001, 0.0001, ……, 0.0000, 1]
12
Classifier: Random forest
13
14
Research conducted in the DARPA MEMEX program
Stanford Named Entity Recognition system (NER)
15
16
17
18
19
Information Extraction approach that is suitable for illicit Web domains.
word representations from an initial corpus, and the training of a supervised contextual classifier using external high-recall recognizers and a handful of manually verified annotations.
End Human Trafficking hackathon organized by the office of the District Attorney of New York17