Information Extraction in Illicit Web Domains Date: 2017/05/09 - - PowerPoint PPT Presentation

information extraction in illicit web domains
SMART_READER_LITE
LIVE PREVIEW

Information Extraction in Illicit Web Domains Date: 2017/05/09 - - PowerPoint PPT Presentation

Information Extraction in Illicit Web Domains Date: 2017/05/09 Author: Mayank Kejriwal, Pedro Szekely Source: ACM WWW 17 Advisor: Jia-ling Koh Speaker : Yi-hui Lee 1 Outline Introduction Approach Experiment Conclusion 2


slide-1
SLIDE 1

Date: 2017/05/09 Author: Mayank Kejriwal, Pedro Szekely Source: ACM WWW’ 17 Advisor: Jia-ling Koh Speaker : Yi-hui Lee

1

Information Extraction in Illicit Web Domains

slide-2
SLIDE 2

Outline

  • Introduction
  • Approach
  • Experiment
  • Conclusion

2

slide-3
SLIDE 3

Introduction

  • Information Extraction:

3

slide-4
SLIDE 4

Introduction(cont.)

  • Information Extraction on Dark web(human trafficking):

4

ages (of human trafficking victims) locations prices of services posting dates

slide-5
SLIDE 5
  • A high-level overview of the proposed information

extraction approach:

5

preprocessing

step 1 step 2

Apply recognizers

step 3

word representation learning supervised classifier

step 4

  • utput:Annotated corpus

input: Dark Web

Introduction(cont.)

slide-6
SLIDE 6

Outline

  • Introduction
  • Approach
  • Experiment
  • Conclusion

6

slide-7
SLIDE 7

Approach

  • Step 1. Preprocessing:

Readability Text Extractor(RTE):

  • > Mercury Web Parser

NLTK:

  • RTE string output -> sentence tokenize -> word tokenize
  • > list of tokens

7

slide-8
SLIDE 8

Approach(cont.)

8

  • Step 2. Apply recognizers:

GeoNames-Cities GeoNames-States RegEx-Ages: use regular expressions Dictionary-Names: person names

slide-9
SLIDE 9

Approach(cont.)

9

  • Step 3. Word Representation learning:

D1: The cow is in the farm. D2: I jumped over the farm. D3: I saw a cow in the farm.

D1 D2 D3 The 1 cow 1 1 jumped 1

  • ver

1 the 1 1 1 moon farm 1 1 1

sim(cow, farm) = 2/(sqrt(2)+sqrt(3)) = 0.64 sim(cow, moon) = 0

slide-10
SLIDE 10

Approach(cont.)

10

  • Step 3. Word Representation learning:

Random Index [27]

  • randomly assigned -1, 0, 1 to the vector’s attribute
slide-11
SLIDE 11

Approach(cont.)

11

  • Step 4. Supervised Contextual Classifier:

Aggregate vectors -> l2-normalization

I saw a cow jumped over the farm

saw = [1, 0, 0, …, 1, 0] a = [1, 1, 1, …, 1, 1] cow = [1, 0, 1, …, 0, 0] jumped = [0, 0, 0, …, 1, 1]

  • ver = [0, 0, 1, …, 0, 1]

aggregate = [1, 0, 0, …, 1, 0, 1, 1, 1, …, 1, 1, ……, 0, 1] l2-normalization = [0.0001, 0, 0, …, 0.0001, 0, 0.0001, 0.0001, 0.0001, …, 0.0001, 0.0001, ……, 0.0000, 1]

slide-12
SLIDE 12

Approach(cont.)

12

  • Step 4. Supervised Contextual Classifier:

Classifier: Random forest

slide-13
SLIDE 13

Outline

  • Introduction
  • Approach
  • Experiment
  • Conclusion

13

slide-14
SLIDE 14

Experiment

14

  • Datasets and Ground-truths:

Research conducted in the DARPA MEMEX program

  • Ground-truths:
slide-15
SLIDE 15

Experiment(cont.)

  • Baselines:

Stanford Named Entity Recognition system (NER)

15

slide-16
SLIDE 16
  • Evaluation:

Experiment(cont.)

16

slide-17
SLIDE 17

Experiment(cont.)

  • Feature selection:

17

slide-18
SLIDE 18

Outline

  • Introduction
  • Approach
  • Experiment
  • Conclusion

18

slide-19
SLIDE 19

Conclusion

19

  • We presented a lightweight, feature-agnostic

Information Extraction approach that is suitable for illicit Web domains.

  • Our approach relies on unsupervised derivation of

word representations from an initial corpus, and the training of a supervised contextual classifier using external high-recall recognizers and a handful of manually verified annotations.

  • Real-world settings:

End Human Trafficking hackathon organized by the office of the District Attorney of New York17