Information Extraction in Illicit Web Domains Mayank Kejriwal Pedro - PDF document

Information Extraction in Illicit Web Domains Mayank Kejriwal Pedro Szekely Information Sciences Institute Information Sciences Institute USC Viterbi School of Engineering USC Viterbi School of Engineering kejriwal@isi.edu pszekely@isi.edu ABSTRACT [14], [21], [15]. In IE systems based on statistical learning, sequence labeling models like Conditional Random Fields Extracting useful entities and attribute values from illicit (CRFs) can be trained and used for tagging the scraped text domains such as human tra ffi cking is a challenging prob- from each data source with terms from the domain ontology lem with the potential for widespread social impact. Such [24], [15]. With enough data and computational power, deep domains employ atypical language models, have ‘long tails’ neural networks can also be used for a range of collective and su ff er from the problem of concept drift. In this pa- natural language tasks, including chunking and extraction per, we propose a lightweight, feature-agnostic Information of named entities and relationships [10]. Extraction (IE) paradigm specifically designed for such do- While IE has been well-studied both for cross-domain mains. Our approach uses raw, unlabeled text from an ini- Web sources (e.g. Wikipedia) and for traditional domains tial corpus, and a few (12-120) seed annotations per domain- like biomedicine [32], [20], it is less well-studied (Section specific attribute, to learn robust IE models for unobserved 2) for dynamic domains that undergo frequent changes in pages and websites. Empirically, we demonstrate that our content and structure. Such domains include news feeds, approach can outperform feature-centric Conditional Ran- social media, advertising, and online marketplaces, but also dom Field baselines by over 18% F-Measure on five anno- illicit domains like human tra ffi cking. Automatically con- tated sets of real-world human tra ffi cking datasets in both structing knowledge graphs containing important informa- low-supervision and high-supervision settings. We also show tion like ages (of human tra ffi cking victims), locations, prices that our approach is demonstrably robust to concept drift, of services and posting dates over such domains could have and can be e ffi ciently bootstrapped even in a serial comput- widespread social impact, since law enforcement and federal ing environment. agencies could query such graphs to glean rapid insights [28]. Illicit domains pose some formidable challenges for tradi- Keywords tional IE systems, including deliberate information obfusca- Information Extraction; Named Entity Recognition; Illicit tion , non-random misspellings of common words, high occur- Domains; Feature-agnostic; Distributional Semantics rences of out-of-vocabulary and uncommon words, frequent (and non-random) use of Unicode characters, sparse content and heterogeneous website structure, to only name a few 1. INTRODUCTION [28], [1], [13]. While some of these characteristics are shared Building knowledge graphs (KG) over Web corpora is an by more traditional domains like chat logs and Twitter, both important problem that has galvanized e ff ort from multiple information obfuscation and extreme content heterogeneity communities over two decades [12], [29]. Automated knowl- are unique to illicit domains. While this paper only consid- edge graph construction from Web resources involves several ers the human tra ffi cking domain, similar kinds of problems di ff erent phases. The first phase involves domain discovery , are prevalent in other illicit domains that have a sizable Web which constitutes identification of sources, followed by crawl- (including Dark Web) footprint, including terrorist activity, ing and scraping of those sources [7]. A contemporaneous and sales of illegal weapons and counterfeit goods [9]. ontology engineering phase is the identification and design As real-world illustrative examples, consider the text frag- of key classes and properties in the domain of interest (the ments ‘Hey gentleman im neWYOrk and i’m looking for domain ontology ) [33]. generous...’ and ‘AVAILABLE NOW! ?? - (4 two 4) six Once a set of (typically unstructured) data sources has 5 two - 0 9 three 1 - 21’ . In the first instance, the correct been identified, an Information Extraction (IE) system needs extraction for a Name attribute is neWYOrk , while in the to extract structured data from each page in the corpus [11], second instance, the correct extraction for an Age attribute is 21 . It is not obvious what features should be engineered in a statistical learning-based IE system to achieve robust c � 2017 International World Wide Web Conference Committee performance on such text. (IW3C2), published under Creative Commons CC BY 4.0 License. To compound the problem, wrapper induction systems WWW ’17 Perth, Australia from the Web IE literature cannot always be applied in such ACM 978-1-4503-4913-0/17/04. http://dx.doi.org/10.1145/3038912.3052642 domains, as many important attributes can only be found in text descriptions, rather than template-based Web extrac- tors that wrappers traditionally rely on [21]. Constructing . 997

Information Extraction in Illicit Web Domains Mayank Kejriwal Pedro - PDF document

Information Extraction in Illicit Web Domains Mayank Kejriwal Pedro Szekely Information Sciences Institute Information Sciences Institute USC Viterbi School of Engineering USC Viterbi School of Engineering kejriwal@isi.edu pszekely@isi.edu

Global, East Europe, Middle East , p , & Africa Illicit Trade Overview TISA Conference

Information Extraction in Illicit Web Domains Date: 2017/05/09 Author: Mayank Kejriwal, Pedro

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Scott Domains for Denotational Semantics and Program Extraction Ulrich Berger Swansea University

National study on the illicit trade and its implications 5 July 2018 TISA SCOPE & MANDATE

WITH AFRICA Richard Parry African Caucus Luanda, Angola August 2015 1. ILLICIT FINANCIAL

6/12/2015 Construction/Permitting Violations Illicit Discharge/ Illicit Connection/

Black, white, or shades of grey? Illicit opioid and methamphetamine use in the ACT findings from

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

The most popular illicit drug

Bi-Continuous Domains and Some Old Problems in Domain Theory Talk at Domains IX Klaus Keimel

ILLICIT FINANCIAL FLOWS (IFF) TRACK IT, STOP IT, RECOVER IT! Presentation on the Report of the

Tackling Illicit Tobacco for Better Health programme Catherine Taylor Regional Programme

Illicit financial flows Hany Hafez Mohamed G.M . at ASA Egypt Member of the Association of

Microplastic ic poll llution and wastewater treatment: State of knowledge and future dir

The Nova Scotia Take Home Naloxone Program Amanda Hudson-Frigault, MA NS THN Coordinator

Far arming ming in in a a Changing Changing Clima Climate te at t Br Broadf oadfor ork

Zinc One Overview Jan 2017 WWW.ZINCONE.COM TSX-V: Z ZINC ONE SUMMARY Zinc One is a

COSC 380 SPRING INTERNSHIP BY: VINCENT NIGRO OMNI TECHNOLOGY PROFESSIONALS, INC. Who is Team

KAOSPILOTS june 2010 lule Kaos Pilot principle 1 > it is the students agenda Kaos Pilot

Peter Snow swings both ways FPTP Swing for Election 2010 Data scraped from

Hint Cleaner Tom Bryan bryanptom@gmail.com Priorities in Record Hints: Best Connect Records

Information Extraction in Illicit Web Domains Mayank Kejriwal Pedro - PDF document

Information Extraction in Illicit Web Domains Mayank Kejriwal Pedro Szekely Information Sciences Institute Information Sciences Institute USC Viterbi School of Engineering USC Viterbi School of Engineering kejriwal@isi.edu pszekely@isi.edu

Global, East Europe, Middle East , p , &amp; Africa Illicit Trade Overview TISA Conference

Information Extraction in Illicit Web Domains Date: 2017/05/09 Author: Mayank Kejriwal, Pedro

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Scott Domains for Denotational Semantics and Program Extraction Ulrich Berger Swansea University

National study on the illicit trade and its implications 5 July 2018 TISA SCOPE &amp; MANDATE

WITH AFRICA Richard Parry African Caucus Luanda, Angola August 2015 1. ILLICIT FINANCIAL

6/12/2015 Construction/Permitting Violations Illicit Discharge/ Illicit Connection/

Black, white, or shades of grey? Illicit opioid and methamphetamine use in the ACT findings from

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

The most popular illicit drug

Bi-Continuous Domains and Some Old Problems in Domain Theory Talk at Domains IX Klaus Keimel

ILLICIT FINANCIAL FLOWS (IFF) TRACK IT, STOP IT, RECOVER IT! Presentation on the Report of the

Tackling Illicit Tobacco for Better Health programme Catherine Taylor Regional Programme

Illicit financial flows Hany Hafez Mohamed G.M . at ASA Egypt Member of the Association of

Microplastic ic poll llution and wastewater treatment: State of knowledge and future dir

The Nova Scotia Take Home Naloxone Program Amanda Hudson-Frigault, MA NS THN Coordinator

Far arming ming in in a a Changing Changing Clima Climate te at t Br Broadf oadfor ork

Zinc One Overview Jan 2017 WWW.ZINCONE.COM TSX-V: Z ZINC ONE SUMMARY Zinc One is a

COSC 380 SPRING INTERNSHIP BY: VINCENT NIGRO OMNI TECHNOLOGY PROFESSIONALS, INC. Who is Team

KAOSPILOTS june 2010 lule Kaos Pilot principle 1 &gt; it is the students agenda Kaos Pilot

Peter Snow swings both ways FPTP Swing for Election 2010 Data scraped from

Hint Cleaner Tom Bryan bryanptom@gmail.com Priorities in Record Hints: Best Connect Records

Global, East Europe, Middle East , p , & Africa Illicit Trade Overview TISA Conference

National study on the illicit trade and its implications 5 July 2018 TISA SCOPE & MANDATE

KAOSPILOTS june 2010 lule Kaos Pilot principle 1 > it is the students agenda Kaos Pilot