Introduction: Where do NLP and DM Meet? 1 7/26/2015 Slightly - - PDF document

introduction where do nlp and dm meet
SMART_READER_LITE
LIVE PREVIEW

Introduction: Where do NLP and DM Meet? 1 7/26/2015 Slightly - - PDF document

7/26/2015 Successful Data Mining Methods for NLP Jiawei Han (UIUC), Heng Ji (RPI), Yizhou Sun (NEU) http://hanj.cs.illinois.edu/slides/dmnlp15.pptx[pdf] http://nlp.cs.rpi.edu/paper/dmnlp15.pptx[pdf] 1 Introduction: Where do NLP and DM Meet?


slide-1
SLIDE 1

7/26/2015 1

1

Successful Data Mining Methods for NLP

Jiawei Han (UIUC), Heng Ji (RPI), Yizhou Sun (NEU) http://hanj.cs.illinois.edu/slides/dmnlp15.pptx[pdf] http://nlp.cs.rpi.edu/paper/dmnlp15.pptx[pdf]

Introduction: Where do NLP and DM Meet?

slide-2
SLIDE 2

7/26/2015 2

3

Slightly Different Research Philosophies

 NLP: Deep understanding of

individual words, phrases and sentences (“micro‐level”); focus on unstructured text data

 Data Mining (DM): High‐level (statistical)

understanding, discovery and synthesis of the most salient information (“Macro‐ level”); historically more on structured and semi‐structured data

NewsNet (Tao et al., 2014)

Related to “Health Care Bill” 4

Advantages of NLP

Construct graphs/networks with fine‐grained semantics from unstructured texts

Use large‐scale annotations for real‐world data

Advantages of DM: Deep understanding through structured/correlation inference

Using a structured representation (e.g., graph, network) as a bridge to capture interactions between NLP and DM

Example: Heterogeneous Information Networks [Han et al., 2010; Sun et al., 2012]

DM Solution: Data to Networks to Knowledge (D2N2K)

Data Networks Knowledge

slide-3
SLIDE 3

7/26/2015 3

5

Major theme of this tutorial

Applying novel DM methods to solve traditional NLP problems

Integrating DM and NLP, transforming Data to Networks to Knowledge

Road Map of this tutorial

Effective Network Construction by Leveraging Information Redundancy

Theme I: Phrase Mining and Topic Modeling from Large Corpora

Theme II: Entity Extraction and Linking by Relational Graph Construction

Mining Knowledge from Structured Networks

Theme III: Search and Mining Structured Graphs and Heterogeneous Networks

Looking forward to the Future

A Promising Direction: Integrating DM and NLP

Theme I: Phrase Mining and Topic Modeling from Large Corpora

6

slide-4
SLIDE 4

7/26/2015 4

7

Why Phrase Mining?

 Phrase: Minimal, unambiguous semantic unit; basic building block for information

network and knowledge base

 Unigrams vs. phrases  Unigrams (single words) are ambiguous Example: “United”: United States? United Airline? United Parcel Service?  Phrase: A natural, meaningful, unambiguous semantic unit Example: “United States” vs. “United Airline”  Mining semantically meaningful phrases  Transform text data from word granularity to phrase granularity  Enhance the power and efficiency at manipulating unstructured data using

database technology

8

Mining Phrases: Why Not Just Use NLP Methods?

Phrase mining: Originated from the NLP community—“Chunking”

Model it as a sequence labeling problem (B‐NP, I‐NP, O, …) 

Need annotation and training

Annotate hundreds of POS tagged documents as training data

Train a supervised model based on part‐of‐speech features 

Recent trend:

Use distributional features based on web n‐grams (Bergsma et al., 2010)

State‐of‐the‐art Performance: ~95% accuracy, ~88% phrase‐level F‐score 

Limitations

High annotation cost, not scalable to a new language, domain or genre

May not fit domain‐specific, dynamic, emerging applications

Scientific domains, query logs, or social media, e.g., Yelp, Twitter

Use only local features, no ranking, no links to topics

slide-5
SLIDE 5

7/26/2015 5

9

Data Mining Approaches for Phrase Mining

General principle: Corpus‐based; fully exploit information redundancy and data‐ driven criteria to determine phrase boundaries and salience; using local evidence to adjust corpus‐level data statistics

Phrase Mining and Topic Modeling from Large Corpora

Strategy 1: Simultaneously Inferring Phrases and Topics

Bigram topical model [Wallach’06], topical n‐gram model [Wang, et al.’07], phrase discovering topic model [Lindsey, et al.’12] 

Strategy 2: Post Topic Modeling Phrase Construction

Label topic [Mei et al.’07], TurboTopic [Blei & Lafferty’09], KERT [Danilevsky, et al.’14] 

Strategy 3: First Phrase Mining then Topic Modeling:

ToPMine [El‐kishky, et al., VLDB’15] 

Integration of Phrase Mining with Document Segmentation

SegPhrase [Liu, et al., SIGMOD’15]

10

Strategy 1: Simultaneously Inferring Phrases and Topics

Bigram Topic Model [Wallach’06]

Probabilistic generative model that conditions on previous word and topic when drawing next word

Topical N‐Grams (TNG) [Wang, et al.’07]

Probabilistic model that generates words in textual order

Create n‐grams by concatenating successive bigrams (a generalization of Bigram Topic Model)

Phrase‐Discovering LDA (PDLDA) [Lindsey, et al.’12]

Viewing each sentence as a time‐series of words, PDLDA posits that the generative parameter (topic) changes periodically

Each word is drawn based on previous m words (context) and current phrase topic

High model complexity: Tends to overfitting; High inference cost: Slow

slide-6
SLIDE 6

7/26/2015 6

11

Strategy 2: Post Topic Modeling Phrase Construction

TurboTopics [Blei & Lafferty’09] – Phrase construction as a post‐processing step to Latent Dirichlet Allocation

Perform Latent Dirichlet Allocation on corpus to assign each token a topic label

Merge adjacent unigrams with the same topic label by a distribution‐free permutation test on arbitrary‐length back‐off model

End recursive merging when all significant adjacent unigrams have been merged

 KERT [Danilevsky et al.’14] – Phrase construction as a post‐processing step

to Latent Dirichlet Allocation

Perform frequent pattern mining on each topic

Perform phrase ranking based on four different criteria

12

Example of TurboTopics

Perform LDA on corpus to assign each token a topic label

E.g., … phase11 transition11 …. game153 theory127 …

Then merge adjacent unigrams with same topic label

slide-7
SLIDE 7

7/26/2015 7

13

Framework of KERT

  • 1. Run bag‐of‐words model inference and assign topic label to each token
  • 2. Extract candidate keyphrases within each topic
  • 3. Rank the keyphrases in each topic

Frequent pattern mining Frequent pattern mining Comparability property: directly compare phrases of mixed lengths

kpRel

[Zhao et al. 11]

KERT (‐popularity) KERT (‐discriminativeness) KERT (‐concordance) KERT [Danilevsky et al. 14] learning effective support vector machines learning learning classification text feature selection classification support vector machines selection probabilistic reinforcement learning selection reinforcement learning models identification conditional random fields feature feature selection algorithm mapping constraint satisfaction decision conditional random fields features task decision trees bayesian classification decision planning dimensionality reduction trees decision trees : : : : :

14

Strategy 3: First Phrase Mining then Topic Modeling

ToPMine [El‐Kishky et al. VLDB’15]

First phrase construction, then topic mining

Contrast with KERT: topic modeling, then phrase mining

The ToPMine Framework:

Perform frequent contiguous pattern mining to extract candidate phrases and their counts

Perform agglomerative merging of adjacent unigrams as guided by a significance score—This segments each document into a “bag‐of‐phrases”

The newly formed bag‐of‐phrases are passed as input to PhraseLDA, an extension

  • f LDA, that constrains all words in a phrase to each sharing the same latent topic
slide-8
SLIDE 8

7/26/2015 8

15

Why First Phrase Mining then Topic Modeling ?

With Strategy 2, tokens in the same phrase may be assigned to different topics

  • Ex. knowledge discovery using least squares support vector machine classifiers…

Knowledge discovery and support vector machine should have coherent topic labels

Solution: switch the order of phrase mining and topic model inference

Techniques

Phrase mining and document segmentation

Topic model inference with phrase constraint

[knowledge discovery] using [least

squares] [support vector machine] [classifiers] …

[knowledge discovery] using [least

squares] [support vector machine] [classifiers] …

16

Phrase Mining: Frequent Pattern Mining + Statistical Analysis

[Markov blanket] [feature selection] for [support vector machines] [knowledge discovery] using [least squares] [support vector machine] [classifiers] …[support vector] for [machine learning]… Phrase

Raw freq. True freq.

[support vector machine]

90 80

[vector machine]

95

[support vector]

100 20

Quality phrases Based on significance score [Church et al.’91]:

α(P1, P2) ≈ (f(P1●P2) ̶ µ0(P1,P2))/√ f(P1●P2)

slide-9
SLIDE 9

7/26/2015 9

17

Collocation Mining

Collocation: A sequence of words that occur more frequently than expected

Often “interesting” and due to their non‐compositionality, often relay information not portrayed by their constituent terms (e.g., “made an exception”, “strong tea”)

Many different measures used to extract collocations from a corpus [Dunning 93, Pederson 96]

E.g., mutual information, t‐test, z‐test, chi‐squared test, likelihood ratio

Many of these measures can be used to guide the agglomerative phrase‐ segmentation algorithm

18

ToPMine: Phrase LDA (Constrained Topic Modeling)

The generative model for PhraseLDA is the same as LDA

Difference: the model incorporates constraints

  • btained from the “bag‐of‐phrases” input

Chain‐graph shows that all words in a phrase are constrained to take on the same topic values

[knowledge discovery] using [least squares] [support vector machine] [classifiers] …

Topic model inference with phrase constraints

slide-10
SLIDE 10

7/26/2015 10

19

Example Topical Phrases: A Comparison

ToPMine [El‐kishky et al. 14] Strategy 3 (67 seconds)

information retrieval feature selection social networks machine learning web search semi supervised search engine large scale information extraction support vector machines question answering active learning web pages face recognition : : Topic 1 Topic 2 social networks information retrieval web search text classification time series machine learning search engine support vector machines management system information extraction real time neural networks decision trees text categorization : : Topic 1 Topic 2

PDLDA [Lindsey et al. 12] Strategy 1 (3.72 hours)

20

ToPMine: Experiments on DBLP Abstracts

slide-11
SLIDE 11

7/26/2015 11

21

ToPMine: Topics on Associate Press News (1989)

22

ToPMine: Experiments on Yelp Reviews

slide-12
SLIDE 12

7/26/2015 12

23

Efficiency: Running Time of Different Strategies

Strategy 1: Generate bag‐of‐words → generate sequence of tokens

Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams

Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model

Running time: strategy 3 > strategy 2 > strategy 1 (“>” means outperforms)

24

Coherence of Topics: Comparison of Strategies

Strategy 1: Generate bag‐of‐words → generate sequence of tokens

Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams

Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model

Coherence measured by z‐score: strategy 3 > strategy 2 > strategy 1

slide-13
SLIDE 13

7/26/2015 13

25

Phrase Intrusion: Comparison of Strategies

Phrase intrusion measured by average number of correct answers: strategy 3 > strategy 2 > strategy 1

26

Phrase Quality: Comparison of Strategies

Phrase quality measured by z‐score: strategy 3 > strategy 2 > strategy 1

slide-14
SLIDE 14

7/26/2015 14

27

Mining Phrases: Why Not Use Raw Frequency Based Methods?

Traditional data‐driven approaches

Frequent pattern mining

If AB is frequent, likely AB could be a phrase

Raw frequency could NOT reflect the quality of phrases

E.g., freq(vector machine) ≥ freq(support vector machine)

Need to rectify the frequency based on segmentation results

Phrasal segmentation will tell

Some words should be treated as a whole phrase whereas others are still unigrams

28

SegPhrase: From Raw Corpus to Quality Phrases and Segmented Corpus

Raw frequency could NOT reflect the quality of phrases

E.g., freq(vector machine) ≥ freq(support vector machine)

Need to rectify the frequency based on segmentation results 

Build a candidate phrase set by frequent pattern mining

Mining frequent k‐grams; k is typically small, e.g. 6 in our experiments 

Popularity measured by raw frequent words and phrases mined from the corpus

Document 1 Citation recommendation is an interesting but challenging research problem in data mining area. Document 2 In this study, we investigate the problem in the context of heterogeneous information networks using data mining technique.

Phrase Mining

Document 3 Principal Component Analysis is a linear dimensionality reduction technique commonly used in machine learning applications.

Quality Phrases

Phrasal Segmentation

Raw Corpus Segmented Corpus Input Raw Corpus Quality Phrases Segmented Corpus

slide-15
SLIDE 15

7/26/2015 15

29

SegPhrase: The Overall Framework

ClassPhrase: Frequent pattern mining, feature extraction, classification

SegPhrase: Phrasal segmentation and phrase quality estimation

SegPhrase+: One more round to enhance mined phrase quality

ClassPhrase SegPhrase(+)

30

What Kind of Phrases Are of “High Quality”?

Judging the quality of phrases

Popularity

“information retrieval” vs. “cross‐language information retrieval”

Concordance

“powerful tea” vs. “strong tea”

“active learning” vs. “learning classification”

Informativeness

“this paper” (frequent but not discriminative, not informative)

Completeness

“vector machine” vs. “support vector machine”

slide-16
SLIDE 16

7/26/2015 16

31

Feature Extraction: Concordance

 Partition a phrase into two parts to check whether the co‐occurrence is

significantly higher than pure random

support vector machine this paper demonstrates

Pointwise mutual information:

Pointwise KL divergence:

 The additional p(v) is multiplied with pointwise mutual information,

leading to less bias towards rare‐occurred phrases

  • 32

Feature Extraction: Informativeness

Deriving Informativeness

Quality phrases typically start and end with a non‐stopword

“machine learning is” vs. “machine learning”

Use average IDF over words in the phrase to measure the semantics

Usually, the probabilities of a quality phrase in quotes, brackets, or connected by dash should be higher (punctuation information)

“state‐of‐the‐art”

We can also incorporate features using some NLP techniques, such as POS tagging, chunking, and semantic parsing

slide-17
SLIDE 17

7/26/2015 17

33

Classifier

Limited Training

Labels: Whether a phrase is a quality one or not

“support vector machine”: 1

“the experiment shows”: 0

For ~1GB corpus, only 300 labels

Random Forest as our classifier

Predicted phrase quality scores lie in [0, 1]

Bootstrap many different datasets from limited labels

34

SegPhrase: Why Do We Need Phrasal Segmentation in Corpus?

Phrasal segmentation can tell which phrase is more appropriate

Ex: A standard [feature vector] [machine learning] setup is used to describe...

Rectified phrase frequency (expected influence)

Example:

Not counted towards the rectified frequency

slide-18
SLIDE 18

7/26/2015 18

35

SegPhrase: Segmentation of Phrases

Partition a sequence of words by maximizing the likelihood

Considering

Phrase quality score

ClassPhrase assigns a quality score for each phrase

Probability in corpus

Length penalty

length penalty : when 1, it favors shorter phrases

Filter out phrases with low rectified frequency

Bad phrases are expected to rarely occur in the segmentation results

36

SegPhrase+: Enhancing Phrasal Segmentation

SegPhrase+: One more round for enhanced phrasal segmentation

Feedback

Using rectified frequency, re‐compute those features previously computed based on raw frequency

Process

Classification  Phrasal segmentation // SegPhrase

 Classification  Phrasal segmentation // SegPhrase+ 

Effects on computing quality scores

np hard in the strong sense

np hard in the strong

data base management system

slide-19
SLIDE 19

7/26/2015 19

37

Performance Study: Methods to Be Compared

Other phase mining methods: Methods to be compared

NLP chunking based methods

Chunks as candidates

Sorted by TF‐IDF and C‐value (K. Frantzi et al., 2000)

Unsupervised raw frequency based methods

ConExtr (A. Parameswaran et al., VLDB 2010)

ToPMine (A. El‐Kishky et al., VLDB 2015)

Supervised method

KEA, designed for single document keyphrases (O. Medelyan & I. H. Witten, 2006)

38

Performance Study: Experimental Setting

Datasets

Popular Wiki Phrases

Based on internal links

~7K high quality phrases

Pooling

Sampled 500 * 7 Wiki‐uncovered phrases

Evaluated by 3 reviewers independently

Dataset #docs #words #labels DBLP 2.77M 91.6M 300 Yelp 4.75M 145.1M 300

slide-20
SLIDE 20

7/26/2015 20

39

Performance: Precision Recall Curves on DBLP

Compare with other baselines TF‐IDF C‐Value ConExtr KEA ToPMine SegPhrase+ Compare with

  • ur 3 variations

TF‐IDF ClassPhrase SegPhrase SegPhrase+

39

40

Performance Study: Processing Efficiency

SegPhrase+ is linear to the size of corpus!

slide-21
SLIDE 21

7/26/2015 21

41

Experimental Results: Interesting Phrases Generated (From the Titles and Abstracts of SIGMOD)

Query SIGMOD Method SegPhrase+ Chunking (TF‐IDF & C‐Value) 1 data base data base 2 database system database system 3 relational database query processing 4 query optimization query optimization 5 query processing relational database … … … 51 sql server database technology 52 relational data database server 53 data structure large volume 54 join query performance study 55 web service web service … … … 201 high dimensional data efficient implementation 202 location based service sensor network 203 xml schema large collection 204 two phase locking important issue 205 deep web frequent itemset … … …

Only in SegPhrase+ Only in Chunking

42

Experimental Results: Interesting Phrases Generated (From the Titles and Abstracts of SIGKDD)

Query SIGKDD Method SegPhrase+ Chunking (TF‐IDF & C‐Value) 1 data mining data mining 2 data set association rule 3 association rule knowledge discovery 4 knowledge discovery frequent itemset 5 time series decision tree … … … 51 association rule mining search space 52 rule set domain knowledge 53 concept drift importnant problem 54 knowledge acquisition concurrency control 55 gene expression data conceptual graph … … … 201 web content

  • ptimal solution

202 frequent subgraph semantic relationship 203 intrusion detection effective way 204 categorical attribute space complexity 205 user preference small set … … …

Only in SegPhrase+ Only in Chunking

slide-22
SLIDE 22

7/26/2015 22

43

Experimental Results: Similarity Search

Find high‐quality similar phrases based on user’s phrase query

In response to a user’s phrase query, SegPhrase+ generates high quality, semantically similar phrases

In DBLP, query on “data mining” and “OLAP”

In Yelp, query on “blu‐ray”, “noodle”, and “valet parking”

44

Recent Progress

Distant Training: No need of human labeling

Training using general knowledge bases

E.g., Freebase, Wikipedia

Quality Estimation for Unigrams

Integration of phrases and unigrams in one uniform framework

Demo based on DBLP abstract

Multi‐languages: Beyond English corpus

Extensible to mining quality phrases in multiple languages

Recent progress: SegPhrase+ works on Chinese, Arabic and Spanish

slide-23
SLIDE 23

7/26/2015 23

45

High Quality Phrases Generated from Chinese Wikipedia

Rank Phrase In English … … … 62 首席_执行官 CEO 63 中间_偏右 Middle‐right … … … 84 百度_百科 Baidu Pedia 85 热带_气旋 Tropical cyclone 86 中国科学院_院士 Fellow of Chinese Academy of Sciences … … … 1001 十大_中文_金曲 Top‐10 Chinese Songs 1002 全球_资讯网 Global News Website 1003 天一阁_藏_明代_科举_录_选刊 A Chinese book name … … … 9934 国家_戏剧_院 National Theater 9935 谢谢_你 Thank you … … …

46

Top-Ranked Phrases Generated from English Gigaword

Northrop Grumman, Ashfaq Kayani, Sania Mirza, Pius Xii, Shakhtar Donetsk, Kyaw Zaw Lwin

Ratko Mladic, Abdolmalek Rigi, Rubin Kazan, Rajon Rondo, Rubel Hossain, bluefin tuna

Psv Eindhoven, Nicklas Bendtner, Ryo Ishikawa, Desmond Tutu, Landon Donovan, Jannie du Plessis

Zinedine Zidane, Uttar Pradesh, Thor Hushovd, Andhra Pradesh, Jafar_Panahi, Marouane Chamakh

Rahm Emanuel, Yakubu Aiyegbeni, Salva Kiir, Abdelhamid Abou Zeid, Blaise Compaore, Rickie Fowler

Andry Rajoelina, Merck Kgaa, Js Kabylie, Arjun Atwal, Andal Ampatuan Jnr, Reggio Calabria, Ghani Baradar

Mahela Jayawardene, Jemaah Islamiyah, quantitative easing, Nodar Kumaritashvili, Alviro Petersen

Rumiana Jeleva, Helio Castroneves, Koumei Oda, Porfirio Lobo, Anastasia Pavlyuchenkova

Thaksin Shinawatra, Evgeni_Malkin, Salvatore Sirigu, Edoardo Molinari, Yoshito Sengoku

Otago Highlanders, Umar Akmal, Shuaibu Amodu, Nadia Petrova, Jerzy Buzek, Leonid Kuchma,

Alona Bondarenko, Chosun Ilbo, Kei Nishikori, Nobunari Oda, Kumbh Mela, Santo_Domingo

Nicolae Ceausescu, Yoann Gourcuff, Petr Cech, Mirlande Manigat, Sulieman Benn, Sekouba Konate

slide-24
SLIDE 24

7/26/2015 24

47

Summary and Future Work

Strategy 1: Generate bag‐of‐words → generate sequence of tokens

Integrated complex model; phrase quality and topic inference rely on each

  • ther

Slow and overfitting

Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams

Phrase quality relies on topic labels for unigrams

Can be fast; generally high‐quality topics and phrases

Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model

Topic inference relies on correct segmentation of documents, but not sensitive

Can be fast; generally high‐quality topics and phrases

48

Summary and Future Work (Cont’d)

SegPhrase+: A new phrase mining framework

Integrating phrase mining with phrasal segmentation

Requires only limited training or distant training

Generates high‐quality phrases, close to human judgement

Linearly scalable on time and space

Looking forward: High‐quality, scalable phrase mining

Facilitate entity recognition and typing in large corpora (See the next part of this tutorial)

Combine with linguistic‐rich patterns

Transform massive unstructured data into semi‐structured knowledge networks

slide-25
SLIDE 25

7/26/2015 25

49

References for Theme 1: Phrase Mining for Concept Extraction

 D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi‐Word Expressions, arXiv:0907.1013, 2009  K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis. In U. Zernik (ed.), Lexical Acquisition:

Exploiting On‐Line Resources to Build a Lexicon. Lawrence Erlbaum, 1991

 M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical

Keyphrases on Collections of Short Documents“, SDM’14

 A. El‐Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15  K. Frantzi, S. Ananiadou, and H. Mima, Automatic Recognition of Multi‐Word Terms: the c‐value/nc‐value

  • Method. Int. Journal on Digital Libraries, 3(2), 2000

 R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A phrase‐discovering topic model using hierarchical pitman‐yor

processes, EMNLP‐CoNLL’12.

 J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15  O. Medelyan and I. H. Witten, Thesaurus Based Automatic Keyphrase Indexing. IJCDL’06  Q. Mei, X. Shen, C. Zhai. Automatic Labeling of Multinomial Topic Models, KDD’07  A. Parameswaran, H. Garcia‐Molina, and A. Rajaraman. Towards the Web of Concepts: Extracting Concepts from

Large Datasets. VLDB’10

 X. Wang, A. McCallum, X. Wei. Topical n‐grams: Phrase and topic discovery, with an application to information

retrieval, ICDM’07

Theme II: Entity Extraction and Linking by Relational Graph Construction and Propagation

50

slide-26
SLIDE 26

7/26/2015 26

Session 1. Task and Traditional NLP Approach

52

Entity Extraction and Linking

The best BBQ I’ve tasted in Phoenix! I had the pulled pork sandwich with coleslaw and baked beans for lunch. ... The

  • wner is very nice. …

The best BBQ:Food I’ve tasted in Phoenix:LOC ! I had the [pulled pork sandwich]:Food with coleslaw:Food and [baked beans]:Food for lunch. … The owner:JOB_TITLE is very nice. …

FOOD LOCATION JOB_TITLE EVENT ORGANIZATION

Target Types Identifying token span as entity mentions in documents and labeling their types Enabling structured analysis of unstructured text corpus Plain text Text with typed entities FOOD LOCATION EVENT

slide-27
SLIDE 27

7/26/2015 27

53

Traditional NLP Approach: Data Annotation

Extracting and linking entities can be used in a variety of ways:

serve as primitives for information extraction and knowledge base population

assist question answering,…

Traditional named entity recognition systems are designed for major types (e.g., PER, LOC, ORG) and general domains (e.g., news)

Require additional steps to adapt to new domains/types

Expensive human labor on annotation

500 documents for entity extraction; 20,000 queries for entity linking

Unsatisfying agreement due to various granularity levels and scopes of types

Entities obtained by entity linking techniques have limited coverage and freshness

>50% unlinkable entity mentions in Web corpus [Lin et al., EMNLP’12]

>90% in our experiment corpora: tweets, Yelp reviews, …

54

Traditional NLP Approach: Feature Engineering

 Typical Entity Extraction Features (Li et al., 2012) 

N‐gram: Unigram, bigram and trigram token sequences in the context window

Part‐of‐Speech: POS tags of the context words

Gazetteers: person names, organizations, countries and cities, titles, idioms, etc.

Word clusters: word clusters / embeddings

Case and Shape: Capitalization and morphology analysis based features

Chunking: NP and VP Chunking tags

Global feature: Sentence level and document level structure/position features

slide-28
SLIDE 28

7/26/2015 28

55

Traditional NLP Approach: Feature Engineering

Mention/Concept Attribute Description Name Spelling match Exact string match, acronym match, alias match, string matching… KB link mining Name pairs mined from KB text redirect and disambiguation pages Name Gazetteer Organization and geo‐political entity abbreviation gazetteers Document surface Lexical Words in KB facts, KB text, mention name, mention text. Tf.idf of words and ngrams Position Mention name appears early in KB text Genre Genre of the mention text (newswire, blog, …) Local Context Lexical and part‐of‐speech tags of context words Entity Context Type Mention concept type, subtype Relation/Event Concepts co‐occurred, attributes/relations/events with mention Coreference Co‐reference links between the source document and the KB text Profiling Slot fills of the mention, concept attributes stored in KB infobox Concept Ontology extracted from KB text Topic Topics (identity and lexical similarity) for the mention text and KB text KB Link Mining Attributes extracted from hyperlink graphs of the KB text Popularity Web Top KB text ranked by search engine and its length Frequency Frequency in KB texts 

Typical Entity Linking Features (Ji et al., 2011)

Session 2. ClusType: Entity Recognition and Typing by Relation Phrase-Based Clustering

slide-29
SLIDE 29

7/26/2015 29

57

Entity Recognition and Typing: A Data Mining Solution

Acquire labels for a small amount of instances

Construct a relational graph to connect labeled instances and unlabeled instances

Construct edges based on coarse‐grained data‐driven statistics instead of fine‐ grained linguistic similarity

Mention correlation

Text co‐occurrence

Semantic relatedness based on knowledge graph embeddings

Social networks

Label propagation across the graph

58

Case Study 1: Entity Extraction

Goal: recognizing entity mentions of target types with minimal/no human supervision and with no requirement that entities can be found in a KB.

Two kinds of efforts towards this goal:

Weak supervision: relies on manually selected seed entities in applying pattern‐based bootstrapping methods or label propagation methods to identify more entities

Both assume seeds are unambiguous and sufficiently frequent  requires careful seed selection by human

Distant supervision: leverages entity information in KBs to reduce human supervision (cont.)

slide-30
SLIDE 30

7/26/2015 30

59

Typical Workflow of Distant Supervision

Detect entity mentions from text

Map candidate mentions to KB entities of target types

Use confidently mapped {mention, type} to infer types of remaining candidate mentions

60

Problem Definition

Distantly‐supervised entity recognition in a domain‐specific corpus

Given:

a corpus D

a knowledge base (e.g., Freebase)

a set of target types (T) from a KB

 Detect candidate entity mentions from corpus D  Categorize each candidate mention by target types or Not‐Of‐Interest

(NOI), with distant supervision

slide-31
SLIDE 31

7/26/2015 31

61

Challenge I: Domain Restriction

Most existing work assume entity mentions are already extracted by existing entity detection tools, e.g., noun phrase chunkers

Usually trained on general‐domain corpora like news articles (clean, grammatical)

Make use of various linguistics features (e.g., semantic parsing structures)

Do not work well on specific, dynamic or emerging domains (e.g., tweets, Yelp reviews)

E.g., “in‐and‐out” from Yelp review may not be properly detected

62

Challenge II: Name Ambiguity

Multiple entities may share the same surface name

Previous methods simply output a single type/type distribution for each surface name, instead of an exact type for each entity mention While Griffin is not the part of Washington’s plan on Sunday’s game, … Sport team …has concern that Kabul is an ally of Washington. U.S. government He has office in Washington, Boston and San Francisco U.S. capital city

Washington State

  • r

Washington

Sport team Govern ‐ment State

While Griffin is not the part of Washington’s plan on Sunday’s game, … … news from Washington indicates that the congress is going to… It is one of the best state parks in Washington.

slide-32
SLIDE 32

7/26/2015 32

63

Challenge III: Context Sparsity

A variety of contextual clues are leveraged to find sources of shared semantics across different entities

Keywords, Wiki concepts, linguistic patterns, textual relations, …

There are often many ways to describe even the same relation between two entities

Previous methods have difficulties in handling entity mention with sparse (infrequent) context

ID Sentence Freq 1 The magnitude 9.0 quake caused widespread devastation in [Kesennuma city] 12 2 … tsunami that ravaged [northeastern Japan] last Friday 31 3 The resulting tsunami devastate [Japan]’s northeast 244

64

Our Solution

Domain‐agnostic phrase mining algorithm: Extracts candidate entity mentions with minimal linguistic assumption  address domain restriction

E.g., part‐of‐speech (POS) tagging << semantic parsing Do not simply merge entity mentions with identical surface names

Model each mention based on its surface name and context, in a scalable way  address name ambiguity Mine relation phrase co‐occurring with entity mentions; infer synonymous relation phrases

Helps form connecting bridges among entities that do not share identical context, but share synonymous relation phrases  address context sparsity

slide-33
SLIDE 33

7/26/2015 33

65

A Relation Phrase-Based Entity Recognition Framework

POS‐constrained phrase segmentation for mining candidate entity mentions and relation phrases, simultaneously

Construct a heterogeneous graph to represent available information in a unified form Entity mentions are kept as individual objects to be disambiguated Linked to entity surface names & relation phrases

66

A Relation Phrase-Based Entity Recognition Framework

 With the constructed graph, formulate a graph‐based semi‐supervised

learning of two tasks jointly:

Type propagation on heterogeneous graph Multi‐view relation phrase clustering Propagate type information among entities bridges via synonymous relation phrases Derived entity argument types serve as good feature for clustering relation phrases Mutually enhancing each other; leads to quality recognition of unlinkable entity mentions

slide-34
SLIDE 34

7/26/2015 34

67

Framework Overview

1.

Perform phrase mining on a POS‐tagged corpus to extract candidate entity mentions and relation phrases

2.

Construct a heterogeneous graph to encode our insights on modeling the type for each entity mention

3.

Collect seed entity mentions as labels by linking extracted mentions to the KB

4.

Estimate type indicator for unlinkable candidate mentions with the proposed type propagation integrated with relation phrase clustering on the constructed graph

68

Candidate Generation

An efficient phrase mining algorithm incorporating both corpus‐level statistics and syntactic constraints

Global significance score: Filter low‐quality candidates; generic POS tag patterns: remove phrases with improper syntactic structure

By extending TopMine, the algorithm partitions corpus into segments which meet both significance threshold and POS patterns  candidate entity mentions & relation phrases Algorithm workflow:

1. Mine frequent contiguous patterns 2. Performs greedy‐agglomerative merging while enforcing our syntactic constraints

  • Entity mention: consecutive nouns
  • Relation phrases: shown in the table

3. Terminates when the next highest‐score merging does not meet a pre‐defined significance threshold Relation phrase: phrase that denotes a unary

  • r binary relation in a sentence
slide-35
SLIDE 35

7/26/2015 35

69

Candidate Generation

Example output of candidate generation on NYT news articles

Entity detection performance comparison with an NP chunker

Recall is most critical for this step, since later we cannot detect the misses (i.e., false negatives)

70

Construction of Heterogeneous Graphs

With three types of objects extracted from corpus: candidate entity mentions, entity surface names, and relation phrases

We can construct a heterogeneous graph to enforce several hypotheses for modeling type of each entity mention (introduced in the following slides)

Basic idea for constructing the graph: the more two objects are likely to share the same label, the larger the weight will be associated with their connecting edge

Three types of links:

  • 1. Mention‐name link: (many‐to‐one) mappings

between entity mention and surface names

  • 2. Name‐relation phrase links: corpus‐level co‐
  • ccurrence between surface names and relation

phrases

  • 3. Mention correlation links: distributional

similarity between entity mentions

slide-36
SLIDE 36

7/26/2015 36

71

Entity Mention-Surface Name Subgraph

Directly modeling type indicator of each entity mention in label propagation  Intractable size of parameter space

Both the entity name and the surrounding relation phrases provide strong cues on the types of a candidate entity mention  Model the type of each entity mention by (1) type indicator of its surface name; (2) the

type signatures of its surrounding relation phrases (more details in the following slides)

…has concerns whether Kabul is an ally of Washington

Washington

Gover‐ nment State

is an ally of

…has concerns whether Kabul is an ally of Washington: GOVERNMENT M candidate mentions; n surface names Use a bi‐adjacency matrix to represent the mapping

72

Entity Name-Relation Phrase Subgraph

Aggregated co‐occurrences between entity surface names and relation phrases across corpus  weight importance of different relation phrases for surface names  use connected edges as bridges to propagate type information

Left/right entity argument of relation phrase: for each mention, assign it as the left (right, resp.) argument to the closest relation phrase on its right (left, resp.) in a sentence

Type signature of relation phrase: Two type indicators for its left and right arguments

l different relation phrases, mapping between mentions and relation phrases: Two bi‐adjacency matrices for the subgraph

slide-37
SLIDE 37

7/26/2015 37

73

Mention Correlation Subgraph

An entity mention may have ambiguous name and ambiguous relation phrases

E.g., “White house” and “felt” in the first sentence of Figure

Other co‐occurring mentions may provide good hints to the type of an entity mention

E.g., “birth certificate” and “rose garden” in the Figure

Construct KNN graph based on the feature vector f‐surface names of co‐occurring entity mentions

 Propagate type information between candidate mentions of each surface name, based

  • n following hypothesis:

74

Relation Phrase Clustering

Observation: many relation phrases have very few occurrences in the corpus

~37% relation phrases have <3 unique entity surface names (in right or left arguments)

 Hard to model their type signature based on aggregated co‐occurrences with entity surface names (i.e., Hypothesis 1)

Softly clustering synonymous relation phrases:  the type signatures of frequent relation phrases can help infer the type signatures of infrequent (sparse) ones that have similar cluster memberships

slide-38
SLIDE 38

7/26/2015 38

75

Relation Phrase Clustering

Existing work on relation phrase clustering utilizes strings; context words; entity argument to cluster synonymous relation phrases

String similarity and distribution similarity may be insufficient to resolve two relation phrases; type information is particularly helpful in such case

We propose to leverage type signature of relation phrase, and proposed a general relation phrase clustering method to incorporate different features  further integrated with the graph‐based type propagation in a mutually enhancing framework, based on the following hypothesis

Type signatures String features Context features

76

Type Inference: A Joint Optimization Problem

Mention modeling & mention correlation (Hypo 2) Multi‐view relation phrases clustering (Hypo 3 & 4) Type propagation between entity surface names and relation phrases (Hypo 1)

slide-39
SLIDE 39

7/26/2015 39

77

The ClusType Algorithm

Can be efficiently solved by alternative minimization based on block coordinate descent algorithm

Algorithm complexity is linear to #entity mentions, #relation phrases, #cluster, #clustering features and #target types

Update type indicators and type signatures For each view, performs single‐view NMF until converges

The ClusType algorithm:

Update consensus matrix and relative weights of different views

Until the objective converges

78

Experiment Setting

Datasets: 2013 New York Times news (~110k docs) [event, PER, LOC, ORG]; Yelp Reviews (~230k) [Food, Job, …]; 2011 Tweets (~300k) [event, product, PER, LOC, …]

Seed mention sets: < 7% extracted mentions are mapped to Freebase entities

Evaluation sets: manually annotate mentions of target types for subsets of the corpora

Evaluation metrics: Follows named entity recognition evaluation (Precision, Recall, F1)

Compared methods

Pattern: Stanford pattern‐based learning; SemTagger:bootstrapping method which trains contextual classifier based on seed mentions; FIGER: distantly‐supervised sequence labeling method trained on Wiki corpus; NNPLB: label propagation using ReVerb assertion and seed mention; APOLLO: mention‐level label propagation using Wiki concepts and KB entities;

ClusType‐NoWm: ignore mention correlation; ClusType‐NoClus: conducts only type propagation; ClusType‐TwpStep: first performs hard clustering then type propagation

slide-40
SLIDE 40

7/26/2015 40

79

Comparing ClusType with Other Methods and Its Variants

  • vs. FIGER: effectiveness of our candidate generation and proposed

hypotheses on type propagation

  • vs. NNPLB and APOLLO: ClusType not only utilizes semantic‐rich

relation phrase as type cues, but only cluster synonymous relation phrases to tackle context sparsity

  • vs. variants: (i) models mention correlation for name disambiguation;

(ii) integrates clustering in a mutually enhancing way 46.08% and 48.94% improvement in F1 score compared to the best baseline on the Tweet and the Yelp datasets

80

Comparing on Different Entity Types

Obtains larger gain on organization and person (more entities with ambiguous surface names) Modeling types on entity mention level is critical for name disambiguation Superior performance on product and food mainly comes from the domain independence of our method Both NNPLB and SemTagger require sophisticated linguistic feature generation which is hard to adapt to new types

slide-41
SLIDE 41

7/26/2015 41

81

Comparing on Trained NER System

Compare with Stanford NER, which is trained on general‐domain corpora including ACE corpus and MUC corpus, on three types: PER, LOC, ORG

ClusType and its variants outperform Stanford NER on both dynamic corpus (NYT) and domain‐specific corpus (Yelp)

ClusType has lower precision but higher Recall and F1 score on Tweet  Superior recall of ClusType mainly come from domain‐independent candidate generation

82

Example Output and Relation Phrase Clusters

Extracts more mentions and predicts types with higher accuracy

Not only synonymous relation phrases, but also both sparse and frequent relation phrase can be clustered together

 boosts sparse relation phrases with type information

  • f frequent relation phrases
slide-42
SLIDE 42

7/26/2015 42

83

Testing on Context Sparsity and Surface Name Popularity

Context sparsity:

Group A: frequent relation phrases

Group B: sparse relation phrases

ClusType obtains superior performance over its variants on Group B

 clustering relation phrase is critical for sparse relation phrases

Surface name popularity:

Group A: high frequency surface name

Group B: infrequent surface name

ClusType outperforms its variants on Group B

 Handles well mentions with insufficient corpus statistics

84

Conclusions and Future Work

Study distantly‐supervised entity recognition for domain‐specific corpora and propose a novel relation phrase‐based framework

A data‐driven, domain‐agnostic phrase mining algorithm for candidate entity mentions and relation phrase generation

Integrate relation phrase clustering with type propagation on heterogeneous graphs, and solve it by a joint optimization problem. Ongoing:

Extend to role discovery for scientific concepts  paper profiling (research/demo)

Study of relation phrase clustering, such as

joint entity/relation clustering

synonymous relation phrase canonicalization

Study of joint entity and relation phrase extraction with phrase mining

slide-43
SLIDE 43

7/26/2015 43

Session 3. Entity Linking Based on Semi-Supervised Graph Regularization

86  Stay up Hawk Fans.  We are going through a slump,  but we have to stay positive. Go Hawks!

Entity Linking: A Relational Graph Approach

slide-44
SLIDE 44

7/26/2015 44

87

Relevant Mention Detection: Meta Path

A meta‐path is a path defined over a network and composed of a sequence of relations between different object types (Sun et al., 2011)

Each meta path represent a semantic relation (more in next theme)

Meta paths between two mentions

M‐T‐M

M‐T‐U‐T‐M‐M

M‐T‐H‐T‐M

M‐T‐U‐T‐M‐T‐H‐T‐M

M‐T‐H‐T‐M‐T‐U‐T‐M M: mention, T: tweet, U: user, H: hashtag

Schema of a Heterogeneous Information Network in Twitter

88

Relational Graph Construction

Local Compatibility

Mention Features (e.g., idf, keyphraseness)

Concept Features (e.g., # of incoming/outgoing links)

Mention + Concept Features (e.g., prior popularity, tf)

Context Features (e.g., capitalization, tf‐idf)

0.91 0.32 0.43 1.0 0.68 0.52 0.44 0.68

  • Semantic Relatedness (SR)
  • SR between two mentions: meta path
  • SR between two concepts: link structure in Wikipedia (Milne and Witten, 2008)
  • Linear Combination of these three graphs

 Coreference

 At least one meta path exists between

two similar mentions

slide-45
SLIDE 45

7/26/2015 45

89

A Deep Semantic Relatedness Model (DSRM)

Semantic Knowledge Graphs

Titanic

Roster Member

National Basketball Association Miami Erik Spoelstra Miami Heat

Coach

Dwyane Wade

Location

1988

Founded Description

Professional Sports Team

Type

90

The DSRM Architecture

90 Feature Vector Word Hashing Layer Multi-layer non- linear projections Semantic Layer 1m 105k (50k + 50k + 3.2k + 1.6k) 300 300 300 x y Di 4m 3.2k 1.6k Ci Ri CTi 1m 105k (50k + 50k + 3.2k + 1.6k) 300 300 300 Dj 4m 3.2k 1.6k Cj CTj Rj Semantic relatedness (cosine similarity) SR(ci , cj)

Titanic

Roster Member

National Basketball Association Miami Miami Heat Dwyane Wade

Location

Professional Sports Team

Type

slide-46
SLIDE 46

7/26/2015 46

91

Data and Scoring Metric

  • Data
  • A Wikipedia dump on May 3, 2013
  • A portion of Freebase limited to the Wikipedia concepts
  • Wikification: a public data set includes 502 messages from 28 users (Meij et al., 2012)
  • Semantic relatedness: A benchmark testset includes 3,314 concepts as testing

queries (Ceccarelli et al., 2013)

  • Scoring Metric
  • Wikification
  • Standard precision, recall and F1
  • Semantic relatedness
  • nDCG

92

Overall Performance

  • TagMe: unsupervised model
  • Meij: supervised model; use 100% labeled data
  • SSRegu: use 50% labeled data

7.5% absolute F1 gain over the state-of-the-art supervised models TagMe Meij SSRegu + M&W SSRegu + DSRM

32.9% 42.3% 37.0% 39.3% 59.8% 47.5% 65.0% 44.1% 52.5% 59.0% 51.6% 55.0%

slide-47
SLIDE 47

7/26/2015 47

93

Semantic Relatedness: Examples

Semantic relatedness scores between a sample of concepts and the concept ”National Basketball Association” in sports domain.

Method M&W DSRM New York City 0.92 0.22 New York Knicks 0.78 0.79 Washington, D.C. 0.80 0.30 Washington Wizards 0.60 0.85 Atlanta 0.71 0.39 Atlanta Hawks 0.53 0.83 Houston 0.55 0.37 Houston Rockets 0.49 0.80

94

Summary on Relational Graph-Based Entity Linking

Compared to traditional label propagation‐based methods

Data driven methods to construct relational graphs

Integrate multi‐view clustering with graph‐based label propagation 

Compared to traditional distantly‐supervised methods

Domain‐independent entity candidate generation: jointly extract entity mentions and their relation phrases in an unsupervised way

Resolve context (relation phrase) sparsity by integrating type propagation with relation clustering

Fully exploit entries and structures in knowledge bases 

Potential Applications to other NLP Tasks

Where data annotation is costly

Where a relational graph among labeled seeds and unlabeled instances can be constructed based

  • n data‐driven metrics

Toward fully “Liberal” IE: Discover schema and extract/link entities simultaneously

slide-48
SLIDE 48

7/26/2015 48

95

References for Theme 2: Entity Recognition and Typing

 X. Ren, A. El‐Kishky, C. Wang, F. Tao, C. R. Voss, H. Ji and J. Han. ClusType: Effective Entity Recognition and

Typing by Relation Phrase‐Based Clustering. KDD’15.

 H. Huang, Y. Cao, X. Huang, H. Ji and C. Lin. Collective Tweet Wikification based on Semi‐supervised

Graph Regularization. ACL’14.

 T. Lin, O. Etzioni, et al. No noun phrase left behind: detecting and typing unlinkable entities. EMNLP’12.  N. Nakashole, T. Tylenda, and G. Weikum. Fine‐grained semantic typing of emerging entities. ACL’13.  R. Huang and E. Riloff. Inducing domain‐specific semantic class taggers from (almost) nothing. ACL’10  X. Ling and D. S. Weld. Fine‐grained entity recognition. AAAI’12.  W. Shen, J. Wang, P. Luo, and M. Wang. A graph‐based approach for ontology population with named

  • entities. CIKM’12

 S. Gupta and C. D. Manning. Improved pattern learning for bootstrapped entity extraction. CONLL’14  P. P. Talukdar and F. Pereira. Experiments in graph‐based semi‐supervised learning methods for class‐

instance acquisition. ACL’10.

 Z. Kozareva and E. Hovy. Not all seeds are equal: Measuring the quality of text mining seeds. NAACL’10  L. Gal´arraga, G. Heitz, K. Murphy, and F. M. Suchanek. Canonicalizing open knowledge bases. CIKM’14

Theme III. Search and Mining Structured Graphs and Heterogeneous Networks for NLP

96

slide-49
SLIDE 49

7/26/2015 49

97

Outline

Directly converting a large amount of data into knowledge is too ambitious and unrealistic

Once we construct graphs or heterogeneous networks, we can perform many powerful search and mining tasks

Many NLP problems are about graphs and networks – what’s the DM point of view?

Search: Graph index creation and approximate structural search

NLP Application: Schemaless queries

Classification: Graph pattern mining and pattern‐based classification

NLP Application: Distinguishing authors based on their writing styles

Meta‐path‐based similarity search in heterogeneous networks

NLP Application: Entity morph decoding

Meta‐path‐based mining in heterogeneous networks: Prediction and recommendation

NLP Application: Knowledge base completion; multi‐hop question answering, causal event prediction

Section 1: Search Graphs and Networks

slide-50
SLIDE 50

7/26/2015 50

99

Graph Indexing

Graph query: Find all the graphs in a graph DB containing a given query graph

Graph (G) Substructure Query graph (Q)

Graph DB:

(a) (b) (c) Path‐indices: C, C‐C, C‐C‐C, C‐C‐C‐C cannot prune (a) & (b) Only graph (c) contains Q

Query Q:

Index should be a powerful tool

Path‐index may not work well

Solution: Index directly on substructures (i.e., graphs)

100

Many NLP Problems can be Converted into Graph Search

 Entity Linking (Pan et al., 2015) Query Graph Constructed from NL douments using Abstract Meaning Representation Knowledge Graph

slide-51
SLIDE 51

7/26/2015 51

101

Many NLP Problems can be Converted into Graph Search

 Bursty Information Networks Decipherment (Tao et al., 2015submission) Query Graph Constructed from Chinese English Knowledge Graph

102

gIndex: Indexing Frequent and Discriminative Substructures

Why index frequent substructures?

Too many substructures to index

Size‐increasing support threshold

Large structures will likely be indexed well by their substructures

Why discriminative substructures?

size support

min‐support threshold

 

x f f f f x

i n

 , , , Pr

2 1

Reduce the index size by an order of magnitude

Selection: Given a set of selected structures f1, f2, .. fn, and a new structure x, the extra indexing power is measured by when Pr(x|f1, f2, …, fn) is small enough, x is a discriminative structure and should be included in the index

Experiments show gIndex is small, effective and stable

structure ( > 1 0 6) frequent ( ~ 1 0 5) discrim inative( ~ 1 0 3)

slide-52
SLIDE 52

7/26/2015 52

103

Structure Similarity Search Assisted with Graph Indices

 The necessity can be clearly shown in search for

similar chemical compounds

 How to conduct approximate search?  Build indices covering all the similar subgraphs?  No! Idea: (1) keep the index structure

(2) select features in the query space

Query relaxation measure: Features are more important than strict substructure matching

 Only need to index a set of smaller features

(a) caffeine (b) diurobromine (c) viagra

Query graph Data graphs

Feature set

104

Schemaless Queries over Knowledge Graphs

Knowledge Graphs are ubiquitous

Queries on KN are mostly schemaless NL queries

Challenges: ambiguities

  • n query

resolution

Video Photo m usic Business Univeristy join like

follow

Yellow stone NP Bison Mam m al Football League Country

follow

Name: Bison Class: Mammal Phylum: Chordate Order: Even‐toed ungulate Comment: Bison are large even‐toed ungulates within the subfamily ... Courtesy of Shengqi Yang, UCSB

 Ex: When was UIUC founded? And how?

slide-53
SLIDE 53

7/26/2015 53

105

Major Challenges in Schemaless Query Answering

 Extracting entities and relationships from natural language queries  Mismatch between knowledge graph and queries: Figure out the best transformation Knowledge Graph Query “the University of Illinois at Urbana-Champaign” “UIUC” “neoplasm” “tumor” “Doctor” “Dr.” “Barack Obama” “Obama” “Jeffrey Jacob Abrams” “J. J. Abrams” “teacher” “educator” “1980” “~30” “3 mi” “4.8 km” “Hinton” - “DNNresearch” - “Google” “Hinton” - “Google” … …

106

Processing of Schemaless Queries by Pattern Matching

Users freely post queries in natural language, without knowledge on data graphs

Schemaless query (SLQ) system finds results through a set of transformations

Ex.

Prof., ~ 7 0 yrs Google UT

Query

Geoffrey Hinton ( Professor, 1 9 4 7 ) University

  • f Toronto

DNNresearch Google

A Match

Acronym transformation: ‘UT’  ‘University of Toronto’ Abbreviation transformation: ‘Prof.’ ‘Professor’ Numeric transformation: ‘~70’  ‘1947’ Structural transformation: an edge  a path

slide-54
SLIDE 54

7/26/2015 54

107

Automatic Generation of Training Data

1.

Sampling: A set of subgraphs are randomly extracted from the data graph

2.

Query generation: queries are generated by randomly adding transformations on the extracted subgraphs

3.

Searching: search generated queries on the data graph

4.

Labeling: results are labeled based on

  • riginal subgraphs

5.

Training: queries + labeled results, are used to estimate the weights of different transformations

Knowledge graph

  • 1. Sampling
  • 2. Adding transformations

Tom

  • 4. Labeling

the results

training queries results

Tom Cruise Samuel Tom

  • 5. Training the ranking model ( ( ) |

) P Q Q 

108

Online Query Processing Using Knowledge Graphs

 Finding the results to a query  Configuration of

latent random variables in CRFs

Inference in Conditional Random Fields (CRFs)

Top‐1 result: Computing the most likely assignment

Approximate inference: Loopy Belief Propagation

Top‐K result generation

Finding the M Most Probable Configurations Using Loopy Belief Propagation [Yanover et al., NIPS’04]

Experiments: Queries generated on YAGO2

SLQ [Yang, et al, VLDB’14] shows high performance

slide-55
SLIDE 55

7/26/2015 55

Section 2: Graph Pattern Mining and Pattern-Based Classification

110

Application to NLP: Authorship Classification

 The writing style is one of the best indicators for original authorship  Substructures: k‐embedded‐edge subtree patterns holds more information than basic

syntactic features: function words, POS(Part of Speech) tags, and rewrite rules

 Ex: A k‐embedded‐edge subtree t mined from NYT journalists Jack Healy and Eric Dash  On average, 21.2% of Jack’s sentences contained t while only 7.2% of Eric’s contained t

slide-56
SLIDE 56

7/26/2015 56

111

Structural Pattern Tells More on Authorship Classification

Binned information gain score distribution of various feature sets

FW: function words

POS: POS tags

BPOS: bigram POS tags

RR: rewrite rules

k‐ee subtrees # features for FW, POS, RR and k‐ee feature sets

Accuracy comparison on # authors and various datasets

Section 3: Similarity Computation in Heterogeneous Networks

slide-57
SLIDE 57

7/26/2015 57

113

Structures Facilitate Mining Heterogeneous Networks

 Network construction: generates structured networks from unstructured text data  Each node: an entity; each link: a relationship between entities  Each node/link may have attributes, labels, and weights  Heterogeneous, multi‐typed networks: e.g., Medical network: Patients, doctors,

diseases, contacts, treatments

Venue Paper Author

DBLP Bibliographic Network The IMDB Movie Network

Actor Movie Director Movie Studio

The Facebook Network

114

Why Mining Typed Heterogeneous Networks?

 A homogeneous network can be derived from its “parent” heterogeneous network  Ex. Coauthor networks from the original author‐paper‐conference networks  Heterogeneous networks carry richer info. than the projected homogeneous networks  Typed nodes & links imply more structures, leading to richer discovery  Ex.: DBLP: A Computer Science bibliographic database (network) Knowledge hidden in DBLP Network Mining Functions

Who are the leading researchers on Web search? Ranking Who are the peer researchers of Jure Leskovec? Similarity Search Whom will Christos Faloutsos collaborate with? Relationship Prediction Which relationships are most influential for an author to decide her topics? Relation Strength Learning How was the field of Data Mining emerged or evolving? Network Evolution Which authors are rather different from his/her peers in IR? Outlier/anomaly detection

slide-58
SLIDE 58

7/26/2015 58

115

Meta-Path and Similarity Search in Heterogeneous Networks

 Similarity measure/search is the base for cluster analysis  Who are the most similar to Christos Faloutsos based on the DBLP network?  Meta‐Path: Meta‐level description of a path between two objects  A path on network schema  Denote an existing or concatenated relation between two object types  Different meta‐paths tell different semantics Christos’ students or close collaborators Work in similar fields with similar reputation

Meta‐Path: Author‐Paper‐Author Meta‐Path: Author‐Paper‐Venue‐Paper‐Author Co‐authorship Meta‐path: A‐P‐A

116

Existing Popular Similarity Measures for Networks

 Random walk (RW):  The probability of random walk starting at x and ending at y, with

meta‐path P

 Used in Personalized PageRank (P‐Pagerank) (Jeh and Widom 2003)  Favors highly visible objects (i.e., objects with large degrees)  Pairwise random walk (PRW):  The probability of pairwise random walk starting at (x, y) and ending

at a common object (say z), following a meta‐path (P1, P2)

 Used in SimRank (Jeh and Widom 2002)  Favors pure objects (i.e., objects with highly skewed distribution in

their in‐links or out‐links) ( , ) ( )

p P

s x y prob p

 

1 2 1 2

1 2 ( , ) ( , )

( , ) ( ) ( )

p p P P

s x y prob p prob p



X Y P X Y P1 P2

‐1

Z

Note: P-PageRank and SimRank do not distinguish

  • bject type and

relationship type Note: P-PageRank and SimRank do not distinguish

  • bject type and

relationship type

slide-59
SLIDE 59

7/26/2015 59

117

Which Similarity Measure Is Better for Finding Peers?

 Meta‐path: APCPA  Mike publishes similar # of papers as Bob

and Mary

 Other measures find Mike is closer to Jim

Author\Conf. SIGMOD VLDB ICDM KDD Mike 2 1 Jim 50 20 Mary 2 1 Bob 2 1 Ann 1 1 Measure\Author Jim Mary Bob Ann P‐PageRank 0.376 0.013 0.016 0.005 SimRank 0.716 0.572 0.713 0.184 Random Walk 0.8983 0.0238 0.0390 Pairwise R.W. 0.5714 0.4440 0.5556 PathSim (APCPA) 0.083 0.8 1

Who is more similar to Mike?

Comparison of Multiple Measures: A Toy Example  PathSim: Favors peers  Peers: Objects with

strong connectivity and similar visibility with a given meta‐path

x y

118

Example with DBLP: Find Academic Peers by PathSim

 Anhai Doan  CS, Wisconsin  Database area  PhD: 2002  Jun Yang  CS, Duke  Database area  PhD: 2001  Amol Deshpande  CS, Maryland  Database area  PhD: 2004  Jignesh Patel  CS, Wisconsin  Database area  PhD: 1998

Meta-Path: Author-Paper-Venue-Paper-Author

slide-60
SLIDE 60

7/26/2015 60

119

Application to NLP: Entity Morph Resolution

119

=

“Conquer West King” (平西王) “Bo Xilai” (薄熙来)

=

“Tender Beef Pentagon” (嫩牛五方)

= =

“instant noodles” (方便面) “Zhou Yongkang” (周永康) “Yang Mi” (杨幂) The Hutt Chris Christie

120

120

Target Candidate Identification

  • Considering all entities

will be too overwhelming

slide-61
SLIDE 61

7/26/2015 61

121

121

Further Narrow down by Social Correlation

122

Further Narrow down by Social Correlation

 A morph and its target are likely to be posted by two users with strong social

correlation at Weibo and Twitter respectively (test data: average social correlation = 0.923)

 Explicit Social Correlation between users from re‐tweet, mentioning, reply and

follower networks (Wen and Lin, 2010)

 Compute the degree of separation in user interactions and the amount of

interactions

 Infer Implicit Social Correlation between users by topic modeling and opinion

mining

 Users who share similar interests are likely to post similar information and

  • pinions

 Measure content similarity of the messages posted by two users

 Temporal + Social Correlation: Narrow down the number of target candidates to 1%

with 100% accuracy

slide-62
SLIDE 62

7/26/2015 62

123

Conquer West King from Chongqing fell from power, still need to sing red songs?

There is no difference between that guy’s plagiarism and Buhou’s gang crackdown.

Remember that Buhou said that his family was not rich at the press conference a few days before he fell from power. His son Bo Guagua is supported by his scholarship.

Bo Xilai: ten thousand letters of accusation have been received during Chongqing gang crackdown.

The webpage of “Tianze Economic Study Institute”

  • wned by the liberal party has been closed. This is the

first affected website of the liberal party after Bo Xilai fell from power.

Bo Xilai gave an explanation about the source of his son, Bo Guagua’s tuition.

Bo Xilai led Chongqing city leaders and 40 district and county party and government leaders to sing red songs.

Weibo (censored) Twitter and Chinese News (uncensored)

Cross-genre Comparable Corpora

124

124 Bo Guagua Conquer West King Chongqing Earth Common Bo Xilai CCP Wen Jiabao Best Actor China

Each node is an entity An edge: a semantic relation, event, sentiment, semantic role, dependency relation or

co‐occurrence, associated with confidence values

PER GPE GPE PER ORG PER

Children Children Top_employee Top_employee Justice Justice Affiliation Located Member Member Affiliation Top‐employee Top‐employee

Constructing Heterogeneous Information Networks

slide-63
SLIDE 63

7/26/2015 63

125

Meta-paths

125 Network Schema M: Morphs E: Entities EV: Events NP: Non-Entity Noun Phrases

  • Meta‐path Examples:
  • M – E – E
  • M – EV – E
  • M – NP – E

126

Data and Scoring Metric

126

  • Data
  • Time frame: 05/01/2012‐06/30/2012
  • 1555K Chinese messages from Weibo
  • 66K formal web documents from embedded URL
  • 25K Chinese messages from English Twitter for sensitive morphs
  • Test on 107 morph entities in Weibo, 23 of them are sensitive
  • Scoring Metric
  • Ck: the number of correctly resolved morphs at top position k
  • T: the total number of morphs in ground truth

@ /

k

A cc k C T 

slide-64
SLIDE 64

7/26/2015 64

127

Overall Performance

127 Homogeneous Network 1 5 10 20 1 5 10 20 23.4% 41.6% 47.7% 51.9% 37.9% 59.4% 65.9% 70.1% Hetereogeneous Network

Section 4: Mining Heterogeneous Information Networks: Prediction and Recommendation

slide-65
SLIDE 65

7/26/2015 65

129

Relationship Prediction vs. Link Prediction

Link prediction in homogeneous networks [Liben‐Nowell and Kleinberg, 2003, Hasan et al., 2006]

E.g., friendship prediction

Relationship prediction in heterogeneous networks

Different types of relationships need different prediction models

Different connection paths need to be treated separately!

Meta‐path‐based approach to define topological features vs. vs.

130

PathPredict: Meta-Path Based Relationship Prediction

Meta path‐guided prediction of links and relationships

Philosophy: Meta path relationships among similar typed links share similar semantics and are comparable and inferable

Co‐author prediction (A—P—A)

Use topological features encoded by meta paths, e.g., citation relations between authors (A—P→P—A) vs.

Meta‐paths between authors under length 4

Meta‐Path Semantic Meaning

slide-66
SLIDE 66

7/26/2015 66

131

The Power of PathPredict: Experiment on DBLP

Explain the prediction power of each meta‐path

Wald Test for logistic regression

Higher prediction accuracy than using projected homogeneous network

11% higher in prediction accuracy

Co‐author prediction for Jian Pei: Only 42 among 4809 candidates are true first‐time co‐authors!

(Feature collected in [1996, 2002]; Test period in [2003,2009]) 132

Author Citation Time Prediction in DBLP

 Top‐4 meta‐paths for author citation time prediction  Predict when Philip S. Yu will cite a new author

Social relations are less important in author citation prediction than in co‐ author prediction. Social relations are less important in author citation prediction than in co‐ author prediction. Under Weibull distribution assumption

Study the same topic Co‐cited by the same paper Follow co‐authors’ citation Follow the citations of authors who study the same topic

slide-67
SLIDE 67

7/26/2015 67

133

Enhancing the Power of Recommender Systems by Heterog.

  • Info. Network Analysis

Heterogeneous relationships complement each other

Users and items with limited feedback can be connected to the network by different types of paths

Connect new users or items in the information network

Different users may require different models: Relationship heterogeneity makes personalized recommendation models easier to define

Avatar Titanic Aliens Revolutionary Road James Cameron Kate Winslet Leonardo Dicaprio

Zoe Saldana

Adventure

Romance

Collaborative filtering methods suffer from the data sparsity issue

# of users or items A small set

  • f users &

items have a large number

  • f ratings

Most users & items have a small number of ratings # of ratings Personalized recommendation with heterog. Networks [WSDM’14]

134

Relationship Heterogeneity Based Personalized Recommendation Models

Different users may have different behaviors or preferences

Aliens

James Cameron fan 80s Sci‐fi fan Sigourney Weaver fan

Different users may be interested in the same movie for different reasons

Two levels of personalization

Data level

  • Most recommendation methods use
  • ne model for all users and rely on

personal feedback to achieve personalization Model level

  • With different entity relationships, we

can learn personalized models for different users to further distinguish their differences

slide-68
SLIDE 68

7/26/2015 68

135

Preference Propagation-Based Latent Features

Alice Bob Kate Winslet Naomi Watts Titanic revolutionary road skyfall King Kong

genre: drama

Sam Mendes tag: Oscar Nomination Charlie

Generate L different meta‐path (path

(path types) types) connecting

users and items Propagate user implicit feedback along each meta‐ path Calculate latent‐ features for users and items for each meta‐path with NMF related method

Ralph Fiennes

136

L user‐cluster similarity

Recommendation Models

Observation 1: Different meta‐paths may have different importance

Global Recommendation Model Personalized Recommendation Model

Observation 2: Different users may require different models

ranking score the q‐th meta‐path features for user i and item j c total soft user clusters

(1) (2)

slide-69
SLIDE 69

7/26/2015 69

137

Parameter Estimation

  • Bayesian personalized ranking (Rendle UAI’09)
  • Objective function

min

  • sigmoid function

for each correctly ranked item pair i.e., gave feedback to but not

Soft cluster users with NMF + k‐means For each user cluster, learn one model with Eq. (3) Generate personalized model for each user on the fly with Eq. (2) (3)

Learning Personalized Recommendation Model

138

Experiments: Heterogeneous Network Modeling Enhances the Quality of Recommendation

Datasets

Comparison methods

Popularity: recommend the most popular items to users

Co‐click: conditional probabilities between items

NMF: non‐negative matrix factorization on user feedback

Hybrid‐SVM: use Rank‐SVM with plain features (utilize both user feedback and information network)

HeteRec personalized recommendation (HeteRec‐p) leads to the best recommendation

slide-70
SLIDE 70

7/26/2015 70

139

Summary of Theme III

 Network representation provides more extensive analysis on text

corpora

 Relatively homogeneous, clean, common substructures appear

frequently

Graph indexing + approximate structural search

Graph pattern mining

 Heterogeneous, noisy, knowledge sparsity, but clear meta‐path schema 

Meta‐path based similarity search

Meta‐path guided prediction, classification and recommendation in heterogeneous networks

140

References for Theme III: Graph Search and Mining

  • C. Borgelt and M. R. Berthold, “Mining molecular fragments: Finding relevant substructures
  • f molecules”, ICDM'02

  • S. Kim, H. Kim, T. Weninger, J. Han, H. D. Kim, "Authorship Classification: A Discriminative

Syntactic Tree Mining Approach", SIGIR'11

  • M. Kuramochi and G. Karypis. “Frequent subgraph discovery”, ICDM'01

  • S. Nijssen and J. Kok. A quickstart in frequent structure mining can make a difference.

KDD'04

  • X. Yan and J. Han, “gSpan: Graph‐Based Substructure Pattern Mining”, ICDM'02

  • X. Yan and J. Han, “CloseGraph: Mining Closed Frequent Graph Patterns”, KDD'03

  • X. Yan, P. S. Yu, and J. Han, “Graph Indexing: A Frequent Structure‐based Approach”,

SIGMOD'04

  • X. Yan, P. S. Yu, and J. Han, "Substructure Similarity Search in Graph Databases", SIGMOD'05

  • S. Yang, Y. Wu, H. Sun, X. Yan, Schemaless and Structureless Graph Querying, VLDB'14

  • F. Zhu, Q. Qu, D. Lo, X. Yan, J. Han, and P. S. Yu, "Mining Top‐K Large Structural Patterns in a

Massive Network", VLDB'11

slide-71
SLIDE 71

7/26/2015 71

141

References for Theme III: Heterogeneous Network Search & Mining

  • H. Huang, Z. Wen, D. Yu, H. Ji, Y. Sun, J. Han and H. Li. 2013. “Resolving Entity Morphs in Censored

Data”, ACL’13

  • D. Yu, H. Huang, T. Cassidy, H. Ji, C. Wang, S. Zhi, J. Han, C. Voss and M. Magdon‐Ismail. “The

Wisdom of Minority: Unsupervised Slot Filling Validation based on Multi‐dimensional Truth‐ Finding”, COLING’14

  • Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies,

Morgan & Claypool Publishers, 2012

  • Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, T. Wu, "RankClus: Integrating Clustering with Ranking for

Heterogeneous Information Network Analysis", EDBT’09

  • Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path‐Based Top‐K Similarity Search in

Heterogeneous Information Networks”, VLDB'11

  • Y. Sun, J. Han, C. C. Aggarwal, and N. Chawla, "When Will It Happen? Relationship Prediction in

Heterogeneous Information Networks", WSDM'12

  • X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, and J. Han, "Personalized Entity

Recommendation: A Heterogeneous Information Network Approach", WSDM'14

  • Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co‐Author Relationship Prediction in

Heterogeneous Bibliographic Networks", ASONAM'11

Conclusions and Future Research Directions

142

slide-72
SLIDE 72

7/26/2015 72

143

Conclusions

Data mining and NLP have been working on a common goal: Turning data into knowledge (D2K), but with different methodologies

Data mining explore more on massive amount of data, but not in‐depth analysis on individual documents

Two fields will benefit each other by integrating their technologies/methodologies

We have been proposing a D2N2K (data to network to knowledge) methodology

Phrase mining in massive corpora

Entity recognition and typing by correlation and cluster analysis

Construction of massive typed heterogeneous information networks

Mining actionable knowledge from such “semi‐structured” information networks

Lots to be done in the future!

144

Looking Forward

Lot of unanswered questions and research issues

What is the best framework for integration and joint inference?

Is there an ideal common representation, or a layer in between? Even go beyond Heterogeneous Information Networks?

Apply D2N2K framework to other NLP applications

Network search based Collective Entity Linking

Cross‐lingual information network alignment

Semantic graph based Machine Translation

Expert finding, research recommendation, …

slide-73
SLIDE 73

7/26/2015 73

145

Tools

Checking our research package dissemination portal

IlliMine http://illimine.cs.uiuc.edu/

Phrase Mining

https://github.com/shangjingbo1226/SegPhrase

Entity Typing

http://shanzhenren.github.io/ClusType

Entity Morph Decoding

http://nlp.cs.rpi.edu/software/morphdecoding.tar.gz

Graph Mining Tools

Gspan http://www.cs.ucsb.edu/~xyan/software/gSpan.htm

146

Acknowledgement

N E T W O R K S C IE N C E

C T A

slide-74
SLIDE 74

7/26/2015 74

BACKUP

148

Different Input Focus

 NLP: Unstructured Text

Data

 Data Mining (DM): More on

Structured and Semi‐structured Data

slide-75
SLIDE 75

7/26/2015 75

149

NLP Challenge 1: Data and Knowledge Sparsity

Data Sparsity: Need to obtain high‐level statistics as global evidence

Training a typical supervised model needs a large number of instances

500 documents for phrase chunking, and entity, relation, event extraction

20,000 queries for entity linking

Knowledge Sparsity: Require global knowledge acquired from a wider context with low cost

Source KB Translation out of hype‐speak: some kook made threatening noises at Brownback and go arrested Samuel Dale "Sam" Brownback (born September 12, 1956) is an American politician, the 46th and current Governor of Kansas.

Who is Brownback?  background data/knowledge aggregation

150

DM Solution: Leverage Information Redundancy

Acquire a large amount of related documents from multiple sources (genres, languages, data modalities)

Learn high‐level data‐driven statistics to extract, type and link information

Frequent pattern mining, ranking based on different criteria

Popularity: ‘information retrieval’ vs. ‘cross‐language information retrieval’

Concordance: ‘active learning’ vs.‘learning classification’

Completeness: ‘vector machine’ vs. ‘support vector machine’

Construct relational graphs for weakly‐supervised label propagation

slide-76
SLIDE 76

7/26/2015 76

151

NLP Challenge 2: From Data to Knowledge (D2K)

Directly converting a large amount of unstructured data into knowledge is too ambitious and unrealistic

Example 1

Input: Millions of discussion forum posts under censorship

Output: Resolve each implicit entity mention to its real target

Example 2

 Input: 15 years of non‐parallel Chinese and English news articles  Output: Entity translation pairs  Example 3  Input: Millions of multi‐source documents reporting conflicting information  Output: Track each company’s top employees over time, complete knowledge bases

Session 1. Phrase Mining: A Data Mining Approach

slide-77
SLIDE 77

7/26/2015 77

153

TNG: Experiments on Research Papers

154

Phrase Mining and Topic Modeling: An Example

Motivation: Unigrams (single words) can be difficult to interpret

Ex.: The topic that represents the area of Machine Learning

learning reinforcement support machine vector selection feature random :

versus

learning support vector machines reinforcement learning feature selection conditional random fields classification decision trees :

slide-78
SLIDE 78

7/26/2015 78

155

KERT: Topical Keyphrase Extraction & Ranking

learning support vector machines reinforcement learning feature selection conditional random fields classification decision trees :

Topical keyphrase extraction & ranking

knowledge discovery using least squares support vector machine classifiers support vectors for reinforcement learning a hybrid approach to feature selection pseudo conditional random fields automatic web page classification in a dynamic and hierarchical way inverse time dependency in convex regularized learning postprocessing decision trees to extract actionable knowledge variance minimization least squares support vector machines …

Unigram topic assignment: Topic 1 & Topic 2

156

Framework of KERT

  • 1. Run bag‐of‐words model inference and assign topic label to each token
  • 2. Extract candidate keyphrases within each topic
  • 3. Rank the keyphrases in each topic

Popularity: ‘information retrieval’ vs. ‘cross‐language information retrieval’

Discriminativeness: only frequent in documents about topic t

Concordance: ‘active learning’ vs.‘learning classification’

Completeness: ‘vector machine’ vs. ‘support vector machine’ Frequent pattern mining Frequent pattern mining Comparability property: directly compare phrases of mixed lengths

slide-79
SLIDE 79

7/26/2015 79

157

Summary: Strategies on Topical Phrase Mining

Strategy 1: Generate bag‐of‐words → generate sequence of tokens

Integrated complex model; phrase quality and topic inference rely on each

  • ther

Slow and overfitting

Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams

Phrase quality relies on topic labels for unigrams

Can be fast; generally high‐quality topics and phrases

Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model

Topic inference relies on correct segmentation of documents, but not sensitive

Can be fast; generally high‐quality topics and phrases

Session 2. Simultaneously Inferring Phrases and Topics

slide-80
SLIDE 80

7/26/2015 80

Session 3. Post Topic Modeling Phrase Construction Session 4. First Phrase Mining then Topic Modeling

slide-81
SLIDE 81

7/26/2015 81

Session 5. SegPhrase: Integration of Phrase Mining with Document Segmentation

162

Mining Phrases: Why Not Use Raw Frequency Based Methods?

Traditional data‐driven approaches

Frequent pattern mining

If AB is frequent, likely AB could be a phrase

Raw frequency could NOT reflect the quality of phrases

E.g., freq(vector machine) ≥ freq(support vector machine)

Need to rectify the frequency based on segmentation results

Phrasal segmentation will tell

Some words should be treated as a whole phrase whereas others are still unigrams

slide-82
SLIDE 82

7/26/2015 82

163

ClassPhrase I: Pattern Mining for Candidate Set

Build a candidate phrase set by frequent pattern mining

Mining frequent k‐grams

k is typically small, e.g. 6 in our experiments

Popularity measured by raw frequent words and phrases mined from the corpus

164

Classifier

Limited Training

Labels: Whether a phrase is a quality one or not

“support vector machine”: 1

“the experiment shows”: 0

For ~1GB corpus, only 300 labels

Random Forest as our classifier

Predicted phrase quality scores lie in [0, 1]

Bootstrap many different datasets from limited labels

slide-83
SLIDE 83

7/26/2015 83

165

SegPhrase: Why Do We Need Phrasal Segmentation in Corpus?

Phrasal segmentation can tell which phrase is more appropriate

Ex: A standard feature vector machine learning setup is used to describe...

Rectified phrase frequency (expected influence)

Example:

Not counted towards the rectified frequency

166

SegPhrase: Segmentation of Phrases

Partition a sequence of words by maximizing the likelihood

Considering

Phrase quality score

ClassPhrase assigns a quality score for each phrase

Probability in corpus

Length penalty

length penalty : when 1, it favors shorter phrases

Filter out phrases with low rectified frequency

Bad phrases are expected to rarely occur in the segmentation results

slide-84
SLIDE 84

7/26/2015 84

167

SegPhrase+: Enhancing Phrasal Segmentation

SegPhrase+: One more round for enhanced phrasal segmentation

Feedback

Using rectified frequency, re‐compute those features previously computed based on raw frequency

Process

Classification  Phrasal segmentation // SegPhrase

 Classification  Phrasal segmentation // SegPhrase+ 

Effects on computing quality scores

np hard in the strong sense

np hard in the strong

data base management system

168  Stay up Hawk Fans.  We are going through a slump,  but we have to stay positive. Go Hawks!

Linking Entities to Knowledge Base

slide-85
SLIDE 85

7/26/2015 85

169

Recognizing Typed Entities

The best BBQ I’ve tasted in Phoenix! I had the pulled pork sandwich with coleslaw and baked beans for lunch. ... The

  • wner is very nice. …

The best BBQ:Food I’ve tasted in Phoenix:LOC ! I had the [pulled pork sandwich]:Food with coleslaw:Food and [baked beans]:Food for lunch. … The owner:JOB_TITLE is very nice. …

FOOD LOCATION JOB_TITLE EVENT ORGANIZATION

Target Types Identifying token span as entity mentions in documents and labeling their types Enabling structured analysis of unstructured text corpus Plain text Text with typed entities FOOD LOCATION EVENT

170

1 1 1 1 1

Entity Linking: A Relational Graph Approach

Relational graph: Each pair of mention m and concept c as a node

  • The model (Adapted from Zhu2003)

 yi: the label of node i  W: weight matrix of the

relational graph

Local Compatability Coreference Semantic Relatedness

slide-86
SLIDE 86

7/26/2015 86

171

Relational Graph Construction (con’t)

0.62 0.55 0.87 0.89 0.91 0.32 0.43 1.0 1.0 0.68 0.52 0.44 0.68

  • Semantic Relatedness (SR)
  • SR between two mentions: meta path
  • SR between two concepts: link structure in Wikipedia (Milne and Witten, 2008)
  • Linear Combination of these three graphs

172

Models for Comparison

  • TagMe: an unsupervised model based on prior popularity and semantic

relatedness of a single message (Ferragina and Scaiella, 2010)

  • Meij: the state‐of‐the‐art supervised approach based on the random

forest model (Meij et al., 2012)

  • SSRegu: our proposed semi‐supervised graph regularization model with

all three types of relations

slide-87
SLIDE 87

7/26/2015 87

173

Quality of Semantic Relatedness

DSRM Standard Relatedness Method M&W (Milne and Witten, 2008)

174

Impact of Semantic Relatedness on Concept Disambiguation

  • News dataset: 4,485 mentions (Hoffart et al., 2011)
  • AIDA: a unsupervised collective inference method (Hoffart et al., 2011)
  • Our methods are completely unsupervised

TagMe Meij SSRegu+M&W SSRegu+DSRM SSRegu+M&W SSRegu+DSRM AIDA Tweet Set News Dataset

slide-88
SLIDE 88

7/26/2015 88

175

Remaining Challenges

Mention detection is performance bottleneck

Mention disambiguation: city and country names that refer to sports teams (e.g., “Miami” ‐> “Miami Heat”)

Incorporate user interests

Non‐linkable entity mention recognization and clustering

175

Error Distribution

176

Transformation-Based Graph Querying

Transformation‐based graph querying produces many results

How to suggest the “best” results to users?

Different transformations, should be weighted differently

How to determine them? Weights shall be learned

Evaluate a Candidate Match: Ranking Function

Features

Node matching feature:

Edge matching feature:

Matching Score: Ex.: Given a single node query, “Geoffrey Hinton” Nodes with “G. Hinton” (Abbreviation transformation) shall be ranked higher than Nodes with “Hinton” (Last token transformation)

( , ( )) ( , ( ))

V i i i

F v v f v v      ( , ( )) ( , ( ))

E j j j

F e e g e e      ( ( ) | ) exp( ( , ( )) ( , ( )))

Q Q

V E v V e E

P Q Q F v v F e e   

 

 

 

slide-89
SLIDE 89

7/26/2015 89

177

Potential Challenges in Applying Graph Indices + Search to NLP

The graphs automatically constructed from natural language texts might:

Include many more diverse types and weights

Include more ambiguity

Knowledge sparsity; hard to generalize unique nodes/substructures

Include more noise and errors

178

Mining Frequent (Sub)Graph Patterns

Data mining research has developed many scalable graph pattern mining methods

Given a labeled graph dataset D = {G1, G2, …, Gn), the supporting graph set of a subgraph g is Dg = {Gi | g  Gi, Gi D}.

support(g) = |Dg|/ |D|

A (sub)graph g is frequent if support(g) ≥ min_sup

Ex.: Chemical structures

Graph Dataset Frequent Graph Patterns

S OH O O O N O N HO O N O N

(A) (B) (C)

O N

N O N

(1) (2) min_sup = 2 support = 67% 

Alternative:

Mining frequent subgraph patterns from a single large graph or network

Documents can be viewed as graph structures as well

slide-90
SLIDE 90

7/26/2015 90

179

Efficient Graph Pattern Mining Methods

Effective Graph pattern mining methods have been developed for different scenarios

Mining graph “transaction” datasets

  • Ex. gSpan (Yan et al, ICDM’02), CloseGraph (Yan et al., KDD’03), …

Mining frequent large subgraph structures in single massive network

  • Ex. Spider‐Mine (Zhu, et al., VLDB’11)

Graph pattern mining forms building blocks for graph classification, clustering, compression, comparison, and correlation analysis

Graph indexing and graph similarity search

gIndex (SIGMOD’05): Graph indexing by graph pattern mining

Help precise as well as similarity‐based graph query search and answering

180

Mining Collaboration Patterns in DBLP Networks

Data description: 600 top confs, 9 major CS areas, 15071 authors in DB/DM

Author labeled by # of papers published in DB/DM

Prolific (P): >=50, Senior (S): 20~49, Junior (J): 10~19, Beginner(B): 5~9

Patterns found Patterns found Patterns found Instance in data Instance in data Instance in data Instance in data

slide-91
SLIDE 91

7/26/2015 91

181

Applications of Graph Pattern Mining

Bioinformatics

Gene networks, protein interactions, metabolic pathways

Chem‐informatics: Mining chemical compound structures

Social networks, web communities, tweets, …

Cell phone networks, computer networks, …

Web graphs, XML structures, semantic Web, information networks

Software engineering: Program execution flow analysis

Building blocks for graph classification, clustering, compression, comparison, and correlation analysis

Graph indexing and graph similarity search

182

Logistic Regression-Based Prediction Model

 Training and test pair: <xi, yi> = <history feature list, future relationship label>  Logistic Regression Model 

Model the probability for each relationship as

  is the coefficients for each feature (including a constant 1) 

MLE estimation

 Maximize the likelihood of observing all the relationships in the training data 

A‐P‐A‐P‐A A‐P‐V‐P‐A A‐P‐T‐P‐A A‐P‐>P‐A A‐P‐A <Mike, Ann> 4 5 100 3 Yes = 1 <Mike, Jim> 1 20 2 No = 0

slide-92
SLIDE 92

7/26/2015 92

183

When Will It Happen?

From “whether” to “when”

“Whether”: Will Jim rent the movie “Avatar” in Netflix?

“When”: When will Jim rent the movie “Avatar”?

What is the probability Jim will rent “Avatar” within 2 months? 2

By when Jim will rent “Avatar” with 90% probability? : 0.9

What is the expected time it will take for Jim to rent “Avatar”?

May provide useful information to supply chain management May provide useful information to supply chain management

Output: P(X=1)=? Output: A distribution of time! 184

Generalized Linear Model under Weibull Distribution Assumption

The Relationship Building Time Prediction Model

 Solution 

Directly model relationship building time: P(Y=t)

 Geometric distribution, Exponential distribution, Weibull distribution 

Use generalized linear model

 Deal with censoring (relationship builds beyond the observed time

interval)

Training Framework T: Right Censoring