7/26/2015 1
1
Introduction: Where do NLP and DM Meet? 1 7/26/2015 Slightly - - PDF document
7/26/2015 Successful Data Mining Methods for NLP Jiawei Han (UIUC), Heng Ji (RPI), Yizhou Sun (NEU) http://hanj.cs.illinois.edu/slides/dmnlp15.pptx[pdf] http://nlp.cs.rpi.edu/paper/dmnlp15.pptx[pdf] 1 Introduction: Where do NLP and DM Meet?
7/26/2015 1
1
7/26/2015 2
3
NLP: Deep understanding of
individual words, phrases and sentences (“micro‐level”); focus on unstructured text data
Data Mining (DM): High‐level (statistical)
understanding, discovery and synthesis of the most salient information (“Macro‐ level”); historically more on structured and semi‐structured data
NewsNet (Tao et al., 2014)
Related to “Health Care Bill” 4
Advantages of NLP
Construct graphs/networks with fine‐grained semantics from unstructured texts
Use large‐scale annotations for real‐world data
Advantages of DM: Deep understanding through structured/correlation inference
Using a structured representation (e.g., graph, network) as a bridge to capture interactions between NLP and DM
Example: Heterogeneous Information Networks [Han et al., 2010; Sun et al., 2012]
Data Networks Knowledge
7/26/2015 3
5
Major theme of this tutorial
Applying novel DM methods to solve traditional NLP problems
Integrating DM and NLP, transforming Data to Networks to Knowledge
Road Map of this tutorial
Effective Network Construction by Leveraging Information Redundancy
Theme I: Phrase Mining and Topic Modeling from Large Corpora
Theme II: Entity Extraction and Linking by Relational Graph Construction
Mining Knowledge from Structured Networks
Theme III: Search and Mining Structured Graphs and Heterogeneous Networks
Looking forward to the Future
6
7/26/2015 4
7
Phrase: Minimal, unambiguous semantic unit; basic building block for information
network and knowledge base
Unigrams vs. phrases Unigrams (single words) are ambiguous Example: “United”: United States? United Airline? United Parcel Service? Phrase: A natural, meaningful, unambiguous semantic unit Example: “United States” vs. “United Airline” Mining semantically meaningful phrases Transform text data from word granularity to phrase granularity Enhance the power and efficiency at manipulating unstructured data using
database technology
8
Phrase mining: Originated from the NLP community—“Chunking”
Model it as a sequence labeling problem (B‐NP, I‐NP, O, …)
Need annotation and training
Annotate hundreds of POS tagged documents as training data
Train a supervised model based on part‐of‐speech features
Recent trend:
Use distributional features based on web n‐grams (Bergsma et al., 2010)
State‐of‐the‐art Performance: ~95% accuracy, ~88% phrase‐level F‐score
Limitations
High annotation cost, not scalable to a new language, domain or genre
May not fit domain‐specific, dynamic, emerging applications
Scientific domains, query logs, or social media, e.g., Yelp, Twitter
Use only local features, no ranking, no links to topics
7/26/2015 5
9
General principle: Corpus‐based; fully exploit information redundancy and data‐ driven criteria to determine phrase boundaries and salience; using local evidence to adjust corpus‐level data statistics
Phrase Mining and Topic Modeling from Large Corpora
Strategy 1: Simultaneously Inferring Phrases and Topics
Bigram topical model [Wallach’06], topical n‐gram model [Wang, et al.’07], phrase discovering topic model [Lindsey, et al.’12]
Strategy 2: Post Topic Modeling Phrase Construction
Label topic [Mei et al.’07], TurboTopic [Blei & Lafferty’09], KERT [Danilevsky, et al.’14]
Strategy 3: First Phrase Mining then Topic Modeling:
ToPMine [El‐kishky, et al., VLDB’15]
Integration of Phrase Mining with Document Segmentation
SegPhrase [Liu, et al., SIGMOD’15]
10
Bigram Topic Model [Wallach’06]
Probabilistic generative model that conditions on previous word and topic when drawing next word
Topical N‐Grams (TNG) [Wang, et al.’07]
Probabilistic model that generates words in textual order
Create n‐grams by concatenating successive bigrams (a generalization of Bigram Topic Model)
Phrase‐Discovering LDA (PDLDA) [Lindsey, et al.’12]
Viewing each sentence as a time‐series of words, PDLDA posits that the generative parameter (topic) changes periodically
Each word is drawn based on previous m words (context) and current phrase topic
High model complexity: Tends to overfitting; High inference cost: Slow
7/26/2015 6
11
TurboTopics [Blei & Lafferty’09] – Phrase construction as a post‐processing step to Latent Dirichlet Allocation
Perform Latent Dirichlet Allocation on corpus to assign each token a topic label
Merge adjacent unigrams with the same topic label by a distribution‐free permutation test on arbitrary‐length back‐off model
End recursive merging when all significant adjacent unigrams have been merged
KERT [Danilevsky et al.’14] – Phrase construction as a post‐processing step
to Latent Dirichlet Allocation
Perform frequent pattern mining on each topic
Perform phrase ranking based on four different criteria
12
Perform LDA on corpus to assign each token a topic label
E.g., … phase11 transition11 …. game153 theory127 …
Then merge adjacent unigrams with same topic label
7/26/2015 7
13
Frequent pattern mining Frequent pattern mining Comparability property: directly compare phrases of mixed lengths
kpRel
[Zhao et al. 11]
KERT (‐popularity) KERT (‐discriminativeness) KERT (‐concordance) KERT [Danilevsky et al. 14] learning effective support vector machines learning learning classification text feature selection classification support vector machines selection probabilistic reinforcement learning selection reinforcement learning models identification conditional random fields feature feature selection algorithm mapping constraint satisfaction decision conditional random fields features task decision trees bayesian classification decision planning dimensionality reduction trees decision trees : : : : :
14
ToPMine [El‐Kishky et al. VLDB’15]
First phrase construction, then topic mining
Contrast with KERT: topic modeling, then phrase mining
The ToPMine Framework:
Perform frequent contiguous pattern mining to extract candidate phrases and their counts
Perform agglomerative merging of adjacent unigrams as guided by a significance score—This segments each document into a “bag‐of‐phrases”
The newly formed bag‐of‐phrases are passed as input to PhraseLDA, an extension
7/26/2015 8
15
With Strategy 2, tokens in the same phrase may be assigned to different topics
Knowledge discovery and support vector machine should have coherent topic labels
Solution: switch the order of phrase mining and topic model inference
Techniques
Phrase mining and document segmentation
Topic model inference with phrase constraint
[knowledge discovery] using [least
squares] [support vector machine] [classifiers] …
[knowledge discovery] using [least
squares] [support vector machine] [classifiers] …
16
[Markov blanket] [feature selection] for [support vector machines] [knowledge discovery] using [least squares] [support vector machine] [classifiers] …[support vector] for [machine learning]… Phrase
Raw freq. True freq.
[support vector machine]
90 80
[vector machine]
95
[support vector]
100 20
Quality phrases Based on significance score [Church et al.’91]:
α(P1, P2) ≈ (f(P1●P2) ̶ µ0(P1,P2))/√ f(P1●P2)
7/26/2015 9
17
Collocation: A sequence of words that occur more frequently than expected
Often “interesting” and due to their non‐compositionality, often relay information not portrayed by their constituent terms (e.g., “made an exception”, “strong tea”)
Many different measures used to extract collocations from a corpus [Dunning 93, Pederson 96]
E.g., mutual information, t‐test, z‐test, chi‐squared test, likelihood ratio
Many of these measures can be used to guide the agglomerative phrase‐ segmentation algorithm
18
The generative model for PhraseLDA is the same as LDA
Difference: the model incorporates constraints
Chain‐graph shows that all words in a phrase are constrained to take on the same topic values
[knowledge discovery] using [least squares] [support vector machine] [classifiers] …
Topic model inference with phrase constraints
7/26/2015 10
19
ToPMine [El‐kishky et al. 14] Strategy 3 (67 seconds)
information retrieval feature selection social networks machine learning web search semi supervised search engine large scale information extraction support vector machines question answering active learning web pages face recognition : : Topic 1 Topic 2 social networks information retrieval web search text classification time series machine learning search engine support vector machines management system information extraction real time neural networks decision trees text categorization : : Topic 1 Topic 2
PDLDA [Lindsey et al. 12] Strategy 1 (3.72 hours)
20
7/26/2015 11
21
22
7/26/2015 12
23
Strategy 1: Generate bag‐of‐words → generate sequence of tokens
Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams
Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model
Running time: strategy 3 > strategy 2 > strategy 1 (“>” means outperforms)
24
Strategy 1: Generate bag‐of‐words → generate sequence of tokens
Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams
Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model
Coherence measured by z‐score: strategy 3 > strategy 2 > strategy 1
7/26/2015 13
25
Phrase intrusion measured by average number of correct answers: strategy 3 > strategy 2 > strategy 1
26
Phrase quality measured by z‐score: strategy 3 > strategy 2 > strategy 1
7/26/2015 14
27
Traditional data‐driven approaches
Frequent pattern mining
If AB is frequent, likely AB could be a phrase
Raw frequency could NOT reflect the quality of phrases
E.g., freq(vector machine) ≥ freq(support vector machine)
Need to rectify the frequency based on segmentation results
Phrasal segmentation will tell
Some words should be treated as a whole phrase whereas others are still unigrams
28
Raw frequency could NOT reflect the quality of phrases
E.g., freq(vector machine) ≥ freq(support vector machine)
Need to rectify the frequency based on segmentation results
Build a candidate phrase set by frequent pattern mining
Mining frequent k‐grams; k is typically small, e.g. 6 in our experiments
Popularity measured by raw frequent words and phrases mined from the corpus
Document 1 Citation recommendation is an interesting but challenging research problem in data mining area. Document 2 In this study, we investigate the problem in the context of heterogeneous information networks using data mining technique.
Phrase Mining
Document 3 Principal Component Analysis is a linear dimensionality reduction technique commonly used in machine learning applications.
Quality Phrases
Phrasal Segmentation
Raw Corpus Segmented Corpus Input Raw Corpus Quality Phrases Segmented Corpus
7/26/2015 15
29
ClassPhrase: Frequent pattern mining, feature extraction, classification
SegPhrase: Phrasal segmentation and phrase quality estimation
SegPhrase+: One more round to enhance mined phrase quality
ClassPhrase SegPhrase(+)
30
Judging the quality of phrases
Popularity
“information retrieval” vs. “cross‐language information retrieval”
Concordance
“powerful tea” vs. “strong tea”
“active learning” vs. “learning classification”
Informativeness
“this paper” (frequent but not discriminative, not informative)
Completeness
“vector machine” vs. “support vector machine”
7/26/2015 16
31
Partition a phrase into two parts to check whether the co‐occurrence is
significantly higher than pure random
support vector machine this paper demonstrates
Pointwise mutual information:
Pointwise KL divergence:
The additional p(v) is multiplied with pointwise mutual information,
leading to less bias towards rare‐occurred phrases
Deriving Informativeness
Quality phrases typically start and end with a non‐stopword
“machine learning is” vs. “machine learning”
Use average IDF over words in the phrase to measure the semantics
Usually, the probabilities of a quality phrase in quotes, brackets, or connected by dash should be higher (punctuation information)
“state‐of‐the‐art”
We can also incorporate features using some NLP techniques, such as POS tagging, chunking, and semantic parsing
7/26/2015 17
33
Limited Training
Labels: Whether a phrase is a quality one or not
“support vector machine”: 1
“the experiment shows”: 0
For ~1GB corpus, only 300 labels
Random Forest as our classifier
Predicted phrase quality scores lie in [0, 1]
Bootstrap many different datasets from limited labels
34
Phrasal segmentation can tell which phrase is more appropriate
Ex: A standard [feature vector] [machine learning] setup is used to describe...
Rectified phrase frequency (expected influence)
Example:
Not counted towards the rectified frequency
7/26/2015 18
35
Partition a sequence of words by maximizing the likelihood
Considering
Phrase quality score
ClassPhrase assigns a quality score for each phrase
Probability in corpus
Length penalty
length penalty : when 1, it favors shorter phrases
Filter out phrases with low rectified frequency
Bad phrases are expected to rarely occur in the segmentation results
36
SegPhrase+: One more round for enhanced phrasal segmentation
Feedback
Using rectified frequency, re‐compute those features previously computed based on raw frequency
Process
Classification Phrasal segmentation // SegPhrase
Classification Phrasal segmentation // SegPhrase+
Effects on computing quality scores
np hard in the strong sense
np hard in the strong
data base management system
7/26/2015 19
37
Other phase mining methods: Methods to be compared
NLP chunking based methods
Chunks as candidates
Sorted by TF‐IDF and C‐value (K. Frantzi et al., 2000)
Unsupervised raw frequency based methods
ConExtr (A. Parameswaran et al., VLDB 2010)
ToPMine (A. El‐Kishky et al., VLDB 2015)
Supervised method
KEA, designed for single document keyphrases (O. Medelyan & I. H. Witten, 2006)
38
Datasets
Popular Wiki Phrases
Based on internal links
~7K high quality phrases
Pooling
Sampled 500 * 7 Wiki‐uncovered phrases
Evaluated by 3 reviewers independently
Dataset #docs #words #labels DBLP 2.77M 91.6M 300 Yelp 4.75M 145.1M 300
7/26/2015 20
39
Compare with other baselines TF‐IDF C‐Value ConExtr KEA ToPMine SegPhrase+ Compare with
TF‐IDF ClassPhrase SegPhrase SegPhrase+
39
40
SegPhrase+ is linear to the size of corpus!
7/26/2015 21
41
Query SIGMOD Method SegPhrase+ Chunking (TF‐IDF & C‐Value) 1 data base data base 2 database system database system 3 relational database query processing 4 query optimization query optimization 5 query processing relational database … … … 51 sql server database technology 52 relational data database server 53 data structure large volume 54 join query performance study 55 web service web service … … … 201 high dimensional data efficient implementation 202 location based service sensor network 203 xml schema large collection 204 two phase locking important issue 205 deep web frequent itemset … … …
Only in SegPhrase+ Only in Chunking
42
Query SIGKDD Method SegPhrase+ Chunking (TF‐IDF & C‐Value) 1 data mining data mining 2 data set association rule 3 association rule knowledge discovery 4 knowledge discovery frequent itemset 5 time series decision tree … … … 51 association rule mining search space 52 rule set domain knowledge 53 concept drift importnant problem 54 knowledge acquisition concurrency control 55 gene expression data conceptual graph … … … 201 web content
202 frequent subgraph semantic relationship 203 intrusion detection effective way 204 categorical attribute space complexity 205 user preference small set … … …
Only in SegPhrase+ Only in Chunking
7/26/2015 22
43
Find high‐quality similar phrases based on user’s phrase query
In response to a user’s phrase query, SegPhrase+ generates high quality, semantically similar phrases
In DBLP, query on “data mining” and “OLAP”
In Yelp, query on “blu‐ray”, “noodle”, and “valet parking”
44
Distant Training: No need of human labeling
Training using general knowledge bases
E.g., Freebase, Wikipedia
Quality Estimation for Unigrams
Integration of phrases and unigrams in one uniform framework
Demo based on DBLP abstract
Multi‐languages: Beyond English corpus
Extensible to mining quality phrases in multiple languages
Recent progress: SegPhrase+ works on Chinese, Arabic and Spanish
7/26/2015 23
45
Rank Phrase In English … … … 62 首席_执行官 CEO 63 中间_偏右 Middle‐right … … … 84 百度_百科 Baidu Pedia 85 热带_气旋 Tropical cyclone 86 中国科学院_院士 Fellow of Chinese Academy of Sciences … … … 1001 十大_中文_金曲 Top‐10 Chinese Songs 1002 全球_资讯网 Global News Website 1003 天一阁_藏_明代_科举_录_选刊 A Chinese book name … … … 9934 国家_戏剧_院 National Theater 9935 谢谢_你 Thank you … … …
46
Northrop Grumman, Ashfaq Kayani, Sania Mirza, Pius Xii, Shakhtar Donetsk, Kyaw Zaw Lwin
Ratko Mladic, Abdolmalek Rigi, Rubin Kazan, Rajon Rondo, Rubel Hossain, bluefin tuna
Psv Eindhoven, Nicklas Bendtner, Ryo Ishikawa, Desmond Tutu, Landon Donovan, Jannie du Plessis
Zinedine Zidane, Uttar Pradesh, Thor Hushovd, Andhra Pradesh, Jafar_Panahi, Marouane Chamakh
Rahm Emanuel, Yakubu Aiyegbeni, Salva Kiir, Abdelhamid Abou Zeid, Blaise Compaore, Rickie Fowler
Andry Rajoelina, Merck Kgaa, Js Kabylie, Arjun Atwal, Andal Ampatuan Jnr, Reggio Calabria, Ghani Baradar
Mahela Jayawardene, Jemaah Islamiyah, quantitative easing, Nodar Kumaritashvili, Alviro Petersen
Rumiana Jeleva, Helio Castroneves, Koumei Oda, Porfirio Lobo, Anastasia Pavlyuchenkova
Thaksin Shinawatra, Evgeni_Malkin, Salvatore Sirigu, Edoardo Molinari, Yoshito Sengoku
Otago Highlanders, Umar Akmal, Shuaibu Amodu, Nadia Petrova, Jerzy Buzek, Leonid Kuchma,
Alona Bondarenko, Chosun Ilbo, Kei Nishikori, Nobunari Oda, Kumbh Mela, Santo_Domingo
Nicolae Ceausescu, Yoann Gourcuff, Petr Cech, Mirlande Manigat, Sulieman Benn, Sekouba Konate
7/26/2015 24
47
Strategy 1: Generate bag‐of‐words → generate sequence of tokens
Integrated complex model; phrase quality and topic inference rely on each
Slow and overfitting
Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams
Phrase quality relies on topic labels for unigrams
Can be fast; generally high‐quality topics and phrases
Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model
Topic inference relies on correct segmentation of documents, but not sensitive
Can be fast; generally high‐quality topics and phrases
48
SegPhrase+: A new phrase mining framework
Integrating phrase mining with phrasal segmentation
Requires only limited training or distant training
Generates high‐quality phrases, close to human judgement
Linearly scalable on time and space
Looking forward: High‐quality, scalable phrase mining
Facilitate entity recognition and typing in large corpora (See the next part of this tutorial)
Combine with linguistic‐rich patterns
Transform massive unstructured data into semi‐structured knowledge networks
7/26/2015 25
49
D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi‐Word Expressions, arXiv:0907.1013, 2009 K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis. In U. Zernik (ed.), Lexical Acquisition:
Exploiting On‐Line Resources to Build a Lexicon. Lawrence Erlbaum, 1991
M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical
Keyphrases on Collections of Short Documents“, SDM’14
A. El‐Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15 K. Frantzi, S. Ananiadou, and H. Mima, Automatic Recognition of Multi‐Word Terms: the c‐value/nc‐value
R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A phrase‐discovering topic model using hierarchical pitman‐yor
processes, EMNLP‐CoNLL’12.
J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15 O. Medelyan and I. H. Witten, Thesaurus Based Automatic Keyphrase Indexing. IJCDL’06 Q. Mei, X. Shen, C. Zhai. Automatic Labeling of Multinomial Topic Models, KDD’07 A. Parameswaran, H. Garcia‐Molina, and A. Rajaraman. Towards the Web of Concepts: Extracting Concepts from
Large Datasets. VLDB’10
X. Wang, A. McCallum, X. Wei. Topical n‐grams: Phrase and topic discovery, with an application to information
retrieval, ICDM’07
50
7/26/2015 26
52
The best BBQ I’ve tasted in Phoenix! I had the pulled pork sandwich with coleslaw and baked beans for lunch. ... The
The best BBQ:Food I’ve tasted in Phoenix:LOC ! I had the [pulled pork sandwich]:Food with coleslaw:Food and [baked beans]:Food for lunch. … The owner:JOB_TITLE is very nice. …
FOOD LOCATION JOB_TITLE EVENT ORGANIZATION
…
Target Types Identifying token span as entity mentions in documents and labeling their types Enabling structured analysis of unstructured text corpus Plain text Text with typed entities FOOD LOCATION EVENT
7/26/2015 27
53
Extracting and linking entities can be used in a variety of ways:
serve as primitives for information extraction and knowledge base population
assist question answering,…
Traditional named entity recognition systems are designed for major types (e.g., PER, LOC, ORG) and general domains (e.g., news)
Require additional steps to adapt to new domains/types
Expensive human labor on annotation
500 documents for entity extraction; 20,000 queries for entity linking
Unsatisfying agreement due to various granularity levels and scopes of types
Entities obtained by entity linking techniques have limited coverage and freshness
>50% unlinkable entity mentions in Web corpus [Lin et al., EMNLP’12]
>90% in our experiment corpora: tweets, Yelp reviews, …
54
Typical Entity Extraction Features (Li et al., 2012)
7/26/2015 28
55
Mention/Concept Attribute Description Name Spelling match Exact string match, acronym match, alias match, string matching… KB link mining Name pairs mined from KB text redirect and disambiguation pages Name Gazetteer Organization and geo‐political entity abbreviation gazetteers Document surface Lexical Words in KB facts, KB text, mention name, mention text. Tf.idf of words and ngrams Position Mention name appears early in KB text Genre Genre of the mention text (newswire, blog, …) Local Context Lexical and part‐of‐speech tags of context words Entity Context Type Mention concept type, subtype Relation/Event Concepts co‐occurred, attributes/relations/events with mention Coreference Co‐reference links between the source document and the KB text Profiling Slot fills of the mention, concept attributes stored in KB infobox Concept Ontology extracted from KB text Topic Topics (identity and lexical similarity) for the mention text and KB text KB Link Mining Attributes extracted from hyperlink graphs of the KB text Popularity Web Top KB text ranked by search engine and its length Frequency Frequency in KB texts
Typical Entity Linking Features (Ji et al., 2011)
7/26/2015 29
57
Acquire labels for a small amount of instances
Construct a relational graph to connect labeled instances and unlabeled instances
Construct edges based on coarse‐grained data‐driven statistics instead of fine‐ grained linguistic similarity
Mention correlation
Text co‐occurrence
Semantic relatedness based on knowledge graph embeddings
Social networks
Label propagation across the graph
58
Goal: recognizing entity mentions of target types with minimal/no human supervision and with no requirement that entities can be found in a KB.
Two kinds of efforts towards this goal:
Weak supervision: relies on manually selected seed entities in applying pattern‐based bootstrapping methods or label propagation methods to identify more entities
Both assume seeds are unambiguous and sufficiently frequent requires careful seed selection by human
Distant supervision: leverages entity information in KBs to reduce human supervision (cont.)
7/26/2015 30
59
Detect entity mentions from text
Map candidate mentions to KB entities of target types
Use confidently mapped {mention, type} to infer types of remaining candidate mentions
60
Distantly‐supervised entity recognition in a domain‐specific corpus
Given:
a corpus D
a knowledge base (e.g., Freebase)
a set of target types (T) from a KB
Detect candidate entity mentions from corpus D Categorize each candidate mention by target types or Not‐Of‐Interest
(NOI), with distant supervision
7/26/2015 31
61
Most existing work assume entity mentions are already extracted by existing entity detection tools, e.g., noun phrase chunkers
Usually trained on general‐domain corpora like news articles (clean, grammatical)
Make use of various linguistics features (e.g., semantic parsing structures)
Do not work well on specific, dynamic or emerging domains (e.g., tweets, Yelp reviews)
E.g., “in‐and‐out” from Yelp review may not be properly detected
62
Multiple entities may share the same surface name
Previous methods simply output a single type/type distribution for each surface name, instead of an exact type for each entity mention While Griffin is not the part of Washington’s plan on Sunday’s game, … Sport team …has concern that Kabul is an ally of Washington. U.S. government He has office in Washington, Boston and San Francisco U.S. capital city
Washington State
Washington
Sport team Govern ‐ment State
…
While Griffin is not the part of Washington’s plan on Sunday’s game, … … news from Washington indicates that the congress is going to… It is one of the best state parks in Washington.
7/26/2015 32
63
A variety of contextual clues are leveraged to find sources of shared semantics across different entities
Keywords, Wiki concepts, linguistic patterns, textual relations, …
There are often many ways to describe even the same relation between two entities
Previous methods have difficulties in handling entity mention with sparse (infrequent) context
ID Sentence Freq 1 The magnitude 9.0 quake caused widespread devastation in [Kesennuma city] 12 2 … tsunami that ravaged [northeastern Japan] last Friday 31 3 The resulting tsunami devastate [Japan]’s northeast 244
64
Domain‐agnostic phrase mining algorithm: Extracts candidate entity mentions with minimal linguistic assumption address domain restriction
E.g., part‐of‐speech (POS) tagging << semantic parsing Do not simply merge entity mentions with identical surface names
Model each mention based on its surface name and context, in a scalable way address name ambiguity Mine relation phrase co‐occurring with entity mentions; infer synonymous relation phrases
Helps form connecting bridges among entities that do not share identical context, but share synonymous relation phrases address context sparsity
7/26/2015 33
65
POS‐constrained phrase segmentation for mining candidate entity mentions and relation phrases, simultaneously
Construct a heterogeneous graph to represent available information in a unified form Entity mentions are kept as individual objects to be disambiguated Linked to entity surface names & relation phrases
66
With the constructed graph, formulate a graph‐based semi‐supervised
Type propagation on heterogeneous graph Multi‐view relation phrase clustering Propagate type information among entities bridges via synonymous relation phrases Derived entity argument types serve as good feature for clustering relation phrases Mutually enhancing each other; leads to quality recognition of unlinkable entity mentions
7/26/2015 34
67
1.
Perform phrase mining on a POS‐tagged corpus to extract candidate entity mentions and relation phrases
2.
Construct a heterogeneous graph to encode our insights on modeling the type for each entity mention
3.
Collect seed entity mentions as labels by linking extracted mentions to the KB
4.
Estimate type indicator for unlinkable candidate mentions with the proposed type propagation integrated with relation phrase clustering on the constructed graph
68
An efficient phrase mining algorithm incorporating both corpus‐level statistics and syntactic constraints
Global significance score: Filter low‐quality candidates; generic POS tag patterns: remove phrases with improper syntactic structure
By extending TopMine, the algorithm partitions corpus into segments which meet both significance threshold and POS patterns candidate entity mentions & relation phrases Algorithm workflow:
1. Mine frequent contiguous patterns 2. Performs greedy‐agglomerative merging while enforcing our syntactic constraints
3. Terminates when the next highest‐score merging does not meet a pre‐defined significance threshold Relation phrase: phrase that denotes a unary
7/26/2015 35
69
Example output of candidate generation on NYT news articles
Entity detection performance comparison with an NP chunker
Recall is most critical for this step, since later we cannot detect the misses (i.e., false negatives)
70
With three types of objects extracted from corpus: candidate entity mentions, entity surface names, and relation phrases
We can construct a heterogeneous graph to enforce several hypotheses for modeling type of each entity mention (introduced in the following slides)
Basic idea for constructing the graph: the more two objects are likely to share the same label, the larger the weight will be associated with their connecting edge
Three types of links:
between entity mention and surface names
phrases
similarity between entity mentions
7/26/2015 36
71
Directly modeling type indicator of each entity mention in label propagation Intractable size of parameter space
Both the entity name and the surrounding relation phrases provide strong cues on the types of a candidate entity mention Model the type of each entity mention by (1) type indicator of its surface name; (2) the
type signatures of its surrounding relation phrases (more details in the following slides)
…has concerns whether Kabul is an ally of Washington
Washington
Gover‐ nment State
is an ally of
…has concerns whether Kabul is an ally of Washington: GOVERNMENT M candidate mentions; n surface names Use a bi‐adjacency matrix to represent the mapping
72
Aggregated co‐occurrences between entity surface names and relation phrases across corpus weight importance of different relation phrases for surface names use connected edges as bridges to propagate type information
Left/right entity argument of relation phrase: for each mention, assign it as the left (right, resp.) argument to the closest relation phrase on its right (left, resp.) in a sentence
Type signature of relation phrase: Two type indicators for its left and right arguments
l different relation phrases, mapping between mentions and relation phrases: Two bi‐adjacency matrices for the subgraph
7/26/2015 37
73
An entity mention may have ambiguous name and ambiguous relation phrases
E.g., “White house” and “felt” in the first sentence of Figure
Other co‐occurring mentions may provide good hints to the type of an entity mention
E.g., “birth certificate” and “rose garden” in the Figure
Construct KNN graph based on the feature vector f‐surface names of co‐occurring entity mentions
Propagate type information between candidate mentions of each surface name, based
74
Observation: many relation phrases have very few occurrences in the corpus
~37% relation phrases have <3 unique entity surface names (in right or left arguments)
Hard to model their type signature based on aggregated co‐occurrences with entity surface names (i.e., Hypothesis 1)
Softly clustering synonymous relation phrases: the type signatures of frequent relation phrases can help infer the type signatures of infrequent (sparse) ones that have similar cluster memberships
7/26/2015 38
75
Existing work on relation phrase clustering utilizes strings; context words; entity argument to cluster synonymous relation phrases
String similarity and distribution similarity may be insufficient to resolve two relation phrases; type information is particularly helpful in such case
We propose to leverage type signature of relation phrase, and proposed a general relation phrase clustering method to incorporate different features further integrated with the graph‐based type propagation in a mutually enhancing framework, based on the following hypothesis
Type signatures String features Context features
76
Mention modeling & mention correlation (Hypo 2) Multi‐view relation phrases clustering (Hypo 3 & 4) Type propagation between entity surface names and relation phrases (Hypo 1)
7/26/2015 39
77
Can be efficiently solved by alternative minimization based on block coordinate descent algorithm
Algorithm complexity is linear to #entity mentions, #relation phrases, #cluster, #clustering features and #target types
Update type indicators and type signatures For each view, performs single‐view NMF until converges
The ClusType algorithm:
Update consensus matrix and relative weights of different views
Until the objective converges
78
Datasets: 2013 New York Times news (~110k docs) [event, PER, LOC, ORG]; Yelp Reviews (~230k) [Food, Job, …]; 2011 Tweets (~300k) [event, product, PER, LOC, …]
Seed mention sets: < 7% extracted mentions are mapped to Freebase entities
Evaluation sets: manually annotate mentions of target types for subsets of the corpora
Evaluation metrics: Follows named entity recognition evaluation (Precision, Recall, F1)
Compared methods
Pattern: Stanford pattern‐based learning; SemTagger:bootstrapping method which trains contextual classifier based on seed mentions; FIGER: distantly‐supervised sequence labeling method trained on Wiki corpus; NNPLB: label propagation using ReVerb assertion and seed mention; APOLLO: mention‐level label propagation using Wiki concepts and KB entities;
ClusType‐NoWm: ignore mention correlation; ClusType‐NoClus: conducts only type propagation; ClusType‐TwpStep: first performs hard clustering then type propagation
7/26/2015 40
79
hypotheses on type propagation
relation phrase as type cues, but only cluster synonymous relation phrases to tackle context sparsity
(ii) integrates clustering in a mutually enhancing way 46.08% and 48.94% improvement in F1 score compared to the best baseline on the Tweet and the Yelp datasets
80
Obtains larger gain on organization and person (more entities with ambiguous surface names) Modeling types on entity mention level is critical for name disambiguation Superior performance on product and food mainly comes from the domain independence of our method Both NNPLB and SemTagger require sophisticated linguistic feature generation which is hard to adapt to new types
7/26/2015 41
81
Compare with Stanford NER, which is trained on general‐domain corpora including ACE corpus and MUC corpus, on three types: PER, LOC, ORG
ClusType and its variants outperform Stanford NER on both dynamic corpus (NYT) and domain‐specific corpus (Yelp)
ClusType has lower precision but higher Recall and F1 score on Tweet Superior recall of ClusType mainly come from domain‐independent candidate generation
82
Extracts more mentions and predicts types with higher accuracy
Not only synonymous relation phrases, but also both sparse and frequent relation phrase can be clustered together
boosts sparse relation phrases with type information
7/26/2015 42
83
Context sparsity:
Group A: frequent relation phrases
Group B: sparse relation phrases
ClusType obtains superior performance over its variants on Group B
clustering relation phrase is critical for sparse relation phrases
Surface name popularity:
Group A: high frequency surface name
Group B: infrequent surface name
ClusType outperforms its variants on Group B
Handles well mentions with insufficient corpus statistics
84
Study distantly‐supervised entity recognition for domain‐specific corpora and propose a novel relation phrase‐based framework
A data‐driven, domain‐agnostic phrase mining algorithm for candidate entity mentions and relation phrase generation
Integrate relation phrase clustering with type propagation on heterogeneous graphs, and solve it by a joint optimization problem. Ongoing:
Extend to role discovery for scientific concepts paper profiling (research/demo)
Study of relation phrase clustering, such as
joint entity/relation clustering
synonymous relation phrase canonicalization
Study of joint entity and relation phrase extraction with phrase mining
7/26/2015 43
86 Stay up Hawk Fans. We are going through a slump, but we have to stay positive. Go Hawks!
7/26/2015 44
87
A meta‐path is a path defined over a network and composed of a sequence of relations between different object types (Sun et al., 2011)
Each meta path represent a semantic relation (more in next theme)
Meta paths between two mentions
M‐T‐M
M‐T‐U‐T‐M‐M
M‐T‐H‐T‐M
M‐T‐U‐T‐M‐T‐H‐T‐M
M‐T‐H‐T‐M‐T‐U‐T‐M M: mention, T: tweet, U: user, H: hashtag
Schema of a Heterogeneous Information Network in Twitter
88
Local Compatibility
Mention Features (e.g., idf, keyphraseness)
Concept Features (e.g., # of incoming/outgoing links)
Mention + Concept Features (e.g., prior popularity, tf)
Context Features (e.g., capitalization, tf‐idf)
0.91 0.32 0.43 1.0 0.68 0.52 0.44 0.68
Coreference
At least one meta path exists between
two similar mentions
7/26/2015 45
89
Semantic Knowledge Graphs
Titanic
Roster Member
National Basketball Association Miami Erik Spoelstra Miami Heat
Coach
Dwyane Wade
Location
1988
Founded Description
Professional Sports Team
Type
90
90 Feature Vector Word Hashing Layer Multi-layer non- linear projections Semantic Layer 1m 105k (50k + 50k + 3.2k + 1.6k) 300 300 300 x y Di 4m 3.2k 1.6k Ci Ri CTi 1m 105k (50k + 50k + 3.2k + 1.6k) 300 300 300 Dj 4m 3.2k 1.6k Cj CTj Rj Semantic relatedness (cosine similarity) SR(ci , cj)
Titanic
Roster Member
National Basketball Association Miami Miami Heat Dwyane Wade
Location
Professional Sports Team
Type
7/26/2015 46
91
queries (Ceccarelli et al., 2013)
92
7.5% absolute F1 gain over the state-of-the-art supervised models TagMe Meij SSRegu + M&W SSRegu + DSRM
32.9% 42.3% 37.0% 39.3% 59.8% 47.5% 65.0% 44.1% 52.5% 59.0% 51.6% 55.0%
7/26/2015 47
93
Semantic relatedness scores between a sample of concepts and the concept ”National Basketball Association” in sports domain.
Method M&W DSRM New York City 0.92 0.22 New York Knicks 0.78 0.79 Washington, D.C. 0.80 0.30 Washington Wizards 0.60 0.85 Atlanta 0.71 0.39 Atlanta Hawks 0.53 0.83 Houston 0.55 0.37 Houston Rockets 0.49 0.80
94
Compared to traditional label propagation‐based methods
Data driven methods to construct relational graphs
Integrate multi‐view clustering with graph‐based label propagation
Compared to traditional distantly‐supervised methods
Domain‐independent entity candidate generation: jointly extract entity mentions and their relation phrases in an unsupervised way
Resolve context (relation phrase) sparsity by integrating type propagation with relation clustering
Fully exploit entries and structures in knowledge bases
Potential Applications to other NLP Tasks
Where data annotation is costly
Where a relational graph among labeled seeds and unlabeled instances can be constructed based
Toward fully “Liberal” IE: Discover schema and extract/link entities simultaneously
7/26/2015 48
95
X. Ren, A. El‐Kishky, C. Wang, F. Tao, C. R. Voss, H. Ji and J. Han. ClusType: Effective Entity Recognition and
Typing by Relation Phrase‐Based Clustering. KDD’15.
H. Huang, Y. Cao, X. Huang, H. Ji and C. Lin. Collective Tweet Wikification based on Semi‐supervised
Graph Regularization. ACL’14.
T. Lin, O. Etzioni, et al. No noun phrase left behind: detecting and typing unlinkable entities. EMNLP’12. N. Nakashole, T. Tylenda, and G. Weikum. Fine‐grained semantic typing of emerging entities. ACL’13. R. Huang and E. Riloff. Inducing domain‐specific semantic class taggers from (almost) nothing. ACL’10 X. Ling and D. S. Weld. Fine‐grained entity recognition. AAAI’12. W. Shen, J. Wang, P. Luo, and M. Wang. A graph‐based approach for ontology population with named
S. Gupta and C. D. Manning. Improved pattern learning for bootstrapped entity extraction. CONLL’14 P. P. Talukdar and F. Pereira. Experiments in graph‐based semi‐supervised learning methods for class‐
instance acquisition. ACL’10.
Z. Kozareva and E. Hovy. Not all seeds are equal: Measuring the quality of text mining seeds. NAACL’10 L. Gal´arraga, G. Heitz, K. Murphy, and F. M. Suchanek. Canonicalizing open knowledge bases. CIKM’14
96
7/26/2015 49
97
Directly converting a large amount of data into knowledge is too ambitious and unrealistic
Once we construct graphs or heterogeneous networks, we can perform many powerful search and mining tasks
Many NLP problems are about graphs and networks – what’s the DM point of view?
Search: Graph index creation and approximate structural search
NLP Application: Schemaless queries
Classification: Graph pattern mining and pattern‐based classification
NLP Application: Distinguishing authors based on their writing styles
Meta‐path‐based similarity search in heterogeneous networks
NLP Application: Entity morph decoding
Meta‐path‐based mining in heterogeneous networks: Prediction and recommendation
NLP Application: Knowledge base completion; multi‐hop question answering, causal event prediction
7/26/2015 50
99
Graph query: Find all the graphs in a graph DB containing a given query graph
Graph (G) Substructure Query graph (Q)
Graph DB:
(a) (b) (c) Path‐indices: C, C‐C, C‐C‐C, C‐C‐C‐C cannot prune (a) & (b) Only graph (c) contains Q
Query Q:
Index should be a powerful tool
Path‐index may not work well
Solution: Index directly on substructures (i.e., graphs)
100
Entity Linking (Pan et al., 2015) Query Graph Constructed from NL douments using Abstract Meaning Representation Knowledge Graph
7/26/2015 51
101
Bursty Information Networks Decipherment (Tao et al., 2015submission) Query Graph Constructed from Chinese English Knowledge Graph
102
Why index frequent substructures?
Too many substructures to index
Size‐increasing support threshold
Large structures will likely be indexed well by their substructures
Why discriminative substructures?
size support
min‐support threshold
x f f f f x
i n
, , , Pr
2 1
Reduce the index size by an order of magnitude
Selection: Given a set of selected structures f1, f2, .. fn, and a new structure x, the extra indexing power is measured by when Pr(x|f1, f2, …, fn) is small enough, x is a discriminative structure and should be included in the index
Experiments show gIndex is small, effective and stable
structure ( > 1 0 6) frequent ( ~ 1 0 5) discrim inative( ~ 1 0 3)
7/26/2015 52
103
The necessity can be clearly shown in search for
similar chemical compounds
How to conduct approximate search? Build indices covering all the similar subgraphs? No! Idea: (1) keep the index structure
(2) select features in the query space
Query relaxation measure: Features are more important than strict substructure matching
Only need to index a set of smaller features
(a) caffeine (b) diurobromine (c) viagra
Query graph Data graphs
…
Feature set
104
Video Photo m usic Business Univeristy join like
follow
Yellow stone NP Bison Mam m al Football League Country
follow
Name: Bison Class: Mammal Phylum: Chordate Order: Even‐toed ungulate Comment: Bison are large even‐toed ungulates within the subfamily ... Courtesy of Shengqi Yang, UCSB
Ex: When was UIUC founded? And how?
7/26/2015 53
105
Extracting entities and relationships from natural language queries Mismatch between knowledge graph and queries: Figure out the best transformation Knowledge Graph Query “the University of Illinois at Urbana-Champaign” “UIUC” “neoplasm” “tumor” “Doctor” “Dr.” “Barack Obama” “Obama” “Jeffrey Jacob Abrams” “J. J. Abrams” “teacher” “educator” “1980” “~30” “3 mi” “4.8 km” “Hinton” - “DNNresearch” - “Google” “Hinton” - “Google” … …
106
Users freely post queries in natural language, without knowledge on data graphs
Schemaless query (SLQ) system finds results through a set of transformations
Ex.
Prof., ~ 7 0 yrs Google UT
Query
Geoffrey Hinton ( Professor, 1 9 4 7 ) University
DNNresearch Google
A Match
Acronym transformation: ‘UT’ ‘University of Toronto’ Abbreviation transformation: ‘Prof.’ ‘Professor’ Numeric transformation: ‘~70’ ‘1947’ Structural transformation: an edge a path
7/26/2015 54
107
1.
Sampling: A set of subgraphs are randomly extracted from the data graph
2.
Query generation: queries are generated by randomly adding transformations on the extracted subgraphs
3.
Searching: search generated queries on the data graph
4.
Labeling: results are labeled based on
5.
Training: queries + labeled results, are used to estimate the weights of different transformations
Knowledge graph
Tom
the results
training queries results
Tom Cruise Samuel Tom
) P Q Q
108
Finding the results to a query Configuration of
latent random variables in CRFs
Inference in Conditional Random Fields (CRFs)
Top‐1 result: Computing the most likely assignment
Approximate inference: Loopy Belief Propagation
Top‐K result generation
Finding the M Most Probable Configurations Using Loopy Belief Propagation [Yanover et al., NIPS’04]
Experiments: Queries generated on YAGO2
SLQ [Yang, et al, VLDB’14] shows high performance
7/26/2015 55
110
The writing style is one of the best indicators for original authorship Substructures: k‐embedded‐edge subtree patterns holds more information than basic
syntactic features: function words, POS(Part of Speech) tags, and rewrite rules
Ex: A k‐embedded‐edge subtree t mined from NYT journalists Jack Healy and Eric Dash On average, 21.2% of Jack’s sentences contained t while only 7.2% of Eric’s contained t
7/26/2015 56
111
Binned information gain score distribution of various feature sets
FW: function words
POS: POS tags
BPOS: bigram POS tags
RR: rewrite rules
k‐ee subtrees # features for FW, POS, RR and k‐ee feature sets
Accuracy comparison on # authors and various datasets
7/26/2015 57
113
Network construction: generates structured networks from unstructured text data Each node: an entity; each link: a relationship between entities Each node/link may have attributes, labels, and weights Heterogeneous, multi‐typed networks: e.g., Medical network: Patients, doctors,
diseases, contacts, treatments
Venue Paper Author
DBLP Bibliographic Network The IMDB Movie Network
Actor Movie Director Movie Studio
The Facebook Network
114
A homogeneous network can be derived from its “parent” heterogeneous network Ex. Coauthor networks from the original author‐paper‐conference networks Heterogeneous networks carry richer info. than the projected homogeneous networks Typed nodes & links imply more structures, leading to richer discovery Ex.: DBLP: A Computer Science bibliographic database (network) Knowledge hidden in DBLP Network Mining Functions
Who are the leading researchers on Web search? Ranking Who are the peer researchers of Jure Leskovec? Similarity Search Whom will Christos Faloutsos collaborate with? Relationship Prediction Which relationships are most influential for an author to decide her topics? Relation Strength Learning How was the field of Data Mining emerged or evolving? Network Evolution Which authors are rather different from his/her peers in IR? Outlier/anomaly detection
7/26/2015 58
115
Similarity measure/search is the base for cluster analysis Who are the most similar to Christos Faloutsos based on the DBLP network? Meta‐Path: Meta‐level description of a path between two objects A path on network schema Denote an existing or concatenated relation between two object types Different meta‐paths tell different semantics Christos’ students or close collaborators Work in similar fields with similar reputation
Meta‐Path: Author‐Paper‐Author Meta‐Path: Author‐Paper‐Venue‐Paper‐Author Co‐authorship Meta‐path: A‐P‐A
116
Random walk (RW): The probability of random walk starting at x and ending at y, with
meta‐path P
Used in Personalized PageRank (P‐Pagerank) (Jeh and Widom 2003) Favors highly visible objects (i.e., objects with large degrees) Pairwise random walk (PRW): The probability of pairwise random walk starting at (x, y) and ending
at a common object (say z), following a meta‐path (P1, P2)
Used in SimRank (Jeh and Widom 2002) Favors pure objects (i.e., objects with highly skewed distribution in
their in‐links or out‐links) ( , ) ( )
p P
s x y prob p
1 2 1 2
1 2 ( , ) ( , )
( , ) ( ) ( )
p p P P
s x y prob p prob p
X Y P X Y P1 P2
‐1
Z
Note: P-PageRank and SimRank do not distinguish
relationship type Note: P-PageRank and SimRank do not distinguish
relationship type
7/26/2015 59
117
Meta‐path: APCPA Mike publishes similar # of papers as Bob
and Mary
Other measures find Mike is closer to Jim
Author\Conf. SIGMOD VLDB ICDM KDD Mike 2 1 Jim 50 20 Mary 2 1 Bob 2 1 Ann 1 1 Measure\Author Jim Mary Bob Ann P‐PageRank 0.376 0.013 0.016 0.005 SimRank 0.716 0.572 0.713 0.184 Random Walk 0.8983 0.0238 0.0390 Pairwise R.W. 0.5714 0.4440 0.5556 PathSim (APCPA) 0.083 0.8 1
Who is more similar to Mike?
Comparison of Multiple Measures: A Toy Example PathSim: Favors peers Peers: Objects with
strong connectivity and similar visibility with a given meta‐path
x y
118
Anhai Doan CS, Wisconsin Database area PhD: 2002 Jun Yang CS, Duke Database area PhD: 2001 Amol Deshpande CS, Maryland Database area PhD: 2004 Jignesh Patel CS, Wisconsin Database area PhD: 1998
Meta-Path: Author-Paper-Venue-Paper-Author
7/26/2015 60
119
119
“Conquer West King” (平西王) “Bo Xilai” (薄熙来)
“Tender Beef Pentagon” (嫩牛五方)
“instant noodles” (方便面) “Zhou Yongkang” (周永康) “Yang Mi” (杨幂) The Hutt Chris Christie
120
120
7/26/2015 61
121
121
122
A morph and its target are likely to be posted by two users with strong social
correlation at Weibo and Twitter respectively (test data: average social correlation = 0.923)
Explicit Social Correlation between users from re‐tweet, mentioning, reply and
follower networks (Wen and Lin, 2010)
Compute the degree of separation in user interactions and the amount of
interactions
Infer Implicit Social Correlation between users by topic modeling and opinion
mining
Users who share similar interests are likely to post similar information and
Measure content similarity of the messages posted by two users
Temporal + Social Correlation: Narrow down the number of target candidates to 1%
with 100% accuracy
7/26/2015 62
123
Conquer West King from Chongqing fell from power, still need to sing red songs?
There is no difference between that guy’s plagiarism and Buhou’s gang crackdown.
Remember that Buhou said that his family was not rich at the press conference a few days before he fell from power. His son Bo Guagua is supported by his scholarship.
Bo Xilai: ten thousand letters of accusation have been received during Chongqing gang crackdown.
The webpage of “Tianze Economic Study Institute”
first affected website of the liberal party after Bo Xilai fell from power.
Bo Xilai gave an explanation about the source of his son, Bo Guagua’s tuition.
Bo Xilai led Chongqing city leaders and 40 district and county party and government leaders to sing red songs.
Weibo (censored) Twitter and Chinese News (uncensored)
124
124 Bo Guagua Conquer West King Chongqing Earth Common Bo Xilai CCP Wen Jiabao Best Actor China
Each node is an entity An edge: a semantic relation, event, sentiment, semantic role, dependency relation or
co‐occurrence, associated with confidence values
PER GPE GPE PER ORG PER
Children Children Top_employee Top_employee Justice Justice Affiliation Located Member Member Affiliation Top‐employee Top‐employee
7/26/2015 63
125
125 Network Schema M: Morphs E: Entities EV: Events NP: Non-Entity Noun Phrases
126
126
@ /
k
A cc k C T
7/26/2015 64
127
127 Homogeneous Network 1 5 10 20 1 5 10 20 23.4% 41.6% 47.7% 51.9% 37.9% 59.4% 65.9% 70.1% Hetereogeneous Network
7/26/2015 65
129
Link prediction in homogeneous networks [Liben‐Nowell and Kleinberg, 2003, Hasan et al., 2006]
E.g., friendship prediction
Relationship prediction in heterogeneous networks
Different types of relationships need different prediction models
Different connection paths need to be treated separately!
Meta‐path‐based approach to define topological features vs. vs.
130
Meta path‐guided prediction of links and relationships
Philosophy: Meta path relationships among similar typed links share similar semantics and are comparable and inferable
Co‐author prediction (A—P—A)
Use topological features encoded by meta paths, e.g., citation relations between authors (A—P→P—A) vs.
Meta‐paths between authors under length 4
Meta‐Path Semantic Meaning
7/26/2015 66
131
Explain the prediction power of each meta‐path
Wald Test for logistic regression
Higher prediction accuracy than using projected homogeneous network
11% higher in prediction accuracy
Co‐author prediction for Jian Pei: Only 42 among 4809 candidates are true first‐time co‐authors!
(Feature collected in [1996, 2002]; Test period in [2003,2009]) 132
Top‐4 meta‐paths for author citation time prediction Predict when Philip S. Yu will cite a new author
Social relations are less important in author citation prediction than in co‐ author prediction. Social relations are less important in author citation prediction than in co‐ author prediction. Under Weibull distribution assumption
Study the same topic Co‐cited by the same paper Follow co‐authors’ citation Follow the citations of authors who study the same topic
7/26/2015 67
133
Heterogeneous relationships complement each other
Users and items with limited feedback can be connected to the network by different types of paths
Connect new users or items in the information network
Different users may require different models: Relationship heterogeneity makes personalized recommendation models easier to define
Avatar Titanic Aliens Revolutionary Road James Cameron Kate Winslet Leonardo Dicaprio
Zoe Saldana
Adventure
Romance
Collaborative filtering methods suffer from the data sparsity issue
# of users or items A small set
items have a large number
Most users & items have a small number of ratings # of ratings Personalized recommendation with heterog. Networks [WSDM’14]
134
Different users may have different behaviors or preferences
Aliens
James Cameron fan 80s Sci‐fi fan Sigourney Weaver fan
Different users may be interested in the same movie for different reasons
Data level
personal feedback to achieve personalization Model level
can learn personalized models for different users to further distinguish their differences
7/26/2015 68
135
Alice Bob Kate Winslet Naomi Watts Titanic revolutionary road skyfall King Kong
genre: drama
Sam Mendes tag: Oscar Nomination Charlie
Generate L different meta‐path (path
(path types) types) connecting
users and items Propagate user implicit feedback along each meta‐ path Calculate latent‐ features for users and items for each meta‐path with NMF related method
Ralph Fiennes
136
L user‐cluster similarity
Observation 1: Different meta‐paths may have different importance
Observation 2: Different users may require different models
ranking score the q‐th meta‐path features for user i and item j c total soft user clusters
(1) (2)
7/26/2015 69
137
min
for each correctly ranked item pair i.e., gave feedback to but not
Soft cluster users with NMF + k‐means For each user cluster, learn one model with Eq. (3) Generate personalized model for each user on the fly with Eq. (2) (3)
Learning Personalized Recommendation Model
138
Datasets
Comparison methods
Popularity: recommend the most popular items to users
Co‐click: conditional probabilities between items
NMF: non‐negative matrix factorization on user feedback
Hybrid‐SVM: use Rank‐SVM with plain features (utilize both user feedback and information network)
HeteRec personalized recommendation (HeteRec‐p) leads to the best recommendation
7/26/2015 70
139
Network representation provides more extensive analysis on text
Relatively homogeneous, clean, common substructures appear
Heterogeneous, noisy, knowledge sparsity, but clear meta‐path schema
140
Syntactic Tree Mining Approach", SIGIR'11
KDD'04
SIGMOD'04
Massive Network", VLDB'11
7/26/2015 71
141
Data”, ACL’13
Wisdom of Minority: Unsupervised Slot Filling Validation based on Multi‐dimensional Truth‐ Finding”, COLING’14
Morgan & Claypool Publishers, 2012
Heterogeneous Information Network Analysis", EDBT’09
Heterogeneous Information Networks”, VLDB'11
Heterogeneous Information Networks", WSDM'12
Recommendation: A Heterogeneous Information Network Approach", WSDM'14
Heterogeneous Bibliographic Networks", ASONAM'11
142
7/26/2015 72
143
Data mining and NLP have been working on a common goal: Turning data into knowledge (D2K), but with different methodologies
Data mining explore more on massive amount of data, but not in‐depth analysis on individual documents
Two fields will benefit each other by integrating their technologies/methodologies
We have been proposing a D2N2K (data to network to knowledge) methodology
Phrase mining in massive corpora
Entity recognition and typing by correlation and cluster analysis
Construction of massive typed heterogeneous information networks
Mining actionable knowledge from such “semi‐structured” information networks
Lots to be done in the future!
144
Lot of unanswered questions and research issues
What is the best framework for integration and joint inference?
Is there an ideal common representation, or a layer in between? Even go beyond Heterogeneous Information Networks?
Apply D2N2K framework to other NLP applications
Network search based Collective Entity Linking
Cross‐lingual information network alignment
Semantic graph based Machine Translation
Expert finding, research recommendation, …
…
7/26/2015 73
145
Checking our research package dissemination portal
IlliMine http://illimine.cs.uiuc.edu/
Phrase Mining
https://github.com/shangjingbo1226/SegPhrase
Entity Typing
http://shanzhenren.github.io/ClusType
Entity Morph Decoding
http://nlp.cs.rpi.edu/software/morphdecoding.tar.gz
Graph Mining Tools
Gspan http://www.cs.ucsb.edu/~xyan/software/gSpan.htm
146
7/26/2015 74
148
NLP: Unstructured Text
Data Mining (DM): More on
7/26/2015 75
149
Data Sparsity: Need to obtain high‐level statistics as global evidence
Training a typical supervised model needs a large number of instances
500 documents for phrase chunking, and entity, relation, event extraction
20,000 queries for entity linking
Knowledge Sparsity: Require global knowledge acquired from a wider context with low cost
Source KB Translation out of hype‐speak: some kook made threatening noises at Brownback and go arrested Samuel Dale "Sam" Brownback (born September 12, 1956) is an American politician, the 46th and current Governor of Kansas.
Who is Brownback? background data/knowledge aggregation
150
Acquire a large amount of related documents from multiple sources (genres, languages, data modalities)
Learn high‐level data‐driven statistics to extract, type and link information
Frequent pattern mining, ranking based on different criteria
Popularity: ‘information retrieval’ vs. ‘cross‐language information retrieval’
Concordance: ‘active learning’ vs.‘learning classification’
Completeness: ‘vector machine’ vs. ‘support vector machine’
Construct relational graphs for weakly‐supervised label propagation
7/26/2015 76
151
Directly converting a large amount of unstructured data into knowledge is too ambitious and unrealistic
Example 1
Input: Millions of discussion forum posts under censorship
Output: Resolve each implicit entity mention to its real target
Example 2
Input: 15 years of non‐parallel Chinese and English news articles Output: Entity translation pairs Example 3 Input: Millions of multi‐source documents reporting conflicting information Output: Track each company’s top employees over time, complete knowledge bases
7/26/2015 77
153
154
Motivation: Unigrams (single words) can be difficult to interpret
Ex.: The topic that represents the area of Machine Learning
learning reinforcement support machine vector selection feature random :
versus
learning support vector machines reinforcement learning feature selection conditional random fields classification decision trees :
7/26/2015 78
155
learning support vector machines reinforcement learning feature selection conditional random fields classification decision trees :
Topical keyphrase extraction & ranking
knowledge discovery using least squares support vector machine classifiers support vectors for reinforcement learning a hybrid approach to feature selection pseudo conditional random fields automatic web page classification in a dynamic and hierarchical way inverse time dependency in convex regularized learning postprocessing decision trees to extract actionable knowledge variance minimization least squares support vector machines …
Unigram topic assignment: Topic 1 & Topic 2
156
Popularity: ‘information retrieval’ vs. ‘cross‐language information retrieval’
Discriminativeness: only frequent in documents about topic t
Concordance: ‘active learning’ vs.‘learning classification’
Completeness: ‘vector machine’ vs. ‘support vector machine’ Frequent pattern mining Frequent pattern mining Comparability property: directly compare phrases of mixed lengths
7/26/2015 79
157
Strategy 1: Generate bag‐of‐words → generate sequence of tokens
Integrated complex model; phrase quality and topic inference rely on each
Slow and overfitting
Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams
Phrase quality relies on topic labels for unigrams
Can be fast; generally high‐quality topics and phrases
Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model
Topic inference relies on correct segmentation of documents, but not sensitive
Can be fast; generally high‐quality topics and phrases
7/26/2015 80
7/26/2015 81
162
Traditional data‐driven approaches
Frequent pattern mining
If AB is frequent, likely AB could be a phrase
Raw frequency could NOT reflect the quality of phrases
E.g., freq(vector machine) ≥ freq(support vector machine)
Need to rectify the frequency based on segmentation results
Phrasal segmentation will tell
Some words should be treated as a whole phrase whereas others are still unigrams
7/26/2015 82
163
Build a candidate phrase set by frequent pattern mining
Mining frequent k‐grams
k is typically small, e.g. 6 in our experiments
Popularity measured by raw frequent words and phrases mined from the corpus
164
Limited Training
Labels: Whether a phrase is a quality one or not
“support vector machine”: 1
“the experiment shows”: 0
For ~1GB corpus, only 300 labels
Random Forest as our classifier
Predicted phrase quality scores lie in [0, 1]
Bootstrap many different datasets from limited labels
7/26/2015 83
165
Phrasal segmentation can tell which phrase is more appropriate
Ex: A standard feature vector machine learning setup is used to describe...
Rectified phrase frequency (expected influence)
Example:
Not counted towards the rectified frequency
166
Partition a sequence of words by maximizing the likelihood
Considering
Phrase quality score
ClassPhrase assigns a quality score for each phrase
Probability in corpus
Length penalty
length penalty : when 1, it favors shorter phrases
Filter out phrases with low rectified frequency
Bad phrases are expected to rarely occur in the segmentation results
7/26/2015 84
167
SegPhrase+: One more round for enhanced phrasal segmentation
Feedback
Using rectified frequency, re‐compute those features previously computed based on raw frequency
Process
Classification Phrasal segmentation // SegPhrase
Classification Phrasal segmentation // SegPhrase+
Effects on computing quality scores
np hard in the strong sense
np hard in the strong
data base management system
168 Stay up Hawk Fans. We are going through a slump, but we have to stay positive. Go Hawks!
7/26/2015 85
169
The best BBQ I’ve tasted in Phoenix! I had the pulled pork sandwich with coleslaw and baked beans for lunch. ... The
The best BBQ:Food I’ve tasted in Phoenix:LOC ! I had the [pulled pork sandwich]:Food with coleslaw:Food and [baked beans]:Food for lunch. … The owner:JOB_TITLE is very nice. …
FOOD LOCATION JOB_TITLE EVENT ORGANIZATION
…
Target Types Identifying token span as entity mentions in documents and labeling their types Enabling structured analysis of unstructured text corpus Plain text Text with typed entities FOOD LOCATION EVENT
170
Relational graph: Each pair of mention m and concept c as a node
yi: the label of node i W: weight matrix of the
relational graph
Local Compatability Coreference Semantic Relatedness
7/26/2015 86
171
0.62 0.55 0.87 0.89 0.91 0.32 0.43 1.0 1.0 0.68 0.52 0.44 0.68
172
7/26/2015 87
173
DSRM Standard Relatedness Method M&W (Milne and Witten, 2008)
174
TagMe Meij SSRegu+M&W SSRegu+DSRM SSRegu+M&W SSRegu+DSRM AIDA Tweet Set News Dataset
7/26/2015 88
175
Mention detection is performance bottleneck
Mention disambiguation: city and country names that refer to sports teams (e.g., “Miami” ‐> “Miami Heat”)
Incorporate user interests
Non‐linkable entity mention recognization and clustering
175
Error Distribution
176
Transformation‐based graph querying produces many results
How to suggest the “best” results to users?
Different transformations, should be weighted differently
How to determine them? Weights shall be learned
Evaluate a Candidate Match: Ranking Function
Features
Node matching feature:
Edge matching feature:
Matching Score: Ex.: Given a single node query, “Geoffrey Hinton” Nodes with “G. Hinton” (Abbreviation transformation) shall be ranked higher than Nodes with “Hinton” (Last token transformation)
( , ( )) ( , ( ))
V i i i
F v v f v v ( , ( )) ( , ( ))
E j j j
F e e g e e ( ( ) | ) exp( ( , ( )) ( , ( )))
Q Q
V E v V e E
P Q Q F v v F e e
7/26/2015 89
177
The graphs automatically constructed from natural language texts might:
Include many more diverse types and weights
Include more ambiguity
Knowledge sparsity; hard to generalize unique nodes/substructures
Include more noise and errors
178
Data mining research has developed many scalable graph pattern mining methods
Given a labeled graph dataset D = {G1, G2, …, Gn), the supporting graph set of a subgraph g is Dg = {Gi | g Gi, Gi D}.
support(g) = |Dg|/ |D|
A (sub)graph g is frequent if support(g) ≥ min_sup
Ex.: Chemical structures
Graph Dataset Frequent Graph Patterns
S OH O O O N O N HO O N O N
(A) (B) (C)
O N
N O N
(1) (2) min_sup = 2 support = 67%
Alternative:
Mining frequent subgraph patterns from a single large graph or network
Documents can be viewed as graph structures as well
7/26/2015 90
179
Effective Graph pattern mining methods have been developed for different scenarios
Mining graph “transaction” datasets
Mining frequent large subgraph structures in single massive network
Graph pattern mining forms building blocks for graph classification, clustering, compression, comparison, and correlation analysis
Graph indexing and graph similarity search
gIndex (SIGMOD’05): Graph indexing by graph pattern mining
Help precise as well as similarity‐based graph query search and answering
180
Data description: 600 top confs, 9 major CS areas, 15071 authors in DB/DM
Author labeled by # of papers published in DB/DM
Prolific (P): >=50, Senior (S): 20~49, Junior (J): 10~19, Beginner(B): 5~9
Patterns found Patterns found Patterns found Instance in data Instance in data Instance in data Instance in data
7/26/2015 91
181
Bioinformatics
Gene networks, protein interactions, metabolic pathways
Chem‐informatics: Mining chemical compound structures
Social networks, web communities, tweets, …
Cell phone networks, computer networks, …
Web graphs, XML structures, semantic Web, information networks
Software engineering: Program execution flow analysis
Building blocks for graph classification, clustering, compression, comparison, and correlation analysis
Graph indexing and graph similarity search
182
Training and test pair: <xi, yi> = <history feature list, future relationship label> Logistic Regression Model
Model the probability for each relationship as
is the coefficients for each feature (including a constant 1)
MLE estimation
Maximize the likelihood of observing all the relationships in the training data
A‐P‐A‐P‐A A‐P‐V‐P‐A A‐P‐T‐P‐A A‐P‐>P‐A A‐P‐A <Mike, Ann> 4 5 100 3 Yes = 1 <Mike, Jim> 1 20 2 No = 0
7/26/2015 92
183
From “whether” to “when”
“Whether”: Will Jim rent the movie “Avatar” in Netflix?
“When”: When will Jim rent the movie “Avatar”?
What is the probability Jim will rent “Avatar” within 2 months? 2
By when Jim will rent “Avatar” with 90% probability? : 0.9
What is the expected time it will take for Jim to rent “Avatar”?
May provide useful information to supply chain management May provide useful information to supply chain management
Output: P(X=1)=? Output: A distribution of time! 184
Generalized Linear Model under Weibull Distribution Assumption
Solution
Geometric distribution, Exponential distribution, Weibull distribution
Deal with censoring (relationship builds beyond the observed time
Training Framework T: Right Censoring