Ontology Generation for Large Email Collections Grace Hui Yang and - PDF document

6/27/13 ¡ Ontology Generation for Large Email Collections Grace Hui Yang and Jamie Callan Carnegie Mellon University  Introduction  Subtasks in Ontology Learning  Supervised Hierarchical Clustering Framework  Experimental Results  User Study Roadmap 1 ¡

6/27/13 ¡  Ontology is a data model that represents a set of concepts within a domain and the set of pair-wised relationships between those concepts. ◦ Examples: WordNet, ODP  Ontology Learning is the task to construct a well-defined ontology given ◦ a text corpus or ◦ a set of concept terms Introduction  In eRulemaking, there are large number of email comments sent to the agency every day ◦ Ontology offers a nice way to summarize the important topics in the email comments  In Information Retrieval, Natural Language Processing, there is need to know the relationships among the terms/phrases/ concepts ◦ Ontology offers relational associations between items Introduction 2 ¡

6/27/13 ¡  Concept Extraction  Synonym Detection  Relationship Formulation by Clustering  Cluster Labeling Subtasks in Ontology Learning  Concept Extraction  Synonym Detection  Relationship Formulation by Clustering  Cluster Labeling Subtasks in Ontology Learning 3 ¡

6/27/13 ¡ Noun N-gram Mining Concept Filtering  Each sentence is parsed  Web-based POS error by a part-of-speech detection (POS) tagger  Assumption:  An n-gram generator ◦ Among the first 10 google then scans through to snippets, a valid concept identify noun appears more than a threshold (4 in our case) sequences  Remove POS errors  Bigrams and trigrams are ranked by their ◦ protect/NN polar/NN bear/ NN frequencies of  Remove Spelling errors occurrences ◦ Pullution, polor bear  Longer Named Entities Concept Extraction Concept Extraction 4 ¡

6/27/13 ¡  Hierarchical Clustering  Different Strategies for Concepts at Different Abstraction Levels ◦ Concrete Concepts at the lower levels  Camp, basketball, car ◦ Abstract Concepts at the higher levels  Economy, math, study Clustering  Find Syntactic and Semantic Evidences for Concrete concepts ◦ concept candidates are organized into groups based on the 1st sense of the head noun in Wordnet ◦ one of their common head nouns will be selected as the parent concept for this group  pollution subsumes water pollution, air pollution.  Create a high accuracy concept forests at the lower level of the ontology Bottom-Up Hierarchical Clustering 5 ¡

6/27/13 ¡ High Accuracy Ontology Fragments  Two Problems in the previous step ◦ Animal species and bear species are sisters ◦ Different fragments need to be further grouped  Solution: Use Wordnet Hypernyms to construct a higher level ◦ Concepts at the leaf level are looked-up in Wordnet. If one is another's hypernym, the former is promoted as the parent of the latter's.  Species subsumes animal species subsumes bear species ◦ Concepts in a Wordnet hypernym chain are connected  Their hypernym in Wordnet is used to label the group Continue to be Bottom-Up 6 ¡

6/27/13 ¡ Different fragments are grouped Ontology Fragments after Wordnet Refinement  Problem ◦ Still a forest ◦ Many concepts at top level are not grouped  In any clustering algorithm, we need a metric ◦ Hard to know the metric to measure distance for those top level nodes ◦ Learn it! Continue to be Bottom-up 7 ¡

6/27/13 ¡  Learn for Whom? ◦ Concepts at lower levels since they are highly accurate ◦ User feedback  Learn What? ◦ A distance metric function  After learning, then what? ◦ Apply the distance metric function to high level to get distance scores for them ◦ Then use whatever clustering algorithm to group them based on the distance scores Supervised Hierarchical Clustering  A set of concepts x (i) on the i th level of the ontology hierarchy  Distance matrix y ( i) ◦ The Matrix entry which corresponding to jk ∈ {0,1}, concept x (i) j and x (i) k is y (i) ◦ y (i) jk = 0, if x (i) j and x (i) k in the same group; ◦ y (i) jk = 1, otherwise. Training Data from Lower Levels 8 ¡

6/27/13 ¡ 0 0 1 1 ⎡ ⎤ ⎢ ⎥ 0 0 1 1 y (i) ⎢ ⎥ = 1 1 0 0 ⎢ ⎥ ⎢ ⎥ 1 1 0 0 ⎣ ⎦ Training Data from Lower Levels  Distance metric represented as a Mahalanobis distance ◦ Φ (x (i) j, x (i) k )represents a set of pairwise underlying feature functions ◦ A is a positive semi-definite matrix, the parameter we need to learn  Parameter estimation by Minimize Squared Errors Learning the Distance Metric 9 ¡

6/27/13 ¡  Optimization can be done by ◦ Newton’s Method ◦ Interior-Point Method ◦ Any standard semi-definite programming (SDP) solvers  Sedumi, yalmip Solve the Optimization Problem Underlying Feature Functions 10 ¡

6/27/13 ¡  We have learned A!  For any pair of concepts at higher level (x (i+1) l, x (i+1) m )  The corresponding entry in the distance matrix y ( i+1) is Generate Distance Scores for Higher Level  A flat clustering at each level  Use one of the concepts as the cluster center  Estimate the number of clusters by Gap statistics [Tibshirani et al. 2000] K-medoids Clustering for Higher Level Concepts 11 ¡

6/27/13 ¡  Repeat the learning process from each level ◦ Learn parameter matrix A from lower level ◦ Generate distance scores for higher level ◦ Clustering higher level ◦ Move one level up  Previous testing data now becomes training data!  Always trust groupings in the lower level since they are relatively more accurate Supervised Hierarchical Clustering  Problem: ◦ Concepts are grouped together, but nameless  Need to find a good name representing the meaning of entire group  Solution: ◦ A web-based approach ◦ Send a query formed by concatenating the child concepts to Google ◦ Parse top 10 snippets ◦ Most frequent word is selected to be the parent of this group Cluster Labeling 12 ¡

6/27/13 ¡  Datasets Experimental Results Component-based Performance Analysis 13 ¡

6/27/13 ¡ Component-based Performance Analysis Error Analysis 14 ¡

6/27/13 ¡ Software  Combine many techniques into a unified framework ◦ pattern-based(concept mining) ◦ knowledge-based (use of Wordnet) ◦ Web-based (concept filtering and cluster naming) ◦ Machine learning (supervised clustering)  Effectively combine the strengths of automatic systems and human knowledge via relevance feedback  Worked on harder datasets which do not contain broad, diverse concepts, hence require higher accuracy Contributions 15 ¡

6/27/13 ¡  Is bottom-up the best way to do? ◦ Maybe not ◦ Incremental clustering saves most efforts  We have used different technologies for concepts at different levels, how to formally generalize it? ◦ Model concept abstractness explicitly  We have tested on domain-specific corpora, how about corpora for more general purpose? ◦ Can we reconstruct Wordnet or ODP? What is Next? 16 ¡

Ontology Generation for Large Email Collections Grace Hui Yang and - PDF document

6/27/13 Ontology Generation for Large Email Collections Grace Hui Yang and Jamie Callan Carnegie Mellon University Introduction Subtasks in Ontology Learning Supervised Hierarchical Clustering Framework Experimental

The Natural Science Collections Facility Natural Science Collections Collections in South Africa

Scala Collections 1 / 20 Scala Collections Figure 1: Abstract classes and traits in

Data driven Ontology Alignment Data driven Ontology Alignment Nigam Shah nigam@stanford.edu

COMP 213 Advanced Object-oriented Programming Lecture 12 Java Collections. The Collections

Collections Objectives Explore collections in System.Collections namespace memory

Java Collections and Generics Object-oriented programming Inf1 :: 2008 Object-oriented

Ontology Engineering Lecture 7: Top-down (and middle-out) Ontology Development II Maria Keet

Some (more) Burning Issues for Ontology Initiatives Background: Current Ontology Work in Bremen

Ontology Development 101: A Guide to Creating Your First Ontology Natalya F. Noy and Deborah L.

Systematic Annotation Mark Voorhies 4/5/2011 The Gene Ontology Three directed acyclic graphs

Combining XML querying Combining XML querying with ontology reasoning: with ontology reasoning:

Ontology Languages for the Semantic Web Ontology Languages Wide variety of languages for

Ontology Jan Pettersen Nytun Knowledge Representation Part I, JPN, UiA 1 Outline S O P

ODPReco - A Tool to Recommend Ontology Design Patterns Maleeha Arif Yasvi, Raghava Mutharaju

Ontology Engineering Lecture 8: Bottom-up Ontology Development Maria Keet email:

Ontology Engineering Lecture 4: The Web Ontology Language OWL 2 Maria Keet email:

Extreme Value Analysis Amir AghaKouchak Email: amir.a@uci.edu Web:

accessing Outlook via the Outlook Web App (OWA) This user guide will show you how to access

Thank you for joining us for this AmerisourceBergen webinar. The broadcast has not started yet.

Overview Weak entity sets and keys Design principles CS 235: Examples Introduction to

The Great East Japan Earthquake - What we did as CSIRTs- June 14, 2011 Itaru Kamiya , NTT-CERT

HP3 Personal Upgrade Scheme for the Recognition of Excellence 2019 Information Session 1

Social Engineering? (And h o w d o yo u p r o t ec t yo ur busin e ss a nd peop l e ?) www.id e

Engaging Providers in HRS Devona Tafalla & Molly Taroli, PacificSource Case Management and

Ontology Generation for Large Email Collections Grace Hui Yang and - PDF document

6/27/13 Ontology Generation for Large Email Collections Grace Hui Yang and Jamie Callan Carnegie Mellon University Introduction Subtasks in Ontology Learning Supervised Hierarchical Clustering Framework Experimental

The Natural Science Collections Facility Natural Science Collections Collections in South Africa

Scala Collections 1 / 20 Scala Collections Figure 1: Abstract classes and traits in

Data driven Ontology Alignment Data driven Ontology Alignment Nigam Shah nigam@stanford.edu

COMP 213 Advanced Object-oriented Programming Lecture 12 Java Collections. The Collections

Collections Objectives Explore collections in System.Collections namespace memory

Java Collections and Generics Object-oriented programming Inf1 :: 2008 Object-oriented

Ontology Engineering Lecture 7: Top-down (and middle-out) Ontology Development II Maria Keet

Some (more) Burning Issues for Ontology Initiatives Background: Current Ontology Work in Bremen

Ontology Development 101: A Guide to Creating Your First Ontology Natalya F. Noy and Deborah L.

Systematic Annotation Mark Voorhies 4/5/2011 The Gene Ontology Three directed acyclic graphs

Combining XML querying Combining XML querying with ontology reasoning: with ontology reasoning:

Ontology Languages for the Semantic Web Ontology Languages Wide variety of languages for

Ontology Jan Pettersen Nytun Knowledge Representation Part I, JPN, UiA 1 Outline S O P

ODPReco - A Tool to Recommend Ontology Design Patterns Maleeha Arif Yasvi, Raghava Mutharaju

Ontology Engineering Lecture 8: Bottom-up Ontology Development Maria Keet email:

Ontology Engineering Lecture 4: The Web Ontology Language OWL 2 Maria Keet email:

Extreme Value Analysis Amir AghaKouchak Email: amir.a@uci.edu Web:

accessing Outlook via the Outlook Web App (OWA) This user guide will show you how to access

Thank you for joining us for this AmerisourceBergen webinar. The broadcast has not started yet.

Overview Weak entity sets and keys Design principles CS 235: Examples Introduction to

The Great East Japan Earthquake - What we did as CSIRTs- June 14, 2011 Itaru Kamiya , NTT-CERT

HP3 Personal Upgrade Scheme for the Recognition of Excellence 2019 Information Session 1

Social Engineering? (And h o w d o yo u p r o t ec t yo ur busin e ss a nd peop l e ?) www.id e

Engaging Providers in HRS Devona Tafalla &amp; Molly Taroli, PacificSource Case Management and

Engaging Providers in HRS Devona Tafalla & Molly Taroli, PacificSource Case Management and