A semi-supervised approach to extracting multiword entity names from - PowerPoint PPT Presentation

A semi-supervised approach to extracting multiword entity names from user reviews Olga Vechtomova ovechtom@uwaterloo.ca

Introduction • Semi-supervised approach to extracting entities (both single words and multiword units) of a specific semantic class from user-written reviews • Can be applied to extract different classes of entities • Task: extraction of dish names from restaurant reviews • Identification and removal of subjective modifiers • Novel use of BM25 as distributional similarity measure • Comparison with other similarity measures

Computing similarity between seeds and single words • Pre-processing: perform dependency parsing of the corpus (Stanford parser) • Examples of dependency triples: – gnocchi with brown butter and crispy sage leaves • amod NN JJ (butter, brown) • prep_with NN NN (gnocchi, butter) • nn NNS NN (leaves, crispy) • nn NNS NN (leaves, sage) • prep_with NN NNS (gnocchi, leaves) • conj_and NN NNS (butter, leaves)

Computing similarity between seeds and single words • Build feature vectors – Take all dependency triples containing seed words – Transform them into vector features Seed word: butter – amod NN JJ ( butter , brown) – prep_with NN NN (gnocchi, butter ) – conj_and NN NNS ( butter , leaves) – amod NN JJ ( X , brown) – prep_with NN NN (gnocchi, X ) – conj_and NN NNS ( X , leaves) – A vector for each word consists of these features and their frequencies of occurrence with this word

BM25-based distributional similarity measure • Compute similarity between the vectors of each candidate and seed using BM25 with query weights (Spärck Jones et al., 2000): ! !" ( ! ! + 1 ) !"#$ ! , ! = × !"# × !"# ! ! + !" ! ! ! • F - number of features in common between candidate word c and seed s • TF – frequency of occurrence of feature f with the candidate word • QTF – frequency of occurrence of f with the seed

BM25-based distributional similarity measure IDF f = log N n f • n f - number of candidate word vectors the feature f occurs in • N - number of candidate word vectors

Other distributional similarity methods • Lin’s measure (Lin 1998) – uses PMI to calculate association between a word and a feature (dependency triple) • Weeds and Weir (2003) – adapt the concepts of precision and recall to compute similarity • Kotlerman et al. (2009) – propose directional (assymetric) measure (balAPinc), aimed at finding more specialized terms

Computing similarity between seeds and single words • Feature-Seed co-occurrence threshold ( t ): • Use only those features that occur with at least t seeds

Effect of seed threshold ( t ) on MAP (Nouns) 0.7 0.6 0.5 0.4 MAP lin weeds 0.3 balapinc 0.2 bm25 0.1 0 1 2 3 4 5 6 7 seed threshold ( t ) 10

Effect of seed threshold ( t ) on MAP (Adjectives) 0.9 0.8 0.7 0.6 MAP 0.5 lin 0.4 weeds 0.3 balapinc bm25 0.2 0.1 0 1 2 3 4 5 6 7 seed threshold ( t ) 11

Removing subjective modifiers • delicious italian pizza • Use top ranked adjectives as subjective (parameter a ) • In each NP find the rightmost occurrence of a subj. adjective • Remove all words preceding and including this adjective • Rationale: modifiers in English are used in a certain order – opinion, size, age, shape, colour, origin, material, purpose

Effect of the number of top ranked adjectives removed on performance 0.45 0.4 0.35 0.3 MAP 0.25 lin 0.2 weeds 0.15 balapinc 0.1 bm25 0.05 0 0 50 100 150 200 250 300 350 400 all Number of top ranked adjectives ( a )

Ranking noun phrases • Intuition: the further away the noun is from the NP head the less its score should contribute to the NP score – “restaurant pizza” and “pizza restaurant" • Discount noun scores based on distance from the end of NP Log-linear D = 1 − log 10 ( d i ) Linear D = 1 − (( d i − 1) × 0.1) No discount D = 1 0.5 discount

Ranking noun phrases n ∑ Dw i NPscore = i n • w i - seed-similarity score of the noun • D - discount function • n - number of words in the NP

Effect of the discount factors on performance 0.39 0.38 0.37 0.36 0.5 discount MAP 0.35 linear 0.34 log 0.33 no discount 0.32 0.31 0.3 lin weeds balapinc bm25

Evaluation • Corpus of 157,865 restaurant reviews from Citygrid • 2 annotators labeled dish names and subjective adjectives in 600 reviews: – 1000 unique dish names (MWUs and single nouns) – 573 unique single word food/dish names – 472 unique subjective adjectives • Seed sets: – 20 seed sets, 10 single words each

Results - Ranking of single noun food/dish names Run MAP P@50 P@100 P@200 Lin ( t =3) 0.5255 0.968 0.893 0.7755 Weeds 0.5501 0.886 0.795 0.7445 ( t =1) balAPinc 0.5836 0.964 0.91 0.82 ( t =2) BM25 0.5705 0.97 0.893 0.811 ( t =1)

Results - Ranking of adjectives Run MAP P@50 P@100 P@200 Lin ( t =3) 0.7442 0.914 0.883 0.842 Weeds 0.7334 0.916 0.878 0.835 ( t =1) balAPinc 0.7296 0.892 0.859 0.812 ( t =2) BM25 0.7744** 0.922 0.889 0.861 ( t =1)

Results - Ranking of multiword dish names Run MAP P@50 P@100 P@200 Lin 0.3738 0.92 0.831 0.759 ( a =50) Weeds 0.3483 0.854 0.787 0.684 ( a =50) balAPinc 0.3742 0.886 0.814 0.7245 ( a =50) BM25 0.3814 0.832 0.779 0.715 ( a =100)

Effect of the number of seeds on performance 0.7 0.6 0.5 5-seeds 0.4 MAP 10-seeds 0.3 15-seeds 0.2 20-seeds 0.1 0 bm25 balapinc

Future work • Better method to detect boundaries of multiword entity names is needed, e.g. – “arugula salad with fresh parmesan” – “made by hand fries in a sundae dish with three different dips” – “pasta with lamb, olives, goat cheese and rosemary” • Application to other types of entities – Could be applied to identify different types of entities in user reviews – Promising results of a small-scale evaluation of other aspects of restaurant reviews (e.g., ambiance/atmosphere; people/staff)

Questions?

A semi-supervised approach to extracting multiword entity names from - PowerPoint PPT Presentation

A semi-supervised approach to extracting multiword entity names from user reviews Olga Vechtomova ovechtom@uwaterloo.ca Introduction Semi-supervised approach to extracting entities (both single words and multiword units) of a specific

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Semi-Supervised Kernel Mean Shift Clustering A Semi-Supervised Clustering Approach Motivation:

Semi-Supervised Local Fisher Semi-Supervised Local Fisher Discriminant Analysis Discriminant

Support Vector Machines (SVMs). Semi-Supervised Learning. Semi-Supervised SVMs.

Semi-Supervised Learning Maria-Florina Balcan 03/30/2015 Readings: Semi-Supervised Learning.

CS330 Paper Presentation: October 16th, 2019 Supervised Classification Semi-Supervised

Iterative Hybrid Algorithm for Semi-supervised Classification Martin SAVESKI Supervised by

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

1 Methods of Extracting or Obtaining Essential Oils The most common method for extracting

Multiword expressions: Insights from a multi-lingual perspective Manfred Sailer and Stella

When the whole is greater than the sum of its parts: Multiword expressions and idiomaticity

Multiword expressions: Getting the taste of things to come MWE 2017 Workshop Panel discussion

Learning Sentiment Polarity of Multiword Expressions M A X K A U F M A N N , N I C K C H E N ,

Multiword Expression Identification with Tree Substitution Grammars Spence Green, Marie-Catherine

Extracting semi-Dyck words from fsa using the CYK algorithm Thomas Ruprecht November 30, 2018

The first fast food format of gnocchi and condiments 100% made in Italy PROJECT GNOCCHITA

Monitoring performance of your OpenStack environment Matthias Runge Senior Software Engineer

OpenStack Telemetry and the 10,000 Instances To infinity and beyond Julien Danjou Alex Krzos 9

Gnocchi and Collectd for Title faster fault detection and maintenance Julien Danjou, Red Hat

CDN use-case. Abubakr Magzoub Content Delivery Network (CDN) Use-Case CDN Use case shows:

DISHIONARY YOU ARE WHAT YOU EAT Dishionary is a mobile application that helps people learn about

Review Open Call 1 Giovanni Cuffaro LASH-5G FEC3 experiment Paris, 16 th March 2018

foodhome.kz online supermarket Atyrau 2019 Cons of the shopping in the grocery store You