Content-based Linked Data Summarization Andrejs Abele Supervisor: - PowerPoint PPT Presentation

Content-based Linked Data Summarization Andrejs Abele Supervisor: Paul Buitelaar Mentor: Georgeta Bordea

Introduction 1. Motivation 2. Datasets 3. Approach 4. Evaluation 5. Experiments 6. Conclusion & Future work

Terminology ● Linked data ● Automatic summarization: ○ Extraction-based summarization, ○ Abstraction-based summarization ● Single document summarization ● Multi document-summarization

Motivation Data scientist and developers ... ... Wikipedi DynaMed a outbreak databas ... Us e Census Datahub contains Data 8 731 datasets Summarizer Dataset Description Top entries ... Encyclopedia DBpedia history, structure, ... ... Contains information about science, technology, math, history … outbreakdatabas Provides summaries of significant food and water related outbreaks Outbreak, illness, ... ... e occurring since ... ... ... ... ...

Datasets DBpedia - english dbpedia dump( 866 461 004 ) <http://dbpedia.org/resource/Lignum_nephriticum> <http://www.w3.org/2000/01/rdf-schema#label> "Lignum nephriticum"@en . <http://dbpedia.org/resource/Mithridate> <http://dbpedia.org/ontology/wikiPageInterLanguageLink> <http://br.dbpedia.org/resource/Mitridates> . <http://dbpedia.org/resource/Uguisu_no_fun> <http://dbpedia.org/ontology/abstract> "Uguisu no fun (\u9DAF\u306E\u7CDE), which literally means \u201Cnightingale feces\u201D in Japanese, refers to the excrement (fun) produced by a particular nightingale called the Japanese bush warbler (Cettia diphone) (uguisu). The droppings have been used in facials since ancient Japanese times. Recently, the product has been used in the Western world. This facial has been referred to as the \u201CGeisha Facial\u201D. The facial is supposed to lighten the skin and balance skin tones that have acne or sun damage."@en . … WikiAbstracts - Wikipedia abstracts ( 4 636 227 ) acquis - Acquis english corpus ( 23228 )

Experiment 1. Extract informations about one topic from linked Dataset 2. Determine most important terms 3. Create summary from extracted words 4. Compare summary to wikipedia article about the topic

Ranking methods ● Normalized Term Frequency (TF/N) ● Term Frequency -Inverse Document Frequency (TF*IDF) IDF(t)=ln(N d /N dt ) ● Taxonomy extraction

Evaluation ROUGE-N ROUGE output: 1. Recall N-gram based co-occurrence statistics 2. Precision pyrouge - https://github.com/andersjo/pyrouge.git 3. F-measure Parameters used : m - uses Porter stemmer s - removes stopwords (around,as,aside,ask,asking,...) n - max-ngram l - n-words

Experiments ● Term preprocessing ○ Stemming ○ Removing Stopwords ○ Part-of-speech tagging

Experiment 1 extracted IDF data IDF data sources: literals source 1. All Literals from DBpedia 2. Wikipedia abstracts 3. acquis 4. extracted literals Compute Compute Compute TF IDF TF-IDF Ranked term List (230) Wikipedia article abstract (230) ROUGE ROUGE output (using stemming)

Extract informations about one topic 1. grep for all triples containing <http://dbpedia.org/page/Category:Traditional_medicine> 2. get all subjects and objects and merge in a list 3. use list to grep for all related triples from dbpedia 4. upload triples to triplestore 5. query for unique subjects and objects, where object is a literal

Topic specific data ( 369 ) ?s ?O <http://dbpedia.org/resource/Kampo> ", alternatively shortened as just Kanpō, is the Japanese study and adaptation of Traditional Chinese medicine (TCM). The fundamental principles of Chinese medicine came to Japan between the 7th and 9th centuries. Since then, the Japanese have created their own unique herbal medical system and diagnosis. Kampo uses most of the Chinese medical system including acupuncture and moxibustion but is primarily concerned with the study of herbs." <http://dbpedia.org/resource/Kampo> "Kampo" <http://dbpedia.org/resource/Apocroustic> "Apocroustics, in pre-modern medicine, were medications intended to stop the flux of malignant humours to a diseased part. They were usually cold, astringent, and consisting of large particles." ... ● Text gets stemmed using Lucene library and merged in one document altern shorten as just Kanp is the Japanes studi and adapt of Tradit Chines medicin TCM The fundament principl of Chines medicin came to Japan between the 7th and 9th centuri Sinc then the Japanes have creat their own uniqu herbal medic system and diagnosi Kampo us most of the Chines medic system includ acupunctur and moxibust but is primarili concern with the studi of herb Kampo Apocroust in pre modern medicin were medic intend to stop the flux of malign humour to a diseas part Thei were usual cold astring and consist of larg particl

IDF datasets ● Input is standard triple (S P O) <http://dbpedia.org/resource/Irani_traditional_medicine> <http://www.w3.org/2000/01/rdf-schema#label> "Irani traditional medicine"@en . <http://dbpedia.org/resource/Lignum_nephriticum> <http://www.w3.org/2000/01/rdf-schema#label> "Lignum nephriticum"@en ● Using Jena parser, filter out Literals Irani traditional medicine Lignum nephriticum ● Words get stemmed using Lucene library (Irani, tradit, medicin) (Lignum, nephriticum) ● Calculate IDF 0.09230952124 medicin 0.03787453865 tradit 0.02862030703 herbal 0.01959969126 medic

Experiment 1 result without stopword With stopwords removed R P F R P F DBPedia 0.21304 0.21304 0.21304 0.25373 0.17617 0.20795 Wikipedia 0.18261 0.18261 0.18261 0.22388 0.14493 0.17595 acquis 0.15217 0.15217 0.15217 0.19403 0.12093 0.149 extracted literals 0.21739 0.21645 0.21692 0.24627 0.17647 0.20561

Experiment 2 extracted IDF data IDF data sources: literals source 1. All Literals from DBpedia POS tage POS tage 2. Wikipedia abstracts Compute Compute Compute TF IDF TF-IDF Ranked term List (230) Wikipedia article abstract (230) ROUGE ROUGE output (using stemming)

Part of speech tagging ● extract all literals Trisuloides sericea is a moth of the Noctuidae family. It is found in South-east Asia. The wingspan is about 24 mm. Khvorakabad is a village in Mazraeh Now Rural District, in the Central District of Ashtian County, Markazi Province, Iran. At the 2006 census, its population was 72, in 23 families. ● T ag text using stanford speech tagger (3.5.0) Trisuloides_NNS sericea_NN is_VBZ a_DT moth_NN of_IN the_DT Noctuidae_NNP family_NN ._. It_PRP is_VBZ found_VBN in_IN South-east_JJ Asia_NNP ._. The_DT wingspan_NN is_VBZ about_IN 24_CD mm_NN ._. Khvorakabad_NNP is_VBZ a_DT village_NN in_IN Mazraeh_NNP Now_NNP Rural_NNP District_NNP ,_, in_IN the_DT Central_NNP District_NNP of_IN Ashtian_NNP County_NNP ,_, Markazi_NNP Province_NNP ,_, Iran_NNP ._. At_IN the_DT 2006_CD census_NN ,_, its_PRP$ population_NN was_VBD 72_CD ,_, in_IN 23_CD families_NNS ._. ● Filter out only Verbs and nouns (NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ) Trisuloides_NNS, sericea_NN, is_VBZ, moth_NN, Noctuidae_NNP, family_NN, is_VBZ, found_VBN, Asia_NNP, wingspan_NN, is_VBZ, mm_NN, Khvorakabad_NNP, is_VBZ, village_NN, Mazraeh_NNP, Now_NNP, Rural_NNP, District_NNP, Central_NNP, District_NNP, Ashtian_NNP, County_NNP, Markazi_NNP, Province_NNP, Iran_NNP, census_NN, population_NN, was_VBD, families_NNS ● Compute TF-IDF

Rezults without stopword With stopwords removed R P F R P F DBPedia 0.17826 0.17826 0.17826 0.26119 0.16509 0.20231 Wikipedia 0.16087 0.16087 0.16087 0.23881 0.14884 0.18338

Experiment 3 extracted Taxonomy parameters: literals MincommonDoc=2 MincommonDoc=3 Split in Saffron documents Ranked Generate term List taxonomy (230) Wikipedia article abstract (230) ROUGE ROUGE (using output stemming)

Rezults Taxonomy without stopword POS With stopwords removed without stopword With stopwords removed R P F R P F R P F R P F MinComDoc=2 Words 0.31739 0.28968 0.3029 0.49254 0.27049 0.34921 DBPedia 0.17826 0.17826 0.17826 0.26119 0.16509 0.20231 MinComDoc=2 Terms 0.17826 0.25309 0.20918 0.29104 0.24528 0.26621 Wikipedia 0.16087 0.16087 0.16087 0.23881 0.14884 0.18338 MinComDoc=3 Words 0.12174 0.73684 0.20896 0.20896 0.73684 0.32559 MinComDoc=3 Terms 0.05652 0.65 0.104 0.09701 0.65 0.16882 Stemmed without stopword With stopwords removed R P F R P F DBPedia 0.21304 0.21304 0.21304 0.25373 0.17617 0.20795 Wikipedia 0.18261 0.18261 0.18261 0.22388 0.14493 0.17595 acquis 0.15217 0.15217 0.15217 0.19403 0.12093 0.149 extracted literals 0.21739 0.21645 0.21692 0.24627 0.17647 0.20561

Conclusion TF-IDF considering triples as documents shows good results Taxonomy extraction provided best results Future work ● Automatically extract categories/topics from dataset ● Generate N-grams summaries for topics, based on model, that is trained on full dataset ● Gather relevant statistics about datasets ● Create more precise evaluation method

Content-based Linked Data Summarization Andrejs Abele Supervisor: - PowerPoint PPT Presentation

Content-based Linked Data Summarization Andrejs Abele Supervisor: Paul Buitelaar Mentor: Georgeta Bordea Introduction 1. Motivation 2. Datasets 3. Approach 4. Evaluation 5. Experiments 6. Conclusion & Future work Terminology

ACL19 Summarization Xiachong Feng Papers Multi-Document Summarization Scientific Paper

Document Summarization Statistical NLP Spring 2011 Lecture 25: Summarization Dan Klein UC

Composition Announcements Linked Lists Linked List Structure A linked list is either empty or a

Composition Announcements Linked Lists Linked List Structure A linked list is either empty or a

Linked Lists Fundamentals of Computer Science Outline Sequential vs. Linked Linked List

Composition Announcements Linked Lists Linked List Structure A linked list is either empty or a

Composition Announcements Linked Lists Linked List Structure A linked list is either empty or a

Overview of TAC 2011 Summarization Track Karolina Owczarzak, Hoa Trang Dang National Institute of

A Neural Attention Model for Sentence Summarization Alexander M. Rush, Sumit Chopra, Jason

Statistical NLP Spring 2011 Lecture 25: Summarization Dan Klein UC Berkeley Document

Automatic Summarization (and other stuff) Taylor Berg-Kirkpatrick CS 288 UC Berkeley

csci 210: Data Structures Linked lists Summary Today linked lists single-linked

Linked Lists Definition of Linked Lists A linked list is a sequence of items (objects) where

Joint Regional Seminar 2016 Risk Analysis of Equity-linked Products 1 Equity-linked products 2

Linked Lists Kruse and Ryba Textbook 4.1 and Chapter 6 Linked Lists Linked list of items

Ch 5 Linked Lists A Node Class for Linked Lists A Linked List Toolkit The Bag Class with a

Mara Galmarini Incoming Fellow Buenos Aires Dij on France Argentina Early days and background

Background Why and How VOCs are Regulated in the U.S. State and Federal VOC Regulations

March 2015 Meeting: Weed Project March 2015 Meeting: Weed Project Thank you! Thank you!

leos e Manteigas Especiais Naturais So Paulo - Brasil Na tura l Oil Extra cts | B utter | B

From the Postulator's Desk No. 31 September 2010 Donal S. Blake CFC, Edmund Rice Postulator As I

Bunya Grove Produce Bunya Grove Produce Our Farm Your Food 255 Amamoor Creek Road, Amamoor 400

Hedge Planting In Whitworth Park Hedge Bed Preparation Planting Hedge Plants Wildflower Planting

Diet, Training, and Ergogenic Aids: Search for the Competitive Edge. The Evidence from Antiquity