Content-based Linked Data Summarization Andrejs Abele Supervisor: - - PowerPoint PPT Presentation

content based linked data summarization
SMART_READER_LITE
LIVE PREVIEW

Content-based Linked Data Summarization Andrejs Abele Supervisor: - - PowerPoint PPT Presentation

Content-based Linked Data Summarization Andrejs Abele Supervisor: Paul Buitelaar Mentor: Georgeta Bordea Introduction 1. Motivation 2. Datasets 3. Approach 4. Evaluation 5. Experiments 6. Conclusion & Future work Terminology


slide-1
SLIDE 1

Content-based Linked Data Summarization

Andrejs Abele

Supervisor: Paul Buitelaar Mentor: Georgeta Bordea

slide-2
SLIDE 2

Introduction

  • 1. Motivation
  • 2. Datasets
  • 3. Approach
  • 4. Evaluation
  • 5. Experiments
  • 6. Conclusion & Future work
slide-3
SLIDE 3

Terminology

  • Linked data
  • Automatic summarization:

○ Extraction-based summarization, ○ Abstraction-based summarization

  • Single document summarization
  • Multi document-summarization
slide-4
SLIDE 4

Motivation

Datahub contains 8 731 datasets

Dataset Description Top entries ... DBpedia Encyclopedia Contains information about science, technology, math, history … history, structure, ... ...

  • utbreakdatabas

e Provides summaries of significant food and water related outbreaks

  • ccurring since ...

Outbreak, illness, ... ... ... ... ... ...

Wikipedi a DynaMed Us Census Data

  • utbreak

databas e

... ... ...

Summarizer Data scientist and developers

slide-5
SLIDE 5

Datasets

DBpedia - english dbpedia dump(866 461 004)

<http://dbpedia.org/resource/Lignum_nephriticum> <http://www.w3.org/2000/01/rdf-schema#label> "Lignum nephriticum"@en . <http://dbpedia.org/resource/Mithridate> <http://dbpedia.org/ontology/wikiPageInterLanguageLink> <http://br.dbpedia.org/resource/Mitridates> . <http://dbpedia.org/resource/Uguisu_no_fun> <http://dbpedia.org/ontology/abstract> "Uguisu no fun (\u9DAF\u306E\u7CDE), which literally means \u201Cnightingale feces\u201D in Japanese, refers to the excrement (fun) produced by a particular nightingale called the Japanese bush warbler (Cettia diphone) (uguisu). The droppings have been used in facials since ancient Japanese times. Recently, the product has been used in the Western world. This facial has been referred to as the \u201CGeisha Facial\u201D. The facial is supposed to lighten the skin and balance skin tones that have acne or sun damage."@en . …

WikiAbstracts - Wikipedia abstracts (4 636 227) acquis - Acquis english corpus (23228)

slide-6
SLIDE 6

Experiment

  • 1. Extract informations about one topic from

linked Dataset

  • 2. Determine most important terms
  • 3. Create summary from extracted words
  • 4. Compare summary to wikipedia article about

the topic

slide-7
SLIDE 7

Ranking methods

  • Normalized Term Frequency (TF/N)
  • Term Frequency -Inverse Document

Frequency (TF*IDF) IDF(t)=ln(Nd /Ndt)

  • Taxonomy extraction
slide-8
SLIDE 8

Evaluation

ROUGE-N

N-gram based co-occurrence statistics pyrouge - https://github.com/andersjo/pyrouge.git Parameters used : m - uses Porter stemmer s - removes stopwords (around,as,aside,ask,asking,...) n - max-ngram l - n-words

ROUGE output:

1. Recall 2. Precision 3. F-measure

slide-9
SLIDE 9

Experiments

  • Term preprocessing

○ Stemming ○ Removing Stopwords ○ Part-of-speech tagging

slide-10
SLIDE 10

Experiment 1

Compute TF Compute IDF Compute TF-IDF Ranked term List (230)

IDF data sources: 1. All Literals from DBpedia 2. Wikipedia abstracts 3. acquis 4. extracted literals

ROUGE (using stemming) extracted literals IDF data source Wikipedia article abstract (230) ROUGE

  • utput
slide-11
SLIDE 11

Extract informations about one topic

  • 1. grep for all triples containing

<http://dbpedia.org/page/Category:Traditional_medicine>

  • 2. get all subjects and objects and merge in a list
  • 3. use list to grep for all related triples from dbpedia
  • 4. upload triples to triplestore
  • 5. query for unique subjects and objects, where object is a

literal

slide-12
SLIDE 12

Topic specific data (369)

?s ?O

<http://dbpedia.org/resource/Kampo> ", alternatively shortened as just Kanpō, is the Japanese study and adaptation of Traditional Chinese medicine (TCM). The fundamental principles of Chinese medicine came to Japan between the 7th and 9th

  • centuries. Since then, the Japanese have created their own unique herbal medical system and diagnosis. Kampo uses most of the

Chinese medical system including acupuncture and moxibustion but is primarily concerned with the study of herbs." <http://dbpedia.org/resource/Kampo> "Kampo" <http://dbpedia.org/resource/Apocroustic> "Apocroustics, in pre-modern medicine, were medications intended to stop the flux

  • f malignant humours to a diseased part. They were usually cold, astringent, and consisting of large particles."

...

  • Text gets stemmed using Lucene library and merged in one document

altern shorten as just Kanp is the Japanes studi and adapt of Tradit Chines medicin TCM The fundament principl of Chines medicin came to Japan between the 7th and 9th centuri Sinc then the Japanes have creat their own uniqu herbal medic system and diagnosi Kampo us most of the Chines medic system includ acupunctur and moxibust but is primarili concern with the studi

  • f herb Kampo Apocroust in pre modern medicin were medic intend to stop the flux of malign humour to a diseas part Thei were

usual cold astring and consist of larg particl

slide-13
SLIDE 13

IDF datasets

  • Input is standard triple (S P O)

<http://dbpedia.org/resource/Irani_traditional_medicine> <http://www.w3.org/2000/01/rdf-schema#label> "Irani traditional medicine"@en . <http://dbpedia.org/resource/Lignum_nephriticum> <http://www.w3.org/2000/01/rdf-schema#label> "Lignum nephriticum"@en

  • Using Jena parser, filter out Literals

Irani traditional medicine Lignum nephriticum

  • Words get stemmed using Lucene library

(Irani, tradit, medicin) (Lignum, nephriticum)

  • Calculate IDF

0.09230952124 medicin 0.03787453865 tradit 0.02862030703 herbal 0.01959969126 medic

slide-14
SLIDE 14

Experiment 1 result

without stopword With stopwords removed R P F R P F DBPedia

0.21304 0.21304 0.21304 0.25373 0.17617 0.20795

Wikipedia

0.18261 0.18261 0.18261 0.22388 0.14493 0.17595

acquis

0.15217 0.15217 0.15217 0.19403 0.12093 0.149

extracted literals

0.21739 0.21645 0.21692 0.24627 0.17647 0.20561

slide-15
SLIDE 15

Experiment 2

Compute TF Compute IDF Compute TF-IDF Ranked term List (230)

IDF data sources: 1. All Literals from DBpedia 2. Wikipedia abstracts

ROUGE (using stemming) extracted literals POS tage IDF data source POS tage Wikipedia article abstract (230) ROUGE

  • utput
slide-16
SLIDE 16

Part of speech tagging

  • extract all literals

Trisuloides sericea is a moth of the Noctuidae family. It is found in South-east Asia. The wingspan is about 24 mm. Khvorakabad is a village in Mazraeh Now Rural District, in the Central District of Ashtian County, Markazi Province,

  • Iran. At the 2006 census, its population was 72, in 23 families.
  • Tag text using stanford speech tagger (3.5.0)

Trisuloides_NNS sericea_NN is_VBZ a_DT moth_NN of_IN the_DT Noctuidae_NNP family_NN ._. It_PRP is_VBZ

found_VBN in_IN South-east_JJ Asia_NNP ._. The_DT wingspan_NN is_VBZ about_IN 24_CD mm_NN ._. Khvorakabad_NNP is_VBZ a_DT village_NN in_IN Mazraeh_NNP Now_NNP Rural_NNP District_NNP ,_, in_IN the_DT Central_NNP District_NNP of_IN Ashtian_NNP County_NNP ,_, Markazi_NNP Province_NNP ,_, Iran_NNP ._. At_IN the_DT 2006_CD census_NN ,_, its_PRP$ population_NN was_VBD 72_CD ,_, in_IN 23_CD families_NNS ._.

  • Filter out only Verbs and nouns (NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ)

Trisuloides_NNS, sericea_NN, is_VBZ, moth_NN, Noctuidae_NNP, family_NN, is_VBZ, found_VBN, Asia_NNP, wingspan_NN, is_VBZ, mm_NN, Khvorakabad_NNP, is_VBZ, village_NN, Mazraeh_NNP, Now_NNP, Rural_NNP, District_NNP, Central_NNP, District_NNP, Ashtian_NNP, County_NNP, Markazi_NNP, Province_NNP, Iran_NNP, census_NN, population_NN, was_VBD, families_NNS

  • Compute TF-IDF
slide-17
SLIDE 17

Rezults

without stopword With stopwords removed R P F R P F DBPedia

0.17826 0.17826 0.17826 0.26119 0.16509 0.20231

Wikipedia

0.16087 0.16087 0.16087 0.23881 0.14884 0.18338

slide-18
SLIDE 18

Experiment 3

Split in documents Generate taxonomy Saffron Ranked term List (230)

Taxonomy parameters: MincommonDoc=2 MincommonDoc=3

ROUGE (using stemming) extracted literals Wikipedia article abstract (230) ROUGE

  • utput
slide-19
SLIDE 19

Rezults

Taxonomy without stopword

With stopwords removed R P F R P F MinComDoc=2 Words

0.31739 0.28968 0.3029 0.49254 0.27049 0.34921

MinComDoc=2 Terms

0.17826 0.25309 0.20918 0.29104 0.24528 0.26621

MinComDoc=3 Words

0.12174 0.73684 0.20896 0.20896 0.73684 0.32559

MinComDoc=3 Terms

0.05652 0.65 0.104 0.09701 0.65 0.16882

POS

without stopword With stopwords removed R P F R P F DBPedia

0.17826 0.17826 0.17826 0.26119 0.16509 0.20231

Wikipedia 0.16087

0.16087 0.16087 0.23881 0.14884 0.18338

Stemmed

without stopword With stopwords removed R P F R P F DBPedia

0.21304 0.21304 0.21304 0.25373 0.17617 0.20795

Wikipedia

0.18261 0.18261 0.18261 0.22388 0.14493 0.17595

acquis

0.15217 0.15217 0.15217 0.19403 0.12093 0.149

extracted literals

0.21739 0.21645 0.21692 0.24627 0.17647 0.20561

slide-20
SLIDE 20

Conclusion

TF-IDF considering triples as documents shows good results Taxonomy extraction provided best results

  • Automatically extract categories/topics from dataset
  • Generate N-grams summaries for topics, based on model, that is trained
  • n full dataset
  • Gather relevant statistics about datasets
  • Create more precise evaluation method

Future work