Content-based Linked Data Summarization
Andrejs Abele
Supervisor: Paul Buitelaar Mentor: Georgeta Bordea
Content-based Linked Data Summarization Andrejs Abele Supervisor: - - PowerPoint PPT Presentation
Content-based Linked Data Summarization Andrejs Abele Supervisor: Paul Buitelaar Mentor: Georgeta Bordea Introduction 1. Motivation 2. Datasets 3. Approach 4. Evaluation 5. Experiments 6. Conclusion & Future work Terminology
Supervisor: Paul Buitelaar Mentor: Georgeta Bordea
Datahub contains 8 731 datasets
Dataset Description Top entries ... DBpedia Encyclopedia Contains information about science, technology, math, history … history, structure, ... ...
e Provides summaries of significant food and water related outbreaks
Outbreak, illness, ... ... ... ... ... ...
Wikipedi a DynaMed Us Census Data
databas e
... ... ...
Summarizer Data scientist and developers
<http://dbpedia.org/resource/Lignum_nephriticum> <http://www.w3.org/2000/01/rdf-schema#label> "Lignum nephriticum"@en . <http://dbpedia.org/resource/Mithridate> <http://dbpedia.org/ontology/wikiPageInterLanguageLink> <http://br.dbpedia.org/resource/Mitridates> . <http://dbpedia.org/resource/Uguisu_no_fun> <http://dbpedia.org/ontology/abstract> "Uguisu no fun (\u9DAF\u306E\u7CDE), which literally means \u201Cnightingale feces\u201D in Japanese, refers to the excrement (fun) produced by a particular nightingale called the Japanese bush warbler (Cettia diphone) (uguisu). The droppings have been used in facials since ancient Japanese times. Recently, the product has been used in the Western world. This facial has been referred to as the \u201CGeisha Facial\u201D. The facial is supposed to lighten the skin and balance skin tones that have acne or sun damage."@en . …
N-gram based co-occurrence statistics pyrouge - https://github.com/andersjo/pyrouge.git Parameters used : m - uses Porter stemmer s - removes stopwords (around,as,aside,ask,asking,...) n - max-ngram l - n-words
1. Recall 2. Precision 3. F-measure
Compute TF Compute IDF Compute TF-IDF Ranked term List (230)
IDF data sources: 1. All Literals from DBpedia 2. Wikipedia abstracts 3. acquis 4. extracted literals
ROUGE (using stemming) extracted literals IDF data source Wikipedia article abstract (230) ROUGE
<http://dbpedia.org/page/Category:Traditional_medicine>
?s ?O
<http://dbpedia.org/resource/Kampo> ", alternatively shortened as just Kanpō, is the Japanese study and adaptation of Traditional Chinese medicine (TCM). The fundamental principles of Chinese medicine came to Japan between the 7th and 9th
Chinese medical system including acupuncture and moxibustion but is primarily concerned with the study of herbs." <http://dbpedia.org/resource/Kampo> "Kampo" <http://dbpedia.org/resource/Apocroustic> "Apocroustics, in pre-modern medicine, were medications intended to stop the flux
altern shorten as just Kanp is the Japanes studi and adapt of Tradit Chines medicin TCM The fundament principl of Chines medicin came to Japan between the 7th and 9th centuri Sinc then the Japanes have creat their own uniqu herbal medic system and diagnosi Kampo us most of the Chines medic system includ acupunctur and moxibust but is primarili concern with the studi
usual cold astring and consist of larg particl
<http://dbpedia.org/resource/Irani_traditional_medicine> <http://www.w3.org/2000/01/rdf-schema#label> "Irani traditional medicine"@en . <http://dbpedia.org/resource/Lignum_nephriticum> <http://www.w3.org/2000/01/rdf-schema#label> "Lignum nephriticum"@en
Irani traditional medicine Lignum nephriticum
(Irani, tradit, medicin) (Lignum, nephriticum)
0.09230952124 medicin 0.03787453865 tradit 0.02862030703 herbal 0.01959969126 medic
without stopword With stopwords removed R P F R P F DBPedia
0.21304 0.21304 0.21304 0.25373 0.17617 0.20795
Wikipedia
0.18261 0.18261 0.18261 0.22388 0.14493 0.17595
acquis
0.15217 0.15217 0.15217 0.19403 0.12093 0.149
extracted literals
0.21739 0.21645 0.21692 0.24627 0.17647 0.20561
Compute TF Compute IDF Compute TF-IDF Ranked term List (230)
IDF data sources: 1. All Literals from DBpedia 2. Wikipedia abstracts
ROUGE (using stemming) extracted literals POS tage IDF data source POS tage Wikipedia article abstract (230) ROUGE
Trisuloides sericea is a moth of the Noctuidae family. It is found in South-east Asia. The wingspan is about 24 mm. Khvorakabad is a village in Mazraeh Now Rural District, in the Central District of Ashtian County, Markazi Province,
Trisuloides_NNS sericea_NN is_VBZ a_DT moth_NN of_IN the_DT Noctuidae_NNP family_NN ._. It_PRP is_VBZ
found_VBN in_IN South-east_JJ Asia_NNP ._. The_DT wingspan_NN is_VBZ about_IN 24_CD mm_NN ._. Khvorakabad_NNP is_VBZ a_DT village_NN in_IN Mazraeh_NNP Now_NNP Rural_NNP District_NNP ,_, in_IN the_DT Central_NNP District_NNP of_IN Ashtian_NNP County_NNP ,_, Markazi_NNP Province_NNP ,_, Iran_NNP ._. At_IN the_DT 2006_CD census_NN ,_, its_PRP$ population_NN was_VBD 72_CD ,_, in_IN 23_CD families_NNS ._.
Trisuloides_NNS, sericea_NN, is_VBZ, moth_NN, Noctuidae_NNP, family_NN, is_VBZ, found_VBN, Asia_NNP, wingspan_NN, is_VBZ, mm_NN, Khvorakabad_NNP, is_VBZ, village_NN, Mazraeh_NNP, Now_NNP, Rural_NNP, District_NNP, Central_NNP, District_NNP, Ashtian_NNP, County_NNP, Markazi_NNP, Province_NNP, Iran_NNP, census_NN, population_NN, was_VBD, families_NNS
without stopword With stopwords removed R P F R P F DBPedia
0.17826 0.17826 0.17826 0.26119 0.16509 0.20231
Wikipedia
0.16087 0.16087 0.16087 0.23881 0.14884 0.18338
Split in documents Generate taxonomy Saffron Ranked term List (230)
Taxonomy parameters: MincommonDoc=2 MincommonDoc=3
ROUGE (using stemming) extracted literals Wikipedia article abstract (230) ROUGE
Taxonomy without stopword
With stopwords removed R P F R P F MinComDoc=2 Words
0.31739 0.28968 0.3029 0.49254 0.27049 0.34921
MinComDoc=2 Terms
0.17826 0.25309 0.20918 0.29104 0.24528 0.26621
MinComDoc=3 Words
0.12174 0.73684 0.20896 0.20896 0.73684 0.32559
MinComDoc=3 Terms
0.05652 0.65 0.104 0.09701 0.65 0.16882
POS
without stopword With stopwords removed R P F R P F DBPedia
0.17826 0.17826 0.17826 0.26119 0.16509 0.20231
Wikipedia 0.16087
0.16087 0.16087 0.23881 0.14884 0.18338
Stemmed
without stopword With stopwords removed R P F R P F DBPedia
0.21304 0.21304 0.21304 0.25373 0.17617 0.20795
Wikipedia
0.18261 0.18261 0.18261 0.22388 0.14493 0.17595
acquis
0.15217 0.15217 0.15217 0.19403 0.12093 0.149
extracted literals
0.21739 0.21645 0.21692 0.24627 0.17647 0.20561
TF-IDF considering triples as documents shows good results Taxonomy extraction provided best results