Creation, Population and Preprocessing of Experimental Data Sets for - PowerPoint PPT Presentation

Creation, Population and Preprocessing of Experimental Data Sets for Evaluation of Applications for the Semantic Web G. Frivolt, J. Suchal, R. Veselý, G. Frivolt, J. Suchal, R. Veselý, P. Vojtek, O. Vozár, M. Bieliková �� 1

Motivation � Lack of suitable data sets for experimental evaluation of semantic web oriented applications (faceted browser) � Preserve as much as possible information � Preserve as much as possible information from original data sources � Existing data sets miss (or contain sparse) meta-data 2

Goals � Project MAPEKUS 1 � create semantic layer over digital libraries � background for inferencing � analysis of social networks � analysis of social networks � Improve quality of obtained data � identify duplicated and malformed data � Provide visual navigation in data set � �� !!��"��"��#�"�� 3

Domain Description � Data from scientific publications domain � Digital libraries: � ACM www.acm.org � Springer www.springer.com � Springer www.springer.com � Meta-data repository: � DBLP www.informatik.uni-trier.de/~ley/db/ 4

Domain Description $�� $�� 5

Data Process Flow %��&�� %�� %�� #�� '�� (��) �(�*+��) (�� )� ∪ )� ∪ ∪ � ,� ∪ ∪ � ,� ∪ ∪ � + ∪ � + ∪ ∪ ∪ ∪ ∪ ∪ ∪ ∪ (��, �(�*+��, (��+ �(�*+��+ """ 6

Data Process Flow %��&�� %�� %�� #�� '�� )+-�(�� )+- �)+-� ∪ �)+-� ∪ ∪ ∪ ∪ � �� ∪ ∪ ∪ � �� ∪ ∪ ∪ � %,./ ∪ ∪ ∪ ∪ � %,./ ∪ �� %,./ %,./ 7

Data Process Flow %��&�� %�� %�� #�� '�� )+-�(�� )+- �)+-� ∪ �)+-� ∪ ∪ ∪ ∪ � �� ∪ ∪ ∪ � �� ∪ ∪ ∪ � %,./ ∪ ∪ ∪ � %,./ ∪ ∪ �� %,./ %,./ �� 8

Data Acquisition � How did we gather data? � wrapper induction by giving positive and negative examples of patterns on the web pages � Wrapper induction exploits machine learning techniques for generalization of patterns techniques for generalization of patterns � XPath based learning of patterns � generalization of patterns’ attributes using Bayesian networks � Gathered data stored in structured form in an ontological repository 9

Data Acquisition � Wrapped data (depends on data source): � publication instances: name, abstract, year � publication categories, topics and keywords � authorship relation � authorship relation � isReferencedBy and references relations between publications 10

Data Preprocessing � Why to clean data? � inconsistencies: � in source data (name misspelling, diacritics) � inconsistencies created during wrapping process � inconsistencies created during wrapping process � source integration (same author in two sources) – relevant for social networks of authors � Non-invasive cleaning � tagging inconsistent data (without removal) 11

Data Preprocessing Single-pass instance cleaning � Cleaning in the scope of one instance (without relations) � Set of filters, each filter for particular purpose: � correcting capital letters in names and surnames � correcting capital letters in names and surnames � separating first names and surnames � One pass through all instances – linear time complexity 12

Data Preprocessing Duplicate identification � Combination of two methods: � comparison of data properties (e.g., author names, publication titles) � comparison of object properties � comparison of object properties (e.g., coauthors, references, relations between author and publication) 13

Duplicate Identification Data properties comparison � using standard string metrics like � Levenstein distance � Monge-Elkan � N-grams � N-grams � special string metrics � distance of different characters on keyboard � name metrics, considering abbreviations (J. Smyth = John Smyth) 14

Duplicate Identification Object properties comparison � Object properties comparison � for each object property the similarity is computed from number of matches � for example: similarity of two authors depends on � for example: similarity of two authors depends on number of conjoint co-authors 15

Data Process Flow � � %��&�� %�� %�� #�� '�� )+-�(�� )+- �)+-� ∪ ∪ ∪ � �� ∪ ∪ ∪ � %,./ ∪ ∪ �� %,./ %,./ 16

Graph Clustering � Graph extraction from ontology � preparation for clustering � Hierarchical clustering � clustering methods from JUNG library � clustering methods from JUNG library � layers generated using bottom-up approach � Results stored in relational database � speed and simplicity 17

ACM visualization (1500 publications) 18

Graph Clustering 19

Graph Clustering 20

Evaluation – Identification of Duplicates � Sampling (publications count) : 1 000, 5 000, 10 000, 20 000 � 10 runs for each sample size � Injected 100 generated duplicities � Injected 100 generated duplicities � All data from DBLP � Duplicities already present in DBLP were ignored 21

Evaluation – Identification of Duplicates 140,00 120,00 ed duplicates 100,00 80,00 Wrong Wrong detected 60,00 Missing Correct 40,00 20,00 0,00 1000 5000 10000 20000 Sample size (number of publications) 22

Evaluation – Identification of Duplicates 0,95 0,90 0,85 % 0,80 0,75 0,70 0,65 1 000 5 000 10 000 20 000 Precision 0,79 0,79 0,81 0,85 Recall 0,90 0,92 0,78 0,77 F1 0,84 0,85 0,80 0,81 23

Evaluation – Identification of Duplicates in Real Data 200 180 160 icate count 140 120 Duplicate ACM ACM 100 100 DBLP 80 Springer 60 Combination 40 20 0 1 000 2 000 3 000 4 000 5 000 Sample size 24

Conclusions � ACM, Springer and DBLP data sources were: � obtained via web scrapping � stored in meta-data preserving format (OWL) � available online: http://mapekus.fiit.stuba.sk � available online: http://mapekus.fiit.stuba.sk � Data evaluation: � data cleaning (duplicity identification) � case study of data set processing – cluster-based visual navigation 25

Future Work � Make available integrated and cleaned ontology � add to this “pack” also cluster-based visual navigator of data navigator of data � Create smaller, focused data set in specialized sub-domains for experimental reasons: � software engineering � user modeling 27

�� 28

Creation, Population and Preprocessing of Experimental Data Sets for - PowerPoint PPT Presentation

Creation, Population and Preprocessing of Experimental Data Sets for Evaluation of Applications for the Semantic Web G. Frivolt, J. Suchal, R. Vesel, G. Frivolt, J. Suchal, R. Vesel, P. Vojtek, O. Vozr, M. Bielikov

CS6220: DATA MINING TECHNIQUES Chapter 3: Data Preprocessing Instructor: Yizhou Sun

Population Ecology 1. Population Concepts 2. Population Growth 3. Regulation of Population

Creation of new mark Creation of new markets ets Creation of new mark Creation of new markets

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B uchler, Emily Franzini and Greta

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for

Data Preprocessing Why Data Preprocessing? Chris Williams, School of Informatics University of

World Population Trends January 26, 2012 World Population Trends World Population Growth

Experimental Linguistics Session B: Experiment Creation II Lecturer: Roland Mhlenbernd

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data preprocessing Functional Programming and Intelligent Algorithms Que Tran Hgskolen i

Data Warehousing and Machine Learning Preprocessing Thomas D. Nielsen Aalborg University

Data Preparation Data cleaning Discretization (Data preprocessing) Data

Preprocessing Data for Machine Learning P R E P R OC E SSIN G FOR MAC H IN E L E AR N IN G

UNIONS OF ONIONS Maarten L offler Wolfgang Mulzer Universiteit Utrecht Freie Universit at

ATIR April 28, 2016 Motivation Simple Preprocessing Linguistics Further Preprocessing

Score! Mapping the National Cyber League Competition Tasks to the Operational Security Testing

USING A CONSORTIUM TO INTERNATIONALIZE Examples and lessons from the ACM Good morning. Im

Predictive Maintenance Part Two: The Application of Condition Monitoring Thursday Oct ober 10,

WOMENS NETWORK MEETING/ ACADEMIC CAREER MAP Agenda ACM background and context (overview)

Case Studies in Financial Aid Federal M ethodology Institutional M ethodology EFC Case

Alarm Management Experience on Shearwater Dave Gisby and Ian Dunsmuir Shell U.K Exploration and

EASO Consultative Forum Synergies with civil society in the relocation process: the experience of

AECOM June 2017 KUALA LUMPUR-SINGAPORE HIGH SPEED RAIL Singapore Providing complete design