boa
play

BOA Bootstrapping the Linked Data Web Daniel Gerber, Axel-Cyrille - PowerPoint PPT Presentation

http://aksw.org/files/boa.pdf BOA Bootstrapping the Linked Data Web Daniel Gerber, Axel-Cyrille Ngonga Ngomo AKSW, Universitt Leipzig http://www.volunteer-conservation-peru.org Motivation most knowledge bases extracted from


  1. http://aksw.org/files/boa.pdf BOA Bootstrapping the Linked Data Web Daniel Gerber, Axel-Cyrille Ngonga Ngomo AKSW, Universität Leipzig http://www.volunteer-conservation-peru.org

  2. Motivation • most knowledge bases extracted from (semi)-structured data • Linked Data Cloud grows • BUT: only 15-20 % of information • How can we extract data from the document-oriented web?

  3. Idea • start with triples from the Data Web • extract natural language patterns which express predicates found in triples • combine patterns & NLP to find labels which stand in relation with predicate • generate RDF and feed it into Data Web

  4. Related Work • NLP & RDF: • Fox, Extractiv, Alchemy, OpenCalais • NELL [CAR+10] • initial ontology: 100+ categories/relations • PROSPERA [NDA+11] • harvesting of n-grams-itemset patterns ➤ generalisation without adding noise • [JUR+10] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. ACL, pages 1003 – 1011, 2009.

  5. The BOA approach

  6. Corpus extraction • Crawler • Seed Pages, removes HTML • Cleaner • SBD, UTF-8 filters to remove noise • Indexer • sentences get index

  7. Knowledge acquisition • Class C that serves as the rdfs:domain or as the rdfs:range of predicate p • knowledge base for background knowledge • extract statements with entities of rdf:type C as subject or object • db:Place , db:Person , db:Organisation

  8. Pattern Search • set of entities s and o connected through p • find sentences which contain s & o, strip the rest • replace labels with variables (?D?, ?R?) • A BOA pattern is a pair P = (μ(p), θ), where μ(p) is p’s URI and θ is a natural language representation of p. • A BOA pattern mapping is a function M such that M(p) = S , where S is the set of natural language representations for p. • Occurrences, sentences, labels p is learned from, number of occurrences for each label combination

  9. Pattern Scoring • Pattern Filtering: Length, Stop Words, Occurrence • Support: used across several triples in background knowledge • Typicity: allows to map ?D?, ?R? to entities with rdf:type of domain/range of p • Specificity: used exclusively to express p, IDF adopted to patterns • ( Similarity: how similar is a pattern to label of predicate) • Combine Support, Typicity, Specificity to calculate local maximum

  10. RDF Generation • use top-n pattern for each relation • find sentences which contain pattern • NER-tag sentence • look for token’s classes which match domain/range • extract labels • URI retrieval above threshold do not create new URI

  11. Demo

  12. http://139.18.2.164:8080/boa

  13. Evaluation • Corpora • en_wiki (44.7M), en_news (256.1M) • Background Knowledge • Organisation, Place and Person (283 relations from 1 to 471920 triples) • Parameters • top1,2 pattern, kappa, 500 sentences for Typicity, 100 example sentences for 12 different KBs

  14. Results

  15. Examples Relation Top-2 Pattern URI en-wiki en-news Domain/Range foundationPerson 1. R , co-founder of D 1. R, the co-founder of D Organisation/Person 2. R , founder of D 2. R, founder of the D subsidiary 1. R , a subsidiary of D 1. R, a division of D Organisation/Organisati 2. D‘s acquisition of R 2. - (R , a division of D) on 1. D has been named in the birthPlace 1.D was born in R R Person/PopulatedPlace 2. - (D , the mayor of R) 2. D, MP for R

  16. Discussion • we can use patterns from wiki for every corpus • we create many new triples • we create correct triples • we need 15 minutes for one iteration • Q1 & Q2 answered with YES

  17. Future Work • Train NER on DBpedia classes • Iteration 1+ • Human feedback • Pattern generalization • rdf:type extractor • Languages/Corpora • Webservices

  18. Thank you! Questions?

  19. References • [NDA+11] • Ndapandula Nakashole, Martin Theobald, and Gerhard Weikum. Scalable knowledge harvesting with high precision and high recall. In WSDM, pages 227 – 236, Hong Kong, 2011. • [CAR+10] • Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr., and Tom M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend