semantic question answering on big data
play

Semantic Question Answering on Big Data Tatiana Erekhinskaya July, - PowerPoint PPT Presentation

Semantic Question Answering on Big Data Tatiana Erekhinskaya July, 2016 The Goal Challenge: Find answers to complex questions in large structured and unstructured data resources Sample question: List Chinese researchers who


  1. Semantic Question Answering on Big Data Tatiana Erekhinskaya July, 2016

  2. The Goal Challenge: • Find answers to complex questions in large structured and unstructured data resources • Sample question: List Chinese researchers who worked with Kuznetsov, have publications on Zika virus and studied in US Solution : • Convert data into RDF storage • Convert questions into SPARQL

  3. Outline • System Architecture • NLP & Semantic Parsing • RDF Representation • Plain English Query to SPARQL • Experiments & Results • Use Cases & Future Work

  4. System Architecture Semantic Semantic RDF to RDF to SPARQL Store User Input Question Docs Question Deep Knowledge Processing NLP Extraction

  5. Natural Language Processing

  6. Concept Extraction • Hybrid approach combines machine learning classifiers, cascade of finite-state automata, and lexicons • Uses existing medical ontologies: MeSH, SNOMED and UMLS Metathesaurus • 80+ types of named entities: demographics, disease, symptom, dosage, severity, time course, onset, alleviating and aggravating factors

  7. Semantic Parsing • Extracts 26 predefined binary relation types: AGENT, THEME, LOCATION, TIME, etc. • Maximum granularity, not limited to verb arguments: VALUE, PROPERTY, QUANTITY • Robust basic representation, not for end users POSSESSION QUANTITY VALUE 100 subjects with type 2 diabetes PROPERTY

  8. Semantic Calculus • Defines how and under what conditions a chain of relations can be combined into a high level custom relation • Axioms: Possession(c1;c2)&ISA(c1, disease) & ISA(c2; organism)  HasDisease(c1; c2) HAS_DISEASE QUANTITY 100 subjects with type 2 diabetes SEVERITY

  9. RDF & SPARQL

  10. RDF Representation • 6.3 MB of text → 13 M triples, 1 GB of RDF XML • Keep only relations of interest and tokens that participate in these relations • For tokens: named entity type or is- event flag, lemma, synset, and reference sentence

  11. Reasoning on the RDF Store • OWLPrime • SameAs: mentions • Lexical chains: Wordnet-based relation sequence

  12. Question Processing • Full NLP & semantic parsing • Expected answer type recognition (_human or organization, _date or _time, etc.) • Answer type terms “ which cartel ” • Maximum entropy model

  13. SPARQL Query Formulation

  14. Query Relaxation • Synset relaxation: include hyponyms, parts, derivations • On empty results: drop variable-description triples and semantic relations with little importance

  15. Experiments & Results

  16. Experimental Data • Illicit Drugs domain • 584 documents: Wikipedia + documents • 6.3 MB of plain text • 6,729,854 RDF triples • 546 MB of RDF XML

  17. Results: Question Answering 344 questions Free text-search: 47% MRR Semantic Approach: 66% MRR Factoid: 85% MRR Definition: 78% MRR List: 68 % MRR

  18. Results: NL to SPARQL 34 manually annotated questions • SELECT clauses: 85% • WHERE clauses on triple level: 78% • WHERE clauses on question level: 65% Relaxation usage: 68% of queries inSynset-relaxation sufficient for 31%

  19. Error Analysis 73% caused by faulty or missing semantic relations 16% caused by query conversion: yes/no questions, and procedural questions

  20. Conclusion Use Cases • Processing Pubmed for quality measures • National Security: terrorism, law enforcement • Foreign languages Future Work • Integration with LinkedData • Rapid Customization

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend