conclusion and review domain specific search dss
play

Conclusion and review Domain-specific search (DSS) 2 3 Emerging - PowerPoint PPT Presentation

Conclusion and review Domain-specific search (DSS) 2 3 Emerging opportunities for DSS Fighting human Predicting trafficking cyberattacks Accurate Stopping geopolitical Penny Stock forecasting Fraud 3 General Search Google Knowledge


  1. Conclusion and review

  2. Domain-specific search (DSS) 2

  3. 3 Emerging opportunities for DSS Fighting human Predicting trafficking cyberattacks Accurate Stopping geopolitical Penny Stock forecasting Fraud 3

  4. General Search Google Knowledge Graph DSS Domain-Specific Knowledge Graphs How do we construct domain specific knowledge graphs over web data for powerful DSS applications 4

  5. Knowledge Graphs for DSS 5

  6. Challenges

  7. Many Document Features Grammatical Text Astro Teller is the CEO and co-founder of sentences BodyMedia. Astro holds a Ph.D. in Artificial paragraphs Intelligence from Carnegie Mellon University, where plus some without he was inducted as a national Hertz fellow. His M.S. formatting & in symbolic and heuristic computation and B.S. in formatting computer science are from Stanford University. His links work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Non-grammatical snippets, Tables Charts rich formatting & links 7

  8. Scope Genre specific Short tail Long tail (e.g., forums) 8

  9. Pattern Complexity Regular set Closed set U.S. phone numbers U.S. states Phone: (413) 545-1323 He was born in Alabama … The CALD main office can be The big Wyoming sky… reached at 412-268-1299 Complex Ambiguous, needing context Unusual language U.S. postal addresses Person names models “YOU don't wanna miss out on ME :) University of Arkansas …was among the six houses Perfect lil booty Green eyes Long curly P.O. Box 140 sold by Hope Feldman that Hope, AR 71802 black hair Im a Irish, Armenian and year. Pawel Opalinski, Software Filipino mixed princess :) ❤ Kim ❤ Headquarters: Engineer at WhizBang Labs. 7 ○ 7~7two7~7four77 ❤ HH 80 roses ❤ 1128 Main Street, 4th Floor Hour 120 roses ❤ 15 mins 60 roses” Cincinnati, Ohio 45210 Courtesy of Andrew McCallum 9

  10. small amount of relevant content irrelevant content very similar to relevant content 10

  11. Spreadsheets Created For Human Consumption 11

  12. Databases with PDF Code Books PDF 12

  13. Data In Web Tables 13

  14. Knowledge Graph Construction: Long-tail vs. Short-tail

  15. Extracting Data from Semi-structured Sources Wrappers NAME Casablanca Restaurant STREET 220 Lincoln Boulevard CITY Venice PHONE (310) 392-5751 15

  16. Information Integration in Karma 16 16

  17. Karma semi-automatically builds semantic models … and provides a nice GUI to edit them Knowledge Graphs Karma uses semantic models to create knowledge graphs 17

  18. Practical Considerations for Extractions Long tail vs. short tail • How good (precision/recall) is necessary? – High precision when showing KG nodes to users – High recall when used for ranking results • How long does it take to construct? – Minutes, hours, days, months • What expertise do I need? – None (domain expertise), patience (annotation), scripting, machine learning guru • What tools can I use? – Many … 18

  19. Data Tables Matrix Table List Table Entity Table 19

  20. Table Type Classification • Feature-based supervised classification – Cafarella’08 – Crestan’11 – Eberius’15 • Deep Learning – Nishida’2017 20

  21. Semantic + Structure Embedding 21

  22. Data Extraction Techniques • Glossary • Regular expressions • Natural language rules • Named entity recognition • Sequence labeling (Conditional Random Fields) 22

  23. Searching Knowledge Graphs

  24. Many problems with ‘strict’ execution synonyms “red” typos “brunette” SELECT ?ad WHERE { not present ?ad a :Ad ; numbers hard to match :hair_color 'Auburn' ; :review_site_id 'cg9469f' ; Claire is a common name :price_per_hour '500' ; Gold is a domain word :name 'Claire Gold' ; slang, e.g., “FOB” for Asian :ethnicity ’Asian'. inference, e.g., “Japanese” } No NoSQL store results 24

  25. Candidate Generation Keyword expansion • Context broadening • Constraint relaxation SELECT ?ad ?ethnicity WHERE Precision query 1 { query 2 ?ad a :Ad ; :hair_color 'Auburn' ; Query query 3 :review_site_id 'cg9469f' ; Reformulation query 4 :price_per_hour '500' ; :name ’Claire Gold’ ; Recall query n :ethnicity ?ethnicity . } Ranked Candid Elastic Search ates 100M entities 25

  26. Knowledge Graph Completion

  27. For Web extractions, noise is inevitable • Thousands of web domains • Many page formats • Distracting & irrelevant content • Purposeful obfuscation • Poor grammar & spelling • Tables To reach its potential, a constructed KG must be completed or identified 27

  28. Entity Resolution Many nodes refer to the same underlying entity 28

  29. Entity Resolution is fundamentally non-linear • Theoretically quadratic in the number of nodes, even if ‘resolution rule’ was known • In practice, number of ‘duplicates’ tends to grow linearly, and duplicates overlap in non-trivial ways • How to devise efficient algorithms? 50 years of research has agreed on a two- step solutions Candidate set Resolved Execute Execute Knowledge entities graph blocking similarity 29

  30. Other research frontiers for KG Completion • Other things explored in the literature: • Domain knowledge – Collective ER methods have tried to exploit these systematically • Multi-type Entity Resolution – Extremely useful for knowledge graphs, lots more work to be done • Entity Resolution+Ontologies+IE Confidences: – Probabilistic Graphical Models like Probabilistic Soft Logic • Knowledge graph embeddings – Useful for link prediction and triples classification – Recall the Microsoft-founded_in-Seattle example earlier 30

  31. THANK YOU! QUESTIONS... 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend