Conclusion and review Domain-specific search (DSS) 2 3 Emerging - - PowerPoint PPT Presentation

conclusion and review domain specific search dss
SMART_READER_LITE
LIVE PREVIEW

Conclusion and review Domain-specific search (DSS) 2 3 Emerging - - PowerPoint PPT Presentation

Conclusion and review Domain-specific search (DSS) 2 3 Emerging opportunities for DSS Fighting human Predicting trafficking cyberattacks Accurate Stopping geopolitical Penny Stock forecasting Fraud 3 General Search Google Knowledge


slide-1
SLIDE 1

Conclusion and review

slide-2
SLIDE 2

2

Domain-specific search (DSS)

slide-3
SLIDE 3

Emerging opportunities for DSS

3 Fighting human trafficking Predicting cyberattacks Stopping Penny Stock Fraud Accurate geopolitical forecasting

3

slide-4
SLIDE 4

How do we construct domain specific knowledge graphs over web data for powerful DSS applications

General Search Google Knowledge Graph DSS Domain-Specific Knowledge Graphs

4

slide-5
SLIDE 5

5

Knowledge Graphs for DSS

slide-6
SLIDE 6

Challenges

slide-7
SLIDE 7

Many Document Features

Text paragraphs without formatting Grammatical sentences plus some formatting & links Non-grammatical snippets, rich formatting & links Tables

Astro Teller is the CEO and co-founder of

  • BodyMedia. Astro holds a Ph.D. in Artificial

Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

Charts

7

slide-8
SLIDE 8

Scope

Short tail

Genre specific (e.g., forums)

Long tail

8

slide-9
SLIDE 9

Pattern Complexity

Closed set

He was born in Alabama…

Regular set

Phone: (413) 545-1323

Complex

University of Arkansas P.O. Box 140 Hope, AR 71802 …was among the six houses sold by Hope Feldman that year.

Ambiguous, needing context

The CALD main office can be reached at 412-268-1299 The big Wyoming sky…

U.S. states U.S. phone numbers U.S. postal addresses Person names

Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210 Pawel Opalinski, Software Engineer at WhizBang Labs.

Courtesy of Andrew McCallum

“YOU don't wanna miss out on ME :) Perfect lil booty Green eyes Long curly black hair Im a Irish, Armenian and Filipino mixed princess :) ❤ Kim ❤ 7○7~7two7~7four77 ❤ HH 80 roses ❤ Hour 120 roses ❤ 15 mins 60 roses”

Unusual language models

9

slide-10
SLIDE 10

small amount of relevant content

irrelevant content very similar to relevant content

10

slide-11
SLIDE 11

Spreadsheets Created For Human Consumption

11

slide-12
SLIDE 12

Databases with PDF Code Books

PDF

12

slide-13
SLIDE 13

Data In Web Tables

13

slide-14
SLIDE 14

Knowledge Graph Construction: Long-tail vs. Short-tail

slide-15
SLIDE 15

Extracting Data from Semi-structured Sources

NAME Casablanca Restaurant STREET 220 Lincoln Boulevard CITY Venice PHONE (310) 392-5751

Wrappers

15

slide-16
SLIDE 16

Information Integration in Karma

16 16

slide-17
SLIDE 17

Knowledge Graphs

Karma uses semantic models to create knowledge graphs

Karma semi-automatically builds semantic models … and provides a nice GUI to edit them

17

slide-18
SLIDE 18

Practical Considerations for Extractions

  • How good (precision/recall) is necessary?

– High precision when showing KG nodes to users – High recall when used for ranking results

  • How long does it take to construct?

– Minutes, hours, days, months

  • What expertise do I need?

– None (domain expertise), patience (annotation), scripting, machine learning guru

  • What tools can I use?

– Many …

Long tail vs. short tail

18

slide-19
SLIDE 19

Data Tables

Entity Table Matrix Table List Table

19

slide-20
SLIDE 20

Table Type Classification

  • Feature-based supervised classification

– Cafarella’08 – Crestan’11 – Eberius’15

  • Deep Learning

– Nishida’2017

20

slide-21
SLIDE 21

Semantic + Structure Embedding

21

slide-22
SLIDE 22

Data Extraction Techniques

  • Glossary
  • Regular expressions
  • Natural language rules
  • Named entity recognition
  • Sequence labeling (Conditional Random Fields)

22

slide-23
SLIDE 23

Searching Knowledge Graphs

slide-24
SLIDE 24

Many problems with ‘strict’ execution No results

synonyms “red” typos “brunette” not present numbers hard to match Claire is a common name Gold is a domain word slang, e.g., “FOB” for Asian inference, e.g., “Japanese”

NoSQL store

SELECT ?ad WHERE { ?ad a :Ad ; :hair_color 'Auburn' ; :review_site_id 'cg9469f' ; :price_per_hour '500' ; :name 'Claire Gold' ; :ethnicity ’Asian'. }

24

slide-25
SLIDE 25

Candidate Generation

SELECT ?ad ?ethnicity WHERE { ?ad a :Ad ; :hair_color 'Auburn' ; :review_site_id 'cg9469f' ; :price_per_hour '500' ; :name ’Claire Gold’ ; :ethnicity ?ethnicity . }

query 1 query 2 query 3 query 4 query n

Query Reformulation

Precision Recall

Elastic Search 100M entities Ranked Candid ates

Keyword expansion • Context broadening • Constraint relaxation

25

slide-26
SLIDE 26

Knowledge Graph Completion

slide-27
SLIDE 27

27

For Web extractions, noise is inevitable

  • Thousands of web domains
  • Many page formats
  • Distracting & irrelevant content
  • Purposeful obfuscation
  • Poor grammar & spelling
  • Tables

To reach its potential, a constructed KG must be completed or identified

slide-28
SLIDE 28

28

Entity Resolution

Many nodes refer to the same underlying entity

slide-29
SLIDE 29
  • Theoretically quadratic in the number of nodes, even if ‘resolution

rule’ was known

  • In practice, number of ‘duplicates’ tends to grow linearly, and

duplicates overlap in non-trivial ways

  • How to devise efficient algorithms?

29

Entity Resolution is fundamentally non-linear

50 years of research has agreed on a two- step solutions

Execute blocking Execute similarity

Candidate set Resolved entities Knowledge graph

slide-30
SLIDE 30
  • Other things explored in the literature:

30

Other research frontiers for KG Completion

  • Domain knowledge

– Collective ER methods have tried to exploit these systematically

  • Entity Resolution+Ontologies+IE Confidences:

– Probabilistic Graphical Models like Probabilistic Soft Logic

  • Knowledge graph embeddings

– Useful for link prediction and triples classification – Recall the Microsoft-founded_in-Seattle example earlier

  • Multi-type Entity Resolution

– Extremely useful for knowledge graphs, lots more work to be done

slide-31
SLIDE 31

THANK YOU! QUESTIONS...

31