probase
play

Probase Haixun Wang Microsoft Research Asia Short Text Document - PowerPoint PPT Presentation

Probase Haixun Wang Microsoft Research Asia Short Text Document Title Search Caption Ad keywords Question Anchor text The big question How does the mind get so much out of so little? Our minds build rich models of


  1. Probase Haixun Wang Microsoft Research Asia

  2. Short Text • Document Title • Search • Caption • Ad keywords • Question • Anchor text

  3. The big question • How does the mind get so much out of so little? • Our minds build rich models of the world and make strong generalizations from input data that is sparse, noisy, and ambiguous – in many ways far too limited to support the inferences we make. • How do we do it?

  4. Science 331 , 1279 (2011); MIT CMU Berkeley Stanford

  5. If the mind goes beyond the data given, another source of information must make up the difference.

  6. h1: all and only horses h2: all horses except Clydesdales h3: all animals

  7. likelihood prior h1: all and only horses • 𝑄 ℎ 𝑒 ∝ 𝑄 𝑒 ℎ 𝑄(ℎ) h2: all horses except Clydesdales h3: all animals h1 h1 h2 h3

  8. Which is “ kiki ” and which is “ bouba ”?

  9. 𝒍𝒇 \′𝒍𝒇 sound shape zigzaggedness

  10. Another example Pablo Picasso 25 Oct 1881 Spanish

  11. Probase: a semantic network for text understanding Concepts Entities isA isPropertyOf Co-occurrence

  12. isA Extraction • Hearst pattern • domestic animals such as cats and dogs … NP such as NP, NP, ..., and|or NP such NP as NP,* or|and NP • animals other than cats NP, NP*, or other NP such as dogs … NP, NP*, and other NP NP, including NP,* or | and NP NP, especially NP,* or|and NP • China is a developing • … is a … pattern country. NP is a/an/the NP • Life is a box of chocolate.

  13. … animals other than cats such as dogs … animals cats isA isA dogs dogs

  14. … household pets other than animals such as reptiles , aquarium fish … household pets animals isA isA reptiles reptiles

  15. Iterative Information Extraction Syntactic Knowledge patterns

  16. Semantic Drifts A drifting point 10%-20% Precision Improvement

  17. Probase Concepts (2.7 million+) Basic watercolor techniques countries Celebrity wedding dress designers Probase isA error rate: <%1 @1 and <10% for random pair

  18. A traditional taxonomy

  19. “python” in Probase “python”

  20. # of descendants (WordNet)

  21. Transitivity does not always hold furniture plastic material chair film

  22. # of descendants (early version of Probase)

  23. Probase Scores • Typicality • Vagueness • Representativeness foundation for inferencing • Ambiguity • Similarity

  24. Typicality bird 𝑜(𝑑, 𝑓) + α 𝑄 𝑓 𝑑 = + α𝑂 𝑜(𝑑, 𝑓 𝑗 ) 𝑓 𝑗 ∈𝑑 𝑜 𝑑, 𝑓 + α 𝑄 𝑑 𝑓 = + α𝑂 𝑜 𝑑 𝑗 , 𝑓 𝑓∈𝑑 𝑗 “robin” is a more typical bird than a “penguin” 𝑞 𝑠𝑝𝑐𝑗𝑜 𝑐𝑗𝑠𝑒 > 𝑞(𝑞𝑓𝑜𝑕𝑣𝑗𝑜|𝑐𝑗𝑠𝑒)

  25. Representativeness (basic level of categorization) software company max 𝑑 𝑞 𝑑 𝑓 ⋅ 𝑞(𝑓|𝑑) … … company largest OS vendor ? high typicality p(c|e) high typicality p(e|c) Microsoft

  26. Vagueness key players factors items things reasons … 𝑊 𝐷 = | 𝑓 𝑗 𝑄 𝐷 𝑓 𝑗 ≥ 𝑑, ∀𝑓 𝑗 ∈ 𝐷}| 𝑂(𝐷) (Do people whom you regard highly regard you highly?)

  27. Ambiguity • Probase defines 3 levels of ambiguity – Level 0 (1 sense): apple juice – Level 1 (2 or more related senses): Google – Level 2 (2 or more senses): python • Concepts form clusters, clusters form senses (through isa relation) region creature crop food animal country city state fruit vegetable meat predator

  28. Similarity • microsoft, ibm 0.933 • google, apple 0.378 ?? 𝑡𝑗𝑛 𝑢 1 , 𝑢 2 = max 𝑦,𝑧 𝑑𝑝𝑡𝑗𝑜𝑓 (𝑑 𝑦 𝑢 1 , 𝑑 𝑧 𝑢 2 )

  29. Applications • Query Understanding – Head/Modifier/Constraint detection • … • SRL (semantic role labeling) with FrameNet – e.g. Tom broke the window. agent patient

  30. Example: FrameNet Frame: Apply_heat FE1 FE2 FE3 FE4 Concept P(c|FE) Instance P(w|FE) heat source 0.19 Stove 0.00019 Large metal 0.04 Radiator* 0.00015 Kitchen 0.02 Oven 0.00015 appliance Grill* 0.00014 Heater* 0.00013 Fireplace* 0.00013 Lamp* 0.00013 Hair dryer* 0.00012 Candle* 0.00012

  31. Example: Head and Modifier Detection • toy kid • cover iphone (accessory, smart phone) • seattle hotel jobs

  32. When concepts are too specific • Example: mobile windows operating system / head large and inferential software vendor / modifier • No generalization power • 𝑛𝑗𝑚𝑚𝑗𝑝𝑜 2 patterns

  33. When concepts are too general Head Modifier … … modem comcast ((Device/Head, Company/Modifer) wireless router comcast … … Conflict Head Modifier … … netflix touchpad ((Device/Modifer, Company/Head) skype windows phone … …

  34. Knowledge Bases WordNet Wikipedia Freebase Probase Feline; Felid; Adult male; Man; TV episode; Creative work; Musical Animal; Pet; Species; Mammal; Gossip; Gossiper; Domesticated animals; Cats; recording; Organism classification; Dated Small animal; Thing; Mammalian Gossipmonger; Rumormonger; Felines; Invasive animal species; location; Musical release; Book; Musical species; Small pet; Animal species; Cat Rumourmonger; Newsmonger; Cosmopolitan species; Sequenced album; Film character; Publication; Carnivore; Domesticated animal; Woman; Adult female; genomes; Animals described in Character species; Top level domain; Companion animal; Exotic pet; Stimulant; Stimulant drug; 1758; Animal; Domesticated animal; ... Vertebrate; ... Excitant; Tracked vehicle; ... Companies listed on the New York Business operation; Issuer; Literature Company; Vendor; Client; Stock Exchange; IBM; Cloud subject; Venture investor; Competitor; Corporation; Organization; computing providers; Companies Software developer; Architectural Manufacturer; Industry leader; based in Westchester County, New structure owner; Website owner; Firm; Brand; Partner; Large IBM N/A York; Multinational companies; Programming language designer; company; Fortune 500 company; Software companies of the United Computer manufacturer/brand; Technology company; Supplier; States; Top 100 US Federal Customer; Operating system developer; Software vendor; Global company; Contractors; ... Processor manufacturer; ... Technology company; ... Instance of : Cognitive function; Employer; Written work; Musical Knowledge; Cultural factor; Communication; Auditory recording; Musical artist; Musical album; Cultural barrier; Cognitive process; communication; Word; Higher Languages; Linguistics; Human Literature subject; Query; Periodical; Cognitive ability; Cultural Language cognitive process; Faculty; communication; Human skills; Type profile; Journal; Quotation subject; difference; Ability; Characteristic; Mental faculty; Module; Text; Wikipedia articles with ASCII art Type/domain equivalent topic; Broadcast Attribute of: Film; Area; Book; Textual matter; genre; Periodical subject; Video game Publication; Magazine; Country; content descriptor; ... Work; Program; Media; City; ...

  35. What can Probase do? enable understanding and make up for the lack of depth

  36. Knowledgebases covers every topic? knows about everything in a topic? contains rich connections? breadth and density enable understanding

  37. Concept Learning India China Brazil emerging market country

  38. taste body smell wine

  39. Understanding Web Tables website president city motto state type director 600 500 # of Concepts 400 300 200 100 0 1 2 3 4 5 6 7 # of Attributes

  40. population china country

  41. collector of fine china earthenware

  42. Bayesian • For a mixture of instances and properties: Noisy-Or model 𝑄 𝑑 𝑢 𝑚 = 1 − 1 − 𝑄 𝑑 𝑢 𝑚 , 𝑨 𝑚 = 1 1 − 𝑄 𝑑 𝑢 𝑚 , 𝑨 𝑚 = 0 where 𝑨 𝑚 = 1 indicates 𝑢 𝑚 is an entity, 𝑨 𝑚 = 0 indicates 𝑢 𝑚 is a property • Bayesian rule gives: 𝑄(𝑑|𝑢 𝑚 ) 𝑀 𝑚 𝑄 𝑑 𝑈 ∝ 𝑄 𝑑 𝑄 𝑢 𝑚 𝑑 ∝ 𝑄 𝑑 𝑀−1 𝑚

  43. iPad apple company device

  44. … … solitaire team d ell’s streak movie common neighbour game spot Ubuntu … … iphone iphone ipod ipod fruit google google mac mac … Ipod touch Ipod touch app app device apps apps … company microsoft microsoft … popular popular adobe adobe ipad apple tablet acer acer food … cell phones cell phones … android android news news tablet tablet product 3g 3g product … launch launch … iphone os iphone os steve jobs steve jobs apple’s apple’s concept cluster … … concept cluster home goods weblog green guide artist keyboard t-shirts … … cooccur1 cooccur2 no filtering filtering device tablet … … company product … … concept cluster

  45. Modeling Co-occurrence Probase + LDA model Wikipedia Concept Topic Word

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend