in4080 2020 fall
play

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 - PowerPoint PPT Presentation

1 IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Vectors, Distributions, Embeddings Lecture 5, Sept 14 Today 3 Lexical semantics Vector models of documents tf-idf weighting Word-context matrices Word


  1. 1 IN4080 – 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning

  2. 2 Vectors, Distributions, Embeddings Lecture 5, Sept 14

  3. Today 3  Lexical semantics  Vector models of documents  tf-idf weighting  Word-context matrices  Word embeddings with dense vectors

  4. The meaning of words 4  Words (lecture 2)  Type – token  Word – lexeme – lemma  Meaning?

  5. Look into the dictionary ˈ ɛ ə ˈ ɛ ə sense lemma 5 definition pepper, n. πέπερι / ˈ p ɛ p ə / , U.S. / ˈ p ɛ p ə r / Brit. Pronunciation: Forms: OE peopor ( rare ), OE pipcer (transmission error), OE pipor , OE pipur ( rare ... c. U.S. The California pepper tree, Schinus molle . Cf. PEPPER TREE n. 3. Frequency (in current use): Etymology: A borrowing from Latin. Etymon: Latin piper . < classical Latin piper , a loanword < Indo-Aryan (as is ancient Greek πέπερι ); compare San 3. Any of various forms of capsicum, esp. Capsicum annuum var. I . The spice or the plant. annuum . Originally (chiefly with distinguishing word): any variety of the 1. C. annuum Longum group, with elongated fruits having a hot, pungent a. A hot pungent spice derived from the prepared fruits (peppercorns) of taste, the source of cayenne, chilli powder, paprika, etc., or of the the pepper plant, Piper nigrum (see sense 2a), used from early times to perennial C. frutescens , the source of Tabasco sauce. Now frequently • A word with several senses is called season food, either whole or ground to powder (often in association with (more fully sw eet pepper ): any variety of the C. annuum Grossum salt). Also (locally, chiefly with distinguishing word): a similar spice group, with large, bell-shaped or apple-shaped, mild-flavoured fruits, derived from the fruits of certain other species of the genus Piper ; the polysemous usually ripening to red, orange, or yellow and eaten raw in salads or fruits themselves. cooked as a vegetable. Also: the fruit of any of these capsicums. The ground spice from Piper nigrum comes in two forms, the more pungent black pepper , produced • If two different words look and sound from black peppercorns, and the milder white pepper , produced from white peppercorns: see BLACK Sweet peppers are often used in their green immature state (more fully green pepper ), but some 1 adj. and n. Special uses 5a, PEPPERCORN n. 1a, and WHITE adj. and n. Special uses 7b(a). new varieties remain green when ripe. the same, they are called homonyms 2. a. The plant Piper nigrum (family Piperaceae), a climbing shrub indigenous to South Asia and also cultivated elsewhere in the tropics, which has alternate stalked entire leaves, with pendulous spikes of small • How to tell: one word or several? green flowers opposite the leaves, succeeded by small berries turning red when ripe. Also more widely: any plant of the genus Piper or the family Piperaceae. • Common origin • But not waterproof/easy to see b. Usu. with distinguishing word: any of numerous plants of other families having hot pungent fruits or leaves which resemble pepper ( 1a) in taste and in some cases are used as a substitute for it. † †

  6. Relations between senses 6 Term Definition Examples

  7. Relations between senses 7 Term Definition Examples Synonymy Have the same meaning in all(?)/some(?) contexts sofa-couch, bus-coach big-large

  8. Relations between senses 8 Term Definition Examples Synonymy Have the same meaning in all(?)/some(?) contexts sofa-couch, bus-coach big-large Antonymy Opposites with respect to a feature of meaning true-false, strong-weak, up- down

  9. Relations between senses 9 Term Definition Examples Synonymy Have the same meaning in all(?)/some(?) contexts sofa-couch, bus-coach big-large Antonymy Opposites with respect to a feature of meaning true-false, strong-weak, up- down rose  flower , cow  animal, Hyponym-hyperonym The <hyponym> is a type-of the <hyperonym> car  vehicle

  10. Relations between senses 10 Term Definition Examples Synonymy Have the same meaning in all(?)/some(?) contexts sofa-couch, bus-coach big-large Antonymy Opposites with respect to a feature of meaning true-false, strong-weak, up- down rose  flower , cow  animal, Hyponym-hyperonym The <hyponym> is a type-of the <hyperonym> car  vehicle Similarity cow-horse boy-girl

  11. Relations between senses 11 Term Definition Examples Synonymy Have the same meaning in all(?)/some(?) contexts sofa-couch, bus-coach big-large Antonymy Opposites with respect to a feature of meaning true-false, strong-weak, up- down rose  flower , cow  animal, Hyponym-hyperonym The <hyponym> is a type-of the <hyperonym> car  vehicle Similarity cow-horse boy-girl Related money-bank fish-water

  12. Resources for lexical semantics: WordNet 12  https://wordnet.princeton.edu  Relations between the synsets  To each word:  One or more synsets lounge, waiting room, waiting area lounge sofa, couch, lounge couch couch (psych. bench) couch (coat of paint)

  13. What does ongchoi mean? 13  Suppose you see these sentences:  Ong choi is delicious sautéed with garlic.  Ong choi is superb over rice  Ong choi leaves with salty sauces  And you've also seen these:  …spinach sautéed with garlic over rice  Chard stems and leaves are delicious  Collard greens and other salty leafy greens  Conclusion: Ongchoi is a leafy green like spinach, chard, or collard greens

  14. Similar 14 (first-order association, Related syntagmatic) ong choi delicious Similar sautéed with garlic (second-order association, spinach over rice paradigmatic)

  15. The distributional hypothesis 15  Words that occur in similar contexts have similar meanings

  16. Today 16  Lexical semantics  Vector models of documents  tf-idf weighting  Word-context matrices  Word embeddings with dense vectors

  17. Shakespeare (from J & M) 17  Vectors are similar for the two  Notice similarity to text classification comedies  Mandatory 2A, multinomial  Different than the historical  The document represented by a dramas vector with the occurrences of  Comedies have more fools 35,000 terms and wit and fewer battles.

  18. Document classification 18  The word vectors were used as basis for classification  If two documents had the same vectors they were put in the same class  Documents are similar = on the same side of the separating hyperplane A problem to draw 35,000 dimensions

  19. Information retrieval (IR) 19  Documents placed in the same n -dimensional space as in classification 40 Henry V [4,13]  Retrieve documents similar to a 15 battle given document 10 Julius Caesar [1,7] 5 Twelfth Night [58,0] As You Like It [36,1] 5 10 15 20 25 30 35 40 45 50 55 60 fool

  20. Cosine similarity 20  Several possible ways to define similarity, e.g.,  Euclidean 40 Henry V [4,13]  Manhattan 15 battle  Most common: cosine 10 Julius Caesar [1,7] 5 Twelfth Night [58,0] As You Like It [36,1]  Do the arrows point in the same direction? 5 10 15 20 25 30 35 40 45 50 55 60 å fool N cos( v , w ) = v · w v i w i = v · w = i = 1 å å v w v w N N 2 2 v i w i i = 1 i = 1

  21. Let us try: cos(𝑤 1 , 𝑤 2 ) 21 Full vectors battles & fools AYLI TwNi JuCa HenV AYLI TwNi JuCa HenV AYLI 1.000 0.950 0.945 0.949 AYLI 1.000 1.000 0.169 0.321 TwNi 0.950 1.000 0.809 0.822 TwNi 1.000 1.000 0.141 0.294 JuCa 0.945 0.809 1.000 0.999 JuCa 0.169 0.141 1.000 0.988 HenV 0.949 0.822 0.999 1.000 0.321 0.294 0.988 1.000 HenV

  22. Today 22  Lexical semantics  Vector models of documents  tf-idf weighting  Word-context matrices  Word embeddings with dense vectors

  23. Ways of counting: Term frequency 23 Alternatives  Raw counts/absolute frequencies, TeNi = (0, 80, 58, 15)  Binary counts (Mandatory 2A), TeNi = (0, 1, 1, 1)  Variants of normalization. 80 58 15  Rel. frequency, (0, 80+58+15 , 80+58+15 , 80+58+15 )  TfidfTransformer(use_idf=False, norm = "l1") 80 58 15  Length normalize, (0, 80 2 +58 2 +15 2 , 80 2 +58 2 +15 2 , 80 2 +58 2 +15 2 )  TfidfTransformer(use_idf=False, norm = "l2")  Sublinear TF: (1 + log(tf)), 0 when tf=0  TfidfTransformer(use_idf=False, sub_linear=True)

  24. Normalize or not? 24  The cos-similarity measure does a form of length normalization:  Raw counts, relative counts, length normalized counts yield the same  For other measures, it matters whether we normalize  e.g. L2-distance is relative large between documents of different lengths  The sublinear squeezing distinguish between terms that occur often and terms that occurs very often:  If term1 occurs 100 times and term2 occurs 10 times:  term1 will be considered 10 times more frequent than term2  but only 2 times as important with sublinear

  25. Inverse document frequency 25  Intuition: A word occurring in a large proportion of documents is not a good discriminator. 𝑂  𝑗𝑒𝑔 𝑢 = log 𝑒𝑔 𝑢 𝑢 the number of documents containing 𝑢 .  𝑒𝑔  TfidfTransformer(use_idf=True, smooth_idf=False)  Smooth: avoid dividing by zero 𝑂  𝑗𝑒𝑔 𝑢 = log 𝑢 +1 + 1 𝑒𝑔  TfidfTransformer(use_idf=True, smooth_idf=True)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend