LDA 1 [Credits: Mike Smith, Las Vegas Sun 2013] LDA 2 [Credits: - - PowerPoint PPT Presentation

β–Ά
lda 1 credits mike smith las vegas sun 2013
SMART_READER_LITE
LIVE PREVIEW

LDA 1 [Credits: Mike Smith, Las Vegas Sun 2013] LDA 2 [Credits: - - PowerPoint PPT Presentation

LDA 1 [Credits: Mike Smith, Las Vegas Sun 2013] LDA 2 [Credits: IITD Library] 4 5 6 In text, the hidden variables are the thematic structure. What are the topics that describe this collection? How does a new document fit into the topic


slide-1
SLIDE 1

1 LDA

slide-2
SLIDE 2

LDA 2

[Credits: Mike Smith, Las Vegas Sun 2013]

slide-3
SLIDE 3

[Credits: IITD Library]

slide-4
SLIDE 4

4

slide-5
SLIDE 5

5

slide-6
SLIDE 6

6

slide-7
SLIDE 7

In text, the hidden variables are the thematic structure. What are the topics that describe this collection? How does a new document fit into the topic structure?

7

slide-8
SLIDE 8

8

Credits: [David Blei, KDD12]

slide-9
SLIDE 9

9

  • Credits: [David Blei, KDD12]
slide-10
SLIDE 10

P(topics, proportions, assignments | documents)

10

slide-11
SLIDE 11

11

ΞΈ

π‘Žπ‘’,π‘œ 𝑋

𝑒,π‘œ

Ξ²

πœƒ 𝛽

slide-12
SLIDE 12

12

ΞΈ

π‘Žπ‘’,π‘œ 𝑋

𝑒,π‘œ

Ξ²

πœƒ 𝛽

slide-13
SLIDE 13

13

ΞΈ

π‘Žπ‘’,π‘œ 𝑋

𝑒,π‘œ

Ξ² Ξ· 𝛽

  • ΞΈ
  • Ξ²
slide-14
SLIDE 14

ΞΈ Ξ± ΞΈ

14

slide-15
SLIDE 15

15

[Credits: Wikipedia]

slide-16
SLIDE 16

16

slide-17
SLIDE 17

17

slide-18
SLIDE 18

18

slide-19
SLIDE 19

19

ΞΈ

π‘Žπ‘’,π‘œ 𝑋

𝑒,π‘œ

Ξ²

πœƒ 𝛽

slide-20
SLIDE 20

20

Topic 1: PGM (𝜸𝟐) Bayesian: 0.1 Markov: 0.09 Network: 0.07 Inference: 0.07 … Topic 2: ML (πœΈπŸ‘) Inference: 0.2 Posterior: 0.15 Regression: 0.1 Gradient: 0.09 … Topic 3: AI (πœΈπŸ’) Markov: 0.09 Reinforcement: 0.08 Planning: 0.08 … Topic 4: Deep Learning (πœΈπŸ“) Backpropagation: 0.15 Convolution: 0.1 LSTM: 0.0.9 Dropout: 0.07 …

πœΎπ’† Topic 1: 0.7 Topic 2: 0.1 Topic 3: 0.15 Topic 4: 0.05 𝒂𝒆,𝒐 Topic 1 𝑋

𝑒,π‘œ

Markov

slide-21
SLIDE 21

21

slide-22
SLIDE 22

𝛽 = 1

22

slide-23
SLIDE 23

𝛽 = 10

23

slide-24
SLIDE 24

𝛽 = 100

24

slide-25
SLIDE 25

𝛽 = 1

25

slide-26
SLIDE 26

𝛽 = 0.1

26

slide-27
SLIDE 27

𝛽 = 0.01

27

slide-28
SLIDE 28

π‘ž 𝛾, πœ„, 𝑨 π‘₯) π‘ž(𝛾, πœ„, 𝑨, π‘₯) Χ­

𝛾,πœ„ σ𝑨 π‘ž(𝛾, πœ„, 𝑨, π‘₯)

28

slide-29
SLIDE 29

𝑦1:𝑂 𝑨1:𝑁

29

slide-30
SLIDE 30

πœ‰

30

slide-31
SLIDE 31

π‘Ÿ(𝛾, 𝑨)

31

slide-32
SLIDE 32

32

slide-33
SLIDE 33

33

slide-34
SLIDE 34

π‘œ(𝑨1:𝑂)

34

slide-35
SLIDE 35

πœ„ π‘œπ‘™(π‘¨βˆ’π‘—) π‘¨βˆ’π‘—

35

slide-36
SLIDE 36

36 LDA

slide-37
SLIDE 37

iPad

TYPE: Launch DATE: Mar 7

Steve Jobs

TYPE: Death DATE: Oct 6

Yelp

TYPE: IPO DATE: March 2

slide-38
SLIDE 38

Claim: This is worth investigating

slide-39
SLIDE 39

http://statuscalendar.com

  • [Prachi] Events shown as

url

slide-40
SLIDE 40

40

[Nupur] Model Architecture [Happy] Normalization? [Shantanu, Surag] Error Accumulation [Himanshu, Prachi] Reliance on POS tagger

slide-41
SLIDE 41

Since spread of printing press Timebank MUC & ACE competitions

  • Limited to narrow domains
  • Performance is still not great
slide-42
SLIDE 42

Short Easy to write (even on mobile devices) Instantly and widely disseminated Many irrelevant messages Many redundant messages

slide-43
SLIDE 43

`2m', `2ma', `2mar', `2mara', `2maro', `2marrow', `2mor', `2mora', `2moro', `2morow', `2morr', `2morro', `2morrow', `2moz', `2mr', `2mro', `2mrrw', `2mrw', `2mw', `tmmrw', `tmo', `tmoro', `tmorrow', `tmoz', `tmr', `tmro', `tmrow', `tmrrow', `tmrrw', `tmrw', `tmrww', `tmw', `tomaro', `tomarow', `tomarro', `tomarrow', `tomm', `tommarow', `tommarrow', `tommoro', `tommorow', `tommorrow', `tommorw', `tommrow', `tomo', `tomolo', `tomoro', `tomorow', `tomorro', `tomorrw', `tomoz', `tomrw', `tomzβ€˜

β€œThe Hobbit has FINALLY started filming! I cannot wait!” β€œwatchng american dad.”

slide-44
SLIDE 44
slide-45
SLIDE 45
  • Annotated 2400 tweets (about 34K tokens)
  • Train on in-domain data
slide-46
SLIDE 46

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Stanford T-NER P R F

slide-47
SLIDE 47
  • –
slide-48
SLIDE 48
slide-49
SLIDE 49
slide-50
SLIDE 50
slide-51
SLIDE 51
slide-52
SLIDE 52

Sports Politics Product releases … Allow more customized calendars Could be useful in upstream tasks

slide-53
SLIDE 53

Might start talking about different things Might want to focus on different groups of users

slide-54
SLIDE 54

Generative Probabilistic Models Discovers types which match the data No need to annotate individual events Don’t need to commit to a specific set of types Modular, can integrate into various applications

slide-55
SLIDE 55

Each Event Phrase is modeled as a mixture of types

Each Event phrase is modeled as a mixture of types

Each Event Type is Associated with a Distribution over Entities and Dates P(SPORTS|cheered)= 0.6 P(POLITICS|cheered)= 0.4

[Happy, Arindam, Akshay, Surag, Dinesh R] Liked [Akshay] New entities? [Anshul] Sensitive to parameters

slide-56
SLIDE 56

1,000 iterations of burn in Parallelized sampling (approximation) using MPI

[Newman et. al. 2009]

[Happy, Nupur] Disliked manual annotation [Anshul] β€˜Legal’, β€˜Food’ not event categories

slide-57
SLIDE 57
slide-58
SLIDE 58

Using types discovered by the topic model Supervised classification using 10-fold cross validation Treat event phrases like bag of words

[Nupur] Multiple entity events? [Nupur, Anshul] Very simple baseline

slide-59
SLIDE 59
slide-60
SLIDE 60

What they ate for lunch Entities such as McDonalds would be frequent on most days Only show if entities appear more than expected

slide-61
SLIDE 61

𝐻2 = ෍

π‘¦βˆˆ 𝑓,¬𝑓 ,π‘§βˆˆ{𝑒,¬𝑒}

𝑃𝑦,𝑧 Γ— π‘šπ‘œ 𝑃𝑦,𝑧 𝐹𝑦,𝑧 𝑃𝑓,𝑒 𝑃𝑓,¬𝑒 𝐹𝑓,𝑒

𝐻2

62

[Happy, Akshay, Shantanu, Nupur, Anshul, Rishab, Dinesh R] Liked [Barun, Shantanu] Same event on multiple days? [Rishab] Why not πœ“2?

slide-62
SLIDE 62

[Akshay, Barun] Liked

slide-63
SLIDE 63

No Named Entity Recognition Rely on significance test to rank ngrams A few extra heuristics (filter out temporal expressions etc…)

End-to-end Evaluation

slide-64
SLIDE 64

65