E VALUATION via Negativa I NFORMATION R ETRIEVAL Mike - - PowerPoint PPT Presentation

e valuation via negativa i nformation r etrieval mike
SMART_READER_LITE
LIVE PREVIEW

E VALUATION via Negativa I NFORMATION R ETRIEVAL Mike - - PowerPoint PPT Presentation

E VALUATION via Negativa I NFORMATION R ETRIEVAL Mike Tian-Jian Jiang, Chen-Wei Shih, Chan-Hung Kuo, Richard Tzong-Han Tsai, and Wen-Lian Hsu National Tsing Hua University Academia Sinica Taiwan 1 / 36 Fundamental Unit? a


slide-1
SLIDE 1

/ 36

EVALUATION via Negativa

Mike Tian-Jian Jiang, Chen-Wei Shih, Chan-Hung Kuo, Richard Tzong-Han Tsai, and Wen-Lian Hsu National Tsing Hua University Academia Sinica Taiwan

中 文 詞分

INFORMATION RETRIEVAL

1

slide-2
SLIDE 2

/ 36

Fundamental Unit?

a meta-communication

2

slide-3
SLIDE 3

/ 36

What is a Word?

to linguistics

3

slide-4
SLIDE 4

/ 36

“... the smallest free form that may be uttered in isolation with semantic

  • r pragmatic content (with literal or

practical meaning) ...”

http://en.wikipedia.org/wiki/Word

4

slide-5
SLIDE 5

/ 36

“... the task of defining what constitutes a ‘word’ involves determining where one word ends and another word begins...”

http://en.wikipedia.org/wiki/Word#Word_boundaries

5

slide-6
SLIDE 6

/ 36

Word Boundary?

  • Phonology
  • Morphology
  • Orthography
  • Compound? Multi-word expression?
  • Multi-word vs. multiword vs. multi word
  • CJKV?
  • Multi-character expression?

6

slide-7
SLIDE 7

/ 36

What is a Word?

to computational linguistics

7

slide-8
SLIDE 8

/ 36

Standard de jure?

  • Academia Sinica Balanced Corpus
  • Chinese Treebank of University of

Pennsylvania

  • City University of Hong Kong
  • Microsoft Research Asia
  • Peking University

8

slide-9
SLIDE 9

/ 36

... then match standards

the more accuracy, the better communication?

9

slide-10
SLIDE 10

/ 36

What is a Word?

to computational linguistics applications

10

slide-11
SLIDE 11

/ 36

e.g. Information Retrieval

11

slide-12
SLIDE 12

/ 36

Standard de facto?

  • Word n-gram
  • Character n-gram
  • Hybrid

12

slide-13
SLIDE 13

/ 36

Monotonic or not?

better WS results yield better IR outcomes?

13

slide-14
SLIDE 14

/ 36

Is it finite?

How to evaluate WS-to-application influence?

14

slide-15
SLIDE 15

/ 36

Via Negativa

“It describes God by saying what he is not, rather than what he is, because as finite beings we can not recognize God's attributes in any real and full sense and because God is beyond what our language can positively describe. “

http://www.blackwellreference.com/public/tocnode?id=g9781405106795_chunk_g978140510679515_ss1-58 http://www.blackmetal.com/scans0710/teratism-via-negativa.jpg

15

slide-16
SLIDE 16

/ 36

Binary Classification?

clinical trial?

16

slide-17
SLIDE 17

/ 36

Something about Evaluation

17

slide-18
SLIDE 18

/ 36

IR Evaluation

  • Data
  • TREC, NTCIR, etc.
  • Metrics
  • P@k, MRR, MAP

, etc.

  • Doubts
  • Pooling bias
  • Score standardization

18

slide-19
SLIDE 19

/ 36

CWS Evaluation

  • Recall and precision counted by
  • Boundary
  • Token
  • Constituent
  • Similarity?

19

slide-20
SLIDE 20

/ 36

WS-to-IR

  • Peng et al. (2002)
  • WS: 44-70%, IR: ↗
  • WS: 70-77%, IR: ⤴
  • WS: 85-95%, IR: ⤵
  • He et al. (2002)
  • WS: ↗(91-94%), IR: ⤴

20

slide-21
SLIDE 21

/ 36

Why Inconclusive?

  • WS accuracy ranges?
  • WS/IR evaluation metrics?
  • Query length?
  • Term types?

21

slide-22
SLIDE 22

/ 36

Term Type

  • Kwok (2002)
  • Insensitive: stop-words; frequent non-content-bearing
  • Monotonic: content-bearing
  • Non-monotonic:
  • 西土耳其 (Western Turkey)
  • Semantic, syntax, or surface?
  • 农 (agricultural) / 作物 (plants)
  • 旱 (drought) / 灾 (disaster) vs. 春旱 (Spring drought) vs. 旱区 (area or

drought disaster)

  • Recall or precision?
  • 火 (fire) / 山 (mountain) vs. 火山 (volcano)

22

slide-23
SLIDE 23

/ 36

Surface Pattern

  • Ambiguity
  • Combinatorial
  • 西土耳其、农作物、旱灾、春旱、旱区、火山...

etc.

  • Overlapping
  • 施政 (practice policy) / 伟 (great) vs. 施 (Shih) / 政

伟 (Zheng-Wei)

  • Which is more harmful?
http://www.definicionabc.com/general/gestalt-psicologia.php

23

slide-24
SLIDE 24

/ 36 24

slide-25
SLIDE 25

/ 36

Is it finite?

How to evaluate WS-to-IR influence?

25

slide-26
SLIDE 26

/ 36

IR Is Rallying

  • Indexing models
  • Retrieval models
  • Data collections
  • Evaluation metrics

26

slide-27
SLIDE 27

/ 36

Tractable Simulation?

http://imgs.xkcd.com/store/glen_shirts/g_try_science_shirt_2.jpg

27

slide-28
SLIDE 28

/ 36

Balanced

NTCIR (long) and Sogou (short) query collections

28

slide-29
SLIDE 29

/ 36

Pragmatical WS

accuracy-controlled systems on different standards 1, 1/2, 1/4, ..., 1/16384 data of Bakeoff 2005 for CRF

http://scifun.files.wordpress.com/2010/07/1278929569066.jpg

29

slide-30
SLIDE 30

/ 36 30

slide-31
SLIDE 31

/ 36

Popularity

similarity (MAP) to a black box’s preference (top-100)

31

slide-32
SLIDE 32

/ 36 32

slide-33
SLIDE 33

/ 36

Correlation≠Causation

TNR and NPV may imply something

http://imgs.xkcd.com/store/imgs/correlation_shirt_300.png

33

slide-34
SLIDE 34

/ 36

Discussion

  • 上海滩 (the bund of Shanghai)
  • MSR: 上海滩,上海 / 滩,上 / 海 / 滩
  • PKU: 上海滩,上海 / 滩,上 / 海滩
  • May be caused by......
  • Standard differences?
  • Lexicon disappearances?

34

slide-35
SLIDE 35

/ 36

Concerns

  • Other accuracy-controlled WS systems than CRF?
  • The same training data, different standards?
  • Conventional/comparative IR experiments?
  • Lucene? Lemur/Indri?
  • TREC and NTCIR?
  • Silver standards?
  • Relaxation of negative patterns?
  • Graphical or n-best list output of WS?
  • Oracle precision, recall, TNR, NPV, etc?
  • Other applications than IR?
  • Out-of-vocabulary?

35

slide-36
SLIDE 36

/ 36

<(_ _)>

36