Overview of the Recognizing Inference in TExt (RITE-2) at - - PowerPoint PPT Presentation

overview of the recognizing inference in text rite 2 at
SMART_READER_LITE
LIVE PREVIEW

Overview of the Recognizing Inference in TExt (RITE-2) at - - PowerPoint PPT Presentation

RITE-2 Overview of the Recognizing Inference in TExt (RITE-2) at Recognizing Inference in TExt@NTCIR10 NTCIR-10 Yotaro Yusuke Junta Tomohide Hiroshi Cheng- Watanabe Miyao Mizuno Shibata Kanayama Wei Lee Tohoku NII


slide-1
SLIDE 1

Overview of the Recognizing Inference in TExt (RITE-2) at NTCIR-10

Yotaro Watanabe Yusuke Miyao Junta Mizuno Tomohide Shibata Cheng- Wei Lee Chuan- Jie Lin Teruko Mitamura

Tohoku University NII Kyoto University CMU Academia Sinica National Taiwan Ocean University

Hideki Shima Hiroshi Kanayama Kohichi Takeda

IBM Research

Shuming Shi

MSRA

RITE-2

Recognizing ¡ Inference ¡in ¡ TExt@NTCIR10 ¡

Tohoku University IBM Research CMU

Noriko Kando

NII

slide-2
SLIDE 2

Overview of RITE-2

  • RITE-2 is a generic benchmark task that

addresses a common semantic inference required in various NLP/IA applications

The 10th NTCIR Conference

The Kamakura Shogunate was considered to have begun in 1192, but the current leading theory is that it was effectively formed in 1185. The Kamakura Shogunate began in Japan in the 12th century. Can t2 be inferred from t1 ? (entailment?) t1: t2:

2

slide-3
SLIDE 3

Motivation

  • Natural Language Processing (NLP) /

Information Access (IA) applications

Ø Question Answering, Information Retrieval, Information Extraction, Text Summarization, Automatic evaluation for Machine Translation, Complex Question Answering

  • The current entailment recognition systems

have not been mature enough

Ø The highest accuracy on Japanese BC subtask in NTCIR-9 RITE was only 58% Ø There is still enough room to address the task to advance entailment recognition technologies

The 10th NTCIR Conference

3

slide-4
SLIDE 4

Unit ¡ Test BC ¡

Linguistic phenomena- level inference

World ¡ knowledge

Bio. apps IR QA

Sentence-level inference Multiple sentence- level inference Pyramid of entailment recognition technology

RITE-2

Search ¡ ≡ ≡ ≡ ≡ ≡

sentence sentence

sent. sent. sent. sent. sent. sent. sent. sent. sent. sent. sent. sent. sent. sent. sent. sent.

  • negation

modification coordination

quantification

case ¡alternation

phrase ¡rel. lexical ¡rel.

sentence documents entailment ¡? paraphrase? ¡ entailment ¡? ¡ contradiction? entailment ¡? ¡ contradiction?

MC ¡ Foundation oriented Application oriented

RITE

RITE vs. RITE-2

The 10th NTCIR Conference

4

slide-5
SLIDE 5

RITE-2 Subtasks

The 10th NTCIR Conference

5

slide-6
SLIDE 6

BC and MC subtasks

  • BC subtask

Ø Entailment (t1 entails t2) or Non-Entailment (otherwise)

  • MC subtask

Ø Bi-directional Entailment (t1 entails t2 & t2 entails t1) Ø Forward Entailment (t1 entails t2 & t2 does not entail t1) Ø Contradiction (t1 contradicts t2 or cannot be true at the same time) Ø Independence (otherwise)

The 10th NTCIR Conference

6

The Kamakura Shogunate was considered to have begun in 1192, but the current leading theory is that it was effectively formed in 1185. The Kamakura Shogunate began in Japan in the 12th century. t1: t2:

BC MC

Y or N B,F,C or I

slide-7
SLIDE 7

Development of BC and MC data

The 10th NTCIR Conference

7

retrieve pairs

  • f sentences

edit pairs if needed

1: <t1, t2> 1: <t1, t2> 1: <t1, t2> 1: <t1, t2> 1: <t1, t2> 1: <t1, t2> 1: <t1, t2> 1: <t1, t2> 1: <t1, t2> 1: <t1, t2> 1: <t1, t2> 1: <t1, t2>

for each example, 5 annotators assigned its semantic label accept an example if 4 or more annotators assigned the same label to the example

RITE2 BC, MC data

slide-8
SLIDE 8

Entrance Exam subtasks ( Japanese only)

The 10th NTCIR Conference

8

t2: ¡オスマン帝国ではスレイマン1世 の時代が最盛期であった. (The ¡ Ottoman ¡Empire’s ¡peak ¡was ¡during ¡ the ¡reign ¡of ¡Suleiman ¡I). ¡ Entrance ¡exam ¡problem ¡ t1: ¡スレイマン1世は数多くの軍事的 成功を収めてオスマン帝国を最盛期 に導いた.(Suleiman ¡I ¡contributed ¡in ¡a ¡ lot ¡of ¡military ¡successes ¡and ¡led ¡the ¡ Ottoman ¡Empire ¡to ¡its ¡peak. ¡

National ¡Center ¡Test ¡for ¡University ¡Admission ¡ (Daigaku ¡Nyushi ¡Center ¡Shiken) ¡

slide-9
SLIDE 9

Entrance Exam subtask: BC and Search

  • Entrance Exam BC

Ø Binary-classification problem ( Entailment or Non- entailment ) Ø t1 and t2 are given

  • Entrance Exam Search

Ø Binary-classification problem ( Entailment or Non- entailment ) Ø t2 and a set of documents are given

v Systems are required to search sentences in Wikipedia and textbooks to decide semantic labels

The 10th NTCIR Conference

9

slide-10
SLIDE 10

UnitTest ( Japanese only)

  • Motivation

Ø Evaluate how systems can handle linguistic phenomena that affects entailment relations

  • Task definition

Ø Binary classification problem (same as BC subtask)

The 10th NTCIR Conference

10

t1: In the Meiji Constitution, legal clear distinction between the Imperial Family and Japan had been allowed. t2: In the Meiji Constitution, distinction between the Imperial Family and Japan had been allowed. t1: In the Meiji Constitution, distinction between the Imperial Family and Japan had been allowed. t2: In the Meiji Constitution, distinction between the Emperor and Japan had been allowed Category: modifier Category: melonymy

slide-11
SLIDE 11

1: <t1, t2> 1: <t1, t2> 1: <t1, t2> 1: <t1, t2> 1: <t1, t2> 1: <t1, t2>

Development of the UnitTest data

  • Procedure

Ø Sentence pairs {<t1, t2>} were sampled from the BC subtask data Ø An annotator transformed each sampled sentence pair from t1 to t2 by breaking down the pair in a set of linguistic phenomena

  • [Kaneko+ 13] (to appear in ACL 2013)

The 10th NTCIR Conference

11

1: <t1, t2>

BC subtask data

1: <t1, t2> 1: <t1, t2>

sampling

1: <t1, t2> 2: <t1, t2> 1: <t1, t2>

Sampled sentence pairs

break down

1: <t1, t2> 1: <t1, t2> 1: <t1, t2> 1: <t1, t2> 1.2: <t1, t2> 1.1: <t1, t2>

UnitTest data

slide-12
SLIDE 12

Distribution of the linguistic phenomena in UnitTest data

The 10th NTCIR Conference

dev test lexical synonymy 10 10 hypernymy 6 3 meronymy 1 1 entailment 1 phrase synonymy 45 35 hypernymy 3 entailment 28 45 case alternation 9 7 modifier 30 42 nominalization 2 1 coreference 12 4 clause 29 14 relative clause 10 8 transparent head 2 1 dev test list 11 3 quantity 1 scrambling 16 15 inference 4 2 Implicit relation 10 18 apposition 3 1 temporal 2 1 spatial 4 1 disagree lexical 5 2 phrase 25 25 modality 2 1 spatial 1 1 temporal 1 Total 272 241 12

slide-13
SLIDE 13

RITE4QA (Chinese only)

  • Motivation

Ø Can an entailment recognition system rank a set of unordered answer candidates in QA?

  • Dataset

Ø Developed from NTCIR-7 and NTCIR-8 CLQA data

v t1: answer-candidate-bearing sentence v t2: a question in an affirmative form

  • Requirements

Ø Generate confidence scores for ranking process

The 10th NTCIR Conference

13

slide-14
SLIDE 14

Evaluation Metrics

  • Macro F1 and Accuracy (BC, MC, ExamBC,

ExamSearch and UnitTest)

  • Correct Answer Ratio (Entrance Exam)

Ø Y/N labels are mapped into selections of answers and calculate accuracy of the answers

  • Top1 and MRR (RITE4QA)

The 10th NTCIR Conference

14

MacroF1 = 1 |C| X

c∈C

F1c

Top1 = 1 |Q|

|Q|

X

i=1

[top answer is correct]

MRR = 1 |Q|

|Q|

X

i=1

1 ranki

Accuracy = 100 × Ncorrect Nexamples

slide-15
SLIDE 15

Organization Effort

slide-16
SLIDE 16
  • We provided pre-processed data and tools

to lower barriers to entry

Generic Framework

The 10th NTCIR Conference

Entailment Recognizer ≡ ≡ ≡ ≡ ≡ documents Linguistic Analyzer Evaluator

  • utputs

Evaluation results (accuracies, F1-values…) (1) Provided pre- processed data (2) Provided a fundamental entailment recognition tool (3) Provided RITE-2 evaluators

16

sentence

slide-17
SLIDE 17
  • Morphological and syntactic analysis

Ø MeCab [Kudo+ 05] + CaboCha [Kudo+ 02] Ø Juman + KNP Ø Provided as XML data

  • Search Results for

Exam Search subtask

Ø Used TSUBAKI [Shinzato+ 11] to provide search results Ø Provided at most five search results extracted from Wikipedia and textbooks

The 10th NTCIR Conference

17

(1) Pre-processed data

slide-18
SLIDE 18

(2) A fundamental entailment recognition tool (Baseline tool)

  • Features

Ø a machine learning-based entailment recognition system Ø simple features are implemented (Feature Extractor)

v Bag-of- {content words, aligned chunks, head words} v Ratio of aligned {content words, aligned chunks}

Ø new features can be easily added Ø outputs files compatible with the format of the RITE-2 formal run

The 10th NTCIR Conference

18

instance (XML) ≡ Feature Extractor Machine Learning relation Y or N B,F,C or I Baseline Tool

slide-19
SLIDE 19

(3) RITE-2 Evaluators

  • Generic Evaluator (all of the subtasks)
  • Additional Evaluator (Entrance Exam)

Ø Calculate correct answer ratio

The 10th NTCIR Conference

19

$ java -jar rite2eval.jar -g RITE2_JA_test_bc.xml -s output_bc.txt

  • |Label| #| Precision| Recall| F1|

| N| 354| 60.18( 204/ 339)| 57.63( 204/ 354)| 58.87| | Y| 256| 44.65( 121/ 271)| 47.27( 121/ 256)| 45.92|

  • Accuracy:53.28( 325/ 610)

Macro F1:52.40

  • Confusion Matrix
  • |gold \ sys| N Y|
  • | N| 204 150|

| Y| 135 121|

slide-20
SLIDE 20

RITE-2 Formal Run Participation

slide-21
SLIDE 21

Number of submissions

The 10th NTCIR Conference

NTCIR-10 RITE-2 JA CT CS Total BC 41 20 21 82 MC 20 21 21 62 Exam BC 31

  • 31

Exam Search 4

  • 4

UnitTest 14

  • 14

RITE4QA

  • 12

10 22 Total 110 53 52 215 NTCIR-9 RITE JA CT CS Total Total 65 70 77 212

21

slide-22
SLIDE 22

Countries/Regions of Participants

The 10th NTCIR Conference

China

3 groups

Taiwan

8 groups

India

1 group

Ireland

1 group

22

Japan

15 groups

slide-23
SLIDE 23

Formal Run Results

slide-24
SLIDE 24

BC ( Japanese)

The 10th NTCIR Conference

24 40 45 50 55 60 65 70 75 80 85

DCUMT-01 WSD-03 SKL-02 BnO-03 WSD-02 WSD-01 SKL-01 BnO-02 BnO-01 KitAi-01 OKA1-02 JAIST-02 SKL-03 KitAi-03 JAIST-01 OKA1-01 KYOTO-02 IBM-01 IBM-02 JAIST-03 IBM-03 KitAi-02 ut12-01 OKA1-03 FLL-03 *TKDDI-03 TKDDI-02 *TKDDI-02 *TKDDI-01 TKDDI-01 FLL-01 Baseline-01 NTTD-03 *FLL-05 FLL-02 ut12-03 ut12-02 NTTD-01 *FLL-06 EHIME-01 *FLL-04 THK-01 NTTD-02 EHIME-02 JUNLP-01 EHIME-03 KYOTO-03 KYOTO-01

Macro F1 80.49 Accuracy 81.64

  • The best system achieved over 80% of accuracy

(The highest score in BC subtask at RITE was 58%)

  • The difference is caused by

Ø Advancement of entailment recognition technologies Ø Strict data filtering in the data development

Macro F1 Accuracy

slide-25
SLIDE 25

BC (Traditional/Simplified Chinese)

The 10th NTCIR Conference

25 10 20 30 40 50 60 70 80

bcNLP-03 MIG-02 CYUT-03 bcNLP-01 bcNLP-02 MIG-01 CYUT-02 WHUTE-02 CYUT-01 *IASL-02 WHUTE-01 MIG-03 IMTKU-03 Yuntech-03 Yuntech-02 Yuntech-01 IMTKU-01 *IASL-01 WUST-02 WUST-01 *WUST-01 WUST-03 IMTKU-02 JUNLP-01

10 20 30 40 50 60 70 80

IASL-02 MIG-02 MIG-03 IMTKU-01 WHUTE-01 MIG-01 IMTKU-03 Yuntech-03 Yuntech-02 Yuntech-01 KC99-01 CYUT-01 CYUT-02 IASL-01 CYUT-03 JUNLP-01 IMTKU-02 NTOUA-01 NTOUA-03 NTOUA-02

Macro F1 67.14 Accuracy 67.76 Macro F1 73.84 Accuracy 74.65 Macro F1 Accuracy Macro F1 Accuracy

  • The top scores are almost the same as those in

NTCIR-9 RITE

slide-26
SLIDE 26

10 20 30 40 50 60 70 80

SKL- L-01 SKL- L-02 SKL- L-03 WSD-03 WSD-03 WSD-02 WSD-02 WSD-01 WSD-01 FLL- LL-01 JAIST-01 JAIST-01 BnO-02 BnO-02 JAIST-02 JAIST-02 BnO-03 BnO-03 BnO-01 BnO-01 JAIST-03 JAIST-03 *FLL- LL-04 KYOTO-02 KYOTO-02 *FLL- LL-02 THK HK-01 Baseline-01 Baseline-01 EHI HIME-03 EHI HIME-01 *FLL- LL-03 EHI HIME-02 JUNLP LP-01 KYOTO-01 KYOTO-01

MC ( Japanese)

The 10th NTCIR Conference

26 Macro F1 Accuracy Macro F1 59.96 Accuracy 69.53

  • The top system achieved approx. 70% of accuracy

(The highest acc. in NTCIR-9 RITE was only 51%)

slide-27
SLIDE 27

MC ( Japanese, F1 for each label)

The 10th NTCIR Conference

27 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 90

SKL-01 SKL-02 SKL-03 WSD-03 WSD-02 WSD-01 FLL-01 JAIST-01 BnO-02 JAIST-02 BnO-03 BnO-01 JAIST-03 *FLL-04 KYOTO-02 *FLL-02 THK-01 Baseline-01 EHIME-03 EHIME-01 *FLL-03 EHIME-02 JUNLP-01 KYOTO-01

Bi-directional Ent. Forward Ent. Contradiction Forward Ent. 76.47 Contradiction 28.57

Difficulty:

Contradiction >>> Bi-directional > Forward Ent.

Bi-directional 69.29

slide-28
SLIDE 28

MC (Traditional/Simplified Chinese)

The 10th NTCIR Conference

28 10 20 30 40 50 60 70

bcNLP-03 *IASL-02 WHUTE-01 WHUTE-02 bcNLP-02 MIG-02 CYUT-02 MIG-01 Yuntech-02 Yuntech-03 WUST-02 WUST-03 *WUST-01 CYUT-01 WUST-01 Yuntech-01 CYUT-03 bcNLP-01 *IASL-01 MIG-03 IMTKU-03 JUNLP-01 IMTKU-01 IMTKU-02

10 20 30 40 50 60 70

IASL-02 WHUTE-01 MIG-02 NTOUA-03 NTOUA-01 MIG-03 KC99-01 MIG-01 Yuntech-03 Yuntech-01 Yuntech-02 IMTKU-01 NTOUA-02 IASL-01 MCUIM-01 IMTKU-03 CYUT-02 CYUT-01 JUNLP-01 CYUT-03 IMTKU-02

Macro F1 Accuracy Macro F1 Accuracy

Macro F1 46.32 Accuracy 51.99 Macro F1 56.82 Accuracy 61.08

  • The top system in TC achieved approx. 52% of accuracy
  • The top system in SC achieved over 60% of accuracy
slide-29
SLIDE 29

10 20 30 40 50 60 70 80

BnO-02 BnO-03 KDR-02 BnO-01 KDR-03 WSD-01 IBM-03 WSD-03 SKL-02 WSD-02 KDR-01 SKL-01 SKL-03 KitAi-01 IBM-02 KitAi-03 JAIST-02 JAIST-03 IBM-01 JAIST-01 KitAi-02 KYOTO-02 NTTD-02 Baseline-01 NTTD-03 NTTD-01 JUNLP-01 *TKDDI-03 TKDDI-01 TKDDI-02 THK-01 KYOTO-03 KYOTO-01

Exam BC ( Japanese)

The 10th NTCIR Conference

29

Macro F1 Accuracy Correct Answer Ratio

The highest Correct Answer Ratio

57.41

Macro F1 67.15 Accuracy 70.31

  • If candidate sentences in knowledge (Wikipedia and

textbooks) are already obtained, the best system can answer more than 57% of exam questions correctly

slide-30
SLIDE 30

Exam Search

  • The best system could answer 34% of

questions correctly in a search task setting

The 10th NTCIR Conference

30 10 20 30 40 50 60 70 80

*KDR-02 *KDR-01 *KDR-03 NTTD-01 *BnO-02 *BnO-01 *BnO-03 NTTD-02 KYOTO-01 KYOTO-02

Macro F1 Accuracy Correct Answer Ratio The highest Correct Answer Ratio

34.26

Macro F1 58.12 Accuracy 64.51

slide-31
SLIDE 31

10 20 30 40 50 60 70 80 90 100

*FLL-01 *FLL-03 JAIST-02 *TKDDI-03 *TKDDI-01 *TKDDI-02 TKDDI-01 TKDDI-02 BnO-01 NTTD-02 BnO-03 JAIST-01 NTTD-01 THK-01 Baseline-01 *FLL-02 NTTD-03 BnO-02 KYOTO-02 KYOTO-01 JAIST-03

UnitTest

The 10th NTCIR Conference

31

Macro F1 Accuracy Y-F1 N-F1

Macro F1 77.77 Accuracy 90.87 N-F1 60.71

  • Since almost of the examples are Y (Y:219, N:29),

improving performance of detecting “N” is important

  • Due to the limited space, performances for each

category cannot be shown here

slide-32
SLIDE 32

5 10 15 20 25 30 35 40

IMTKU-03 WHUTE-01 IMTKU-02 *IASL-02 *IASL-01 IMTKU-01 CYUT-02

  • rgQAsys-01

bcNLP-03 CYUT-01 bcNLP-01 bcNLP-02

RITE4QA (Traditional/Simplified Chinese)

The 10th NTCIR Conference

32 5 10 15 20 25 30 35 40

WHUTE-01 IMTKU-03 IMTKU-01 CYUT-03 IASL-01 IMTKU-02 IASL-02 NTOUA-01 NTOUA-03 NTOUA-02

  • rgQAsys-01

CYUT-01 CYUT-02

Top1 acc. MRR Top1 acc. MRR

Top1 acc. 27.33 MRR 34.57 Top1 acc. 28.00 MRR 33.77

slide-33
SLIDE 33

Review of Participants’ Systems

slide-34
SLIDE 34

Participant’s approaches

  • Category

Ø Statistical (50%) Ø Hybrid (27%) Ø Rule-based (23%)

  • Fundamental approach

Ø Overlap-based (77%) Ø Alignment-based (63%) Ø Transformation-based (23%)

The 10th NTCIR Conference

34

slide-35
SLIDE 35

Summary of types of information explored

Ø Character/word overlap (85%) Ø Syntactic information (67%) Ø Temporal/numerical information (63%) Ø Named entity information (56%) Ø Predicate-argument structure (44%) Ø Entailment relations (30%) Ø Polarity information (7%) Ø Modality information (4%)

The 10th NTCIR Conference

35

slide-36
SLIDE 36

Summary of Resources Explored

  • Japanese

Ø Wikipedia (10) Ø Japanese WordNet (9) Ø ALAGIN Entailment DB (5) Ø Nihongo Goi-Taikei (2) Ø Bunruigoihyo (2) Ø Iwanami Dictionary (2)

  • Chinese

Ø Chinese WordNet (3) Ø TongYiCi CiLin (3) Ø HowNet (2)

The 10th NTCIR Conference

36

slide-37
SLIDE 37

Advanced approaches

  • Logical approaches

Ø Dependency-based Compositional Semantics (DCS) [BnO], Markov Logic [EHIME], Natural Logic [THK]

  • Alignment

Ø GIZA [CYUT], ILP [FLL], Labeled Alignment [bcNLP, THK]

  • Search Engine

Ø Google and Yahoo [DCUMT]

  • Deep Learning

Ø RNN language models [DCUMT]

  • Probabilistic Models

Ø N-gram HMM [DCUMT], LDA [FLL]

  • Machine Translation

Ø [ JUNLP, JAIST, KC99]

The 10th NTCIR Conference

37

slide-38
SLIDE 38

Oral Presentations (6/20 13:00-)

  • [DCUMT] Tsuyoshi Okita. Local Graph Matching with Active Learning for Recognizing

Inference in Text at NTCIR-10.

  • [SKL] Shohei Hattori and Satoshi Sato. Team SKL’s Strategy and Experience in RITE2.
  • [BnO] Ran Tian, Yusuke Miyao, Takuya Matsuzaki and Hiroyoshi Komatsu. BnO at

NTCIR-10 RITE: A Strong Shallow Approach and an Inference-based Textual Entailment Recognition System.

  • [FLL] Takuya Makino, Seiji Okajima and Tomoya Iwakura. FLL: Local Alignments based

Approach for NTCIR-10 RITE-2

  • [KDR] Daniel Andrade, Masaaki Tsuchida, Takashi Onishi and Kai Ishikawa. Detecting

Contradiction in Text by Using Lexical Mismatch and Structural Similarity

  • [NTTD] Megumi Ohki, Takashi Suenaga, Daisuke Satoh, Yuji Nomura and Toru Takaki.

Expanded Dependency Structure based Textual Entailment Recognition System of NTTDATA for NTCIR10-RITE2.

  • [IASL] Cheng-Wei Shih, Chad Liu, Cheng-Wei Lee and Wen-Lian Hsu. IASL RITE System at

NTCIR-10.

  • [WHUTE] Han Ren, Hongmiao Wu, Chen Lv, Donghong Ji and Jing Wan. The WHUTE

System in NTCIR-10 RITE Task.

  • [bcNLP] Xiao-Lin Wang, Hai Zhao and Bao-Liang Lu. BCMI-NLP Labeled-Alignment-

Based Entailment System for NTCIR-10 RITE-2 Task.

  • [IMTKU] Chun Tu, Min-Yuh Day, Shih-Jhen Huang, Hou-Cheng Vong and Sih-Wei Wu.

IMTKU Textual Entailment System for Recognizing Inference in Text at NTCIR-10 RITE2.

The 10th NTCIR Conference

38

slide-39
SLIDE 39

Conclusion

  • NTCIR-10 RITE-2

Ø Benchmark task of evaluating systems that infer semantic relations between sentences Ø Two subtasks were added

v Exam Search: provided more realistic task setting v UnitTest: enabled us fine-grained evaluation and analysis of RITE systems

Ø Organization Efforts

v Provided pre-processed data (XML), Baseline tool and Evaluation tools

Ø 28 teams participated! (NTCIR-9 RITE: 24 teams) Ø Diverse advanced approaches and resources were explored

The 10th NTCIR Conference

39

RITE-2 was as successful ! !