NTCIR-7 Almost-Unsupervised Cross-Language Opinion Analysis NLCL - - PowerPoint PPT Presentation

ntcir 7
SMART_READER_LITE
LIVE PREVIEW

NTCIR-7 Almost-Unsupervised Cross-Language Opinion Analysis NLCL - - PowerPoint PPT Presentation

NTCIR-7 Almost-Unsupervised Cross-Language Opinion Analysis NLCL group Taras Zagibalov* T.Zagibalov@sussex.ac.uk John Carroll J.A.Carroll@sussex.ac.uk Department of Informatics University of Sussex * supported by the Ford Foundation


slide-1
SLIDE 1

NTCIR-7

Almost-Unsupervised Cross-Language Opinion Analysis

NLCL group Taras Zagibalov* T.Zagibalov@sussex.ac.uk John Carroll J.A.Carroll@sussex.ac.uk

Department of Informatics University of Sussex

* supported by the Ford Foundation International Fellowships Program.

slide-2
SLIDE 2

19/12/2008 NLCL 2

Overview

  • Introduction
  • Tasks
  • Our Approach
  • Lexical Item Extraction
  • Relevance Classification
  • Subjectivity Classification
  • Results
  • Error Analysis and Conclusion
slide-3
SLIDE 3

19/12/2008 NLCL 3

Introduction

  • Our main focus is portability of natural

language processing systems across languages

  • Our basic approach is an almost

unsupervised approach

slide-4
SLIDE 4

19/12/2008 NLCL 4

Tasks

  • Japanese
  • English
  • Simplified Chinese
  • Traditional Chinese
slide-5
SLIDE 5

19/12/2008 NLCL 5

Tasks

  • Relevance Classification
  • Subjectivity Classification
  • Opinion Classification
  • Target Detection
  • Opinion Holder Detection
slide-6
SLIDE 6

19/12/2008 NLCL 6

Our Approach

  • Lexical Item Extraction
  • Relevance Classification
  • Subjectivity Classification
slide-7
SLIDE 7

19/12/2008 NLCL 7

Lexical Item Extraction

Lexical Item (LI) extraction problems:

  • A problem of the word boundary detection in

Chinese and Japanese.

  • A problem of idioms / collocations
slide-8
SLIDE 8

19/12/2008 NLCL 8

Lexical Item Extraction

LI extraction technique used:

  • Any sequence of characters that occurs at least

three times is a candidate to be a LI

  • If the frequency of a LI is the same as that of a

shorter sub-unit then the latter is deleted.

slide-9
SLIDE 9

19/12/2008 NLCL 9

Lexical Item Extraction

LI extraction technique used:

  • Any sequence of characters that occurs at least

three times is a candidate to be a LI

  • If the frequency of a LI is the same as that of a

shorter sub-unit then the latter is deleted.

LI candidate Frequency Length

美国司法

31 4

√ 美国司

31 3

X 司

519 1

slide-10
SLIDE 10

19/12/2008 NLCL 10

Relevance Classification

  • All LI are ranked according to their frequency in

each document

  • LI frequency ranks are compared across all the

documents

  • LI with the biggest rank differences are selected

as relevance indicators

slide-11
SLIDE 11

19/12/2008 NLCL 11

Relevance Classification

  • All LI are ranked according to their frequency in

each document

  • LI frequency ranks are compared across all the

documents

  • LI with the biggest rank differences are selected

as relevance indicators

LI Topic 1 rank Topic 2 rank Difference the 2 3 1 X netscape 10 10

law 24 6 18

slide-12
SLIDE 12

19/12/2008 NLCL 12

Relevance Classification

 Example:

 Topic:

'What is the relationship between AOL and Netscape?' (N11)

 Relevance indicators:

america online, appliances, designed, dominant, link, maker, netscape, online, services, start-ups, sun, technological change, they have, windows

  • perating
slide-13
SLIDE 13

19/12/2008 NLCL 13

Subjectivity Classification

  • For each LI we found immediate neighbours:

第五次缔约方大会的中国代表团

slide-14
SLIDE 14

19/12/2008 NLCL 14

Subjectivity Classification

  • For each LI we found immediate neighbours:

第五次缔约方大会的中国代表团 中国 : 的 _0, 大会的 _0, 代表团 _1

slide-15
SLIDE 15

19/12/2008 NLCL 15

Subjectivity Classification

  • For each neighbour word we calculated chi-

square (χ

2) score

  • LI with χ

2 > 3.84 were included into the list

  • All such words were ranked according to their

score

  • Lists of every two headwords were compared to

find how much of context words they shared

slide-16
SLIDE 16

19/12/2008 NLCL 16

Subjectivity Classification

  • Syntactic and Semantic relations separated:

跟 中国 经济 的 快速 对 美国 经济 的 信心

Syntactic relations Semantic relations

跟 + 中国 中国 + 经济 美国 + 经济 经济 + 的 中国 + 美国

slide-17
SLIDE 17

19/12/2008 NLCL 17

Subjectivity Classification

Headwords

中国 美国 经济 的

Context words

经济 经济 中国 经济

Context words

跟 对 的 快速

  • Good pairs:

中国 + 美国

  • Bad pairs:

中国 + 经济 ; 美国 + 经济 ; 经济 + 的

slide-18
SLIDE 18

19/12/2008 NLCL 18

Subjectivity Classification

  • Syntactic and Semantic relations separated:

there are good years and bad years stable and good conditions

Syntactic relations Semantic relations

are + good good + years and + bad and + good good + bad

slide-19
SLIDE 19

19/12/2008 NLCL 19

Subjectivity Classification

Headwords

good bad and years

Context words

and and bad bad

Context words

years years good and

  • Good pairs:

good + bad

  • Bad pairs:

and + bad; and + good; and + years; years + bad; good + years

slide-20
SLIDE 20

19/12/2008 NLCL 20

Subjectivity Classification

 Filtering the paired headwords:  Filter 1:

Excluded all pairs with a too small association score (the score value less than

  • 1.96σ)

 Filter 2:

Deleted all words that occurred in too many pairs ( LI that occurred in more than +1.96σ pairs);

 x  x

slide-21
SLIDE 21

19/12/2008 NLCL 21

Subjectivity Classification

 RunID1:

 Use manually filtered words:

important, difficult, effective, popular, successful, easily, troubled, striking, best, bad, painful, strong, good Result: low recall

slide-22
SLIDE 22

19/12/2008 NLCL 22

Subjectivity Classification

 RunID1:

 Use manually filtered words

 RunID2:

 RunID1 + (χ 2 >average)

 RunID3:

 RunID1 + (χ 2 >3.84)

slide-23
SLIDE 23

19/12/2008 NLCL 23

Subjectivity Classification

Classification algorithm:

  • 1. If a sentence contains a relevance marker >

RELEVANT

  • 2. If a sentence is RELEVANT and contains a

subjectivity marker > OPINIONATED

  • 3. Otherwise >

NA

slide-24
SLIDE 24

19/12/2008 NLCL 24

Results: Trad. Chinese (lenient)

1 2 3 10 20 30 40 50 60 70 80 90 100

P -rel R -rel F -rel P -opin R -opin F -opin

slide-25
SLIDE 25

19/12/2008 NLCL 25

Results: Simp. Chinese (lenient)

1 2 3 10 20 30 40 50 60 70 80 90 100

P -rel R -rel F -rel P -opin R -opin F -opin

slide-26
SLIDE 26

19/12/2008 NLCL 26

Results: Japanese (lenient)

1 2 3 10 20 30 40 50 60 70 80 90 100

P -rel R -rel F -rel P -opin R -opin F -opin

slide-27
SLIDE 27

19/12/2008 NLCL 27

Results: English (lenient)

1 2 3 10 20 30 40 50 60 70 80 90 100

P -rel R -rel F -rel P -opin R -opin F -opin

slide-28
SLIDE 28

19/12/2008 NLCL 28

Language Precision Recall F-value

  • T. Chinese

Relevance (3) 48.2 68.9 56.7 Opinion (3) 27.7 84.6 41.7

  • S. Chinese Relevance (3)

97.1 58.5 73.0 Opinion (3) 43.2 69.9 53.4 Japanese Relevance (3)* 47.7 63.8 54.6 Opinion (3)* 30.2 91.0 45.3 English Relevance (3) 87.5 41.1 55.6 Opinion (3) 47.6 74.2 58.0

*Note that the RunID3 results were obtained after the official submission.

Best results (lenient)

Sub-task (RunID)

Results

slide-29
SLIDE 29

19/12/2008 NLCL 29

Error Analysis

  • Small amount of data
  • More noise with higher recall
  • Word segmentation for the Asian languages

 发展中国家 : 发展中 +

国家 / 发展 + 中国 + 家

  • POS tagging
slide-30
SLIDE 30

19/12/2008 NLCL 30

Conclusion

  • Simple almost unsupervised cross-lingual

system

  • Satisfactory results for the Japanese and

English tasks

  • Rather poor performance for the Chinese (both)
slide-31
SLIDE 31

19/12/2008 NLCL 31

Future Work

  • Reduce noise
  • Automate subjectivity marker selection
  • Develop unsupervised language independent

(quasi-)POS tagging technique

slide-32
SLIDE 32

19/12/2008 NLCL 32

ありがとうございます 謝謝 谢谢 Thank you

slide-33
SLIDE 33

19/12/2008 NLCL 33

Results

Precision Recall F-value Relevance (1) 84.9 14.5 24.8 Opinion (1) 53.6 26.8 35.7 Relevance (2) 86.4 28.6 43.0 Opinion (2) 49.4 50.6 50.0 Relevance (3) 85.7 41.1 55.6 Opinion (3) 47.6 74.2 58.0

Traditional Chinese (lenient)

Sub-task (RunID)

slide-34
SLIDE 34

19/12/2008 NLCL 34

Results

Precision Recall F-value Relevance (1) 96.3 32.6 48.7 Opinion (1) 44.3 39.9 42.0 Relevance (2) 97.5 28.0 43.5 Opinion (2) 48.2 36.9 41.8 Relevance (3) 97.1 58.5 73.0 Opinion (3) 43.2 69.9 53.4

Simplified Chinese (lenient)

Sub-task (RunID)

slide-35
SLIDE 35

19/12/2008 NLCL 35

Sub-task (RunID) Precision Recall F-value Relevance (1) 53.7 18.9 28.0 Opinion (1) 42.6 22.3 29.3 Relevance (2)

  • Opinion (2)
  • Relevance (3)*

47.7 63.8 54.6 Opinion (3)* 30.2 91.0 45.3

*Note that the RunID3 results were obtained after the official submission.

Japanese (lenient)

Results

slide-36
SLIDE 36

19/12/2008 NLCL 36

Results

Precision Recall F-value Relevance (1) 13.0 6.8 9.0 Opinion (1) 37.8 10.1 16.0 Relevance (2) 17.5 14.4 15.8 Opinion (2) 33.8 18.6 24.0 Relevance (3) 48.2 68.9 56.7 Opinion (3) 27.7 84.6 41.7

English (lenient)

Sub-task (RunID)