[PPT] - NTCIR-7 Almost-Unsupervised Cross-Language Opinion Analysis NLCL PowerPoint Presentation

SLIDE 1

NTCIR-7

Almost-Unsupervised Cross-Language Opinion Analysis

NLCL group Taras Zagibalov* T.Zagibalov@sussex.ac.uk John Carroll J.A.Carroll@sussex.ac.uk

Department of Informatics University of Sussex

* supported by the Ford Foundation International Fellowships Program.

SLIDE 2

19/12/2008 NLCL 2

Overview

Introduction
Tasks
Our Approach
Lexical Item Extraction
Relevance Classification
Subjectivity Classification
Results
Error Analysis and Conclusion

SLIDE 3

19/12/2008 NLCL 3

Introduction

Our main focus is portability of natural

language processing systems across languages

Our basic approach is an almost

unsupervised approach

SLIDE 4

19/12/2008 NLCL 4

Tasks

Japanese
English
Simplified Chinese
Traditional Chinese

SLIDE 5

19/12/2008 NLCL 5

Tasks

Relevance Classification
Subjectivity Classification
Opinion Classification
Target Detection
Opinion Holder Detection

SLIDE 6

19/12/2008 NLCL 6

Our Approach

Lexical Item Extraction
Relevance Classification
Subjectivity Classification

SLIDE 7

19/12/2008 NLCL 7

Lexical Item Extraction

Lexical Item (LI) extraction problems:

A problem of the word boundary detection in

Chinese and Japanese.

A problem of idioms / collocations

SLIDE 8

19/12/2008 NLCL 8

Lexical Item Extraction

LI extraction technique used:

Any sequence of characters that occurs at least

three times is a candidate to be a LI

If the frequency of a LI is the same as that of a

shorter sub-unit then the latter is deleted.

SLIDE 9

19/12/2008 NLCL 9

Lexical Item Extraction

LI extraction technique used:

Any sequence of characters that occurs at least

three times is a candidate to be a LI

If the frequency of a LI is the same as that of a

shorter sub-unit then the latter is deleted.

LI candidate Frequency Length

美国司法

31 4

√ 美国司

31 3

X 司

519 1

√

SLIDE 10

19/12/2008 NLCL 10

Relevance Classification

All LI are ranked according to their frequency in

each document

LI frequency ranks are compared across all the

documents

LI with the biggest rank differences are selected

as relevance indicators

SLIDE 11

19/12/2008 NLCL 11

Relevance Classification

All LI are ranked according to their frequency in

each document

LI frequency ranks are compared across all the

documents

LI with the biggest rank differences are selected

as relevance indicators

LI Topic 1 rank Topic 2 rank Difference the 2 3 1 X netscape 10 10

√

law 24 6 18

√

SLIDE 12

19/12/2008 NLCL 12

Relevance Classification

 Example:

 Topic:

'What is the relationship between AOL and Netscape?' (N11)

 Relevance indicators:

america online, appliances, designed, dominant, link, maker, netscape, online, services, start-ups, sun, technological change, they have, windows

perating

SLIDE 13

19/12/2008 NLCL 13

Subjectivity Classification

For each LI we found immediate neighbours:

第五次缔约方大会的中国代表团

SLIDE 14

19/12/2008 NLCL 14

Subjectivity Classification

For each LI we found immediate neighbours:

第五次缔约方大会的中国代表团中国 : 的 _0, 大会的 _0, 代表团 _1

SLIDE 15

19/12/2008 NLCL 15

Subjectivity Classification

For each neighbour word we calculated chi-

square (χ

2) score

LI with χ

2 > 3.84 were included into the list

All such words were ranked according to their

score

Lists of every two headwords were compared to

find how much of context words they shared

SLIDE 16

19/12/2008 NLCL 16

Subjectivity Classification

Syntactic and Semantic relations separated:

跟中国经济的快速对美国经济的信心

Syntactic relations Semantic relations

跟 + 中国中国 + 经济美国 + 经济经济 + 的中国 + 美国

SLIDE 17

19/12/2008 NLCL 17

Subjectivity Classification

Headwords

中国美国经济的

Context words

经济经济中国经济

Context words

跟对的快速

Good pairs:

中国 + 美国

Bad pairs:

中国 + 经济 ; 美国 + 经济 ; 经济 + 的

SLIDE 18

19/12/2008 NLCL 18

Subjectivity Classification

Syntactic and Semantic relations separated:

there are good years and bad years stable and good conditions

Syntactic relations Semantic relations

are + good good + years and + bad and + good good + bad

SLIDE 19

19/12/2008 NLCL 19

Subjectivity Classification

Headwords

good bad and years

Context words

and and bad bad

Context words

years years good and

Good pairs:

good + bad

Bad pairs:

and + bad; and + good; and + years; years + bad; good + years

SLIDE 20

19/12/2008 NLCL 20

Subjectivity Classification

 Filtering the paired headwords:  Filter 1:

Excluded all pairs with a too small association score (the score value less than

1.96σ)

 Filter 2:

Deleted all words that occurred in too many pairs ( LI that occurred in more than +1.96σ pairs);

 x  x

SLIDE 21

19/12/2008 NLCL 21

Subjectivity Classification

 RunID1:

 Use manually filtered words:

important, difficult, effective, popular, successful, easily, troubled, striking, best, bad, painful, strong, good Result: low recall

SLIDE 22

19/12/2008 NLCL 22

Subjectivity Classification

 RunID1:

 Use manually filtered words

 RunID2:

 RunID1 + (χ 2 >average)

 RunID3:

 RunID1 + (χ 2 >3.84)

SLIDE 23

19/12/2008 NLCL 23

Subjectivity Classification

Classification algorithm:

1. If a sentence contains a relevance marker >

RELEVANT

2. If a sentence is RELEVANT and contains a

subjectivity marker > OPINIONATED

3. Otherwise >

NA

SLIDE 24

19/12/2008 NLCL 24

Results: Trad. Chinese (lenient)

1 2 3 10 20 30 40 50 60 70 80 90 100

P -rel R -rel F -rel P -opin R -opin F -opin

SLIDE 25

19/12/2008 NLCL 25

Results: Simp. Chinese (lenient)

1 2 3 10 20 30 40 50 60 70 80 90 100

P -rel R -rel F -rel P -opin R -opin F -opin

SLIDE 26

19/12/2008 NLCL 26

Results: Japanese (lenient)

1 2 3 10 20 30 40 50 60 70 80 90 100

P -rel R -rel F -rel P -opin R -opin F -opin

SLIDE 27

19/12/2008 NLCL 27

Results: English (lenient)

1 2 3 10 20 30 40 50 60 70 80 90 100

P -rel R -rel F -rel P -opin R -opin F -opin

SLIDE 28

19/12/2008 NLCL 28

Language Precision Recall F-value

T. Chinese

Relevance (3) 48.2 68.9 56.7 Opinion (3) 27.7 84.6 41.7

S. Chinese Relevance (3)

97.1 58.5 73.0 Opinion (3) 43.2 69.9 53.4 Japanese Relevance (3)* 47.7 63.8 54.6 Opinion (3)* 30.2 91.0 45.3 English Relevance (3) 87.5 41.1 55.6 Opinion (3) 47.6 74.2 58.0

*Note that the RunID3 results were obtained after the official submission.

Best results (lenient)

Sub-task (RunID)

Results

SLIDE 29

19/12/2008 NLCL 29

Error Analysis

Small amount of data
More noise with higher recall
Word segmentation for the Asian languages

 发展中国家 : 发展中 +

国家 / 发展 + 中国 + 家

POS tagging

SLIDE 30

19/12/2008 NLCL 30

Conclusion

Simple almost unsupervised cross-lingual

system

Satisfactory results for the Japanese and

English tasks

Rather poor performance for the Chinese (both)

SLIDE 31

19/12/2008 NLCL 31

Future Work

Reduce noise
Automate subjectivity marker selection
Develop unsupervised language independent

(quasi-)POS tagging technique

SLIDE 32

19/12/2008 NLCL 32

ありがとうございます謝謝谢谢 Thank you

SLIDE 33

19/12/2008 NLCL 33

Results

Precision Recall F-value Relevance (1) 84.9 14.5 24.8 Opinion (1) 53.6 26.8 35.7 Relevance (2) 86.4 28.6 43.0 Opinion (2) 49.4 50.6 50.0 Relevance (3) 85.7 41.1 55.6 Opinion (3) 47.6 74.2 58.0

Traditional Chinese (lenient)

Sub-task (RunID)

SLIDE 34

19/12/2008 NLCL 34

Results

Precision Recall F-value Relevance (1) 96.3 32.6 48.7 Opinion (1) 44.3 39.9 42.0 Relevance (2) 97.5 28.0 43.5 Opinion (2) 48.2 36.9 41.8 Relevance (3) 97.1 58.5 73.0 Opinion (3) 43.2 69.9 53.4

Simplified Chinese (lenient)

Sub-task (RunID)

SLIDE 35

19/12/2008 NLCL 35

Sub-task (RunID) Precision Recall F-value Relevance (1) 53.7 18.9 28.0 Opinion (1) 42.6 22.3 29.3 Relevance (2)

Opinion (2)
Relevance (3)*

47.7 63.8 54.6 Opinion (3)* 30.2 91.0 45.3

*Note that the RunID3 results were obtained after the official submission.

Japanese (lenient)

Results

SLIDE 36

19/12/2008 NLCL 36

Results

Precision Recall F-value Relevance (1) 13.0 6.8 9.0 Opinion (1) 37.8 10.1 16.0 Relevance (2) 17.5 14.4 15.8 Opinion (2) 33.8 18.6 24.0 Relevance (3) 48.2 68.9 56.7 Opinion (3) 27.7 84.6 41.7

NTCIR-7

Almost-Unsupervised Cross-Language Opinion Analysis

NLCL group Taras Zagibalov* T.Zagibalov@sussex.ac.uk John Carroll J.A.Carroll@sussex.ac.uk

Department of Informatics University of Sussex

Overview

Introduction

language processing systems across languages

unsupervised approach

Tasks

Tasks

Our Approach

Lexical Item Extraction

Lexical Item (LI) extraction problems:

Chinese and Japanese.

Lexical Item Extraction

LI extraction technique used:

three times is a candidate to be a LI

shorter sub-unit then the latter is deleted.

Lexical Item Extraction

LI extraction technique used:

three times is a candidate to be a LI

shorter sub-unit then the latter is deleted.

Relevance Classification

each document

documents

as relevance indicators

Relevance Classification

each document

documents

as relevance indicators

Relevance Classification

'What is the relationship between AOL and Netscape?' (N11)

america online, appliances, designed, dominant, link, maker, netscape, online, services, start-ups, sun, technological change, they have, windows

Subjectivity Classification

第五次缔约方大会的中国代表团

Subjectivity Classification

第五次缔约方大会的中国代表团 中国 : 的 _0, 大会的 _0, 代表团 _1

Subjectivity Classification

square (χ

score

find how much of context words they shared

Subjectivity Classification

跟 中国 经济 的 快速 对 美国 经济 的 信心

Syntactic relations Semantic relations

跟 + 中国 中国 + 经济 美国 + 经济 经济 + 的 中国 + 美国

Subjectivity Classification

中国 美国 经济 的

经济 经济 中国 经济

跟 对 的 快速

中国 + 美国

中国 + 经济 ; 美国 + 经济 ; 经济 + 的

Subjectivity Classification

there are good years and bad years stable and good conditions

Syntactic relations Semantic relations

are + good good + years and + bad and + good good + bad

Subjectivity Classification

good bad and years

and and bad bad

years years good and

good + bad

and + bad; and + good; and + years; years + bad; good + years

Subjectivity Classification

Excluded all pairs with a too small association score (the score value less than

Deleted all words that occurred in too many pairs ( LI that occurred in more than +1.96σ pairs);

 x  x

Subjectivity Classification

important, difficult, effective, popular, successful, easily, troubled, striking, best, bad, painful, strong, good Result: low recall

Subjectivity Classification

Subjectivity Classification

Classification algorithm:

RELEVANT

subjectivity marker > OPINIONATED

NA

Results: Trad. Chinese (lenient)

Results: Simp. Chinese (lenient)

Results: Japanese (lenient)

Results: English (lenient)

Language Precision Recall F-value

Relevance (3) 48.2 68.9 56.7 Opinion (3) 27.7 84.6 41.7

97.1 58.5 73.0 Opinion (3) 43.2 69.9 53.4 Japanese Relevance (3)* 47.7 63.8 54.6 Opinion (3)* 30.2 91.0 45.3 English Relevance (3) 87.5 41.1 55.6 Opinion (3) 47.6 74.2 58.0

第五次缔约方大会的中国代表团中国 : 的 _0, 大会的 _0, 代表团 _1

跟中国经济的快速对美国经济的信心

跟 + 中国中国 + 经济美国 + 经济经济 + 的中国 + 美国

中国美国经济的

经济经济中国经济

跟对的快速

ありがとうございます謝謝谢谢 Thank you