An Empirical Study on Selective Sampling in Active Learning for - - PowerPoint PPT Presentation

an empirical study on selective sampling in active
SMART_READER_LITE
LIVE PREVIEW

An Empirical Study on Selective Sampling in Active Learning for - - PowerPoint PPT Presentation

An Empirical Study on Selective Sampling in Active Learning for Splog Detection Taichi Katayama 1 Takehito Utsuro 1 Yuuki Sato 2 Takayuki Yoshinaka 3 Yasuhide Kawada 4 Tomohiro Fukuhara 5 1 University of Tsukuba, 2 Konami Corporation, 3 Tokyo Denki


slide-1
SLIDE 1

1

An Empirical Study on Selective Sampling in Active Learning for Splog Detection

Taichi Katayama1 Takehito Utsuro1 Yuuki Sato2 Takayuki Yoshinaka3 Yasuhide Kawada4 Tomohiro Fukuhara5

1University of Tsukuba, 2Konami Corporation, 3Tokyo Denki University, 4Navix Co., Ltd., 5University of Tokyo,

AIRWeb2009, April 21nd, 2009 @Madrid, Spain. WWW2009

slide-2
SLIDE 2

2

Background

  • Opinion Mining from Blogs
  • Splogs are Serious Noise in Opinion Mining

– e.g., larger scale statistics (2008 Mar.)

  • 40% of Japanese Blog Articles in BuzzPulse,

nifty are Splogs, 2007 Oct. 2008 Feb.

  • Automatic Detection is highly Expected.
slide-3
SLIDE 3

3

keyword stuffed blog

slide-4
SLIDE 4

4

Rumor of “FC Tokyo” (a football team in Japan) “FC Tokyo” Blog snippet retrieved with “FC Tokyo”

slide-5
SLIDE 5

5

Blog snippet retrieved with “LOUIS VUITTON Key case” pop-up advertisement automatically inserted by the blog host system

slide-6
SLIDE 6

6

$50 Software Package for Massive Splog Creation

Featuring

  • SEO
  • Affiliate Program

in link in link

satellite satellite satellite satellite satellite satellite main site

slide-7
SLIDE 7

7

Background

  • Opinion Mining from Blogs
  • Splogs are Serious Noise in Opinion Mining

– e.g., larger scale statistics (2008 Mar.)

  • 40% of Japanese Blog Articles in BuzzPulse,

nifty are Splogs, 2007 Oct. 2008 Feb.

  • Automatic Detection is highly Expected.
slide-8
SLIDE 8

8

Previous studies on splog detection

  • [P.Kolari 2007]

– Words – URLs – Anchor texts – Links – HTML meta tags

  • [Y.-R.Lin 2007]

– Temporal self similarities of

  • Posting time
  • Posting contents
  • Affiliated links
  • [G.Mishne 2005]

– Language models among the blog post , the comment ,and pages linked by the comments

slide-9
SLIDE 9

9

Evaluation with two data sets “Does splog change over time?”

  • 1. Years 2007-2008 (720 sites)
  • 2. Years 2008-2009 (720 sites)
slide-10
SLIDE 10

10

  • Recall(%)

P r e c i s i o n ( % )

  • Recall(%)

P r e c i s i o n ( % )

Recall/Precision curves with confidence measure

Train 07-08(720 sites) Train 07-08 (360 sites) +08-09 (360 sites) Train 07-08 (360 sites) +08-09 (360 sites) Train 07-08(720 sites)

Splog detection Authentic blog detection Test 08-09 (40 sites)

slide-11
SLIDE 11

11

Purpose of This Research (1)

  • Needs for continuously updating

splog/authentic blog data sets year by year

  • How to reduce human supervision?
  • May active learning framework work?
slide-12
SLIDE 12

12

Purpose of This Research (2)

  • Optimal Strategies for Selective Sampling

in Active Learning

  • Guided by Certain Confidence Measure

random samples, samples balanced with a confidence measure

  • samples with the

least confidence

slide-13
SLIDE 13

13

Outline

  • 1. Definition of splog sites
  • 2. Splog detection by Machine learning

– SVM – Confidence Measure – Features

  • 3. Active learning
  • 4. Evaluation
  • 5. Future works
slide-14
SLIDE 14

14

Definition of splog sites

  • If one of the followings holds for the given

blog sites, then it is mostly splog

– originally written text is not included – originally written text is included but many

  • “links top affiliated sites” or
  • ”advertisement articles” or
  • “articles with adult content”

are included (judged individually by considering the contents of each blog)

  • Otherwise, the given blog sites is an

authentic blog

slide-15
SLIDE 15

15

Splog Detection by SVMs

  • a tool

– TinySVM

  • the kernel function:

– 2nd order linear

  • confidence measure

– the distance from the separating hyperplane to each test instance

slide-16
SLIDE 16

16

A Confidence Measure

  • Lower Bound

(authentic blog)

  • Separating

hyperplane Lower Bound (splog) :splog :authentic blog

slide-17
SLIDE 17

17

Features for splog detection

1. Total frequency of URLs not linked from splogs 2. Co-occurrence between Noun Phrases and Splogs

  • Sum of

3. Noun Phrases in Anchor Texts and linked URLs

  • Total frequency of anchor text noun phrases
  • in splogs
  • ut-linked to splog URLs and Blacklist URLs
  • Total frequency of anchor text noun phrases
  • in splogs
  • ut-linked to authentic blog URLs Whitelist URLs

) phrase noun , splog (

2

w

slide-18
SLIDE 18

18

Feature1: URLs are not linked from splog

splog Authentic blog Authentic blog splog splog

More than one inward links from splogs

more than one inward links from authentic blogs url

included only in splogs included only in authentic blogs

url url url url url url Whitelist: defined as these URLs Blacklist: defined as these URLs

slide-19
SLIDE 19

19

Value of the Whitelist URLs feature

  • u

u u instance test the in

  • f

frequency total homepages blog authentic

  • f

instances training whole in the

  • f

frequency total log

u: Whitelist URLs

slide-20
SLIDE 20

20

Features for splog detection

1. Total frequency of URLs not linked from splogs 2. Co-occurrence between Noun Phrases and Splogs

  • Sum of

3. Noun Phrases in Anchor Texts and linked URLs

  • Total frequency of anchor text noun phrases
  • in splogs
  • ut-linked to splog URLs and Blacklist URLs
  • Total frequency of anchor text noun phrases
  • in splogs
  • ut-linked to authentic blog URLs Whitelist URLs

) phrase noun , splog (

2

w

slide-21
SLIDE 21

21

Feature2: Noun Phrases

Training set

w w

splog Authentic blog

w

w: a noun phrase

freq(splog,w)=a freq(splog,w)=b freq(authentic blog,w)=c freq(authentic blog,w)=d

w w w w w w w

slide-22
SLIDE 22

22

Value of the splog noun phrase feature

  • instance

test in the

  • f

frequency total ) , splog ( log ) )( )( )( ( ) ( ) , splog (

2 2 2

w w d c d b c a b a bc ad w

w

slide-23
SLIDE 23

23

Features for splog detection

1. Total frequency of URLs not linked from splogs 2. Co-occurrence between Noun Phrases and Splogs

  • Sum of

3. Noun Phrases in Anchor Texts and linked URLs

  • Total frequency of anchor text noun phrases
  • in splogs
  • ut-linked to splog URLs and Blacklist URLs
  • Total frequency of anchor text noun phrases
  • in splogs
  • ut-linked to authentic blog URLs Whitelist URLs

) phrase noun , splog (

2

w

slide-24
SLIDE 24

24

Feature3:

Noun Phrases in Anchor Texts and linked URLs

a Splog site s Blacklist URLs Splog URLs Whitelist URLs Authentic blog URLs http:// http:// http:// http:// http:// http:// http:// http:// http:// http:// http:// http:// http:// http:// http:// AncfB(w,s)=freq of w w: a noun phrase in Anchor text AncfW(w,s)=freq of w Other URLs <a href=> w </a> <a href=> w </a> <a href=> w </a> <a href=> w </a> <a href=> w </a> <a href=> w </a> <a href=> w </a> <a href=> w </a> <a href=> w </a>

slide-25
SLIDE 25

25

Noun Phrases in Anchor Texts and linked URLs: two features

  • w

s

t w AncfB s w AncfB ) , ( ) , ( log

  • w

t w AncfW s w AncfW ) , ( ) , ( log

homepages splog training

w: noun phrase s: a training splog homepage t: a test instance blog homepage

the value of a feature named anchor text noun phrase out-linked to Whitelist URLs for a test instance blog homepage the value of a feature named anchor text noun phrase out-linked to Blacklist URLs for a test instance blog homepage

slide-26
SLIDE 26

26

Framework of Active learning

Pool of unlabeled instances (initial size of 3504) (1296 splog and 2208 authentic blog) Training Set (initial size

  • f 10)

(4 splog and 6 authentic Blog) selective sampling In active learning Training an SVM classifier unlabeled 4 sites Human supervision labeled 4 sites

250 cycles up to 1010 training instances

slide-27
SLIDE 27

27

Statistics of Splog/Authentic Blogs Data Set

3904 2459 1445 Years 2008-2009 total # of authentic blogs # of splogs Data Sets

slide-28
SLIDE 28

28

Strategies of selective sampling(1/2)

Low High

High/Low Balanced

  • Separating

hyperplane

  • Separating

hyperplane

  • Separating

hyperplane

  • Separating

hyperplane

  • :splog

:authentic blog

slide-29
SLIDE 29

29

Strategies of selective sampling(2/2)

Low-Sp/Au High-Au High/Low-Au Balanced-Sp/Au

  • Separating

hyperplane

  • Separating

hyperplane

  • Separating

hyperplane

  • Separating

hyperplane

  • :splog

:authentic blog

slide-30
SLIDE 30

30

Outline

  • 1. Definition of splog sites
  • 2. Splog detection by Machine learning

– SVM – Confidence Measure – Features

  • 3. Active learning
  • 4. Evaluation
  • 5. Future works
slide-31
SLIDE 31

31

Measure for Performance evaluation after active learning cycles

  • Recall/Precision

– Splog detection – Authentic blog detection is considered in a similar fashion

  • “| Tr |= 3500”, “Random”

– “| Tr |= 3500” indicates a classifier trained with the whole 3504 instances in the pool – “Random” indicates a classifier trained with randomly selected training instances

| Ts(splog) | | ) Ts(LBD Ts(splog) | recall | ) Ts(LBD | | ) Ts(LBD Ts(splog) | precision

s s s

slide-32
SLIDE 32

32

Lower Bound of the Confidence Measure

  • Separating

hyperplane Lower Bound (splog) :splog :authentic blog

) (

s

LBD Ts

Ts(splog): the set of reference splog sites

slide-33
SLIDE 33

33

Measure for Performance evaluation after active learning cycles

  • Recall/Precision

– Splog detection – Authentic blog detection is considered in a similar fashion

  • “| Tr |= 3500”, “Random”

– “| Tr |= 3500” indicates a classifier trained with the whole 3504 instances in the pool – “Random” indicates a classifier trained with randomly selected training instances

| Ts(splog) | | ) Ts(LBD Ts(splog) | recall | ) Ts(LBD | | ) Ts(LBD Ts(splog) | precision

s s s

slide-34
SLIDE 34

34

  • Recall(%)

Precision(%)

High/Low-Au Low-Sp/Au High-Au Random Balanced-Sp/Au |Tr|=3500

Recall/precision curve of Splog detection

slide-35
SLIDE 35

35

  • Recall(%)

Precision(%)

High/Low-Au Low-Sp/Au High-Au Random Balanced-Sp/Au |Tr|=3500

Recall/precision curve of Authentic blog Detection

slide-36
SLIDE 36

36

Evaluation results: comparison of strategies for selective sampling

|TR|=3500 Random High/Low Blance High Low Low Random

  • Previous studies of active learning for text classification tasks

Splog/authentic blog detection

slide-37
SLIDE 37

37

Support Vectors

  • only the support vectors have effect on deciding the

position of the separating hyperplane

  • the number of support vectors can be regarded as the

complexity of the learning task

  • Separating

hyperplane

  • ,support vector
slide-38
SLIDE 38

38

  • # of Training Instances

# of Support Vectors

High/Low-Au Low-Sp/Au High-Au Random Balanced-Sp/Au

Changes in # of Support Vectors

Random High/Low Balance Low High

slide-39
SLIDE 39

39

Evaluation result: # of support vectors

  • The number of support vectors linearly

increases

  • Performance of splog/authentic blog

detection increase much more slowly

  • About 20% of training instances are

constantly selected as support vectors

  • In this task, more effective features should

be added.

slide-40
SLIDE 40

40

  • # of Training Instances

Precison(%)

High/Low-Au Low-Sp/Au High-Au Random Balanced-Sp/Au

|Tr|=3500

Change in maximum precision with recall as 30 %

  • f Splog Detection
slide-41
SLIDE 41

41

  • # of Training Instances

Precision(%)

High/Low-Au Low-Sp/Au High-Au Ranom Balanced-Sp/Au |Tr|=3500

Change in maximum precision with recall as 30 %

  • f Authentic blog Detection
slide-42
SLIDE 42

42

Evaluation result: # of support vectors

  • The number of support vectors linearly

increases

  • Performance of splog/authentic blog

detection increase much more slowly

  • About 20% of training instances are

constantly selected as support vectors

  • In this task, more effective features should

be added.

slide-43
SLIDE 43

43

Future works

  • Incorporating other features

– Post time and intervals – Html structures

  • Manual examination of support vectors
slide-44
SLIDE 44

44

Thanks for your attention