An Empirical Study on Selective Sampling in Active Learning for - PowerPoint PPT Presentation

An Empirical Study on Selective Sampling in Active Learning for Splog Detection Taichi Katayama 1 Takehito Utsuro 1 Yuuki Sato 2 Takayuki Yoshinaka 3 Yasuhide Kawada 4 Tomohiro Fukuhara 5 1 University of Tsukuba, 2 Konami Corporation, 3 Tokyo Denki University, 4 Navix Co., Ltd., 5 University of Tokyo, 1 AIRWeb2009, April 21nd, 2009 @Madrid, Spain. WWW2009

Background • Opinion Mining from Blogs • Splogs are Serious Noise in Opinion Mining – e.g., larger scale statistics (2008 Mar.) • 40% of Japanese Blog Articles in BuzzPulse, nifty are Splogs, 2007 Oct. � 2008 Feb. • Automatic Detection is highly Expected. 2

keyword stuffed blog 3

Blog snippet Rumor of retrieved with “FC Tokyo” “FC Tokyo” (a football team in Japan) “FC Tokyo” 4

Blog snippet retrieved with “LOUIS VUITTON Key case” pop-up advertisement automatically 5 inserted by the blog host system

$50 Software Package for Massive Splog Creation Featuring • SEO • Affiliate Program satellite satellite in link in link main site satellite satellite satellite satellite 6

Background • Opinion Mining from Blogs • Splogs are Serious Noise in Opinion Mining – e.g., larger scale statistics (2008 Mar.) • 40% of Japanese Blog Articles in BuzzPulse, nifty are Splogs, 2007 Oct. � 2008 Feb. • Automatic Detection is highly Expected. 7

Previous studies on splog detection • [P.Kolari 2007] – Words – URLs – Anchor texts – Links – HTML meta tags • [Y.-R.Lin 2007] – Temporal self similarities of • Posting time • Posting contents • Affiliated links • [G.Mishne 2005] – Language models among the blog post , the comment ,and pages linked by the comments 8

Evaluation with two data sets “Does splog change over time?” 1. Years 2007-2008 (720 sites) 2. Years 2008-2009 (720 sites) 9

Recall/Precision curves with confidence measure Splog detection Authentic blog detection �� Train 07-08(720 sites) Train 07-08 (360 �� sites) +08-09 (360 �� sites) �� P r e c i s i o n ( % ) �� P r e c i s i o n ( % ) �� Train 07-08 (360 �� sites) +08-09 (360 Train 07-08(720 sites) sites) �� Recall(%) Recall(%) Test 08-09 (40 sites) 10

Purpose of This Research (1) • Needs for continuously updating splog/authentic blog data sets year by year • How to reduce human supervision? • May active learning framework work? 11

Purpose of This Research (2) • Optimal Strategies for Selective Sampling in Active Learning • Guided by Certain Confidence Measure random samples, samples with the � samples balanced least confidence with a confidence measure 12

Outline 1. Definition of splog sites 2. Splog detection by Machine learning – SVM – Confidence Measure – Features 3. Active learning 4. Evaluation 5. Future works 13

Definition of splog sites • If one of the followings holds for the given blog sites, then it is mostly splog – originally written text is not included – originally written text is included but many • “links top affiliated sites” or • ”advertisement articles” or • “articles with adult content” are included (judged individually by considering the contents of each blog) • Otherwise, the given blog sites is an authentic blog 14

Splog Detection by SVMs • a tool – TinySVM • the kernel function: – 2nd order � linear • confidence measure – the distance from the separating hyperplane to each test instance 15

A Confidence Measure � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � Lower Bound � � � � (splog) � � :splog Lower Bound � :authentic blog Separating (authentic blog) 16 hyperplane

Features for splog detection 1. Total frequency of URLs not linked from splogs 2. Co-occurrence between Noun Phrases and Splogs � 2 ( splog , noun phrase w ) • Sum of 3. Noun Phrases in Anchor Texts and linked URLs • Total frequency of anchor text noun phrases • in splogs • out-linked to splog URLs and Blacklist URLs • Total frequency of anchor text noun phrases • in splogs • out-linked to authentic blog URLs Whitelist URLs 17

Feature1: URLs are not linked from splog Authentic Authentic splog splog splog blog blog included only in splogs url included only in url authentic blogs url url url url url more than one Whitelist: Blacklist: More than one inward links defined as inward links defined as from authentic 18 these URLs from splogs these URLs blogs

Value of the Whitelist URLs feature � � � � total frequency total � � � � � � � � of u in the whole frequency � � � � � � log training instances of u in � � � � u � � � � of authentic blog the test � � � � homepages instance � � � � u : Whitelist URLs 19

Feature2: Noun Phrases Authentic splog blog Training set �� w �� w �� w �� w �� w �� w �� w �� freq( splog, w)=a freq( splog , � w)=b �� w �� w �� w �� freq( authentic blog ,w)=c freq( authentic blog , � w)=d w : a noun phrase 21

Value of the splog noun phrase feature � 2 ( ad bc ) � � 2 ( splog , w ) � � � � ( a b )( a c )( b d )( c d ) � � total frequency of w � � � � � 2 log ( splog , w ) � � � in the test instance � w 22

Feature3: Noun Phrases in Anchor Texts and linked URLs w : a noun phrase in Anchor text a Splog site s <a href= �� > �� w �� </a> <a href= �� > �� w �� </a> <a href= �� > �� w �� </a> <a href= �� > �� w �� </a> <a href= �� > �� w �� </a> <a href= �� > �� w �� </a> <a href= �� > �� w �� </a> <a href= �� > �� w �� </a> <a href= �� > �� w �� </a> http:// �� http:// �� http:// �� http:// �� http:// �� http:// �� http:// �� http:// �� http:// �� http:// �� http:// �� http:// �� http:// �� http:// �� http:// �� Authentic blog Whitelist URLs URLs Splog URLs Other URLs Blacklist URLs AncfW(w,s)=freq of w 24 AncfB(w,s)=freq of w

Noun Phrases in Anchor Texts and linked URLs: two features the value of a feature named anchor text noun phrase out-linked to Blacklist URLs for a test instance blog homepage � � � � � � � log AncfB ( w , s ) AncfB ( w , t ) � � w s the value of a feature named anchor text noun phrase out-linked to Whitelist URLs for a test instance blog homepage � � � � � � � log AncfW ( w , s ) AncfW ( w , t ) � � � � w training splog � � homepages w : noun phrase s : a training splog homepage t : a test instance blog homepage 25

Framework of Active learning 250 cycles up to 1010 training instances labeled 4 sites unlabeled 4 sites selective Training Training Human sampling Set an SVM supervision In active (initial size classifier learning of 10) (4 splog and 6 authentic Pool of unlabeled Blog) instances (initial size of 3504) (1296 splog and 2208 authentic blog) 26

Statistics of Splog/Authentic Blogs Data Set Data Sets # of splogs # of total authentic blogs Years 1445 2459 3904 2008-2009 27

An Empirical Study on Selective Sampling in Active Learning for - PowerPoint PPT Presentation

An Empirical Study on Selective Sampling in Active Learning for Splog Detection Taichi Katayama 1 Takehito Utsuro 1 Yuuki Sato 2 Takayuki Yoshinaka 3 Yasuhide Kawada 4 Tomohiro Fukuhara 5 1 University of Tsukuba, 2 Konami Corporation, 3 Tokyo Denki

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

The Active Card An Active Mind in an Active Body More people, More Active, More often! The

Active Adversary Lecture 7 CCA Security MAC Active Adversary Active Adversary An active

Texas Instruments & RFAB TI Information Selective Disclosure TI Information Selective

Cimzia Selective rebrand Concept A Cimzia Selective rebrand Logo Main / Colour Grayscale

Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1.

Stat 5102 Lecture Slides: Deck 1 Empirical Distributions, Exact Sampling Distributions,

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Dialetheism is an empirical hypothesis David Ripley University of Connecticut

An Empirical Characterization of A E i i l Ch t i ti f Stream Programs and its

An Empirical Analysis of Traceability in the Monero Blockchain Malte Mser, Kyle Soska, Ethan

Robustness Meets Algorithms Ankur Moitra (MIT) Robust Statistics Summer School CLASSIC PARAMETER

Pr Private Stochastic Convex Optimization wi with Optimal Ra Rate Raef Bassily Vitaly

Import Competition, Heterogeneous Preferences of Managers and Productivity Cheng Chen Claudia

NEW DEVELOPMENTS IN RANKING AND SELECTION: An Empirical Comparison of Three Main Approaches urgen

Time-Bounded Sequential Parameter Optimization Frank Hutter, Holger H. Hoos, Kevin Leyton-Brown,

Sambuz

Useful Links

Newsletter

Mail Us

An Empirical Study on Selective Sampling in Active Learning for - PowerPoint PPT Presentation

An Empirical Study on Selective Sampling in Active Learning for Splog Detection Taichi Katayama 1 Takehito Utsuro 1 Yuuki Sato 2 Takayuki Yoshinaka 3 Yasuhide Kawada 4 Tomohiro Fukuhara 5 1 University of Tsukuba, 2 Konami Corporation, 3 Tokyo Denki

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

The Active Card An Active Mind in an Active Body More people, More Active, More often! The

Active Adversary Lecture 7 CCA Security MAC Active Adversary Active Adversary An active

Texas Instruments &amp; RFAB TI Information Selective Disclosure TI Information Selective

Cimzia Selective rebrand Concept A Cimzia Selective rebrand Logo Main / Colour Grayscale

Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1.

Stat 5102 Lecture Slides: Deck 1 Empirical Distributions, Exact Sampling Distributions,

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Dialetheism is an empirical hypothesis David Ripley University of Connecticut

An Empirical Characterization of A E i i l Ch t i ti f Stream Programs and its

An Empirical Analysis of Traceability in the Monero Blockchain Malte Mser, Kyle Soska, Ethan

Robustness Meets Algorithms Ankur Moitra (MIT) Robust Statistics Summer School CLASSIC PARAMETER

Pr Private Stochastic Convex Optimization wi with Optimal Ra Rate Raef Bassily Vitaly

Import Competition, Heterogeneous Preferences of Managers and Productivity Cheng Chen Claudia

NEW DEVELOPMENTS IN RANKING AND SELECTION: An Empirical Comparison of Three Main Approaches urgen

Time-Bounded Sequential Parameter Optimization Frank Hutter, Holger H. Hoos, Kevin Leyton-Brown,

Sambuz

Useful Links

Newsletter

Mail Us

Texas Instruments & RFAB TI Information Selective Disclosure TI Information Selective