Improving Generalization by Data Categorization Ling Li, Amrit - PowerPoint PPT Presentation

Introduction Concepts Methods Experiments Conclusion Improving Generalization by Data Categorization Ling Li, Amrit Pratap, Hsuan-Tien Lin, and Yaser Abu-Mostafa Learning Systems Group, Caltech ECML/PKDD, October 4, 2005

Introduction Concepts Methods Experiments Conclusion Examples in Learning A Learning System Unknown Target f Examples { ( x i , y i ) } i Learner Examples are essential since they act as the information gateway between the target and the learner. Not All Examples Are Equally Useful √ 1 Surprising examples carry more information Garbage examples are also surprising (Guyon et al., 1996) × × 2 Noisy examples and outliers × 3 Examples beyond the ability of the learner Can we improve learning by automatically categorizing examples?

Introduction Concepts Methods Experiments Conclusion Improved Generalization 30 original set filtered set 25 Average test error (%) 20 15 10 5 0 australian breast cleveland german heart pima votes84

Introduction Concepts Methods Experiments Conclusion Categorize Examples Which examples are ? “bad”? Close-to-boundary ? examples are informative Three categories: typical, critical, and noisy The automatic data categorization is for better learning. The criteria are usually related with how useful or reliable the example is to learning, such as the margin.

Introduction Concepts Methods Experiments Conclusion Intrinsic Function The target f : X → {− 1 , 1 } comes from thresholding an intrinsic function f r : X → R . That is f ( x ) = sign ( f r ( x )) . Examples of f r ( x ) 1 The credit score of the applicant x minus some threshold 2 The signed Euclidean distance of x to the boundary 3 The probability of x belonging to class 1 minus 0 . 5 Properties Problem-dependent (e.g., the knowledge of experts) Tells the usefulness or reliability of an example Unknown

Introduction Concepts Methods Experiments Conclusion Intrinsic Margin and Data Categorization For an example ( x , y ), its intrinsic margin is yf r ( x ). The intrinsic margin yf r ( x ) can be treated as a measure of how close x is to the decision boundary. Small positive: near the boundary critical Large positive: deep in the class territory typical Negative: mislabeled noisy

Introduction Concepts Methods Experiments Conclusion Monotonic Estimate However, the intrinsic margin is unknown. typical a monotonic estimate of the intrinsic margin monotonic estimate two proper thresholds critical three categories noisy noisy typical critical intrinsic margin

Introduction Concepts Methods Experiments Conclusion Selection Cost For an example ( x , y ), a hypothesis g may classify it either wrongly or correctly. Consider the expected out-of-sample errors. E g [ π ( g ) | g ( x ) � = y ] − E g [ π ( g ) | g ( x ) = y ] � �� wrongly correctly We may select to trust ( x , y ), or not. The difference is the cost we pay when we make the selection. We call it the selection cost. noisy typical critical negative small positive large positive selection cost We actually estimate a scaled version of the selection cost (Nicholson, 2002). The model for learning should be also used for the estimation.

Introduction Concepts Methods Experiments Conclusion SVM Confidence Margin The soft-margin support vector machine (SVM) (Vapnik, 1995) finds a large-confidence hyperplane classifier in the feature space. The confidence margin is a meaningful estimate of the intrinsic margin. Better than the one used in (Guyon et al., 1996). Confidence margin ≤ 1: support vectors critical Negative margin noisy

Introduction Concepts Methods Experiments Conclusion AdaBoost Sample Weight AdaBoost (Freund & Schapire, 1996) is an algorithm to improve the accuracy of a base learner. It iteratively generates an ensemble of base hypotheses. It gradually forces the base learner to focus on “hard” examples by giving erroneous examples higher sample weight. The sample weight is actually a consensus among the base hypotheses on the “hardness” of the example. If an example is too “hard”, it is probably noisy. If an example is too “easy”, it is probably typical. The negative average sample weight over different iterations is a robust estimate of the intrinsic margin.

Introduction Concepts Methods Experiments Conclusion Scatter Plot 3-5-1 NNet 6 5 4 SVM confidence margin 3 2 1 0 −1 −2 −3 −4 −1 −0.5 0 0.5 1 Intrinsic margin

Introduction Concepts Methods Experiments Conclusion Scatter Plot Sin (Merler et al., 2004) 0.6 0.4 Scaled selection cost 0.2 0 −0.2 −0.4 −0.6 −8 −6 −4 −2 0 2 4 6 8 Intrinsic margin

Introduction Concepts Methods Experiments Conclusion ROC Curves Sin 1 0.9 0.8 1 − false negative rate 0.7 0.6 0.5 0.4 0.3 Scaled selection cost Negative AdaBoost weight 0.2 SVM confidence margin AdaBoost misclassification ratio 0.1 SVM Lagrange coefficient 0 0 0.2 0.4 0.6 0.8 1 False positive rate

Introduction Concepts Methods Experiments Conclusion Fingerprint Plot 3-5-1 NNet 1 0.8 0.6 0.4 Intrinsic value 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0 50 100 150 200 250 300 350 400 Example index

Introduction Concepts Methods Experiments Conclusion Fingerprint Plot Sin 8 6 4 Intrinsic value 2 0 −2 −4 −6 −8 0 50 100 150 200 250 300 350 400 Example index

Introduction Concepts Methods Experiments Conclusion 2-D Plot Yin-Yang (http://www.work.caltech.edu/ling/data/yinyang.html) · typical ◦ critical, clean � critical, mislabeled � noisy, mislabeled • noisy, clean

Introduction Concepts Methods Experiments Conclusion Real-World Data Utilize Data Categorization It is now possible to treat different categories differently. Noisy examples: remove Critical examples: emphasize Typical examples: reduce dataset orig. dataset selection cost SVM margin AdaBoost weight australian 16 . 65 ± 0 . 19 15 . 23 ± 0 . 20 14 . 83 ± 0 . 18 13 . 92 ± 0 . 16 breast 4 . 70 ± 0 . 11 6 . 44 ± 0 . 13 3 . 40 ± 0 . 10 3 . 32 ± 0 . 10 cleveland 21 . 64 ± 0 . 31 18 . 24 ± 0 . 30 18 . 91 ± 0 . 29 18 . 56 ± 0 . 30 german 26 . 11 ± 0 . 20 30 . 12 ± 0 . 15 24 . 59 ± 0 . 20 24 . 68 ± 0 . 22 heart 21 . 93 ± 0 . 43 17 . 33 ± 0 . 34 17 . 59 ± 0 . 32 18 . 52 ± 0 . 37 pima 26 . 14 ± 0 . 20 35 . 16 ± 0 . 20 24 . 02 ± 0 . 19 25 . 15 ± 0 . 20 votes84 5 . 20 ± 0 . 14 6 . 45 ± 0 . 17 5 . 03 ± 0 . 13 4 . 91 ± 0 . 13

Introduction Concepts Methods Experiments Conclusion Conclusion Contributions 1 Proposed 3 methods for automatically categorizing examples. The methods are from different parts of learning theory. They all gave reasonable categorization results. 2 Tested learning with categorized data. A simple strategy is enough to improve learning. The categorization results can be used in conjunction with a large variety of learning algorithms. 3 Showed experimentally data categorization is powerful. Future Work Estimate the optimal thresholds (say, using a validation set) Better utilize the categorization in learning Extend the framework to problems other than classification

Improving Generalization by Data Categorization Ling Li, Amrit - PowerPoint PPT Presentation

Introduction Concepts Methods Experiments Conclusion Improving Generalization by Data Categorization Ling Li, Amrit Pratap, Hsuan-Tien Lin, and Yaser Abu-Mostafa Learning Systems Group, Caltech ECML/PKDD, October 4, 2005 Introduction

Categorization Categorization is the basis of structure and meaning in our world. We

Text Categorization (I) Luo Si Department of Computer Science Purdue University Text

Automatic Categorization of Query Results SIGMOD 04 . Kaushik Chakrabarti 1 S. Surajit

Computer Vision Exercise Session 10 Image Categorization Object Categorization Task

Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity

Local Substitutability for Sequence Generalization Fran cois Coste , Ga elle Garet , Jacques

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 /

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Generalization of Cycle-Covering Heuristics Clemens B uchner Department of Mathematics and

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23

Improving Improving Finances, Finances, Improving Improving Lives Lives www.jeanchatzky.com

An Empirical Comparison of Text Categorization Methods Ana Cardoso-Cachopo and Arlindo L.

Inductive Learning Algorithms and Representations for Text Categorization David Heckerman Susan

(Semi-) Automatic Categorization of Natural Language Requirements Eric Knauss and Daniel Ott

Towards a Methodology for Benchmarking Edge Processing Frameworks Pedro Silva, Alexandru Costan,

A historical perspective on Machine Learning (on the occasion of the 25th Benelearn) Luc De

In Interpretin ing Deep Sports Analy lytics: Valu luin ing Actio ions and Pla layers in in

Dagstuhl Seminar 17382 AAIP17 Approaches and Applications of Inductive Programming

Grgory Marlire www.ifsttar.fr Institut franais des sciences et technologies des transports,

N UMERICAL R ESULTS (Q UALITY , S PEEDUP ) H 2 -matrix stored o.t.fly (f. both) data set #data

Exploring City Structure from Georeferenced Photos Using Graph Centrality Measures Katerina

Road Friday 20 November Anna Walker, Chair 1 Agenda for the day 9:30 Registration and