Robustness and Generalization Huan Xu The University of Texas at - PowerPoint PPT Presentation

Robustness and Generalization Huan Xu The University of Texas at Austin Department of Electrical and Computer Engineering COLT, June 29, 2010 Joint work with Shie Mannor

What is Robustness?

What is Robustness? • Robustness is the property that tested on a training sample and on a similar testing sample, the performance is close.

What is Robustness? • Robust decision making/optimization: • Consider a general decision problem: find v such that ℓ ( v , ξ ) is small. • If for ξ ′ ≈ ξ , ℓ ( v , ξ ′ ) is also small, then v is robust to the perturbation of parameter. • Robust optimization: min v max ξ ′ ≈ ξ ℓ ( v , ξ ′ )

What is Robustness? • Robust decision making/optimization: • Consider a general decision problem: find v such that ℓ ( v , ξ ) is small. • If for ξ ′ ≈ ξ , ℓ ( v , ξ ′ ) is also small, then v is robust to the perturbation of parameter. • Robust optimization: min v max ξ ′ ≈ ξ ℓ ( v , ξ ′ ) • Robustness in machine learning • Robust optimization was introduced to machine learning to handle observation noise (e.g., [Lanckriet et al 2003]; [Lebret and El Ghaoui 1997]; [Shivaswamy et al 2006]). • It is then discovered that SVM and Lasso can both be rewritten as robust optimization (of empirical loss), and the RO formulation implies consistency [HX, Caramanis and SM 2009; 2010].

What is Robustness? • Robust decision making/optimization: • Consider a general decision problem: find v such that ℓ ( v , ξ ) is small. • If for ξ ′ ≈ ξ , ℓ ( v , ξ ′ ) is also small, then v is robust to the perturbation of parameter. • Robust optimization: min v max ξ ′ ≈ ξ ℓ ( v , ξ ′ ) • Robustness in machine learning • Robust optimization was introduced to machine learning to handle observation noise (e.g., [Lanckriet et al 2003]; [Lebret and El Ghaoui 1997]; [Shivaswamy et al 2006]). • It is then discovered that SVM and Lasso can both be rewritten as robust optimization (of empirical loss), and the RO formulation implies consistency [HX, Caramanis and SM 2009; 2010]. • This paper formalizes this observation to general learning algorithms.

Difference with Stabiilty Non-stable algorithm:

Difference with Stabiilty Stable algorithm:

Difference with Stabiilty Non-robust algorithm:

Difference with Stabiilty Robust algorithm:

Outline 1. Algorithmic Robustness and Generalization Bound 2. Robust Algorithms 3. (Weak) Robustness is Necessary and Sufficient to (Asymptotic) Generalizability

Notations • Training sample set s of n training samples ( s 1 , · · · , s n ) . • Z and H are the set from which each sample is drawn, and the hypothesis set. • A s is the hypothesis learned given training set s . • For each hypothesis h ∈ H and a point z ∈ Z , there is an associated loss ℓ ( h , z ) ∈ [ 0 , M ] .

Notations • Training sample set s of n training samples ( s 1 , · · · , s n ) . • Z and H are the set from which each sample is drawn, and the hypothesis set. • A s is the hypothesis learned given training set s . • For each hypothesis h ∈ H and a point z ∈ Z , there is an associated loss ℓ ( h , z ) ∈ [ 0 , M ] . • In supervised learning, we decompose Z = Y × X , and use | x and | y to denote the x -component and y -component of a point.

Notations • Training sample set s of n training samples ( s 1 , · · · , s n ) . • Z and H are the set from which each sample is drawn, and the hypothesis set. • A s is the hypothesis learned given training set s . • For each hypothesis h ∈ H and a point z ∈ Z , there is an associated loss ℓ ( h , z ) ∈ [ 0 , M ] . • In supervised learning, we decompose Z = Y × X , and use | x and | y to denote the x -component and y -component of a point. • The covering number of a metric space T : N ( ǫ, T , ρ )

Motivating example 1: Large Margin Classifier An algorithm A s has a margin γ if for j = 1 , · · · , n A s ( x ) = A s ( s j | x ); ∀ x : � x − s j | x � 2 < γ. Example Fix γ > 0 and put K = 2 N ( γ/ 2 , X , � · � 2 ) . If A s has a margin γ , then Z can be partitioned into K disjoint sets, denoted by { C i } K i = 1 , such that if s j and z ∈ Z belong to a same C i , then | ℓ ( A s , s j ) − ℓ ( A s , z ) | = 0.

Motivating example 1: Large Margin Classifier

Motivating example 2: Linear Regression The norm-constrained linear regression algorithm is n � | s i | y − w ⊤ s i | x | , A s = min (0.1) w ∈ R m : � w � 2 ≤ c i = 1 Example Fix ǫ > 0 and let K = N ( ǫ/ 2 , X , � · � 2 ) × N ( ǫ/ 2 , Y , | · | ) . Consider the norm-constrained linear regression algorithm as in (0.1). The set Z can be partitioned into K disjoint sets, such that if s j and z ∈ Z belong to a same C i , then | ℓ ( A s , s j ) − ℓ ( A s , z ) | ≤ ( c + 1 ) ǫ.

Algorithmic Robustness Definition (Algorithmic Robustness) Algorithm A is ( K , ǫ ( s )) robust if • Z can be partitioned into K disjoint sets, denoted by { C i } K i = 1 ; • such that ∀ s ∈ s , s , z ∈ C i , = ⇒ | ℓ ( A s , s ) − ℓ ( A s , z ) | ≤ ǫ ( s ) . (0.2)

Algorithmic Robustness Definition (Algorithmic Robustness) Algorithm A is ( K , ǫ ( s )) robust if • Z can be partitioned into K disjoint sets, denoted by { C i } K i = 1 ; • such that ∀ s ∈ s , s , z ∈ C i , = ⇒ | ℓ ( A s , s ) − ℓ ( A s , z ) | ≤ ǫ ( s ) . (0.2) Remark: • The definition requires that the difference between a testing sample “similar to” a training sample is small. • The property jointly depends on the solution to the algorithm and the training set.

Generalization property of robust algorithms – the main theorem Theorem Let ˆ ℓ ( · ) and ℓ emp ( · ) denote the expected loss and the training loss. If s consists of n i.i.d. samples, and A is ( K , ǫ ( s )) -robust, then for any δ > 0 , with probability at least 1 − δ , � 2 K ln 2 + 2 ln ( 1 /δ ) � � � ˆ ℓ ( A s ) − ℓ emp ( A s ) � ≤ ǫ ( s ) + M . � � n

Generalization property of robust algorithms – the main theorem Theorem Let ˆ ℓ ( · ) and ℓ emp ( · ) denote the expected loss and the training loss. If s consists of n i.i.d. samples, and A is ( K , ǫ ( s )) -robust, then for any δ > 0 , with probability at least 1 − δ , � 2 K ln 2 + 2 ln ( 1 /δ ) � � � ˆ ℓ ( A s ) − ℓ emp ( A s ) � ≤ ǫ ( s ) + M . � � n Remark: The bounds depends on the partitioning of the sample space.

Proof of the Main Theorem • Let N i be the set of index of points of s that fall into C i . Then ( | N 1 | , · · · , | N K | ) is an IID multinomial random variable with parameters n and ( µ ( C 1 ) , · · · , µ ( C K )) .

Proof of the Main Theorem • Let N i be the set of index of points of s that fall into C i . Then ( | N 1 | , · · · , | N K | ) is an IID multinomial random variable with parameters n and ( µ ( C 1 ) , · · · , µ ( C K )) . • Breteganolle-Huber-Carol inequality gives � K � ≤ 2 K exp ( − n λ 2 � | N i | � � � � − µ ( C i ) � ≥ λ ) . Pr � � n 2 � i = 1

Proof of the Main Theorem • Let N i be the set of index of points of s that fall into C i . Then ( | N 1 | , · · · , | N K | ) is an IID multinomial random variable with parameters n and ( µ ( C 1 ) , · · · , µ ( C K )) . • Breteganolle-Huber-Carol inequality gives � K � ≤ 2 K exp ( − n λ 2 � | N i | � � � � − µ ( C i ) � ≥ λ ) . Pr � � n 2 � i = 1 • Hence, with probability at least 1 − δ , K � � | N i | � 2 K ln 2 + 2 ln ( 1 /δ ) � � � − µ ( C i ) � ≤ . (0.3) � � n n � i = 1

Proof of the Main Theorem (Cont.) Furthermore, � K n � µ ( C i ) − 1 � � � � � ˆ � � � � ℓ ( A s ) − ℓ emp ( A s ) � = ℓ ( A s , z ) | z ∈ C i ℓ ( A s , s i ) E � � � � � n � � � i = 1 i = 1 � K n � � | N i | − 1 � � � � � ≤ ℓ ( A s , z ) | z ∈ C i ℓ ( A s , s i ) E � � n n � � � � i = 1 i = 1 � K K � � | N i | � � � � � � � + ℓ ( A s , z ) | z ∈ C i µ ( C i ) − ℓ ( A s , z ) | z ∈ C i E E � � n � � � � i = 1 i = 1

Proof of the Main Theorem (Cont.) Furthermore, � K n � µ ( C i ) − 1 � � � � � ˆ � � � � ℓ ( A s ) − ℓ emp ( A s ) � = ℓ ( A s , z ) | z ∈ C i ℓ ( A s , s i ) E � � � � � n � � � i = 1 i = 1 � K n � � | N i | − 1 � � � � � ≤ ℓ ( A s , z ) | z ∈ C i ℓ ( A s , s i ) E � � n n � � � � i = 1 i = 1 � K K � � | N i | � � � � � � � + ℓ ( A s , z ) | z ∈ C i µ ( C i ) − ℓ ( A s , z ) | z ∈ C i E E � � n � � � � i = 1 i = 1 • The first term is bounded by � � � K � 1 � j ∈ N i max z 2 ∈ C i | ℓ ( A s , s j ) − ℓ ( A s , z 2 ) | � ≤ ǫ ( s ) . � � n i = 1

Robustness and Generalization Huan Xu The University of Texas at - PowerPoint PPT Presentation

Robustness and Generalization Huan Xu The University of Texas at Austin Department of Electrical and Computer Engineering COLT, June 29, 2010 Joint work with Shie Mannor What is Robustness? What is Robustness? What is Robustness?

UCSD Robustness Summer School David Donoho 20190812 David Donoho UCSD Robustness Summer School

Robustness? Robustness ? Robustness?

Where Are We? Lecture 9 Robustness through Training 1 Robustness Explicit Handling of Noise

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23

Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Generalization of Cycle-Covering Heuristics Clemens B uchner Department of Mathematics and

Local Substitutability for Sequence Generalization Fran cois Coste , Ga elle Garet , Jacques

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 /

Adversarial Training for Deep Learning : A Framework for Improving Robustness, Generalization and

Robustness and SMC Adam Pechner Overview What is Robustness and why do we care? Different

Trade-off between Efficiency and Robustness Doctoral Colloqium @ SenSys18, Shenzhen Robert

Point sets, Maps and Navigation - II D.A. Forsyth Robustness is a serious problem Robustness is

S9932: LEARNING TO BOOST S9932: LEARNING TO BOOST ROBUSTNESS FOR ROBUSTNESS FOR AUTONOMOUS

Language as an Abstraction for Hierarchical Deep Reinforcement Learning Paper Authors: Yiding

Slide Handouts: Instruction Gathering the Information Part 3 Slide notes Welcome to Module

From Discrete Trial to Real Life Applications August 9, 2018 National Autism Conference Rebekah

SARNet : Search and Rescue Network Integrated robotics systems for enhanced search-and-rescue

Liberty Mutual Group PEBELS: Policy Exposure Based Excess Loss Smoothing Marquis J. Moehring

The generalized correlated sampling approach: toward an exact calculation of energy derivatives

Adaptive Oblivious Transfer And Generalization Olivier Blazy, C eline Chevalier, Paul Germouty

Generalization of Factor Graphs and Belief Propagation for Quantum Information Science