la th eorie pac bayes en apprentissage supervis e
play

La th eorie PAC-Bayes en apprentissage supervis e Pr esentation - PowerPoint PPT Presentation

The mathematics of the PAC-Bayes Theory PAC-Bayes bounds and algorithms References La th eorie PAC-Bayes en apprentissage supervis e Pr esentation au LRI de luniversit e Paris XI Fran cois Laviolette, Laboratoire du GRAAL,


  1. The mathematics of the PAC-Bayes Theory PAC-Bayes bounds and algorithms References La th´ eorie PAC-Bayes en apprentissage supervis´ e Pr´ esentation au LRI de l’universit´ e Paris XI Fran¸ cois Laviolette, Laboratoire du GRAAL, Universit´ e Laval, Qu´ ebec, Canada 14 dcembre 2010 Fran¸ cois Laviolette, Laboratoire du GRAAL, Universit´ e Laval, Qu´ La th´ ebec, Canada eorie PAC-Bayes en apprentissage supervis´ e

  2. The mathematics of the PAC-Bayes Theory PAC-Bayes bounds and algorithms References Summary Aujourd’hui, j’ai l’intention de vous pr´ esenter les math´ ematiques qui sous tendent la th´ eorie PAC-Bayes vous pr´ esenter des algorithmes qui consistent en la minimisation d’une borne PAC-Bayes et comparer ces derniers avec des algorithmes existants. Fran¸ cois Laviolette, Laboratoire du GRAAL, Universit´ e Laval, Qu´ La th´ ebec, Canada eorie PAC-Bayes en apprentissage supervis´ e

  3. The mathematics of the PAC-Bayes Theory PAC-Bayes bounds and algorithms Derivation of classical PAC-Bayes bound References Definitions Each example ( x , y ) ∈ X × {− 1 , +1 } , is drawn acc. to D . The (true) risk R ( h ) and training error R S ( h ) are defined as: m = 1 � def def R ( h ) = ( x , y ) ∼ D I ( h ( x ) � = y ) E ; R S ( h ) I ( h ( x i ) � = y i ) . m i =1 The learner’s goal is to choose a posterior distribution Q on a space H of classifiers such that the risk of the Q -weighted majority vote B Q is as small as possible. � � def B Q ( x ) = sgn h ∼ Q h ( x ) E B Q is also called the Bayes classifier . Fran¸ cois Laviolette, Laboratoire du GRAAL, Universit´ e Laval, Qu´ La th´ ebec, Canada eorie PAC-Bayes en apprentissage supervis´ e

  4. The mathematics of the PAC-Bayes Theory PAC-Bayes bounds and algorithms Derivation of classical PAC-Bayes bound References The Gibbs clasifier PAC-Bayes approach does not directly bounds the risk of B Q It bounds the risk of the Gibbs classifier G Q : to predict the label of x , G Q draws h from H and predicts h ( x ) The risk and the training error of G Q are thus defined as: R ( G Q ) = h ∼ Q R ( h ) ; R S ( G Q ) = h ∼ Q R S ( h ) . E E Fran¸ cois Laviolette, Laboratoire du GRAAL, Universit´ e Laval, Qu´ La th´ ebec, Canada eorie PAC-Bayes en apprentissage supervis´ e

  5. The mathematics of the PAC-Bayes Theory PAC-Bayes bounds and algorithms Derivation of classical PAC-Bayes bound References G Q , B Q , and KL ( Q � P ) If B Q misclassifies x , then at least half of the classifiers (under measure Q ) err on x . Hence: R ( B Q ) ≤ 2 R ( G Q ) Thus, an upper bound on R ( G Q ) gives rise to an upper bound on R ( B Q ) PAC-Bayes makes use of a prior distribution P on H . The risk bound depends on the Kullback-Leibler divergence : h ∼ Q ln Q ( h ) def KL ( Q � P ) = E P ( h ) . Fran¸ cois Laviolette, Laboratoire du GRAAL, Universit´ e Laval, Qu´ La th´ ebec, Canada eorie PAC-Bayes en apprentissage supervis´ e

  6. The mathematics of the PAC-Bayes Theory PAC-Bayes bounds and algorithms Derivation of classical PAC-Bayes bound References A PAC-Bayes bound to rule them all ! , J.R.R. Tolkien, roughly or John Langford, less roughly. Theorem 1 Germain et al. 2009 For any distribution D on X × Y , for any set H of classifiers, for any prior distribution P of support H , for any δ ∈ (0 , 1], and for any convex function D : [0 , 1] × [0 , 1] → R , we have � ∀ Q on H : D ( R S ( G Q ) , R ( G Q )) ≤ Pr S ∼ D m 1 � � 1 ��� h ∼ P e m D ( R S ( h ) , R ( h )) KL ( Q � P ) + ln δ E E m S ∼ D ≥ 1 − δ . Fran¸ cois Laviolette, Laboratoire du GRAAL, Universit´ e Laval, Qu´ La th´ ebec, Canada eorie PAC-Bayes en apprentissage supervis´ e

  7. The mathematics of the PAC-Bayes Theory PAC-Bayes bounds and algorithms Derivation of classical PAC-Bayes bound References Proof of Theorem 1 h ∼ P e m D ( R S ( h ) , R ( h )) is a non-negative r.v., Markov’s inequality gives Since E „ « h ∼ P e m D ( RS ( h ) , R ( h )) ≤ 1 h ∼ P e m D ( RS ( h ) , R ( h )) Pr E E E ≥ 1 − δ . δ S ∼ Dm S ∼ Dm Hence, by taking the logarithm on each side of the inequality and by transforming the expectation over P into an expectation over Q : „ – « » – » P ( h ) Q ( h ) e m D ( RS ( h ) , R ( h )) 1 h ∼ P e m D ( RS ( h ) , R ( h )) ∀ Q : ln ≤ ln ≥ 1 − δ . Pr E E E δ S ∼ Dm S ∼ Dm h ∼ Q Then, exploiting the fact that the logarithm is a concave function, by an application of Jensen’s inequality, we obtain „ – « h P ( h ) » Q ( h ) e m D ( RS ( h ) , R ( h )) i 1 h ∼ P e m D ( RS ( h ) , R ( h )) Pr ∀ Q : h ∼ Q ln E ≤ ln E E ≥ 1 − δ . δ S ∼ Dm S ∼ Dm Fran¸ cois Laviolette, Laboratoire du GRAAL, Universit´ e Laval, Qu´ La th´ ebec, Canada eorie PAC-Bayes en apprentissage supervis´ e

  8. The mathematics of the PAC-Bayes Theory PAC-Bayes bounds and algorithms Derivation of classical PAC-Bayes bound References Proof of Theorem 1 (cont) „ – « h P ( h ) » Q ( h ) e m D ( RS ( h ) , R ( h )) i 1 h ∼ P e m D ( RS ( h ) , R ( h )) Pr ∀ Q : h ∼ Q ln E ≤ ln E E ≥ 1 − δ . δ S ∼ Dm S ∼ Dm From basic logarithm properties, and from the fact that i def h P ( h ) h ∼ Q ln = − KL ( Q � P ), we now have E Q ( h ) „ – « » 1 h ∼ P e m D ( RS ( h ) , R ( h )) Pr ∀ Q : − KL ( Q � P )+ E h ∼ Q m D ( R S ( h ) , R ( h )) ≤ ln E E ≥ 1 − δ . S ∼ Dm δ S ∼ Dm Then, since D has been supposed convexe, again by the Jensen inequality, we have „ « h ∼ Q m D ( R S ( h ) , R ( h )) = m D E h ∼ Q R S ( h ) , E E h ∼ Q R ( h ) , which immediately implies the result. ✷ Fran¸ cois Laviolette, Laboratoire du GRAAL, Universit´ e Laval, Qu´ La th´ ebec, Canada eorie PAC-Bayes en apprentissage supervis´ e

  9. The mathematics of the PAC-Bayes Theory PAC-Bayes bounds and algorithms Derivation of classical PAC-Bayes bound References Applicability of Theorem 1 � � 1 h ∼ P e m D ( R S ( h ) , R ( h )) How can we estimate ln ? E E δ S ∼ D m Fran¸ cois Laviolette, Laboratoire du GRAAL, Universit´ e Laval, Qu´ La th´ ebec, Canada eorie PAC-Bayes en apprentissage supervis´ e

  10. The mathematics of the PAC-Bayes Theory PAC-Bayes bounds and algorithms Derivation of classical PAC-Bayes bound References The Seeger’s bound (2002) Seeger Bound For any D , any H , any P of support H , any δ ∈ (0 , 1], we have � Pr ∀ Q on H : kl ( R S ( G Q ) , R ( G Q )) ≤ S ∼ D m � �� 1 KL ( Q � P ) + ln ξ ( m ) ≥ 1 − δ , δ m = q ln q def p + (1 − q ) ln 1 − q where kl ( q , p ) 1 − p , = � m def � m � ( k / m ) k (1 − k / m ) m − k . and where ξ ( m ) k =0 k ξ ( m ) ≤ 2 √ m Note: Fran¸ cois Laviolette, Laboratoire du GRAAL, Universit´ e Laval, Qu´ La th´ ebec, Canada eorie PAC-Bayes en apprentissage supervis´ e

  11. The mathematics of the PAC-Bayes Theory PAC-Bayes bounds and algorithms Derivation of classical PAC-Bayes bound References Graphical illustration of the Seeger bound kl(0.1||R(Q)) 0. 4 0. 3 0. 2 0. 1 R(Q) 0. 1 0. 2 0. 3 0. 4 0. 5 Borne Inf Borne Sup Fran¸ cois Laviolette, Laboratoire du GRAAL, Universit´ e Laval, Qu´ La th´ ebec, Canada eorie PAC-Bayes en apprentissage supervis´ e

  12. The mathematics of the PAC-Bayes Theory PAC-Bayes bounds and algorithms Derivation of classical PAC-Bayes bound References Proof of the Seeger bound Follows immediately from Theorem 1 by choosing D ( q , p ) = kl ( q , p ). Indeed, in that case we have “ RS ( h ) ” mRS ( h ) “ 1 − RS ( h ) ” m (1 − RS ( h )) h ∼ P e m D ( RS ( h ) , R ( h )) = E E E E R ( h ) 1 − R ( h ) S ∼ Dm S ∼ Dm h ∼ P ! k m − k k 1 − k ! = P m S ∼ Dm ( R S ( h )= k m ) m m E Pr k =0 R ( h ) 1 − R ( h ) h ∼ P k ) ( k / m ) k (1 − k / m ) m − k , = P m k =0 ( m (1) 2 √ m . ≤ ✷ ` R S ( h ) = k ´ Note that, in Line (1) of the proof, is replaced by the Pr m S ∼ D m probability mass function of the binomial. (i.e., S ∼ D m ) This is only true if the examples of S are drawn iid. So this result is no longuer valid in the non iid case, even if Theorem 1 is. Fran¸ cois Laviolette, Laboratoire du GRAAL, Universit´ e Laval, Qu´ La th´ ebec, Canada eorie PAC-Bayes en apprentissage supervis´ e

  13. The mathematics of the PAC-Bayes Theory PAC-Bayes bounds and algorithms Derivation of classical PAC-Bayes bound References The McAllester’s bound (1998) Put D ( q , p ) = 1 2 ( q − p ) 2 , Theorem 1 then gives McAllester Bound For any D , any H , any P of support H , any δ ∈ (0 , 1], we have � ∀ Q on H : 1 2( R S ( G Q ) , R ( G Q )) 2 ≤ Pr S ∼ D m 1 � KL ( Q � P ) + ln ξ ( m ) �� ≥ 1 − δ , m δ p + (1 − q ) ln 1 − q = q ln q def where kl ( q , p ) 1 − p , def = � m � m ( k / m ) k (1 − k / m ) m − k . � and where ξ ( m ) k =0 k Fran¸ cois Laviolette, Laboratoire du GRAAL, Universit´ e Laval, Qu´ La th´ ebec, Canada eorie PAC-Bayes en apprentissage supervis´ e

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend