active learning for sparse bayesian multilabel
play

Active Learning for Sparse Bayesian Multilabel Classification - PowerPoint PPT Presentation

Active Learning for Sparse Bayesian Multilabel Classification Deepak Vasisht, MIT & IIT Delhi Andreas Domianou, University of Sheffield Manik Varma, MSR, India Ashish Kapoor, MSR, Redmond Multilabel Classification Given a set of


  1. Active Learning for Sparse Bayesian Multilabel Classification Deepak Vasisht, MIT & IIT Delhi Andreas Domianou, University of Sheffield Manik Varma, MSR, India Ashish Kapoor, MSR, Redmond

  2. Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels.

  3. Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels.

  4. Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i ∈ R d Feature ¡vector, ¡d: ¡dimension ¡of ¡the ¡feature ¡space

  5. Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i ∈ R d Feature ¡vector, ¡d: ¡dimension ¡of ¡the ¡feature ¡space Iraq Sea Sky Flowers Human Brick Sun

  6. Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i ∈ R d Feature ¡vector, ¡d: ¡dimension ¡of ¡the ¡feature ¡space Iraq Sea Sky Flowers Human Brick Sun

  7. Training

  8. Training

  9. Training WikiLSHTC has 325k labels. Good luck with that!!

  10. Training Is Expensive • Training data can also be very expensive, like genomic data, chemical data • Getting each label incurs additional cost

  11. Training Is Expensive • Training data can also be very expensive, like genomic data, chemical data • Getting each label incurs additional cost Need to reduce the required training data.

  12. Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 3 Datapoints N

  13. Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 0 1 3 Datapoints N

  14. Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 0 1 3 Datapoints N

  15. Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 3 Datapoints Which data points should I label? N

  16. Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 3 Datapoints For a particular datapoint, which labels should I reveal? N

  17. Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 3 Datapoints Can I choose datapoint-label pairs to annotate? N

  18. In this talk • An active learner for Multi-label classification that: • Answers all your questions • Is Computationally Cheap • Is Non myopic and near-optimal • Incorporates label sparsity • Achieves higher accuracy than state-of-the-art

  19. Classification

  20. Classification Model* x i y 1 y 2 y 3 y L Labels i i i i *Kapoor et al, NIPS 2012

  21. Classification Model* x i z k z 1 z 2 Compressed Space i i i Φ y 1 y 2 y 3 y L Labels i i i i *Kapoor et al, NIPS 2012

  22. Classification Model* x i W z k z 1 z 2 Compressed Space i i i Φ y 1 y 2 y 3 y L Labels i i i i *Kapoor et al, NIPS 2012

  23. Classification Model* x i W z k z 1 z 2 Compressed Space i i i Φ y 1 y 2 y 3 y L Labels i i i i *Kapoor et al, α 1 α 2 α 3 α L Sparsity i i i i NIPS 2012

  24. Classification Model: Potentials x i f x i ( W, z i ) = e − || W T xi − zi || 2 W 2 σ 2 z k z 1 z 2 i i i Φ y 1 y 2 y 3 y L i i i i α 1 α 2 α 3 α L i i i i

  25. Classification Model: Potentials x i f x i ( W, z i ) = e − || W T xi − zi || 2 W 2 σ 2 z k z 1 z 2 i i i g φ ( y i , z i ) = e − || Φ yi − zi || 2 2 χ 2 Φ y 1 y 2 y 3 y L i i i i α 1 α 2 α 3 α L i i i i

  26. Classification Model: Priors x i W z k z 1 z 2 i i i Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i α 1 α 2 α 3 α L i i i i

  27. Classification Model: Priors x i W z k z 1 z 2 i i i Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i α j i ∼ Γ ( α j α 1 α 2 α 3 α L i ; a 0 , b 0 ) i i i i

  28. Sparsity Priors a 0 = 10 − 6 , b 0 = 10 − 6

  29. Classification Model x i W f x i ( W, z i ) = e − || W T xi − zi || 2 2 σ 2 z k z 1 z 2 i i i g φ ( y i , z i ) = e − || Φ yi − zi || 2 2 χ 2 Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i α j i ∼ Γ ( α j α 1 α 2 α 3 α L i ; a 0 , b 0 ) i i i i

  30. Classification Model x i W f x i ( W, z i ) = e − || W T xi − zi || 2 2 σ 2 z k z 1 z 2 i i i g φ ( y i , z i ) = e − || Φ yi − zi || 2 2 χ 2 Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i Problem: Exact inference is intractable. α j i ∼ Γ ( α j α 1 α 2 α 3 α L i ; a 0 , b 0 ) i i i i

  31. Inference: Variational Bayes x i W z k z 1 z 2 i i i Φ y 1 y 2 y 3 y L i i i i α 1 α 2 α 3 α L i i i i

  32. Inference: Variational Bayes x i W z k z 1 z 2 Approximate Gaussian i i i Φ y 1 y 2 y 3 y L i i i i α 1 α 2 α 3 α L i i i i

  33. Inference: Variational Bayes x i W z k z 1 z 2 Approximate Gaussian i i i Φ y 1 y 2 y 3 y L Approximate Gaussian i i i i α 1 α 2 α 3 α L i i i i

  34. Inference: Variational Bayes x i W z k z 1 z 2 Approximate Gaussian i i i Φ y 1 y 2 y 3 y L Approximate Gaussian i i i i α 1 α 2 α 3 α L Approximate Gamma i i i i

  35. Inference: Variational Bayes x i W z k z 1 z 2 i i i Φ y 1 y 2 y 3 y L Approximate Gaussian i i i i α 1 α 2 α 3 α L i i i i

  36. Active Learning Criteria • Entropy: Is a measure of uncertainty. For a random variable X, the entropy H is given as: X H ( X ) = − P ( x i ) log( P ( x i )) � i • Picks points far apart from each other • For a Gaussian process, H = 1 2 log( | Σ | ) + const

  37. Active Learning Criteria • Mutual Information: Measures reduction in uncertainty over unlabeled space MI ( A, B ) = H ( A ) − H ( A | B ) � • Used in past work successfully for regression

  38. Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A )

  39. Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A ) Problem: Variance is not preserved across layers

  40. Idea: Collapsed Variational Bayes

  41. Idea: Collapsed Variational Bayes x i W f x i ( W, z i ) = e − || W T xi − zi || 2 2 σ 2 z k z 1 z 2 i i i g φ ( y i , z i ) = e − || Φ yi − zi || 2 2 χ 2 Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i α j i ∼ Γ ( α j α 1 α 2 α 3 α L i ; a 0 , b 0 ) i i i i

  42. Idea: Collapsed Variational Bayes x i f x i ( W, z i ) = e − || W T xi − zi || 2 W 2 σ 2 g φ ( y i , z i ) = e − || Φ yi − zi || 2 z k z 1 z 2 2 χ 2 i i i Φ i ∼ N (0 , 1 y j ) α j y L y 1 y 2 y 3 i i i i i α j i ∼ Γ ( α j α L α 1 α 2 α 3 i ; a 0 , b 0 ) i i i i

  43. Idea: Collapsed Variational Bayes Integrate to get a Gaussian x i distribution over Y f x i ( W, z i ) = e − || W T xi − zi || 2 W 2 σ 2 g φ ( y i , z i ) = e − || Φ yi − zi || 2 z k z 1 z 2 2 χ 2 i i i Φ i ∼ N (0 , 1 y j ) α j y L y 1 y 2 y 3 i i i i i α j i ∼ Γ ( α j α L α 1 α 2 α 3 i ; a 0 , b 0 ) i i i i

  44. Idea: Collapsed Variational Bayes Integrate to get a Gaussian x i distribution over Y f x i ( W, z i ) = e − || W T xi − zi || 2 W Use Variational Bayes for 2 σ 2 sparsity g φ ( y i , z i ) = e − || Φ yi − zi || 2 z k z 1 z 2 2 χ 2 i i i Φ i ∼ N (0 , 1 y j ) α j y L y 1 y 2 y 3 i i i i i α j i ∼ Γ ( α j α L α 1 α 2 α 3 i ; a 0 , b 0 ) i i i i

  45. Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A )

  46. Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space Problem: Computing Mutual Information still needs A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A ) exponential time

  47. Solution: Approximate Mutual Information • Approximate the final distribution over Y by a Gaussian • Use the Gaussian to estimate the mutual information ˆ • Theorem 1: lim MI → MI a 0 → 0 ,b 0 → 0

  48. Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A ) Problem: Subset selection problem is NP complete

  49. Solution: Use Submodularity • Under some weak conditions, the objective is sub-modular • Sub-modularity ensures that the greedy solution is a constant times the optimal solution

  50. Algorithm • Input: Feature vectors for a set of unlabeled instance, U and a budget n • Iteratively, add a datapoint x to labeled set A, such that x leads to maximum increase in MI MI ( A ∪ x ) − ˆ ˆ x ← arg max MI ( A ) x ∈ U\ A

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend