semisupervised learning transfer learning and the future
play

Semisupervised Learning, Transfer Learning, and the Future at a - PowerPoint PPT Presentation

Semisupervised Learning, Transfer Learning, and the Future at a Glance Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning


  1. Semisupervised GAN Generator and discriminator do not need to play a zero-sum game Semisupervised GAN [11]: Discriminator learns one extra class “fake” in addition to K classes Softmax output units a ( L ) = ˆ ρ 2 R K + 1 for P ( y | x , Θ ) ⇠ Categorical ( ρ ) Cost function ( L labeled, M fake, N � L unlabeled): L K M N 1 ( y ( n ) = j ) log ˆ ρ ( n ) ρ ( m ) ρ ( n ) ∑ ∑ ∑ ∑ + log ˆ K + 1 + log ( 1 � ˆ K + 1 ) argmin Θ gen max j Θ dis n = 1 j = 1 m = 1 n = L + 1 Real, labeled points should be classified correctly Generated point should be identified as fake Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 12 / 57

  2. Semisupervised GAN Generator and discriminator do not need to play a zero-sum game Semisupervised GAN [11]: Discriminator learns one extra class “fake” in addition to K classes Softmax output units a ( L ) = ˆ ρ 2 R K + 1 for P ( y | x , Θ ) ⇠ Categorical ( ρ ) Cost function ( L labeled, M fake, N � L unlabeled): L K M N 1 ( y ( n ) = j ) log ˆ ρ ( n ) ρ ( m ) ρ ( n ) ∑ ∑ ∑ ∑ + log ˆ K + 1 + log ( 1 � ˆ K + 1 ) argmin Θ gen max j Θ dis n = 1 j = 1 m = 1 n = L + 1 Real, labeled points should be classified correctly Generated point should be identified as fake Real, unlabeled points can be in any class except K + 1 Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 12 / 57

  3. Performance State-of-the-art classification performance given: 100 labeled points (out of 60K) in MNIST 4K labeled points (out of 50K) in CIFAR-10 With generators: Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 13 / 57

  4. Outline Semisupervised Learning 1 Label Propagation Semisupervised GAN Semisupervised Clustering Transfer Learning 2 Multitask Learning & Weight Initiation Domain Adaptation Zero Shot Learning Unsupervised TL The Future at a Glance 3 Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 14 / 57

  5. Clustering Clustering is an ill-posed problem Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 15 / 57

  6. Clustering Clustering is an ill-posed problem E.g., how to cluster the following images into two group? Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 15 / 57

  7. Semisupervised Clustering Di ff erent users may have di ff erent answers: Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 16 / 57

  8. Semisupervised Clustering Di ff erent users may have di ff erent answers: User-perceived clusters 6 = clusters learned from data Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 16 / 57

  9. Semisupervised Clustering Di ff erent users may have di ff erent answers: User-perceived clusters 6 = clusters learned from data Semisupervised clustering : to ask some side information from the user to better uncover the user perspective Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 16 / 57

  10. Semisupervised Clustering Di ff erent users may have di ff erent answers: User-perceived clusters 6 = clusters learned from data Semisupervised clustering : to ask some side information from the user to better uncover the user perspective In what form? Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 16 / 57

  11. Point-Level Supervision Side info: must-links and/or cannot-links Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 17 / 57

  12. Point-Level Supervision Side info: must-links and/or cannot-links Constrained K -means [13]: to assign points to clusters without violating the constraints Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 17 / 57

  13. Sampling Bias Sampling of pairwise constraints matters: Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 18 / 57

  14. Sampling Bias Sampling of pairwise constraints matters: In many applications, the sampling cannot be uniform Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 18 / 57

  15. Sampling Bias Sampling of pairwise constraints matters: In many applications, the sampling cannot be uniform E.g., suppose we want to cluster products in an e-commerce website Use click-streams provided by the user to get must-links implicitly Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 18 / 57

  16. Sampling Bias Sampling of pairwise constraints matters: In many applications, the sampling cannot be uniform E.g., suppose we want to cluster products in an e-commerce website Use click-streams provided by the user to get must-links implicitly User not likely to click products uniformly Instead, e.g., clicks products with the lowest prices Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 18 / 57

  17. Feature-Level Supervision I Side info: perception vectors { p ( n ) 2 R B } N n = 1 E.g., bag-of-word vectors of the “reasons” (text) behind must-/cannot-links B the vocabulary size p ( n ) 6 = 0 if point n is covered by a must-/cannot-link Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 19 / 57

  18. Feature-Level Supervision II How to get perception vectors when clustering products in an e-commerce website? Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 20 / 57

  19. Feature-Level Supervision II How to get perception vectors when clustering products in an e-commerce website? Use click-streams provided by the user as must-links Use the query that triggers clicks as the perception vector Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 20 / 57

  20. Feature-Level Supervision II How to get perception vectors when clustering products in an e-commerce website? Use click-streams provided by the user as must-links Use the query that triggers clicks as the perception vector How to learn form the perception vectors? Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 20 / 57

  21. Perception-Embedding Clustering Perception-embedding clustering [4]: to map every x ( n ) 2 R D to a dense f ( n ) 2 R B and cluster based on f ( n ) ’s Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 21 / 57

  22. Perception-Embedding Clustering Perception-embedding clustering [4]: to map every x ( n ) 2 R D to a dense f ( n ) 2 R B and cluster based on f ( n ) ’s Cost function for mapping function: F , W , b k XW + 1 N b > � F k 2 + λ k S ( F � P ) k 2 arg min X 2 R N ⇥ D , W 2 R D ⇥ B , b 2 R B , S 2 R N ⇥ N , and F , P 2 R N ⇥ B Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 21 / 57

  23. Perception-Embedding Clustering Perception-embedding clustering [4]: to map every x ( n ) 2 R D to a dense f ( n ) 2 R B and cluster based on f ( n ) ’s Cost function for mapping function: F , W , b k XW + 1 N b > � F k 2 + λ k S ( F � P ) k 2 arg min X 2 R N ⇥ D , W 2 R D ⇥ B , b 2 R B , S 2 R N ⇥ N , and F , P 2 R N ⇥ B Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 21 / 57

  24. Perception-Embedding Clustering Perception-embedding clustering [4]: to map every x ( n ) 2 R D to a dense f ( n ) 2 R B and cluster based on f ( n ) ’s Cost function for mapping function: F , W , b k XW + 1 N b > � F k 2 + λ k S ( F � P ) k 2 arg min X 2 R N ⇥ D , W 2 R D ⇥ B , b 2 R B , S 2 R N ⇥ N , and F , P 2 R N ⇥ B The embedding (parametrized by W and b ) applies to all points, thereby avoiding sampling bias Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 21 / 57

  25. Outline Semisupervised Learning 1 Label Propagation Semisupervised GAN Semisupervised Clustering Transfer Learning 2 Multitask Learning & Weight Initiation Domain Adaptation Zero Shot Learning Unsupervised TL The Future at a Glance 3 Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 22 / 57

  26. Transfer Learning In practice, we may not have enough data/supervision in X to generalize well in a task Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 23 / 57

  27. Transfer Learning In practice, we may not have enough data/supervision in X to generalize well in a task Semisupervised learning: to learn from unlabeled data Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 23 / 57

  28. Transfer Learning In practice, we may not have enough data/supervision in X to generalize well in a task Semisupervised learning: to learn from unlabeled data Transfer learning : to learning from data in other domains Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 23 / 57

  29. Transfer Learning In practice, we may not have enough data/supervision in X to generalize well in a task Semisupervised learning: to learn from unlabeled data Transfer learning : to learning from data in other domains Define the source and target tasks over X ( source ) and X ( target ) Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 23 / 57

  30. Transfer Learning In practice, we may not have enough data/supervision in X to generalize well in a task Semisupervised learning: to learn from unlabeled data Transfer learning : to learning from data in other domains Define the source and target tasks over X ( source ) and X ( target ) Goal: use X ( source ) to get better results in target task (or vise versa) Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 23 / 57

  31. Transfer Learning In practice, we may not have enough data/supervision in X to generalize well in a task Semisupervised learning: to learn from unlabeled data Transfer learning : to learning from data in other domains Define the source and target tasks over X ( source ) and X ( target ) Goal: use X ( source ) to get better results in target task (or vise versa) How? Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 23 / 57

  32. Transfer Learning In practice, we may not have enough data/supervision in X to generalize well in a task Semisupervised learning: to learn from unlabeled data Transfer learning : to learning from data in other domains Define the source and target tasks over X ( source ) and X ( target ) Goal: use X ( source ) to get better results in target task (or vise versa) How? To learn “correlations” between X ( source ) and X ( target ) Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 23 / 57

  33. Branches [10] Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 24 / 57

  34. Few, One, and Zero Shot Learning How many data do we need in X ( target ) to allow knowledge transfer? Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 25 / 57

  35. Few, One, and Zero Shot Learning How many data do we need in X ( target ) to allow knowledge transfer? Not many: transfer learning Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 25 / 57

  36. Few, One, and Zero Shot Learning How many data do we need in X ( target ) to allow knowledge transfer? Not many: transfer learning Very few: few shot learning Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 25 / 57

  37. Few, One, and Zero Shot Learning How many data do we need in X ( target ) to allow knowledge transfer? Not many: transfer learning Very few: few shot learning Only 1: one shot learning Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 25 / 57

  38. Few, One, and Zero Shot Learning How many data do we need in X ( target ) to allow knowledge transfer? Not many: transfer learning Very few: few shot learning Only 1: one shot learning None: zero shot learning Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 25 / 57

  39. Few, One, and Zero Shot Learning How many data do we need in X ( target ) to allow knowledge transfer? Not many: transfer learning Very few: few shot learning Only 1: one shot learning None: zero shot learning (How is that possible?) Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 25 / 57

  40. Outline Semisupervised Learning 1 Label Propagation Semisupervised GAN Semisupervised Clustering Transfer Learning 2 Multitask Learning & Weight Initiation Domain Adaptation Zero Shot Learning Unsupervised TL The Future at a Glance 3 Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 26 / 57

  41. Multitask Learning To jointly learn the source and target models Both X ( source ) and X ( target ) have labels Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 27 / 57

  42. Multitask Learning To jointly learn the source and target models Both X ( source ) and X ( target ) have labels Models share weights that capture the correlation between the data/tasks Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 27 / 57

  43. Multitask Learning To jointly learn the source and target models Both X ( source ) and X ( target ) have labels Models share weights that capture the correlation between the data/tasks Which layers to share in deep NNs? Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 27 / 57

  44. Weight Sharing Application dependent, e.g., Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 28 / 57

  45. Weight Sharing Application dependent, e.g., Shallow layers in image object recognition To share filters/feature detectors Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 28 / 57

  46. Weight Sharing Application dependent, e.g., Shallow layers in image object recognition To share filters/feature detectors Deep layers in speech transcription To share the word map Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 28 / 57

  47. Weight Initiation One simpler way to transfer knowledge is to initiate weights of target model to those of source model Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 29 / 57

  48. Weight Initiation One simpler way to transfer knowledge is to initiate weights of target model to those of source model Very common in deep learning Training a CNN over ImageNet [5] may take a week Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 29 / 57

  49. Weight Initiation One simpler way to transfer knowledge is to initiate weights of target model to those of source model Very common in deep learning Training a CNN over ImageNet [5] may take a week Many pre-trained NNs on Internet, e.g., Model Zoo Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 29 / 57

  50. Weight Initiation One simpler way to transfer knowledge is to initiate weights of target model to those of source model Very common in deep learning Training a CNN over ImageNet [5] may take a week Many pre-trained NNs on Internet, e.g., Model Zoo A regularization technique rather than an optimization technique [3] Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 29 / 57

  51. Weight Initiation One simpler way to transfer knowledge is to initiate weights of target model to those of source model Very common in deep learning Training a CNN over ImageNet [5] may take a week Many pre-trained NNs on Internet, e.g., Model Zoo A regularization technique rather than an optimization technique [3] Which weights to borrow from also depends on applications Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 29 / 57

  52. Fine-Tuning I In addition to borrowing weights, we may update ( fine-tune ) the weights when training the target model Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 30 / 57

  53. Fine-Tuning I In addition to borrowing weights, we may update ( fine-tune ) the weights when training the target model Results from 2 CNNs (A and B) over ImageNet [14]: Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 30 / 57

  54. Fine-Tuning II Caution Fine tuning does not always help! Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 31 / 57

  55. Fine-Tuning II Caution Fine tuning does not always help! Fine-tuning or not? Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 31 / 57

  56. Fine-Tuning II Caution Fine tuning does not always help! Fine-tuning or not? Large X ( target ) , similar X ( source ) : Large X ( target ) , di ff erent X ( source ) : Small X ( target ) , similar X ( source ) : Small X ( target ) , di ff erent X ( source ) : Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 31 / 57

  57. Fine-Tuning II Caution Fine tuning does not always help! Fine-tuning or not? Large X ( target ) , similar X ( source ) : Yes Large X ( target ) , di ff erent X ( source ) : Small X ( target ) , similar X ( source ) : Small X ( target ) , di ff erent X ( source ) : Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 31 / 57

  58. Fine-Tuning II Caution Fine tuning does not always help! Fine-tuning or not? Large X ( target ) , similar X ( source ) : Yes Large X ( target ) , di ff erent X ( source ) : Yes (often still beneficial in practice) Small X ( target ) , similar X ( source ) : Small X ( target ) , di ff erent X ( source ) : Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 31 / 57

  59. Fine-Tuning II Caution Fine tuning does not always help! Fine-tuning or not? Large X ( target ) , similar X ( source ) : Yes Large X ( target ) , di ff erent X ( source ) : Yes (often still beneficial in practice) Small X ( target ) , similar X ( source ) : No (to avoid overfitting) Small X ( target ) , di ff erent X ( source ) : Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 31 / 57

  60. Fine-Tuning II Caution Fine tuning does not always help! Fine-tuning or not? Large X ( target ) , similar X ( source ) : Yes Large X ( target ) , di ff erent X ( source ) : Yes (often still beneficial in practice) Small X ( target ) , similar X ( source ) : No (to avoid overfitting) Small X ( target ) , di ff erent X ( source ) : No Instead prepend/append simple weight rewriter (e.g., linear SVM) Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 31 / 57

  61. Outline Semisupervised Learning 1 Label Propagation Semisupervised GAN Semisupervised Clustering Transfer Learning 2 Multitask Learning & Weight Initiation Domain Adaptation Zero Shot Learning Unsupervised TL The Future at a Glance 3 Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 32 / 57

  62. Domain Adaptation Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 33 / 57

  63. Domain Adversarial Networks Goal: to learn domain-invariant features that help source model adapt to target task Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 34 / 57

  64. Domain Adversarial Networks Goal: to learn domain-invariant features that help source model adapt to target task Domain classifier + gradient reversal layer [7] Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 34 / 57

  65. Outline Semisupervised Learning 1 Label Propagation Semisupervised GAN Semisupervised Clustering Transfer Learning 2 Multitask Learning & Weight Initiation Domain Adaptation Zero Shot Learning Unsupervised TL The Future at a Glance 3 Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 35 / 57

  66. Zero Shot Learning Zero shot learning: transfer learning with X ( source ) and empty X ( target ) Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 36 / 57

  67. Zero Shot Learning Zero shot learning: transfer learning with X ( source ) and empty X ( target ) How is that possible? Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 36 / 57

  68. Label Representations Side information: the semantic representations Ψ ( y ) of labels E.g., “has paws,” “has stripes,” or “is black” for the “animal” class Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 37 / 57

  69. Label Representations Side information: the semantic representations Ψ ( y ) of labels E.g., “has paws,” “has stripes,” or “is black” for the “animal” class Assume that labels in di ff erent domains share the same semantic space Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 37 / 57

  70. Label Representations Side information: the semantic representations Ψ ( y ) of labels E.g., “has paws,” “has stripes,” or “is black” for the “animal” class Assume that labels in di ff erent domains share the same semantic space Embedding function Ψ can be learned jointedly with the model (e.g., in Google Neural Machine Translation) or separately (e.g., in [1]) Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 37 / 57

  71. Why Does Zero Shot Learning Work? In task A, a model uses labeled pairs ( x ( i ) , y ( i ) ) ’s to learn the map between spaces of Φ ( x ) and Ψ ( y ) Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 38 / 57

  72. Why Does Zero Shot Learning Work? In task A, a model uses labeled pairs ( x ( i ) , y ( i ) ) ’s to learn the map between spaces of Φ ( x ) and Ψ ( y ) In task B (with zero shot), the model predicts label of point x 0 by First obtaining Φ ( x 0 ) 1 Then following the map to find out Ψ ( y 0 ) 2 Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 38 / 57

  73. Outline Semisupervised Learning 1 Label Propagation Semisupervised GAN Semisupervised Clustering Transfer Learning 2 Multitask Learning & Weight Initiation Domain Adaptation Zero Shot Learning Unsupervised TL The Future at a Glance 3 Shan-Hung Wu (CS, NTHU) Semisup./Trans. Learning and the Future Machine Learning 39 / 57

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend