Fast Direct Search in an Optimally Compressed Continuous Target - - PowerPoint PPT Presentation

fast direct search in an optimally compressed continuous
SMART_READER_LITE
LIVE PREVIEW

Fast Direct Search in an Optimally Compressed Continuous Target - - PowerPoint PPT Presentation

Introduction Methodology Results Fast Direct Search in an Optimally Compressed Continuous Target Space for Efficient Multi-Label Active Learning Weishi Shi and Qi Yu B. Thomas Golisano College of Computing and Information Sciences Rochester


slide-1
SLIDE 1

Introduction Methodology Results

Fast Direct Search in an Optimally Compressed Continuous Target Space for Efficient Multi-Label Active Learning

Weishi Shi and Qi Yu

  • B. Thomas Golisano College of Computing and Information Sciences

Rochester Institute of Technology

Jun 2019

Weishi Shi and Qi Yu Multi-label Active Learning

slide-2
SLIDE 2

Introduction Methodology Results

Multi-Label Active Learning

Multi-label classification (ML-C) aims to learn a model that automatically assigns a set of relevant labels to a data instance. Multi-label problems naturally arise in many applications, including various image classification and video/audio recognition tasks. Data labeling for model training becomes more labor intensive as it is necessary to check each label in a potentially large label space, making active learning more important. Key challenges for multi-label AL Sampling measure is hard to design due to label correlations. Rare labels are much harder to detect. Computational cost increases fast with the number of labels.

Weishi Shi and Qi Yu Multi-label Active Learning

slide-3
SLIDE 3

Introduction Methodology Results

CS-BPCA Label Transformation

We have proposed a principled two-level label transformation (Compressed Sensing (CS) + Bayesian Principal Component Analysis (BPCA)) strategy that enables multi-label active learning to be performed in an optimally compressed target space. CS-BPCA: Two-level Label Transformation

Original label space (Y)

CS

Compressed space (R)

BPCA

Target space (U)

MOGP Data Sample Compressing/sampling Recovery/prediction

Weishi Shi and Qi Yu Multi-label Active Learning

slide-4
SLIDE 4

Introduction Methodology Results

CS-BPCA Label Transformation

We have proposed a principled two-level label transformation (Compressed Sensing (CS) + Bayesian Principal Component Analysis (BPCA)) strategy that enables multi-label active learning to be performed in an optimally compressed target space. CS-BPCA: Two-level Label Transformation

Original label space (Y)

CS

Compressed space (R)

BPCA

Target space (U)

MOGP Data Sample Compressing/sampling Recovery/prediction

Key Properties of the Transformed Label Space Optimally compressed: The optimal compressing rate is automatically determined. Orthogonal: Label correlation is fully decoupled.

Weishi Shi and Qi Yu Multi-label Active Learning

slide-5
SLIDE 5

Introduction Methodology Results

Multi-output GP (MOGP) based Data Sampling

Two key benefits Output the predictive entropy that provides an informative measure for uncertainty based data sampling. Use a flexible covariance function to precisely capture the covariance structure of the input data. A flexible kernel function k(xi, xj) = θ0 exp{−θ1 2 xi − xj2} + θ2xT

i xj + θ3

Apply to the optimally compressed target space Continuous: Consistent with the MOGP assumption; Compact: Efficient computation; Weighted: Precise sampling; Orthogonality: Decoupling label correlation.

Weishi Shi and Qi Yu Multi-label Active Learning

slide-6
SLIDE 6

Introduction Methodology Results

Gradient-free Hyper-parameter Optimization

High computational cost of gradient based methods Compute the gradient of the likelihood over each hyperparameter until convergence (via p iterations): O|θ|pm3 [Need to run multiple times due to a non-convex likelihood]. Construct the covariance matrix of input data: O(m2n). The overall complexity: O(|θ|(pm3 + m2n)) Fast kernel re-estimation for covariance matrix construction We separate two blocks of computation that are invariant to θ and

  • nly partially update the kernel matrix for fast covariance matrix

construction. O(m2n) − → O(m2)

Weishi Shi and Qi Yu Multi-label Active Learning

slide-7
SLIDE 7

Introduction Methodology Results

Gradient-free Hyper-parameter Optimization

Bayesian Optimization (B-OPT) Use expected improvement as a cheap surrogate of the likelihood to choose a candidate θ from the grid search space. Need to define a grid search space. Simplex Optimization (S-OPT) Explore the search space by evolving (i.e., expanding, reflecting, and contracting) a simplex. Automatically explore the search space. Overall Complexity Reduction O(|θ|(pm3 + m2n)) − → O(qm3 + m2) where q ≪ p

Weishi Shi and Qi Yu Multi-label Active Learning

slide-8
SLIDE 8

Introduction Methodology Results

Benchmark Datasets and Compared Models

Summary of Datasets

Dataset Domain Instances Features Labels Label Card Label Sparsity Delicious web 8172 500 157 5.56 0.03 BookMark publication 38548 2150 136 3.45 0.02 WebAPI software 9166 5659 90 2.50 0.02 Corel5K images 5000 499 132 3.25 0.02 Bibtex text 7013 1836 127 2.4 0.02

Competitive Active Learning Models for Multi-label Classification Type I models: Perform active learning in a compressed label space (CS-MIML, CS-BR, CS-RR). Type II models: Perform active learning in the original label space (MMC, Adaptive).

Weishi Shi and Qi Yu Multi-label Active Learning

slide-9
SLIDE 9

Introduction Methodology Results

Comparison Results

50 100 150 200 250 300 350 400 450 500

Number of active iterations

0.15 0.20 0.25

F-score WEB API data CS-BPCA-GP CS-BR CS-RR CS-MIML

50 100 150 200 250 300 350 400 450 500

Number of active iterations

0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30

F-score Delicious data CS-BPCA-GP CS-BR CS-RR CS-MIML

50 100 150 200 250 300 350 400 450 500

Number of active iterations

0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16

F-score Bookmark data CS-BPCA-GP CS-BR CS-RR CS-MIML

50 100 150 200 250 300 350 400 450 500

Number of active iterations

0.05 0.10 0.15 0.20 0.25

F-score Corel5K data CS-BPCA-GP CS-BR CS-RR CS-MIML

50 100 150 200 250 300 350 400 450 500

Number of active iterations

0.10 0.15 0.20 0.25 0.30 0.35

F-score Bibtex data CS-BPCA-GP CS-BR CS-RR CS-MIML

Comparison Result I

50 100 150 200 250 300 350 400 450 500

Number of active iterations

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

F-score Reduced WEB API Dataset CS-BPCA-GP MMC Adaptive

50 100 150 200 250 300 350 400 450 500

Number of active iterations

0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46

F-score Reduced Delicious Dataset CS-BPCA-GP MMC Adaptive

50 100 150 200 250 300 350 400 450 500

Number of active iterations

0.25 0.30 0.35 0.40

F-score Reduced Bookmark Dataset CS-BPCA-GP MMC Adaptive

50 100 150 200 250 300 350 400 450 500

Number of active iterations

0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28

F-score Reduced Colrel5K Dataset CS-BPCA-GP MMC Adaptive

50 100 150 200 250 300 350 400 450 500

Number of active iterations

0.10 0.15 0.20 0.25 0.30 0.35

F-score Reduced Bibtex Dataset CS-BPCA-GP MMC Adaptive

Comparison Result II

Weishi Shi and Qi Yu Multi-label Active Learning

slide-10
SLIDE 10

Introduction Methodology Results

Rare Label Prediction Comparison

0.003 0.003 0.004 0.004 0.004 0.004 0.005 0.006 0.008 0.010 0.012 0.029 0.029

Label Frequency

0.2 0.4 0.6 0.8 1.0

Recall Bookmark data Adaptive CS-BPCA-GP

0.020 0.020 0.023 0.025 0.028 0.030 0.033 0.041 0.047 0.060 0.067 0.086 0.112 0.177

Label Frequency

0.2 0.4 0.6 0.8 1.0

Recall Delicious data Adaptive CS-BPCA-GP

0.012 0.012 0.012 0.016 0.017 0.020 0.022 0.023 0.025 0.027 0.030 0.039 0.043 0.059 0.069 0.149 0.149

Label Frequency

0.2 0.4 0.6 0.8 1.0

Recall Corel5K data Adaptive CS-BPCA-GP

0.006 0.006 0.009 0.009 0.010 0.010 0.014 0.015 0.016 0.018 0.021 0.027 0.035 0.062 0.062

Label Frequency

0.2 0.4 0.6 0.8 1.0

Recall Bibtex data Adaptive CS-BPCA-GP

0.008 0.008 0.009 0.011 0.012 0.013 0.016 0.017 0.024 0.043 0.045 0.066 0.109

Label Frequency

0.2 0.4 0.6 0.8 1.0

Recall Web-API data Adaptive CS-BPCA-GP

Rare Label Prediction Comparison

The proposed model is effective at detecting rare labels by leveraging label correlation.

Weishi Shi and Qi Yu Multi-label Active Learning

slide-11
SLIDE 11

Introduction Methodology Results

CPU Time of Hyper-parameter Optimization

Dataset GA B-OPT S-OPT Delicious 1.83 0.17 0.20 BookMark 15.0 0.80 0.79 WebAPI 10.10 0.54 0.55 Corel5K 0.58 0.08 0.08 Bibtex 8.71 0.48 0.51 The proposed direct search methods learn the kernel parameters 10 ∼ 15 times faster than the gradient based methods.

Weishi Shi and Qi Yu Multi-label Active Learning

slide-12
SLIDE 12

Introduction Methodology Results

Conclusions

Propose a two-level CS-BPCA process to generate an

  • ptimally compressed, weighted, orthogonal, and continuous

target space to support multi-label data sampling. Propose an MOGP based sampling function that accurately captures the covariance structure of the input data. Propose gradient-free hyper-parameter optimization to enable fast online active learning. Apply to real-world multi-label datasets from diverse domains to evaluate the effectiveness of the proposed model. Poster Poster ID: 261

Weishi Shi and Qi Yu Multi-label Active Learning