extreme f measure maximization
play

Extreme F-measure Maximization Kalina Jasinska 1 Karlson Pfannschmidt - PowerPoint PPT Presentation

Extreme F-measure Maximization Kalina Jasinska 1 Karlson Pfannschmidt 2 obert Busa-Fekete 2 nski 1 R Krzysztof Dembczy 1 Intelligent Decision Support Systems Laboratory (IDSS), Pozna n University of Technology, Poland 2 Department of


  1. Theoretical results • Estimation of the threshold on a validation set is statistically consistent with provable regret bounds. 2 2 N. Nagarajan, S. Koyejo, R. Ravikumar, and I. Dhillon. Consistent binary classification with generalized performance metrics. In NIPS 27 , pages 2744–2752, 2014 H. Narasimhan, R. Vaish, and Agarwal S. On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In NIPS , 2014 Shameem Puthiya Parambath, Nicolas Usunier, and Yves Grandvalet. Optimizing f-measures by cost-sensitive classification. In NIPS 27 , pages 2123–2131, 2014 Wojciech Kot� lowski and Krzysztof Dembczynski. Surrogate regret bounds for generalized clas- sification performance metrics. In ACML , 2015 14 / 36

  2. Online F-measure optimization • Online update of the threshold by exploiting that F ∗ ( τ ) = 2 τ ∗ . 3 R´ obert Busa-Fekete, Bal´ azs Sz¨ or´ enyi, Krzysztof Dembczynski, and Eyke H¨ ullermeier. Online f-measure optimization. In NIPS 29 , 2015 15 / 36

  3. Online F-measure optimization • Online update of the threshold by exploiting that F ∗ ( τ ) = 2 τ ∗ . • Converges to the optimal threshold. 3 3 R´ obert Busa-Fekete, Bal´ azs Sz¨ or´ enyi, Krzysztof Dembczynski, and Eyke H¨ ullermeier. Online f-measure optimization. In NIPS 29 , 2015 15 / 36

  4. Online F-measure optimization • Online update of the threshold by exploiting that F ∗ ( τ ) = 2 τ ∗ . • Converges to the optimal threshold. 3 • Requires to store only a small constant number of auxiliary variables. 3 R´ obert Busa-Fekete, Bal´ azs Sz¨ or´ enyi, Krzysztof Dembczynski, and Eyke H¨ ullermeier. Online f-measure optimization. In NIPS 29 , 2015 15 / 36

  5. Online F-measure optimization • Online update of the threshold by exploiting that F ∗ ( τ ) = 2 τ ∗ . • Converges to the optimal threshold. 3 • Requires to store only a small constant number of auxiliary variables. • Can be either applied on a validation set or run simultaneously with training of the class probability model. 3 R´ obert Busa-Fekete, Bal´ azs Sz¨ or´ enyi, Krzysztof Dembczynski, and Eyke H¨ ullermeier. Online f-measure optimization. In NIPS 29 , 2015 15 / 36

  6. Online F-measure optimization • Online update of the threshold by exploiting that F ∗ ( τ ) = 2 τ ∗ . • Converges to the optimal threshold. 3 • Requires to store only a small constant number of auxiliary variables. • Can be either applied on a validation set or run simultaneously with training of the class probability model. • For large validation sets one pass over data should get an accurate estimate of the threshold. 3 R´ obert Busa-Fekete, Bal´ azs Sz¨ or´ enyi, Krzysztof Dembczynski, and Eyke H¨ ullermeier. Online f-measure optimization. In NIPS 29 , 2015 15 / 36

  7. Online F-measure Maximization • In each round t : 16 / 36

  8. Online F-measure Maximization • In each round t : ◮ Example x t is observed, x 1 16 / 36

  9. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , x 1 η ( x 1 ) ˆ 16 / 36

  10. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � x 1 η ( x 1 ) ˆ y 1 ˆ 16 / 36

  11. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. x 1 η ( x 1 ) ˆ y 1 ˆ y 1 16 / 36

  12. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 η ( x 1 ) ˆ y 1 ˆ y 1 τ 1 16 / 36

  13. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 η ( x 1 ) ˆ y 1 ˆ y 1 τ 1 16 / 36

  14. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 η ( x 1 ) ˆ η ( x 2 ) ˆ y 1 ˆ y 1 τ 1 16 / 36

  15. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 η ( x 1 ) ˆ η ( x 2 ) ˆ y 1 ˆ y 2 ˆ y 1 τ 1 16 / 36

  16. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 η ( x 1 ) ˆ η ( x 2 ) ˆ y 1 ˆ y 2 ˆ y 1 y 2 τ 1 16 / 36

  17. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 η ( x 1 ) ˆ η ( x 2 ) ˆ y 1 ˆ y 2 ˆ y 1 y 2 τ 1 τ 2 16 / 36

  18. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 η ( x 1 ) ˆ η ( x 2 ) ˆ y 1 ˆ y 2 ˆ y 1 y 2 τ 1 τ 2 16 / 36

  19. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ y 1 ˆ y 2 ˆ y 1 y 2 τ 1 τ 2 16 / 36

  20. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 1 y 2 τ 1 τ 2 16 / 36

  21. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 1 y 2 y 3 τ 1 τ 2 16 / 36

  22. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 1 y 2 y 3 τ 1 τ 2 τ 3 16 / 36

  23. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 1 y 2 y 3 τ 1 τ 2 τ 3 16 / 36

  24. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 1 y 2 y 3 τ 1 τ 2 τ 3 16 / 36

  25. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 4 ˆ y 1 y 2 y 3 τ 1 τ 2 τ 3 16 / 36

  26. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 4 ˆ y 1 y 2 y 3 y 4 τ 1 τ 2 τ 3 16 / 36

  27. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 4 ˆ y 1 y 2 y 3 y 4 τ 1 τ 2 τ 3 τ 4 16 / 36

  28. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 x 5 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 4 ˆ y 1 y 2 y 3 y 4 τ 1 τ 2 τ 3 τ 4 16 / 36

  29. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 x 5 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ η ( x 5 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 4 ˆ y 1 y 2 y 3 y 4 τ 1 τ 2 τ 3 τ 4 16 / 36

  30. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 x 5 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ η ( x 5 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 4 ˆ y 5 ˆ y 1 y 2 y 3 y 4 τ 1 τ 2 τ 3 τ 4 16 / 36

  31. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 x 5 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ η ( x 5 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 4 ˆ y 5 ˆ y 1 y 2 y 3 y 4 y 5 τ 1 τ 2 τ 3 τ 4 16 / 36

  32. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 x 5 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ η ( x 5 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 4 ˆ y 5 ˆ y 1 y 2 y 3 y 4 y 5 τ 1 τ 2 τ 3 τ 4 τ 5 16 / 36

  33. Online F-measure Maximization • In each round t : ◮ Example x t is observed, η ( x t ) = ˆ ◮ Model g applied to x t to get ˆ P ( y t = 1 | x t ) , ◮ Prediction ˆ y t is computed by ˆ y t = � ˆ η ( x t ) ≥ τ t − 1 � ◮ Label y t is revealed. ◮ Threshold τ t is computed by τ t = F t 2 = a t , b t with a t = a t − 1 + y t ˆ y t and b t = b t − 1 + y t + ˆ y t ( a 0 and b 0 → prior). x 1 x 2 x 3 x 4 x 5 η ( x 1 ) ˆ η ( x 2 ) ˆ η ( x 3 ) ˆ η ( x 4 ) ˆ η ( x 5 ) ˆ y 1 ˆ y 2 ˆ y 3 ˆ y 4 ˆ y 5 ˆ . . . y 1 y 2 y 3 y 4 y 5 τ 1 τ 2 τ 3 τ 4 τ 5 16 / 36

  34. Beyond binary problems • All the above approaches are working well. 17 / 36

  35. Beyond binary problems • All the above approaches are working well. • Computational issues can almost be ignored in binary problems. 17 / 36

  36. Beyond binary problems • All the above approaches are working well. • Computational issues can almost be ignored in binary problems. • Scaling to X-MLC? 17 / 36

  37. Macro-averaging of the F-measure • m labels. 18 / 36

  38. Macro-averaging of the F-measure • m labels. • Test set of size n , { ( x i , y i ) } n 1 . 18 / 36

  39. Macro-averaging of the F-measure • m labels. • Test set of size n , { ( x i , y i ) } n 1 . • The true label vector: y i = ( y i 1 , . . . , y im ) . 18 / 36

  40. Macro-averaging of the F-measure • m labels. • Test set of size n , { ( x i , y i ) } n 1 . • The true label vector: y i = ( y i 1 , . . . , y im ) . • The predicted label vector: ˆ y i = (ˆ y i 1 , . . . , ˆ y im ) . 18 / 36

  41. Macro-averaging of the F-measure • m labels. • Test set of size n , { ( x i , y i ) } n 1 . • The true label vector: y i = ( y i 1 , . . . , y im ) . • The predicted label vector: ˆ y i = (ˆ y i 1 , . . . , ˆ y im ) . • The macro F-measure: m m 2 � n i =1 y ij ˆ y ij F M = 1 y · j ) = 1 � � F ( y · j , ˆ . � n i =1 y ij + � n m m i =1 ˆ y ij j =1 j =1 True labels Predicted labels y 11 y 12 y 13 y 14 y 11 ˆ y 12 ˆ ˆ y 13 y 14 ˆ y 21 y 22 y 23 y 24 y 21 ˆ y 22 ˆ ˆ y 23 y 24 ˆ y 31 y 32 y 33 y 34 y 31 ˆ y 32 ˆ y 33 ˆ y 34 ˆ y 41 y 42 y 43 y 44 y 41 ˆ y 42 ˆ y 43 ˆ y 44 ˆ y 51 y 52 y 53 y 54 y 51 ˆ y 52 ˆ ˆ y 53 y 54 ˆ y 61 y 62 y 63 y 64 y 61 ˆ y 62 ˆ ˆ y 63 y 64 ˆ 18 / 36

  42. Macro-averaging of the F-measure • m labels. • Test set of size n , { ( x i , y i ) } n 1 . • The true label vector: y i = ( y i 1 , . . . , y im ) . • The predicted label vector: ˆ y i = (ˆ y i 1 , . . . , ˆ y im ) . • The macro F-measure: m m 2 � n i =1 y ij ˆ y ij F M = 1 y · j ) = 1 � � F ( y · j , ˆ . � n i =1 y ij + � n m m i =1 ˆ y ij j =1 j =1 True labels Predicted labels y 11 y 12 y 13 y 14 y 11 ˆ y 12 ˆ ˆ y 13 y 14 ˆ y 21 y 22 y 23 y 24 y 21 ˆ y 22 ˆ ˆ y 23 y 24 ˆ y 31 y 32 y 33 y 34 y 31 ˆ y 32 ˆ y 33 ˆ y 34 ˆ y 41 y 42 y 43 y 44 y 41 ˆ y 42 ˆ y 43 ˆ y 44 ˆ y 51 y 52 y 53 y 54 y 51 ˆ y 52 ˆ ˆ y 53 y 54 ˆ y 61 y 62 y 63 y 64 y 61 ˆ y 62 ˆ ˆ y 63 y 64 ˆ 18 / 36

  43. Macro-averaging of the F-measure • m labels. • Test set of size n , { ( x i , y i ) } n 1 . • The true label vector: y i = ( y i 1 , . . . , y im ) . • The predicted label vector: ˆ y i = (ˆ y i 1 , . . . , ˆ y im ) . • The macro F-measure: m m 2 � n i =1 y ij ˆ y ij F M = 1 y · j ) = 1 � � F ( y · j , ˆ . � n i =1 y ij + � n m m i =1 ˆ y ij j =1 j =1 True labels Predicted labels y 11 y 12 y 13 y 14 y 11 ˆ y 12 ˆ ˆ y 13 y 14 ˆ y 21 y 22 y 23 y 24 y 21 ˆ y 22 ˆ ˆ y 23 y 24 ˆ y 31 y 32 y 33 y 34 y 31 ˆ y 32 ˆ y 33 ˆ y 34 ˆ y 41 y 42 y 43 y 44 y 41 ˆ y 42 ˆ y 43 ˆ y 44 ˆ y 51 y 52 y 53 y 54 y 51 ˆ y 52 ˆ ˆ y 53 y 54 ˆ y 61 y 62 y 63 y 64 y 61 ˆ y 62 ˆ ˆ y 63 y 64 ˆ 18 / 36

  44. Macro-averaging of the F-measure • m labels. • Test set of size n , { ( x i , y i ) } n 1 . • The true label vector: y i = ( y i 1 , . . . , y im ) . • The predicted label vector: ˆ y i = (ˆ y i 1 , . . . , ˆ y im ) . • The macro F-measure: m m 2 � n i =1 y ij ˆ y ij F M = 1 y · j ) = 1 � � F ( y · j , ˆ . � n i =1 y ij + � n m m i =1 ˆ y ij j =1 j =1 True labels Predicted labels y 11 y 12 y 13 y 14 y 11 ˆ y 12 ˆ ˆ y 13 y 14 ˆ y 21 y 22 y 23 y 24 y 21 ˆ y 22 ˆ ˆ y 23 y 24 ˆ y 31 y 32 y 33 y 34 y 31 ˆ y 32 ˆ y 33 ˆ y 34 ˆ y 41 y 42 y 43 y 44 y 41 ˆ y 42 ˆ y 43 ˆ y 44 ˆ y 51 y 52 y 53 y 54 y 51 ˆ y 52 ˆ ˆ y 53 y 54 ˆ y 61 y 62 y 63 y 64 y 61 ˆ y 62 ˆ ˆ y 63 y 64 ˆ 18 / 36

  45. Macro-averaging of the F-measure • m labels. • Test set of size n , { ( x i , y i ) } n 1 . • The true label vector: y i = ( y i 1 , . . . , y im ) . • The predicted label vector: ˆ y i = (ˆ y i 1 , . . . , ˆ y im ) . • The macro F-measure: m m 2 � n i =1 y ij ˆ y ij F M = 1 y · j ) = 1 � � F ( y · j , ˆ . � n i =1 y ij + � n m m i =1 ˆ y ij j =1 j =1 True labels Predicted labels y 11 y 12 y 13 y 14 y 11 ˆ y 12 ˆ ˆ y 13 y 14 ˆ y 21 y 22 y 23 y 24 y 21 ˆ y 22 ˆ ˆ y 23 y 24 ˆ y 31 y 32 y 33 y 34 y 31 ˆ y 32 ˆ y 33 ˆ y 34 ˆ y 41 y 42 y 43 y 44 y 41 ˆ y 42 ˆ y 43 ˆ y 44 ˆ y 51 y 52 y 53 y 54 y 51 ˆ y 52 ˆ ˆ y 53 y 54 ˆ y 61 y 62 y 63 y 64 y 61 ˆ y 62 ˆ ˆ y 63 y 64 ˆ 18 / 36

  46. Macro-averaging of the F-measure • Can be solved by reduction to m independent binary problems of F-measure maximization. 4 4 Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit S. Dhillon. Con- sistent multilabel classification. In NIPS 29 , dec 2015 19 / 36

  47. Macro-averaging of the F-measure • Can be solved by reduction to m independent binary problems of F-measure maximization. 4 • Can we use the above threshold tuning methods? 4 Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit S. Dhillon. Con- sistent multilabel classification. In NIPS 29 , dec 2015 19 / 36

  48. Macro-averaging of the F-measure • Can be solved by reduction to m independent binary problems of F-measure maximization. 4 • Can we use the above threshold tuning methods? • The naive adaptation of them can be costly!!! 4 Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit S. Dhillon. Con- sistent multilabel classification. In NIPS 29 , dec 2015 19 / 36

  49. Macro-averaging of the F-measure • Can be solved by reduction to m independent binary problems of F-measure maximization. 4 • Can we use the above threshold tuning methods? • The naive adaptation of them can be costly!!! ◮ We need CPEs for all labels and examples in the validation set. 4 Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit S. Dhillon. Con- sistent multilabel classification. In NIPS 29 , dec 2015 19 / 36

  50. Macro-averaging of the F-measure • Can be solved by reduction to m independent binary problems of F-measure maximization. 4 • Can we use the above threshold tuning methods? • The naive adaptation of them can be costly!!! ◮ We need CPEs for all labels and examples in the validation set. ◮ For m > 10 5 and n > 10 5 , we need at least 10 10 predictions to be computed and potentially stored. 4 Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit S. Dhillon. Con- sistent multilabel classification. In NIPS 29 , dec 2015 19 / 36

  51. Macro-averaging of the F-measure • Can be solved by reduction to m independent binary problems of F-measure maximization. 4 • Can we use the above threshold tuning methods? • The naive adaptation of them can be costly!!! ◮ We need CPEs for all labels and examples in the validation set. ◮ For m > 10 5 and n > 10 5 , we need at least 10 10 predictions to be computed and potentially stored. • Solution : 4 Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit S. Dhillon. Con- sistent multilabel classification. In NIPS 29 , dec 2015 19 / 36

  52. Macro-averaging of the F-measure • Can be solved by reduction to m independent binary problems of F-measure maximization. 4 • Can we use the above threshold tuning methods? • The naive adaptation of them can be costly!!! ◮ We need CPEs for all labels and examples in the validation set. ◮ For m > 10 5 and n > 10 5 , we need at least 10 10 predictions to be computed and potentially stored. • Solution : ◮ To compute the F-measure we need only true positive labels ( y ij = 1 ) and predicted positive labels ( ˆ y ij = 1 ). 4 Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit S. Dhillon. Con- sistent multilabel classification. In NIPS 29 , dec 2015 19 / 36

  53. Macro-averaging of the F-measure • Can be solved by reduction to m independent binary problems of F-measure maximization. 4 • Can we use the above threshold tuning methods? • The naive adaptation of them can be costly!!! ◮ We need CPEs for all labels and examples in the validation set. ◮ For m > 10 5 and n > 10 5 , we need at least 10 10 predictions to be computed and potentially stored. • Solution : ◮ To compute the F-measure we need only true positive labels ( y ij = 1 ) and predicted positive labels ( ˆ y ij = 1 ). ◮ Therefore to reduce the complexity we need to deliver sparse probability estimates (SPEs). 4 Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit S. Dhillon. Con- sistent multilabel classification. In NIPS 29 , dec 2015 19 / 36

  54. Outline 1 Extreme multi-label classification 2 The F-measure 3 Efficient sparse probability estimators 4 Experimental results 5 Summary 20 / 36

  55. Efficient sparse probability estimators • Sparse propability estimates (SPEs): CPEs of top labels or CPEs exceeding a given threshold 5 Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In KDD , pages 263–272. ACM, 2014 6 Kalina Jasinska and Krzysztof Dembczynski. Consistent label tree classifiers for extreme multi- label classification. In The ICML Workshop on Extreme Classification , 2015 21 / 36

  56. Efficient sparse probability estimators • Sparse propability estimates (SPEs): CPEs of top labels or CPEs exceeding a given threshold • We need multi-label classifiers that efficiently deliver SPEs: Efficient sparse probability estimators 5 Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In KDD , pages 263–272. ACM, 2014 6 Kalina Jasinska and Krzysztof Dembczynski. Consistent label tree classifiers for extreme multi- label classification. In The ICML Workshop on Extreme Classification , 2015 21 / 36

  57. Efficient sparse probability estimators • Sparse propability estimates (SPEs): CPEs of top labels or CPEs exceeding a given threshold • We need multi-label classifiers that efficiently deliver SPEs: Efficient sparse probability estimators • Two examples: FastXML 5 and PLT 6 5 Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In KDD , pages 263–272. ACM, 2014 6 Kalina Jasinska and Krzysztof Dembczynski. Consistent label tree classifiers for extreme multi- label classification. In The ICML Workshop on Extreme Classification , 2015 21 / 36

  58. FastXML • Based on standard decision trees . 7 • Uses an ensemble of trees to improve predictive performance. • Sparse linear classifiers trained to maximize nDCG in internal nodes. • Empirical distributions in leaves. • Very efficient training procedure. w 1 · x ≥ 0 w 2 · x ≥ 0 w 3 · x ≥ 0 η 34 ( x )=0 . 8 η 45 ( x )=0 . 45 η 3 ( x )=0 . 46 η 45 ( x )=0 . 45 w 4 · x ≥ 0 η 2 ( x )=0 . 4 η 1 ( x )=0 . 15 η 5 ( x )=0 . 15 . . . . . . . . . η 44 ( x )=0 . 46 η 1 ( x )=0 . 6 η 3 ( x )=0 . 15 η 12 ( x )=0 . 45 η 102 ( x )=0 . 05 . . . . . . 7 L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees . Wadsworth and Brooks, Monterey, CA, 1984 22 / 36

  59. FastXML w 1 · x ≥ 0 w 2 · x ≥ 0 w 3 · x ≥ 0 η 34 ( x )=0 . 8 η 45 ( x )=0 . 45 η 3 ( x )=0 . 46 η 45 ( x )=0 . 45 w 4 · x ≥ 0 η 2 ( x )=0 . 4 η 1 ( x )=0 . 15 η 5 ( x )=0 . 15 . . . . . . . . . η 44 ( x )=0 . 46 η 1 ( x )=0 . 6 η 3 ( x )=0 . 15 η 12 ( x )=0 . 45 η 102 ( x )=0 . 05 . . . . . . • Most importantly: FastXML delivers SPEs . 23 / 36

  60. FastXML w 1 · x ≥ 0 w 2 · x ≥ 0 w 3 · x ≥ 0 η 34 ( x )=0 . 8 η 45 ( x )=0 . 45 η 3 ( x )=0 . 46 η 45 ( x )=0 . 45 w 4 · x ≥ 0 η 2 ( x )=0 . 4 η 1 ( x )=0 . 15 η 5 ( x )=0 . 15 . . . . . . . . . η 44 ( x )=0 . 46 η 1 ( x )=0 . 6 η 3 ( x )=0 . 15 η 12 ( x )=0 . 45 η 102 ( x )=0 . 05 . . . . . . • Most importantly: FastXML delivers SPEs . ◮ Leaf nodes cover only small feature space 23 / 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend