Smaller, more accurate regression forests using tree alternating - PowerPoint PPT Presentation

Smaller, more accurate regression forests using tree alternating optimization Arman Zharmagambetov and Miguel ´ A. Carreira-Perpi˜ n´ an Dept. of Computer Science and Engineering University of California, Merced ICML, July 2020 Smaller, more accurate regression forests using tree a A. Zharmagambetov and M. ´ A. Carreira-Perpi˜ n´ an 1 / 16

Overview A forest (= ensemble of decision trees) is a powerful machine learning method that has been successfully applied in many ap- plications: computer vision, speech processing, NLP, etc. They are often (part of) the winning methods in ML competitions and challenges. Some examples of forests: Random forests train each tree independently on a different data sample (bagging). Boosted Trees sequentially train trees on reweighted versions of the data. In both cases, the forest prediction is obtained by weighted averaging or voting. We focus on regression forests, where the prediction is a contin- uous scalar or vector, using bagging. Smaller, more accurate regression forests using tree a A. Zharmagambetov and M. ´ A. Carreira-Perpi˜ n´ an 2 / 16

Overview Random forests (and their variations) for regression have important advantages: high predictive accuracy few hyperparameters to set reasonably fast to train and can be scaled to large datasets considered to be robust against overfitting. But they have an important disadvantage: the individual trees they learn are far from accurate. This is due to two reasons: Each tree is axis-aligned (a decision node tests for a single feature, e.g. “if x 7 ≥ 3 then go right”). This is a very restrictive model, particularly for particularly for correlated features. Standard tree learning algorithms, based on a greedy recursive tree growth (such as CART), are highly suboptimal. There are exist few works that propose forests of more complex trees (Menze et al., 2000; Breiman, 2001; Frank & Kramer, 2004), but they mostly focus on classification rather than regression and improve marginally over conventional axis-aligned trees. Smaller, more accurate regression forests using tree a A. Zharmagambetov and M. ´ A. Carreira-Perpi˜ n´ an 3 / 16

Our idea We address both issues by: using trees with more complex nodes (oblique, i.e., hyperplane) using a better optimization algorithm to learn the tree. The resulting forests are smaller, shallower and much more accurate, consistently over various datasets. We build on a recently proposed algorithm for learning classification trees, Tree Alternating Optimization (TAO) (Carreira- Perpi˜ n´ an et al. 2018). TAO finds good approximate optima of an objective function over a tree with predetermined structure and it applies to trees beyond axis-aligned splits. We adapt TAO to the regression case and then use it in combination with bagging to learn forests of oblique trees. Smaller, more accurate regression forests using tree a A. Zharmagambetov and M. ´ A. Carreira-Perpi˜ n´ an 4 / 16

Optimizing a single tree with TAO: general formulation We consider trees whose nodes make hard decisions (not soft trees). Optimizing such trees is difficult because they are not differentiable. Assuming a tree structure T is given (say, binary complete of depth ∆), consider the following optimization problem over its parameters: N � � E ( Θ ) = L ( y n , T ( x n ; Θ )) + α φ i ( θ i ) n =1 i ∈N given a training set { ( x n , y n ) } N n =1 . Θ = { θ i } i ∈N is a set of parameters of all tree nodes. The loss function L ( y , z ) is the squared error � y − z � 2 2 in our case (although it is possible to use other losses, such as the least absolute deviation or a robust loss). The regularization term φ i (e.g. ℓ 1 norm) penalizes the parameters θ i of each node. Smaller, more accurate regression forests using tree a A. Zharmagambetov and M. ´ A. Carreira-Perpi˜ n´ an 5 / 16

Optimizing a single tree with TAO: separability of nodes Our TAO algorithm for regression is based on 3 theorems: separability condition, reduced problem over a leaf, reduced problem over a decision node. 1. Separability condition Consider any pair of nodes i and j . Assume the parameters of all other nodes ( Θ rest ) are fixed. If nodes i and j are not descendants of each other, then E ( Θ ) can be rewritten as: E ( Θ ) = E i ( θ i ) + E j ( θ j ) + E rest ( Θ rest ) In other words, the separability condition states that any set of non-descendant nodes of a tree can be optimized independently. Note that E rest ( Θ rest ) can be treated as a constant since we fix Θ rest . Smaller, more accurate regression forests using tree a A. Zharmagambetov and M. ´ A. Carreira-Perpi˜ n´ an 6 / 16

Optimizing a single tree with TAO: leaves A set of non-descendant nodes are all the leaves. Optimizing over the parameters of one leaf is given by the following theorem. 2. Reduced problem over a leaf If i is a leaf, the optimization of E ( Θ ) over θ i can be equivalently written as: � min θ i E i ( θ i ) = L ( y n , g i ( x n ; θ i )) + α φ i ( θ i ) n ∈R i The reduced set R i contains the training instances that reach leaf i . Each leaf i has a predictor function g i ( x ; θ i ): R D → R K (we use a constant or linear regressor) that produces the actual output. Therefore, solving the reduced problem over a leaf i amounts to fitting the leaf’s predictor g i to the instances in its reduced set to minimize the original loss (e.g. squared error). Smaller, more accurate regression forests using tree a A. Zharmagambetov and M. ´ A. Carreira-Perpi˜ n´ an 7 / 16

Optimizing a single tree with TAO: decision nodes An example of a set of non-descendant nodes are all the decision nodes at the same depth. Optimizing over the parameters of one decision node is given by the following theorem. 3. Reduced problem over a decision node If i is a decision node, the optimization of E ( Θ ) over θ i can be equivalently written as: � min θ i E i ( θ i ) = l in ( f i ( x n ; θ i )) + α φ i ( θ i ) n ∈R i where R i is the reduced set of node i and (assuming binary trees) f i ( x ; θ i ): R D → { left , right } is a decision function in node i which sends instance x n to the corresponding child of i . We consider oblique trees, having hyperplane decision functions “go to right if w T i x + w i 0 ≥ 0” (where θ i = { w i , w i 0 } ). l in ( · ) is the loss incurred if x n chooses the right or left subtree. Smaller, more accurate regression forests using tree a A. Zharmagambetov and M. ´ A. Carreira-Perpi˜ n´ an 8 / 16

Optimizing a single tree with TAO: decision nodes (cont.) The reduced problem over a decision node can be equivalently rewritten as a weighted 0/1 loss binary classification problem on the node’s reduced set instances: � min θ i E i ( θ i ) = L in ( y in , f i ( x n ; θ i )) + α φ i ( θ i ) n ∈R i where the weighted 0/1 loss L in ( y in , · ) for instance n ∈ R i is defined as L in ( y in , y ) = l in ( y ) − l in ( y in ) ∀ y ∈ { left , right } , where y in = arg min y l in ( y ) is a “pseudolabel” indicating a child which gives the lowest value of the regression loss L for instance x n under the current tree. For hyperplane nodes, this is NP-hard, but can be approximated by using a convex surrogate loss (we use the logistic loss). Hence, if φ i is an ℓ 1 norm, this requires solving an ℓ 1 -regularized logistic regression. Smaller, more accurate regression forests using tree a A. Zharmagambetov and M. ´ A. Carreira-Perpi˜ n´ an 9 / 16

Pseudocode for training a single TAO tree TAO repeatedly alternates optimizing over sets of nodes by training a (binary) classifier in the decision nodes and a regressor in the leaves, while monotonically decr. the obj. function E ( Θ ). input training set; initial tree T ( · ; Θ ) of depth ∆ N 0 , . . . , N ∆ ← nodes at depth 0 , . . . , ∆, respectively R 1 ← { 1 , . . . , N } repeat for d = 0 to ∆ parfor i ∈ N d if i is a leaf then θ i ← train regressor g i on reduced set R i else θ i ← train decision function f i on R i compute the reduced sets of each child of i until stop prune dead subtrees of T return T Smaller, more accurate regression forests using tree a A. Zharmagambetov and M. ´ A. Carreira-Perpi˜ n´ an 10 / 16

Ensemble of TAO trees Out of many ways to ensemble individual learners (bagging, boosting, etc.), we choose a simple one: we train each TAO tree independently (and in parallel) on a random subset of M samples of the available training data ( N instances). If M = N this is bagging. The forest prediction is the average of its trees’ predictions. Although our TAO regression algorithm works more generally, our experiments use: oblique trees ( f i is linear), which are more powerful than axis- aligned trees constant- and linear-predictor leaves ( g i ). We initialize each TAO tree ( T ) from a complete tree of depth ∆ and random node parameters. We train each tree with an ℓ 1 regularizer to achieve a more compact model: it encourages the weight vectors of individual nodes to be sparse and leads to more dead subtrees which can be pruned. Smaller, more accurate regression forests using tree a A. Zharmagambetov and M. ´ A. Carreira-Perpi˜ n´ an 11 / 16

Smaller, more accurate regression forests using tree alternating - PowerPoint PPT Presentation

Smaller, more accurate regression forests using tree alternating optimization Arman Zharmagambetov and Miguel A. Carreira-Perpi n an Dept. of Computer Science and Engineering University of California, Merced ICML, July 2020 Smaller,

A Look at our Wyoming Forests December 18 - 20, 2013 Governors Task Force on Forests Forests

Chapter 9 Object recognition Random Forests 9.9 Random forests 2 9.9 Random forests

Forests NSW Forests NSW Spotted Gum ( Corymbia spp.) Tree improvement and deployment strategy

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

Forests and Climate Forests and Climate Keeping Earth a Livable Place Keeping Earth a Livable

South- -East East Pahang Pahang Peat Peat South Swamp Forests, Malaysia Swamp Forests,

Mangrove forests and sea level rise 1 / 48 00001 - 00:00:01 Mangrove forests and sea level rise

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Econ 2148, fall 2019 Trees, forests, and causal trees Maximilian Kasy Department of Economics,

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Tree-sitter @maxbrunsfeld What is Tree-sitter? Why I wrote Tree-sitter What were

IACP Smaller Law Enforcement Agency Technical Assistance Program Smaller Agency Conference Track

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Our Changing Forests Harvard Forest Schoolyard Project August 22, 2019 1. How do forests change?

Machine Learning Software: Design and Practical Use Chih-Jen Lin National Taiwan University

Nonlinear Modulational Instability of Dispersive PDE Models Jiayin Jin, Shasha Liao, and Zhiwu

Survey: Leveraging Human Guidance for Deep Reinforcement Learning Tasks Ruohan Zhang, Faraz

MICE Update J. Pasternak 03/12/2014, SLAC, MAP meeting Outline Introduction Preparations

Lecture 1 08/24/15 Instructor: Yu-San Lin yusan@psu.edu

Function Space Priors in Bayesian Deep Learning Roger Grosse Motivation Today Bayesian deep

CS 285 Instructor: Sergey Levine UC Berkeley Recap: whats the problem? this is easy (mostly)

Jin Lin, Ernesto Su, Xinmin Tian Intel Corporation LLVM Developers Meeting 2018, October