Examples of MM Algorithms Kenneth Lange Departments of - PowerPoint PPT Presentation

Examples of MM Algorithms Kenneth Lange Departments of Biomathematics, Human Genetics, and Statistics University of California, Los Angeles joint work with Eric Chi (NCSU), Joong-Ho Won (Seoul NU), Jason Xu (Duke), and Hua Zhou (UCLA) de Leeuw Seminar, April 26, 2018 1

Introduction to the MM Principle 1. The MM principle is not an algorithm, but a prescription or principle for constructing optimization algorithms. 2. The EM algorithm from statistics is a special case. 3. An MM algorithm operates by creating a surrogate function that minorizes or majorizes the objective function. When the surrogate function is optimized, the objective function is driven uphill or downhill as needed. 4. In minimization MM stands for majorize/minimize, and in maximization MM stands for minorize/maximize. 2

History of the MM Principle 1. Anticipators: HO Hartley (1958, EM algorithms), AG McKendrick (1926, epidemiology), CAB Smith (1957, gene counting), E Weiszfeld (1937, facilities location), F Yates (1934, multiple classification) 2. Ortega and Rheinboldt (1970) enunciate the principle in the context of line search methods. 3. de Leeuw (1977) presents an MM algorithm for multidimensional scaling contemporary with the classic Dempster et al. (1977) paper on EM algorithms. 3

MM Application Areas a) robust regression, b) logistic regression,c) quantile regression, d) variance components, e) multidimensional scaling, f) correspondence analysis, g) medical imaging, h) convex programming, i) DC programming, j) geometric programming, k) survival analysis, l) nonnegative matrix factorization, m) discriminant analysis, n) cluster analysis, o) Bradley-Terry model, p) DNA sequence analysis, q) Gaussian mixture models, r) paired and multiple comparisons, s) variable selection, t) support vector machines, u) X-ray crystallography, v) facilities location, w) signomial programming, x) importance sampling, y) image restoration, and z) manifold embedding. 4

Rationale for the MM Principle 1. It can generate an algorithm that avoids matrix inversion. 2. It can separate the parameters of a problem. 3. It can linearize an optimization problem. 4. It can deal gracefully with equality and inequality constraints. 5. It can restore symmetry. 6. It can turn a non-smooth problem into a smooth problem. 5

Majorization and Definition of the Algorithm 1. A function g ( θ | θ n ) is said to majorize the function f ( θ ) at θ n provided f ( θ n ) = g ( θ n | θ n ) tangency at θ n ≤ g ( θ | θ n ) f ( θ ) domination for all θ . The majorization relation between functions is closed under the formation of sums, nonnegative products, limits, and composition with an increasing function. 2. A function g ( θ | θ n ) is said to minorize the function f ( θ ) at θ n provided − g ( θ | θ n ) majorizes − f ( θ ). 3. In minimization, we choose a majorizing function g ( θ | θ n ) and minimize it. This produces the next point θ n +1 in the algorithm. 6

MM Algorithm in Action larger f(x) smaller very bad optimal less bad x 7

MM Algorithm in Action larger ● f(x) smaller very bad optimal less bad x 7

MM Algorithm in Action larger ● ● f(x) smaller very bad optimal less bad x 7

MM Algorithm in Action larger ● ● ● f(x) smaller very bad optimal less bad x 7

MM Algorithm in Action larger ● ● ● f(x) ● ● smaller very bad optimal less bad x 7

MM Algorithm in Action larger ● ● ● f(x) ● ● smaller ● ● ● very bad optimal less bad x 7

MM Algorithm in Action larger ● ● ● f(x) ● ● smaller ● ●● ● ● ● ● very bad optimal less bad x 7

Descent Property 1. An MM minimization algorithm satisfies the descent property f ( θ n +1 ) ≤ f ( θ n ) with strict inequality unless both g ( θ n +1 | θ n ) = g ( θ n | θ n ) f ( θ n +1 ) = g ( θ n +1 | θ n ) . 2. The descent property follows from the definitions and f ( θ n +1 ) ≤ g ( θ n +1 | θ n ) ≤ g ( θ n | θ n ) = f ( θ n ) . 3. The descent property makes the MM algorithm very stable. 8

Example 1: Minimum of cos( x ) The univariate function f ( x ) = cos( x ) achieves its minimum of − 1 at odd multiples of π and its maximum of 1 at even multiples of π . For a given x n , the second-order Taylor expansion cos( x n ) − sin( x n )( x − x n ) − 1 2 cos( z )( x − x n ) 2 cos( x ) = holds for some z between x and x n . Because | cos( z ) | ≤ 1, the surrogate function cos( x n ) − sin( x n )( x − x n ) + 1 2( x − x n ) 2 g ( x | x n ) = d majorizes f ( x ). Solving dx g ( x | x n ) = 0 gives the MM algorithm x n +1 = x n + sin( x n ) for minimizing f ( x ) and represents an instance of the quadratic upper bound principle. 9

Majorization of cos x 2 1 function f(x) ● g(x|x 0 ) g(x|x 1 ) 0 ● −1 0 5 10 x 10

MM and Newton Iterates for Minimizing cos( x ) MM Newton cos( x n ) cos( y n ) n x n y n 0 2.00000000 -0.41614684 2.00000000 -0.41614684 1 2.90929743 -0.97314057 4.18503986 -0.50324437 2 3.13950913 -0.99999783 2.46789367 -0.78151929 3 3.14159265 -1.00000000 3.26618628 -0.99224825 4 3.14159265 -1.00000000 3.14094391 -0.99999979 5 3.14159265 -1.00000000 3.14159265 -1.00000000 11

Example 2: Robust Regression According to Geman and McClure, robust regression can be achieved by minimizing the amended linear regression criterion m ( y i − x ∗ i β ) 2 � f ( β ) = i β ) 2 . c + ( y i − x ∗ i =1 Here y i and x i are the response and the predictor vector for case i and s c > 0. Majorization is achieved via the concave function h ( s ) = c + s . In view of the linear majorization h ( s ) ≤ h ( s n ) + h ′ ( s n )( s − s n ), substitution i β ) 2 for s gives the surrogate function of ( y i − x ∗ m i β ) 2 + constant , � w ni ( y i − x ∗ g ( β | β n ) = i =1 where the weight w ni equals h ′ ( s ) evaluated at s n = ( y i − x ∗ i β n ) 2 . The update β n +1 is found by minimizing this weighted least squares criterion. 12

s Majorization of h ( s ) = 1+ s at s n = 1 1.0 0.5 1 2 3 13

Example 3: Missing Data in K -Means Clustering Lloyd’s algorithm is one of the earliest and simplest algorithms for K -means clustering. A recent paper extends K -means clustering to missing data. For subject i we observe an indexed set of components y ij of a vector y i ∈ R d . Call the index set O i . Subjects must be assigned to one of K clusters. Let C k denote the set of subjects currently assigned to cluster k . With this notation we seek to minimize the objective function K � � � ( y ij − µ kj ) 2 , k =1 i ∈ C k j ∈ O i where µ k is the center of cluster k . Reference: Chi JT, Chi EC, Baraniuk RG (2016) k-POD: A method for k-means clustering of missing data. The American Statistician 70:91–99 14

Reformulation of Lloyd’s Algorithm Lloyd’s algorithm alternates cluster reassignment with re-estimation of cluster centers. If we fix the centers, then subject i should be reassigned to the cluster k minimizing the quantity � ( y ij − µ kj ) 2 . j ∈ O i Re-estimation of the cluster centers relies on the MM principle. The surrogate function K � � ( y ij − µ kj ) 2 + ( µ nkj − µ kj ) 2 � � � � . k =1 i ∈ C k j ∈ O i j �∈ O i majorizes the objective around the cluster centers µ nk at the current iteration n . Note that the extra terms are nonnegative and vanish when µ k = µ nk . 15

Center Updates under Lloyd’s Algorithm If we define � j ∈ O i y ij ˜ = y nij µ nkj j �∈ O i , then the surrogate can be rewritten as � K y nj − µ k � 2 . Its � j ∈ C i � ˜ k =1 minimum is achieved at the revised centers 1 � µ n +1 , i = ˜ y nj . | C i | j ∈ C i In other words, the center equals the within cluster average over the combination of the observed data and the imputed data. The MM principle restores symmetry and leads to exact updates. 16

Robust Version of Lloyd’s Algorithm It is worth mentioning that the same considerations apply to other objective functions. For instance, if we substitute ℓ 1 norms for sums of squares, then the missing component majorization works with the term | µ nkj − µ kj | replacing the term ( µ nkj − µ kj ) 2 . In this case, each component of the update µ n +1 , kj equals the corresponding median of the completed data points ˜ y ni assigned to cluster k . This version of clustering is less subject to the influence of outliers. 17

Strengths and Weaknesses of K -Means 1. Strength: Speed and simplicity of implementation 2. Strength: Ease of interpretation 3. Weakness: Based on spherical clusters 4. Weakness: Lloyd’s algorithm attracted to local minima 5. Weakness: Distortion by outliers 6. Weakness: Choice of number classes K 18

Examples of MM Algorithms Kenneth Lange Departments of - PowerPoint PPT Presentation

Examples of MM Algorithms Kenneth Lange Departments of Biomathematics, Human Genetics, and Statistics University of California, Los Angeles joint work with Eric Chi (NCSU), Joong-Ho Won (Seoul NU), Jason Xu (Duke), and Hua Zhou (UCLA) de Leeuw

Parallel Algorithms Parallel Algorithms Examples Examples Concepts & Definitions

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Algorithms Theory Algorithms Theory 10 10 Greedy Algorithms G d Al ith Dr. Alexander

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

- - packing p a - packing algo- packing cking rithms algo- a l g o - theorems rithms

Evolutionary Algorithms CS 478 - Evolutionary Algorithms 1 Evolutionary Computation/Algorithms

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Randomized Algorithms Randomized Algorithms Two Types of Randomized Algorithms Two Types of

Course Evaluations 1. More examples Worked examples on whiteboard? Concrete examples of

Objectives You should be able to ... Lambda Calculus Examples Here are some examples! Dr.

Graphs More Examples More Examples More Examples Path graph P n : V = {1,,n} and E = {

ML in Geosciences Valentine et al. (2012, 2013) Examples in Geo Valentine & Trampert (2012)

Parallel Programming and High-Performance Computing Part 7: Examples of Parallel Algorithms Dr.

PASI15 WG3: Medical applications review from PASI13 Rob Apsimon (Tom Kroc, George Coutrakon)

Adjuvant chemoRT for Cholangiocarcinoma Edgar Ben-Josef, M.D. Eli Glatstein Professor of

Image Image- -guided Radiotherapy using guided Radiotherapy using A ti Active Pixel Technology

H&N Cancer, 2014: Vital Statistics Estimated new cases (USA), 2014* Deaths Oral

Chemist About Solution Properties UNIT 5 DAY 5 What are we going to learn today? Thinking Like

D ECEMBER 1 3 , 2 0 1 2 Introduction T AX S MOOTHI NG P P = n MRS (1 ) MPN t

Federated Cloud Computing Environment for Malaria Fighting INNOVAR PARA GANAR Vilnius

Research Digest Mathematical Optimization Mathematical approach to pursue the best Makoto

Examples of MM Algorithms Kenneth Lange Departments of - PowerPoint PPT Presentation

Examples of MM Algorithms Kenneth Lange Departments of Biomathematics, Human Genetics, and Statistics University of California, Los Angeles joint work with Eric Chi (NCSU), Joong-Ho Won (Seoul NU), Jason Xu (Duke), and Hua Zhou (UCLA) de Leeuw

Parallel Algorithms Parallel Algorithms Examples Examples Concepts &amp; Definitions

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Algorithms Theory Algorithms Theory 10 10 Greedy Algorithms G d Al ith Dr. Alexander

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

- - packing p a - packing algo- packing cking rithms algo- a l g o - theorems rithms

Evolutionary Algorithms CS 478 - Evolutionary Algorithms 1 Evolutionary Computation/Algorithms

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Randomized Algorithms Randomized Algorithms Two Types of Randomized Algorithms Two Types of

Course Evaluations 1. More examples Worked examples on whiteboard? Concrete examples of

Objectives You should be able to ... Lambda Calculus Examples Here are some examples! Dr.

Graphs More Examples More Examples More Examples Path graph P n : V = {1,,n} and E = {

ML in Geosciences Valentine et al. (2012, 2013) Examples in Geo Valentine &amp; Trampert (2012)

Parallel Programming and High-Performance Computing Part 7: Examples of Parallel Algorithms Dr.

PASI15 WG3: Medical applications review from PASI13 Rob Apsimon (Tom Kroc, George Coutrakon)

Adjuvant chemoRT for Cholangiocarcinoma Edgar Ben-Josef, M.D. Eli Glatstein Professor of

Image Image- -guided Radiotherapy using guided Radiotherapy using A ti Active Pixel Technology

H&amp;N Cancer, 2014: Vital Statistics Estimated new cases (USA), 2014* Deaths Oral

Chemist About Solution Properties UNIT 5 DAY 5 What are we going to learn today? Thinking Like

D ECEMBER 1 3 , 2 0 1 2 Introduction T AX S MOOTHI NG P P = n MRS (1 ) MPN t

Federated Cloud Computing Environment for Malaria Fighting INNOVAR PARA GANAR Vilnius

Research Digest Mathematical Optimization Mathematical approach to pursue the best Makoto

Parallel Algorithms Parallel Algorithms Examples Examples Concepts & Definitions

ML in Geosciences Valentine et al. (2012, 2013) Examples in Geo Valentine & Trampert (2012)

H&N Cancer, 2014: Vital Statistics Estimated new cases (USA), 2014* Deaths Oral