Optimization Algorithms for Data Analysis Stephen Wright University - PowerPoint PPT Presentation

Optimization Algorithms for Data Analysis Stephen Wright University of Wisconsin-Madison Fields Institute, June 2010 Stephen Wright (UW-Madison) Optimization Algorithms for Data Analysis Fields Institute, June 2010 1 / 54

Introduction: Data Analysis Learn how to make inferences from data. Related Fields: Data Mining, Machine Learning, Support Vector Machines, Classification, Regression. Given a (possibly huge) number of examples (“training data”) and the known inferences for each data point, seek rules that can be used to make inferences about future instances. Among many possible rules that explain the examples, seek simple ones. Provide insight into the most important features of the data: needles in the haystack . Simple rules are inexpensive to apply to new instances. Simple rules can be more generalizable to the underlying problem - don’t over-fit to the particular set of examples used. Need to setting parameters that trade off between data fitting and generalizability (tuning/validation data useful). Stephen Wright (UW-Madison) Optimization Algorithms for Data Analysis Fields Institute, June 2010 2 / 54

Important Tool: Sparse Optimization Optimization has been a key technology in data analysis for many years. (Least squares, robust regression, support vector machines.) The need for simple, approximate solutions that draw essential insights from large data sets motivates sparse optimization. In sparse optimization, we look for a simple approximate solution of optimization problem, rather than a (more complicated) exact solution. Occam’s Razor: Simple explanations of the observations are preferable to complicated explanations. Noisy or sampled data doesn’t justify solving the problem exactly. Simple solutions sometimes more robust to data inexactness. Often easier to actuate / implement / store / explain simple solutions. May conform better to prior knowledge. When the solution is represented in an appropriate basis, simplicity or structure shows up as sparsity in x (i.e. few nonzero components). Stephen Wright (UW-Madison) Optimization Algorithms for Data Analysis Fields Institute, June 2010 3 / 54

Optimization Tools Needed Biological and biomedical applications use many tools from large-scale optimization: quadratic programming, integer programming, semidefinite programming. The extreme scale motivates the use of other tools too, e.g. stochastic gradient methods. Sparsity requires additional algorithmic tools. (It often introduces structured nonsmooth functions into the objective or constraints.) Effectiveness depends critically on exploiting the structure of the application class. Stephen Wright (UW-Madison) Optimization Algorithms for Data Analysis Fields Institute, June 2010 4 / 54

This Talk We discuss sparse optimization and other optimization techniques relevant to problems in biological and medical sciences. 1. Optimization in classification (SVM); sparse optimization in sparse classification. 2. Regularized logistic regression. 3. Tensor decompositions for multiway data arrays. 4. Cancer treatment planning. 5. Semidefinite programming for cluster analysis. 6. Integer programming for genetically optimal captive breeding programs. (More time for some topics than others!) Stephen Wright (UW-Madison) Optimization Algorithms for Data Analysis Fields Institute, June 2010 5 / 54

1. Optimization in Classification Have feature vectors x 1 , x 2 , . . . , x n ∈ R m (real vectors) and binary labels y 1 , y 2 , . . . , y n = ± 1. Seek a hyperplane w T x + b defined by coefficients ( w , b ) that separates the points according to their classification: w T x i + b ≥ 1 ⇒ y i = 1 , w T x i + b ≤ − 1 ⇒ y i = − 1 for most training examples i = 1 , 2 , . . . , n . Choose ( w , b ) to balance between fitting this particular set of training examples, ... but not over-fitting — so that it would not change much if presented with other training examples following the same (unknown) underlying distribution. Stephen Wright (UW-Madison) Optimization Algorithms for Data Analysis Fields Institute, June 2010 6 / 54

Linear SVM Classifier Stephen Wright (UW-Madison) Optimization Algorithms for Data Analysis Fields Institute, June 2010 7 / 54

Separable Data Set: Possible Separating Planes Stephen Wright (UW-Madison) Optimization Algorithms for Data Analysis Fields Institute, June 2010 8 / 54

More Data Shows Max-Margin Separator is Best Stephen Wright (UW-Madison) Optimization Algorithms for Data Analysis Fields Institute, June 2010 9 / 54

For separable data, find maximum-margin classifier by solving � w T x i + b i ≥ 1 , if y i = +1 ( w , b ) � w � 2 min 2 s.t. w T x i + b i ≤ − 1 , if y i = − 1 Penalized formulation: for suitable λ > 0, solve m λ 2 w T w + 1 � � � 1 − y i [ w T x i + b ] , 0 min max . m ( w , b ) i =1 (Also works for non-separable data.) Dual formulation: e T α − 1 1 2 α T Y T KY α s.t. α T y = 0 , 0 ≤ α ≤ max λ m 1 , α where y = ( y 1 , y 2 , . . . , y m ) T , Y = diag( y ), K ij = x T i x j is the kernel. Stephen Wright (UW-Madison) Optimization Algorithms for Data Analysis Fields Institute, June 2010 10 / 54

Nonlinear Support Vector Machines Stephen Wright (UW-Madison) Optimization Algorithms for Data Analysis Fields Institute, June 2010 11 / 54

Nonlinear SVM To get a nonlinear classifier, map x into a higher-dimensional space φ : R n → H , and do linear classification in H to find w ∈ H , b ∈ R . When the hyperplane is projected back into R n , gives a nonlinear surface (often not contiguous). In “lifted” space, primal problem is m λ � � 2 w T w + � 1 − y i [ w T φ ( x i ) + b ] , 0 min max . ( w , b ) i =1 By optimality conditions (and a representation theorem), optimal w has the form m � w = α i y i φ ( x i ) . i =1 Stephen Wright (UW-Madison) Optimization Algorithms for Data Analysis Fields Institute, June 2010 12 / 54

Kernel By substitution, obtain a finite-dimensional problem in ( α, b ) ∈ R m +1 : m λ 2 α T Ψ α + 1 � min max (1 − Ψ i · α − y i b , 0) , m α, b i =1 where Ψ ij = y i y j φ ( x i ) T φ ( x j ). WLOG can impose bounds α i ∈ [0 , 1 / ( λ m )]. Don’t need to define φ explicitly! Instead define the kernel function k ( s , t ) to indicate distance between s and t in H . Implicitly, k ( s , t ) = � φ ( s ) , φ ( t ) � . The Gaussian kernel k G ( s , t ) := exp( −� s − t � 2 2 / (2 σ 2 )) is popular. Thus define Ψ ij = y i y j k ( x i , x j ) in the problem above. Stephen Wright (UW-Madison) Optimization Algorithms for Data Analysis Fields Institute, June 2010 13 / 54

The Classifier Given a solution ( α, b ) we can classify a new point x by evaluating m � α i y i k ( x , x i ) + b , i =1 and checking whether it is positive (thus classified as +1) or negative (class − 1). Difficulties: Ψ is generally large ( m × m ) and dense. Specialized techniques needed to solve the classification problem for ( α, b ). Classifier can be expensive to apply (it requires m kernel evaluations). Many specialized algorithms proposed since about 1998, drawing heavily on optimization, but also exploiting the structure heavily. Stephen Wright (UW-Madison) Optimization Algorithms for Data Analysis Fields Institute, June 2010 14 / 54

Approximate Kernel Propose an algorithm that replaces Ψ by a low-rank approximation and then uses stochastic approximation to solve it. Using a Nystrom method [Drineas & Mahoney 05], choose c indices from { 1 , 2 , . . . , m } and evaluate those rows/columns of Ψ. By factoring this submatrix, can construct a rank- r approximation Ψ ≈ VV T , where V ∈ R m × r (with r ≤ c ). Replace Ψ ← VV T in the problem and change variables γ = V T α , to get m λ 2 γ T γ + 1 � � � 1 − v T min max i γ − y i b , 0 , m ( γ, b ) i =1 where v T is the i th row of V . i Same form as linear SVM, with feature vectors y i v i , i = 1 , 2 , . . . , m . Stephen Wright (UW-Madison) Optimization Algorithms for Data Analysis Fields Institute, June 2010 15 / 54

Optimization Algorithms for Data Analysis Stephen Wright University - PowerPoint PPT Presentation

Optimization Algorithms for Data Analysis Stephen Wright University of Wisconsin-Madison Fields Institute, June 2010 Stephen Wright (UW-Madison) Optimization Algorithms for Data Analysis Fields Institute, June 2010 1 / 54 Introduction: Data

Algorithms for unconstrained local optimization Fabio Schoen 2008

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

PERFORMANCE OF PERFORMANCE OF OPTIMIZATION OPTIMIZATION ALGORITHMS ALGORITHMS FOR DERIVING

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Optimization algorithms on Cell processor Vladim r T rebick y Optimization algorithms

Computational Optimization Constrained Optimization Algorithms Same basic algorithms Repeat

Algorithms for constrained local optimization Fabio Schoen 2008

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Evolutionary Algorithm 2. Swarm Intelligence and Ant Colony Optimization Ant Colony Optimization

Various Topics Outline 1. Dynamic (time-varying) Optimization Problems 2. Stochastic

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

MATH529 Fundamentals of Optimization Fundamentals of Constrained Optimization VIII:

COMP 3170 - Analysis of Algorithms & Data Structures Shahin Kamali Approximation Algorithms

Security, Privacy, Ethics and Sheep Professor Stephen Hailes UCL New Frontiers in IoT UCL New

Personalized Medicine Redefining Cancer Treatment RAMONA BENDIAS, FRIDA BRNFORS Is there a way

http://www.acesconnection.com/g/aces-in-children-and-youth2/clip/tics-youth-leadership-interviews

Information about the County 1 Million Residents 17% Anticipated growth __________ 50.6% Ethnic

Data Matthew James The central role that peoples health and well -being play in social

Cross-checking Semantic Correctness: The Case of Finding File System Bugs Changwoo Min , Sanidhya

Disclosures Aneurysms:Open Repair is the Gold Standard NONE Michael S. Conte MD Division of

Elodie Laine July 3 rd 2012, JOBIM, Rennes BiMoDyM, Molecular Oncology and Pharmacology Team,

Optimization Algorithms for Data Analysis Stephen Wright University - PowerPoint PPT Presentation

Optimization Algorithms for Data Analysis Stephen Wright University of Wisconsin-Madison Fields Institute, June 2010 Stephen Wright (UW-Madison) Optimization Algorithms for Data Analysis Fields Institute, June 2010 1 / 54 Introduction: Data

Algorithms for unconstrained local optimization Fabio Schoen 2008

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

PERFORMANCE OF PERFORMANCE OF OPTIMIZATION OPTIMIZATION ALGORITHMS ALGORITHMS FOR DERIVING

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Optimization algorithms on Cell processor Vladim r T rebick y Optimization algorithms

Computational Optimization Constrained Optimization Algorithms Same basic algorithms Repeat

Algorithms for constrained local optimization Fabio Schoen 2008

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Evolutionary Algorithm 2. Swarm Intelligence and Ant Colony Optimization Ant Colony Optimization

Various Topics Outline 1. Dynamic (time-varying) Optimization Problems 2. Stochastic

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

MATH529 Fundamentals of Optimization Fundamentals of Constrained Optimization VIII:

COMP 3170 - Analysis of Algorithms &amp; Data Structures Shahin Kamali Approximation Algorithms

Security, Privacy, Ethics and Sheep Professor Stephen Hailes UCL New Frontiers in IoT UCL New

Personalized Medicine Redefining Cancer Treatment RAMONA BENDIAS, FRIDA BRNFORS Is there a way

http://www.acesconnection.com/g/aces-in-children-and-youth2/clip/tics-youth-leadership-interviews

Information about the County 1 Million Residents 17% Anticipated growth __________ 50.6% Ethnic

Data Matthew James The central role that peoples health and well -being play in social

Cross-checking Semantic Correctness: The Case of Finding File System Bugs Changwoo Min , Sanidhya

Disclosures Aneurysms:Open Repair is the Gold Standard NONE Michael S. Conte MD Division of

Elodie Laine July 3 rd 2012, JOBIM, Rennes BiMoDyM, Molecular Oncology and Pharmacology Team,

COMP 3170 - Analysis of Algorithms & Data Structures Shahin Kamali Approximation Algorithms