追风人: 二月 2015

@(Numerical Optimization)[Notes, Numerical Optimization]

Notes on Course 10-725 Convex Optimization

From my personal feeling, this course is designed for Machine Learning (ML) and Statistics (Stat). Focus on machine learning related algorithms.
My target on optimization knowledge:
1. Gradient descent
2. AdaGrad
3. RMSProp
2. Newton method
3. Quasi-Newton method
4. Interior point method
5. Primal dual path following
6. Decomposition Method
7. Coordinate descent method
8. Proximal gradient method
9. ADMM methods

will be classified into First order method, Second order method, Interior point Method and something rest.

Other methods related to machine learning will be a plus. More methods related, i will get a bird-view of the optimization methods in machine learning.

Lecture 2

Introduction on Gradient Descent
Possible problem with Gradient Descent:
1. Saddle point, not local minimum but the gradient is zero.
2. algorithm nerve converge.
3. oscillate results from inappropriate step size.

How to determine when to stop the gradient descent?
Two different approach
1. From ML & Stat, We may need to use held-out data-sets to decide when to stop, because the target is Generalization
2. From Numerical Optimization, maybe the accuracy $\epsilon$ , i.e the absolute of difference between current point $\hat{x}$ and optimal point $x^*$ .

Lecture 3 Convexity

We have convex set and convex function
some concepts related with set $C$ , and we define $B_p(x, \epsilon) = \{ y \; | \; ||x-y||_p \le \epsilon \}$
1. boundary of set $C$ , i.e. $\delta C$ defined as follow:
$\forall \; \epsilon \gt 0 \quad then \quad B_p(x, \epsilon) \cap C \; \neq \; \phi$
2. interior of set $C$ , defined as follow:
$\exists \; \epsilon \gt 0 \; and \; B_p(x, \epsilon) \subset C$
3. relative interior (rel int $C$ ) defined as follow:
$C$ belong to some subspace, interior set of $C$ is this subspace called relative interior set. This is useful when we talk about some sets without knowing the exact dimension containing this sets.
example of relative interior: a $L_2$ norm ball in $R^2$ has interior set. But if this set exist in $R ^ 3$ , this is no interior set but has a relative interior.
4. closed set, defined as follow:
$\delta C \subset C$ , i.e. all the points belong to boundary set also belongs to set $C$ .
5. open set, defined as follow:
$\delta C \cap C = \phi$
6. compact set, is closed and bounded. When we say bounded, means there is a ball can contain this set $C$ .

Primal & Dual representation of Convex set

primal representation of convex set: convex hull of a set of points, while the dual representation is the intersection of hyper-places (possible infinite many)

For any convex set, we can find supporting hyper-plane for any point on the boundary (i.e. $\delta C$ ).
Supporting hyper-plane means:
i. set $C$ on one side of hyper-plane
ii. intersection between hyper-plane and convex set is not empty.
For any two convex sets, there will be an separating hyper-plane if the intersection is empty. If two of the sets are closed and at least one of the sets is compact, then we get a strict separating hyper-plane.

Epigraph and sub-level sets

A function called convex when satisfy the following definition:
1. Domain of function is convex set. This is important and easily be ignored.
2. $\forall \; x \;$ and $y$ in domain, line segments connect $x$ and $y$ above the function curve. And this definition implies the $epi \; f$ is convex !!!.

important property:
$epi \; f \; is \; convex \iff \; f(x) \; is \; convex$
$f(x) \; is \; convex \; \Rightarrow \; sub-level \; set \; is \; convex$

criteria of convex function:
1. zero order: function $f(x + tv)$ is convex
2. first order: $f(y) \ge f(x) + \nabla ^ T f(x)(y-x)$ , i.e. any point we can get a global under-estimator of function value.
3. second order: hession $\nabla ^ {2} f(x)$ is positive semi-definite.

Summary of lecture 3:
1. definition of compact set
2. primal & dual representation of convex set
3. different criteria of convex function

Lecture 4

Explain more details about convex set and convex function. Important concepts related with convex set: open, closed and compact.
1. Strictly convex: $f(tx + (1-t)y) \lt tf(x) + (1-t)f(y)$ , an important property of objective function, will have significant on the performance of iteration.
2. $\delta$ -strongly convex: function has follow property, $f(y) \ge f(x) + \nabla ^ T f(x) (y-x) + \frac{\delta}{2}||y-x|| ^ 2$ . In fact, this is an augmented version of first order sufficient and necessary condition of convex function. And $\delta$ -strongly convex is a stronger condition than strongly convex, $\delta$ -strongly convex means a function is lower bounded by a quadratic function.
3. Lipschitz or Lipschitz continuous, function satisfy follow property: $| f(x) - f(y) | \le L||y-x||$ . Lipschitz condition constrain the function to be smooth, i.e. function can’t grow faster than linear function. Lipschitz continuous can be applied to function or gradient as well. It’s hard to face problem with linear gradient in convex optimization problem, but it’s okay to assume the gradient of function can’t grow faster than linear function.

Convergence result for gradient descent

if objective function is $f(x)$ and we want to converge when different less than $\epsilon (f(x_0) -f(x^*) )$ and how to set the learning rate
1. f is lipchitz then we need $O(\frac{1}{\epsilon ^ 2})$ iterations, stepsize should be $\frac{1}{\sqrt{k}}$
2. $\nabla f(x)$ is Lipschitz then we need $O(\frac{1}{\epsilon})$ , stepsize should be constant.
3. $f(x)$ is strongly convex then we need $O(\ln(\frac{1}{\epsilon}))$ , stepsize should be constant.

But the definition of $O$ related with conditioning of function $f(x)$ . Conditioning of function means the frac of largest eigenvalue and smallest eigenvalue.

Extension of Gradient Descent

includes subgradient, proximal operator, FISTA, mirrow descent, conjugate gradient and Stochastic gradient descent. All these methods are called First Order Method.

Lecture 5 Derivation of convergence rate of gradient descent

More on the derivation of convergence analysis for gradient descent. One interesting property:
When we say gradient of function $f(x)$ is Lipschitz which also constrain the element of the hessian matrix, i.e. $\nabla f(x) - \nabla f(y) \le L || y- x || \iff \nabla ^ 2 f(x) \le L\; I$
also if function $f(x)$ is Lipschitz continuous then the gradient of function is bounded by $L$ .

Lecture 6 Subgradient

Subgradient is an extension of gradient. For convex function, either differentiable or not, we can always find a subgradient. But for non-convex function, we can’t guarantee there is always a subgradient.
For subgradient, there is a subgradient calculus for the calculation of subgradient for conplex function. And also the optimality condition and convergence analysis.

Lecture 7 Subgradient continue

Continue the talk of subgradient, with example of finding the intersection of n closed convex sets. One interesting phenomenon is that: if we want to find a point in the intersection of two closed convex sets.
Another problem with subgradient is the slow convergence rate, i.e. $O(\frac{1}{\epsilon ^ 2})$
Another interesting property:
function is Lipschitz is a easier condition than the gradient is Lipschitz continuous. If the gradient is lipschitz continuous then the function will have a bell shaped curve.

Lecture 8 Generalized Gradient Descent

Focus on objective function with decomposition like $f(x) = g(x) + h(x)$ where $g(x)$ is convex and differentiable, $h(x)$ to be convex and not necessary differentiable. With an intuition: $f(y) = g(x) + \nabla ^ T g(x) (y-x) + \frac{1}{2t} || y - x || ^ 2 + h(x)$ . This intuition works as below: we give a second order taylor series expansion for function $g(x)$ and keep $h(x)$ untouched, then we find the minimum of this approximation as our next point.
If function $h(x)$ happens to be $L_1$ norm, then the result called soft thresholding operator. One practical approach for LASSO problem.

Lecture 9 Acceleration

Accelerate the gradient descent by using information form previous direction. Lots of different names for this method, mostly comes from Nestrov. Accelerated version of ISTA called FISTA.
Another import property of acceleration is that: This method is not a descent method all always. We also need some special treatment of line search when apply on the acceleration.

Lecture 10 Matrix differentials

For multivariate variable function or multivariate vector function, the partial derivative is too many and we need a clever way to manage the details. This is why we need vector calculus and matrix calculus. In the worst situation, we need to calculate the partial derivative by hand and then fill it into a vector or matrix form.
But for the simplicity of formula derivation, we need a simpler way to derive the gradient or hessian of some function. we may need a formula similar the scalar function.
But derivative do not have a chain rule, produce rule; even though sum rule can apply to multivariate function. So we need differential form, and then convert back to derivative form.

Lecture 11 Matrix differentials & Newton method

Continue the discussion of matrix differentials, one of the import rule for matrix differential is the application of trace operator. For newton method, it originate from nonlinear equation solver. In order to solve nonlinear equation, a first order approximation is derived first $f(\hat{y)} \simeq f(x) + \nabla ^ T f(x) ( y- x)$ , then we set $f(\hat{y}) = 0$ and get a roughly estimate of the solution. After this, we can iterate many time until we get a satisfactory precision. This is the scalar version of newton method, then we apply this to multivariate function. The necessary condition of local minimum: gradient equal to 0. Then we apply same method to gradient, find improvement that make the gradient to be zero. Derived as follow: $\nabla g(y) \simeq \nabla g(x) + H(x) (y-x)$ , we hope to find a $y$ than make $g(y)$ to be 0. We have another derivation of newton’s method. We write down the second order approximation of function at $x$ : $f(\hat{y}) \simeq \; f(x) + \nabla ^ T f(x) ( y - x) + 1/2 (y-x) ^ T H(x) (y-x)$ . These two derivations give the same answer. For newton method to be a descent direction, Hessian matrix must be positive definite.
(ps: Hessian matrix is apply jacabion to gradient)

Lecture 12 Newton Methods

Newton methods with line search called damped Newton. Newton method will converge only when current point is near the optimum. When current point is far form the optimum, we need to perform line sear to ensure sufficient decrease in function value.
For objective function $f(x)$ to be strongly convex and has an equality constraint. We can solve this kind of problem with Newton method, i.e
$argmin_x \; f(x)$
$\quad s.t. \; h(x) =0$
One simple way to solve this problem would use second order approximation of $f(x)$ to replace the real objective function $f(x)$ and use Newton’s method to solve problem $h(x) = 0$ , means
$argmin_x \; f(x) + \nabla ^ T f(x) \delta x + 1/2 \delta x ^ T H(x) \delta x$
$\quad s.t. \; J ^ T (x) \delta x = - h(x)$
Or we can all this approach sequential Quadratic programming approach.

Lecture 13 Linear Programming

Objective function and constraints must be linear function, called linear programming. Any linear programming problem can be convert to a standard form. In this standard form, all the variables are required to be non-negative and all the constraints must be equality constraints.
But how to convert to standard form ? some scenarios we can use the following tricks:
1. use slack variable $\epsilon$ to convert inequality constraint into equality constraint. For example, we can convert $a ^ T x \le b$ then we can convert to $a ^ T b + \epsilon = b$ and add another constraint $\epsilon \ge 0$ .
2. if some variable is negative, then we can replace it with subtraction of two non-negative variable and two more constraints. For example, $x \le 0$ can be convert to $x_1 - x_2 \le 0 \; ; x_1 \ge 0 \; ; x_2 \ge 0$

Find these two tricks quite useful. When convert ordinary linear programming into standard form, we’ll find the #constraints < #variables, which means some variables are free variable. We call the non-free variable basis and find corner with these variable.

Lecture 14 Linear Programming

Solve linear programming with Simplex method. This method works with the following intuition: For a standard linear programming problem, we may have $m$ equality constraints, $n$ variables. From the perspective of solving linear equation, we must assume $m \lt n$ , when the equality holds we’ll get only one solution. From all the $n$ variables, we can select $m$ of them and solve the corresponding linear equation to get one vertex of corresponding polyhedral. This is the first insight. Another insight is this, we can select different variable, we’ll move from one vertex to another.
About degenerancy, which means square $m$ by $m$ matrix is singular, we can’t get only one solution for the linear equation.

Lecture 15 Duality

Derive the duality for linear programming problem, but i don’t think the derivation is good.

Lecture 16 KKT Condition

For general optimization problem, we can derive the dual problem of primal problem. The KKT condition used to specify the necessary condition of primal and dual solution. But for convex optimization problem, KKT condition is also the sufficient condition. Which means we can find primal & dual feasible that satisfy the KKT condition, and these points are the primal & dual solution.
For machine learning problem we need to convert between Constrained Form (C) and Lagrangian Form (L) here is examples:
(C) Constrained Form:
$argmin_{x} f(x) \; s.t. \; h(x) \le t$
(L) Lagrangian Form:
$argmin_{x} f(x) + \lambda h(x)$
These two forms are equivalent from the perspective of KKT condition.

Lecture 17 Duality Correspondence

Define core concepts like conjugate function, conjugate norm and generalized inequality and so on. With these concepts, we can define a unified view of convex optimization. Also introduce some tricks for deriving dual optimization problem.
conjugate function of function $f(x)$ defined as follow:
$f^*(y)=argmax_{x} \; \{y^T x - f(x) \}$
function with this form is quite common when we have affine equality constraints in problem. And we also know norm of some widely used function.
1. conjugate function of indicator function $f(x) = I_{C} (x)$
$f ^ * (y) = argmax_{x} \{ y ^ T x - I_{C}(x) \} = argmax_{x \in C} \{ y^T x \} = I_{C} {^ *}(x)$
This means the conjugate function of indicator function is the support function of set $C$ . For a very special case, when $C = \{ x : \|x\| \lt \lambda \}$ then $I_{C} { ^ *} = \lambda \|x\|_{*}$ .
2. conjugate function of norm function $\|x\|$
$f^*(y) = argmax_{x} \{ y ^ T x - \|x\| \} = 0 \; when \; y^ *x \lt 1; \inf$

Another very important concepts in optimization is cone and convex cone. A set $K$ is called cone, when $x \in K \; and \; \theta x \in K$ as well, where $\theta \gt 0$ . A set $K$ is called convex cone, when $x \in K \; and \; \theta_1 x_1 + \theta_2 x_2 \in K$ , where $\theta_1 \gt 0 \; and \; \theta_2 \gt 0$ .
With cone concepts, we can convert most of the problems in optimization into unified form. Then solve them with only one package.
Example include CVX and TFOCS. These two packages are used as general optimization package for convex optimization. They convert any problem into cone optimization internally.

Lecture 18 Use of duality

Describe the primal and dual problem of lasso problem. Most of the time focus on the interpretation of optimal solution from the KKT condition and dual problem. One important rule is characterize the solution using dual problem and KKT condition.
Talk about Dual Sub-gradient method, which calculate the sub-gradient of dual problem.
Also talk about the Decomposition of primal and dual problem for convex optimization. Using decomposition will convert one big problem into sets of problem with smaller size.
In the last minutes, talk about Augmented Lagrangian for primal problem, using this trick will accelerate the convergence of Dual subgradient method.

Lecture 19 ADMM and Mirror Descent

ADMM

Alternating Direction Method of Multipliers represent the combination of two methods:
1. Decomposition method
2. Augmented Lanrangian Method

for example we have one problem have following form:
$argmin_{x} f(x) + h(x)$
This is a unconstrained optimization but we can convert this problem into following form:
$argmin_{x, z} f(x) + h(z)$
$s.t. x = z$
This kind of transformation is equivalent to previous formulation, but with this new formulation we will decompose a big problem into two different problem with only one constraint coupling two variables.
With ADMM we need to solve the following problem:
$L(x, z, \mu) = f(x) + h(z) + \frac{\rho}{2} \|x - z\|_2 ^ 2 + \mu (x - z)$
This is actually a transformed objective function of augmented lanrangian method, but subsequent optimization step will be quite different.
$x ^ { t+1} = argmin_{x} L(x, z ^ t, \mu ^ t)$
$z ^ { t+ 1} = argmin_{z} L(x ^ { t+1}, z, \mu ^ t )$
$\mu ^ { t+1 } = \mu ^ {t} + \beta (x ^ {t+1} - \mu^{t+1})$
From my personal understanding, ADMM use transformation convert a single problem into multiple smaller problems with multiple consensus constraints (like a = b, c = d and so on), Then use Augmented Lagrangian Method perform problem transformation, at last use dual decomposition method to optimize each variable independently.

Mirror Descent

The reason this method is called Mirror descent, because of the way this method is working: for primal function $f(x)$ we define another dual function $g(x)$ , for each iteration, we move a direction in the dual space and then mirror back to primal space.
when we say function $g(x)$ is strongly convex when function satisfy following condition:
$g(y) \ge g(x) + \nabla g(x) ^ T (y-x) + \frac{\lambda}{2} \|y-x\| ^ 2$
And we can define Bregman Divergence as follow:
$\Delta_{g}(x,y) = g(x) - [g(y) + \nabla g(y) ^ T (x - y)]$
This divergence used to measure the size of high order taylor series term. with this divergence definition, we can define the minimization of each iteration:
$x ^ {t+1} = argmin_{x} \; f(x^t) + \nabla f(x ^ t) ^ T (x - x ^ t) + \Delta_{g}(x, x ^ t)$
And we can get the solution like:
$x^{t+1} = \nabla g ^{-1} (\nabla g(x^t) - \rho \nabla f(x ^ t))$
This need the gradient function of $g(x)$ has some special property.

For some special $g(x)$ we can directly get closed form solution:
1. define $g(x) = \sum_{i=1} ^ { i = n } {x_i log(x_i) - x_i}$
Then $x ^ {t+1} = x^{t} \odot exp(-\eta ^ t \nabla f(x ^ t))$ , this method called exponentiated gradient.
2. define $g(x) = \| x \|_2 ^ 2$ then corresponding method called gradient descent.

Lecture 20 Quadratic Program and Cone Program

Concepts and definition of quadratic program and cone program and semi definite program.

Quadratic Program

Quadratic program refers to optimization problem in following form:
$argmin_{x} \; \frac{1}{2} x ^ T H x + c ^ T x$
$\quad s.t. \; Ax \le b \; or \; Ax = b$
when $H$ is positive definite matrix, then this Quadratic program become convex function. Many of the problems in machine learning are convex programming problem, like SVM.

Second Order Cone Programming

For SOCP, we have problem with following structure:
$argmin_{x} \; c ^ x$
$s.t. A_i x + b_i \in K$
where $K$ is second order cone. QP is a special case of SOCP.

Semi Definite Program

SDP has a broder definition, SOCP is also one special case of SDP. SDP is widely used in machine learning and statistics.
$argmin_{x} \; c ^ T x$
$s.t. A_0 + \sum_{i} {A_i x_i} \succeq 0$ and we require $A_0 + \sum_{i} {A_i x_i}$ to be symmetric and positive semi-definite. We also call the constraints Linear Matrix Inequality. One interesting thing about Positive Semi Definite Cone is that its dual cone is also Positive Semi Definite Cone, i.e. its self-dual.

We get special structure form LP, QP, SOCP, SDP, the next step would be general convex programming, i.e. without specific structure to exploit.

Lecture 21 Quadratic & Cone Program duality; SVM

The derivation of dual problem for QP and Cone Programming is trival. But i find the real application of dual cone and benefits of using cone programming to unify all the problem presentations. Here is one example, assume we’re required to solving such a problem:
$argmin_{x} c ^ T x$
$s.t. \; Ax + b \in K$
We can derive the dual problem of primal in following way:
$L(x, \lambda) = c ^ T x - \lambda ^ T (Ax + b)$
Here restrict $\lambda \in K ^ *$ , i.e. the dual cone of cone $K$ . And this using just match the definition of Dual cone, interesting !!!.

For SVM, all still the same tricks. No special interesting point.

Lecture 22 SVM & Interior point method

Derive the dual problem of SVM and the dual solution provide another insight about svm. The most important insight is that: solution $w$ is just the combination of training points. And we can use kernel trick to fully exploit this specific structure.

Interior Point Method

One of the most important method for solving all kinds of optimization problem. One basic intuition about Interior point method is that, solution point can be find by finding points in interior of feasible region. In this section, Analytic center is derived to find the starting point of interior point leading to optimum solution.

Lecture 23 Interior point method

Derive the interior point for linear programming. The interior point method for linear programming is such a method that start with analytic center and end with optimum point. Interior point method can be interpreted as the solution of following problem:
$argmin_{x} \; c ^ T x + \frac{1}{t} \sum_{i=1} {log(s_i)}$
$s.t. Ax + b = s \succeq 0$
This is one example of barrier method and using the log barrier. In order to solve this problem, we need to increase the $t$ larger and larger.
Another interesting property about this problem is that, we can estimate the duality gap using parameter $t$ .
We can also interpret interior point method as perturbed version of KKT condition.

Lecture 24 Interior Point Method continued

Continue derive the interior point method from the perspective of perturbed KKT condition. Interior point method for linear programming and nonlinear programming can also be derived from this idea. We have the optimal condition and increase the some tiny improvement on each variable, drop the second order term and then solve the correspond equation with differentiable idea.

Lecture 25 Coordinate Descent

Coordinate descent is a class algorithm only works for some kind of function.
1. convex and differentiable function
2. function of the form $f(x) + \sum_{i=1} ^ { n} {h(x_i)}$ , function f(x) must be convex and differentiable, function h(x) only need to be convex but should be separable.
One interesting phenomenon about coordinate descent is that its’ convergence rate is faster than first order method and widely used in machine learning and statistics.

Lecture 26 Path Algorithm

Path algorithm is algorithm used to give exact solution for each different regularization parameter $\lambda$ , it’s too hard for me, skip.

Lecture 27 Dual Averaging

Originate from subgradient method, dual averaging method assign equal important to each sub-gradient and optimize following problem:
$argmin_{x} \hat{g(k)}x + \frac{\mu_k \|x\| ^ 2}{2k}$
here $\hat{g(k)}$ represent the averaged sub-gradient. Using this method, we can achieve better convergence rate than sub-gradient. Using Regularized Dual Averaging (RDA) we can get sparse solution for online optimization problem.

Summary of This Course

This course introduce first-order method, their introduction is very excellent and give state of the art description. For second method, they do not talk about CG, L-BFGS, but still give very good presentation about duality on lecture 17 and 18.
Introduction of QP, SOCP and SDP is also very interesting, especially the examples they used, wonderful! For coordinate descent and dual averaging, i learned so much.

2015年2月2日星期一

Notes on Course 10-725 Convex Optimization

Notes on Course 10-725 Convex Optimization

Lecture 2

Lecture 3 Convexity

Primal & Dual representation of Convex set

Hyper-place related with convex set

Epigraph and sub-level sets

Lecture 4

Convergence result for gradient descent

Extension of Gradient Descent

Lecture 5 Derivation of convergence rate of gradient descent

Lecture 6 Subgradient

Lecture 7 Subgradient continue

Lecture 8 Generalized Gradient Descent

Lecture 9 Acceleration

Lecture 10 Matrix differentials

Lecture 11 Matrix differentials & Newton method

Lecture 12 Newton Methods

Lecture 13 Linear Programming

Lecture 14 Linear Programming

Lecture 15 Duality

Lecture 16 KKT Condition

Lecture 17 Duality Correspondence

Lecture 18 Use of duality

Lecture 19 ADMM and Mirror Descent

ADMM

Mirror Descent

Lecture 20 Quadratic Program and Cone Program

Quadratic Program

Second Order Cone Programming

Semi Definite Program

Lecture 21 Quadratic & Cone Program duality; SVM

Lecture 22 SVM & Interior point method

Interior Point Method

Lecture 23 Interior point method

Lecture 24 Interior Point Method continued

Lecture 25 Coordinate Descent

Lecture 26 Path Algorithm

Lecture 27 Dual Averaging

Summary of This Course