追风人: 2015

2015年9月17日星期四

AM207 Monte Carlo Method and Stochastic Optimization

Lecture 1

Introduction lecture, almost forget all the materials.

Lab session 1

Introduction to probability distribution with python as example.

Lecture 2 Basic Monte Carlo Method

Important sampling
This method is just for calculate the integral of some function.
Rejection Sampling
This is the first method for sampling from a distribution in real.

One important issue with monte carlo method is the variance of estimation depend on the sample size. We need find better ways to control the variance of estimator.

Lecture 3 Variance Reduction

Reduce the variance of monte carlo methods.

Control Variates
Antithetic Variates
require the function must be mononic.
Stratification variate
split the variable into multiple interval and perform calculation on each interval.

Lecture 4 Bayesian Formalism

One simple distinction between bayesian and frequentist would be their returned result. For bayesian method, it will return a distribution; frequentist will return a number.
Bayesian method start with a prior distribution on parameter, after observing some data, updating the prior get the posterior distribution on the parameter.

Lecture 5 Bayesian Formalism: Part 2 && MCMC

Sampling method:
1. Inverse transformation
2. Rejection Sampling
3. MCMC

MCMC is the widely used approach to sample from high dimension and complex distributions. What’s the properties we expect MCMC to have ?
1. aperiodic: This means we do not want pattern (or loops) existed in the random walking. If we have this pattern, then the corresponding samples will not be random.
2. Irreducible: for any step $t$ , the probability of any point $x$ would large than $0$ . this means we do not want the random walk end up with a deterministic ending sample. So that we can get to any space in the domain.
3. detailed balance: from sample point $x$ and $\tilde{x}$ , the forward and backward probability is the same. This will guarantee the corresponding samples will come from the distribution we want.

Lecture 6 MCMC

Start real introduction into MCMC with detailed balance:
First talk about the Metroplis method for MCMC, by assuming a symmetric transition probability $q(x; x') = q(x' ; x)$ . First we get a new point from distribution $q(x ^ *; x)$ where $x$ is current point, then we accept the new point $x ^ *$ with probability $A(x ^ *, x)$
$A (x ^ *; x ) = min \{ 1, p(x ^ * ) / p(x) \}$
Perform a uniform sampling between 0 and 1, get a prob $p$ . If $p$ less than $A (x ^ *; x)$ then new point $x ^ *$ is accepted; otherwise stay on current point.
The interpretation of prob $A(x ^ *; x)$ is quite simple: if the next point has a higher probability than current point, it is accepted without any condition; if the next point has a smaller probability than current point, it is accepted with a probability. In practice, using gaussian or uniform distribution will satisfy the symmetric requirement.
if we have a proposal distribution $q (x ^ *; x)$ which is not symmetric ? We correct the probability in the following way:
$p(x) q( x ^ *; x) r (x ^ *, x) = p(x ^ *) q(x; x ^ *)$
then $r(x ^*, x) = \frac{p(x ^ *)q(x; x ^ *)}{p(x) q(x ^ *; x)}$ so we have the following new acceptance function $A(x ^ *; x) = min \{ 1, \frac{ p(x ^ *)q(x; x ^ *) } { p(x) q(x ^ *; x) } \}$
But in usual case, we still using symmetric proposal distribution instead of a unsymmetric.

Lecture 7 MCMC Convergence

For MCMC sampling method, convergence will be a serious issue. Convergence means we get the samples according to target distribution. Before convergence, we have a phase called burning in, samples from this stage can’t be used for estimate statistics. There is no reliable method to determine when the burning is finished, only some available heuristic variable to show.
1. Trace of variables
Using iteration number as the horizon line, actual value of variable as the vertical line. We plot this line to see the change of variable along time, if it looks like random Then it may have converged.
2. Geweke test
Geweke test used to calculate a estimator based on the mean and variance of two non-overlapping sequence of samples. Here is the definition:
$z = \frac{ \bar{\theta}_a - \bar{\theta}_b } { \sqrt{ var(\theta_a) + var(\theta_b) } }$
In usual case, the value of $z$ should between -2 and 2. When using this method, we just split current samples into two parts, calculate the mean $\bar{ \theta }_a$ , $\bar{ \theta }_b$ and corresponding variance $var(\theta)_a$ , $var(\theta)_b$ . this seems useful.

Lecture 8 Gibbs Sampling

Another test statistics for convergence of MCMC. Gibbs sampling just mentioned.

Lecture 9 More on Gibbs Sampling

Just know the gibbs sampling is a special form of metroplolis-hasting algorithm when the proposal distribution is conditional distribution. Then all the rest is examples.

Lecture 10 Slice MC and Multivariate Slice MC

Data Augmentation
The key idea about data augmentation: If we want to sample from probability distribution $p(x)$ , but we augment it with joint probability $p(x,y)$ whose marginal distribution of $x$ is $p(x)$ . Then sampling from $p(x,y)$ and abandon $y$ to get samples from $x$ .

Slice Sampling
One example of data augmentation.

Lecture 11 Hierarchical Bayesian Method

Nothing special, just normal hierarchical models: directed acyclic graphical model. But the derivation of conditional distribution is quite hard. I don’t know the applicability to other more complex models.

Lecture 12 Hamiltonian Monte Carlo

Monte Carlo or Markov Chain Monte Carlo methods are all based on random walk. From this perspective, there are several interesting aspects need attention:
1. How to propose new point (based on current point or other information) ?
2. Criterion for accepting new proposed point. This is just need to ensure the detailed balance, so that finally sampling points according to target distribution.

For MCMC algorithm to work, need to ensure the detailed balance, ensure the random walk would not have patterns and can reach every possible space.

Hamiltonian Monte Carlo method is one type of MCMC, motivated by one face: for efficient sampling, when the probability surface is flat, the step size should be large; when the probability surface has very variability, the step size should be small. So can ensure the full sampling.

Hamiltonian Monte Carlo Method motivated by physical system and newton first, second law. Even though I do not know the details of HMC, i can know the main improvement is the way to propose new candidate point (make it much efficient).

Lecture 13 Parallel Tempering

Start multiple with different temperature. The key idea the the using of Temperature. For any probability distribution $p(x)$ can be transformed into $p(x) = exp(- \frac{(-log(p(x))) } { T } ) = exp(- \frac{E(x)}{T})$ . This will turn any distribution into boltzmann distribution and $T$ is the temperate. Higher temperate will increase the rate of acceptance when sampling. Low temperate will decrease the rate of acceptance. The core idea of parallel tempering is the run multiple chain of MCMC with different temperate. Some of the chain with $T = 1$ and others with higher temperate. So that MCMC will explore more space and more efficient.

Lecture 14 Simulated Annealing

Borrowing the concept of temperate form lecture 13. Start with high temperate, then decreasing the temperate slowly. Hope to jump out of local minimum. The reason is this: The high temperate will smooth the objective (or distribution).

Lecture 15 Stochastic Gradient Descent

Discuss the optimization algorithm, nothing special.

Lecture 16 Time Series

Despite the I.I.D assumptions about data points, there are other forms of data. Time series is a special form, data are depend on previous data points. There are multiple concepts related to time series data points:
1. Autocorrelation.
2. Autoregression.

Lecture 17 Times Series & HMM

Keep talking about time series related model & HMM, not interested.

Lecture 18 HMM & Kalman Filter

Model like HMM and kalman filter. Know the details.

Lecture 19 Expectation Maximization

Some kind of derivation with example of GMM. Nothing special information from this lecture.

Lecture 20 Gaussian Process

Gaussian process is too much complicated. The prediction for new point based on seen points. Using kernel to construct the covariance matrix for a gaussian distribution like the Gram Matrix. The new point and the original point will construct a joint distributiion which can be simplified to a conditional distribution. Something else not clear.

Lecture 21 Graphical Model

One lecture for graphical model will be impossible, skipped.

Lecture 22 Graphical Model cont.

Skipped as well.

Summary

After finishing this course, i think this course is not so advanced but the most important stuff is let me know some more details about MCMC. So that i can read some more book & understand & implement some algorithm myself. Whatever, this is a good course for MCMC. But for Optimization and Graphical Model part, EM part, nothing special.

2015年8月6日星期四

Notes on Foundation of Machine Learning

@(Machine Learning)[Notes, Machine Learning]

Notes on Foundation of Machine Learning

Chapter 1 Introduction

Three different criteria for learning algorithm:
1. Time complexity
2. memory complexity
3. sample complexity: how many samples are need to learning concepts in hypothesis set.

Some other concepts like training set, validation set and test set. One important thing i learned from chapter is what so called Learning Scenario. Different learning scenarios differed from each other by the type of training and testing data (with label, without label), how the training data and testing data is arrived, how to use the training and testing data. We have following scenarios:
1. Supervised Learning
2. Unsupervised Learning
3. Active Learning
4. Dimension Reduction
5. Transductive Learning
6. Online Learning

Then we also have learning tasks, task means the target we want to get:
1. Classification
2. Regression
3. Ranking
4. Sequence labeling

and so on.

Chapter 2 PAC Learning Framework

Machine learning is using a learning algorithm $A$ to find one hypothesis $h$ among hypothesis class $H$ according to specific loss function $l$ . Two potential criterion for one hypothesis $h$ :
1. loss on available training sample called Empirical Risk $\hat{R(h)} = \frac{1}{m} \sum_{i=1} ^ {m} l[h(x_i), y_i]$ ;
2. average on all the possible samples called Risk $R(h) = E_{x}{l(x, y_{x})}$

The most important question would be the correlation between empirical risk and risk. This kind of relation is called Generalization Bound. Because in machine learning, low risk is the ultimate goal not low empirical risk. Low risk means good generalization ability. If the generalization bound is clear, we can estimate the real performance of hypothesis returned by learning algorithm $A$ . Under certain circumstance, we can put the same question in another way: How many samples we need so that the returned hypothesis has a risk less the prescribed $\epsilon \gt 0$ .
Basic background for derivation. Each training sample $x$ comes from sample space $X$ , all these samples distributed according to distribution $D$ . There is a concept family $C$ , each of the concept will map a sample into target value space $Y$ , i.e. $y = c(x)$ . Learning algorithm $A$ use hypothesis space $H$ to approximate the concept class $C$ . Each of the hypothesis $h \in H$ will map sample $x$ into $Y$ . Here we assume the possible hypothesis of $H$ is finite, i.e. size is $|H|$ .
Now PAC-Learnable defined as follow:
A concept class $C$ is PAC-Learnable, if exists algorithm $A$ and a polynomial function $poly(.,.,.,.)$ , for all the $\epsilon \gt 0$ and $\delta \gt 0$ , for all the distribution $D$ on $X$ and any target concept $c$ in $C$ , if sample size $m \gt poly(1/\epsilon, 1/\delta, O(n), size(c))$ then the following statements hold:
$Pr[R(h_{S}) \le \epsilon] \ge 1 - \delta$ .

In order to derive the bound for consistent and inconsistent case, we need union bound and hoedding inequality.
Some generalization to above statement:
1. Assume the label $y$ according to input instance $x$ is deterministic, which means we can get the label from the input data. However, in real application the relation of input and output is stochastic. Which means we have the following distribution $P(Y = y | X = x)$ for each training instance $x$ . So we have so assume a distribution $D$ over input and output space, i.e. $X \times Y \in D$ . One possible explanation for this is: we do not have enough features to distinguish each training instances. So like use height and weight to classify a person is man or woman. For a set of height and weight, there is a possibility for man and woman.
2. When the output is a distribution conditioned on input sample $x$ , We have the bayesian error. This is the minimum error the best classifier can get.
3. We can decompose the risk difference between hypothesis returned by learning algorithm $h_{S}$ and bayesian risk $h_{Bayesian}$ into a combination of estimation error and approximation error.
4. According to generalization bound, the risk of hypothesis $h$ depend on empirical error and ratio of sample size and number of hypothesis. We can find a hypothesis among different hypothesis size for model selection. Or we can use regularization to perform model selection.

Lecture 3 Rademacher Complexity and VC dimension

Rademacher Complexity and VC dimension are two different means to describe the complexity of hypothesis spaces. Because when the number of hypothesis is infinite, the generalization boudn derived in chapter 2 is useless. Rademacher complexity and VC dimension is another way to evaluate the complexity of hypothesis set except the number of hypothesis.
With these two different notation of hypothesis complexity, combined with concentration of inequality to give a uniform generalization bound.

Chapter 4 SVM

Start with linear separable case and then extend to non-separable case. The derivation of primal and dual problem is the way i expected. The most important is the derivation of generalization bound for SVM based on margin rather than feature dimension $N$ . One more interesting thing is using hinge loss is actually optimization the generalization bound of SVM.

Chapter 5 Kernel Method

For Kernel method, just know the implicit mapping from input space to high dimension space. The other work is about Positive Definite Symmetric kernel, the rest like RKHS, it’s unnecessary to know all these stuffs.

Chapter 6 Boosting

Boosting comes with one question: can we build a strong classifier with weak classifier ? what’s the definition of weak classifier? How can we combine this weak classifier? What’s its generalization bound and what’s the factor that impact the generalization ability.
Boosting is a strategy, AdaBoost is one real algorithm that can be used when the base weak classifier is a hypothesis set mapping from input space $X$ to $[-1, 1]$ . One basic idea is assign weight to each training sample, for samples harder to classify by nature, we give them more weight and then combine the classifier from each iteration linearly to get final classifier.
AdaBoost is optimize a exponential loss function and can be interpreted as coordinate descent algorithm.
Maximum margin can’t explain the performance of adaboost.
One disadvantage of AdaBoost: prone to noisy data. This is not good, because noisy data is everywhere and hard to control.
Another question is the selection of base classifier. Usually we can use decision tree as base learner and trained a tree model based on the current samples.

Chapter 7 Online Learning

Online learning handle the data point one by one, when the learning algorithm receive one data point, then make prediction and have a loss related to the prediction.
Online learning different from the techniques in the previous chapter: No fixed distribution $D$ is assumed on input $x$ . This means we can’t use risk to evaluate the hypothesis. So online learning using another set of notation for hypothesis evaluation: mistake model and regret analysis.
Online learning model with expert advice: When doing online learning, $N$ experts will give advice for each input sample $x_t$ .
Mistake model is simple, i.e. the number of mistake learning algorithm made during $T$ round of learning. Regret defined as follow: the difference of cumulative loss minus the loss of expert give the minimal loss.
$R_T = \sum_{i=1}^{T}{L(\hat{y}_i, y_i)} - min_{i =1 } ^ { N } { \sum_{t=1} ^ { T } { L_{i, t} } }$
All the analysis of online learning algorithm will give results on the total number of mistakes or regret of T rounds learning.
In the setting of $N$ experts, have Weighted Majority (WM), Randomized Weight Majority(RWM) and Exponential Weighted Majority(EWM) algorithms.
When analyzing the mistake bound and regret, it’s usually by deriving the upper and lower bound of weight, then get the results.
For linear classification, perceptron and winnow algorithm are two standards approaches.
Since online learning do not have distribution over data $D$ , batch algorithm assume a distribution $D$ . What’s the result training a online learning algorithm on data with generated from fixed distribution $D$ ? The risk can be bounded by the average loss and some factor of $O(N/T)$ where $N$ is the size of dimension, $T$ is the number of learning rounds.

Chapter 8 Multi-Class Classification

Compared with binary classification, multi-class classification differed in three aspects:
1. Large number of class require more computational resources.
2. There are unbalanced classes in multi-class, some class have much larger samples than the other one, need the change of loss function.
3. There are some internal structures inside the output classes, should be used in the design of loss function.

This chapter starts with the generalization bound for multi-class classification. I fond the standard steps for building generalization bound:
1. find the rademacher complexity of the hypothesis set.
2. Derive the relation of rademacher complexity and margin, then get a bound rely on the margin.

There two different approaches have appeared multiple times in this book.
For multi-class classification:
1. Multi-class SVM by define margin as the different between correct label and second largest label score.
2. boosting: Same as AdaBoost. It’s used for multi-label problem.
3. Decision tree: One important knowledge about decision tree is decision tree will split the input space into multiple regions. Each region will have its label.

Other methods called: Aggregated Method
1. one V.S. rest
2. one V.S. one
3. Error Correction codes, quite interesting method, using hamming distance to classify the label of instance.

Multi-class classification and sequence labeling
For sequence labeling, following objective function provided:
1. Maximum Margin objective
2. Struct-SVM objective

Chapter 9 Ranking

Ranking is a new task compared with classification and regression, although this problem can be solved using same techniques when solving classification and regression.
For ranking task, there are two different approaches:
1. score-based approach: This method will train a hypothesis mapping from $X$ to $R$ . When predicting, each instance will get a score from hypothesis $h$ then ranking according to this score.
2. preference-based approach: I do not understand this method! This method split ranking into two stages. First stage will train a preference function from training sample, second stage will perform ranking based on the score of preference function.

Score-based approach for ranking
This is pair-wise ranking method, training data is pair of data points and objective function is average mis-ranked pair. For score-based approach, each instance will get a score from hypothesis $h: {X -> R}$ . Can use SVM and RankBoost method to training the hypothesis. The analysis of generalization bound is based on rademacher complexity or margin, very much like SVM and AdaBoost.
One interesting scenario is bipartie ranking. In this scenario, all the instance can be classified into two classes. The only requirement is the pos has a higher than neg. This special scenario can be solve using boost method.

Preference-based ranking
i don’t know what’s the meaning of this method.

Chapter 10 Regression

Regression with target value of real. This chapter derive the generalization of linear regression based on rademacher complexity. Other concept similar to VC-dimension is Psudo-Dimension which based on the concept of shattering. The real algorithm for regression has linear regression (with linear hypothesis), ridge regression, kernel ridge regression, support vector regression, lasso and group lasso algorithm. Some analysis are based on rademacher complexity not on margin.

Chapter 11 Algorithmic Stability

Only care one definition of stability: when two samples $S$ and $S'$ only different by one data sample, but the hypothesis output difference can be bounded by some constant number $\beta$ . This is called $\beta$ stability.

Chapter 12 Dimension Reduction

PCA is the most formal one of dimensional reduction method, aimed at preserve the maximum variance of original data. Kernel PCA is a version based on a PDS kernel, has many connections with other non-linear dimension reduction method.
Other methods like Isomap (keep the pariwise distance), LLE (keep neighbour distance) and Lapancian Eigenmap (keep the neighbor distance).

Chapter 13 Learning Automata and Languages

Skipped, don’t think it’s relevant to my work.

Chapter 14 Reinforcement Learning

Reinforcement learning based on on abstraction: MDP (markov decision process). Which includes state, action, reward, transition probability (depend on the sate and action), reward probability.
There are two different assumptions:
1. The transition probability and reward probability is known to learner.
2. The transition probability and reward probability is unknown to learner.

A policy is a mapping from state to action: $\pi : s -> a$

In the first case, we only need to plan action based on the known statistics. In the second case, learner need to estimate the parameters from the interaction between environment. For unknown environment, have model-free approach which estimate the policy directly. Also have model-based approach which estimate the transition probability and reward probability at first, then estimate the policy. Q-learning is the most famous algorithm estimate the policy directly.
All these methods based on the stochastic approximate method.

Written with StackEdit.

2015年6月22日星期一

Summary on Machine Learning Topics

@(Machine Learning)[Summary,Machine Learning]

Summary on Machine Learning Topics

Machine Learning composed of three critical components
1. Specific Task, each algorithm works with specific task not a general approach to everything. Tasks different from each other by the type of training data, type of testing data, how we get the training data and how we use the testing data.
2. Experience. Refers to everything can be represented with appropriate format. For my personal understanding, experience just the interaction of objects, like human and environment. Human interaction should and always be the main source of improvement.
3. Performance. Most important step during the development of application. We need to first set the performance evaluation of our task, then we can find appropriate ways to improve. Under most circumstance, we can’t optimize the performance directly, we can only optimize the Surrogate of our final performance. For most of the problem, this idea should works very well.

So before we start a work, we need to select the type, performance evaluation metric, type of experience and most important the model we’ll use.
There are at least three aspects related with machine learning models:
1. Hypotheses spaces. This is the most important thing in a machine learning model, hypothesis defines how to mapping the input to output, what’s the relationship between input and output. For each hypothesis space, we will need a way to evaluate the Complexity, a way to qualify the different of performance of specific hypothesis on training data and testing data, this need the help of Learning Theory.
2. Loss function. Among all the hypotheses in the hypothesis space, we will find specific one as our final result of learning, loss function will be the criterion for selection.
3. Learning algorithm. With hypothesis and hypothesis selected, we need one learning algorithm to select the hypothesis. Most of the learning algorithm just one of the numerical optimization algorithms, but we have exception like Decision Tree. Numerical optimization algorithm for machine learning have different requirement with standard one, machine learning focus on the generalization ability rather than numerical precision.

Most of the book about machine learning will focus on these three aspects. Topics about the hypothesis spaces is the key of machine learning. We’ll also see talks on loss function. Learning algorithm require the book from numerical optimization community, rather than machine learning. But with the increasing size of data, distributed optimization and fast optimization has become more and more important.

Loss Function

There are two different approaches for loss function:
1. Maximum Likelihood of data. This is a suggest form probability when fitting parameter. When the output of hypothesis space can be interpreted as probability, maximum likelihood will be a possible principle for loss function.
2. Empirical Risk Minimization: This is another important principle. This approach will optimize the surrogate of our real loss function. For example, we actually want to optimize $0-1$ loss for classification task, but this function is too hard to optimize. We use log loss and hinge loss instead. And the cost is evaluated on the training data, i.e. $L(h) = \sum_{i=1} ^ {m} {l(h(x_i), y_i)}$ . Then we optimize this function based on empirical data.

Hypotheses space

Hypothesis describe the relation of input and output. We can classify the type of hypothesis according to different criterion. We have linear, non-linear; parametric and non-parametric hypothesis.
Summarize the hypothesis space as follow:
1. Linear hypothesis space: when combined with different loss function, we have different classifier. Combined with log loss, it’s Logistic Regression; combined with hinge loss, it’s Support Vector Machine.
2. Tree hypothesis: Decision tree is actually an non-parametric hypothesis spaces, GBDT is actually optimize the exponential loss function. We have GDBT, RF and so on.
3. Kernel related hypothesis: We can combine kernel method with linear method, This is an non-parametric hypotheis method, because final result based on the training data.
4. Graphical Model hypothesis: this hypothesis is described with exponential family distribution. So we can use different loss function.
5. Neural Network hypothesis: Highly complex non-convex hypothesis family. It’s hard to optimize and easy to over-fitting.

Learning Algorithm

For most of the machine learning models, we need to use learning algorithm to find a hpothesis according to loss function. Some of the algorithms are greedy like decision tree building. But most of the algorithms involves numerical optimization algorithm. This means optimization is at the heart of machine learning, many models rely on numerical optimization heavily. Here are the summaries of optimization algorithms:
1. First order method
First order method is very cheap compared with other methods. And is widely used for Neural Network. First order method include Stochastic Gradient Descent, Proximal Gradient, Sub-Gradient Descent, Mirror Descent and so on.
2. Second order method
Most important would be Newton Method, some other similar method like Conjugate Descent, L-BFGS and so on. Under most circumstance machine learning would not use Interior Point Method and Sequential Quadratic Programming, because these methods will cost too much memory.

Some very hot topics on Learning algorithm would be the $L_1$ regularization and distributed optimization. For distributed optimization, we have $ADMM$ .

A short summary on the numerical optimization topic, itself is a very big topic. We can classify the algorithm based on objective, constraints and so on. From this perspective we have

Problem Type	Mathematical Description
Linear Programming	$argmin_{x} {c^Tx} \; s.t. \; Ax \ge b$
Quadratic Programming	$argmin_{x} {x^TQx + b^Tx}$ with constraints
Second Order Cone Programming	$argmin_{x} {c ^ T x}$ with second order constraints
Semi-Definite Programming	$argmin_{x} {c^T x}$ with positive definite constraints

I think for engineering purpose, most work will focus on Learning algorithm, especially for large scale training and distributed optimization. The development of hypothesis space and loss function would be slower.

Written with StackEdit.

2015年5月9日星期六

10-701 Course notes

Notes on 10-701 Introduction to Machine Learning

Course from CMU 2014 Fall, general introduction to machine learning.

Lecture 1 Introduction to Machine Learning

Basic introduction about machine learning. The definition of Task, Experience and Performance is quite clear for me now.
Task: difference of task comes form different perspective, type of training data, type of testing data, how the training and testing data is arrived and how the training and testing data is used. We have following different scenarios:
1. Supervised Learning: Batch of training data with label and do not know testing data when training. like classification and regression.
2. Unsupervised Learning: Batch of training data without label and make prediction on unseen data points. It’s hard to evaluate the result of unsupervised learning. Typical tasks include clustering, dimensionality reduction and density estimation.
3. Semi-supervised Learning: Batch of training data with label and without label, making predictions on unseen data points.
4. Transductive Learning: Batch of training data with label and testing data without label, we only make predictions on these test point.
5. Online Learning: Learner receive training and testing data point in a fixed fashion. One unseen data point come, make prediction, and receive a true label. The total goal is the minimized the cumulative loss.
6. Reinforcement Learning: training data and testing data are mixed, the learned is required to interact with the environment.
7. Active Learning: Learning adaptively select training sample for label, hope to get better performance with fewer samples.
8. Clustering: unsupervised learning, find different data regions.
9. Dimension Reduction: project high dimension data into low-dimension.
10. Density Estimation, Sparse coding

The above list is just a simple list of possible scenarios, we can find much complex scenario for machine learning algorithm.
Experience: Training samples
Performance: We all all kinds of performance criteria like:
1. squared loss
2. hinge loss
3. log loss
4. absolute loss
5. negative log-likelihood
6. negative cross entropy

Lecture 2 Bayesian Optimal Classifier and Naive Bayes Classifier

For classification problem, we can derive the bayesian optimal classifier, i.e. the best possible classification result given the correct joint distribution of training data and class label.
$\hat{y} = argmax_{y} {p(Y=y | X = x)}$
After using bayes rule, we can derive the following formula:
$\hat{y} = argmax_{y} \; {p(Y=y) P(X=x|Y=y)}$
The only problem for this formula: we do not have the exact distribution and learning the distribution need too much sample.
Naive Bayes Classifier solve this problem with a strong assumption, all the features are conditional independent give the class label.
In practical, we need to smoothing the probability distribution for each feature.

Lecture 3 Perceptron

Using linear model as example introduce perceptron algorithm. Perceptron used to training model in online version.
For model in the following form:
$y = sign(w ^ T x)$
where $w$ is model parameter. When we have a bunch of linear separable training data, we will need to solve a feasibility problem:
$argmin_{w} \; 1$
$\quad s.t. \; y_{i} w ^ T x_{i} \gt 0$

For this problem, we can train it with:
a. Linear Programming;
b. Perceptron algorithm

When solve this problem with linear programming, we can’t send the problem directly to LP solver: because of the strict inequality. Most of the cases, solver will convert strict inequality into $\ge$ , this will have a trivial solution, i.e. $w = 0$ . So we need perform transformation to this optimization problem. Assume the data is linear separable, then we will find an solution to this problem $w ^ *$ , after getting this solution, we can find the minimum value of $\epsilon = min(y w ^ T x)$ , then the problem will convert to following form:
$argmin_{x} \; 1$
$\quad s.t. \; y_{i} w ^ T x_{i} \ge \epsilon$
Since we don’t have constraints on $w$ , we can divide both side by $epsilon$ , then we can get following form:
$argmin_{w} \; 1$
$\quad s.t. \; y_{i} w ^ T x_{i} \ge 1$
Finally, we can send this problem to LP solver.

When using Perceptron algorithm, we will have a very easy update rule:
when current model predict wrong label, we’ll add the $y_{i} x_{i}$ to current parameter.
key point for perceptron implementation:
1. data should be random pertubed.

Convergence Issue
1. converge guaranteed under liner separable assumption of data.
2. not if sample not linear separable, use averaged parameter will get convergence assumption.

Proof of perceptron convergence?
Margin Theory, i.e. $argmax_{w} argmin_{i} y_{i} w ^ T x_{i}$ . This means the largest possible margin between decision boundary.

Lecture 4 Logistic Regression and Perceptron

Logistic regression is a transformation of linear classifier, hoping to get a probabilistic interpretation of classification result.
Another topic is about generative model and discriminative model. For distriminative model, we have model the conditional relation between label and features. For generative model, we model the joint distribution of features and labels. The point is the different assumptions model make on the training data. With discriminative model, there is no extra assumptions about the data. But with generative model, we need to find a way to model the distribution of data $p(x)$ , this is not an easy talk under most circumstances.

Lecture 5 Analysis of Convergence of Perceptron

Analyze the convergence for non-separable case perceptron, also introduce the margin perceptron:
update the parameter $w$ when $y_{i} w ^ T x_{i} \lt 1$ , this will lead to better performance.

Lecture 6 Kernel Method & RHKS

Not interested, skip.

Lecture 7 Kernel Methods Continued; Support Vector Machine

Skip kernel method part and focus on support vector machine part.
Still derive the algorithm from maximum-margin point, clear already.
Extend linear SVM with kernel trick from dual space.

Recitation 3 Optimization in Machine Learning

Most of the optimization rely on backtracking and sufficient decrease condition.

Lecture 8 Linear Regression

Linear regression used to predict real target value. Using $L_2$ regularization equivalent to using a gaussian prior on the parameters. Using $L_1$ regularization equivalent to using a laplance prior on the parameters.
Using a $L_2$ regularization will get a invertible matrix. Nothing more.

Recitation 4

Derivation of SVM, kernel method.

Lecture 9 Nonlinear Regression

Still about regression, but talk about polymonial features. In this kind of setting, we have much more features than the original feature.
Nonlinear feature expansion of original features.
Error decomposition of training and testing error.

Ridge regression can have dual problem and find a similar interpretation of SVM.

Gaussian Process for Regression is also talked about.

Lecture 10 Optimizatioin & Convergence Analysis

Lectures about optimization, gradient descent, Newton’s method, convergence analysis.
Two major target for optimization:
1. Maximum Likihihood Estimate
2. ERM like margin and so on.

Gradient Descent and newton.

Recitation 5

Kernel Method: Kernel method are a strategy to learn complex function.

Lecture 11 Continue the Optimization Topics

Talk about newton method with equality constraints, barrier method for inequality method; Find another explanation about mirror descent.

For non-parametric model, we do not have pre-defined distribution for model and hope our model can can more complex when have more data.

Mirror descent works in the following way:
1. for variable $z$ calculate $z_{i+1} = z_{i} - \eta g$ .
2. solve following optimization problem: $argmin_{w} P(w) - z ^ T w$

Lecture 12 Nonparametric methods

Density estimation based on histogram and kernel method.
For classification task, kNN is the nonparametric methods. Weighted least regression, locally linear regression can have same effects as kernel methods.

Lecture 13 Decision Tree & Boosting

Several critical points for decision tree:
1. Criteria to select attribute for node splitting
2. Way of node splitting, only two-way splitting or multiple-way splitting ?
3. node pruning or tree stopping rule.

ID3: do not all the re-use of attribute for node splitting.

Pre-pruning:
1. fixed tree depth
2. fixed number of leafs

Post-pruning:
1. chi-square test of independence: convert tree into a list of rules at first, then eliminate rules independent of labels.

Boosting used to combine multiple weak learner to give better performance. Each iteration update sample weights.

Lecture 14 Boosting & Graphical Model

Boosting is used to optimize a exponential loss function. Then we optimize the bound find appropriate parameters.
Key idea for graphical model is the structure of joint distribution. We can use structure to save a lot of parameters.

Lecture 15 Graphical Model

Variable elimination works in a dynamic programming way for graphical model inference. Introduction to directed and undirected graphical model.

Lecture 16 Graphical Model

Bayesian is a DAG, we have d-separation for all the nods.
Factor graph is a biparte graph. Node represent factor and variables. variable elimination used to compute the marginal and conditional probability distribution.

Lecture 17 Belief Propagation

Belief propagation only works for tree-like graph. For other cases, need other helps. Use variable elimination and other methods to get clique tree, then running belief propagation.
For non-tree like graphical model, we can use junction tree algorithm to convert to clique tree, then running belief propagation on the clique tree.

Lecture 18 Graphical Model & Learning Theory

Plate notation used for graphical model, plate notation used for model with large number of variables.
Learning theory deals with the sample complexity:
1. how do we evaluate the complexity of hypothesis we want to learn
2. how many samples do we need to learn a hypotheses
3. the difference between training error and test error, its relation with hypothesis complexity and sample numbers.

Most of the questions are for theoretical study, they can only for intuition but not accurate solution of some problem.

Consistent hypothesis refers to a hypothesis that achieve 0 error rate on training data. For this kind of hypotheses, we need to use union bound to get the upper bound. For hypothesis with training error larger than zero, we need another inequality to find the upper bound.

Lecture 19 VC Dimension and Bounds

PAC learning framework provide a way to describe the different between training error and testing error which depends on the complexity of the hypothesis family.
If the hypothesis family is finite, then we can use the number of hypothesis to describe the complexity of the hypothesis family.
If the hypothesis family is infinite, then we may use VC dimension and Redmacher complexity to describe the complexity.

VC dimension: The maximum number of points which can be classified exactly for any labeling. We only need to find one placement of training points and for each possible labeling, we can find a hypothesis to classify the samples correctly.
Redmacher complexity: for $m$ training samples, we give training labels random with probability $p(\delta = +1) = \frac{1}{2}$ and $p(\delta = -1) = \frac{1}{2}$ . Then $\epsilon(H) = E_{\delta} \left[ sup_{h \in H} \left( \sum_{i=1}^{m} { \delta_i h(X_i) } \right) \right]$ .

Whatever VC and Redmarcher used to describe the complexity and they all have their only defect, the only use of these bounds will guide the design of optimization objective.

For example, the complexity of decision tree depends on the number of leafs; the complexity of linear classifier depend on the margin (or $\|w\|$ more exactly).

Lecture 20 Clustering Method

Not interested, skip.

Lecture 21 EM algorithm & PCA

For models with hidden variables, EM will be used to maximize the likelihood of the data. EM can be viewed as a co-ordinate ascent algorithm, optimize the distribution of hidden variable and joint probability distribution.
EM is just optimize the lower bound of likelihood of data.

PCA can be explained in two different perspectives:
1. maximize the variance of projected data points.
2. minimize the residual of original data points and projected data points.

Lecture 22 PCA & Neural Network

PCA: obtain the principle component through SVD of co-variate matrix, get the top-k eigen vector.
Disadvantage of PCA:
1. Can’t use information like class label.
2. Can only handle data with guassian like distribution.

For deep learning, we have deep belief network and deep neural network. Deep belief network is just graphical models, directed or undirected. Deep neural network is neural network with many hidden layers.

For training algorithm, we can use SGD or Others.

Lecture 23 Neural Network & BP & Spectral Method

Problem of Deep Neural Network
1. Overfitting
2. non-convex
3. Poor conditioning: some gradient can vanished and some of them will be exploded into infinite large and so on.

Conditioning problem
1. explosion/shringe of gradient.
2. derivative about 0 will be very small.

Solution:
1. grading clipping: normalize the norm of gradient.

Overfitting
1. Early Stopping
2. Penalty with $L_1$ or $L_2$ .

Spectral Method skipped.

2015年2月2日星期一

Notes on Course 10-725 Convex Optimization

@(Numerical Optimization)[Notes, Numerical Optimization]

Notes on Course 10-725 Convex Optimization

From my personal feeling, this course is designed for Machine Learning (ML) and Statistics (Stat). Focus on machine learning related algorithms.
My target on optimization knowledge:
1. Gradient descent
2. AdaGrad
3. RMSProp
2. Newton method
3. Quasi-Newton method
4. Interior point method
5. Primal dual path following
6. Decomposition Method
7. Coordinate descent method
8. Proximal gradient method
9. ADMM methods

will be classified into First order method, Second order method, Interior point Method and something rest.

Other methods related to machine learning will be a plus. More methods related, i will get a bird-view of the optimization methods in machine learning.

Lecture 2

Introduction on Gradient Descent
Possible problem with Gradient Descent:
1. Saddle point, not local minimum but the gradient is zero.
2. algorithm nerve converge.
3. oscillate results from inappropriate step size.

How to determine when to stop the gradient descent?
Two different approach
1. From ML & Stat, We may need to use held-out data-sets to decide when to stop, because the target is Generalization
2. From Numerical Optimization, maybe the accuracy $\epsilon$ , i.e the absolute of difference between current point $\hat{x}$ and optimal point $x^*$ .

Lecture 3 Convexity

We have convex set and convex function
some concepts related with set $C$ , and we define $B_p(x, \epsilon) = \{ y \; | \; ||x-y||_p \le \epsilon \}$
1. boundary of set $C$ , i.e. $\delta C$ defined as follow:
$\forall \; \epsilon \gt 0 \quad then \quad B_p(x, \epsilon) \cap C \; \neq \; \phi$
2. interior of set $C$ , defined as follow:
$\exists \; \epsilon \gt 0 \; and \; B_p(x, \epsilon) \subset C$
3. relative interior (rel int $C$ ) defined as follow:
$C$ belong to some subspace, interior set of $C$ is this subspace called relative interior set. This is useful when we talk about some sets without knowing the exact dimension containing this sets.
example of relative interior: a $L_2$ norm ball in $R^2$ has interior set. But if this set exist in $R ^ 3$ , this is no interior set but has a relative interior.
4. closed set, defined as follow:
$\delta C \subset C$ , i.e. all the points belong to boundary set also belongs to set $C$ .
5. open set, defined as follow:
$\delta C \cap C = \phi$
6. compact set, is closed and bounded. When we say bounded, means there is a ball can contain this set $C$ .

Primal & Dual representation of Convex set

primal representation of convex set: convex hull of a set of points, while the dual representation is the intersection of hyper-places (possible infinite many)

For any convex set, we can find supporting hyper-plane for any point on the boundary (i.e. $\delta C$ ).
Supporting hyper-plane means:
i. set $C$ on one side of hyper-plane
ii. intersection between hyper-plane and convex set is not empty.
For any two convex sets, there will be an separating hyper-plane if the intersection is empty. If two of the sets are closed and at least one of the sets is compact, then we get a strict separating hyper-plane.

Epigraph and sub-level sets

A function called convex when satisfy the following definition:
1. Domain of function is convex set. This is important and easily be ignored.
2. $\forall \; x \;$ and $y$ in domain, line segments connect $x$ and $y$ above the function curve. And this definition implies the $epi \; f$ is convex !!!.

important property:
$epi \; f \; is \; convex \iff \; f(x) \; is \; convex$
$f(x) \; is \; convex \; \Rightarrow \; sub-level \; set \; is \; convex$

criteria of convex function:
1. zero order: function $f(x + tv)$ is convex
2. first order: $f(y) \ge f(x) + \nabla ^ T f(x)(y-x)$ , i.e. any point we can get a global under-estimator of function value.
3. second order: hession $\nabla ^ {2} f(x)$ is positive semi-definite.

Summary of lecture 3:
1. definition of compact set
2. primal & dual representation of convex set
3. different criteria of convex function

Lecture 4

Explain more details about convex set and convex function. Important concepts related with convex set: open, closed and compact.
1. Strictly convex: $f(tx + (1-t)y) \lt tf(x) + (1-t)f(y)$ , an important property of objective function, will have significant on the performance of iteration.
2. $\delta$ -strongly convex: function has follow property, $f(y) \ge f(x) + \nabla ^ T f(x) (y-x) + \frac{\delta}{2}||y-x|| ^ 2$ . In fact, this is an augmented version of first order sufficient and necessary condition of convex function. And $\delta$ -strongly convex is a stronger condition than strongly convex, $\delta$ -strongly convex means a function is lower bounded by a quadratic function.
3. Lipschitz or Lipschitz continuous, function satisfy follow property: $| f(x) - f(y) | \le L||y-x||$ . Lipschitz condition constrain the function to be smooth, i.e. function can’t grow faster than linear function. Lipschitz continuous can be applied to function or gradient as well. It’s hard to face problem with linear gradient in convex optimization problem, but it’s okay to assume the gradient of function can’t grow faster than linear function.

Convergence result for gradient descent

if objective function is $f(x)$ and we want to converge when different less than $\epsilon (f(x_0) -f(x^*) )$ and how to set the learning rate
1. f is lipchitz then we need $O(\frac{1}{\epsilon ^ 2})$ iterations, stepsize should be $\frac{1}{\sqrt{k}}$
2. $\nabla f(x)$ is Lipschitz then we need $O(\frac{1}{\epsilon})$ , stepsize should be constant.
3. $f(x)$ is strongly convex then we need $O(\ln(\frac{1}{\epsilon}))$ , stepsize should be constant.

But the definition of $O$ related with conditioning of function $f(x)$ . Conditioning of function means the frac of largest eigenvalue and smallest eigenvalue.

Extension of Gradient Descent

includes subgradient, proximal operator, FISTA, mirrow descent, conjugate gradient and Stochastic gradient descent. All these methods are called First Order Method.

Lecture 5 Derivation of convergence rate of gradient descent

More on the derivation of convergence analysis for gradient descent. One interesting property:
When we say gradient of function $f(x)$ is Lipschitz which also constrain the element of the hessian matrix, i.e. $\nabla f(x) - \nabla f(y) \le L || y- x || \iff \nabla ^ 2 f(x) \le L\; I$
also if function $f(x)$ is Lipschitz continuous then the gradient of function is bounded by $L$ .

Lecture 6 Subgradient

Subgradient is an extension of gradient. For convex function, either differentiable or not, we can always find a subgradient. But for non-convex function, we can’t guarantee there is always a subgradient.
For subgradient, there is a subgradient calculus for the calculation of subgradient for conplex function. And also the optimality condition and convergence analysis.

Lecture 7 Subgradient continue

Continue the talk of subgradient, with example of finding the intersection of n closed convex sets. One interesting phenomenon is that: if we want to find a point in the intersection of two closed convex sets.
Another problem with subgradient is the slow convergence rate, i.e. $O(\frac{1}{\epsilon ^ 2})$
Another interesting property:
function is Lipschitz is a easier condition than the gradient is Lipschitz continuous. If the gradient is lipschitz continuous then the function will have a bell shaped curve.

Lecture 8 Generalized Gradient Descent

Focus on objective function with decomposition like $f(x) = g(x) + h(x)$ where $g(x)$ is convex and differentiable, $h(x)$ to be convex and not necessary differentiable. With an intuition: $f(y) = g(x) + \nabla ^ T g(x) (y-x) + \frac{1}{2t} || y - x || ^ 2 + h(x)$ . This intuition works as below: we give a second order taylor series expansion for function $g(x)$ and keep $h(x)$ untouched, then we find the minimum of this approximation as our next point.
If function $h(x)$ happens to be $L_1$ norm, then the result called soft thresholding operator. One practical approach for LASSO problem.

Lecture 9 Acceleration

Accelerate the gradient descent by using information form previous direction. Lots of different names for this method, mostly comes from Nestrov. Accelerated version of ISTA called FISTA.
Another import property of acceleration is that: This method is not a descent method all always. We also need some special treatment of line search when apply on the acceleration.

Lecture 10 Matrix differentials

For multivariate variable function or multivariate vector function, the partial derivative is too many and we need a clever way to manage the details. This is why we need vector calculus and matrix calculus. In the worst situation, we need to calculate the partial derivative by hand and then fill it into a vector or matrix form.
But for the simplicity of formula derivation, we need a simpler way to derive the gradient or hessian of some function. we may need a formula similar the scalar function.
But derivative do not have a chain rule, produce rule; even though sum rule can apply to multivariate function. So we need differential form, and then convert back to derivative form.

Lecture 11 Matrix differentials & Newton method

Continue the discussion of matrix differentials, one of the import rule for matrix differential is the application of trace operator. For newton method, it originate from nonlinear equation solver. In order to solve nonlinear equation, a first order approximation is derived first $f(\hat{y)} \simeq f(x) + \nabla ^ T f(x) ( y- x)$ , then we set $f(\hat{y}) = 0$ and get a roughly estimate of the solution. After this, we can iterate many time until we get a satisfactory precision. This is the scalar version of newton method, then we apply this to multivariate function. The necessary condition of local minimum: gradient equal to 0. Then we apply same method to gradient, find improvement that make the gradient to be zero. Derived as follow: $\nabla g(y) \simeq \nabla g(x) + H(x) (y-x)$ , we hope to find a $y$ than make $g(y)$ to be 0. We have another derivation of newton’s method. We write down the second order approximation of function at $x$ : $f(\hat{y}) \simeq \; f(x) + \nabla ^ T f(x) ( y - x) + 1/2 (y-x) ^ T H(x) (y-x)$ . These two derivations give the same answer. For newton method to be a descent direction, Hessian matrix must be positive definite.
(ps: Hessian matrix is apply jacabion to gradient)

Lecture 12 Newton Methods

Newton methods with line search called damped Newton. Newton method will converge only when current point is near the optimum. When current point is far form the optimum, we need to perform line sear to ensure sufficient decrease in function value.
For objective function $f(x)$ to be strongly convex and has an equality constraint. We can solve this kind of problem with Newton method, i.e
$argmin_x \; f(x)$
$\quad s.t. \; h(x) =0$
One simple way to solve this problem would use second order approximation of $f(x)$ to replace the real objective function $f(x)$ and use Newton’s method to solve problem $h(x) = 0$ , means
$argmin_x \; f(x) + \nabla ^ T f(x) \delta x + 1/2 \delta x ^ T H(x) \delta x$
$\quad s.t. \; J ^ T (x) \delta x = - h(x)$
Or we can all this approach sequential Quadratic programming approach.

Lecture 13 Linear Programming

Objective function and constraints must be linear function, called linear programming. Any linear programming problem can be convert to a standard form. In this standard form, all the variables are required to be non-negative and all the constraints must be equality constraints.
But how to convert to standard form ? some scenarios we can use the following tricks:
1. use slack variable $\epsilon$ to convert inequality constraint into equality constraint. For example, we can convert $a ^ T x \le b$ then we can convert to $a ^ T b + \epsilon = b$ and add another constraint $\epsilon \ge 0$ .
2. if some variable is negative, then we can replace it with subtraction of two non-negative variable and two more constraints. For example, $x \le 0$ can be convert to $x_1 - x_2 \le 0 \; ; x_1 \ge 0 \; ; x_2 \ge 0$

Find these two tricks quite useful. When convert ordinary linear programming into standard form, we’ll find the #constraints < #variables, which means some variables are free variable. We call the non-free variable basis and find corner with these variable.

Lecture 14 Linear Programming

Solve linear programming with Simplex method. This method works with the following intuition: For a standard linear programming problem, we may have $m$ equality constraints, $n$ variables. From the perspective of solving linear equation, we must assume $m \lt n$ , when the equality holds we’ll get only one solution. From all the $n$ variables, we can select $m$ of them and solve the corresponding linear equation to get one vertex of corresponding polyhedral. This is the first insight. Another insight is this, we can select different variable, we’ll move from one vertex to another.
About degenerancy, which means square $m$ by $m$ matrix is singular, we can’t get only one solution for the linear equation.

Lecture 15 Duality

Derive the duality for linear programming problem, but i don’t think the derivation is good.

Lecture 16 KKT Condition

For general optimization problem, we can derive the dual problem of primal problem. The KKT condition used to specify the necessary condition of primal and dual solution. But for convex optimization problem, KKT condition is also the sufficient condition. Which means we can find primal & dual feasible that satisfy the KKT condition, and these points are the primal & dual solution.
For machine learning problem we need to convert between Constrained Form (C) and Lagrangian Form (L) here is examples:
(C) Constrained Form:
$argmin_{x} f(x) \; s.t. \; h(x) \le t$
(L) Lagrangian Form:
$argmin_{x} f(x) + \lambda h(x)$
These two forms are equivalent from the perspective of KKT condition.

Lecture 17 Duality Correspondence

Define core concepts like conjugate function, conjugate norm and generalized inequality and so on. With these concepts, we can define a unified view of convex optimization. Also introduce some tricks for deriving dual optimization problem.
conjugate function of function $f(x)$ defined as follow:
$f^*(y)=argmax_{x} \; \{y^T x - f(x) \}$
function with this form is quite common when we have affine equality constraints in problem. And we also know norm of some widely used function.
1. conjugate function of indicator function $f(x) = I_{C} (x)$
$f ^ * (y) = argmax_{x} \{ y ^ T x - I_{C}(x) \} = argmax_{x \in C} \{ y^T x \} = I_{C} {^ *}(x)$
This means the conjugate function of indicator function is the support function of set $C$ . For a very special case, when $C = \{ x : \|x\| \lt \lambda \}$ then $I_{C} { ^ *} = \lambda \|x\|_{*}$ .
2. conjugate function of norm function $\|x\|$
$f^*(y) = argmax_{x} \{ y ^ T x - \|x\| \} = 0 \; when \; y^ *x \lt 1; \inf$

Another very important concepts in optimization is cone and convex cone. A set $K$ is called cone, when $x \in K \; and \; \theta x \in K$ as well, where $\theta \gt 0$ . A set $K$ is called convex cone, when $x \in K \; and \; \theta_1 x_1 + \theta_2 x_2 \in K$ , where $\theta_1 \gt 0 \; and \; \theta_2 \gt 0$ .
With cone concepts, we can convert most of the problems in optimization into unified form. Then solve them with only one package.
Example include CVX and TFOCS. These two packages are used as general optimization package for convex optimization. They convert any problem into cone optimization internally.

Lecture 18 Use of duality

Describe the primal and dual problem of lasso problem. Most of the time focus on the interpretation of optimal solution from the KKT condition and dual problem. One important rule is characterize the solution using dual problem and KKT condition.
Talk about Dual Sub-gradient method, which calculate the sub-gradient of dual problem.
Also talk about the Decomposition of primal and dual problem for convex optimization. Using decomposition will convert one big problem into sets of problem with smaller size.
In the last minutes, talk about Augmented Lagrangian for primal problem, using this trick will accelerate the convergence of Dual subgradient method.

Lecture 19 ADMM and Mirror Descent

ADMM

Alternating Direction Method of Multipliers represent the combination of two methods:
1. Decomposition method
2. Augmented Lanrangian Method

for example we have one problem have following form:
$argmin_{x} f(x) + h(x)$
This is a unconstrained optimization but we can convert this problem into following form:
$argmin_{x, z} f(x) + h(z)$
$s.t. x = z$
This kind of transformation is equivalent to previous formulation, but with this new formulation we will decompose a big problem into two different problem with only one constraint coupling two variables.
With ADMM we need to solve the following problem:
$L(x, z, \mu) = f(x) + h(z) + \frac{\rho}{2} \|x - z\|_2 ^ 2 + \mu (x - z)$
This is actually a transformed objective function of augmented lanrangian method, but subsequent optimization step will be quite different.
$x ^ { t+1} = argmin_{x} L(x, z ^ t, \mu ^ t)$
$z ^ { t+ 1} = argmin_{z} L(x ^ { t+1}, z, \mu ^ t )$
$\mu ^ { t+1 } = \mu ^ {t} + \beta (x ^ {t+1} - \mu^{t+1})$
From my personal understanding, ADMM use transformation convert a single problem into multiple smaller problems with multiple consensus constraints (like a = b, c = d and so on), Then use Augmented Lagrangian Method perform problem transformation, at last use dual decomposition method to optimize each variable independently.

Mirror Descent

The reason this method is called Mirror descent, because of the way this method is working: for primal function $f(x)$ we define another dual function $g(x)$ , for each iteration, we move a direction in the dual space and then mirror back to primal space.
when we say function $g(x)$ is strongly convex when function satisfy following condition:
$g(y) \ge g(x) + \nabla g(x) ^ T (y-x) + \frac{\lambda}{2} \|y-x\| ^ 2$
And we can define Bregman Divergence as follow:
$\Delta_{g}(x,y) = g(x) - [g(y) + \nabla g(y) ^ T (x - y)]$
This divergence used to measure the size of high order taylor series term. with this divergence definition, we can define the minimization of each iteration:
$x ^ {t+1} = argmin_{x} \; f(x^t) + \nabla f(x ^ t) ^ T (x - x ^ t) + \Delta_{g}(x, x ^ t)$
And we can get the solution like:
$x^{t+1} = \nabla g ^{-1} (\nabla g(x^t) - \rho \nabla f(x ^ t))$
This need the gradient function of $g(x)$ has some special property.

For some special $g(x)$ we can directly get closed form solution:
1. define $g(x) = \sum_{i=1} ^ { i = n } {x_i log(x_i) - x_i}$
Then $x ^ {t+1} = x^{t} \odot exp(-\eta ^ t \nabla f(x ^ t))$ , this method called exponentiated gradient.
2. define $g(x) = \| x \|_2 ^ 2$ then corresponding method called gradient descent.

Lecture 20 Quadratic Program and Cone Program

Concepts and definition of quadratic program and cone program and semi definite program.

Quadratic Program

Quadratic program refers to optimization problem in following form:
$argmin_{x} \; \frac{1}{2} x ^ T H x + c ^ T x$
$\quad s.t. \; Ax \le b \; or \; Ax = b$
when $H$ is positive definite matrix, then this Quadratic program become convex function. Many of the problems in machine learning are convex programming problem, like SVM.

Second Order Cone Programming

For SOCP, we have problem with following structure:
$argmin_{x} \; c ^ x$
$s.t. A_i x + b_i \in K$
where $K$ is second order cone. QP is a special case of SOCP.

Semi Definite Program

SDP has a broder definition, SOCP is also one special case of SDP. SDP is widely used in machine learning and statistics.
$argmin_{x} \; c ^ T x$
$s.t. A_0 + \sum_{i} {A_i x_i} \succeq 0$ and we require $A_0 + \sum_{i} {A_i x_i}$ to be symmetric and positive semi-definite. We also call the constraints Linear Matrix Inequality. One interesting thing about Positive Semi Definite Cone is that its dual cone is also Positive Semi Definite Cone, i.e. its self-dual.

We get special structure form LP, QP, SOCP, SDP, the next step would be general convex programming, i.e. without specific structure to exploit.

Lecture 21 Quadratic & Cone Program duality; SVM

The derivation of dual problem for QP and Cone Programming is trival. But i find the real application of dual cone and benefits of using cone programming to unify all the problem presentations. Here is one example, assume we’re required to solving such a problem:
$argmin_{x} c ^ T x$
$s.t. \; Ax + b \in K$
We can derive the dual problem of primal in following way:
$L(x, \lambda) = c ^ T x - \lambda ^ T (Ax + b)$
Here restrict $\lambda \in K ^ *$ , i.e. the dual cone of cone $K$ . And this using just match the definition of Dual cone, interesting !!!.

For SVM, all still the same tricks. No special interesting point.

Lecture 22 SVM & Interior point method

Derive the dual problem of SVM and the dual solution provide another insight about svm. The most important insight is that: solution $w$ is just the combination of training points. And we can use kernel trick to fully exploit this specific structure.

Interior Point Method

One of the most important method for solving all kinds of optimization problem. One basic intuition about Interior point method is that, solution point can be find by finding points in interior of feasible region. In this section, Analytic center is derived to find the starting point of interior point leading to optimum solution.

Lecture 23 Interior point method

Derive the interior point for linear programming. The interior point method for linear programming is such a method that start with analytic center and end with optimum point. Interior point method can be interpreted as the solution of following problem:
$argmin_{x} \; c ^ T x + \frac{1}{t} \sum_{i=1} {log(s_i)}$
$s.t. Ax + b = s \succeq 0$
This is one example of barrier method and using the log barrier. In order to solve this problem, we need to increase the $t$ larger and larger.
Another interesting property about this problem is that, we can estimate the duality gap using parameter $t$ .
We can also interpret interior point method as perturbed version of KKT condition.

Lecture 24 Interior Point Method continued

Continue derive the interior point method from the perspective of perturbed KKT condition. Interior point method for linear programming and nonlinear programming can also be derived from this idea. We have the optimal condition and increase the some tiny improvement on each variable, drop the second order term and then solve the correspond equation with differentiable idea.

Lecture 25 Coordinate Descent

Coordinate descent is a class algorithm only works for some kind of function.
1. convex and differentiable function
2. function of the form $f(x) + \sum_{i=1} ^ { n} {h(x_i)}$ , function f(x) must be convex and differentiable, function h(x) only need to be convex but should be separable.
One interesting phenomenon about coordinate descent is that its’ convergence rate is faster than first order method and widely used in machine learning and statistics.

Lecture 26 Path Algorithm

Path algorithm is algorithm used to give exact solution for each different regularization parameter $\lambda$ , it’s too hard for me, skip.

Lecture 27 Dual Averaging

Originate from subgradient method, dual averaging method assign equal important to each sub-gradient and optimize following problem:
$argmin_{x} \hat{g(k)}x + \frac{\mu_k \|x\| ^ 2}{2k}$
here $\hat{g(k)}$ represent the averaged sub-gradient. Using this method, we can achieve better convergence rate than sub-gradient. Using Regularized Dual Averaging (RDA) we can get sparse solution for online optimization problem.

Summary of This Course

This course introduce first-order method, their introduction is very excellent and give state of the art description. For second method, they do not talk about CG, L-BFGS, but still give very good presentation about duality on lecture 17 and 18.
Introduction of QP, SOCP and SDP is also very interesting, especially the examples they used, wonderful! For coordinate descent and dual averaging, i learned so much.

订阅：博文 (Atom)