2016年4月8日星期五

Using locale information

Kernel Method (or Non-parametric Method)

Seems there are two different definition of “kernel methods” in machine learning. One definition related with RKHS and so on. The other one, just like some non-parametric methods. And the latter one is the topic in post, the main idea is about how to use localized information to get a model.

Unlike linear model, which construct a global function over all the sample spaces; kernel methods works by construct a localized function for each new sample point . We can see how this method can be applied to different tasks.

When apply this method to regression task, for each new sample point , it will construct a weight matrix based on some kernel function . This kernel function will assign higher weights to closer training points based on some norm. Then a weighted regression will be performed, getting a brand new predicting function and return the predicted value.

For regression task, there is also another kernel method. Which will weight samples within the neighborhood of new sample point and return a weighted average of response variable value.

When apply this method to density estimation, it will also construct a weight kernel decaying with the distance from the point . And then perform classification according to bayes rule. We can also use mixture of Gaussian to estimate the density more clearly for each classes.

For all the methods mentioned above, there will be a issue of bias and variance trade off.

Written with StackEdit.

2016年4月1日星期五

Notes on Linear Regression

Having read the linear regression chapter of Element of Statistical Learning, method is different compared with Pattern Recognition and Machine Learning. After the introduction of least square methods, ESL will talk about the variant of the estimator ( ). Well, this is something quite new to me.
The first question is why we need to do this ? What’s the benefits of doing such kind of inference? But more interesting point is, with assumption of truly underlying model is linear model: ESL gives hypothesis testing and interval estimation of the parameters. This is quite new, but the question would be what if the real underlying model isn’t linear. I think this is the most common scenario.
For other point, ESL give a detailed analysis and comparison of different shrinkage method, this is a clear description of “bias variance decomposition”. And also other advanced method like lasso path and LAR algorithm.

2016年1月19日星期二

Notes on Pattern Recognition and Machine Learning

Chapter 1 Introduction

Three important parts: Probability distribution, decision theory and information theory.
From decision theory, loss function is provided; from information theory, entropy and KL is provided. From probability, conditional, joint probability; bayesian formalism and frequentist formalism.
Other topics about high dimension situation: for naive methods, requirement on data size grow exponentially with the number of dimensions and high-dimension is counter-intuitive. But we also have other insight on high dimension data: real data exist in a manifold of high dimension and local smoothness is guaranteed.
For model selection: we have cross validation from frequentist, other method combine model complexity and training performance from bayesian.

Chapter 2 Probability distribution

Focus on the probability distribution related to machine learning. Specially focus on Gaussian distribution. One important thing I learned from this chapter is how to derive the conditional and marginal distribution from a joint Gaussian distribution: Gaussian distribution has two very important components, first is the quadratic term involving precision matrix, second is the mean term; we can find the corresponding distribution by completing this quadratic terms.
Another important stuff about Gaussian distribution is the precision matrix, this is very helpful when derive the conditional and marginal distribution.
Other stuff in chapter 2 about Exponential Family, which is a generalized concepts with density function form where is sufficient statistics, is a normalizing constant depend on parameter . or called Partition Function.
For probability density, despite parametric form there is Nonparametric form. For Nonparametric probability density, we can have Nearest Neighbour Method and Kernel Method two different approaches. But these two ideas all come from one basic principal of estimating probability.

Chapter 3 Linear Regression

This book focus on bayesian approach to every model ( or hypothesis ). For linear regression, there are several different prospect for derivation:
1. MLE: assuming a gaussian distribution of noisy.
2. Geometry point: Projecting target value into the range of columns space of data samples.

For regularization part, assuming a gaussian prior distribution on parameter .
The ultimate purpose of learning model is predicting target value for new input data point , how to expression the uncertainly of predicted value ? Frequentist and Bayesian have different method:
1. Bayesian expression the uncertainly through of posterior distribution of parameter .
2. Frequentist will make a point estimate of at first, then through a series of though experiment to determine the uncertainty.

One important stuff: Bias-Variance Decompositioin is used for Frequentist, because the interpretation of Bias and Variance depend on the following ideas:
We have a set of different data sets, each data set comprised of N data points. From each set, learning algorithm will get an point estimate of , based on this parameter, prediction made on new data point . Since there are multiple data set from some unknown distribution , then can take expectation and variance of all the predicted values on new data point. This is the origin of Bias and Variance. Different model have different bias and variance depend on the model complexity. Thus the control of model complexity is vital for machine learning.

But what is the Bayesian approach to Linear Regress Estimate & Model Selection ?
Start with a prior distribution over parameter (mostly choose gaussian), then updating this distribution when new data point observed.

Frequentist start model selection with cross validation. But Bayesian will do it based on model evidence.

Another question would be the variant based on linear regression?
There are lots of variation of simple linear regression.
1. In original linear regression, original data point is used. But can use Basis Function to transform the data point at first : from -> , can have many basis function, then get the linear representation of data point. Some of the well-known basis function is: gaussian function, wavelet function and sigmoid function.
2. Another extension would be the norm of regularization. From to and . If a purely linear regression and norm regularization, this is called LASSO.

Other stuffs ?
1. Hypothesis complexity of hypothesis for linear regression. Used to derive the generalization bound and sample complexity.

Chapter 4 Linear Classification

For classification, there are three different approaches:
1. Discriminant Function: From training instance to class label directly.
2. Probabilistic Generative Model: model the joint probability of instance and class label.
3. Probabilistic Discriminative Model: model the conditional probability of class label give training instance.

In this chapter, it’s about how to use linear model to realize three different approaches.
For discriminant function, Linear Regression, Finsher Discriminant Analysis and Perceptron algorithm. Linear Regression and Finsher Discriminant function with different objective function. For perceptron, it’s quiet unusual. It’s hard to find a appropriate category for this algorithm.
For probabilistic generative model, it’s modeling as follow:

where represent the class conditional probability distribution. For binary and multi-class classification, if class conditional probability is gaussian and share the same covariance matrix, then the posterior of class label has the following form: . The function is called activation function and function classed link function in statistics. So one question would be this: which specific form of class conditional probability distribution will lead to a linear model? Answer is exponential family distribution with shared scaling parameters.
For discriminative function, it’s modeling the conditional probability directly. Linear discriminative model has the following form:

activation function is sigmoid function for binary classification, soft-max function for multi-class classification. Since nonlinear activation function, there is no closed form solution for this problem, only be solved through iterative approach. **I**terative **R**egularized **L**east **S**quare (IRLS) is apply the newton method to linear discriminative model. Another point need attention: when using 1-of-K coding schema for class label, optimize the negative loglikelihood function of training data is the same as optimize the negative cross entropy function of training data. this is true when binary using 0 and 1 to represent different class. (But in normal case, everyone is using 1 and -1, interesting).
According to the spirit of this book, There is a bayesian version of logistic regression. But the posterior of parameter given training data is intractable. So laplace approximation is used to approximate the posterior distribution.
How does Laplace Approximation works? It approximate the target distribution with a Gaussian. And this gaussian sit on the model, its precision matrix is the negative of the hessian of the target probability density at the mode.
Summary: Approaches to classification, cross-entropy, probit regression, laplace approximation, BIC.

Chapter 5 Neural Network

Most important concepts: Neural Network is adaptive linear model. Or can be understand as hierarchical linear model. Because each layer of neural network is just perform linear model operation plus some nonlinear activation. So neural network is itself nonlinear but composed of linear models. The most important motivation is the the input of linear model can be the output of other linear model. When thinking in this way, it is basis expansion but with adaptive basis. Much more interesting than aspects from neuron inspiration.
When recognized as adaptive linear model, neural network need some objective function: cross-entropy for classification, least square for regression. These concepts are all from the previous chapters.
But adaptive linear model is hard to calculate the gradient and hessian. So the back-propagation comes into help.
Using back propagation, gradient and hessian can be calculated easily.
As a new function mapping different from linear model, it need some ways to perform model selection. The old regularization on all the parameters still works well. However, as a adaptive linear model, itself is hierarchical!
For neural network, some other approaches can be used: consistent prior, tangent propagation, convolution and soft weigh sharing. I think all these techniques are too much complicated, don’t know the real application in real open-source tools.
Neural network is a function space and it can be used to do anything. Mixture density network is using neural network to predict the mixing coefficients, mean parameters and covariance matrix. Too much parameters, i don’t think this is good.
Still, this book is for Bayesian Method. Bayesian neural network for regression and classification. Using lapalace approximation, posterior distribution can be approximated to give predictive distribution and so on. Using so many approximation, what’s the meaning of getting a distribution rather than a single parameter. I doubt the effectiveness of the bayesion for neural network.

Chapter 6 Kernel Method

Previous chapters focus on linear method and its extension (i mean Neural Network). But kernel method is very different from kernel method, it involves nonlinear mapping in the model directly.
All the linear method can get a dual representation. In this representation, model is represented with a kernel function involved.
For kernel function, possible kernel should have positive definite gram matrix. And there are multiple ways to construct new kernel function:
1. Composite new kernel function according to a set of rules.
2. Composite new kernel function with probabilistic generative model, i.e. combine kernel function with a mixture model way.

This is the keypoint about the kernel function, others are skipped.
Another important knowledge is Gaussian Process, it seems that i can understand it now. From prior distribution, any existed data point has a distribution of output . Gaussian Process means the distribution of is gaussian. Well, most interesting point is we do not need to worry about selecting a proper prior over the parameter .

Chapter 7 Sparse Kernel Machine

Start with SVM algorithm, more interesting is Relevance Vector Machine. This is a Bayesian version of Support Vector Machine. The only different from Bayesian linear model, is the prior distribution for parameter is composed element-wise way:

So with this variant, sparse effect is achieved. The bayesian version is existed for classification & regression model at the same time.

Chapter 8 Graphical Model

Graphical model has different representation: directed & undirected. Each graphical representation specify the factorization of the joint probability distribution and conditional independence of the joint distribution. The factorization and conditional independence decide the efficiency of inference and learning algorithm.
For directed graphical model, there is d-separation to decide the conditional independence. For undirected graphical model, this is much simpler.
Inference on chain model & tree model is very simple. One important information: message passing used to passing message from other nodes about current node.
For general graphical model, loopy belief propagation, variational inference and sampling is the solution. Junction tree need complicated steps, i don’t think is used widely.
Some thing like factor graph and clique tree is another different representation of same graphical model. Different representation do not change the underlying distribution, just for the computation convenience.
So for graphical model, information is about structure of the joint distribution and how to use the structure to accelerate the computation.

Chapter 9 EM & Mixture Models

EM is used to get maximum likelihood estimator of some models with latent variable. The introduce of latent variable is used to simplify the computation of likelihood of observed data, even though the latent variable do not have any physical interpretation.
EM algorithm will try to get posterior distribution of hidden variable given observed data at first. Then it will calculate the expected complete data log-likelihood under the posterior distribution of hidden variables. At last, it will maximize the expectation with respect to model parameters.
For a long time, I can’t understand EM because don’t know how to get the distribution, then calculate the expectation under this distribution. Finally, i get the point. The key is not the separate the distribution and expectation, but to find the expectation of complete likelihood. With this information, the only thing required is the expectation of the posterior distribution.
If the posterior distribution is easy, EM can be very simple. Just maximize the complete likelihood of data, but replace the value of hidden variable with expected value of hidden variable. When posterior distribution is complex, approximate inference method is required. We can get the expected with different ways.

Chapter 10 Variational Inference

Variational inference is used for inference problem: marginal problem, posterior problem and MAP inference. If the distribution is very complex, we can make it simple by adding more extra conditional independence. Mean field approximation works by assuming some group of variables are independent even though not. Variatioanl inference minimized the following KL divergence:

and is factorized with different group of variables, i.e. . Apply this to the KL function then can get a function for minimization.
For a long time, i don’t understand this algorithm. Just because do not understand how to evaluate the expectation of complete data log-likelihood under some distribution.

Chapter 11 Sampling Method

The fantastic name “Monte Carlo Method” do not reveal the real content of this subject. Numerical Sampling method is using computer generated pseudo random number generator to get the real sample from target distribution, then perform all kind of inference operations.
So generally, there are specific sampling method, MCMC sampling method. Using MCMC method, there are several conditions:
1. aperiodic: This means there is no circle in the state traveling space.
2. irreducible: This means all the space can be explored.
3. reversible: also called detailed balance.
4. ergodicity: starting with any possible point, the convergence distribution is the same.

Among all the methods, Gibbs Sampling is the most important one. Another interesting method is Hamilton Monte Carlo Method.
From my understanding, the most important part for MCMC like method:
1. How to propose new point in the whole domain of the distribution.
2. How to use detailed balance to determine the acceptance of new point.

There are other general idea related with sampling: Data Augmentation. Data Augmentation works in the following way: If you want to sample from distribution , but you construct another distribution which satisfy at first. The variable called auxiliary variable. With new distribution , it’s much simpler to sample from it. So we get samples from then just drop the part. Slice sampling, Hamilton Monte Carlo method belong to this kind of general idea.

Chapter 12 Continuous Latent Variable

In this chapter, author gives some very important idea on dimension reduction task. In dimension reduction, there are some continuous latent variable, then after some kind of transformation become a high dimension space. But the latent variable only exist in a small manifold of high dimension. But in real word, data do not exist in a small manifold. How to interpret this? Well, data point do not exist in the small manifold, these data points will be interpreted as real data point plus some noise. PCA is interpreted as model with continuous latent variable, plus a Gaussian noise. Start with PCA, there are many other dimension reduction algorithm can be derived. Amazing!

Chapter 13 Sequential Data

Sequential data has the basic model: HMM for discrete latent variable, it’s the extension of mixture of Gaussians; LDS for continuous latent variable, it’s extension of PCA like models. It’s just the application of Graphical Model.

Combing Models

For combing models, it’s called Ensemble Method. Widely used method: Random Forest, AdaBoost, GBDT. For the rest, I don’t know the real applications.

Summary of Reading

For PRML, it covers the Linear Method for Classification/Regression, Neural Network, Kernel Method, SVM, Graphical Model, Variational Inference, Numerical Sampling, Ensemble Learning. I think the basic understand of machine learning techniques is acquired.

Written with StackEdit.

2015年9月17日星期四

AM207 Monte Carlo Method and Stochastic Optimization

AM207 Monte Carlo Method and Stochastic Optimization

Lecture 1

Introduction lecture, almost forget all the materials.

Lab session 1

Introduction to probability distribution with python as example.

Lecture 2 Basic Monte Carlo Method

  1. Important sampling
    This method is just for calculate the integral of some function.

  2. Rejection Sampling
    This is the first method for sampling from a distribution in real.

One important issue with monte carlo method is the variance of estimation depend on the sample size. We need find better ways to control the variance of estimator.

Lecture 3 Variance Reduction

Reduce the variance of monte carlo methods.

  1. Control Variates

  2. Antithetic Variates
    require the function must be mononic.

  3. Stratification variate
    split the variable into multiple interval and perform calculation on each interval.

Lecture 4 Bayesian Formalism

One simple distinction between bayesian and frequentist would be their returned result. For bayesian method, it will return a distribution; frequentist will return a number.
Bayesian method start with a prior distribution on parameter, after observing some data, updating the prior get the posterior distribution on the parameter.

Lecture 5 Bayesian Formalism: Part 2 && MCMC

Sampling method:
1. Inverse transformation
2. Rejection Sampling
3. MCMC

MCMC is the widely used approach to sample from high dimension and complex distributions. What’s the properties we expect MCMC to have ?
1. aperiodic: This means we do not want pattern (or loops) existed in the random walking. If we have this pattern, then the corresponding samples will not be random.
2. Irreducible: for any step , the probability of any point would large than . this means we do not want the random walk end up with a deterministic ending sample. So that we can get to any space in the domain.
3. detailed balance: from sample point and , the forward and backward probability is the same. This will guarantee the corresponding samples will come from the distribution we want.

Lecture 6 MCMC

Start real introduction into MCMC with detailed balance:
First talk about the Metroplis method for MCMC, by assuming a symmetric transition probability . First we get a new point from distribution where is current point, then we accept the new point with probability

Perform a uniform sampling between 0 and 1, get a prob . If less than then new point is accepted; otherwise stay on current point.
The interpretation of prob is quite simple: if the next point has a higher probability than current point, it is accepted without any condition; if the next point has a smaller probability than current point, it is accepted with a probability. In practice, using gaussian or uniform distribution will satisfy the symmetric requirement.
if we have a proposal distribution which is not symmetric ? We correct the probability in the following way:

then so we have the following new acceptance function
But in usual case, we still using symmetric proposal distribution instead of a unsymmetric.

Lecture 7 MCMC Convergence

For MCMC sampling method, convergence will be a serious issue. Convergence means we get the samples according to target distribution. Before convergence, we have a phase called burning in, samples from this stage can’t be used for estimate statistics. There is no reliable method to determine when the burning is finished, only some available heuristic variable to show.
1. Trace of variables
Using iteration number as the horizon line, actual value of variable as the vertical line. We plot this line to see the change of variable along time, if it looks like random Then it may have converged.
2. Geweke test
Geweke test used to calculate a estimator based on the mean and variance of two non-overlapping sequence of samples. Here is the definition:

In usual case, the value of should between -2 and 2. When using this method, we just split current samples into two parts, calculate the mean , and corresponding variance , . this seems useful.

Lecture 8 Gibbs Sampling

Another test statistics for convergence of MCMC. Gibbs sampling just mentioned.

Lecture 9 More on Gibbs Sampling

Just know the gibbs sampling is a special form of metroplolis-hasting algorithm when the proposal distribution is conditional distribution. Then all the rest is examples.

Lecture 10 Slice MC and Multivariate Slice MC

Data Augmentation
The key idea about data augmentation: If we want to sample from probability distribution , but we augment it with joint probability whose marginal distribution of is . Then sampling from and abandon to get samples from .

Slice Sampling
One example of data augmentation.

Lecture 11 Hierarchical Bayesian Method

Nothing special, just normal hierarchical models: directed acyclic graphical model. But the derivation of conditional distribution is quite hard. I don’t know the applicability to other more complex models.

Lecture 12 Hamiltonian Monte Carlo

Monte Carlo or Markov Chain Monte Carlo methods are all based on random walk. From this perspective, there are several interesting aspects need attention:
1. How to propose new point (based on current point or other information) ?
2. Criterion for accepting new proposed point. This is just need to ensure the detailed balance, so that finally sampling points according to target distribution.

For MCMC algorithm to work, need to ensure the detailed balance, ensure the random walk would not have patterns and can reach every possible space.

Hamiltonian Monte Carlo method is one type of MCMC, motivated by one face: for efficient sampling, when the probability surface is flat, the step size should be large; when the probability surface has very variability, the step size should be small. So can ensure the full sampling.

Hamiltonian Monte Carlo Method motivated by physical system and newton first, second law. Even though I do not know the details of HMC, i can know the main improvement is the way to propose new candidate point (make it much efficient).

Lecture 13 Parallel Tempering

Start multiple with different temperature. The key idea the the using of Temperature. For any probability distribution can be transformed into . This will turn any distribution into boltzmann distribution and is the temperate. Higher temperate will increase the rate of acceptance when sampling. Low temperate will decrease the rate of acceptance. The core idea of parallel tempering is the run multiple chain of MCMC with different temperate. Some of the chain with and others with higher temperate. So that MCMC will explore more space and more efficient.

Lecture 14 Simulated Annealing

Borrowing the concept of temperate form lecture 13. Start with high temperate, then decreasing the temperate slowly. Hope to jump out of local minimum. The reason is this: The high temperate will smooth the objective (or distribution).

Lecture 15 Stochastic Gradient Descent

Discuss the optimization algorithm, nothing special.

Lecture 16 Time Series

Despite the I.I.D assumptions about data points, there are other forms of data. Time series is a special form, data are depend on previous data points. There are multiple concepts related to time series data points:
1. Autocorrelation.
2. Autoregression.

Lecture 17 Times Series & HMM

Keep talking about time series related model & HMM, not interested.

Lecture 18 HMM & Kalman Filter

Model like HMM and kalman filter. Know the details.

Lecture 19 Expectation Maximization

Some kind of derivation with example of GMM. Nothing special information from this lecture.

Lecture 20 Gaussian Process

Gaussian process is too much complicated. The prediction for new point based on seen points. Using kernel to construct the covariance matrix for a gaussian distribution like the Gram Matrix. The new point and the original point will construct a joint distributiion which can be simplified to a conditional distribution. Something else not clear.

Lecture 21 Graphical Model

One lecture for graphical model will be impossible, skipped.

Lecture 22 Graphical Model cont.

Skipped as well.

Summary

After finishing this course, i think this course is not so advanced but the most important stuff is let me know some more details about MCMC. So that i can read some more book & understand & implement some algorithm myself. Whatever, this is a good course for MCMC. But for Optimization and Graphical Model part, EM part, nothing special.

2015年8月6日星期四

Notes on Foundation of Machine Learning

@(Machine Learning)[Notes, Machine Learning]

Notes on Foundation of Machine Learning

Chapter 1 Introduction

Three different criteria for learning algorithm:
1. Time complexity
2. memory complexity
3. sample complexity: how many samples are need to learning concepts in hypothesis set.

Some other concepts like training set, validation set and test set. One important thing i learned from chapter is what so called Learning Scenario. Different learning scenarios differed from each other by the type of training and testing data (with label, without label), how the training data and testing data is arrived, how to use the training and testing data. We have following scenarios:
1. Supervised Learning
2. Unsupervised Learning
3. Active Learning
4. Dimension Reduction
5. Transductive Learning
6. Online Learning

Then we also have learning tasks, task means the target we want to get:
1. Classification
2. Regression
3. Ranking
4. Sequence labeling

and so on.

Chapter 2 PAC Learning Framework

Machine learning is using a learning algorithm to find one hypothesis among hypothesis class according to specific loss function . Two potential criterion for one hypothesis :
1. loss on available training sample called Empirical Risk ;
2. average on all the possible samples called Risk

The most important question would be the correlation between empirical risk and risk. This kind of relation is called Generalization Bound. Because in machine learning, low risk is the ultimate goal not low empirical risk. Low risk means good generalization ability. If the generalization bound is clear, we can estimate the real performance of hypothesis returned by learning algorithm . Under certain circumstance, we can put the same question in another way: How many samples we need so that the returned hypothesis has a risk less the prescribed .
Basic background for derivation. Each training sample comes from sample space , all these samples distributed according to distribution . There is a concept family , each of the concept will map a sample into target value space , i.e. . Learning algorithm use hypothesis space to approximate the concept class . Each of the hypothesis will map sample into . Here we assume the possible hypothesis of is finite, i.e. size is .
Now PAC-Learnable defined as follow:
A concept class is PAC-Learnable, if exists algorithm and a polynomial function , for all the and , for all the distribution on and any target concept in , if sample size then the following statements hold:
.

In order to derive the bound for consistent and inconsistent case, we need union bound and hoedding inequality.
Some generalization to above statement:
1. Assume the label according to input instance is deterministic, which means we can get the label from the input data. However, in real application the relation of input and output is stochastic. Which means we have the following distribution for each training instance . So we have so assume a distribution over input and output space, i.e. . One possible explanation for this is: we do not have enough features to distinguish each training instances. So like use height and weight to classify a person is man or woman. For a set of height and weight, there is a possibility for man and woman.
2. When the output is a distribution conditioned on input sample , We have the bayesian error. This is the minimum error the best classifier can get.
3. We can decompose the risk difference between hypothesis returned by learning algorithm and bayesian risk into a combination of estimation error and approximation error.
4. According to generalization bound, the risk of hypothesis depend on empirical error and ratio of sample size and number of hypothesis. We can find a hypothesis among different hypothesis size for model selection. Or we can use regularization to perform model selection.

Lecture 3 Rademacher Complexity and VC dimension

Rademacher Complexity and VC dimension are two different means to describe the complexity of hypothesis spaces. Because when the number of hypothesis is infinite, the generalization boudn derived in chapter 2 is useless. Rademacher complexity and VC dimension is another way to evaluate the complexity of hypothesis set except the number of hypothesis.
With these two different notation of hypothesis complexity, combined with concentration of inequality to give a uniform generalization bound.

Chapter 4 SVM

Start with linear separable case and then extend to non-separable case. The derivation of primal and dual problem is the way i expected. The most important is the derivation of generalization bound for SVM based on margin rather than feature dimension . One more interesting thing is using hinge loss is actually optimization the generalization bound of SVM.

Chapter 5 Kernel Method

For Kernel method, just know the implicit mapping from input space to high dimension space. The other work is about Positive Definite Symmetric kernel, the rest like RKHS, it’s unnecessary to know all these stuffs.

Chapter 6 Boosting

Boosting comes with one question: can we build a strong classifier with weak classifier ? what’s the definition of weak classifier? How can we combine this weak classifier? What’s its generalization bound and what’s the factor that impact the generalization ability.
Boosting is a strategy, AdaBoost is one real algorithm that can be used when the base weak classifier is a hypothesis set mapping from input space to . One basic idea is assign weight to each training sample, for samples harder to classify by nature, we give them more weight and then combine the classifier from each iteration linearly to get final classifier.
AdaBoost is optimize a exponential loss function and can be interpreted as coordinate descent algorithm.
Maximum margin can’t explain the performance of adaboost.
One disadvantage of AdaBoost: prone to noisy data. This is not good, because noisy data is everywhere and hard to control.
Another question is the selection of base classifier. Usually we can use decision tree as base learner and trained a tree model based on the current samples.

Chapter 7 Online Learning

Online learning handle the data point one by one, when the learning algorithm receive one data point, then make prediction and have a loss related to the prediction.
Online learning different from the techniques in the previous chapter: No fixed distribution is assumed on input . This means we can’t use risk to evaluate the hypothesis. So online learning using another set of notation for hypothesis evaluation: mistake model and regret analysis.
Online learning model with expert advice: When doing online learning, experts will give advice for each input sample .
Mistake model is simple, i.e. the number of mistake learning algorithm made during round of learning. Regret defined as follow: the difference of cumulative loss minus the loss of expert give the minimal loss.

All the analysis of online learning algorithm will give results on the total number of mistakes or regret of T rounds learning.
In the setting of experts, have Weighted Majority (WM), Randomized Weight Majority(RWM) and Exponential Weighted Majority(EWM) algorithms.
When analyzing the mistake bound and regret, it’s usually by deriving the upper and lower bound of weight, then get the results.
For linear classification, perceptron and winnow algorithm are two standards approaches.
Since online learning do not have distribution over data , batch algorithm assume a distribution . What’s the result training a online learning algorithm on data with generated from fixed distribution ? The risk can be bounded by the average loss and some factor of where is the size of dimension, is the number of learning rounds.

Chapter 8 Multi-Class Classification

Compared with binary classification, multi-class classification differed in three aspects:
1. Large number of class require more computational resources.
2. There are unbalanced classes in multi-class, some class have much larger samples than the other one, need the change of loss function.
3. There are some internal structures inside the output classes, should be used in the design of loss function.

This chapter starts with the generalization bound for multi-class classification. I fond the standard steps for building generalization bound:
1. find the rademacher complexity of the hypothesis set.
2. Derive the relation of rademacher complexity and margin, then get a bound rely on the margin.

There two different approaches have appeared multiple times in this book.
For multi-class classification:
1. Multi-class SVM by define margin as the different between correct label and second largest label score.
2. boosting: Same as AdaBoost. It’s used for multi-label problem.
3. Decision tree: One important knowledge about decision tree is decision tree will split the input space into multiple regions. Each region will have its label.

Other methods called: Aggregated Method
1. one V.S. rest
2. one V.S. one
3. Error Correction codes, quite interesting method, using hamming distance to classify the label of instance.

Multi-class classification and sequence labeling
For sequence labeling, following objective function provided:
1. Maximum Margin objective
2. Struct-SVM objective

Chapter 9 Ranking

Ranking is a new task compared with classification and regression, although this problem can be solved using same techniques when solving classification and regression.
For ranking task, there are two different approaches:
1. score-based approach: This method will train a hypothesis mapping from to . When predicting, each instance will get a score from hypothesis then ranking according to this score.
2. preference-based approach: I do not understand this method! This method split ranking into two stages. First stage will train a preference function from training sample, second stage will perform ranking based on the score of preference function.

Score-based approach for ranking
This is pair-wise ranking method, training data is pair of data points and objective function is average mis-ranked pair. For score-based approach, each instance will get a score from hypothesis . Can use SVM and RankBoost method to training the hypothesis. The analysis of generalization bound is based on rademacher complexity or margin, very much like SVM and AdaBoost.
One interesting scenario is bipartie ranking. In this scenario, all the instance can be classified into two classes. The only requirement is the pos has a higher than neg. This special scenario can be solve using boost method.

Preference-based ranking
i don’t know what’s the meaning of this method.

Chapter 10 Regression

Regression with target value of real. This chapter derive the generalization of linear regression based on rademacher complexity. Other concept similar to VC-dimension is Psudo-Dimension which based on the concept of shattering. The real algorithm for regression has linear regression (with linear hypothesis), ridge regression, kernel ridge regression, support vector regression, lasso and group lasso algorithm. Some analysis are based on rademacher complexity not on margin.

Chapter 11 Algorithmic Stability

Only care one definition of stability: when two samples and only different by one data sample, but the hypothesis output difference can be bounded by some constant number . This is called stability.

Chapter 12 Dimension Reduction

PCA is the most formal one of dimensional reduction method, aimed at preserve the maximum variance of original data. Kernel PCA is a version based on a PDS kernel, has many connections with other non-linear dimension reduction method.
Other methods like Isomap (keep the pariwise distance), LLE (keep neighbour distance) and Lapancian Eigenmap (keep the neighbor distance).

Chapter 13 Learning Automata and Languages

Skipped, don’t think it’s relevant to my work.

Chapter 14 Reinforcement Learning

Reinforcement learning based on on abstraction: MDP (markov decision process). Which includes state, action, reward, transition probability (depend on the sate and action), reward probability.
There are two different assumptions:
1. The transition probability and reward probability is known to learner.
2. The transition probability and reward probability is unknown to learner.

A policy is a mapping from state to action:

In the first case, we only need to plan action based on the known statistics. In the second case, learner need to estimate the parameters from the interaction between environment. For unknown environment, have model-free approach which estimate the policy directly. Also have model-based approach which estimate the transition probability and reward probability at first, then estimate the policy. Q-learning is the most famous algorithm estimate the policy directly.
All these methods based on the stochastic approximate method.

Written with StackEdit.

2015年6月22日星期一

Summary on Machine Learning Topics

@(Machine Learning)[Summary,Machine Learning]

Summary on Machine Learning Topics

Machine Learning composed of three critical components
1. Specific Task, each algorithm works with specific task not a general approach to everything. Tasks different from each other by the type of training data, type of testing data, how we get the training data and how we use the testing data.
2. Experience. Refers to everything can be represented with appropriate format. For my personal understanding, experience just the interaction of objects, like human and environment. Human interaction should and always be the main source of improvement.
3. Performance. Most important step during the development of application. We need to first set the performance evaluation of our task, then we can find appropriate ways to improve. Under most circumstance, we can’t optimize the performance directly, we can only optimize the Surrogate of our final performance. For most of the problem, this idea should works very well.

So before we start a work, we need to select the type, performance evaluation metric, type of experience and most important the model we’ll use.
There are at least three aspects related with machine learning models:
1. Hypotheses spaces. This is the most important thing in a machine learning model, hypothesis defines how to mapping the input to output, what’s the relationship between input and output. For each hypothesis space, we will need a way to evaluate the Complexity, a way to qualify the different of performance of specific hypothesis on training data and testing data, this need the help of Learning Theory.
2. Loss function. Among all the hypotheses in the hypothesis space, we will find specific one as our final result of learning, loss function will be the criterion for selection.
3. Learning algorithm. With hypothesis and hypothesis selected, we need one learning algorithm to select the hypothesis. Most of the learning algorithm just one of the numerical optimization algorithms, but we have exception like Decision Tree. Numerical optimization algorithm for machine learning have different requirement with standard one, machine learning focus on the generalization ability rather than numerical precision.

Most of the book about machine learning will focus on these three aspects. Topics about the hypothesis spaces is the key of machine learning. We’ll also see talks on loss function. Learning algorithm require the book from numerical optimization community, rather than machine learning. But with the increasing size of data, distributed optimization and fast optimization has become more and more important.

Loss Function

There are two different approaches for loss function:
1. Maximum Likelihood of data. This is a suggest form probability when fitting parameter. When the output of hypothesis space can be interpreted as probability, maximum likelihood will be a possible principle for loss function.
2. Empirical Risk Minimization: This is another important principle. This approach will optimize the surrogate of our real loss function. For example, we actually want to optimize loss for classification task, but this function is too hard to optimize. We use log loss and hinge loss instead. And the cost is evaluated on the training data, i.e. . Then we optimize this function based on empirical data.

Hypotheses space

Hypothesis describe the relation of input and output. We can classify the type of hypothesis according to different criterion. We have linear, non-linear; parametric and non-parametric hypothesis.
Summarize the hypothesis space as follow:
1. Linear hypothesis space: when combined with different loss function, we have different classifier. Combined with log loss, it’s Logistic Regression; combined with hinge loss, it’s Support Vector Machine.
2. Tree hypothesis: Decision tree is actually an non-parametric hypothesis spaces, GBDT is actually optimize the exponential loss function. We have GDBT, RF and so on.
3. Kernel related hypothesis: We can combine kernel method with linear method, This is an non-parametric hypotheis method, because final result based on the training data.
4. Graphical Model hypothesis: this hypothesis is described with exponential family distribution. So we can use different loss function.
5. Neural Network hypothesis: Highly complex non-convex hypothesis family. It’s hard to optimize and easy to over-fitting.

Learning Algorithm

For most of the machine learning models, we need to use learning algorithm to find a hpothesis according to loss function. Some of the algorithms are greedy like decision tree building. But most of the algorithms involves numerical optimization algorithm. This means optimization is at the heart of machine learning, many models rely on numerical optimization heavily. Here are the summaries of optimization algorithms:
1. First order method
First order method is very cheap compared with other methods. And is widely used for Neural Network. First order method include Stochastic Gradient Descent, Proximal Gradient, Sub-Gradient Descent, Mirror Descent and so on.
2. Second order method
Most important would be Newton Method, some other similar method like Conjugate Descent, L-BFGS and so on. Under most circumstance machine learning would not use Interior Point Method and Sequential Quadratic Programming, because these methods will cost too much memory.

Some very hot topics on Learning algorithm would be the regularization and distributed optimization. For distributed optimization, we have .

A short summary on the numerical optimization topic, itself is a very big topic. We can classify the algorithm based on objective, constraints and so on. From this perspective we have

Problem Type Mathematical Description
Linear Programming
Quadratic Programming with constraints
Second Order Cone Programming with second order constraints
Semi-Definite Programming with positive definite constraints

I think for engineering purpose, most work will focus on Learning algorithm, especially for large scale training and distributed optimization. The development of hypothesis space and loss function would be slower.

Written with StackEdit.

2015年5月9日星期六

10-701 Course notes

Notes on 10-701 Introduction to Machine Learning

Course from CMU 2014 Fall, general introduction to machine learning.

Lecture 1 Introduction to Machine Learning

Basic introduction about machine learning. The definition of Task, Experience and Performance is quite clear for me now.
Task: difference of task comes form different perspective, type of training data, type of testing data, how the training and testing data is arrived and how the training and testing data is used. We have following different scenarios:
1. Supervised Learning: Batch of training data with label and do not know testing data when training. like classification and regression.
2. Unsupervised Learning: Batch of training data without label and make prediction on unseen data points. It’s hard to evaluate the result of unsupervised learning. Typical tasks include clustering, dimensionality reduction and density estimation.
3. Semi-supervised Learning: Batch of training data with label and without label, making predictions on unseen data points.
4. Transductive Learning: Batch of training data with label and testing data without label, we only make predictions on these test point.
5. Online Learning: Learner receive training and testing data point in a fixed fashion. One unseen data point come, make prediction, and receive a true label. The total goal is the minimized the cumulative loss.
6. Reinforcement Learning: training data and testing data are mixed, the learned is required to interact with the environment.
7. Active Learning: Learning adaptively select training sample for label, hope to get better performance with fewer samples.
8. Clustering: unsupervised learning, find different data regions.
9. Dimension Reduction: project high dimension data into low-dimension.
10. Density Estimation, Sparse coding

The above list is just a simple list of possible scenarios, we can find much complex scenario for machine learning algorithm.
Experience: Training samples
Performance: We all all kinds of performance criteria like:
1. squared loss
2. hinge loss
3. log loss
4. absolute loss
5. negative log-likelihood
6. negative cross entropy

Lecture 2 Bayesian Optimal Classifier and Naive Bayes Classifier

For classification problem, we can derive the bayesian optimal classifier, i.e. the best possible classification result given the correct joint distribution of training data and class label.

After using bayes rule, we can derive the following formula:

The only problem for this formula: we do not have the exact distribution and learning the distribution need too much sample.
Naive Bayes Classifier solve this problem with a strong assumption, all the features are conditional independent give the class label.
In practical, we need to smoothing the probability distribution for each feature.

Lecture 3 Perceptron

Using linear model as example introduce perceptron algorithm. Perceptron used to training model in online version.
For model in the following form:

where is model parameter. When we have a bunch of linear separable training data, we will need to solve a feasibility problem:

For this problem, we can train it with:
a. Linear Programming;
b. Perceptron algorithm

When solve this problem with linear programming, we can’t send the problem directly to LP solver: because of the strict inequality. Most of the cases, solver will convert strict inequality into , this will have a trivial solution, i.e. . So we need perform transformation to this optimization problem. Assume the data is linear separable, then we will find an solution to this problem , after getting this solution, we can find the minimum value of , then the problem will convert to following form:


Since we don’t have constraints on , we can divide both side by , then we can get following form:


Finally, we can send this problem to LP solver.

When using Perceptron algorithm, we will have a very easy update rule:
when current model predict wrong label, we’ll add the to current parameter.
key point for perceptron implementation:
1. data should be random pertubed.

Convergence Issue
1. converge guaranteed under liner separable assumption of data.
2. not if sample not linear separable, use averaged parameter will get convergence assumption.

Proof of perceptron convergence?
Margin Theory, i.e. . This means the largest possible margin between decision boundary.

Lecture 4 Logistic Regression and Perceptron

Logistic regression is a transformation of linear classifier, hoping to get a probabilistic interpretation of classification result.
Another topic is about generative model and discriminative model. For distriminative model, we have model the conditional relation between label and features. For generative model, we model the joint distribution of features and labels. The point is the different assumptions model make on the training data. With discriminative model, there is no extra assumptions about the data. But with generative model, we need to find a way to model the distribution of data , this is not an easy talk under most circumstances.

Lecture 5 Analysis of Convergence of Perceptron

Analyze the convergence for non-separable case perceptron, also introduce the margin perceptron:
update the parameter when , this will lead to better performance.

Lecture 6 Kernel Method & RHKS

Not interested, skip.

Lecture 7 Kernel Methods Continued; Support Vector Machine

Skip kernel method part and focus on support vector machine part.
Still derive the algorithm from maximum-margin point, clear already.
Extend linear SVM with kernel trick from dual space.

Recitation 3 Optimization in Machine Learning

Most of the optimization rely on backtracking and sufficient decrease condition.

Lecture 8 Linear Regression

Linear regression used to predict real target value. Using regularization equivalent to using a gaussian prior on the parameters. Using regularization equivalent to using a laplance prior on the parameters.
Using a regularization will get a invertible matrix. Nothing more.

Recitation 4

Derivation of SVM, kernel method.

Lecture 9 Nonlinear Regression

Still about regression, but talk about polymonial features. In this kind of setting, we have much more features than the original feature.
Nonlinear feature expansion of original features.
Error decomposition of training and testing error.

Ridge regression can have dual problem and find a similar interpretation of SVM.

Gaussian Process for Regression is also talked about.

Lecture 10 Optimizatioin & Convergence Analysis

Lectures about optimization, gradient descent, Newton’s method, convergence analysis.
Two major target for optimization:
1. Maximum Likihihood Estimate
2. ERM like margin and so on.

Gradient Descent and newton.

Recitation 5

Kernel Method: Kernel method are a strategy to learn complex function.

Lecture 11 Continue the Optimization Topics

Talk about newton method with equality constraints, barrier method for inequality method; Find another explanation about mirror descent.

For non-parametric model, we do not have pre-defined distribution for model and hope our model can can more complex when have more data.

Mirror descent works in the following way:
1. for variable calculate .
2. solve following optimization problem:

Lecture 12 Nonparametric methods

Density estimation based on histogram and kernel method.
For classification task, kNN is the nonparametric methods. Weighted least regression, locally linear regression can have same effects as kernel methods.

Lecture 13 Decision Tree & Boosting

Several critical points for decision tree:
1. Criteria to select attribute for node splitting
2. Way of node splitting, only two-way splitting or multiple-way splitting ?
3. node pruning or tree stopping rule.

ID3: do not all the re-use of attribute for node splitting.

Pre-pruning:
1. fixed tree depth
2. fixed number of leafs

Post-pruning:
1. chi-square test of independence: convert tree into a list of rules at first, then eliminate rules independent of labels.

Boosting used to combine multiple weak learner to give better performance. Each iteration update sample weights.

Lecture 14 Boosting & Graphical Model

Boosting is used to optimize a exponential loss function. Then we optimize the bound find appropriate parameters.
Key idea for graphical model is the structure of joint distribution. We can use structure to save a lot of parameters.

Lecture 15 Graphical Model

Variable elimination works in a dynamic programming way for graphical model inference. Introduction to directed and undirected graphical model.

Lecture 16 Graphical Model

Bayesian is a DAG, we have d-separation for all the nods.
Factor graph is a biparte graph. Node represent factor and variables. variable elimination used to compute the marginal and conditional probability distribution.

Lecture 17 Belief Propagation

Belief propagation only works for tree-like graph. For other cases, need other helps. Use variable elimination and other methods to get clique tree, then running belief propagation.
For non-tree like graphical model, we can use junction tree algorithm to convert to clique tree, then running belief propagation on the clique tree.

Lecture 18 Graphical Model & Learning Theory

Plate notation used for graphical model, plate notation used for model with large number of variables.
Learning theory deals with the sample complexity:
1. how do we evaluate the complexity of hypothesis we want to learn
2. how many samples do we need to learn a hypotheses
3. the difference between training error and test error, its relation with hypothesis complexity and sample numbers.

Most of the questions are for theoretical study, they can only for intuition but not accurate solution of some problem.

Consistent hypothesis refers to a hypothesis that achieve 0 error rate on training data. For this kind of hypotheses, we need to use union bound to get the upper bound. For hypothesis with training error larger than zero, we need another inequality to find the upper bound.

Lecture 19 VC Dimension and Bounds

PAC learning framework provide a way to describe the different between training error and testing error which depends on the complexity of the hypothesis family.
If the hypothesis family is finite, then we can use the number of hypothesis to describe the complexity of the hypothesis family.
If the hypothesis family is infinite, then we may use VC dimension and Redmacher complexity to describe the complexity.

VC dimension: The maximum number of points which can be classified exactly for any labeling. We only need to find one placement of training points and for each possible labeling, we can find a hypothesis to classify the samples correctly.
Redmacher complexity: for training samples, we give training labels random with probability and . Then .

Whatever VC and Redmarcher used to describe the complexity and they all have their only defect, the only use of these bounds will guide the design of optimization objective.

For example, the complexity of decision tree depends on the number of leafs; the complexity of linear classifier depend on the margin (or more exactly).

Lecture 20 Clustering Method

Not interested, skip.

Lecture 21 EM algorithm & PCA

For models with hidden variables, EM will be used to maximize the likelihood of the data. EM can be viewed as a co-ordinate ascent algorithm, optimize the distribution of hidden variable and joint probability distribution.
EM is just optimize the lower bound of likelihood of data.

PCA can be explained in two different perspectives:
1. maximize the variance of projected data points.
2. minimize the residual of original data points and projected data points.

Lecture 22 PCA & Neural Network

PCA: obtain the principle component through SVD of co-variate matrix, get the top-k eigen vector.
Disadvantage of PCA:
1. Can’t use information like class label.
2. Can only handle data with guassian like distribution.

For deep learning, we have deep belief network and deep neural network. Deep belief network is just graphical models, directed or undirected. Deep neural network is neural network with many hidden layers.

For training algorithm, we can use SGD or Others.

Lecture 23 Neural Network & BP & Spectral Method

Problem of Deep Neural Network
1. Overfitting
2. non-convex
3. Poor conditioning: some gradient can vanished and some of them will be exploded into infinite large and so on.

Conditioning problem
1. explosion/shringe of gradient.
2. derivative about 0 will be very small.

Solution:
1. grading clipping: normalize the norm of gradient.

Overfitting
1. Early Stopping
2. Penalty with or .

Spectral Method skipped.