2015年6月22日星期一

Summary on Machine Learning Topics

@(Machine Learning)[Summary,Machine Learning]

Summary on Machine Learning Topics

Machine Learning composed of three critical components
1. Specific Task, each algorithm works with specific task not a general approach to everything. Tasks different from each other by the type of training data, type of testing data, how we get the training data and how we use the testing data.
2. Experience. Refers to everything can be represented with appropriate format. For my personal understanding, experience just the interaction of objects, like human and environment. Human interaction should and always be the main source of improvement.
3. Performance. Most important step during the development of application. We need to first set the performance evaluation of our task, then we can find appropriate ways to improve. Under most circumstance, we can’t optimize the performance directly, we can only optimize the Surrogate of our final performance. For most of the problem, this idea should works very well.

So before we start a work, we need to select the type, performance evaluation metric, type of experience and most important the model we’ll use.
There are at least three aspects related with machine learning models:
1. Hypotheses spaces. This is the most important thing in a machine learning model, hypothesis defines how to mapping the input to output, what’s the relationship between input and output. For each hypothesis space, we will need a way to evaluate the Complexity, a way to qualify the different of performance of specific hypothesis on training data and testing data, this need the help of Learning Theory.
2. Loss function. Among all the hypotheses in the hypothesis space, we will find specific one as our final result of learning, loss function will be the criterion for selection.
3. Learning algorithm. With hypothesis and hypothesis selected, we need one learning algorithm to select the hypothesis. Most of the learning algorithm just one of the numerical optimization algorithms, but we have exception like Decision Tree. Numerical optimization algorithm for machine learning have different requirement with standard one, machine learning focus on the generalization ability rather than numerical precision.

Most of the book about machine learning will focus on these three aspects. Topics about the hypothesis spaces is the key of machine learning. We’ll also see talks on loss function. Learning algorithm require the book from numerical optimization community, rather than machine learning. But with the increasing size of data, distributed optimization and fast optimization has become more and more important.

Loss Function

There are two different approaches for loss function:
1. Maximum Likelihood of data. This is a suggest form probability when fitting parameter. When the output of hypothesis space can be interpreted as probability, maximum likelihood will be a possible principle for loss function.
2. Empirical Risk Minimization: This is another important principle. This approach will optimize the surrogate of our real loss function. For example, we actually want to optimize loss for classification task, but this function is too hard to optimize. We use log loss and hinge loss instead. And the cost is evaluated on the training data, i.e. . Then we optimize this function based on empirical data.

Hypotheses space

Hypothesis describe the relation of input and output. We can classify the type of hypothesis according to different criterion. We have linear, non-linear; parametric and non-parametric hypothesis.
Summarize the hypothesis space as follow:
1. Linear hypothesis space: when combined with different loss function, we have different classifier. Combined with log loss, it’s Logistic Regression; combined with hinge loss, it’s Support Vector Machine.
2. Tree hypothesis: Decision tree is actually an non-parametric hypothesis spaces, GBDT is actually optimize the exponential loss function. We have GDBT, RF and so on.
3. Kernel related hypothesis: We can combine kernel method with linear method, This is an non-parametric hypotheis method, because final result based on the training data.
4. Graphical Model hypothesis: this hypothesis is described with exponential family distribution. So we can use different loss function.
5. Neural Network hypothesis: Highly complex non-convex hypothesis family. It’s hard to optimize and easy to over-fitting.

Learning Algorithm

For most of the machine learning models, we need to use learning algorithm to find a hpothesis according to loss function. Some of the algorithms are greedy like decision tree building. But most of the algorithms involves numerical optimization algorithm. This means optimization is at the heart of machine learning, many models rely on numerical optimization heavily. Here are the summaries of optimization algorithms:
1. First order method
First order method is very cheap compared with other methods. And is widely used for Neural Network. First order method include Stochastic Gradient Descent, Proximal Gradient, Sub-Gradient Descent, Mirror Descent and so on.
2. Second order method
Most important would be Newton Method, some other similar method like Conjugate Descent, L-BFGS and so on. Under most circumstance machine learning would not use Interior Point Method and Sequential Quadratic Programming, because these methods will cost too much memory.

Some very hot topics on Learning algorithm would be the regularization and distributed optimization. For distributed optimization, we have .

A short summary on the numerical optimization topic, itself is a very big topic. We can classify the algorithm based on objective, constraints and so on. From this perspective we have

Problem Type Mathematical Description
Linear Programming
Quadratic Programming with constraints
Second Order Cone Programming with second order constraints
Semi-Definite Programming with positive definite constraints

I think for engineering purpose, most work will focus on Learning algorithm, especially for large scale training and distributed optimization. The development of hypothesis space and loss function would be slower.

Written with StackEdit.