追风人: 五月 2015

Notes on 10-701 Introduction to Machine Learning

Course from CMU 2014 Fall, general introduction to machine learning.

Lecture 1 Introduction to Machine Learning

Basic introduction about machine learning. The definition of Task, Experience and Performance is quite clear for me now.
Task: difference of task comes form different perspective, type of training data, type of testing data, how the training and testing data is arrived and how the training and testing data is used. We have following different scenarios:
1. Supervised Learning: Batch of training data with label and do not know testing data when training. like classification and regression.
2. Unsupervised Learning: Batch of training data without label and make prediction on unseen data points. It’s hard to evaluate the result of unsupervised learning. Typical tasks include clustering, dimensionality reduction and density estimation.
3. Semi-supervised Learning: Batch of training data with label and without label, making predictions on unseen data points.
4. Transductive Learning: Batch of training data with label and testing data without label, we only make predictions on these test point.
5. Online Learning: Learner receive training and testing data point in a fixed fashion. One unseen data point come, make prediction, and receive a true label. The total goal is the minimized the cumulative loss.
6. Reinforcement Learning: training data and testing data are mixed, the learned is required to interact with the environment.
7. Active Learning: Learning adaptively select training sample for label, hope to get better performance with fewer samples.
8. Clustering: unsupervised learning, find different data regions.
9. Dimension Reduction: project high dimension data into low-dimension.
10. Density Estimation, Sparse coding

The above list is just a simple list of possible scenarios, we can find much complex scenario for machine learning algorithm.
Experience: Training samples
Performance: We all all kinds of performance criteria like:
1. squared loss
2. hinge loss
3. log loss
4. absolute loss
5. negative log-likelihood
6. negative cross entropy

Lecture 2 Bayesian Optimal Classifier and Naive Bayes Classifier

For classification problem, we can derive the bayesian optimal classifier, i.e. the best possible classification result given the correct joint distribution of training data and class label.
$\hat{y} = argmax_{y} {p(Y=y | X = x)}$
After using bayes rule, we can derive the following formula:
$\hat{y} = argmax_{y} \; {p(Y=y) P(X=x|Y=y)}$
The only problem for this formula: we do not have the exact distribution and learning the distribution need too much sample.
Naive Bayes Classifier solve this problem with a strong assumption, all the features are conditional independent give the class label.
In practical, we need to smoothing the probability distribution for each feature.

Lecture 3 Perceptron

Using linear model as example introduce perceptron algorithm. Perceptron used to training model in online version.
For model in the following form:
$y = sign(w ^ T x)$
where $w$ is model parameter. When we have a bunch of linear separable training data, we will need to solve a feasibility problem:
$argmin_{w} \; 1$
$\quad s.t. \; y_{i} w ^ T x_{i} \gt 0$

For this problem, we can train it with:
a. Linear Programming;
b. Perceptron algorithm

When solve this problem with linear programming, we can’t send the problem directly to LP solver: because of the strict inequality. Most of the cases, solver will convert strict inequality into $\ge$ , this will have a trivial solution, i.e. $w = 0$ . So we need perform transformation to this optimization problem. Assume the data is linear separable, then we will find an solution to this problem $w ^ *$ , after getting this solution, we can find the minimum value of $\epsilon = min(y w ^ T x)$ , then the problem will convert to following form:
$argmin_{x} \; 1$
$\quad s.t. \; y_{i} w ^ T x_{i} \ge \epsilon$
Since we don’t have constraints on $w$ , we can divide both side by $epsilon$ , then we can get following form:
$argmin_{w} \; 1$
$\quad s.t. \; y_{i} w ^ T x_{i} \ge 1$
Finally, we can send this problem to LP solver.

When using Perceptron algorithm, we will have a very easy update rule:
when current model predict wrong label, we’ll add the $y_{i} x_{i}$ to current parameter.
key point for perceptron implementation:
1. data should be random pertubed.

Convergence Issue
1. converge guaranteed under liner separable assumption of data.
2. not if sample not linear separable, use averaged parameter will get convergence assumption.

Proof of perceptron convergence?
Margin Theory, i.e. $argmax_{w} argmin_{i} y_{i} w ^ T x_{i}$ . This means the largest possible margin between decision boundary.

Lecture 4 Logistic Regression and Perceptron

Logistic regression is a transformation of linear classifier, hoping to get a probabilistic interpretation of classification result.
Another topic is about generative model and discriminative model. For distriminative model, we have model the conditional relation between label and features. For generative model, we model the joint distribution of features and labels. The point is the different assumptions model make on the training data. With discriminative model, there is no extra assumptions about the data. But with generative model, we need to find a way to model the distribution of data $p(x)$ , this is not an easy talk under most circumstances.

Lecture 5 Analysis of Convergence of Perceptron

Analyze the convergence for non-separable case perceptron, also introduce the margin perceptron:
update the parameter $w$ when $y_{i} w ^ T x_{i} \lt 1$ , this will lead to better performance.

Lecture 6 Kernel Method & RHKS

Not interested, skip.

Lecture 7 Kernel Methods Continued; Support Vector Machine

Skip kernel method part and focus on support vector machine part.
Still derive the algorithm from maximum-margin point, clear already.
Extend linear SVM with kernel trick from dual space.

Recitation 3 Optimization in Machine Learning

Most of the optimization rely on backtracking and sufficient decrease condition.

Lecture 8 Linear Regression

Linear regression used to predict real target value. Using $L_2$ regularization equivalent to using a gaussian prior on the parameters. Using $L_1$ regularization equivalent to using a laplance prior on the parameters.
Using a $L_2$ regularization will get a invertible matrix. Nothing more.

Recitation 4

Derivation of SVM, kernel method.

Lecture 9 Nonlinear Regression

Still about regression, but talk about polymonial features. In this kind of setting, we have much more features than the original feature.
Nonlinear feature expansion of original features.
Error decomposition of training and testing error.

Ridge regression can have dual problem and find a similar interpretation of SVM.

Gaussian Process for Regression is also talked about.

Lecture 10 Optimizatioin & Convergence Analysis

Lectures about optimization, gradient descent, Newton’s method, convergence analysis.
Two major target for optimization:
1. Maximum Likihihood Estimate
2. ERM like margin and so on.

Gradient Descent and newton.

Recitation 5

Kernel Method: Kernel method are a strategy to learn complex function.

Lecture 11 Continue the Optimization Topics

Talk about newton method with equality constraints, barrier method for inequality method; Find another explanation about mirror descent.

For non-parametric model, we do not have pre-defined distribution for model and hope our model can can more complex when have more data.

Mirror descent works in the following way:
1. for variable $z$ calculate $z_{i+1} = z_{i} - \eta g$ .
2. solve following optimization problem: $argmin_{w} P(w) - z ^ T w$

Lecture 12 Nonparametric methods

Density estimation based on histogram and kernel method.
For classification task, kNN is the nonparametric methods. Weighted least regression, locally linear regression can have same effects as kernel methods.

Lecture 13 Decision Tree & Boosting

Several critical points for decision tree:
1. Criteria to select attribute for node splitting
2. Way of node splitting, only two-way splitting or multiple-way splitting ?
3. node pruning or tree stopping rule.

ID3: do not all the re-use of attribute for node splitting.

Pre-pruning:
1. fixed tree depth
2. fixed number of leafs

Post-pruning:
1. chi-square test of independence: convert tree into a list of rules at first, then eliminate rules independent of labels.

Boosting used to combine multiple weak learner to give better performance. Each iteration update sample weights.

Lecture 14 Boosting & Graphical Model

Boosting is used to optimize a exponential loss function. Then we optimize the bound find appropriate parameters.
Key idea for graphical model is the structure of joint distribution. We can use structure to save a lot of parameters.

Lecture 15 Graphical Model

Variable elimination works in a dynamic programming way for graphical model inference. Introduction to directed and undirected graphical model.

Lecture 16 Graphical Model

Bayesian is a DAG, we have d-separation for all the nods.
Factor graph is a biparte graph. Node represent factor and variables. variable elimination used to compute the marginal and conditional probability distribution.

Lecture 17 Belief Propagation

Belief propagation only works for tree-like graph. For other cases, need other helps. Use variable elimination and other methods to get clique tree, then running belief propagation.
For non-tree like graphical model, we can use junction tree algorithm to convert to clique tree, then running belief propagation on the clique tree.

Lecture 18 Graphical Model & Learning Theory

Plate notation used for graphical model, plate notation used for model with large number of variables.
Learning theory deals with the sample complexity:
1. how do we evaluate the complexity of hypothesis we want to learn
2. how many samples do we need to learn a hypotheses
3. the difference between training error and test error, its relation with hypothesis complexity and sample numbers.

Most of the questions are for theoretical study, they can only for intuition but not accurate solution of some problem.

Consistent hypothesis refers to a hypothesis that achieve 0 error rate on training data. For this kind of hypotheses, we need to use union bound to get the upper bound. For hypothesis with training error larger than zero, we need another inequality to find the upper bound.

Lecture 19 VC Dimension and Bounds

PAC learning framework provide a way to describe the different between training error and testing error which depends on the complexity of the hypothesis family.
If the hypothesis family is finite, then we can use the number of hypothesis to describe the complexity of the hypothesis family.
If the hypothesis family is infinite, then we may use VC dimension and Redmacher complexity to describe the complexity.

VC dimension: The maximum number of points which can be classified exactly for any labeling. We only need to find one placement of training points and for each possible labeling, we can find a hypothesis to classify the samples correctly.
Redmacher complexity: for $m$ training samples, we give training labels random with probability $p(\delta = +1) = \frac{1}{2}$ and $p(\delta = -1) = \frac{1}{2}$ . Then $\epsilon(H) = E_{\delta} \left[ sup_{h \in H} \left( \sum_{i=1}^{m} { \delta_i h(X_i) } \right) \right]$ .

Whatever VC and Redmarcher used to describe the complexity and they all have their only defect, the only use of these bounds will guide the design of optimization objective.

For example, the complexity of decision tree depends on the number of leafs; the complexity of linear classifier depend on the margin (or $\|w\|$ more exactly).

Lecture 20 Clustering Method

Not interested, skip.

Lecture 21 EM algorithm & PCA

For models with hidden variables, EM will be used to maximize the likelihood of the data. EM can be viewed as a co-ordinate ascent algorithm, optimize the distribution of hidden variable and joint probability distribution.
EM is just optimize the lower bound of likelihood of data.

PCA can be explained in two different perspectives:
1. maximize the variance of projected data points.
2. minimize the residual of original data points and projected data points.

Lecture 22 PCA & Neural Network

PCA: obtain the principle component through SVD of co-variate matrix, get the top-k eigen vector.
Disadvantage of PCA:
1. Can’t use information like class label.
2. Can only handle data with guassian like distribution.

For deep learning, we have deep belief network and deep neural network. Deep belief network is just graphical models, directed or undirected. Deep neural network is neural network with many hidden layers.

For training algorithm, we can use SGD or Others.

Lecture 23 Neural Network & BP & Spectral Method

Problem of Deep Neural Network
1. Overfitting
2. non-convex
3. Poor conditioning: some gradient can vanished and some of them will be exploded into infinite large and so on.

Conditioning problem
1. explosion/shringe of gradient.
2. derivative about 0 will be very small.

Solution:
1. grading clipping: normalize the norm of gradient.

Overfitting
1. Early Stopping
2. Penalty with $L_1$ or $L_2$ .

Spectral Method skipped.

追风人

2015年5月9日星期六

10-701 Course notes