龙星计划笔记

龙星计划的机器学习视频，2012年的。主讲者包括余凯和张瞳。这个笔记是为了记录看视频中觉得有趣的内容，以免自己忘记了。

线性模型
形式如下： $y = w ^ T x$
Boosting & Bagging
这里主要的内容就是如何从弱分类器构造强分类器，Boosting就是这样的方法。AdaBoost通过对样本给予不同的权重来实现这个效果，GBDT是通过不同的target来做拟合。

2. Linear Model

线性模型使用下面的方法来解决： $y = w ^ T x$ .对于这个问题来说，有如下的问题需要解决：

如果估计模型参数 $w$
估计出来的模型参数 $w$ 有多好，还能不能继续改进。
在p个参数之中，哪些变量更重要些。

9. Learning Theory

Learning Theory的核心内容在于generalization analysis，也就是如何从training error估计test error。如果training error和test error的差距太大，那么就会产生over-fitting，如果有一个办法可以很好的估计training error和test error的误差，那么可以有效的避免误差。
1. Generalization analysis主要处理一下的问题：
1. 在训练集上获得的最好模型与最好的模型由多大的差距
2. 如何由训练集上的误差来估计这个模型在测试集上的错误。
Generalization analysis的主要的工具就是Uniform Convergence这个方法。推到Uniform Convergence的主要方法就是几个概率不等式。

10. Optimization in Machine Learning

Gradient Descent
Proximal Gradient
Coordinate Descent
Dual Coordinate Descent
LBFGS (second order method)

11. Online Learning

Regret analysis instead of Generalization analysis
Approaches to Online Learning
- Perceptron Learning algorithm

original online learning algorithm.

Online convex optimization
keypoint: transform the batch method to online optimization method.
- Online gradient descent
- Online proximal gradient descent (L1-regularized solved by proximal gradient contains subtle problem which do not leads to sparse solution. Dual averaging should solve the problem.)
- online second order algorithm also works.
(Online optimization faster convergence than batch method)

Stochastic Gradient Descent is the workhorse of online learning, works much better than other algorithm.

12. Sparsity Models

Used when face high dimention data, only when some of the variable are useful.
convex relaxation of sparsity models.
1. Use L1-norm instead of L-0 norm
2. use greedy (forward stage-wise algorithm to select variable) algorithm to select the problem.
L1-norm regularization is just a convex-relaxation of L0-norm normalization.
Other kind of sparsity

Structure sparisity of some problem, like group sparsity.

13. Graphical Model

Key point to this kind of model
Definition of graphical model
1. variable
2. Conditional independence.

Inference
Learning

14. Structure Learning

Structure exists in input, model and output.
input structure like:
Naive Bayes
model structure
model sparsity, model group sparisity and other structures.

output structure
most important is the sequence labeling task.
HMM
MEMM
CRF
M3N

other development.

Learning algorithms
- perceptron learning
- max-margin learning
- so on.

15. Deep Learning & Feature Learning

Deep learning is architecture for machine learning. We use deep learning approach to learn more representative feature for further task.

CNN like with deep layer network.
autoencoders for feature learning with unsupervised data. (love this)

16. Transfer Learning & Semi-Supervised Learning

differetn task share same input space.(Transfer learning, also called multi-task learning)
typical example:
Hierarchiel linear model
nerual netowrk

common approach: Learn a common feature representation, then use this feature to perform different task.
Bayesian approach to transfer learning.

17. Recommendation

CF is the workhorse for recommendation.
CF treated as matrix completion problem get the best performance.
SVD like matrix decomposition (low-rank matrix factorization) gives the best idea.

19. Application of Machine Learning on WEB

Query classification
Key point of web-based classification is to use multiple sources of information to peform ranking. For example, we want to classify a short query, we may use the label of the related web pages.
Ranking

GBDT with different cost function.

Dragon Start 2010 course

Generative approach to machine learning leads to Bayesian Error, which stands for the best possible performance.

amazing facts about kNN classifier:
1. close to optimal (Bayesian Error)
2. Density estimation used.

Lecture 2. Generative & Discriminative

Generative vs. Discriminative
Naive Bayes vs. Logistic Regression
Naive Bayes need the data fits gaussian distribution.
Logistic regression do no assume the distribution. Naive Bayes convergence faster than logistic regression.

Lecture 3. Neural Network

BP like algorithm.
DBN, (Boltaman machine), CNN algorithm for image classification.

Lecture 4. Support Vector Machine

max-margin classifiers, talk about primal problem, dual problem and SMO algorithm.

Lecture 5. Learning Theory

Goal of Learning Theory
- Estimate the testing error from training error.
- Relation of current training error and the best training error under same class of model.
- Sample complexity, how many samples needed to train a model so as to achieve desired accuracy.
- how many iterations and resources need to run the algorithm.
Notes on the bound of Learning Theory
1. All the bounds from learning theory are too loose, thus may not give the exact number. but the trends is okay.
2. Some assumptions are made in the derivation of error bounds, be careful. Some of the bounds are not satisfied.

Inequality
- Union equality
- Hoeffding inequality

PAC
finite number of hypothesis
infinite number of hypothesis, VC dimension. (some set of points that can be shuttered by the hypothesis.)

Lecture 6. Over-fitting & Model selection

Big problem when dealing with learning problem, the phenomenon is you get a relative good score in training set yet bad on test set. (Watch out: You can’t judge you model is over-fitted by its performance on test set. If you change your model when its performance is poor on test set, you will overfit to test set.)

One intuitive explanation of over-fitting is you model fit a sampling or some kind of random error too much. Every model will over-fit to

Bias-Variance trade off

We can decompose the test error into two terms, first one is bias and the other one is variance. We assume the true hypothesis is $h(x)$ and our hypothesis is $\hat{y}(x)$ . Here is the derivation:
$E_D\left[ (\hat{y}(x;D) - h(x) ) ^ 2 \right] = (E_D \left[ \hat{y}(x;D) \right] - h(x) ) ^ 2 + E_D\left[ { (\hat{y}(x;D) - E_D \left[ {\hat{y}(x;D)} \right] ) ^ 2 } \right]$ . Here we get the bias term and variance term. The explanation as follow:
We draw a fixed size of $m$ samples from underlying yet unknown distribution $\Phi$ to form training data $D$ . And then we train our hypothesis on the sampled data $D$ and test a fixed point $x$ . We repeat this experiment for infinite times and get the expectation of our prediction $E_D \left[ \hat{y}(x;D) \right]$ . Then the expectation of difference between our trained hypothesis and true hypothesis is decomposed into two terms. One is the difference of prediction expectation and true result, the other is the variance of our prediction.

Explanation & Advice from Learning Theory

Q. Under which condition will a learning model be consistent.
A. A model will be consistent if and only if the VC dimension of target hypothesis space is finite. (A model consistent means the our trained hypothesis can approach the true hypothesis, this means a model will generalize well when its VC dimension is finite.)

Q. What’s the main reason for the difference between training and test error for a training sample of finite size $m$ ?
A. A formula as follow: $\left| Error_{test} - Error_{train} \right| \le \; g(d/m)$ , here the $d$ is the VC dimension of our hypothesis class and $m$ is the training sample size.

Insight form Learning Theory: For fixed sample size, complex model will lead to larger difference between training and testing error; For fixed VC dimension, more sample will lower the performance gap between training and testing.

Generalization Theory & Structural Risk Minimization

$Risk \; Expectation = Empirical \; Risk + Structural \; Risk$ , The empirical risk is the training error, the structure risk can be viewed as the function of VC dimension and sample size. We have the following bound:
$Generalization = TrainingError + \sqrt{\frac{d(\ln{(2m/d)} + 1) - \ln{\delta}}{m}}$
So we not only need to perform empirical risk minimization, but also need structural risk minimization. And we know the structural risk minimization comes from two perspect: VC dimension and sample size, but this two factors combined in a complex way.
For example, we get series models $H_1, \; H_2, \; ... \; H_n$ and their VC dimension non-decreased order. as model complexity increased, the empirical risk get down but structural risk get increased. We need a comprised model which give the best generalization performance.

Q. how can be control the VC dimension of hypothesis classes
- Reduce the number of parameters (works in most cases)
- If the number of parameters is fixed, we can use regularization.

Q. Connection between Bias-Variance and Structural Risk
A. Low structural risk model will be biased easily,a more complex model (which means high structural risk) will have high variance.

Against Over-fitting

Avoid over-fitting need to choose one model among lots of model. One basic yet very useful way would be

Cross Validation

practical advice for cross validation
- small data size need larger $K$ .
- larger K will reduce the variance.

bias-variance trade off in cross validation. the number of folds $K$ .
- larger K leads to lower bias estimate of the generalization error, but this estimate can be high variance.
- smaller value K will leads to higher bias but lower variance.

Regularization

regularization will reduce the structural risk, but you will need some kind of regularization parameters that need be determined through cross validation.

another interesting point about the regularization is the regularization can be interpretation of bayesian statistics. Like L2-norm regularization corresponds to gaussian prior, lapalce prior leads to L1-norm and so on.

Feature Selection

very effective through L1-norm regularization.
Another approach is to rank the feature and delete the feature that is not important. Ranking method listed below:
Mutual Information and So no.
Approaches to feature selection:
- filter ( ranking feature individually)
- wrapper (test the performance on some validation set)
- embedding (eliminate the feature at the same time of training models)

Information Criterion

like AIC/TIC, models the likelihood but also the model complexity.

Bayesian Model Averaging

Model the complexity through the following formula:
$P (D|M) = \log P(D | \hat{ \theta }_{ML}) - \frac{k}{2}\log{N}$
Often much better way than cross validation.

Lecture 7. Dimensionality Reduction

Parametric-driven dimensionality reduction
data-driven dimensionality reduction.
Linear dimensionality reduction:
- PCA
- LDA

Nonlinear dimensionality reduction:
- Isomap
- LLE

Lecture 8. Spectral Clustering

Some kind of matrix and then using SVD like methods to extract the features.

Spectral Clustering supplement to k-means clustering method.
Form the similarity matrix, calculate the degree matrix and get some kind of eigen value to get the cluster.

k-mean aimed at solving problem with gaussian-like distribution, but spectral clustering cal solve problem with much fancy structure.

Question: Need too much memory.

Lecture 9. Probabilistic Graphical Model

Graph model contains independence assumption and easy be factored.

directed graphical model can be factored according to conditional probability density. Independence can be retrieved from d-separation.

Undirected model is markov random fields, composed into factors.

Lecture 10. EM & Mixture Model

Optimize over the lower bound of full likelihood function.

Learning
Inference

Lecture 11. Exact Inference

Three most common inference problem
- likelihood of observed variable
- posterior of variable given evidence (observed variable)
- MAP estimate of variable Y given evidence X.

Algorithm
- Variable Elimination
- Message Passing
- Junction Tree Algorithm (Tree of cliques)

Lecture 12. Learning

Estimate parameters
- ML of completely observed of given structure. (Easy to calculate, but need to think data sparsity problem)
- Learning partially observed GM. (Need EM algorithm)
Learning depends on the inference algorithm.

Learning Structure model is hard.

Lecture 13. from HMM to CRF

CRF is a markov random field based on the observation which overcome label bias problem in HMM. The reason for label bias is easy, because HMM is an locally normalized model whereas CRF is global normalized model which could avoid the impact of locally feature sparsity.

Lecture 14. Latent Variable Model

Hierarchry Model for Text and Vision

Lecture 15. Approximate Inference

Mean field
Gibbs Sampling
for approximate inference

Lecture 16. Maximum Margin Markov Network

Transfer what the objective from svm to markov network, i.e. maximize the margin.

Key point:
Test with synthetic data to test the algorithm and result.

2014年10月14日星期二

Notes on Dragon Star