Graphical Models

Language for describing the independence relations in lots of variables, useful tool for high dimensional data processing. Two different set of graphical models: directed graphical model(bayesian network), undirected graphical model(markov random fields).

Key concept in graphical model: Probability distribution, factorization, independence set.

Bayesian Network

Concept of d-separation

Two variables are separated by some conditions, then they are independent.

If one probability distribution P factorize over G, then I-map of G contains all the independence of P.
If I(G) is satisfied by some distribution P, then P can factorize over G.

What does this mean?
In a probability distribution P, we can read the independence implied by P easily from graph G. And a graph G can can easily show us the relation between variables in P. Graph G as a bridge that connect the probability distribution P and independence set I(G) in a very natural way.

Structure CPD

Conditional Probability Distributions are very important to implement bayesian network. we can so define structures for CPD.
- Tree structure CPD
different variables are composed together in a tree structure.
- Multiplexer CPD
one catogerical value decide which parents are used to affect the son.
- Noisy OR kind model
Forget the details.

Markov Random Fields

One important difference between BN and MRF is that, the factorization over a MRF is not unique. This means we can write different probability P that can factorize over graph H.
If probability P factorize over H, then the independence set I(H) is a subset of independence of P implied.
Only for positive distribution, If the independence set of P is the same as independence of graph H, then P can factorize over H.

From factor to log-linear model

log-linear model is used to encode local structures into global information. Factors give value for each assigned variable in its scope, but in log-linear model, we give difference parameters to different configuration of variables. Well, from my perspective, this idea prevent us from defining the probability distribution for each factor and how to parameterize them, that would be very difficult.

parameter sharing is very common in graphical model. In BN, if we assume the time invariant and markov property, then we can replcate parameter between different time step. In MRF, we need to ignore the position and share parameters between different subsets of variables.
from my person options, parameter sharing because of data sparsity problem. We do not have enough data to learn complex model.

Notes

Not all the probability distribution can be represented as graph(either BN or MRF).
BN and MRF has its own ability in express probability distribution, they can not replace each other.
How to design a graphical model

Inference in Graphical Model

“Inference” means answer some queries after we get the parameter of the model. There are three different kind of inference problems in general:
- Marginal Probability
In this problem, we need to estimate the probability of $p(x_1, ..., x_k)$ ,
which means you need to summarization over other variables.
- Conditional Probability
- Maximum A Posterior(MAP) estimation.

MAP Inference

If perform on a clique tree algorithm, we turn the algorithm from sum-product algorithm to max-product algorithm. The reason is simple, sum-product is only used for compute the marginal distribution but the MAP problem is about the most probable one.

Another perspective from optimization is to convert the inference to optimization problem and utilize dual decomposition to solve the problem.

Learning in PGM

Learning the parameters in PGM is important, because we want to use it in real environment. For the learning of PGM, there are two different aspects: Structure and Parameter
There is structure in the graphical model, in the most circumstances, we have know the structure in advance. But in very rare situation, we need to learn the structure from the data. And this is called Knowledge Discovery
Parameters are very important for BN and MRF, this should be learned from the data.
So we have different situations when we perform learning:
1. Known structure, Complete data
2. Known structure, Incomplete data
3. Unknown structure, complete data
4. Unknown structure, Incomplete data
5. Unknown structure with hidden variable, Incomplete data

But for the training task, we have several different as follow:
1. Log-likelihood of training data (Corresponding to generative model learning)
2. Conditional log-likelihood learning of training data (Discriminate learning)
Of course, regularization and cross-validation is necessary for every machine learning task.

MLE estimate of BN

Easy, just counts the occurrence.
But the problem lies in the data sparsity, we get worse result with limited data and complex model. But we can simplify the model even it is wrong and get a better generalization ability with limited data.

EM algorithm focus on the expected sufficient statistics in the E step and use these expected sufficient statistics in the Maximization step.

Written with StackEdit.

龙星计划笔记

龙星计划的机器学习视频，2012年的。主讲者包括余凯和张瞳。这个笔记是为了记录看视频中觉得有趣的内容，以免自己忘记了。

线性模型
形式如下： $y = w ^ T x$
Boosting & Bagging
这里主要的内容就是如何从弱分类器构造强分类器，Boosting就是这样的方法。AdaBoost通过对样本给予不同的权重来实现这个效果，GBDT是通过不同的target来做拟合。

2. Linear Model

线性模型使用下面的方法来解决： $y = w ^ T x$ .对于这个问题来说，有如下的问题需要解决：

如果估计模型参数 $w$
估计出来的模型参数 $w$ 有多好，还能不能继续改进。
在p个参数之中，哪些变量更重要些。

9. Learning Theory

Learning Theory的核心内容在于generalization analysis，也就是如何从training error估计test error。如果training error和test error的差距太大，那么就会产生over-fitting，如果有一个办法可以很好的估计training error和test error的误差，那么可以有效的避免误差。
1. Generalization analysis主要处理一下的问题：
1. 在训练集上获得的最好模型与最好的模型由多大的差距
2. 如何由训练集上的误差来估计这个模型在测试集上的错误。
Generalization analysis的主要的工具就是Uniform Convergence这个方法。推到Uniform Convergence的主要方法就是几个概率不等式。

10. Optimization in Machine Learning

Gradient Descent
Proximal Gradient
Coordinate Descent
Dual Coordinate Descent
LBFGS (second order method)

11. Online Learning

Regret analysis instead of Generalization analysis
Approaches to Online Learning
- Perceptron Learning algorithm

original online learning algorithm.

Online convex optimization
keypoint: transform the batch method to online optimization method.
- Online gradient descent
- Online proximal gradient descent (L1-regularized solved by proximal gradient contains subtle problem which do not leads to sparse solution. Dual averaging should solve the problem.)
- online second order algorithm also works.
(Online optimization faster convergence than batch method)

Stochastic Gradient Descent is the workhorse of online learning, works much better than other algorithm.

12. Sparsity Models

Used when face high dimention data, only when some of the variable are useful.
convex relaxation of sparsity models.
1. Use L1-norm instead of L-0 norm
2. use greedy (forward stage-wise algorithm to select variable) algorithm to select the problem.
L1-norm regularization is just a convex-relaxation of L0-norm normalization.
Other kind of sparsity

Structure sparisity of some problem, like group sparsity.

13. Graphical Model

Key point to this kind of model
Definition of graphical model
1. variable
2. Conditional independence.

Inference
Learning

14. Structure Learning

Structure exists in input, model and output.
input structure like:
Naive Bayes
model structure
model sparsity, model group sparisity and other structures.

output structure
most important is the sequence labeling task.
HMM
MEMM
CRF
M3N

other development.

Learning algorithms
- perceptron learning
- max-margin learning
- so on.

15. Deep Learning & Feature Learning

Deep learning is architecture for machine learning. We use deep learning approach to learn more representative feature for further task.

CNN like with deep layer network.
autoencoders for feature learning with unsupervised data. (love this)

16. Transfer Learning & Semi-Supervised Learning

differetn task share same input space.(Transfer learning, also called multi-task learning)
typical example:
Hierarchiel linear model
nerual netowrk

common approach: Learn a common feature representation, then use this feature to perform different task.
Bayesian approach to transfer learning.

17. Recommendation

CF is the workhorse for recommendation.
CF treated as matrix completion problem get the best performance.
SVD like matrix decomposition (low-rank matrix factorization) gives the best idea.

19. Application of Machine Learning on WEB

Query classification
Key point of web-based classification is to use multiple sources of information to peform ranking. For example, we want to classify a short query, we may use the label of the related web pages.
Ranking

GBDT with different cost function.

Dragon Start 2010 course

Generative approach to machine learning leads to Bayesian Error, which stands for the best possible performance.

amazing facts about kNN classifier:
1. close to optimal (Bayesian Error)
2. Density estimation used.

Lecture 2. Generative & Discriminative

Generative vs. Discriminative
Naive Bayes vs. Logistic Regression
Naive Bayes need the data fits gaussian distribution.
Logistic regression do no assume the distribution. Naive Bayes convergence faster than logistic regression.

Lecture 3. Neural Network

BP like algorithm.
DBN, (Boltaman machine), CNN algorithm for image classification.

Lecture 4. Support Vector Machine

max-margin classifiers, talk about primal problem, dual problem and SMO algorithm.

Lecture 5. Learning Theory

Goal of Learning Theory
- Estimate the testing error from training error.
- Relation of current training error and the best training error under same class of model.
- Sample complexity, how many samples needed to train a model so as to achieve desired accuracy.
- how many iterations and resources need to run the algorithm.
Notes on the bound of Learning Theory
1. All the bounds from learning theory are too loose, thus may not give the exact number. but the trends is okay.
2. Some assumptions are made in the derivation of error bounds, be careful. Some of the bounds are not satisfied.

Inequality
- Union equality
- Hoeffding inequality

PAC
finite number of hypothesis
infinite number of hypothesis, VC dimension. (some set of points that can be shuttered by the hypothesis.)

Lecture 6. Over-fitting & Model selection

Big problem when dealing with learning problem, the phenomenon is you get a relative good score in training set yet bad on test set. (Watch out: You can’t judge you model is over-fitted by its performance on test set. If you change your model when its performance is poor on test set, you will overfit to test set.)

One intuitive explanation of over-fitting is you model fit a sampling or some kind of random error too much. Every model will over-fit to

Bias-Variance trade off

We can decompose the test error into two terms, first one is bias and the other one is variance. We assume the true hypothesis is $h(x)$ and our hypothesis is $\hat{y}(x)$ . Here is the derivation:
$E_D\left[ (\hat{y}(x;D) - h(x) ) ^ 2 \right] = (E_D \left[ \hat{y}(x;D) \right] - h(x) ) ^ 2 + E_D\left[ { (\hat{y}(x;D) - E_D \left[ {\hat{y}(x;D)} \right] ) ^ 2 } \right]$ . Here we get the bias term and variance term. The explanation as follow:
We draw a fixed size of $m$ samples from underlying yet unknown distribution $\Phi$ to form training data $D$ . And then we train our hypothesis on the sampled data $D$ and test a fixed point $x$ . We repeat this experiment for infinite times and get the expectation of our prediction $E_D \left[ \hat{y}(x;D) \right]$ . Then the expectation of difference between our trained hypothesis and true hypothesis is decomposed into two terms. One is the difference of prediction expectation and true result, the other is the variance of our prediction.

Explanation & Advice from Learning Theory

Q. Under which condition will a learning model be consistent.
A. A model will be consistent if and only if the VC dimension of target hypothesis space is finite. (A model consistent means the our trained hypothesis can approach the true hypothesis, this means a model will generalize well when its VC dimension is finite.)

Q. What’s the main reason for the difference between training and test error for a training sample of finite size $m$ ?
A. A formula as follow: $\left| Error_{test} - Error_{train} \right| \le \; g(d/m)$ , here the $d$ is the VC dimension of our hypothesis class and $m$ is the training sample size.

Insight form Learning Theory: For fixed sample size, complex model will lead to larger difference between training and testing error; For fixed VC dimension, more sample will lower the performance gap between training and testing.

Generalization Theory & Structural Risk Minimization

$Risk \; Expectation = Empirical \; Risk + Structural \; Risk$ , The empirical risk is the training error, the structure risk can be viewed as the function of VC dimension and sample size. We have the following bound:
$Generalization = TrainingError + \sqrt{\frac{d(\ln{(2m/d)} + 1) - \ln{\delta}}{m}}$
So we not only need to perform empirical risk minimization, but also need structural risk minimization. And we know the structural risk minimization comes from two perspect: VC dimension and sample size, but this two factors combined in a complex way.
For example, we get series models $H_1, \; H_2, \; ... \; H_n$ and their VC dimension non-decreased order. as model complexity increased, the empirical risk get down but structural risk get increased. We need a comprised model which give the best generalization performance.

Q. how can be control the VC dimension of hypothesis classes
- Reduce the number of parameters (works in most cases)
- If the number of parameters is fixed, we can use regularization.

Q. Connection between Bias-Variance and Structural Risk
A. Low structural risk model will be biased easily,a more complex model (which means high structural risk) will have high variance.

Against Over-fitting

Avoid over-fitting need to choose one model among lots of model. One basic yet very useful way would be

Cross Validation

practical advice for cross validation
- small data size need larger $K$ .
- larger K will reduce the variance.

bias-variance trade off in cross validation. the number of folds $K$ .
- larger K leads to lower bias estimate of the generalization error, but this estimate can be high variance.
- smaller value K will leads to higher bias but lower variance.

Regularization

regularization will reduce the structural risk, but you will need some kind of regularization parameters that need be determined through cross validation.

another interesting point about the regularization is the regularization can be interpretation of bayesian statistics. Like L2-norm regularization corresponds to gaussian prior, lapalce prior leads to L1-norm and so on.

Feature Selection

very effective through L1-norm regularization.
Another approach is to rank the feature and delete the feature that is not important. Ranking method listed below:
Mutual Information and So no.
Approaches to feature selection:
- filter ( ranking feature individually)
- wrapper (test the performance on some validation set)
- embedding (eliminate the feature at the same time of training models)

Information Criterion

like AIC/TIC, models the likelihood but also the model complexity.

Bayesian Model Averaging

Model the complexity through the following formula:
$P (D|M) = \log P(D | \hat{ \theta }_{ML}) - \frac{k}{2}\log{N}$
Often much better way than cross validation.

Lecture 7. Dimensionality Reduction

Parametric-driven dimensionality reduction
data-driven dimensionality reduction.
Linear dimensionality reduction:
- PCA
- LDA

Nonlinear dimensionality reduction:
- Isomap
- LLE

Lecture 8. Spectral Clustering

Some kind of matrix and then using SVD like methods to extract the features.

Spectral Clustering supplement to k-means clustering method.
Form the similarity matrix, calculate the degree matrix and get some kind of eigen value to get the cluster.

k-mean aimed at solving problem with gaussian-like distribution, but spectral clustering cal solve problem with much fancy structure.

Question: Need too much memory.

Lecture 9. Probabilistic Graphical Model

Graph model contains independence assumption and easy be factored.

directed graphical model can be factored according to conditional probability density. Independence can be retrieved from d-separation.

Undirected model is markov random fields, composed into factors.

Lecture 10. EM & Mixture Model

Optimize over the lower bound of full likelihood function.

Learning
Inference

Lecture 11. Exact Inference

Three most common inference problem
- likelihood of observed variable
- posterior of variable given evidence (observed variable)
- MAP estimate of variable Y given evidence X.

Algorithm
- Variable Elimination
- Message Passing
- Junction Tree Algorithm (Tree of cliques)

Lecture 12. Learning

Estimate parameters
- ML of completely observed of given structure. (Easy to calculate, but need to think data sparsity problem)
- Learning partially observed GM. (Need EM algorithm)
Learning depends on the inference algorithm.

Learning Structure model is hard.

Lecture 13. from HMM to CRF

CRF is a markov random field based on the observation which overcome label bias problem in HMM. The reason for label bias is easy, because HMM is an locally normalized model whereas CRF is global normalized model which could avoid the impact of locally feature sparsity.

Lecture 14. Latent Variable Model

Hierarchry Model for Text and Vision

Lecture 15. Approximate Inference

Mean field
Gibbs Sampling
for approximate inference

Lecture 16. Maximum Margin Markov Network

Transfer what the objective from svm to markov network, i.e. maximize the margin.

Key point:
Test with synthetic data to test the algorithm and result.

2014年10月26日星期日

Basic Ideas on Graphical Model