2014年10月26日星期日

Basic Ideas on Graphical Model

Graphical Models

Language for describing the independence relations in lots of variables, useful tool for high dimensional data processing. Two different set of graphical models: directed graphical model(bayesian network), undirected graphical model(markov random fields).

Key concept in graphical model: Probability distribution, factorization, independence set.

Bayesian Network

Concept of d-separation

Two variables are separated by some conditions, then they are independent.

  1. If one probability distribution P factorize over G, then I-map of G contains all the independence of P.
  2. If I(G) is satisfied by some distribution P, then P can factorize over G.

What does this mean?
In a probability distribution P, we can read the independence implied by P easily from graph G. And a graph G can can easily show us the relation between variables in P. Graph G as a bridge that connect the probability distribution P and independence set I(G) in a very natural way.

Structure CPD

Conditional Probability Distributions are very important to implement bayesian network. we can so define structures for CPD.
- Tree structure CPD
different variables are composed together in a tree structure.
- Multiplexer CPD
one catogerical value decide which parents are used to affect the son.
- Noisy OR kind model
Forget the details.

Markov Random Fields

  1. One important difference between BN and MRF is that, the factorization over a MRF is not unique. This means we can write different probability P that can factorize over graph H.
  2. If probability P factorize over H, then the independence set I(H) is a subset of independence of P implied.
  3. Only for positive distribution, If the independence set of P is the same as independence of graph H, then P can factorize over H.

From factor to log-linear model

log-linear model is used to encode local structures into global information. Factors give value for each assigned variable in its scope, but in log-linear model, we give difference parameters to different configuration of variables. Well, from my perspective, this idea prevent us from defining the probability distribution for each factor and how to parameterize them, that would be very difficult.

Parameter sharing

parameter sharing is very common in graphical model. In BN, if we assume the time invariant and markov property, then we can replcate parameter between different time step. In MRF, we need to ignore the position and share parameters between different subsets of variables.
from my person options, parameter sharing because of data sparsity problem. We do not have enough data to learn complex model.

Notes

  1. Not all the probability distribution can be represented as graph(either BN or MRF).
  2. BN and MRF has its own ability in express probability distribution, they can not replace each other.
  3. How to design a graphical model

Inference in Graphical Model

“Inference” means answer some queries after we get the parameter of the model. There are three different kind of inference problems in general:
- Marginal Probability
In this problem, we need to estimate the probability of p(x1,...,xk),
which means you need to summarization over other variables.
- Conditional Probability
- Maximum A Posterior(MAP) estimation.

MAP Inference

If perform on a clique tree algorithm, we turn the algorithm from sum-product algorithm to max-product algorithm. The reason is simple, sum-product is only used for compute the marginal distribution but the MAP problem is about the most probable one.

Another perspective from optimization is to convert the inference to optimization problem and utilize dual decomposition to solve the problem.

Learning in PGM

Learning the parameters in PGM is important, because we want to use it in real environment. For the learning of PGM, there are two different aspects: Structure and Parameter
There is structure in the graphical model, in the most circumstances, we have know the structure in advance. But in very rare situation, we need to learn the structure from the data. And this is called Knowledge Discovery
Parameters are very important for BN and MRF, this should be learned from the data.
So we have different situations when we perform learning:
1. Known structure, Complete data
2. Known structure, Incomplete data
3. Unknown structure, complete data
4. Unknown structure, Incomplete data
5. Unknown structure with hidden variable, Incomplete data

But for the training task, we have several different as follow:
1. Log-likelihood of training data (Corresponding to generative model learning)
2. Conditional log-likelihood learning of training data (Discriminate learning)
Of course, regularization and cross-validation is necessary for every machine learning task.

MLE estimate of BN

Easy, just counts the occurrence.
But the problem lies in the data sparsity, we get worse result with limited data and complex model. But we can simplify the model even it is wrong and get a better generalization ability with limited data.

EM algorithm focus on the expected sufficient statistics in the E step and use these expected sufficient statistics in the Maximization step.

Written with StackEdit.

没有评论:

发表评论