2014年11月4日星期二

Graphical Model Series

Graphical Model

Graphical model comes from the combination of graph theory and probability theory. From somebody’s option, this is one of the two major approaches to Machine Learning. The other approach is Statistical Learning. This probabilistic approach to machine learning problem is quite different from ideas like SVM. But I can’t tell more difference between PGM and Statistical Learning.

Key point of PGM

PGM used to define high dimension probability distribution which has some internal structure rather than unstructured.
One of the important structure in probability distribution is independence or conditional independence. The independence structure can be calculated from the probability density function, a simpler way is to read it from corresponding graph structure. So now we have three different concepts: a. Graph Structure b. Probability distribution c. Set of independence.
Roughly speaking, graph structure link probability distribution and set of independence. Different type of graphical model will have different correspondence.
There are two different type of graph: Directed graph and Un-directed graph. Inference and Learning algorithm will be significant different for two different type. But there are two other different representations, Temporal Model and Plate Model. Temporal model used to represent the evolution of set of random variables with time, So most important work is to describe the dependence between subsequent time step. And two other assumptions assumed: Time invariant and Markov Chain. Plate model use to represent set of variables follow same distribution. One famous example would be LDA.
For each type of graphical model: Inference & Learning is very important. Inference means you already parameters for your model, learning means you need to learn parameters or structure for your model. For inference contains marginal distribution, conditional distribution and MAP. For learning, we have different algorithm for different graph. But one important thing to know, inference algorithm will be involved when perform learning.
Other related topics with graphical model is Ising model, boltzmann machine and restricted boltzmann machine, these stuffs are quite popular recently.

Other stuffs

One important concepts is factor when reading materials about PGM. A factor defined on a set of random variables and will give result for each possible configuration, one most familiar expression would be exp(w1x1+w2x2+...+wnxn). This definition like probability density very much except factor is un-normalized which means do not sum up to one. Conditional distribution P(A|B) can be written as f(A,B)af(A,B). Using this trick, we will unify directed model and un-directed model. Like probability density, we can perform multiple operations on factor: Product, marginal.

Directed Model

Directed graphical model also called Bayesian Network (BN). In this graphical model, node are connected with arrow. The variable on the head depend on the variable on the tail.
So in this model, parameters are lots of conditional probability distribution table. Learning & inference process are also based on the conditional probability distribution.
As we said previously, graphical model encode probability distribution and set of independence. For Bayesian network, we can write a probability distribution according to the connections between nodes. When we say probability distribution P(X1,X2,...,Xn) factorizes over graph G, this means probability distribution can be decomposed into product of conditional probability distribution and the conditional probability distribution is the same as we know from graph. Please note this mapping relation is unique. And we can say, each BN represent a specific probability distribution.

Derivation of D-separation

Each probability distribution implies a set of conditional independence XY|Z or independence, it’s hard to calculate from the density function. But we can read them from graph structure. There are several different connections between set of variable:
1. XY
2. XZY
3. XZY
4. XZY
We need to talk about the relation between X and Y based on different situation.
For situation 1, we know X and Y can never be independent.
For situation 2, we need to calculate the P(X,Y) for Z is given or not.
If Z is not given, P(X,Y)=ZP(X,Y,Z)=ZP(X)P(Z|X)P(Y|Z)=P(X)P(Z|X)P(Y|Z), the factorization of P(X,Y,Z) based on the bayesian network. From the result, we can know that X and Y is not independent.
if Z is given, P(X,Y|Z)=P(X,Y,Z)P(Z)=P(X)P(Z|X)P(Y|Z)P(Z)=P(X,Z)P(Y|Z)P(Z)=P(X|Z)P(Y|Z).So for this situation, we know X independent of Y given Z.
For situation 3, if Z is not given, P(X,Y)=ZP(X,Y,Z)=ZP(Z)P(X|Z)P(Y|Z), from this factorization, we know X are not independent. if Z is given, then P(X,Y|Z)=P(X|Z)P(Y|Z), from this formula we know X and Y is independent given Z.
For situation 4, if Z is not given, P(X,Y)=ZP(X,Y,Z)=ZP(X)P(Y)P(Z|X,Y)=P(X)P(Y). So X and Y is independent when Z is not observed. when Z is observed, P(X,Y|Z)=P(X)P(Y)P(Z|X,Y)P(Z), so X and Y is not independent. One more complex situation, sometimes when Z is not observed but one of its descendants is observed, in this situation, X and Y will not be independent neither. For this special structure, we call it v-structure.
For a list of random variables in Bayesian network X1X2...Xn, in this representation we ignore the direction of arrow. And this is a path from X1 to Xn, we call this active trail when there is no v-structure in the path.
We say from X1 to Xn active trail given Z when:
a. for any v-structure XiXi+1Xi+2, either Xi or one of its descendants in Z.
b. no other Xi in Z.
This means only v-structure can be observed.
For bayesian network, X and Y are d-separated given Z if there is no active trail from between X and Y given Z.

Factorization & Independence Map

From the definition of d-separation, we can find lots of conditional independence in the graph. We can also factorize the BN to form distribution. The relation follows:
1. The probability distribution P factorize over graph G, then all the conditional independence read from G are also satisfied by P. We call G is an I-map of P.
2. If P satisfy all the conditional independence read from G (G is an I-map of P), then P factorize over G.
Two theorems give the following statement: A bayesian network encode uniquely the probability distribution and independence. we can uniquely, we mean one bayesian network corresponding one form Bayesian network. (But this is not true for markov random fields).

Another representation of Bayesian network

Bayesian network can be represented in another two forms: Plate Model/Temporal Model.
Temporal model used to represent the evolution of set of variables over time. Typical temporal model is HMM. In fact, only two variable involved: a. State variable b. Observation variable. In temporal model, we need a transition probability and initial probability distribution. then we can perform modeling.
Plate Model used to describe set of random variables which has same ancestor. for example, in LDA, each document has same distribution and depended on the same prior.
Reason for development of temporal & plate model is simple: Parameter sharing. In some application, we get lots of random variables. If we set different parameter for same type of variable will lead to a very serious problem: data sparsity & overfitting. In order to avoid this problem, we may need to perform parameter sharing.

Undirected Model

Un-directed model is graph structure without direction. From my perspective, undirected model remove the conditional dependence from the factorization, a locally normalized factor, but to add global normalization. This will add exponentially difficulty to inference and learning, bu will also give more power in modeling the data.
Different from directed network, factor become much more important in building connection between graph and probability distribution. When factorization over bayesian network, we use conditional probability distribution to compose joint distribution over X1,X2,...,Xn. When we factorization over MRF, we need the help of factor.

Factorization over MRF

In order to perform factorization over a MRF, we need to define lots of factor over the MRF. But is there a criteria to group random variables? Unfortunately, from my personal point, there is no clear method for this. There is no need for the variables in the same factor to be fully connected.

General Gibbs Distribution

Given a set of factors Φ={ϕ1(D1),...,ϕk(Dk)}. The union of the scope is X1,...,Xn. Define gibbs distribution as follow:

P~(X1,..,Xn)=i=1kϕi(Di)

ZΦ=X1,...,XnP~(X1,...,Xn)

P(X1,...,Xn)=1ZΦP~(X1,...,Xn)

Here we know the trouble of MRF, we need a normalizing constant ZΦ which summarize over all the possible value of random variable in the scope. This operation is huge burden for complex MRF.
Another concept is Induced Markov Network, in this kind of network. Random variable belong to same scope must be fully connected. For example, we have X1,X2,X3, then we must have 3 edges.

Factorization & Independence

When we say probability distribution P factorize over graph G, which means density function is the same as the Gibbs distribution of G. One thing need more attention, for a specific graph G, we can define different kind of factorization over the scope. The factorization is not unique, but BN is unique.
For independence between X and Y given Z, it’s much simpler in MN. the only required is that no variable in Z in the path between X and Y.
Define I(P)={(XY|Z):XY|ZholdinP}, I(G)={(XY|Z):XseparatedfromYgivenZinG}.
1. If P factorize over G, then I(G)I(P).
2. if P is positive distribution and I(G)I(P), then P factorize over G.
Unlike BN, we need some constraints to confirm the relation between P and G given independence set.

Log-linear model

One widely used factorization method for Markov Network is Log-linear Model. In MN, we define factor over random variable, but in log-linear model we define function over variable of same factor. And for each assignment of random variable, we give a specific parameter wi , I think this is very convenient for parameter learning.

Conditional Random Field

Another variation of MN. More generally, it’s a discriminative model for sequence labeling task. Widely used for natural language processing and image processing. The model is modeling the P(Y|X) distribution. Here Y is the label sequence and X is input sequence. the rest is the same, but it’s a conditional model. when perform inference we need to perform normalization over every instances.

Written with StackEdit.

没有评论:

发表评论