Graphical Model
Graphical model comes from the combination of graph theory and probability theory. From somebody’s option, this is one of the two major approaches to Machine Learning. The other approach is Statistical Learning. This probabilistic approach to machine learning problem is quite different from ideas like SVM. But I can’t tell more difference between PGM and Statistical Learning.
Key point of PGM
PGM used to define high dimension probability distribution which has some internal structure rather than unstructured.
One of the important structure in probability distribution is independence or conditional independence. The independence structure can be calculated from the probability density function, a simpler way is to read it from corresponding graph structure. So now we have three different concepts: a. Graph Structure b. Probability distribution c. Set of independence.
Roughly speaking, graph structure link probability distribution and set of independence. Different type of graphical model will have different correspondence.
There are two different type of graph: Directed graph and Un-directed graph. Inference and Learning algorithm will be significant different for two different type. But there are two other different representations, Temporal Model and Plate Model. Temporal model used to represent the evolution of set of random variables with time, So most important work is to describe the dependence between subsequent time step. And two other assumptions assumed: Time invariant and Markov Chain. Plate model use to represent set of variables follow same distribution. One famous example would be LDA.
For each type of graphical model: Inference & Learning is very important. Inference means you already parameters for your model, learning means you need to learn parameters or structure for your model. For inference contains marginal distribution, conditional distribution and MAP. For learning, we have different algorithm for different graph. But one important thing to know, inference algorithm will be involved when perform learning.
Other related topics with graphical model is Ising model, boltzmann machine and restricted boltzmann machine, these stuffs are quite popular recently.
Other stuffs
One important concepts is factor when reading materials about PGM. A factor defined on a set of random variables and will give result for each possible configuration, one most familiar expression would be
Directed Model
Directed graphical model also called Bayesian Network (BN). In this graphical model, node are connected with arrow. The variable on the head depend on the variable on the tail.
So in this model, parameters are lots of conditional probability distribution table. Learning & inference process are also based on the conditional probability distribution.
As we said previously, graphical model encode probability distribution and set of independence. For Bayesian network, we can write a probability distribution according to the connections between nodes. When we say probability distribution
Derivation of D-separation
Each probability distribution implies a set of conditional independence
1.
2.
3.
4.
We need to talk about the relation between
For situation 1, we know
For situation 2, we need to calculate the
If
if
For situation 3, if
For situation 4, if
For a list of random variables in Bayesian network
We say from
a. for any v-structure
b. no other
This means only v-structure can be observed.
For bayesian network,
Factorization & Independence Map
From the definition of d-separation, we can find lots of conditional independence in the graph. We can also factorize the BN to form distribution. The relation follows:
1. The probability distribution
2. If
Two theorems give the following statement: A bayesian network encode uniquely the probability distribution and independence. we can uniquely, we mean one bayesian network corresponding one form Bayesian network. (But this is not true for markov random fields).
Another representation of Bayesian network
Bayesian network can be represented in another two forms: Plate Model/Temporal Model.
Temporal model used to represent the evolution of set of variables over time. Typical temporal model is HMM. In fact, only two variable involved: a. State variable b. Observation variable. In temporal model, we need a transition probability and initial probability distribution. then we can perform modeling.
Plate Model used to describe set of random variables which has same ancestor. for example, in LDA, each document has same distribution and depended on the same prior.
Reason for development of temporal & plate model is simple: Parameter sharing. In some application, we get lots of random variables. If we set different parameter for same type of variable will lead to a very serious problem: data sparsity & overfitting. In order to avoid this problem, we may need to perform parameter sharing.
Undirected Model
Un-directed model is graph structure without direction. From my perspective, undirected model remove the conditional dependence from the factorization, a locally normalized factor, but to add global normalization. This will add exponentially difficulty to inference and learning, bu will also give more power in modeling the data.
Different from directed network, factor become much more important in building connection between graph and probability distribution. When factorization over bayesian network, we use conditional probability distribution to compose joint distribution over
Factorization over MRF
In order to perform factorization over a MRF, we need to define lots of factor over the MRF. But is there a criteria to group random variables? Unfortunately, from my personal point, there is no clear method for this. There is no need for the variables in the same factor to be fully connected.
General Gibbs Distribution
Given a set of factors
Here we know the trouble of MRF, we need a normalizing constant
Another concept is Induced Markov Network, in this kind of network. Random variable belong to same scope must be fully connected. For example, we have
Factorization & Independence
When we say probability distribution
For independence between
Define
1. If
2. if
Unlike BN, we need some constraints to confirm the relation between
Log-linear model
One widely used factorization method for Markov Network is Log-linear Model. In MN, we define factor over random variable, but in log-linear model we define function over variable of same factor. And for each assignment of random variable, we give a specific parameter
Conditional Random Field
Another variation of MN. More generally, it’s a discriminative model for sequence labeling task. Widely used for natural language processing and image processing. The model is modeling the
Written with StackEdit.
没有评论:
发表评论