追风人: 九月 2015

AM207 Monte Carlo Method and Stochastic Optimization

Lecture 1

Introduction lecture, almost forget all the materials.

Lab session 1

Introduction to probability distribution with python as example.

Lecture 2 Basic Monte Carlo Method

Important sampling
This method is just for calculate the integral of some function.
Rejection Sampling
This is the first method for sampling from a distribution in real.

One important issue with monte carlo method is the variance of estimation depend on the sample size. We need find better ways to control the variance of estimator.

Lecture 3 Variance Reduction

Reduce the variance of monte carlo methods.

Control Variates
Antithetic Variates
require the function must be mononic.
Stratification variate
split the variable into multiple interval and perform calculation on each interval.

Lecture 4 Bayesian Formalism

One simple distinction between bayesian and frequentist would be their returned result. For bayesian method, it will return a distribution; frequentist will return a number.
Bayesian method start with a prior distribution on parameter, after observing some data, updating the prior get the posterior distribution on the parameter.

Lecture 5 Bayesian Formalism: Part 2 && MCMC

Sampling method:
1. Inverse transformation
2. Rejection Sampling
3. MCMC

MCMC is the widely used approach to sample from high dimension and complex distributions. What’s the properties we expect MCMC to have ?
1. aperiodic: This means we do not want pattern (or loops) existed in the random walking. If we have this pattern, then the corresponding samples will not be random.
2. Irreducible: for any step $t$ , the probability of any point $x$ would large than $0$ . this means we do not want the random walk end up with a deterministic ending sample. So that we can get to any space in the domain.
3. detailed balance: from sample point $x$ and $\tilde{x}$ , the forward and backward probability is the same. This will guarantee the corresponding samples will come from the distribution we want.

Lecture 6 MCMC

Start real introduction into MCMC with detailed balance:
First talk about the Metroplis method for MCMC, by assuming a symmetric transition probability $q(x; x') = q(x' ; x)$ . First we get a new point from distribution $q(x ^ *; x)$ where $x$ is current point, then we accept the new point $x ^ *$ with probability $A(x ^ *, x)$
$A (x ^ *; x ) = min \{ 1, p(x ^ * ) / p(x) \}$
Perform a uniform sampling between 0 and 1, get a prob $p$ . If $p$ less than $A (x ^ *; x)$ then new point $x ^ *$ is accepted; otherwise stay on current point.
The interpretation of prob $A(x ^ *; x)$ is quite simple: if the next point has a higher probability than current point, it is accepted without any condition; if the next point has a smaller probability than current point, it is accepted with a probability. In practice, using gaussian or uniform distribution will satisfy the symmetric requirement.
if we have a proposal distribution $q (x ^ *; x)$ which is not symmetric ? We correct the probability in the following way:
$p(x) q( x ^ *; x) r (x ^ *, x) = p(x ^ *) q(x; x ^ *)$
then $r(x ^*, x) = \frac{p(x ^ *)q(x; x ^ *)}{p(x) q(x ^ *; x)}$ so we have the following new acceptance function $A(x ^ *; x) = min \{ 1, \frac{ p(x ^ *)q(x; x ^ *) } { p(x) q(x ^ *; x) } \}$
But in usual case, we still using symmetric proposal distribution instead of a unsymmetric.

Lecture 7 MCMC Convergence

For MCMC sampling method, convergence will be a serious issue. Convergence means we get the samples according to target distribution. Before convergence, we have a phase called burning in, samples from this stage can’t be used for estimate statistics. There is no reliable method to determine when the burning is finished, only some available heuristic variable to show.
1. Trace of variables
Using iteration number as the horizon line, actual value of variable as the vertical line. We plot this line to see the change of variable along time, if it looks like random Then it may have converged.
2. Geweke test
Geweke test used to calculate a estimator based on the mean and variance of two non-overlapping sequence of samples. Here is the definition:
$z = \frac{ \bar{\theta}_a - \bar{\theta}_b } { \sqrt{ var(\theta_a) + var(\theta_b) } }$
In usual case, the value of $z$ should between -2 and 2. When using this method, we just split current samples into two parts, calculate the mean $\bar{ \theta }_a$ , $\bar{ \theta }_b$ and corresponding variance $var(\theta)_a$ , $var(\theta)_b$ . this seems useful.

Lecture 8 Gibbs Sampling

Another test statistics for convergence of MCMC. Gibbs sampling just mentioned.

Lecture 9 More on Gibbs Sampling

Just know the gibbs sampling is a special form of metroplolis-hasting algorithm when the proposal distribution is conditional distribution. Then all the rest is examples.

Lecture 10 Slice MC and Multivariate Slice MC

Data Augmentation
The key idea about data augmentation: If we want to sample from probability distribution $p(x)$ , but we augment it with joint probability $p(x,y)$ whose marginal distribution of $x$ is $p(x)$ . Then sampling from $p(x,y)$ and abandon $y$ to get samples from $x$ .

Slice Sampling
One example of data augmentation.

Lecture 11 Hierarchical Bayesian Method

Nothing special, just normal hierarchical models: directed acyclic graphical model. But the derivation of conditional distribution is quite hard. I don’t know the applicability to other more complex models.

Lecture 12 Hamiltonian Monte Carlo

Monte Carlo or Markov Chain Monte Carlo methods are all based on random walk. From this perspective, there are several interesting aspects need attention:
1. How to propose new point (based on current point or other information) ?
2. Criterion for accepting new proposed point. This is just need to ensure the detailed balance, so that finally sampling points according to target distribution.

For MCMC algorithm to work, need to ensure the detailed balance, ensure the random walk would not have patterns and can reach every possible space.

Hamiltonian Monte Carlo method is one type of MCMC, motivated by one face: for efficient sampling, when the probability surface is flat, the step size should be large; when the probability surface has very variability, the step size should be small. So can ensure the full sampling.

Hamiltonian Monte Carlo Method motivated by physical system and newton first, second law. Even though I do not know the details of HMC, i can know the main improvement is the way to propose new candidate point (make it much efficient).

Lecture 13 Parallel Tempering

Start multiple with different temperature. The key idea the the using of Temperature. For any probability distribution $p(x)$ can be transformed into $p(x) = exp(- \frac{(-log(p(x))) } { T } ) = exp(- \frac{E(x)}{T})$ . This will turn any distribution into boltzmann distribution and $T$ is the temperate. Higher temperate will increase the rate of acceptance when sampling. Low temperate will decrease the rate of acceptance. The core idea of parallel tempering is the run multiple chain of MCMC with different temperate. Some of the chain with $T = 1$ and others with higher temperate. So that MCMC will explore more space and more efficient.

Lecture 14 Simulated Annealing

Borrowing the concept of temperate form lecture 13. Start with high temperate, then decreasing the temperate slowly. Hope to jump out of local minimum. The reason is this: The high temperate will smooth the objective (or distribution).

Lecture 15 Stochastic Gradient Descent

Discuss the optimization algorithm, nothing special.

Lecture 16 Time Series

Despite the I.I.D assumptions about data points, there are other forms of data. Time series is a special form, data are depend on previous data points. There are multiple concepts related to time series data points:
1. Autocorrelation.
2. Autoregression.

Lecture 17 Times Series & HMM

Keep talking about time series related model & HMM, not interested.

Lecture 18 HMM & Kalman Filter

Model like HMM and kalman filter. Know the details.

Lecture 19 Expectation Maximization

Some kind of derivation with example of GMM. Nothing special information from this lecture.

Lecture 20 Gaussian Process

Gaussian process is too much complicated. The prediction for new point based on seen points. Using kernel to construct the covariance matrix for a gaussian distribution like the Gram Matrix. The new point and the original point will construct a joint distributiion which can be simplified to a conditional distribution. Something else not clear.

Lecture 21 Graphical Model

One lecture for graphical model will be impossible, skipped.

Lecture 22 Graphical Model cont.

Skipped as well.

Summary

After finishing this course, i think this course is not so advanced but the most important stuff is let me know some more details about MCMC. So that i can read some more book & understand & implement some algorithm myself. Whatever, this is a good course for MCMC. But for Optimization and Graphical Model part, EM part, nothing special.

2015年9月17日星期四

AM207 Monte Carlo Method and Stochastic Optimization

AM207 Monte Carlo Method and Stochastic Optimization

Lecture 1

Lab session 1

Lecture 2 Basic Monte Carlo Method

Lecture 3 Variance Reduction

Lecture 4 Bayesian Formalism

Lecture 5 Bayesian Formalism: Part 2 && MCMC

Lecture 6 MCMC

Lecture 7 MCMC Convergence

Lecture 8 Gibbs Sampling

Lecture 9 More on Gibbs Sampling

Lecture 10 Slice MC and Multivariate Slice MC

Lecture 11 Hierarchical Bayesian Method

Lecture 12 Hamiltonian Monte Carlo

Lecture 13 Parallel Tempering

Lecture 14 Simulated Annealing

Lecture 15 Stochastic Gradient Descent

Lecture 16 Time Series

Lecture 17 Times Series & HMM

Lecture 18 HMM & Kalman Filter

Lecture 19 Expectation Maximization

Lecture 20 Gaussian Process

Lecture 21 Graphical Model

Lecture 22 Graphical Model cont.

Summary