2014年9月21日星期日

Notes on "Neural Network and Machine Learning"

I attend a course on coursera named “Neural network and machine learning” which hosted by Hilton. here are some important topics.

Most Important contents

Speed up the learning and improve the generalization ability is the most important tasks in neural network.

Learning problem in neural network

widely used optimization method for neural network in back propagation. Sometimes this can be slow, with few modification can speed up the training process.

Momentum

This trick just like the “heavy ball” in sub-gradient method which combine current gradient direction with previous search direction to form a new search direction.

Adaptive Learning Rate

From my personal understanding, adaptive learning rate is used to solve data sparse problem. Data are sparse in every task, for frequent feature gradient are sufficient for learning. But for rare features, gradient may be very small that even can’t have significant impact on the parameter. So using adaptive learning rate can amplify the gradient to perform enough modification to parameter.

Rprop

Seems like a different handling of gradient of current error to better utilize the sparse feature. In this method, current gradient will be divided by norm of previous search direction to make learning more efficient.

mini-batch

Mini-batch trick seems will improve the system performance.

Learning for RNN

Recurrent neural network has its own problem in training. usual method will be BPTT(Back Propagation Through Time), Echo State Network and so on.

Second-Order method

gradient descent is first order method for training, we can used second order method (is Newton like method). But it’s hard to compute the Hessian and even more harder to store them. but Conjugate Gradient can get a approximation without creating the hessian explicitly.

Improve Generalization Ability

Learning is important, but generalization ability is much more important than the rest. And previous results shows neural network prone to over-fitting. Techniques used to improve generalization ability includes

Early Stopping

Stop the learning process before the model start to over-fitting. This can be achieved by test the model on validation, when performance on validation set start to drop then stop training.

Regularization

Regularization is widely used to prevent over-fitting. Like L2-norm of the parameter or L1-norm of the parameter.

Add Noise to Input and Activity

For linear neurons, add gaussian noise to input or activity is equivalent to add L2-norm regularization. But for non-linear neurons, this will be much complicated but have good effects.

Bayesian Approach

we can over the high variance of neural network by bagging method. In this method we train multiple neural network and combine their results to give a better results. This approach works for regression problem. Bayesian approach similar to this, but is more complicated and more powerful which use the MCMC to generate samples to give posterior distribution over parameters and give a combined results.

For Deep neural network, we improve the performance with two useful tricks: Dropout and Pre-training

Dropout

Dropout is treated like the mixture of experts. It works as follow, when training, for each hidden layer we ignore some of the activity of neurons. When back-propagation, this activity will be treated as zero.

Pre-Training

Pre-training means before a train a supervised neural network, we train a unsupervised network at first and then use the weights to initialize the supervised neural network. This works because deep neural network will get poor performance when initialized with random weights. Pre-training just add more information to the weights. Pre-training is usually done with deep boltzmann machine which is a un-supervised neural work and energy based model.

Other Applications

Other contents i think is interesting like word-embedding for language model and Deep autoencoder. Word-embedding is proved useful in many applications. But i think auto-encoder is much more interesting because it can be used to perform dimention reduction task.

New to me

  1. Boltzmann Machine and Hopfield network. This kind of energy model like ising model?
  2. Pre-training with Deep Boltzmann Machine.
  3. Deep Autoencoder and decoder to dimentionlity reduction.

Written with StackEdit.

没有评论:

发表评论