2015年8月6日星期四

Notes on Foundation of Machine Learning

@(Machine Learning)[Notes, Machine Learning]

Notes on Foundation of Machine Learning

Chapter 1 Introduction

Three different criteria for learning algorithm:
1. Time complexity
2. memory complexity
3. sample complexity: how many samples are need to learning concepts in hypothesis set.

Some other concepts like training set, validation set and test set. One important thing i learned from chapter is what so called Learning Scenario. Different learning scenarios differed from each other by the type of training and testing data (with label, without label), how the training data and testing data is arrived, how to use the training and testing data. We have following scenarios:
1. Supervised Learning
2. Unsupervised Learning
3. Active Learning
4. Dimension Reduction
5. Transductive Learning
6. Online Learning

Then we also have learning tasks, task means the target we want to get:
1. Classification
2. Regression
3. Ranking
4. Sequence labeling

and so on.

Chapter 2 PAC Learning Framework

Machine learning is using a learning algorithm to find one hypothesis among hypothesis class according to specific loss function . Two potential criterion for one hypothesis :
1. loss on available training sample called Empirical Risk ;
2. average on all the possible samples called Risk

The most important question would be the correlation between empirical risk and risk. This kind of relation is called Generalization Bound. Because in machine learning, low risk is the ultimate goal not low empirical risk. Low risk means good generalization ability. If the generalization bound is clear, we can estimate the real performance of hypothesis returned by learning algorithm . Under certain circumstance, we can put the same question in another way: How many samples we need so that the returned hypothesis has a risk less the prescribed .
Basic background for derivation. Each training sample comes from sample space , all these samples distributed according to distribution . There is a concept family , each of the concept will map a sample into target value space , i.e. . Learning algorithm use hypothesis space to approximate the concept class . Each of the hypothesis will map sample into . Here we assume the possible hypothesis of is finite, i.e. size is .
Now PAC-Learnable defined as follow:
A concept class is PAC-Learnable, if exists algorithm and a polynomial function , for all the and , for all the distribution on and any target concept in , if sample size then the following statements hold:
.

In order to derive the bound for consistent and inconsistent case, we need union bound and hoedding inequality.
Some generalization to above statement:
1. Assume the label according to input instance is deterministic, which means we can get the label from the input data. However, in real application the relation of input and output is stochastic. Which means we have the following distribution for each training instance . So we have so assume a distribution over input and output space, i.e. . One possible explanation for this is: we do not have enough features to distinguish each training instances. So like use height and weight to classify a person is man or woman. For a set of height and weight, there is a possibility for man and woman.
2. When the output is a distribution conditioned on input sample , We have the bayesian error. This is the minimum error the best classifier can get.
3. We can decompose the risk difference between hypothesis returned by learning algorithm and bayesian risk into a combination of estimation error and approximation error.
4. According to generalization bound, the risk of hypothesis depend on empirical error and ratio of sample size and number of hypothesis. We can find a hypothesis among different hypothesis size for model selection. Or we can use regularization to perform model selection.

Lecture 3 Rademacher Complexity and VC dimension

Rademacher Complexity and VC dimension are two different means to describe the complexity of hypothesis spaces. Because when the number of hypothesis is infinite, the generalization boudn derived in chapter 2 is useless. Rademacher complexity and VC dimension is another way to evaluate the complexity of hypothesis set except the number of hypothesis.
With these two different notation of hypothesis complexity, combined with concentration of inequality to give a uniform generalization bound.

Chapter 4 SVM

Start with linear separable case and then extend to non-separable case. The derivation of primal and dual problem is the way i expected. The most important is the derivation of generalization bound for SVM based on margin rather than feature dimension . One more interesting thing is using hinge loss is actually optimization the generalization bound of SVM.

Chapter 5 Kernel Method

For Kernel method, just know the implicit mapping from input space to high dimension space. The other work is about Positive Definite Symmetric kernel, the rest like RKHS, it’s unnecessary to know all these stuffs.

Chapter 6 Boosting

Boosting comes with one question: can we build a strong classifier with weak classifier ? what’s the definition of weak classifier? How can we combine this weak classifier? What’s its generalization bound and what’s the factor that impact the generalization ability.
Boosting is a strategy, AdaBoost is one real algorithm that can be used when the base weak classifier is a hypothesis set mapping from input space to . One basic idea is assign weight to each training sample, for samples harder to classify by nature, we give them more weight and then combine the classifier from each iteration linearly to get final classifier.
AdaBoost is optimize a exponential loss function and can be interpreted as coordinate descent algorithm.
Maximum margin can’t explain the performance of adaboost.
One disadvantage of AdaBoost: prone to noisy data. This is not good, because noisy data is everywhere and hard to control.
Another question is the selection of base classifier. Usually we can use decision tree as base learner and trained a tree model based on the current samples.

Chapter 7 Online Learning

Online learning handle the data point one by one, when the learning algorithm receive one data point, then make prediction and have a loss related to the prediction.
Online learning different from the techniques in the previous chapter: No fixed distribution is assumed on input . This means we can’t use risk to evaluate the hypothesis. So online learning using another set of notation for hypothesis evaluation: mistake model and regret analysis.
Online learning model with expert advice: When doing online learning, experts will give advice for each input sample .
Mistake model is simple, i.e. the number of mistake learning algorithm made during round of learning. Regret defined as follow: the difference of cumulative loss minus the loss of expert give the minimal loss.

All the analysis of online learning algorithm will give results on the total number of mistakes or regret of T rounds learning.
In the setting of experts, have Weighted Majority (WM), Randomized Weight Majority(RWM) and Exponential Weighted Majority(EWM) algorithms.
When analyzing the mistake bound and regret, it’s usually by deriving the upper and lower bound of weight, then get the results.
For linear classification, perceptron and winnow algorithm are two standards approaches.
Since online learning do not have distribution over data , batch algorithm assume a distribution . What’s the result training a online learning algorithm on data with generated from fixed distribution ? The risk can be bounded by the average loss and some factor of where is the size of dimension, is the number of learning rounds.

Chapter 8 Multi-Class Classification

Compared with binary classification, multi-class classification differed in three aspects:
1. Large number of class require more computational resources.
2. There are unbalanced classes in multi-class, some class have much larger samples than the other one, need the change of loss function.
3. There are some internal structures inside the output classes, should be used in the design of loss function.

This chapter starts with the generalization bound for multi-class classification. I fond the standard steps for building generalization bound:
1. find the rademacher complexity of the hypothesis set.
2. Derive the relation of rademacher complexity and margin, then get a bound rely on the margin.

There two different approaches have appeared multiple times in this book.
For multi-class classification:
1. Multi-class SVM by define margin as the different between correct label and second largest label score.
2. boosting: Same as AdaBoost. It’s used for multi-label problem.
3. Decision tree: One important knowledge about decision tree is decision tree will split the input space into multiple regions. Each region will have its label.

Other methods called: Aggregated Method
1. one V.S. rest
2. one V.S. one
3. Error Correction codes, quite interesting method, using hamming distance to classify the label of instance.

Multi-class classification and sequence labeling
For sequence labeling, following objective function provided:
1. Maximum Margin objective
2. Struct-SVM objective

Chapter 9 Ranking

Ranking is a new task compared with classification and regression, although this problem can be solved using same techniques when solving classification and regression.
For ranking task, there are two different approaches:
1. score-based approach: This method will train a hypothesis mapping from to . When predicting, each instance will get a score from hypothesis then ranking according to this score.
2. preference-based approach: I do not understand this method! This method split ranking into two stages. First stage will train a preference function from training sample, second stage will perform ranking based on the score of preference function.

Score-based approach for ranking
This is pair-wise ranking method, training data is pair of data points and objective function is average mis-ranked pair. For score-based approach, each instance will get a score from hypothesis . Can use SVM and RankBoost method to training the hypothesis. The analysis of generalization bound is based on rademacher complexity or margin, very much like SVM and AdaBoost.
One interesting scenario is bipartie ranking. In this scenario, all the instance can be classified into two classes. The only requirement is the pos has a higher than neg. This special scenario can be solve using boost method.

Preference-based ranking
i don’t know what’s the meaning of this method.

Chapter 10 Regression

Regression with target value of real. This chapter derive the generalization of linear regression based on rademacher complexity. Other concept similar to VC-dimension is Psudo-Dimension which based on the concept of shattering. The real algorithm for regression has linear regression (with linear hypothesis), ridge regression, kernel ridge regression, support vector regression, lasso and group lasso algorithm. Some analysis are based on rademacher complexity not on margin.

Chapter 11 Algorithmic Stability

Only care one definition of stability: when two samples and only different by one data sample, but the hypothesis output difference can be bounded by some constant number . This is called stability.

Chapter 12 Dimension Reduction

PCA is the most formal one of dimensional reduction method, aimed at preserve the maximum variance of original data. Kernel PCA is a version based on a PDS kernel, has many connections with other non-linear dimension reduction method.
Other methods like Isomap (keep the pariwise distance), LLE (keep neighbour distance) and Lapancian Eigenmap (keep the neighbor distance).

Chapter 13 Learning Automata and Languages

Skipped, don’t think it’s relevant to my work.

Chapter 14 Reinforcement Learning

Reinforcement learning based on on abstraction: MDP (markov decision process). Which includes state, action, reward, transition probability (depend on the sate and action), reward probability.
There are two different assumptions:
1. The transition probability and reward probability is known to learner.
2. The transition probability and reward probability is unknown to learner.

A policy is a mapping from state to action:

In the first case, we only need to plan action based on the known statistics. In the second case, learner need to estimate the parameters from the interaction between environment. For unknown environment, have model-free approach which estimate the policy directly. Also have model-based approach which estimate the transition probability and reward probability at first, then estimate the policy. Q-learning is the most famous algorithm estimate the policy directly.
All these methods based on the stochastic approximate method.

Written with StackEdit.