2019年3月24日星期日

Interesting Design of VW

Design & Learning on VW

VW is a very popular machine learning tools, with many fancy features: feature hashing, online learning and even support distributed running of the application.
I’ve always been very interested in the internal details about this tool, I want to learn more about this tool. Recently, I started to reading the source code of this tool.
This post is about the interesting design of this tool.

Feature Representation

For every machine learning tools, the most important stuff is the representation of features it accepts.
For VW, the feature still represented in <key, value> pair.

IO & Data Parsing

In order to handle the IO, VW use a custom class to represent the opening files.
One interesting thing about data parsing: the structure of sample line follow a LL-parser?

Feature Combination

VW has the concept of feature namespace. I think this is a essential feature for large scale machine learning, when we have multiple source of features. One of the usage is ngram of features between different namespace.
VW support general interaction, which involving multiple namespace. But widely used options are just quadratic & cubic feature combination.
Another interesting about VW: only 256 feature space available in total, don’t know why :)

Considering there is feature combination, if generate all the feature offline and store the combination in file, it will be huge. So the feature processing is online fashion.

Learner

One very interesting design is: learner is composable.
Learning in VW is just a set of functions following same interface.

struct func_data

{ using fn = void(*)(void* data);

void* data;

base_learner* base;

fn func;

};

  

inline func_data tuple_dbf(void* data, base_learner* base, void (*func)(void*))

{ func_data foo;

foo.data = data;

foo.base = base;

foo.func = func;

return foo;

}

  

struct learn_data

{ using fn = void(*)(void* data, base_learner& base, void* ex);

using multi_fn = void(*)(void* data, base_learner& base, void* ex, size_t count, size_t step, polyprediction*pred, bool finalize_predictions);

  

void* data;

base_learner* base;

fn learn_f;

fn predict_f;

fn update_f;

multi_fn multipredict_f;

};

VW use a struct of function pointers to represent all the functionality of a learner. This is also a very interesting design.
So basically, no too much classes in VW.
In the struct learn_data, you can find a base learner. In this way, complex learned can be composed using basic simple learner. This is fascinating.

That’s all the learning!!!

Written with StackEdit.

2018年12月31日星期一

2018总结

距离2019年剩下短短几个小时了，写下一篇总结来回顾这一年。

人生大事

过去的一年简直是惊涛骇浪

辞职

13年入职第一家公司，终于在18年1月1日从这个公司离职。在前东家的最后一年半是完全处于浑浑噩噩的状态！对领导不满，但是自身又找不到前进的方法，各种的迷茫。想到过去的时光，总是会有无限的感慨。然而，我也从这件事情中了解了“职场”的精髓：人走茶凉。过去的几年没有什么目标感，但是从公司离职的时候才知道，人生还是需要给自己指定方向的。

结婚

在这个时候，恰逢我女朋友对国外的生活无比憧憬，于是便鼓励我尝试国外的机会。于是各种疯狂求内推，好在终于被我司收留。但是女朋友却要等6个月之后才能毕业。
为了将来能顺利出国，我便和女朋友火速领证，仓促结婚。说到结婚这个事情，感觉是趁着媳妇没反应过来，直接骗到手了。不过我也不后悔，毕竟过去几年想这个事情好久啦！

入职

离职之后在家休息了将近一个月，不过也没闲着。首先是办了两场婚礼，我家一场，媳妇家一场。然后还准备了雅思考试，以均分6分越过及格线，成功开始了签证办理。终于在入职日期的前4天拿到了工作签证，开始了在新东家的生活。
不过不得不夸一下自己，去哪儿哪儿股票跌！入职新东家，股票是历史高点；入职没俩月，噩耗频发。但是前东家的股票是涨的飞起。
有一件噩耗：我也是党员。

异国团聚

入职几个月之后，唯一关心的事情就是：媳妇啥时候来了。由于我个人太过拖拉，导致媳妇在家等了3个月才到我的身边。这个事情需要好好的反省，做事情的考虑还是不够全面啊。想一想我工作上也存在类似的问题，需要改正。

媳妇怀孕

跟媳妇分别6个月，终于在9月份团聚啦！小别胜新婚，终于让媳妇怀孕了！！！毫无准备！！！原来的人生规划是先玩两年，但是他喵的这个时候就怀上了。
感觉原来的生活节奏完全打乱了。媳妇每天早上总是会有一次呕吐，其他时间完全随机呕吐一次。而且对吃的完全没有兴趣，每天只能吃一点水果，顺便吃一点米饭度日。
人在腐国，周围也完全没啥可以吃的。唐人街的食品毫无存在感，媳妇对这些一点食欲都没有。我的厨艺更是渣渣，养活媳妇的挑战太大了。于是只能让媳妇回国啦。回头想一想我图啥呢？

人生展望

新年的开始总要有美好的愿望。希望在新的一年里 母子平安，然后早早的跟我团聚。希望在新的一年里，在事业上也更近一步。希望双方父母身体健康，生活顺利。
除了这些人生的目标，还应该对自己有更高的要求。

完备全局的思维

在过去一年经历了跳槽，也算是事业遇到阻碍之后的转换。这让我明白自己在思维方式上存在很大的盲点：考虑事情总是单一，片面；对事情的理解总是处于一个特定的阶段，没有实现全局和全程的考量。

清晰准确的沟通

和同事，和领导就工作上面没有实现及时，准确的沟通。导致工作上屡屡出现了一些问题，这个也是需要改进的。

有的放矢

无论是工作中还是业余的生活，总是有一种全面开花的冲动。但是这种思路是有问题，因为人的精力总是有效的。在工作中还是需要抓住重点问题，全力解决。在业余的学习，也要首先从一个点突破，而不是全面的了解每个细节。

Written with StackEdit.
时间2018年12月31日

2018年12月26日星期三

Parameter Server ARch

General Introduction

Parameter server is widely used to handle large scale machine learning system. The general idea of PS is distribute the parameters across multiple machines to handle the extra size of data & parameters.
Considering multiple parameter servers available, there are also multiple work nodes available to finish the related computation & reduce time required to finish the model training.
But there are several problems when design & implement PS system

communication across multiple machines
synchronization between multiple machines.
Storage system for parameters.

If these 3 problems are handled, then the general design will be nailed.
In the following, I will give introduction to the parameter server designed by Mu Li. From my personal opinion, this is a well designed system with very beautiful engineering designs.

Communication

In the system, the communication is handled through ZMQ. So the implementation complexity is handled through this library.

Synchronization

How to let multiple machines synchronize will be a very difficult problem. The key point is message design: each message will has a timestamp. For every pair of communicated machines, the timestamp can be uniquely identify the message.
There is a design in the system: each worker node can wait for specific message identified by timestamp. As long as all the machines are waited on the same timestamp, all the machines can act on the same timeline.
Since timestamp only works between 2 nodes, how to handle the broad cast situations? Then the solution is build multiple p2e connections between multiple nodes.

Storage System

Actually the storage is just hash_map. amazingly easy!!! :)

2016年8月18日星期四

Scope Rule for Identifier

Scope Rules

For each programming language, an identifier is defined with some specific rules. Each expression of program will also involve different identifiers, then a simple question comes: What’s these identifier refereed in the expression. This is called name resolution, the specific rule is defined by each programming language itself.
For name resolution, compiler must know the name binding, from identifier to entity. The scope of name binding is part of program text which the binding is valid. At different location of program text, the name binding is different.

Scope Rule

Generally speaking, scope of an identifier is the lines of program text which the entity can be accessed though the identifier. So scope is the property of identifier. We can also find name context, which is the union of all the scope of identifiers.

Scope Level

Depends on the level of definition, one can get function scope, module scope and so on.

Written with StackEdit.

2016年4月8日星期五

Using locale information

Kernel Method (or Non-parametric Method)

Seems there are two different definition of “kernel methods” in machine learning. One definition related with RKHS and so on. The other one, just like some non-parametric methods. And the latter one is the topic in post, the main idea is about how to use localized information to get a model.

Unlike linear model, which construct a global function over all the sample spaces; kernel methods works by construct a localized function for each new sample point $x_0$ . We can see how this method can be applied to different tasks.

When apply this method to regression task, for each new sample point $x_0$ , it will construct a weight matrix $W_{x_0}$ based on some kernel function $k(x, x_0)$ . This kernel function will assign higher weights to closer training points based on some norm. Then a weighted regression will be performed, getting a brand new predicting function $\hat{f_{x_0}}$ and return the predicted value.

For regression task, there is also another kernel method. Which will weight samples within the neighborhood of new sample point $x_0$ and return a weighted average of response variable value.

When apply this method to density estimation, it will also construct a weight kernel decaying with the distance from the point $x_0$ . And then perform classification according to bayes rule. We can also use mixture of Gaussian to estimate the density more clearly for each classes.

For all the methods mentioned above, there will be a issue of bias and variance trade off.

Written with StackEdit.

2016年4月1日星期五

Notes on Linear Regression

Having read the linear regression chapter of Element of Statistical Learning, method is different compared with Pattern Recognition and Machine Learning. After the introduction of least square methods, ESL will talk about the variant of the estimator ( $\hat{\beta}$ ). Well, this is something quite new to me.
The first question is why we need to do this ? What’s the benefits of doing such kind of inference? But more interesting point is, with assumption of truly underlying model is linear model: ESL gives hypothesis testing and interval estimation of the parameters. This is quite new, but the question would be what if the real underlying model isn’t linear. I think this is the most common scenario.
For other point, ESL give a detailed analysis and comparison of different shrinkage method, this is a clear description of “bias variance decomposition”. And also other advanced method like lasso path and LAR algorithm.

2016年1月19日星期二

Notes on Pattern Recognition and Machine Learning

Chapter 1 Introduction

Three important parts: Probability distribution, decision theory and information theory.
From decision theory, loss function is provided; from information theory, entropy and KL is provided. From probability, conditional, joint probability; bayesian formalism and frequentist formalism.
Other topics about high dimension situation: for naive methods, requirement on data size grow exponentially with the number of dimensions and high-dimension is counter-intuitive. But we also have other insight on high dimension data: real data exist in a manifold of high dimension and local smoothness is guaranteed.
For model selection: we have cross validation from frequentist, other method combine model complexity and training performance from bayesian.

Chapter 2 Probability distribution

Focus on the probability distribution related to machine learning. Specially focus on Gaussian distribution. One important thing I learned from this chapter is how to derive the conditional and marginal distribution from a joint Gaussian distribution: Gaussian distribution has two very important components, first is the quadratic term involving precision matrix, second is the mean term; we can find the corresponding distribution by completing this quadratic terms.
Another important stuff about Gaussian distribution is the precision matrix, this is very helpful when derive the conditional and marginal distribution.
Other stuff in chapter 2 about Exponential Family, which is a generalized concepts with density function form $p(x; \eta) = A(\eta) h(x) exp(\mu ^ T u(x))$ where $u(x)$ is sufficient statistics, $A(\eta)$ is a normalizing constant depend on parameter $\eta$ . or called Partition Function.
For probability density, despite parametric form there is Nonparametric form. For Nonparametric probability density, we can have Nearest Neighbour Method and Kernel Method two different approaches. But these two ideas all come from one basic principal of estimating probability.

Chapter 3 Linear Regression

This book focus on bayesian approach to every model ( or hypothesis ). For linear regression, there are several different prospect for derivation:
1. MLE: assuming a gaussian distribution of noisy.
2. Geometry point: Projecting target value into the range of columns space of data samples.

For regularization part, assuming a gaussian prior distribution on parameter $w$ .
The ultimate purpose of learning model is predicting target value for new input data point $x$ , how to expression the uncertainly of predicted value $y(x)$ ? Frequentist and Bayesian have different method:
1. Bayesian expression the uncertainly through of posterior distribution of parameter $w$ .
2. Frequentist will make a point estimate of $w$ at first, then through a series of though experiment to determine the uncertainty.

One important stuff: Bias-Variance Decompositioin is used for Frequentist, because the interpretation of Bias and Variance depend on the following ideas:
We have a set of different data sets, each data set comprised of N data points. From each set, learning algorithm will get an point estimate of $w$ , based on this parameter, prediction made on new data point $x$ . Since there are multiple data set from some unknown distribution $P$ , then can take expectation and variance of all the predicted values $y(x; \hat{w})$ on new data point. This is the origin of Bias and Variance. Different model have different bias and variance depend on the model complexity. Thus the control of model complexity is vital for machine learning.

But what is the Bayesian approach to Linear Regress Estimate & Model Selection ?
Start with a prior distribution over parameter $w$ (mostly choose gaussian), then updating this distribution when new data point observed.

Frequentist start model selection with cross validation. But Bayesian will do it based on model evidence.

Another question would be the variant based on linear regression?
There are lots of variation of simple linear regression.
1. In original linear regression, original data point $x$ is used. But can use Basis Function to transform the data point at first : from $x$ -> $\Phi{(x)}$ , can have many basis function, then get the linear representation of data point. Some of the well-known basis function is: gaussian function, wavelet function and sigmoid function.
2. Another extension would be the norm of regularization. From $L_2$ to $L_1$ and $L_q$ . If a purely linear regression and $L_1$ norm regularization, this is called LASSO.

Other stuffs ?
1. Hypothesis complexity of hypothesis for linear regression. Used to derive the generalization bound and sample complexity.

Chapter 4 Linear Classification

For classification, there are three different approaches:
1. Discriminant Function: From training instance to class label directly.
2. Probabilistic Generative Model: model the joint probability of instance and class label.
3. Probabilistic Discriminative Model: model the conditional probability of class label give training instance.

In this chapter, it’s about how to use linear model to realize three different approaches.
For discriminant function, Linear Regression, Finsher Discriminant Analysis and Perceptron algorithm. Linear Regression and Finsher Discriminant function with different objective function. For perceptron, it’s quiet unusual. It’s hard to find a appropriate category for this algorithm.
For probabilistic generative model, it’s modeling as follow:
$P(C_k | \phi) = p(C_k) * p(\phi | C_k)$
where $p(\phi | C_k)$ represent the class conditional probability distribution. For binary and multi-class classification, if class conditional probability is gaussian and share the same covariance matrix, then the posterior of class label has the following form: $p(C_k | \phi ) = f(w ^ T \phi + \; constant)$ . The function $f$ is called activation function and function $f ^ {-1}$ classed link function in statistics. So one question would be this: which specific form of class conditional probability distribution will lead to a linear model? Answer is exponential family distribution with shared scaling parameters.
For discriminative function, it’s modeling the conditional probability directly. Linear discriminative model has the following form:
$p(C_k | \phi) = f( w ^ T \phi + w_0)$
activation function $f$ is sigmoid function for binary classification, soft-max function for multi-class classification. Since nonlinear activation function, there is no closed form solution for this problem, only be solved through iterative approach. **I**terative **R**egularized **L**east **S**quare (IRLS) is apply the newton method to linear discriminative model. Another point need attention: when using 1-of-K coding schema for class label, optimize the negative loglikelihood function of training data is the same as optimize the negative cross entropy function of training data. this is true when binary using 0 and 1 to represent different class. (But in normal case, everyone is using 1 and -1, interesting).
According to the spirit of this book, There is a bayesian version of logistic regression. But the posterior of parameter given training data is intractable. So laplace approximation is used to approximate the posterior distribution.
How does Laplace Approximation works? It approximate the target distribution with a Gaussian. And this gaussian sit on the model, its precision matrix is the negative of the hessian of the target probability density at the mode.
Summary: Approaches to classification, cross-entropy, probit regression, laplace approximation, BIC.

Chapter 5 Neural Network

Most important concepts: Neural Network is adaptive linear model. Or can be understand as hierarchical linear model. Because each layer of neural network is just perform linear model operation plus some nonlinear activation. So neural network is itself nonlinear but composed of linear models. The most important motivation is the the input of linear model can be the output of other linear model. When thinking in this way, it is basis expansion but with adaptive basis. Much more interesting than aspects from neuron inspiration.
When recognized as adaptive linear model, neural network need some objective function: cross-entropy for classification, least square for regression. These concepts are all from the previous chapters.
But adaptive linear model is hard to calculate the gradient and hessian. So the back-propagation comes into help.
Using back propagation, gradient and hessian can be calculated easily.
As a new function mapping different from linear model, it need some ways to perform model selection. The old regularization on all the parameters still works well. However, as a adaptive linear model, itself is hierarchical!
For neural network, some other approaches can be used: consistent prior, tangent propagation, convolution and soft weigh sharing. I think all these techniques are too much complicated, don’t know the real application in real open-source tools.
Neural network is a function space and it can be used to do anything. Mixture density network is using neural network to predict the mixing coefficients, mean parameters and covariance matrix. Too much parameters, i don’t think this is good.
Still, this book is for Bayesian Method. Bayesian neural network for regression and classification. Using lapalace approximation, posterior distribution can be approximated to give predictive distribution and so on. Using so many approximation, what’s the meaning of getting a distribution rather than a single parameter. I doubt the effectiveness of the bayesion for neural network.

Chapter 6 Kernel Method

Previous chapters focus on linear method and its extension (i mean Neural Network). But kernel method is very different from kernel method, it involves nonlinear mapping in the model directly.
All the linear method can get a dual representation. In this representation, model is represented with a kernel function involved.
For kernel function, possible kernel should have positive definite gram matrix. And there are multiple ways to construct new kernel function:
1. Composite new kernel function according to a set of rules.
2. Composite new kernel function with probabilistic generative model, i.e. combine kernel function with a mixture model way.

This is the keypoint about the kernel function, others are skipped.
Another important knowledge is Gaussian Process, it seems that i can understand it now. From prior distribution, any existed data point has a distribution of output $y(x; w)$ . Gaussian Process means the distribution of $y(x;w)$ is gaussian. Well, most interesting point is we do not need to worry about selecting a proper prior over the parameter $w$ .

Chapter 7 Sparse Kernel Machine

Start with SVM algorithm, more interesting is Relevance Vector Machine. This is a Bayesian version of Support Vector Machine. The only different from Bayesian linear model, is the prior distribution for parameter $w$ is composed element-wise way:
$p(w) = \prod_{i=1}^{M}{p(w_i)} = \prod_{i=1}^{M}{N(w_i|0, \alpha_i ^ {-1})}$
So with this variant, sparse effect is achieved. The bayesian version is existed for classification & regression model at the same time.

Chapter 8 Graphical Model

Graphical model has different representation: directed & undirected. Each graphical representation specify the factorization of the joint probability distribution and conditional independence of the joint distribution. The factorization and conditional independence decide the efficiency of inference and learning algorithm.
For directed graphical model, there is d-separation to decide the conditional independence. For undirected graphical model, this is much simpler.
Inference on chain model & tree model is very simple. One important information: message passing used to passing message from other nodes about current node.
For general graphical model, loopy belief propagation, variational inference and sampling is the solution. Junction tree need complicated steps, i don’t think is used widely.
Some thing like factor graph and clique tree is another different representation of same graphical model. Different representation do not change the underlying distribution, just for the computation convenience.
So for graphical model, information is about structure of the joint distribution and how to use the structure to accelerate the computation.

Chapter 9 EM & Mixture Models

EM is used to get maximum likelihood estimator of some models with latent variable. The introduce of latent variable is used to simplify the computation of likelihood of observed data, even though the latent variable do not have any physical interpretation.
EM algorithm will try to get posterior distribution of hidden variable given observed data at first. Then it will calculate the expected complete data log-likelihood under the posterior distribution of hidden variables. At last, it will maximize the expectation with respect to model parameters.
For a long time, I can’t understand EM because don’t know how to get the distribution, then calculate the expectation under this distribution. Finally, i get the point. The key is not the separate the distribution and expectation, but to find the expectation of complete likelihood. With this information, the only thing required is the expectation of the posterior distribution.
If the posterior distribution is easy, EM can be very simple. Just maximize the complete likelihood of data, but replace the value of hidden variable with expected value of hidden variable. When posterior distribution is complex, approximate inference method is required. We can get the expected with different ways.

Chapter 10 Variational Inference

Variational inference is used for inference problem: marginal problem, posterior problem and MAP inference. If the distribution is very complex, we can make it simple by adding more extra conditional independence. Mean field approximation works by assuming some group of variables are independent even though not. Variatioanl inference minimized the following KL divergence:
$KL(Q|P) = \int { Q(x) log{\frac{ Q(x) } {P(x)} } dx }$
and $Q$ is factorized with different group of variables, i.e. $Q(x) = q_1(x_1) ... q_n(x_n)$ . Apply this to the KL function then can get a function for minimization.
For a long time, i don’t understand this algorithm. Just because do not understand how to evaluate the expectation of complete data log-likelihood under some distribution.

Chapter 11 Sampling Method

The fantastic name “Monte Carlo Method” do not reveal the real content of this subject. Numerical Sampling method is using computer generated pseudo random number generator to get the real sample from target distribution, then perform all kind of inference operations.
So generally, there are specific sampling method, MCMC sampling method. Using MCMC method, there are several conditions:
1. aperiodic: This means there is no circle in the state traveling space.
2. irreducible: This means all the space can be explored.
3. reversible: also called detailed balance.
4. ergodicity: starting with any possible point, the convergence distribution is the same.

Among all the methods, Gibbs Sampling is the most important one. Another interesting method is Hamilton Monte Carlo Method.
From my understanding, the most important part for MCMC like method:
1. How to propose new point in the whole domain of the distribution.
2. How to use detailed balance to determine the acceptance of new point.

There are other general idea related with sampling: Data Augmentation. Data Augmentation works in the following way: If you want to sample from distribution $p(x)$ , but you construct another distribution $q(x,y)$ which satisfy $p(x)=\int {q(x,y) dy}$ at first. The variable $y$ called auxiliary variable. With new distribution $q$ , it’s much simpler to sample from it. So we get samples from $q$ then just drop the $y$ part. Slice sampling, Hamilton Monte Carlo method belong to this kind of general idea.

Chapter 12 Continuous Latent Variable

In this chapter, author gives some very important idea on dimension reduction task. In dimension reduction, there are some continuous latent variable, then after some kind of transformation become a high dimension space. But the latent variable only exist in a small manifold of high dimension. But in real word, data do not exist in a small manifold. How to interpret this? Well, data point do not exist in the small manifold, these data points will be interpreted as real data point plus some noise. PCA is interpreted as model with continuous latent variable, plus a Gaussian noise. Start with PCA, there are many other dimension reduction algorithm can be derived. Amazing!

Chapter 13 Sequential Data

Sequential data has the basic model: HMM for discrete latent variable, it’s the extension of mixture of Gaussians; LDS for continuous latent variable, it’s extension of PCA like models. It’s just the application of Graphical Model.

Combing Models

For combing models, it’s called Ensemble Method. Widely used method: Random Forest, AdaBoost, GBDT. For the rest, I don’t know the real applications.

Summary of Reading

For PRML, it covers the Linear Method for Classification/Regression, Neural Network, Kernel Method, SVM, Graphical Model, Variational Inference, Numerical Sampling, Ensemble Learning. I think the basic understand of machine learning techniques is acquired.

Written with StackEdit.

订阅：博文 (Atom)