2019年12月31日星期二

2019总结

按照媳妇的要求，给出2019的年终总结，希望吸取下经验教训，收获美好的2020。

人生感慨

不得不感慨下，2020我就30了。没有想到这么快就到而立之年，和优秀的人相比在事业和生活上都有太多的差距。希望在未来的时间能够尽量追赶上吧。

我的2019

宝宝出生

为了怀孕的身体健康考虑，老婆从英国回到老家，希望在家修养对老婆和宝宝的都有好处。当然也是担心因为我们不熟悉NHS而对生产有不好的影响。
宝宝终于在6月来到这个世界啦，幸运的是宝宝和老婆都身体健康。非常感谢老婆，为了让宝宝安全的来到这个世界，在家疯狂喝豆浆 :(。
为了迎接宝宝，我先请了一个月的产假，而后在12月份继续请了两个月的产假回家陪老婆跟宝宝。发现养一个小宝宝真是太累了，太费神了。

工作

仍然在同一家公司，不过这一年在工作上也是有些小小的波动。在上半年，工作所在的组因为方向问题被直接解散了，而我也转到了新组。不过绩效没有收到太大的影响。
下半年发生的事情就更多了，直线经理各种变化，所做的事情也是各种改动。不过幸运的是所做的事情也是比较顺利。虽然现在并不知道最终的结果，但是预测应该是比较好的结果吧。
个人也从这些变动上学到了很多，最重要的是个人的能力还是很重要的。只有个人能力强，才能应对各种变动的风险。

学习

老婆回国之后，我便一个人在伦敦生活。一个人的生活也还是有点无聊，所以在无聊的时候也学习了相关的知识。
不过所学的东西并不是特别的系统和专业，所以感觉还是东一榔头西一棒槌，感觉对未来的事业帮助并不是特别有效。希望未来对这一块有所改进。

我的2020

2020有太多的想法，不过总的来说

家庭

因为我个人的规划失误，导致老婆现在没有满意的工作，希望新的一年老婆能找到自己满意的下一步。
目前对家庭收入方面还是有很多的担心，所有的收入都来自于工资。希望在新的一年里能够找到新的收入来源，慢慢的降低长远的风险吧。
而且在新的一年里，也希望能做一些事情增加家庭团圆的概率吧，毕竟老婆要选择下一步嘛。

工作

在工作上，不得不说出我长久以来的愿望：还是希望能升个级吧:(。另外就是能系统的提高统筹规划的能力，目前这个方面实在是太缺了，感觉只能做一个小小的执行者啊，这个还是太不行了。
另外一个想法就是系统的提升自己对整个广告系统的了解，提高自己独立做大项目的能力。

Written with StackEdit.

Machine Learning the future

This is a note from course, Machine Learning the future

Key Points to future

There are several important pieces need to be solved for a better machine learning model of future:

Online Learning
Representation
Exploration
Reinforcement

There are other topics covered, but I don’t think it’s important in this case. All the 4 topics mentioned above have huge impact on real world applications.

Online Learning

Questions need to be solved for stable online optimization algorithm:

Sample Imbalance

Sample imbalance is widely known issue in real application, there are multiple ways to handle this issue

down sampling the negative. will create a balanced dataset.
Give rare samples higher weights.

Using weighted sample will create challenging problem for gradient update rule: how to utilize the weight information?
Naively multiply gradient by weight will over-run the update, possibly create worse result.
Ideally, the weight should act as multiple run with same example, but each run will change the weight accordingly.

Learning Rate

Online learning is sensitive to learning rate. Mostly this can be solved with algorithms like adagrad.

Scale of feature value

Different feature value has different units. Otherwise, feature with larger value will dominate the weight update.
We can try mean-variance normalization preprocessing tricks. But this will hurt sparsity pattern existed in training data.

Explore/Exploit Tradeoff

For interactive service, exploration/exploitation is a commonly topic for service provider. Before solve this problem, we need to able to evaluation different policy.

Uniformly Randomization Logging for Data Collection

In the usual case, each action must be taken based on predicted probability. But for a small percentage of traffic, we uniformly select one of K actions and logging the result accordingly.
In this way, we have data to evaluate model developed offline. Otherwise, prod model will always dominate the training samples. Offline developed model won’t have opportunity to get selected.

UCB/LinUCB/Bandits

Contextual bandits algorithm are widely used to solve explore/exploit problem. LinUCB is a promising algorithm to solve this problem, with assumption of loss is linear form.
LinUCB seems applied in news recommendation system to improve the CTR.

Offline Policy Evaluation

Evaluation is the most important step for products and machine learning problem. There are 2 aspects of evaluation:

What’s the evaluation metric?
What’s the evaluation data?

The usage of data decide the validness of evaluation metric. For system use bandit/policy to do online learning system, a good offline evaluation system will decide the iteration speed of offline modeling.
There are multiple ways to do offline evaluation of reinforcement learning algorithm (off policy evaluation)

Learning To Search

Learning to search is a framework to jointly optimize complex loss functions. There are multiple approaches to this problem:

optimize independently.
Multi-Task learning
Graphical Model

Learning to search is another framework to this problem. The general idea is to treat the problem as reinforcement learning problem.
While the true label will act as the best available policy to guide the learning.

2019年5月12日星期日

ZMQ Notes

ZMQ Important points

ZMQ is widely used as a reliable & easy-to-use replacement of sockets

Sync & A-Sync style sockets

In ZMQ, the sockets can be classified into two different categories:

Sync: like REQ/REP
Async: DEALER/ROUTER

Envelope

The concept of envelope is used to identify the source of each message.
For REQ socket,

Send the message: REQ will prepend a empty frame to the real message.
Receive the message: strips every frame including empty frame.

For REP socket:

Receive the message: Will strip frames including empty frame, then return the rest of the message to application.
Send the message: Prepend the message with saved envelope.

For ROUTER socket:

Receive the message: prepend identify frame automatically
Send the message: Remove the identity frame and check existence of the identity.

For DEALER socket: Do nothing, will send whatever you want in the message.

Written with StackEdit.

2019年4月27日星期六

Petuum Bosen Design

Overview

Petuum is a machine learning system for distributed machine learning system. The underlying idea of this system is Parameter Server. Parameter server is proposed to solve model training with huge amount of data and huge model size.

This kind of modeling usually found in online advertising business. Only in this level, we have billions of training data, billions of parameters for model. Because of capacity can’t be handled by a single machine, parameter server will distribute the model parameter and parameter updating across multiple machines.

For distributed system, there are several problems always need to be addressed:

Consistency handling
Synchronization across multiple machines

Role Assignment

For a normal parameter server system (i.e., the one implemented by Mu Li), each machined will be assigned a single role: Scheduler, Server, Worker.

Usually, scheduler will organize & coordinate workers and servers, issue the order to start the training, control the distribution of workload across workers, etc.

Worker role usually responsible for computing sufficient statistics from training data, request workload from scheduler, send sufficient statistics to server for model updating. requesting new parameters from servers.

Server role usually responsible for parameter storage, receiving sufficient statistics from worker, updating parameter on local machine and answering the parameter request from worker node.

In this design, following steps will be carried out:

start scheduler, waiting for registration of servers and workers.
start servers and workers, connect to scheduler to something. In the same time, scheduler will broadcast new joining node to all existing node to share the global information.
scheduler node will issue the command to start work.
worker node will request workload from scheduler, request parameters needed from server, computing sufficient statistics from local data, send sufficient statistics to server.
server node will receive sufficient statistics from worker node, refresh local parameters using statistics and answering requests from worker node.
all the training or activities stopped.

However, for Bosen, there is no such role assignment explicitly. This is kind of a weird design for me. How does each machine know its role, know the global network information, all the machines involving this work?

There is no scheduler node? This is bosen adopt a very fixed design system. For each running node, there is a specification file which will list all the ip address/port number, involved in one parameter server running.

Even better, each ip address/port will get an assigned number which is unique. So in this way, without the involvement of Scheduler node, the network topology information is acquired before it starts.

Of course, I think this isn’t a very flexible design, but it solve the problem in a different way 🙂

But there is another question, if we got the list of available machines, how do we know its role from its id? The answer is ever more astonishing! For each involved machine, there are two kinds of thread: background thread and server thread.

In this way, worker role and server role existed in the same machine. Also in this way, thread will be the first citizen of this new system, worker/server role will just be played by a thread rather than a machine.

We get a even better things, each machine could have multiple background/server threads severing the role of worker/server.

But do we really got no scheduler node? The answer is no. Inside the machine of ID 0, there is a special thread “Node Thread”. Node thread play the role of scheduler, all the background thread, server thread will register to this “name thread”, get associated topology fro name node.

So inside this framework, here are the summary of thread type:

application thread: thread doing business logic, using parameters to some computing.
background thread: maintaining the local parameters, responsible the request from application thread and communicate with server thread.
server thread: hold the ground truth of parameters, maintaining global synchronization and server request from background thread.

So this means, application thread only work with background thread.

Table Creation Flow

So how to create a table inside Bosen system? Here are the steps:

Create a PSTableGroupOption, you need to give global information in this struct: #tables will be created, #communication channel in each client.
Create TableOption struct to create each table specifically.
Finish the table creation and start application thread to access the table directly.

There are several important information behind those scenarios:

There are class provide static method for global access: PSTableGroup, GlobalContext. This means all the importance request are issued through static method of PSTableGroup class. And GlobalContext class contain all the shared information across different treads.
application thread can only access global interface like PSTableGroup, PSTableGroup will forward its message to related thread for processing.

What does this happens behind the API calls?

init thread (i.e., main thread) will issue a request to create table (which contain meta information about this table) to background thread.
background thread receive this message request from init thread.
Then the first background thread will send a create table request to Name Node.
Upon receiving a request of create table, Name Node will issue a request to all server thread to create table with specified meta.
Server thread will create table using meta information, then send back responses which indicate success creation of table in server thread.
Name Node receive success message from all server thread, Name Node will send back reply message to each background thread to confirm the request
background thread will create a local version of table, which will store temporary parameters from server.

So from this scenario, we know the interaction mode: application thread hold local copy of table, send request to background thread, then background thread will send request to server thread, background thread will handle reply from server thread as well.

Synchronization of Table Change

Another key issue for Parameter Server is synchronization, there are two kinds of synchronization needed:

Synchronization of actions across different works & servers.
Synchronization of parameter update between worker and server.

In PS (by Mu Li), all the synchronization is controlled by a message timestamp. Even though each parameter table can have different timestamp, but once you set them with controlled, then you can control the sync.

I think the design is similar in Bosen. In Bosen, each application thread has its own timestamp, controlled by a vector_block which accessible by all application thread.

Each time when a application thread make a change, it will change its corresponding timestamp. Once global minimum timestamp change, then workers in one machine will start the synchronization to worker.

But how do changes pushed to server? in what format?

The answer is: The updates are organized in the class of AbstractOpLog. Each table/row will has its own OpLog.

Difference with Other System

In general, Bosen is a very interesting parameter server system. But there are other similar system: Parameter Server by Mu Li (PS), MultiViso by Microsoft.
What’s the difference between different system?
Bosen designed a system which provide indirect access to local table directly. If application thread want to access the table, the application thread can only get a pointer to underlying table, request all the changes through a static class interface. All the relevant changes, sync message usage are controlled by background thread. Background thread will also responsible for communicating with server.
However, PS use a different design, which expose all the technical details to the table user. You will have a direct access to the table, you will need to know how the sync design, you need to know how to control it, the dependencies between different timestamp. This is a much stronger control, more flexible than previous one.
MultiViso is somewhere between those two. So you get the table access, but you don’t need to worry about sync too much. I think only synchronous algorithm is allowed?

Written with StackEdit.

2019年3月30日星期六

Distributed Machine Learning Approach

Distributed Machine Learning

Distributed machine learning system is very much popular across different areas & companies.
So from my understanding, there are 2 different approaches to this problem:

AllReduce (MPI)
Parameter Server

All Reduce

For this approach, we split the task into multiple machines, then we organize the machines into a binary tree (surprised, huh?)
When the task start running, the info first passed from leaf nodes into the root, then the aggregated information passed from root node to all the leaf nods.
This approach is very simple & implementation is relatively easy as well.
But how do we organize the machine into binary tree? The trick is setup a dedicated server for sending back all the other node information.

Parameter Server

Parameter Server is used for large model which should be split into multiple machines.
There are multiple choices here:

Sync computation
Async computation

For a successful setup of Parameter Server, we need 3 types of roles:

Scheduler: control of timestamp between server & workers.
Server: model storage, in rare cases, server also need to do some computation. but mostly, server just serve as storage.
Worker: computation unit. Most of the cases, worker finish the gradient update & send the result to server.

That’s the simple summarization of distributed machine learning tips :)

2019年3月24日星期日

Interesting Design of VW

Design & Learning on VW

VW is a very popular machine learning tools, with many fancy features: feature hashing, online learning and even support distributed running of the application.
I’ve always been very interested in the internal details about this tool, I want to learn more about this tool. Recently, I started to reading the source code of this tool.
This post is about the interesting design of this tool.

Feature Representation

For every machine learning tools, the most important stuff is the representation of features it accepts.
For VW, the feature still represented in <key, value> pair.

IO & Data Parsing

In order to handle the IO, VW use a custom class to represent the opening files.
One interesting thing about data parsing: the structure of sample line follow a LL-parser?

Feature Combination

VW has the concept of feature namespace. I think this is a essential feature for large scale machine learning, when we have multiple source of features. One of the usage is ngram of features between different namespace.
VW support general interaction, which involving multiple namespace. But widely used options are just quadratic & cubic feature combination.
Another interesting about VW: only 256 feature space available in total, don’t know why :)

Considering there is feature combination, if generate all the feature offline and store the combination in file, it will be huge. So the feature processing is online fashion.

Learner

One very interesting design is: learner is composable.
Learning in VW is just a set of functions following same interface.

struct func_data

{ using fn = void(*)(void* data);

void* data;

base_learner* base;

fn func;

};

  

inline func_data tuple_dbf(void* data, base_learner* base, void (*func)(void*))

{ func_data foo;

foo.data = data;

foo.base = base;

foo.func = func;

return foo;

}

  

struct learn_data

{ using fn = void(*)(void* data, base_learner& base, void* ex);

using multi_fn = void(*)(void* data, base_learner& base, void* ex, size_t count, size_t step, polyprediction*pred, bool finalize_predictions);

  

void* data;

base_learner* base;

fn learn_f;

fn predict_f;

fn update_f;

multi_fn multipredict_f;

};

VW use a struct of function pointers to represent all the functionality of a learner. This is also a very interesting design.
So basically, no too much classes in VW.
In the struct learn_data, you can find a base learner. In this way, complex learned can be composed using basic simple learner. This is fascinating.

That’s all the learning!!!

Written with StackEdit.

订阅：博文 (Atom)