2019年5月12日星期日

ZMQ Notes

ZMQ Important points

ZMQ is widely used as a reliable & easy-to-use replacement of sockets

Sync & A-Sync style sockets

In ZMQ, the sockets can be classified into two different categories:

  1. Sync: like REQ/REP
  2. Async: DEALER/ROUTER

Envelope

The concept of envelope is used to identify the source of each message.
For REQ socket,

  1. Send the message: REQ will prepend a empty frame to the real message.
  2. Receive the message: strips every frame including empty frame.

For REP socket:

  1. Receive the message: Will strip frames including empty frame, then return the rest of the message to application.
  2. Send the message: Prepend the message with saved envelope.

For ROUTER socket:

  1. Receive the message: prepend identify frame automatically
  2. Send the message: Remove the identity frame and check existence of the identity.

For DEALER socket: Do nothing, will send whatever you want in the message.

Written with StackEdit.

2019年4月27日星期六

Petuum Bosen Design

Overview

Petuum is a machine learning system for distributed machine learning system. The underlying idea of this system is Parameter Server. Parameter server is proposed to solve model training with huge amount of data and huge model size.

This kind of modeling usually found in online advertising business. Only in this level, we have billions of training data, billions of parameters for model. Because of capacity can’t be handled by a single machine, parameter server will distribute the model parameter and parameter updating across multiple machines.

For distributed system, there are several problems always need to be addressed:

  1. Consistency handling

  2. Synchronization across multiple machines

Role Assignment

For a normal parameter server system (i.e., the one implemented by Mu Li), each machined will be assigned a single role: Scheduler, Server, Worker.

Usually, scheduler will organize & coordinate workers and servers, issue the order to start the training, control the distribution of workload across workers, etc.

Worker role usually responsible for computing sufficient statistics from training data, request workload from scheduler, send sufficient statistics to server for model updating. requesting new parameters from servers.

Server role usually responsible for parameter storage, receiving sufficient statistics from worker, updating parameter on local machine and answering the parameter request from worker node.

In this design, following steps will be carried out:

  1. start scheduler, waiting for registration of servers and workers.

  2. start servers and workers, connect to scheduler to something. In the same time, scheduler will broadcast new joining node to all existing node to share the global information.

  3. scheduler node will issue the command to start work.

  4. worker node will request workload from scheduler, request parameters needed from server, computing sufficient statistics from local data, send sufficient statistics to server.

  5. server node will receive sufficient statistics from worker node, refresh local parameters using statistics and answering requests from worker node.

  6. all the training or activities stopped.

However, for Bosen, there is no such role assignment explicitly. This is kind of a weird design for me. How does each machine know its role, know the global network information, all the machines involving this work?

There is no scheduler node? This is bosen adopt a very fixed design system. For each running node, there is a specification file which will list all the ip address/port number, involved in one parameter server running.

Even better, each ip address/port will get an assigned number which is unique. So in this way, without the involvement of Scheduler node, the network topology information is acquired before it starts.

Of course, I think this isn’t a very flexible design, but it solve the problem in a different way 🙂

But there is another question, if we got the list of available machines, how do we know its role from its id? The answer is ever more astonishing! For each involved machine, there are two kinds of thread: background thread and server thread.

In this way, worker role and server role existed in the same machine. Also in this way, thread will be the first citizen of this new system, worker/server role will just be played by a thread rather than a machine.

We get a even better things, each machine could have multiple background/server threads severing the role of worker/server.

But do we really got no scheduler node? The answer is no. Inside the machine of ID 0, there is a special thread “Node Thread”. Node thread play the role of scheduler, all the background thread, server thread will register to this “name thread”, get associated topology fro name node.

So inside this framework, here are the summary of thread type:

  1. application thread: thread doing business logic, using parameters to some computing.

  2. background thread: maintaining the local parameters, responsible the request from application thread and communicate with server thread.

  3. server thread: hold the ground truth of parameters, maintaining global synchronization and server request from background thread.

So this means, application thread only work with background thread.

Table Creation Flow

So how to create a table inside Bosen system? Here are the steps:

  1. Create a PSTableGroupOption, you need to give global information in this struct: #tables will be created, #communication channel in each client.

  2. Create TableOption struct to create each table specifically.

  3. Finish the table creation and start application thread to access the table directly.

There are several important information behind those scenarios:

  1. There are class provide static method for global access: PSTableGroup, GlobalContext. This means all the importance request are issued through static method of PSTableGroup class. And GlobalContext class contain all the shared information across different treads.

  2. application thread can only access global interface like PSTableGroup, PSTableGroup will forward its message to related thread for processing.

What does this happens behind the API calls?

  1. init thread (i.e., main thread) will issue a request to create table (which contain meta information about this table) to background thread.

  2. background thread receive this message request from init thread.

  3. Then the first background thread will send a create table request to Name Node.

  4. Upon receiving a request of create table, Name Node will issue a request to all server thread to create table with specified meta.

  5. Server thread will create table using meta information, then send back responses which indicate success creation of table in server thread.

  6. Name Node receive success message from all server thread, Name Node will send back reply message to each background thread to confirm the request

  7. background thread will create a local version of table, which will store temporary parameters from server.

So from this scenario, we know the interaction mode: application thread hold local copy of table, send request to background thread, then background thread will send request to server thread, background thread will handle reply from server thread as well.

Synchronization of Table Change

Another key issue for Parameter Server is synchronization, there are two kinds of synchronization needed:

  1. Synchronization of actions across different works & servers.

  2. Synchronization of parameter update between worker and server.

In PS (by Mu Li), all the synchronization is controlled by a message timestamp. Even though each parameter table can have different timestamp, but once you set them with controlled, then you can control the sync.

I think the design is similar in Bosen. In Bosen, each application thread has its own timestamp, controlled by a vector_block which accessible by all application thread.

Each time when a application thread make a change, it will change its corresponding timestamp. Once global minimum timestamp change, then workers in one machine will start the synchronization to worker.

But how do changes pushed to server? in what format?

The answer is: The updates are organized in the class of AbstractOpLog. Each table/row will has its own OpLog.

Difference with Other System

In general, Bosen is a very interesting parameter server system. But there are other similar system: Parameter Server by Mu Li (PS), MultiViso by Microsoft.
What’s the difference between different system?
Bosen designed a system which provide indirect access to local table directly. If application thread want to access the table, the application thread can only get a pointer to underlying table, request all the changes through a static class interface. All the relevant changes, sync message usage are controlled by background thread. Background thread will also responsible for communicating with server.
However, PS use a different design, which expose all the technical details to the table user. You will have a direct access to the table, you will need to know how the sync design, you need to know how to control it, the dependencies between different timestamp. This is a much stronger control, more flexible than previous one.
MultiViso is somewhere between those two. So you get the table access, but you don’t need to worry about sync too much. I think only synchronous algorithm is allowed?

Written with StackEdit.

2019年3月30日星期六

Distributed Machine Learning Approach

Distributed Machine Learning

Distributed machine learning system is very much popular across different areas & companies.
So from my understanding, there are 2 different approaches to this problem:

  1. AllReduce (MPI)
  2. Parameter Server

All Reduce

For this approach, we split the task into multiple machines, then we organize the machines into a binary tree (surprised, huh?)
When the task start running, the info first passed from leaf nodes into the root, then the aggregated information passed from root node to all the leaf nods.
This approach is very simple & implementation is relatively easy as well.
But how do we organize the machine into binary tree? The trick is setup a dedicated server for sending back all the other node information.

Parameter Server

Parameter Server is used for large model which should be split into multiple machines.
There are multiple choices here:

  1. Sync computation
  2. Async computation

For a successful setup of Parameter Server, we need 3 types of roles:

  1. Scheduler: control of timestamp between server & workers.
  2. Server: model storage, in rare cases, server also need to do some computation. but mostly, server just serve as storage.
  3. Worker: computation unit. Most of the cases, worker finish the gradient update & send the result to server.

That’s the simple summarization of distributed machine learning tips :)

2019年3月24日星期日

Interesting Design of VW

Design & Learning on VW

VW is a very popular machine learning tools, with many fancy features: feature hashing, online learning and even support distributed running of the application.
I’ve always been very interested in the internal details about this tool, I want to learn more about this tool. Recently, I started to reading the source code of this tool.
This post is about the interesting design of this tool.

Feature Representation

For every machine learning tools, the most important stuff is the representation of features it accepts.
For VW, the feature still represented in <key, value> pair.

IO & Data Parsing

In order to handle the IO, VW use a custom class to represent the opening files.
One interesting thing about data parsing: the structure of sample line follow a LL-parser?

Feature Combination

VW has the concept of feature namespace. I think this is a essential feature for large scale machine learning, when we have multiple source of features. One of the usage is ngram of features between different namespace.
VW support general interaction, which involving multiple namespace. But widely used options are just quadratic & cubic feature combination.
Another interesting about VW: only 256 feature space available in total, don’t know why :)

Considering there is feature combination, if generate all the feature offline and store the combination in file, it will be huge. So the feature processing is online fashion.

Learner

One very interesting design is: learner is composable.
Learning in VW is just a set of functions following same interface.

struct func_data

{ using fn = void(*)(void* data);

void* data;

base_learner* base;

fn func;

};

  

inline func_data tuple_dbf(void* data, base_learner* base, void (*func)(void*))

{ func_data foo;

foo.data = data;

foo.base = base;

foo.func = func;

return foo;

}

  

struct learn_data

{ using fn = void(*)(void* data, base_learner& base, void* ex);

using multi_fn = void(*)(void* data, base_learner& base, void* ex, size_t count, size_t step, polyprediction*pred, bool finalize_predictions);

  

void* data;

base_learner* base;

fn learn_f;

fn predict_f;

fn update_f;

multi_fn multipredict_f;

};

VW use a struct of function pointers to represent all the functionality of a learner. This is also a very interesting design.
So basically, no too much classes in VW.
In the struct learn_data, you can find a base learner. In this way, complex learned can be composed using basic simple learner. This is fascinating.

That’s all the learning!!!

Written with StackEdit.

2018年12月31日星期一

2018总结

距离2019年剩下短短几个小时了,写下一篇总结来回顾这一年。

人生大事

过去的一年简直是惊涛骇浪

辞职

13年入职第一家公司,终于在18年1月1日从这个公司离职。在前东家的最后一年半是完全处于浑浑噩噩的状态!对领导不满,但是自身又找不到前进的方法,各种的迷茫。想到过去的时光,总是会有无限的感慨。然而,我也从这件事情中了解了“职场”的精髓:人走茶凉。过去的几年没有什么目标感,但是从公司离职的时候才知道,人生还是需要给自己指定方向的。

结婚

在这个时候,恰逢我女朋友对国外的生活无比憧憬,于是便鼓励我尝试国外的机会。于是各种疯狂求内推,好在终于被我司收留。但是女朋友却要等6个月之后才能毕业。
为了将来能顺利出国,我便和女朋友火速领证,仓促结婚。说到结婚这个事情,感觉是趁着媳妇没反应过来,直接骗到手了。不过我也不后悔,毕竟过去几年想这个事情好久啦!

入职

离职之后在家休息了将近一个月,不过也没闲着。首先是办了两场婚礼,我家一场,媳妇家一场。然后还准备了雅思考试,以均分6分越过及格线,成功开始了签证办理。终于在入职日期的前4天拿到了工作签证,开始了在新东家的生活。
不过不得不夸一下自己,去哪儿哪儿股票跌!入职新东家,股票是历史高点;入职没俩月,噩耗频发。但是前东家的股票是涨的飞起。
有一件噩耗:我也是党员。

异国团聚

入职几个月之后,唯一关心的事情就是:媳妇啥时候来了。由于我个人太过拖拉,导致媳妇在家等了3个月才到我的身边。这个事情需要好好的反省,做事情的考虑还是不够全面啊。想一想我工作上也存在类似的问题,需要改正。

媳妇怀孕

跟媳妇分别6个月,终于在9月份团聚啦!小别胜新婚,终于让媳妇怀孕了!!!毫无准备!!!原来的人生规划是先玩两年,但是他喵的这个时候就怀上了。
感觉原来的生活节奏完全打乱了。媳妇每天早上总是会有一次呕吐,其他时间完全随机呕吐一次。而且对吃的完全没有兴趣,每天只能吃一点水果,顺便吃一点米饭度日。
人在腐国,周围也完全没啥可以吃的。唐人街的食品毫无存在感,媳妇对这些一点食欲都没有。我的厨艺更是渣渣,养活媳妇的挑战太大了。于是只能让媳妇回国啦。回头想一想我图啥呢?

人生展望

新年的开始总要有美好的愿望。希望在新的一年里 母子平安,然后早早的跟我团聚。希望在新的一年里,在事业上也更近一步。希望双方父母身体健康,生活顺利。
除了这些人生的目标,还应该对自己有更高的要求。

完备全局的思维

在过去一年经历了跳槽,也算是事业遇到阻碍之后的转换。这让我明白自己在思维方式上存在很大的盲点:考虑事情总是单一,片面;对事情的理解总是处于一个特定的阶段,没有实现全局和全程的考量。

清晰准确的沟通

和同事,和领导就工作上面没有实现及时,准确的沟通。导致工作上屡屡出现了一些问题,这个也是需要改进的。

有的放矢

无论是工作中还是业余的生活,总是有一种全面开花的冲动。但是这种思路是有问题,因为人的精力总是有效的。在工作中还是需要抓住重点问题,全力解决。在业余的学习,也要首先从一个点突破,而不是全面的了解每个细节。

Written with StackEdit.
时间2018年12月31日

2018年12月26日星期三

Parameter Server ARch

General Introduction

Parameter server is widely used to handle large scale machine learning system. The general idea of PS is distribute the parameters across multiple machines to handle the extra size of data & parameters.
Considering multiple parameter servers available, there are also multiple work nodes available to finish the related computation & reduce time required to finish the model training.
But there are several problems when design & implement PS system

  1. communication across multiple machines
  2. synchronization between multiple machines.
  3. Storage system for parameters.

If these 3 problems are handled, then the general design will be nailed.
In the following, I will give introduction to the parameter server designed by Mu Li. From my personal opinion, this is a well designed system with very beautiful engineering designs.

Communication

In the system, the communication is handled through ZMQ. So the implementation complexity is handled through this library.

Synchronization

How to let multiple machines synchronize will be a very difficult problem. The key point is message design: each message will has a timestamp. For every pair of communicated machines, the timestamp can be uniquely identify the message.
There is a design in the system: each worker node can wait for specific message identified by timestamp. As long as all the machines are waited on the same timestamp, all the machines can act on the same timeline.
Since timestamp only works between 2 nodes, how to handle the broad cast situations? Then the solution is build multiple p2e connections between multiple nodes.

Storage System

Actually the storage is just hash_map. amazingly easy!!! :)

2016年8月18日星期四

Scope Rule for Identifier

Scope Rules

For each programming language, an identifier is defined with some specific rules. Each expression of program will also involve different identifiers, then a simple question comes: What’s these identifier refereed in the expression. This is called name resolution, the specific rule is defined by each programming language itself.
For name resolution, compiler must know the name binding, from identifier to entity. The scope of name binding is part of program text which the binding is valid. At different location of program text, the name binding is different.

Scope Rule

Generally speaking, scope of an identifier is the lines of program text which the entity can be accessed though the identifier. So scope is the property of identifier. We can also find name context, which is the union of all the scope of identifiers.

Scope Level

Depends on the level of definition, one can get function scope, module scope and so on.

Written with StackEdit.