2018年12月26日星期三

Parameter Server ARch

General Introduction

Parameter server is widely used to handle large scale machine learning system. The general idea of PS is distribute the parameters across multiple machines to handle the extra size of data & parameters.
Considering multiple parameter servers available, there are also multiple work nodes available to finish the related computation & reduce time required to finish the model training.
But there are several problems when design & implement PS system

  1. communication across multiple machines
  2. synchronization between multiple machines.
  3. Storage system for parameters.

If these 3 problems are handled, then the general design will be nailed.
In the following, I will give introduction to the parameter server designed by Mu Li. From my personal opinion, this is a well designed system with very beautiful engineering designs.

Communication

In the system, the communication is handled through ZMQ. So the implementation complexity is handled through this library.

Synchronization

How to let multiple machines synchronize will be a very difficult problem. The key point is message design: each message will has a timestamp. For every pair of communicated machines, the timestamp can be uniquely identify the message.
There is a design in the system: each worker node can wait for specific message identified by timestamp. As long as all the machines are waited on the same timestamp, all the machines can act on the same timeline.
Since timestamp only works between 2 nodes, how to handle the broad cast situations? Then the solution is build multiple p2e connections between multiple nodes.

Storage System

Actually the storage is just hash_map. amazingly easy!!! :)

没有评论:

发表评论