Distributed Machine Learning
Distributed machine learning system is very much popular across different areas & companies.
So from my understanding, there are 2 different approaches to this problem:
- AllReduce (MPI)
- Parameter Server
All Reduce
For this approach, we split the task into multiple machines, then we organize the machines into a binary tree (surprised, huh?)
When the task start running, the info first passed from leaf nodes into the root, then the aggregated information passed from root node to all the leaf nods.
This approach is very simple & implementation is relatively easy as well.
But how do we organize the machine into binary tree? The trick is setup a dedicated server for sending back all the other node information.
Parameter Server
Parameter Server is used for large model which should be split into multiple machines.
There are multiple choices here:
- Sync computation
- Async computation
For a successful setup of Parameter Server, we need 3 types of roles:
- Scheduler: control of timestamp between server & workers.
- Server: model storage, in rare cases, server also need to do some computation. but mostly, server just serve as storage.
- Worker: computation unit. Most of the cases, worker finish the gradient update & send the result to server.
That’s the simple summarization of distributed machine learning tips :)