Distributed Machine Learning

Running the project

In order to compile thrift code, use the command make which will generate thrift files in gen-py/*

Ensure that the project directory is on the same directory as the thrift directory or otherwise change the path in the .py files.

To run the server with one compute node enter these three commands in different terminals on the same computer:

python3 compute_node.py <port> <load_probability>

python3 coordinator_node.py <port> <scheduling_policy>

python3 client.py <ip_of_coordinator_node> <port_of_coordinator_node>

Example:

python3 compute_node.py 5106 0.3

python3 coordinator_node.py 5105 2

python3 client.py 127.0.0.1 5105 Here, we're running a compute node with load probability of 30%. On the same computer, we're running the coordinator node with a scheduling policy of 2. The coordinator node process then uses the compute_nodes.txt file to find our compute nodes either on localhost or on a different network. We then connect our client to the coordinator to initialize our distributed machine learning algorithm.

If you wish to run on different computers, change the compute_nodes.txt in the format: <ip>,<port>

Populating Compute Nodes

You can populate the compute_nodes.txt with IP addresses and port numbers in a CSV format (e.g. localhost,9091) to allow the coordinator node to connect to those which are active. If a compute node is not active, it will be ignored on coordinator initialization. If it fails to get a response back, it will skip the compute node's aggregation of result data.

Client

client.py and client_thrift.thrift will create a connected to the coordinator and initialize it. It will then wait to receive a response from the coordinator about the aggregated and trained data.

Coordinator Node

coordinator_node.py and coordinator_node_thrift.thrift acts both as a client and server. It is a server for our client and a client for our compute nodes. It will first send a heartbeat to find active nodes and create a stored connection to each one. If a node is inactive, it will occassionally send heartbeats to see if it is alive and if so, send training data. If all compute nodes go down, it will enter a retry loop for 3 attempts before aborting. It utilizes threads to poll for inactive compute nodes and whether the node is finished training. Once done, it will aggregate the data and send the accuracy back to our client.

Compute Node

compute_node.py and compute_node_thrift.thrift are our compute nodes. These nodes will train a local ML model after receiving a task from the coordinator node. Once received, it will train then send back the gradients for aggregation.

What I've Learned

I've learned how Interface Definition Languages (IDLs) work to connect computers and allow different languages to communicate such as Python and C++ using RPC. Whereas with REST APIs which use a consistent format of JSON or XML over HTTP, RPC allows us to run code from another computer as if it was local. I've also learned how RPC can be utilized to distribute computing, managing scheduling and load balancing of distributed systems, as well as message queues to allow for asynchronous distributed computing. Not only that, utilizing retry loops, heartbeats, and other things to learn fault-tolerance.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
letters		letters
CSCI_5105_Phase_1_Report.pdf		CSCI_5105_Phase_1_Report.pdf
ML.py		ML.py
Makefile		Makefile
README.md		README.md
client.py		client.py
client_thrift.thrift		client_thrift.thrift
compute_node.py		compute_node.py
compute_node_thrift.thrift		compute_node_thrift.thrift
compute_nodes.txt		compute_nodes.txt
coordinator_node.py		coordinator_node.py
coordinator_node_thrift.thrift		coordinator_node_thrift.thrift
office_hour_questions.txt		office_hour_questions.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Machine Learning

Running the project

Populating Compute Nodes

Client

Coordinator Node

Compute Node

What I've Learned

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed Machine Learning

Running the project

Populating Compute Nodes

Client

Coordinator Node

Compute Node

What I've Learned

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages