In order to compile thrift code, use the command make which will generate thrift files in gen-py/*
Ensure that the project directory is on the same directory as the thrift directory or otherwise change the path in the .py files.
To run the server with one compute node enter these three commands in different terminals on the same computer:
python3 compute_node.py <port> <load_probability>
python3 coordinator_node.py <port> <scheduling_policy>
python3 client.py <ip_of_coordinator_node> <port_of_coordinator_node>
Example:
python3 compute_node.py 5106 0.3
python3 coordinator_node.py 5105 2
python3 client.py 127.0.0.1 5105
Here, we're running a compute node with load probability of 30%. On the same computer, we're running the coordinator node with a scheduling policy of 2. The coordinator node process then uses the compute_nodes.txt file to find our compute nodes either on localhost or on a different network. We then connect our client to the coordinator to initialize our distributed machine learning algorithm.
If you wish to run on different computers, change the compute_nodes.txt in the format:
<ip>,<port>
You can populate the compute_nodes.txt with IP addresses and port numbers in a CSV format (e.g. localhost,9091) to allow the coordinator node to connect to those which are active. If a compute node is not active, it will be ignored on coordinator initialization. If it fails to get a response back, it will skip the compute node's aggregation of result data.
client.py and client_thrift.thrift will create a connected to the coordinator and initialize it. It will then wait to receive a response from the coordinator about the aggregated and trained data.
coordinator_node.py and coordinator_node_thrift.thrift acts both as a client and server. It is a server for our client and a client for our compute nodes. It will first send a heartbeat to find active nodes and create a stored connection to each one. If a node is inactive, it will occassionally send heartbeats to see if it is alive and if so, send training data. If all compute nodes go down, it will enter a retry loop for 3 attempts before aborting. It utilizes threads to poll for inactive compute nodes and whether the node is finished training. Once done, it will aggregate the data and send the accuracy back to our client.
compute_node.py and compute_node_thrift.thrift are our compute nodes. These nodes will train a local ML model after receiving a task from the coordinator node. Once received, it will train then send back the gradients for aggregation.
I've learned how Interface Definition Languages (IDLs) work to connect computers and allow different languages to communicate such as Python and C++ using RPC. Whereas with REST APIs which use a consistent format of JSON or XML over HTTP, RPC allows us to run code from another computer as if it was local. I've also learned how RPC can be utilized to distribute computing, managing scheduling and load balancing of distributed systems, as well as message queues to allow for asynchronous distributed computing. Not only that, utilizing retry loops, heartbeats, and other things to learn fault-tolerance.