Distributed Scheduler

This project was created as part of my studies on distributed systems. I had a lot of fun building this solution, unfortunately it required a lot of time something that I don't have much to spare, because of that some items I intentionaly left behind, and they are listed on the bottom of this README file, please check that list while evaluating this solution.

Points I found really nice that worth a double check

Usage of leader election to enable resilience.
Usage of retries to add resilience in axios requests.
Architecture that enables scalling as needed for all components.
Leverage of Docker in all services, which would later enable deployment of the entire solution in Kubernetes.

Overall architecture

This system comprises multiple software, and the idea is to provide an architecture that can handle multiple timers simultaneously. Thus the architecture is divided into the following components:

An API that receives the requests from the users and saves these requests into a repository. For this project, I choose MongoDB.
A Scheduler pulls all timers that are about to go off every second and sends them to be processed by workers.
Workers are a small piece of a background job that gets all the timers that need to be sent and process them individually. The scheduler and workers communicate via a distributed queue; for this project, I selected RabbitMQ.

The diagram bellow show how a timer is created and how it is processed by the system.

sequenceDiagram
    User->>+API: POST /timers
    API->>+MongoDB: Save
    Scheduler->>+MongoDB: Every second pull timers
    Scheduler->>+RabitMQ: Insert into the queue
    Worker->>+RabitMQ: Process Item

Dependency and resilience analysis

Key architectural decisions:

Use of Zookeper to handle leader election

The scheduler needs to process timers every second; to provide resilience to this process, I decided to use a cluster of schedulers so that in case the main scheduler dies, another one from the cluster can take over. An alternative approach would be a more straightforward system that could automatically restart the main process after each crash. Bellow the evaluation of the differences and the reasoning for choosing one over the other.

The advantage of the clustered approach is that it provides a much more resilient architecture since each cluster node can run on a different machine; thus, we can still operate the system in case of a machine breakdown. The disadvantage of this approach is that the architecture becomes more complex and harder to develop and maintain. And the opposite applies; a more straightforward solution that could heal itself on the same machine would be much simpler but less resilient if the entire host machine crashes. On this same topic, we could also implement the leader election mechanism directly in the scheduler software, but that would require a lot of work, and I needed more time to implement this system. Therefore, I decided to use an off-the-shelf solution, and Zookeeper is an excellent open-source alternative.

Use a NoSQL database.

Here it was more of a question of personal preference, modern relational databases can also scale horizontally as needed. But the simple architeture of MongoDB and the simple format of the timer record made it clear that mongo would be a good fit for this project.

Scheduler sending the information to workers instead of processing the timer itself.

This decision is to avoid long-standing requests to block the Scheduler from continuing its work of pooling new info every second. To exemplify this point think about the situation where an HTTP request took too long to respond; now imagine multiple timers pointing out to a slow host; processing all of them would require some time, blocking the Scheduler and making its process less responsible for the entire system. Thus, adding workers to process the information frees the Scheduler to do one thing: pool new timers and dispatch them to be processed.

How to build the system

npm run build

How to run the system

First step is to create a .env files with some environment variables, the easiest way is to copy .env.sample, it contains all variables to run the services at localhost.

cp .env.sample .env

Manually, good for debugging

Start docker-compose on the root folder, only for the static services:

npm run static_services

Run the api.

node build/api/app.js

Run the scheduler.

node build/scheduler/app.js

Run the worker.

node build/worker/app.js

Go to the swagger page on: http://localhost:3000/docs

Run it all at once

This method runs all static and dynamic dependencies:

npm run up

Go to the swagger page on: http://localhost:3000/docs

Run performance tests.

For this test to work it is necessary to have Apache ab installed. Also change the ip address on the file tests/ab_payload.json to reflect your own ip address.

npm run up
npm run performance_test

Run the end to end test.

Before running it change the ip address in the line 30 of the test file, explanation in the test file.

npm run up
npm run e2e

Open tasks.

Known issues.

When a worker points out to a host that it cannot reachs it can sometimes crashes

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github		.github
api		api
config/zookeeper1		config/zookeeper1
models		models
repositories		repositories
scheduler		scheduler
tests		tests
worker		worker
.env.sample		.env.sample
.gitignore		.gitignore
Procfile		Procfile
README.md		README.md
architecture.png		architecture.png
docker-compose-ha.yaml		docker-compose-ha.yaml
docker-compose.yaml		docker-compose.yaml
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
resilience.png		resilience.png
tsconfig.json		tsconfig.json
tsoa.json		tsoa.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Scheduler

Overall architecture

Dependency and resilience analysis

Key architectural decisions:

Use of Zookeper to handle leader election

Use a NoSQL database.

Scheduler sending the information to workers instead of processing the timer itself.

How to build the system

How to run the system

Manually, good for debugging

Run it all at once

Run performance tests.

Run the end to end test.

Open tasks.

Known issues.

About

Uh oh!

Releases

Packages

Languages

irapuan/distributed_scheduler

Folders and files

Latest commit

History

Repository files navigation

Distributed Scheduler

Overall architecture

Dependency and resilience analysis

Key architectural decisions:

Use of Zookeper to handle leader election

Use a NoSQL database.

Scheduler sending the information to workers instead of processing the timer itself.

How to build the system

How to run the system

Manually, good for debugging

Run it all at once

Run performance tests.

Run the end to end test.

Open tasks.

Known issues.

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages