DQueue

A distributed task queue with Raft consensus, built from scratch in Go.

Think: mini-SQS + mini-Celery, with the hard parts implemented.

Features

Raft Consensus - From-scratch implementation with leader election, log replication, and snapshots
At-Least-Once Delivery - Tasks are never lost; may be processed multiple times
Visibility Timeout - SQS-style lease mechanism prevents duplicate processing
Automatic Failover - Leader dies? New one elected in seconds. Zero data loss.
Idempotency Keys - Built-in deduplication for safe client retries
Priority Queues - Higher priority tasks processed first
Dead-Letter Queue - Failed tasks moved after max retries
Observability - Prometheus metrics, structured logging

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      DQueue Cluster                         │
│                                                             │
│   ┌─────────┐       ┌─────────┐       ┌─────────┐          │
│   │  Node 1 │◀─────▶│  Node 2 │◀─────▶│  Node 3 │          │
│   │ (Leader)│ Raft  │(Follower)│ Raft  │(Follower)│          │
│   └────┬────┘       └─────────┘       └─────────┘          │
│        │                                                    │
│        │ HTTP API                                           │
└────────┼────────────────────────────────────────────────────┘
         │
    ┌────┴────┐
    │ Clients │
    │ Workers │
    └─────────┘

Quick Start

Prerequisites

Go 1.22+
Docker & Docker Compose (for cluster mode)
Make

Run a Single Node (Development)

# Build
make build

# Run single node (no Raft, for quick testing)
./bin/dqueue --id=dev --bootstrap=true

Run a 3-Node Cluster

# Start cluster
docker-compose -f deploy/docker-compose.yml up -d

# Check health
curl http://localhost:8081/health

# View logs
docker-compose -f deploy/docker-compose.yml logs -f

Enqueue Tasks

# Using CLI
./bin/dqueue-cli enqueue -type email -payload '{"to":"user@example.com"}'

# Using curl
curl -X POST http://localhost:8081/tasks \
  -H "Content-Type: application/json" \
  -d '{"type":"email","payload":{"to":"user@example.com"}}'

Run a Worker

./bin/dqueue-worker --server http://localhost:8081

API Reference

Enqueue Task

POST /tasks
{
  "type": "email",
  "payload": {"to": "user@example.com", "subject": "Hello"},
  "priority": 10,           # Optional: higher = more urgent
  "max_retries": 3,         # Optional: default 3
  "idempotency_key": "abc"  # Optional: for deduplication
}

Lease Task (for workers)

POST /tasks/lease
{
  "worker_id": "worker-1"
}

Returns the next available task. Task is "invisible" to other workers until ACKed or visibility timeout expires.

Acknowledge Task (success)

POST /tasks/{id}/ack
{
  "worker_id": "worker-1"
}

Negative Acknowledge (failure)

POST /tasks/{id}/nack
{
  "worker_id": "worker-1",
  "reason": "database connection failed"
}

Task will be retried with exponential backoff, or moved to dead-letter after max retries.

Get Task Status

GET /tasks/{id}

Health Check

GET /health

Statistics

GET /stats

Delivery Guarantees

At-Least-Once Delivery

Tasks may be delivered multiple times if:

Worker crashes after processing but before ACKing
Network issues cause ACK to be lost
Visibility timeout expires during processing

Your task handlers should be idempotent.

Why Not Exactly-Once?

Exactly-once is impossible in distributed systems without two-phase commit. We chose at-least-once because:

It's what SQS and most production queues use
It's simpler and more performant
Idempotent handlers are a well-understood pattern

Failover Demo

Watch automatic leader election:

# Terminal 1: Start cluster
docker-compose -f deploy/docker-compose.yml up

# Terminal 2: Run failover demo
./scripts/demo-failover.sh

What happens:

Identify current leader
Enqueue tasks
Kill the leader (docker kill)
Watch new leader elected (~3 seconds)
Verify zero task loss

Chaos Testing

Run the chaos test to verify reliability under failure:

./scripts/chaos-test.sh

This script:

Continuously enqueues tasks
Kills the leader every 10 seconds
Verifies no data loss after 60 seconds

Results: Zero committed tasks lost. Transient failures during leader transitions are expected (clients should retry).

Configuration

Flag	Default	Description
`--id`	required	Unique node ID
`--data-dir`	`./data`	Data directory
`--http-addr`	`:8080`	HTTP API address
`--grpc-addr`	`:9090`	gRPC (Raft) address
`--peers`	``	Peer addresses: `id1=addr1,id2=addr2`
`--bootstrap`	`false`	Bootstrap new cluster
`--lease-timeout`	`30s`	Visibility timeout
`--max-retries`	`3`	Max retries before dead-letter

Design Decisions

Unified Log

The Raft log serves as the write-ahead log. Queue state is derived by replaying committed entries (event sourcing). This avoids double-write complexity.

Leader-Local Leases

Leases (visibility timeouts) are NOT replicated via Raft. This is a deliberate trade-off:

Pro: Much higher throughput (no consensus per lease)
Con: If leader dies, leased tasks become available again
This matches how SQS visibility timeout works

Scoped Raft

We implement core Raft without extensions like pre-vote, joint consensus, or linearizable reads. See docs/raft.md for rationale.

Project Structure

├── cmd/
│   ├── dqueue/          # Queue node binary
│   ├── dqueue-worker/   # Example worker
│   └── dqueue-cli/      # CLI tool
├── pkg/
│   ├── raft/            # Raft consensus implementation
│   ├── queue/           # Task queue logic
│   ├── storage/         # Unified WAL storage
│   ├── transport/       # gRPC + HTTP transport
│   └── client/          # Go SDK
├── docs/
│   ├── architecture.md  # System design
│   ├── raft.md          # Raft implementation details
│   └── failure-modes.md # Failure scenarios
├── deploy/
│   ├── Dockerfile
│   └── docker-compose.yml
└── scripts/
    ├── demo-basic.sh
    ├── demo-failover.sh
    └── chaos-test.sh

Building

# Build all binaries
make build

# Run tests
make test

# Generate protobuf (requires protoc)
make proto

# Build Docker image
make docker-build

Documentation

Architecture - System design and data flow
Raft Implementation - Consensus details and non-goals
Failure Modes - How failures are handled

References

Raft Paper - "In Search of an Understandable Consensus Algorithm"
Amazon SQS - Visibility timeout inspiration

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
benchmarks/k6		benchmarks/k6
cmd		cmd
deploy		deploy
docs		docs
internal/config		internal/config
pkg		pkg
proto		proto
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Folders and files

Latest commit

History

Repository files navigation

DQueue

Features

Architecture

Quick Start

Prerequisites

Run a Single Node (Development)

Run a 3-Node Cluster

Enqueue Tasks

Run a Worker

API Reference

Enqueue Task

Lease Task (for workers)

Acknowledge Task (success)

Negative Acknowledge (failure)

Get Task Status

Health Check

Statistics

Delivery Guarantees

At-Least-Once Delivery

Why Not Exactly-Once?

Failover Demo

Chaos Testing

Configuration

Design Decisions

Unified Log

Leader-Local Leases

Scoped Raft

Project Structure

Building

Documentation

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages