A high-performance data processing system built with Rust, DataFusion, Kafka, and Apache Arrow.
This project implements a scalable data processing pipeline that:
- Processes CSV and Parquet files using DataFusion
- Exposes a REST API for data queries and management
- Consumes messages from Kafka for real-time processing
- Provides comprehensive metrics and monitoring
- High-Performance SQL Processing: Execute SQL queries against CSV and Parquet data
- Real-time Data Processing: Consume and process Kafka messages
- REST API: Query data and manage the system via HTTP endpoints
- Metrics & Monitoring: Track system performance with Prometheus-compatible metrics
- Scalable Architecture: Designed for horizontal scaling
- Rust: Core programming language
- DataFusion: SQL query engine based on Apache Arrow
- Apache Arrow: In-memory columnar data format
- Apache Parquet: Efficient columnar storage format
- Kafka: Distributed event streaming platform
- Docker & Docker Compose: Containerization and orchestration
- Docker and Docker Compose
- Rust (for development)
- Git
-
Clone the repository:
git clone https://github.com/xerxes-y/Datafusion.git cd Datafusion -
Start the services:
docker compose up -d
-
Verify the services are running:
docker compose ps
- Health Check:
GET http://localhost:3000/health - Metrics:
GET http://localhost:9000/metrics - Query Data:
POST http://localhost:3000/query{ "sql": "SELECT * FROM transactions LIMIT 10" }
# Build the project
cargo build --release
# Run tests
cargo test
# Run locally
cargo runsrc/main.rs: Application entry pointsrc/api.rs: HTTP API server implementationsrc/kafka.rs: Kafka consumer implementationsrc/query.rs: DataFusion query executionsrc/udf.rs: User-defined functionssrc/observability.rs: Metrics and monitoringsrc/config.rs: Configuration management
The application can be configured using environment variables:
| Variable | Description | Default |
|---|---|---|
RUST_LOG |
Logging level | info |
KAFKA_BROKER |
Kafka broker address | kafka:9092 |
KAFKA_TOPIC |
Kafka topic to consume | transactions |
API_PORT |
API server port | 3000 |
The system is designed to handle:
- Up to 100,000 transactions per second for simple queries
- 50,000 TPS for complex aggregations
- Sub-millisecond latency for data operations
The project includes the following services:
- datafusion: The main application
- kafka: Message broker for streaming data
- zookeeper: Required for Kafka coordination
-
Kafka Connection Issues
- Ensure Zookeeper is healthy
- Check Kafka logs:
docker compose logs kafka
-
Permission Issues
- The containers run as root to avoid volume permission problems
- Ensure your host system allows this
-
Performance Issues
- Adjust Kafka and JVM settings in docker-compose.yaml
- Consider increasing container resources
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.