DataFusion Processing System

A high-performance data processing system built with Rust, DataFusion, Kafka, and Apache Arrow.

Overview

This project implements a scalable data processing pipeline that:

Processes CSV and Parquet files using DataFusion
Exposes a REST API for data queries and management
Consumes messages from Kafka for real-time processing
Provides comprehensive metrics and monitoring

Features

High-Performance SQL Processing: Execute SQL queries against CSV and Parquet data
Real-time Data Processing: Consume and process Kafka messages
REST API: Query data and manage the system via HTTP endpoints
Metrics & Monitoring: Track system performance with Prometheus-compatible metrics
Scalable Architecture: Designed for horizontal scaling

Technology Stack

Rust: Core programming language
DataFusion: SQL query engine based on Apache Arrow
Apache Arrow: In-memory columnar data format
Apache Parquet: Efficient columnar storage format
Kafka: Distributed event streaming platform
Docker & Docker Compose: Containerization and orchestration

Getting Started

Prerequisites

Docker and Docker Compose
Rust (for development)
Git

Installation

Clone the repository:

git clone https://github.com/xerxes-y/Datafusion.git
cd Datafusion

Start the services:
```
docker compose up -d
```
Verify the services are running:
```
docker compose ps
```

API Endpoints

Health Check: GET http://localhost:3000/health
Metrics: GET http://localhost:9000/metrics

Query Data: POST http://localhost:3000/query

{
  "sql": "SELECT * FROM transactions LIMIT 10"
}

Development

Building from Source

# Build the project
cargo build --release

# Run tests
cargo test

# Run locally
cargo run

Project Structure

src/main.rs: Application entry point
src/api.rs: HTTP API server implementation
src/kafka.rs: Kafka consumer implementation
src/query.rs: DataFusion query execution
src/udf.rs: User-defined functions
src/observability.rs: Metrics and monitoring
src/config.rs: Configuration management

Configuration

The application can be configured using environment variables:

Variable	Description	Default
`RUST_LOG`	Logging level	`info`
`KAFKA_BROKER`	Kafka broker address	`kafka:9092`
`KAFKA_TOPIC`	Kafka topic to consume	`transactions`
`API_PORT`	API server port	`3000`

Performance

The system is designed to handle:

Up to 100,000 transactions per second for simple queries
50,000 TPS for complex aggregations
Sub-millisecond latency for data operations

Docker Compose Services

The project includes the following services:

datafusion: The main application
kafka: Message broker for streaming data
zookeeper: Required for Kafka coordination

Troubleshooting

Common Issues

Kafka Connection Issues
- Ensure Zookeeper is healthy
- Check Kafka logs: docker compose logs kafka
Permission Issues
- The containers run as root to avoid volume permission problems
- Ensure your host system allows this
Performance Issues
- Adjust Kafka and JVM settings in docker-compose.yaml
- Consider increasing container resources

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
src		src
.env		.env
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Datafusion.iml		Datafusion.iml
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataFusion Processing System

Overview

Features

Technology Stack

Getting Started

Prerequisites

Installation

API Endpoints

Development

Building from Source

Project Structure

Configuration

Performance

Docker Compose Services

Troubleshooting

Common Issues

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataFusion Processing System

Overview

Features

Technology Stack

Getting Started

Prerequisites

Installation

API Endpoints

Development

Building from Source

Project Structure

Configuration

Performance

Docker Compose Services

Troubleshooting

Common Issues

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages