This repository contains the infrastructure and code for the Rusha Spark 3.5.8 Base, a unified Spark environment designed for production SQL interfaces, machine learning experimentation, and interactive development.
The core of this project is a Base Spark Image (rusha-spark-3.5.8-base) that serves as the foundation for multiple Spark-based services. This repository is specifically focused on supporting various Spark services like the Thrift Server.
It includes native support for:
- Nessie: Git-like versioning for Data Lakes.
- Apache Iceberg: High-performance table format.
- Delta Lake: Reliable data lakes with ACID transactions.
- Unity Catalog: Unified governance for data and AI.
- Hive: Metadata management and SQL interface.
- AWS Integration: Full S3 and AWS SDK support.
- Hadoop Native Decompressors: Robust support for concatenated (multi-member) GZIP files using Hadoop 3.3.6.
graph TD
subgraph Repo [GitHub Repository]
Code[Source Code / Scala / Python]
Docs[Consolidated Documentation]
Release[GitHub Releases / Tags]
end
subgraph Build_Process [Build & Registry]
Local[Local Workspace] -->|docker build| BaseImage[Base Spark Image]
BaseImage -->|push| Registry[Container Registry]
Release -->|Trigger| Local
end
subgraph Runtime [Services]
Registry -->|Pull| Thrift[Spark Thrift Server]
Registry -->|Pull| MLflow[MLflow Server]
Registry -->|Pull| DevNode[Spark Dev Node]
end
- Docker & Docker Buildx
- Java 17 & SBT (for building Scala components)
- Python 3.12 & Poetry
The following variables are required for various services. It is recommended to manage them via a .env file (ignored by git).
| Variable | Description | Default |
|---|---|---|
SPARK_MASTER |
URL of the Spark master | - |
SPARK_WAREHOUSE_DIR |
Directory for Spark warehouse | - |
METASTORE_URIS |
URIs for the Hive metastore | - |
DOCKER_REGISTRY |
(Optional) Destination container registry | - |
DOCKER_REPO |
(Optional) Destination repository name | rusha-spark-3.5.3-base |
To build the base image locally:
./local/build.shBy default, this will build and tag the image locally. To push to a remote registry, set the DOCKER_REGISTRY environment variable in your .env file.
For security in this public repository, internal registry pushes are performed locally via make using your host credentials. This serves as our local "pipeline".
- Configure Environment: Ensure your local
.envfile contains the internalDOCKER_REGISTRY,DOCKER_REPO, andAWS_REGION. - Build & Push:
make push
- Formal Release:
This project uses semantic versioning. Releasing will automatically update
pyproject.toml, push to GitHub, create a GitHub Release, and push the versioned image to the internal registry.make patch # Increment v0.1.0 -> v0.1.1 make minor # Increment v0.1.0 -> v0.2.0 make major # Increment v0.1.0 -> v1.0.0 # OR use an explicit version make release VERSION=v1.2.3
This workflow ensures that internal registry details and credentials are never exposed in GitHub Actions or public CI/CD logs.
Proper releases are managed via Git tags. Creating a tag will trigger a versioned build.
./scripts/release.sh v1.0.0Starts the Hive Thrift Server 2 on Kubernetes or locally.
- Usage:
./start_thrift_server_k8s.shor via Docker Compose. - Port: 10000
Manages machine learning lifecycles.
- URL:
http://localhost:5000
Provides a REST interface for Apache Iceberg catalogs.
- URL:
http://localhost:8181
- No Secrets: Never commit secrets or
.envfiles. Audit history regularly. - Local Context: Always build from the local workspace to ensure uncommitted changes are tested.
- Dependencies: Scala dependencies are managed via
build.sbtand Python viapyproject.toml.
Maintained by Rusha Corp.