This is the quickest end-to-end way to run CDC (Postgres -> Kafka/Debezium -> Hudi on MinIO -> Trino).
From project root:
make startmake start now runs a startup script (scripts/start-sequence.sh) that boots services in sequence and waits for readiness checks before moving ahead.
If you want raw compose startup without sequencing checks:
make start-quick
make connectorTo stop:
make stopFull reset (remove volumes):
make resetUse editable source: screenshots/diagram-v2.mmd
If you want to update the diagram image in readme.md (screenshots/diagram.jpg), keep these flows in sync:
PostgreSQL -> Debezium -> KafkaSpark Hudi Streamer <- Kafka + Schema RegistrySpark -> MinIO (Hudi files)Spark -> Hive Metastore (table sync)Trino -> Hive Metastore + MinIO
From project root:
cd hudi-datalake
docker-compose up -dVerify:
docker psYou should see containers like: hudidb, kafka, schema-registry, hive-metastore, trino, minio.
curl -X POST http://localhost:8083/connectors \
-H 'Content-Type: application/json' \
--data @hudi-datalake/connector.jsonCheck status:
curl http://localhost:8083/connectors/transactions-connector/statusKeep this running in a separate terminal:
docker-compose -f /home/rahulkumarsingh/ScalaWorkspace/datalake/hudi-datalake/docker-compose.spark.yml run --rm spark-hudi-streamerRun from any terminal:
docker exec hudidb psql -U postgres -d dev -c "ALTER TABLE v1.retail_transactions REPLICA IDENTITY FULL;"
docker exec hudidb psql -U postgres -d dev -c "SET search_path TO v1; INSERT INTO retail_transactions VALUES (1001, CURRENT_DATE, 11, 'BOSTON', 'MA', 2, 44.50);"
docker exec hudidb psql -U postgres -d dev -c "SET search_path TO v1; UPDATE retail_transactions SET quantity = quantity + 1, total = total + 10 WHERE tran_id = 2;"
docker exec hudidb psql -U postgres -d dev -c "SET search_path TO v1; DELETE FROM retail_transactions WHERE tran_id = 3;"docker exec -it trino trino --execute "SHOW TABLES FROM hudi.default;"
docker exec -it trino trino --execute "SELECT tran_id, store_city, quantity, total FROM hudi.default.retail_transactions ORDER BY tran_id;"Stop only services:
cd hudi-datalake
docker-compose downStop and remove volumes (full reset):
cd hudi-datalake
docker-compose down -v- Use
config/spark-config-s3.propertieswhen Spark runs on host. - Use
config/spark-config-s3-docker.propertieswhen Spark runs in Docker. - If
docker-composeshowsContainerConfigrecreate issues, run:cd hudi-datalake && docker-compose down --remove-orphans && docker-compose up -d