Running Guide (Simple)

This is the quickest end-to-end way to run CDC (Postgres -> Kafka/Debezium -> Hudi on MinIO -> Trino).

Quick start with Make

From project root:

make start

make start now runs a startup script (scripts/start-sequence.sh) that boots services in sequence and waits for readiness checks before moving ahead.

If you want raw compose startup without sequencing checks:

make start-quick
make connector

To stop:

make stop

Full reset (remove volumes):

make reset

Architecture diagram (editable)

Use editable source: screenshots/diagram-v2.mmd

If you want to update the diagram image in readme.md (screenshots/diagram.jpg), keep these flows in sync:

PostgreSQL -> Debezium -> Kafka
Spark Hudi Streamer <- Kafka + Schema Registry
Spark -> MinIO (Hudi files)
Spark -> Hive Metastore (table sync)
Trino -> Hive Metastore + MinIO

1) Start services

From project root:

cd hudi-datalake
docker-compose up -d

Verify:

docker ps

You should see containers like: hudidb, kafka, schema-registry, hive-metastore, trino, minio.

2) Create Debezium connector

curl -X POST http://localhost:8083/connectors \
  -H 'Content-Type: application/json' \
  --data @hudi-datalake/connector.json

Check status:

curl http://localhost:8083/connectors/transactions-connector/status

3) Start Hudi streamer in continuous mode

Keep this running in a separate terminal:

docker-compose -f /home/rahulkumarsingh/ScalaWorkspace/datalake/hudi-datalake/docker-compose.spark.yml run --rm spark-hudi-streamer

4) Generate source CDC changes (INSERT/UPDATE/DELETE)

Run from any terminal:

docker exec hudidb psql -U postgres -d dev -c "ALTER TABLE v1.retail_transactions REPLICA IDENTITY FULL;"

docker exec hudidb psql -U postgres -d dev -c "SET search_path TO v1; INSERT INTO retail_transactions VALUES (1001, CURRENT_DATE, 11, 'BOSTON', 'MA', 2, 44.50);"

docker exec hudidb psql -U postgres -d dev -c "SET search_path TO v1; UPDATE retail_transactions SET quantity = quantity + 1, total = total + 10 WHERE tran_id = 2;"

docker exec hudidb psql -U postgres -d dev -c "SET search_path TO v1; DELETE FROM retail_transactions WHERE tran_id = 3;"

5) Query in Trino (Presto)

docker exec -it trino trino --execute "SHOW TABLES FROM hudi.default;"

docker exec -it trino trino --execute "SELECT tran_id, store_city, quantity, total FROM hudi.default.retail_transactions ORDER BY tran_id;"

6) Stop

Stop only services:

cd hudi-datalake
docker-compose down

Stop and remove volumes (full reset):

cd hudi-datalake
docker-compose down -v

Notes

Use config/spark-config-s3.properties when Spark runs on host.
Use config/spark-config-s3-docker.properties when Spark runs in Docker.
If docker-compose shows ContainerConfig recreate issues, run:
- cd hudi-datalake && docker-compose down --remove-orphans && docker-compose up -d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Guide (Simple)

Quick start with Make

Architecture diagram (editable)

1) Start services

2) Create Debezium connector

3) Start Hudi streamer in continuous mode

4) Generate source CDC changes (INSERT/UPDATE/DELETE)

5) Query in Trino (Presto)

6) Stop

Notes

FilesExpand file tree

running.md

Latest commit

History

running.md

File metadata and controls

Running Guide (Simple)

Quick start with Make

Architecture diagram (editable)

1) Start services

2) Create Debezium connector

3) Start Hudi streamer in continuous mode

4) Generate source CDC changes (INSERT/UPDATE/DELETE)

5) Query in Trino (Presto)

6) Stop

Notes