This project is a hybrid data processing pipeline that leverages Apache Flink (via PyFlink) and native Python to process large CSV files, clean and chunk them, and store them in a local SQLite database.
The pipeline performs the following steps:
-
Flink Job Execution:
- Reads large CSV files in chunks using pandas.
- Cleans and formats the data rows.
- Splits them into smaller
chunk_*.csvfiles for manageable processing.
-
Python-SQLite Integration:
- Reads each chunked file.
- Inserts cleaned data into a local
hsi.dbSQLite database.
This setup is containerized using Docker and deploys Flink in Session Cluster mode via Docker Compose.
cadence/
├── data/
│ ├── uploads/ # Input CSV files
│ ├── chunks/ # Output files written by Flink for SQLite ingestion
│ └── hsi.db # SQLite database (auto-created)
├── flink_pipeline.py # Main processing script
├── Dockerfile # Builds the PyFlink client container
├── docker-compose.yml # Spins up Flink session cluster (JobManager + TaskManager)
├── requirements.txt # Python dependencies
- Docker & Docker Compose
- Python 3.10 (used in container)
- CSV input files in
data/uploads/directory
From the root directory:
docker compose up -dThis starts:
- JobManager on port
8081(Web UI) - TaskManager connected to the same Flink session
Access the Flink dashboard: http://localhost:8081
docker build -t pyflink-client -f Dockerfile .docker run --rm \
--network=cadence_flink-net (replace with your local network - docker newtwork ls) \
-v $(pwd)/data:/app/data \
pyflink-clientThis will:
- Process files inside
/app/data/uploads - Output chunked
chunk_*.csvfiles into/app/data/chunks - Populate the
hsi.dbSQLite database
-
Preprocesses Input Files
- Cleans columns like
Index,Open,CloseUSD, etc. - Converts rows into CSV string format.
- Cleans columns like
-
Writes Chunked Files
- Each chunk contains 10,000 rows by default.
- Files are saved to
/app/data/chunks/chunk_N.csv.
-
SQLite Insertion
- Loads each
chunk_*.csv - Inserts rows into an SQLite table:
hsi_data
- Loads each
CREATE TABLE hsi_data (
index_name TEXT,
date TEXT,
open REAL,
high REAL,
low REAL,
close REAL,
adj_close REAL,
volume INTEGER,
close_usd REAL
);Listed in requirements.txt:
apache-flink
pyflink
pandasThese are installed inside the PyFlink client Docker image.
To stop the Flink cluster:
docker compose downTo clear generated data:
rm -rf data/chunks/*
rm -f data/hsi.db- Stream to Kafka instead of writing to disk
- Add SQL-based filtering or aggregation
- Use Flink Table API for advanced processing
- Extend to output to PostgreSQL or Cloud DB
Built by Bethvour Chike This is a hybrid Flink-native and Python-native data engineering project for fast, local CSV ingestion and analysis.