This project aims to build a complete machine learning pipeline that forecasts cinema audience numbers using ticket images as the primary data source. Both online ticket screenshots and physical ticket photos are processed using OCR to extract structured information that feeds into a time series forecasting system.
The workflow spans Computer Vision → Data Engineering → Feature Engineering → ML/DL Models → Forecasting → Insights.
Cinema owners need to accurately forecast the audience count for upcoming shows to optimize:
- Staff scheduling
- Screen allocation
- Inventory and concessions
- Revenue planning
- Marketing efforts
However, the available data is unstructured:
- Online ticket booking screenshots
- Images of physical tickets bought at the cinema
Your goal is to transform these raw images into a clean forecasting dataset and then build models to predict upcoming audience demand.
Extract raw text from ticket images using image preprocessing and OCR techniques.
Convert noisy OCR output into structured fields such as:
- Movie name
- Show date
- Show time
- Screen number
- Seat number
- Ticket price
- Ticket quantity
Implement a robust preprocessing workflow that:
- Handles missing values
- Detects and corrects outliers
- Removes duplicates
- Normalizes numeric features
- Performs time-aware train/validation/test split
Generate both low-level and high-level visualizations:
- Daily booking trends
- Monthly patterns
- Seasonality decomposition
- Heatmaps
- Festival or holiday-based analysis
Create meaningful features for forecasting:
- Temporal features (weekday, month, week number)
- Lag features (lag_1, lag_7, lag_30)
- Rolling window features (mean, standard deviation)
- Holiday or festival indicators
Implement at least three forecasting models:
- ARIMA or Auto-ARIMA
- XGBoost
- LSTM or GRU
Optimize model performance using Optuna or other tuning frameworks.
Evaluate models using metrics:
- RMSE
- MAE
- MAPE
- R²
Produce final predictions and summarize insights using visualizations and reports.
src/
├── ocr.py
├── parser.py
├── preprocess.py
├── features.py
├── visualize.py
├── data_loader.py
├── evaluate.py
│
├── models/
│ ├── arima_model.py
│ ├── xgb_model.py
│ └── lstm_model.py
│
└── pipeline/
├── pipeline.py
└── stages/
Data directories:
data/
├── raw/
├── interim/
└── processed/
Additional directories:
plots/
reports/
models/
artifacts/
logs/
A minimal dataset is provided:
- data/raw/ → sample ticket images
- data/interim/ → sample OCR text output
- data/processed/ → structured dataset
You may extend or augment this dataset as needed.
Ticket Images
↓
OCR Extraction
↓
Parsed Fields
↓
Cleaned Data
↓
Feature Engineering
↓
Train/Validation/Test Split
↓
Model Training
↓
Hyperparameter Optimization
↓
Evaluation
↓
Final Forecast and Insights