This folder contains the data processing pipeline for aligning Yelp review text with images, cleaning the dataset, creating a normalized helpfulness target, and producing train/val/test splits.
The pipeline now consists of four stages:
- Advanced Data Alignment - Matches images to review text using AI-powered similarity scoring
- Initial Cleaning - Filters and prepares the aligned dataset for downstream tasks
- Helpfulness Target + Sampling - Creates a normalized helpfulness score and performs exploratory checks
- Dataset Split - Produces train/val/test JSON files for modeling
This notebook performs intelligent image-text alignment using the SigLIP model to match photos with their most relevant reviews.
- Model: Google SigLIP (so400m-patch14-384) - State-of-the-art for image-text matching
- Processing: Batch processing with GPU acceleration (float16 precision)
- Algorithm: Greedy assignment based on cosine similarity scores
- Threshold: 0.15 minimum similarity score to filter weak matches
- Review data:
<YOUR_DATA_PATH>/yelp/text/yelp_academic_dataset_review.json - Photo metadata:
<YOUR_DATA_PATH>/yelp/image/photos.json - Photo files:
<YOUR_DATA_PATH>/yelp/image/photos/
- Aligned dataset:
<YOUR_DATA_PATH>/yelp/dataset_aligned.json - Each review is enriched with matched
photo_idsandcaptions
Note: Replace
<YOUR_DATA_PATH>with your actual data directory path in the notebook configuration section.
- Loads all reviews and photos into memory, grouped by business ID
- Generates text embeddings for reviews (batch size: 64)
- Generates image embeddings for photos (batch size: 64)
- Computes similarity matrix between text and image embeddings
- Assigns photos to reviews using greedy algorithm (max 1 photo per review by default)
- Saves aligned dataset with photo_ids and captions added to each review
BATCH_SIZE_TEXT = 64 # Text embedding batch size
BATCH_SIZE_IMAGE = 64 # Image embedding batch size
THRESHOLD = 0.15 # Minimum similarity scoreThis notebook filters the aligned dataset to retain only reviews with associated images.
- Loads the aligned dataset from Stage 1
- Reports dataset statistics (rows, columns, total images)
- Filters out reviews with empty
photo_ids - Saves cleaned dataset
<YOUR_DATA_PATH>/yelp/dataset_aligned.json
<YOUR_DATA_PATH>/yelp/data.json
Note: Update the input/output paths in the notebook to match your data directory.
# Keep only rows where photo_ids is not empty
df = df[df['photo_ids'].astype(bool)]torch
numpy
pandas
pillow
transformers
tqdm
torchvision
- GPU: Recommended for faster processing (uses CUDA if available)
- Memory: Sufficient RAM to load entire review dataset (~several GB)
- Storage: Space for model weights (~1.5 GB for SigLIP)
-
Install dependencies (if not already installed):
pip install torch numpy pandas pillow transformers tqdm torchvision
-
Configure paths:
- Open
01_advanced_data_align.ipynband update the CONFIGURATION section with your data paths - Open
02_initial_clean.ipynband update the INPUT_PATH and OUTPUT_PATH variables - Open
04_sample_01.ipynbif you need to customize the helpfulness target or plotting options - Open
03_data_split.ipynbif you need to change split ratios or input/output paths
- Open
-
Run Stage 1 - Data Alignment: Execute all cells in
01_advanced_data_align.ipynb -
Run Stage 2 - Data Cleaning: Execute all cells in
02_initial_clean.ipynb -
Run Stage 3 - Helpfulness Target + Sampling: Execute all cells in
04_sample_01.ipynbto create thehelpfulcolumn (log + MinMax scaling onuseful), filter heavy-engagement reviews, and visualize distributions. The notebook saves the result to/home/ranveer/mukul/data/yelp/dataset_01.jsonby default. -
Run Stage 4 - Dataset Split: Execute all cells in
03_data_split.ipynbto generate/home/ranveer/mukul/data/yelp/data/train_data.json,val_data.json, andtest_data.jsonfrom the input dataset (defaults to/home/ranveer/mukul/data/yelp/dataset_02.json).
To modify alignment behavior, adjust parameters in Stage 1:
- THRESHOLD: Increase to be more selective, decrease to allow more matches
- Max photos per review: Modify the condition in
greedy_assignment()function:if len(review_assignments[r_idx]) >= 1: # Change 1 to desired max
The final cleaned dataset (data.json) contains Yelp reviews with the following additional fields:
photo_ids: List of matched photo IDscaptions: List of corresponding photo captionshelpful: Normalized helpfulness score derived fromusefulvialog1p+ MinMax scaling (added in Stage 3 when running04_sample_01.ipynb)
All original review fields are preserved (review_id, business_id, user_id, text, stars, date, useful, funny, cool).
- Stage 1 processes businesses that have both reviews and photos
- Processing time depends on dataset size and GPU availability
- Float16 precision is used to optimize VRAM usage on GPU
- Corrupt or missing images are handled gracefully with placeholder images
After processing:
- Only reviews with matched photos are retained
- Total image count is available in Stage 2 output
- Dataset dimensions are reported before and after filtering