Data Processing Pipeline

This folder contains the data processing pipeline for aligning Yelp review text with images, cleaning the dataset, creating a normalized helpfulness target, and producing train/val/test splits.

Overview

The pipeline now consists of four stages:

Advanced Data Alignment - Matches images to review text using AI-powered similarity scoring
Initial Cleaning - Filters and prepares the aligned dataset for downstream tasks
Helpfulness Target + Sampling - Creates a normalized helpfulness score and performs exploratory checks
Dataset Split - Produces train/val/test JSON files for modeling

Pipeline Stages

1. Advanced Data Alignment (`01_advanced_data_align.ipynb`)

This notebook performs intelligent image-text alignment using the SigLIP model to match photos with their most relevant reviews.

Key Features

Model: Google SigLIP (so400m-patch14-384) - State-of-the-art for image-text matching
Processing: Batch processing with GPU acceleration (float16 precision)
Algorithm: Greedy assignment based on cosine similarity scores
Threshold: 0.15 minimum similarity score to filter weak matches

Input Data

Review data: <YOUR_DATA_PATH>/yelp/text/yelp_academic_dataset_review.json
Photo metadata: <YOUR_DATA_PATH>/yelp/image/photos.json
Photo files: <YOUR_DATA_PATH>/yelp/image/photos/

Output

Aligned dataset: <YOUR_DATA_PATH>/yelp/dataset_aligned.json
Each review is enriched with matched photo_ids and captions

Note: Replace <YOUR_DATA_PATH> with your actual data directory path in the notebook configuration section.

Processing Steps

Loads all reviews and photos into memory, grouped by business ID
Generates text embeddings for reviews (batch size: 64)
Generates image embeddings for photos (batch size: 64)
Computes similarity matrix between text and image embeddings
Assigns photos to reviews using greedy algorithm (max 1 photo per review by default)
Saves aligned dataset with photo_ids and captions added to each review

Configuration Parameters

BATCH_SIZE_TEXT = 64       # Text embedding batch size
BATCH_SIZE_IMAGE = 64      # Image embedding batch size
THRESHOLD = 0.15           # Minimum similarity score

2. Initial Cleaning (`02_initial_clean.ipynb`)

This notebook filters the aligned dataset to retain only reviews with associated images.

Processing Steps

Loads the aligned dataset from Stage 1
Reports dataset statistics (rows, columns, total images)
Filters out reviews with empty photo_ids
Saves cleaned dataset

Input

<YOUR_DATA_PATH>/yelp/dataset_aligned.json

Output

<YOUR_DATA_PATH>/yelp/data.json

Note: Update the input/output paths in the notebook to match your data directory.

Key Operations

# Keep only rows where photo_ids is not empty
df = df[df['photo_ids'].astype(bool)]

Requirements

Python Packages

torch
numpy
pandas
pillow
transformers
tqdm
torchvision

Hardware Requirements

GPU: Recommended for faster processing (uses CUDA if available)
Memory: Sufficient RAM to load entire review dataset (~several GB)
Storage: Space for model weights (~1.5 GB for SigLIP)

Usage

Running the Pipeline

Install dependencies (if not already installed):

pip install torch numpy pandas pillow transformers tqdm torchvision

Configure paths:
- Open 01_advanced_data_align.ipynb and update the CONFIGURATION section with your data paths
- Open 02_initial_clean.ipynb and update the INPUT_PATH and OUTPUT_PATH variables
- Open 04_sample_01.ipynb if you need to customize the helpfulness target or plotting options
- Open 03_data_split.ipynb if you need to change split ratios or input/output paths
Run Stage 1 - Data Alignment: Execute all cells in 01_advanced_data_align.ipynb
Run Stage 2 - Data Cleaning: Execute all cells in 02_initial_clean.ipynb
Run Stage 3 - Helpfulness Target + Sampling: Execute all cells in 04_sample_01.ipynb to create the helpful column (log + MinMax scaling on useful), filter heavy-engagement reviews, and visualize distributions. The notebook saves the result to /home/ranveer/mukul/data/yelp/dataset_01.json by default.
Run Stage 4 - Dataset Split: Execute all cells in 03_data_split.ipynb to generate /home/ranveer/mukul/data/yelp/data/train_data.json, val_data.json, and test_data.json from the input dataset (defaults to /home/ranveer/mukul/data/yelp/dataset_02.json).

Customization

To modify alignment behavior, adjust parameters in Stage 1:

THRESHOLD: Increase to be more selective, decrease to allow more matches
Max photos per review: Modify the condition in greedy_assignment() function:
```
if len(review_assignments[r_idx]) >= 1:  # Change 1 to desired max
```

Output Data Format

The final cleaned dataset (data.json) contains Yelp reviews with the following additional fields:

photo_ids: List of matched photo IDs
captions: List of corresponding photo captions
helpful: Normalized helpfulness score derived from useful via log1p + MinMax scaling (added in Stage 3 when running 04_sample_01.ipynb)

All original review fields are preserved (review_id, business_id, user_id, text, stars, date, useful, funny, cool).

Performance Notes

Stage 1 processes businesses that have both reviews and photos
Processing time depends on dataset size and GPU availability
Float16 precision is used to optimize VRAM usage on GPU
Corrupt or missing images are handled gracefully with placeholder images

Dataset Statistics

After processing:

Only reviews with matched photos are retained
Total image count is available in Stage 2 output
Dataset dimensions are reported before and after filtering

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
01_advanced_data_align.ipynb		01_advanced_data_align.ipynb
02_initial_clean.ipynb		02_initial_clean.ipynb
03_data_split.ipynb		03_data_split.ipynb
04_sample_01.ipynb		04_sample_01.ipynb
04_sample_02.ipynb		04_sample_02.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Processing Pipeline

Overview

Pipeline Stages

1. Advanced Data Alignment (`01_advanced_data_align.ipynb`)

Key Features

Input Data

Output

Processing Steps

Configuration Parameters

2. Initial Cleaning (`02_initial_clean.ipynb`)

Processing Steps

Input

Output

Key Operations

Requirements

Python Packages

Hardware Requirements

Usage

Running the Pipeline

Customization

Output Data Format

Performance Notes

Dataset Statistics

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Processing Pipeline

Overview

Pipeline Stages

1. Advanced Data Alignment (01_advanced_data_align.ipynb)

Key Features

Input Data

Output

Processing Steps

Configuration Parameters

2. Initial Cleaning (02_initial_clean.ipynb)

Processing Steps

Input

Output

Key Operations

Requirements

Python Packages

Hardware Requirements

Usage

Running the Pipeline

Customization

Output Data Format

Performance Notes

Dataset Statistics

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1. Advanced Data Alignment (`01_advanced_data_align.ipynb`)

2. Initial Cleaning (`02_initial_clean.ipynb`)

Packages