Skip to content

mukulkumar-codes/yelp_data_processing

Repository files navigation

Data Processing Pipeline

This folder contains the data processing pipeline for aligning Yelp review text with images, cleaning the dataset, creating a normalized helpfulness target, and producing train/val/test splits.

Overview

The pipeline now consists of four stages:

  1. Advanced Data Alignment - Matches images to review text using AI-powered similarity scoring
  2. Initial Cleaning - Filters and prepares the aligned dataset for downstream tasks
  3. Helpfulness Target + Sampling - Creates a normalized helpfulness score and performs exploratory checks
  4. Dataset Split - Produces train/val/test JSON files for modeling

Pipeline Stages

1. Advanced Data Alignment (01_advanced_data_align.ipynb)

This notebook performs intelligent image-text alignment using the SigLIP model to match photos with their most relevant reviews.

Key Features

  • Model: Google SigLIP (so400m-patch14-384) - State-of-the-art for image-text matching
  • Processing: Batch processing with GPU acceleration (float16 precision)
  • Algorithm: Greedy assignment based on cosine similarity scores
  • Threshold: 0.15 minimum similarity score to filter weak matches

Input Data

  • Review data: <YOUR_DATA_PATH>/yelp/text/yelp_academic_dataset_review.json
  • Photo metadata: <YOUR_DATA_PATH>/yelp/image/photos.json
  • Photo files: <YOUR_DATA_PATH>/yelp/image/photos/

Output

  • Aligned dataset: <YOUR_DATA_PATH>/yelp/dataset_aligned.json
  • Each review is enriched with matched photo_ids and captions

Note: Replace <YOUR_DATA_PATH> with your actual data directory path in the notebook configuration section.

Processing Steps

  1. Loads all reviews and photos into memory, grouped by business ID
  2. Generates text embeddings for reviews (batch size: 64)
  3. Generates image embeddings for photos (batch size: 64)
  4. Computes similarity matrix between text and image embeddings
  5. Assigns photos to reviews using greedy algorithm (max 1 photo per review by default)
  6. Saves aligned dataset with photo_ids and captions added to each review

Configuration Parameters

BATCH_SIZE_TEXT = 64       # Text embedding batch size
BATCH_SIZE_IMAGE = 64      # Image embedding batch size
THRESHOLD = 0.15           # Minimum similarity score

2. Initial Cleaning (02_initial_clean.ipynb)

This notebook filters the aligned dataset to retain only reviews with associated images.

Processing Steps

  1. Loads the aligned dataset from Stage 1
  2. Reports dataset statistics (rows, columns, total images)
  3. Filters out reviews with empty photo_ids
  4. Saves cleaned dataset

Input

  • <YOUR_DATA_PATH>/yelp/dataset_aligned.json

Output

  • <YOUR_DATA_PATH>/yelp/data.json

Note: Update the input/output paths in the notebook to match your data directory.

Key Operations

# Keep only rows where photo_ids is not empty
df = df[df['photo_ids'].astype(bool)]

Requirements

Python Packages

torch
numpy
pandas
pillow
transformers
tqdm
torchvision

Hardware Requirements

  • GPU: Recommended for faster processing (uses CUDA if available)
  • Memory: Sufficient RAM to load entire review dataset (~several GB)
  • Storage: Space for model weights (~1.5 GB for SigLIP)

Usage

Running the Pipeline

  1. Install dependencies (if not already installed):

    pip install torch numpy pandas pillow transformers tqdm torchvision
  2. Configure paths:

    • Open 01_advanced_data_align.ipynb and update the CONFIGURATION section with your data paths
    • Open 02_initial_clean.ipynb and update the INPUT_PATH and OUTPUT_PATH variables
    • Open 04_sample_01.ipynb if you need to customize the helpfulness target or plotting options
    • Open 03_data_split.ipynb if you need to change split ratios or input/output paths
  3. Run Stage 1 - Data Alignment: Execute all cells in 01_advanced_data_align.ipynb

  4. Run Stage 2 - Data Cleaning: Execute all cells in 02_initial_clean.ipynb

  5. Run Stage 3 - Helpfulness Target + Sampling: Execute all cells in 04_sample_01.ipynb to create the helpful column (log + MinMax scaling on useful), filter heavy-engagement reviews, and visualize distributions. The notebook saves the result to /home/ranveer/mukul/data/yelp/dataset_01.json by default.

  6. Run Stage 4 - Dataset Split: Execute all cells in 03_data_split.ipynb to generate /home/ranveer/mukul/data/yelp/data/train_data.json, val_data.json, and test_data.json from the input dataset (defaults to /home/ranveer/mukul/data/yelp/dataset_02.json).

Customization

To modify alignment behavior, adjust parameters in Stage 1:

  • THRESHOLD: Increase to be more selective, decrease to allow more matches
  • Max photos per review: Modify the condition in greedy_assignment() function:
    if len(review_assignments[r_idx]) >= 1:  # Change 1 to desired max

Output Data Format

The final cleaned dataset (data.json) contains Yelp reviews with the following additional fields:

  • photo_ids: List of matched photo IDs
  • captions: List of corresponding photo captions
  • helpful: Normalized helpfulness score derived from useful via log1p + MinMax scaling (added in Stage 3 when running 04_sample_01.ipynb)

All original review fields are preserved (review_id, business_id, user_id, text, stars, date, useful, funny, cool).

Performance Notes

  • Stage 1 processes businesses that have both reviews and photos
  • Processing time depends on dataset size and GPU availability
  • Float16 precision is used to optimize VRAM usage on GPU
  • Corrupt or missing images are handled gracefully with placeholder images

Dataset Statistics

After processing:

  • Only reviews with matched photos are retained
  • Total image count is available in Stage 2 output
  • Dataset dimensions are reported before and after filtering

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors