Skip to content

FuCongResearchSquad/ReSID

Repository files navigation

ReSID

This repository provides a PyTorch reference implementation of the main models and training procedures described in our paper:

Yu Liang*, Zhongjin Zhang*, Yuxuan Zhu, Kerui Zhang, Zhiluohan Guo, Zhou Wenhang, Zonqi Yang, Kangle Wu, Yabo Ni, Anxiang Zeng, Cong Fu, Jianxin Wang, and Jiazhi Xia. Rethinking Generative Recommender Tokenizer: Recsys-Native Encoding and Semantic Quantization Beyond LLMs.

Paper & Resources

Overview

We propose ReSID, a recommendation-native, principled SID framework that rethinks representation learning and quantization from the perspective of information preservation and sequential predictability, without relying on LLMs. ReSID consists of two components: (i) Field-Aware Masked Auto-Encoding (FAMAE), which learns predictive-sufficient item representations from structured features, and (ii) Globally Aligned Orthogonal Quantization (GAOQ), which produces compact and predictable SID sequences by jointly reducing semantic ambiguity and prefix-conditional uncertainty.

image

Project Structure

The structure of this repository is as follows:

.
├── config/                   # All *.yaml configuration files for the pipeline
├── dataset/                  # Amazon-2023 review dataset processing code / downloaded dataset folder
├── model/                    # Model implementations
├── logger.py                 # Logging utilities for printing runtime outputs
├── main.py                   # Main entry point for training and evaluation
├── metrics.py                # Evaluation-related code
├── requirements.txt          # List of required Python packages and dependencies
├── run_pipelines.py          # One-click script to run the full ReSID pipeline
├── trainer.py                # Training script
├── utils.py                  # Training utilities, mainly for data loading
└── README.md                 # This file

Experiments

Setup

We recommend installing dependencies using requirements.txt. This setup has been tested on Ubuntu 18.04, CUDA 12.4, and Python 3.12.

pip3 install -r requirements.txt

Data

Option 1: Download the processed dataset (recommended)

Download the processed dataset from Hugging Face:

After downloading, place the extracted dataset folder directly under dataset/, e.g.:

ReSID/
└── dataset/
    └── Musical_Instruments/   # the extracted dataset directory

Option 2: Reproduce the dataset from raw Amazon-2023 reviews (from scratch)

  1. Download the raw Amazon-2023 review subsets and statistics:

    bash dataset/download_amazon_2023.sh
    bash dataset/download_amazon_2023_statistics.sh
  2. Preprocess the downloaded data:

    python dataset/data_process.py

After processing, the generated dataset will be saved under dataset/ (as configured in the scripts).

Training

To run ReSID, use the following command:

python run_pipelines.py --dataset Musical_Instruments --device cuda:0

Set --dataset to the name of the dataset you want to run.

Results

image

Citation

If you find this repository helpful, please consider citing our paper:

@misc{ReSID,
      title={Rethinking Generative Recommender Tokenizer: Recsys-Native Encoding and Semantic Quantization Beyond LLMs}, 
      author={Yu Liang and Zhongjin Zhang and Yuxuan Zhu and Kerui Zhang and Zhiluohan Guo and Wenhang Zhou and Zonqi Yang and Kangle Wu and Yabo Ni and Anxiang Zeng and Cong Fu and Jianxin Wang and Jiazhi Xia},
      year={2026},
      eprint={2602.02338},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2602.02338}, 
}

About

Official implementation of the paper "Rethinking Generative Recommender Tokenizer: Recsys-Native Encoding and Semantic Quantization Beyond LLMs"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors