Skip to content

Retrieval Service Fails with Faiss CUBLAS Error During PPO Training:CUBLAS_STATUS_EXECUTION_FAILED #157

@Ymmm330

Description

@Ymmm330

I'm encountering an issue where the retrieval service fails with a Faiss CUBLAS error during PPO training.
Despite allocating 5 A800 80GB for retrieval and 6 for training, the retrieval service crashes with "cublas failed (13)" error when I run train_ppo.sh, which also terminates the training process. The error occurs during matrix multiplication operations in Faiss. I've followed the README instructions carefully and verified GPU status is normal. Could you please help identify if this is due to memory issues, configuration problems, or other factors? Thank you for your excellent work and any guidance you can provide!
The error screenshot is as follows:

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions