I'm encountering an issue where the retrieval service fails with a Faiss CUBLAS error during PPO training.
Despite allocating 5 A800 80GB for retrieval and 6 for training, the retrieval service crashes with "cublas failed (13)" error when I run train_ppo.sh, which also terminates the training process. The error occurs during matrix multiplication operations in Faiss. I've followed the README instructions carefully and verified GPU status is normal. Could you please help identify if this is due to memory issues, configuration problems, or other factors? Thank you for your excellent work and any guidance you can provide!
The error screenshot is as follows:
