demo on reward models

This is more of a suggestion than an issue. I think it would extend the scope of the repo a lot if an example applying posteriors to improve reward model (Bradley Terry ) robustness was added (for example on Llama-3-8B-Instruct and hh-rlhf datasets).