Currently, we only evaluate pass@1. We consider the following evaluation metrics as well: * CodeBERTScore ([paper link](https://arxiv.org/abs/2302.05527)) * Calibration (see the [HELM paper](https://arxiv.org/pdf/2211.09110.pdf)) * Robustness (see the [HELM paper](https://arxiv.org/pdf/2211.09110.pdf)) * CodeScore ([paper link](https://arxiv.org/pdf/2301.09043.pdf))