Added a notebook evaluating recommendations produced by different embedding models.#136
Added a notebook evaluating recommendations produced by different embedding models.#136
Conversation
There was a problem hiding this comment.
Pull request overview
Adds cookbook documentation entries for new evaluation notebooks that assess embedding-model impact on recommendation quality, aligning with the repo’s existing “cookbook” workflow examples.
Changes:
- Add links to an intrinsic embedding evaluation notebook.
- Add links to an extrinsic red-team evaluation notebook comparing embedding models on recommendation outcomes.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| 2. [Evaluate Embedding Model](./evaluate_embedding_model.ipynb) - Intrinsic embedding quality metrics (inter-cluster distance, misclassification rate, intra-cluster K-means distance). | ||
| 3. [Embedding Model Comparison: Red Team Evaluation](./embeddings_comparison_red_team.ipynb) - Extrinsic task-level evaluation comparing how different embedding models affect recommendation quality using the red team dataset. Computes accuracy, precision, recall, and F1-score for add and remove recommendations. |
There was a problem hiding this comment.
The new notebook entries in the Evaluation section don’t include the existing “Open In Colab” badge/link that all the other cookbook notebooks use. For consistency and easier access, consider adding the same Colab badge links for these two notebooks as well.
| 2. [Evaluate Embedding Model](./evaluate_embedding_model.ipynb) - Intrinsic embedding quality metrics (inter-cluster distance, misclassification rate, intra-cluster K-means distance). | |
| 3. [Embedding Model Comparison: Red Team Evaluation](./embeddings_comparison_red_team.ipynb) - Extrinsic task-level evaluation comparing how different embedding models affect recommendation quality using the red team dataset. Computes accuracy, precision, recall, and F1-score for add and remove recommendations. | |
| 2. [Evaluate Embedding Model](./evaluate_embedding_model.ipynb) [](https://colab.research.google.com/github/IBM/responsible-prompting-api/blob/develop/cookbook/evaluate_embedding_model.ipynb) - Intrinsic embedding quality metrics (inter-cluster distance, misclassification rate, intra-cluster K-means distance). | |
| 3. [Embedding Model Comparison: Red Team Evaluation](./embeddings_comparison_red_team.ipynb) [](https://colab.research.google.com/github/IBM/responsible-prompting-api/blob/develop/cookbook/embeddings_comparison_red_team.ipynb) - Extrinsic task-level evaluation comparing how different embedding models affect recommendation quality using the red team dataset. Computes accuracy, precision, recall, and F1-score for add and remove recommendations. |
…edding models. Signed-off-by: ArionDas <ariondasad@gmail.com>
cfe7409 to
b58365e
Compare
|
@santanavagner It has metrics + plots for different models and subtypes. Let me know if you want to extend to more models or datasets. |
|
Any updates on this? |
|
Hi Arion, @ArionDas I will review this PR and bring it up when I meet with the team this month. I think the notebook is good as is. No need to extend to other models/datasets for now. Thank you for your patience. |
Embedding Recommendation Quality: Red Team Evaluation
Problem
The recommender system defaults to
all-MiniLM-L6-v2(384-dim) for cosine similarity-based prompt recommendations. No systematic comparison exists to quantify how embedding model choice affects recommendation quality (additions of value-aligned sentences, removals of harmful sentences).Metrics
Evaluation follows the methodology from Can LLMs Recommend More Responsible Prompts? (IUI '25). TP/FP/TN/FN are classified separately for add and remove:
Derived: Accuracy, Precision, Recall, F1-Score.
Ground truth is derived from the red team dataset's 8
(PromptTest_Type, Test_SubType)combinations across 40 adversarial prompts (EmbeddedAmbiguity,LocalAmbiguity,Valence,Novelty).Results