Added a notebook evaluating recommendations produced by different embedding models. by ArionDas · Pull Request #136 · IBM/responsible-prompting-api

ArionDas · 2026-02-28T22:45:17Z

Embedding Recommendation Quality: Red Team Evaluation

Problem

The recommender system defaults to all-MiniLM-L6-v2 (384-dim) for cosine similarity-based prompt recommendations. No systematic comparison exists to quantify how embedding model choice affects recommendation quality (additions of value-aligned sentences, removals of harmful sentences).

Metrics

Evaluation follows the methodology from Can LLMs Recommend More Responsible Prompts? (IUI '25). TP/FP/TN/FN are classified separately for add and remove:

Add TP: recommendation produced when value-aligned addition is needed
Add FP: recommendation produced when none is needed
Remove TP: removal flagged when harmful content exists
Remove FP: removal flagged on benign content

Derived: Accuracy, Precision, Recall, F1-Score.

Ground truth is derived from the red team dataset's 8 (PromptTest_Type, Test_SubType) combinations across 40 adversarial prompts (EmbeddedAmbiguity, LocalAmbiguity, Valence, Novelty).

Results

Thresholds: add 0.3–0.5, remove 0.3–0.5

Model	Dim	Add Acc	Add Prec	Add Rec	Add F1	Remove Acc	Remove Prec	Remove Rec	Remove F1
all-MiniLM-L6-v2	384	0.45	0.68	0.50	0.58	0.60	0.46	0.40	0.43
bge-large-en-v1.5	1024	0.75	0.75	1.00	0.86	0.38	0.38	1.00	0.55
multilingual-e5-large	1024	0.25	0.00	0.00	0.00	0.38	0.38	1.00	0.55

No existing logic modified. Notebook reuses recommend_prompt(), populate_json(), get_embedding_func() from recommendation_handler.py. All thresholds and model configs are parameterized. Adding a new model requires one dict entry in MODEL_CONFIGS.

Copilot

Pull request overview

Adds cookbook documentation entries for new evaluation notebooks that assess embedding-model impact on recommendation quality, aligning with the repo’s existing “cookbook” workflow examples.

Changes:

Add links to an intrinsic embedding evaluation notebook.
Add links to an extrinsic red-team evaluation notebook comparing embedding models on recommendation outcomes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-28T22:46:57Z

cookbook/README.md

+2. [Evaluate Embedding Model](./evaluate_embedding_model.ipynb) - Intrinsic embedding quality metrics (inter-cluster distance, misclassification rate, intra-cluster K-means distance).
+3. [Embedding Model Comparison: Red Team Evaluation](./embeddings_comparison_red_team.ipynb) - Extrinsic task-level evaluation comparing how different embedding models affect recommendation quality using the red team dataset. Computes accuracy, precision, recall, and F1-score for add and remove recommendations.


The new notebook entries in the Evaluation section don’t include the existing “Open In Colab” badge/link that all the other cookbook notebooks use. For consistency and easier access, consider adding the same Colab badge links for these two notebooks as well.

Suggested change

2. [Evaluate Embedding Model](./evaluate_embedding_model.ipynb) - Intrinsic embedding quality metrics (inter-cluster distance, misclassification rate, intra-cluster K-means distance).

3. [Embedding Model Comparison: Red Team Evaluation](./embeddings_comparison_red_team.ipynb) - Extrinsic task-level evaluation comparing how different embedding models affect recommendation quality using the red team dataset. Computes accuracy, precision, recall, and F1-score for add and remove recommendations.

2. [Evaluate Embedding Model](./evaluate_embedding_model.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/responsible-prompting-api/blob/develop/cookbook/evaluate_embedding_model.ipynb) - Intrinsic embedding quality metrics (inter-cluster distance, misclassification rate, intra-cluster K-means distance).

3. [Embedding Model Comparison: Red Team Evaluation](./embeddings_comparison_red_team.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/responsible-prompting-api/blob/develop/cookbook/embeddings_comparison_red_team.ipynb) - Extrinsic task-level evaluation comparing how different embedding models affect recommendation quality using the red team dataset. Computes accuracy, precision, recall, and F1-score for add and remove recommendations.

…edding models. Signed-off-by: ArionDas <ariondasad@gmail.com>

ArionDas · 2026-03-01T08:01:22Z

@santanavagner
I've added a notebook which one can use to evaluate recommendations from different embedding models in relation to the issue #100

It has metrics + plots for different models and subtypes.
Dataset used is the read team dataset already present in the repo.

Let me know if you want to extend to more models or datasets.
Thank you.

cc: @cassiasamp @Mystic-Slice

ArionDas · 2026-03-06T18:53:53Z

Any updates on this?
@santanavagner

Mystic-Slice · 2026-03-06T23:36:48Z

Hi Arion, @ArionDas
Sorry for the delay in the reply.

I will review this PR and bring it up when I meet with the team this month.

I think the notebook is good as is. No need to extend to other models/datasets for now. Thank you for your patience.

Copilot AI review requested due to automatic review settings February 28, 2026 22:45

Copilot started reviewing on behalf of ArionDas February 28, 2026 22:45 View session

Copilot AI reviewed Feb 28, 2026

View reviewed changes

Added a notebook evaluating recommendations produced by different emb…

b58365e

…edding models. Signed-off-by: ArionDas <ariondasad@gmail.com>

ArionDas force-pushed the embedding_recommendation_comparison branch from cfe7409 to b58365e Compare February 28, 2026 22:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a notebook evaluating recommendations produced by different embedding models.#136

Added a notebook evaluating recommendations produced by different embedding models.#136
ArionDas wants to merge 1 commit intoIBM:mainfrom
ArionDas:embedding_recommendation_comparison

ArionDas commented Feb 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 28, 2026

Uh oh!

ArionDas commented Mar 1, 2026 •

edited

Loading

Uh oh!

ArionDas commented Mar 6, 2026

Uh oh!

Mystic-Slice commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		2. [Evaluate Embedding Model](./evaluate_embedding_model.ipynb) - Intrinsic embedding quality metrics (inter-cluster distance, misclassification rate, intra-cluster K-means distance).
		3. [Embedding Model Comparison: Red Team Evaluation](./embeddings_comparison_red_team.ipynb) - Extrinsic task-level evaluation comparing how different embedding models affect recommendation quality using the red team dataset. Computes accuracy, precision, recall, and F1-score for add and remove recommendations.

Conversation

ArionDas commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Embedding Recommendation Quality: Red Team Evaluation

Problem

Metrics

Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

ArionDas commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArionDas commented Mar 6, 2026

Uh oh!

Mystic-Slice commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ArionDas commented Feb 28, 2026 •

edited

Loading

ArionDas commented Mar 1, 2026 •

edited

Loading