Skip to content

Added a notebook evaluating recommendations produced by different embedding models.#136

Open
ArionDas wants to merge 1 commit intoIBM:mainfrom
ArionDas:embedding_recommendation_comparison
Open

Added a notebook evaluating recommendations produced by different embedding models.#136
ArionDas wants to merge 1 commit intoIBM:mainfrom
ArionDas:embedding_recommendation_comparison

Conversation

@ArionDas
Copy link
Copy Markdown
Contributor

@ArionDas ArionDas commented Feb 28, 2026

Embedding Recommendation Quality: Red Team Evaluation

Problem

The recommender system defaults to all-MiniLM-L6-v2 (384-dim) for cosine similarity-based prompt recommendations. No systematic comparison exists to quantify how embedding model choice affects recommendation quality (additions of value-aligned sentences, removals of harmful sentences).

Metrics

Evaluation follows the methodology from Can LLMs Recommend More Responsible Prompts? (IUI '25). TP/FP/TN/FN are classified separately for add and remove:

  • Add TP: recommendation produced when value-aligned addition is needed
  • Add FP: recommendation produced when none is needed
  • Remove TP: removal flagged when harmful content exists
  • Remove FP: removal flagged on benign content

Derived: Accuracy, Precision, Recall, F1-Score.

Ground truth is derived from the red team dataset's 8 (PromptTest_Type, Test_SubType) combinations across 40 adversarial prompts (EmbeddedAmbiguity, LocalAmbiguity, Valence, Novelty).

Results

Thresholds: add 0.3–0.5, remove 0.3–0.5

Model Dim Add Acc Add Prec Add Rec Add F1 Remove Acc Remove Prec Remove Rec Remove F1
all-MiniLM-L6-v2 384 0.45 0.68 0.50 0.58 0.60 0.46 0.40 0.43
bge-large-en-v1.5 1024 0.75 0.75 1.00 0.86 0.38 0.38 1.00 0.55
multilingual-e5-large 1024 0.25 0.00 0.00 0.00 0.38 0.38 1.00 0.55

No existing logic modified. Notebook reuses recommend_prompt(), populate_json(), get_embedding_func() from recommendation_handler.py. All thresholds and model configs are parameterized. Adding a new model requires one dict entry in MODEL_CONFIGS.

Copilot AI review requested due to automatic review settings February 28, 2026 22:45
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds cookbook documentation entries for new evaluation notebooks that assess embedding-model impact on recommendation quality, aligning with the repo’s existing “cookbook” workflow examples.

Changes:

  • Add links to an intrinsic embedding evaluation notebook.
  • Add links to an extrinsic red-team evaluation notebook comparing embedding models on recommendation outcomes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +17 to +18
2. [Evaluate Embedding Model](./evaluate_embedding_model.ipynb) - Intrinsic embedding quality metrics (inter-cluster distance, misclassification rate, intra-cluster K-means distance).
3. [Embedding Model Comparison: Red Team Evaluation](./embeddings_comparison_red_team.ipynb) - Extrinsic task-level evaluation comparing how different embedding models affect recommendation quality using the red team dataset. Computes accuracy, precision, recall, and F1-score for add and remove recommendations.
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new notebook entries in the Evaluation section don’t include the existing “Open In Colab” badge/link that all the other cookbook notebooks use. For consistency and easier access, consider adding the same Colab badge links for these two notebooks as well.

Suggested change
2. [Evaluate Embedding Model](./evaluate_embedding_model.ipynb) - Intrinsic embedding quality metrics (inter-cluster distance, misclassification rate, intra-cluster K-means distance).
3. [Embedding Model Comparison: Red Team Evaluation](./embeddings_comparison_red_team.ipynb) - Extrinsic task-level evaluation comparing how different embedding models affect recommendation quality using the red team dataset. Computes accuracy, precision, recall, and F1-score for add and remove recommendations.
2. [Evaluate Embedding Model](./evaluate_embedding_model.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/responsible-prompting-api/blob/develop/cookbook/evaluate_embedding_model.ipynb) - Intrinsic embedding quality metrics (inter-cluster distance, misclassification rate, intra-cluster K-means distance).
3. [Embedding Model Comparison: Red Team Evaluation](./embeddings_comparison_red_team.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/responsible-prompting-api/blob/develop/cookbook/embeddings_comparison_red_team.ipynb) - Extrinsic task-level evaluation comparing how different embedding models affect recommendation quality using the red team dataset. Computes accuracy, precision, recall, and F1-score for add and remove recommendations.

Copilot uses AI. Check for mistakes.
…edding models.

Signed-off-by: ArionDas <ariondasad@gmail.com>
@ArionDas ArionDas force-pushed the embedding_recommendation_comparison branch from cfe7409 to b58365e Compare February 28, 2026 22:47
@ArionDas
Copy link
Copy Markdown
Contributor Author

ArionDas commented Mar 1, 2026

@santanavagner
I've added a notebook which one can use to evaluate recommendations from different embedding models in relation to the issue #100

It has metrics + plots for different models and subtypes.
Dataset used is the read team dataset already present in the repo.

Let me know if you want to extend to more models or datasets.
Thank you.

cc: @cassiasamp @Mystic-Slice

@ArionDas
Copy link
Copy Markdown
Contributor Author

ArionDas commented Mar 6, 2026

Any updates on this?
@santanavagner

@Mystic-Slice
Copy link
Copy Markdown
Collaborator

Hi Arion, @ArionDas
Sorry for the delay in the reply.

I will review this PR and bring it up when I meet with the team this month.

I think the notebook is good as is. No need to extend to other models/datasets for now. Thank you for your patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants