Skip to content

implementation of skills rag#2821

Open
blublinsky wants to merge 1 commit intoopenshift:mainfrom
blublinsky:skill-rag
Open

implementation of skills rag#2821
blublinsky wants to merge 1 commit intoopenshift:mainfrom
blublinsky:skill-rag

Conversation

@blublinsky
Copy link
Contributor

@blublinsky blublinsky commented Mar 17, 2026

Description

  1. Add SkillsRAG hybrid retrieval system for selecting the best skill for a user query, mirroring the existing ToolsRAG pattern (dense + BM25 sparse fusion with configurable alpha/top_k/threshold)
  2. Add SkillsConfig Pydantic model (skills_dir, embed_model_path, alpha, top_k, threshold) wired into OLSConfig with YAML parsing, equality, and cache invalidation on reload
  3. Add skills_rag cached property on AppConfig with three guard clauses (config present → directory exists → skills loaded) before committing to embedding model initialization
  4. Add 5 OpenShift troubleshooting skills in Anthropic-style Markdown format: pod-failure-diagnosis, degraded-operator-recovery, node-not-ready, route-ingress-troubleshooting, namespace-troubleshooting
    load_content reads entire skill directory tree recursively, filters binary files via UnicodeDecodeError catch (no suffix allowlist), strips skill.md frontmatter, separates additional files with ## {relative_path} headers
  5. Add 30 unit tests (27 for SkillsRAG + 3 for SkillsConfig) covering loading, indexing, retrieval, subdirectory recursion, binary file exclusion, and config validation

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

@openshift-ci openshift-ci bot requested review from bparees and joshuawilson March 17, 2026 13:14
@openshift-ci
Copy link

openshift-ci bot commented Mar 17, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bparees for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

description="Path to directory containing skill subdirectories",
)

embed_model_path: Optional[str] = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be configurable? We need to use the same embedding as we have for RAG/ToolsRAG.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are doing exactly what we are doing in tools Rag. Its local configuration only (per Ashutosh request). Not exposed in CR

for child in sorted(skills_path.iterdir()):
if not child.is_dir():
continue
skill_file = child / "skill.md"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is consistent with what is brought in the PR - the usual format is SKILL.md (uppercase). So some further tuning will be required - making it case insensitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

except (UnicodeDecodeError, ValueError):
continue
if entry.name == "skill.md":
parts.insert(0, raw.split("---", 2)[2].strip())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lib for parsing frontmatter:

import frontmatter

post = frontmatter.load("skills/pod-failure-diagnosis/skill.md")
post.metadata  # {"name": "pod-failure-diagnosis", "description": "..."}
post.content   # everything after the frontmatter block

Worth adding it as a new dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, fixed

)


class SkillsRAG:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explore the option to reuse ToolsRAG? Because this seems like a reimplementation of it.
Maybe you can just create a new instance of that index with skills. Like dynamically build skills as tools

StructuredTool.from_function(
    name="skill_pod_failure_diagnosis",
    description="Diagnose pods stuck in CrashLoopBackOff, ImagePullBackOff, Pending...",
    func=lambda: skill.load_content(),
)

Basically:

  1. Create a ToolsRAG instance
  2. Index skills into it using name + description as the document text
  3. At query time, call retrieve_hybrid(query) to get the best match - future work
  4. Load the matched skill's content and inject into system prompt - future work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never considered straight reuse. The conversation was about reusing the approach.

The reason SkillsRAG is its own class (~80 lines) rather than a ToolsRAG instance is that ToolsRAG carries a lot of tool-specific concerns that don't apply to skills: server filtering in both dense and sparse paths, tool_json metadata, _convert_langchain_tool_to_dict, server-grouped return format, and remove_tools for dynamic MCP updates. We'd have to bypass or stub out most of that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point here is that if we wrap each skill as a StructuredTool (just name + description, func never called) and populate a new ToolsRAG instance with them, we can use retrieve_hybrid to find the matching skill and completely avoid creating SkillsRAG.
The retrieval logic is identical between the two classes. The only new code needed is a skill loader that reads directories, parses frontmatter, and returns list[StructuredTool].
It would be just a way of reusing existing components to get the appropriate results for the given query. After that, skills have a different path than tools.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single index mixing skills and tools will require post-filtering to separate them. The shared infrastructure (QdrantStore, BM25Okapi) is already reused; SkillsRAG only adds ~30 lines of domain-specific wiring that doesn't fit ToolsRAG's server-centric model (server filtering, tool_json metadata, grouped-by-server return format).

@bparees
Copy link
Contributor

bparees commented Mar 17, 2026

is the expectation that this functionality is later migrated into LCORE? Or is it going to remain an OLS-specific capability carried within the OLS codebase?

@onmete
Copy link
Contributor

onmete commented Mar 17, 2026

@bparees We are now focused on troubleshooting the scenario. Any activities regarding OLS adopting LCORE are postponed - thinking about if/how this can be in LCORE too, sadly falls into these postponed activities.

But generally speaking, given how "hot topic" the skills are, I would assume that if we have it in OLS, it should be in LCORE before we migrate.

@blublinsky
Copy link
Contributor Author

is the expectation that this functionality is later migrated into LCORE? Or is it going to remain an OLS-specific capability carried within the OLS codebase?

Llama stack supports skills differently. For now its OLS only. Lets get some experience first and evaluate how useful it is

@blublinsky
Copy link
Contributor Author

/retest

@openshift-ci
Copy link

openshift-ci bot commented Mar 17, 2026

@blublinsky: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-ols-cluster df88c16 link true /test e2e-ols-cluster

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants