Skip to content

OPRUN-4042: olmv1 catalogd graphql service#2025

Open
grokspawn wants to merge 3 commits into
openshift:masterfrom
grokspawn:olmv1-graphql
Open

OPRUN-4042: olmv1 catalogd graphql service#2025
grokspawn wants to merge 3 commits into
openshift:masterfrom
grokspawn:olmv1-graphql

Conversation

@grokspawn
Copy link
Copy Markdown
Contributor

@grokspawn grokspawn commented May 28, 2026

This proposal enables a new catalogd service endpoint in OCP to provide catalogd data via graphql. This feature was merged in the upstream and we'd like to enable downstream to be able to evolve the OLMv1 console interactions and emergent catalog interactions.

@openshift-ci openshift-ci Bot requested review from dustymabe and syed May 28, 2026 16:24
@grokspawn grokspawn changed the title olmv1 catalogd graphql service OPRUN-4042: olmv1 catalogd graphql service May 28, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 28, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 28, 2026

@grokspawn: This pull request references OPRUN-4042 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@grokspawn
Copy link
Copy Markdown
Contributor Author

Adding the reviewers from my frontmatter...
/assign @everettraven
/assign @tmshort
/assign @joelanford


### Non-Goals

1. Replacing or deprecating the existing `/api/v1/all` or `/api/v1/metas` endpoints.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, metas endpoint is defunct now because Console's OLMv1 integration is using the all endpoint and caching what it needs.

  1. Is that correct?
  2. If so, should we remove the metas endpoint and indexing, all of which is still behind a TPNU feature gate?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK you are correct, but it feels like that's a matter for #1749, not this review.
I can pull references to it from this EP if you like.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be reasonable to say that we think this feature obsoletes that feature, in the context of this EP.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have absolutely no problem with dropping the /api/v1/metas EP or feature code.
However, I sought to keep the design scopes disjoint since even removal of the existing Metas API requires code/test/openshift-api/OTE changes which ought to be identified elsewhere and technically this feature doesn't care whether the other feature exists/is-enabled.
I think it's not likely that anyone would use this EP as a reference for what happened to that feature.

Comment thread enhancements/olm/olmv1-graphql-service.md Outdated
Comment thread enhancements/olm/olmv1-graphql-service.md

### Drawbacks

The dynamic schema discovery approach trades type precision for zero-maintenance adaptability. Deeply nested or polymorphic fields (e.g., `properties[].value`) are serialized as JSON strings rather than fully typed, which limits the GraphQL introspection benefit for those fields. A future enhancement could add specialized type-union handling for well-known property types.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if we eval'ed build-time schema generation vs run-time schema discovery? There are tradeoffs for sure, but maybe build-time generation would give us more predictability/testability and maybe the maintenance costs wouldn't be much higher (if at all) if we could automate the generation of the schemas from the types.

Copy link
Copy Markdown
Contributor Author

@grokspawn grokspawn May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal is engineered for the lightest touch possible to give us testable functionality. I can pretty much guarantee that we'll wish to adjust limits (parsing, query, etc.) and otherwise tune this to suit for GA.
However, build-time schema generation has two challenges that this surmounts:

  1. the need to coordinate with another team to include information which might or might not be enabled on the instance being used (committing space, complexity, etc.);
  2. the ability to couple the concern of supplying the data with harvesting the data so that we have the smallest iterable space

Comment thread enhancements/olm/olmv1-graphql-service.md Outdated
Comment thread enhancements/olm/olmv1-graphql-service.md
Comment thread enhancements/olm/olmv1-graphql-service.md Outdated
Comment thread enhancements/olm/olmv1-graphql-service.md
Comment thread enhancements/olm/olmv1-graphql-service.md
### Goals

1. Provide a server-side query endpoint that supports field selection, nested-object traversal, and pagination for FBC catalog data.
2. Automatically adapt the GraphQL schema when FBC schemas evolve, requiring zero code changes in catalogd.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually want a fully dynamic GraphQL schema here? How will clients (like Console presumably?) be able to trust that their queries will be consistently valid?

Copy link
Copy Markdown
Contributor Author

@grokspawn grokspawn May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposition assumes that the producers and consumers will have some discussion as the underlying FBC schemas evolve. The assumption here is that new schemas (or deprecation of existing schemas) will not be a surprise to consumers, and the approach to deriving the appropriate graphql schema name is documented and straightforward (and can further be made more available by ensuring it is available as a library as mentioned here).

While it is possible for a client to determine the appropriate graphql schema nomenclature to be used from first principles by interrogating the service, it's an inefficient approach likely only useful for LLMs and human experimenters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage of a fully dynamic schema here is that if I add io.openshift.lifecycle schema to my FBC, I don't need to plumb it through the graphql serving layer to expose it, or backport a change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the larger concern is that a dynamically-generated-at-runtime schema could break for an existing API. For example, what if we deprecate a field and teams stop using it. Schema discovery would claim "that's not a field", when in fact it is a field, but no one has populated it.

If a client requests a non-existent field according to the generated schema, what happens?

btw, I'm not sure this is the fatal flaw it sounds like it is. I'm just trying to think through it. We constantly dance around this idea that FBC can/will evolve, that it is up to clients to keep up, and that catalogd is not responsible for the back-compat of the FBC it serves. Maybe catalogd isn't, but OCP is?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the larger concern is that a dynamically-generated-at-runtime schema could break for an existing API. For example, what if we deprecate a field and teams stop using it. Schema discovery would claim "that's not a field", when in fact it is a field, but no one has populated it.

If a client requests a non-existent field according to the generated schema, what happens?

This is the concern that I have. If Console attempts to fetch some kind of metadata for display purposes that happens to not exist for a given catalog, what is the response they get from the server when the request they send contains that non-existent field? How do they determine whether or not the request failed because of that specific field being invalid?

We constantly dance around this idea that FBC can/will evolve, that it is up to clients to keep up, and that catalogd is not responsible for the back-compat of the FBC it serves. Maybe catalogd isn't, but OCP is?

My interpretation here is that while FBC can evolve, it does have an API surface and should be treated the same as any other API. Catalogd just serving the raw data is fine, but clients should be able to expect that any queries they craft to be reliable. A field no longer existing should be considered a breaking change from the client perspective and that happening dynamically means that clients cannot trust that their queries are stable.

Have we considered that a prerequisite to making the querying of catalog contents easier might be to refine how the catalog content is actually presented?

For example, do we need some reasonable baseline schema for all catalog content blobs and a way to identify what the schema for those blobs would entail (maybe something like an OpenAPI schema?)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are tertiary concerns. The grammar of graphql interactions for this case are a 200 (OK) HTTP status and an encapsulated graphql "cannot query field X on type Y" message which simply holds that the thing doesn't exist.
And if a product built on valid expressions of X on Y breaks because the FBC stopped encoding the value or otherwise broke the API contract... then it's a failure of the FBC evolution (which broke the contract) and the client (who failed to notice that FBC broke the contract).
IMO it's incredibly meaningful for FBC to respect a responsible API evolution path.
But I don't think that the service layer has any responsibility to enforce it.

And FBC does need to evolve. And it should evolve in the direction of making interactions (like querying) more straightforward.
But the service layer doesn't need that problem to be solved in order to be useful. It can give you what's there and even open the contents up for discovery.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, there is no guaranteed baseline here. FBC is just a bunch of JSON blobs with a very hand-wavy schema.

With the currently proposed approach, there is no real way for a client to have a consistent experience provided the same schema across multiple catalogs.

For example, imagine the following scenario:

  • As a programmatic client, I want to check for the existence of a property on the olm.package schema via the properties field that is documented for that schema across catalogs.
  • Catalog A has at least one olm.package entry that specifies the properties field.
  • Catalog B does not have any olm.package entries that specify the properties field.

When I query Catalog A asking for the inclusion of the properties field I get a response back where that field is either omitted (due to serialization) or present containing additional properties (may or may not contain the one I'm looking for).

When I execute the same query against Catalog B, I get a response back saying that there is "no such field properties for type olmPackage".

Now, as a client of this GraphQL API my request logic is exponentially more challenging to implement because for every request I now have to:

  • Issue a request for the fields that I want to query.
  • IF i receive a valid response, continue normally.
  • ELSE:
    1. Identify the offending non-existent field
    2. Update my query to remove the offending field
    3. Re-issue the query
    4. Repeat steps 1-3 until I get a valid response
    5. Process the valid response, assuming that each field I had to remove from the request is equivalent to "omitted" and handle that accordingly based on the documentation for that schema.

This seems like a really poor client-side UX to me and this behavior doesn't actually have anything to do with whether or not the blob schema actually changed, just whether or not any blobs happen to specify that field from the schema.

While I don't think we can entirely get away from the notion that different catalogs may contain different content blobs, which is a limitation of the catalog system and the fact that not all catalogs will undergo the same strict validation, we can provide the primitives within this system to tell a deterministic querying language what the definition of a given schema is and it build the queryable types from that.

That at least puts the full onus of breaking schema changes on the catalog blob schema maintainers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are absolutely valid, compelling concerns.
I just think that they belong to the user<->FBC contract, and any construct of FBC like a default catalog.

We already enforce bundle metadata format, channel head metadata pruning (or not), etc.

When the underlying format is constrained to comply with client contracts, this middle layer remains deterministic. But this restraint is manifested by this layer, not implemented in it. Custom schema implemented in FBC may or may not have the same constraints or compliance intervals, so the service layer is not opinionated by design.

We've always had Hyrum's law where independent tooling is dependent on specific information in the catalog, and this service layer isn't intended to change that.
FBC is still the API. GraphQL just becomes a way to interrogate it.

Copy link
Copy Markdown
Contributor

@everettraven everettraven Jun 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying that the service layer needs to be opinionated. I'm saying that it needs a way to reliably determine what schemas exist in a catalog and their shape such that it presents a consistent interface for that schema.

The Kubernetes API server isn't opinionated on the shape of your CR, but it can't consistently serve requests for that custom API shape if it doesn't know what it is supposed to look like (via a CRD).

IMO, that should be included within the scope of this work to ensure a reasonable UX when interacting with the GraphQL interface.

Regardless of my opinion, if you are intentionally making this UX tradeoff as part of this design I'd at least want some form of documentation that these things were considered and why the decision was made to reject the work associated with it (to me this seems like enough of a shift to be an explicit "alternatives" approach)

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 28, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from joelanford. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: grokspawn <jordan@nimblewidget.com>
Comment thread enhancements/olm/olmv1-graphql-service.md
Comment thread enhancements/olm/olmv1-graphql-service.md Outdated
Comment thread enhancements/olm/olmv1-graphql-service.md Outdated
Comment thread enhancements/olm/olmv1-graphql-service.md
Comment thread enhancements/olm/olmv1-graphql-service.md Outdated
Comment thread enhancements/olm/olmv1-graphql-service.md Outdated
Comment thread enhancements/olm/olmv1-graphql-service.md
Comment thread enhancements/olm/olmv1-graphql-service.md
Comment thread enhancements/olm/olmv1-graphql-service.md Outdated
Comment thread enhancements/olm/olmv1-graphql-service.md Outdated
Comment thread enhancements/olm/olmv1-graphql-service.md
Signed-off-by: grokspawn <jordan@nimblewidget.com>
@grokspawn
Copy link
Copy Markdown
Contributor Author

I updated the proposal to a file-backed approach, implemented in operator-framework/operator-controller#2732.
I also added new reviewers from our console peers and resolved a bunch of comments.
In order to try to make the cleanest landing for newly-added folks as well as to help me keep the accounting honest, I'll mark comments resolved when I think I've addressed them.
If you disagree or if we need to add more to the conversation, etc., please un-resolve the comment: it's only meant to make it easier to retain momentum here and isn't punative in any way.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

@grokspawn: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants