OPRUN-4042: olmv1 catalogd graphql service#2025
Conversation
|
@grokspawn: This pull request references OPRUN-4042 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Adding the reviewers from my frontmatter... |
|
|
||
| ### Non-Goals | ||
|
|
||
| 1. Replacing or deprecating the existing `/api/v1/all` or `/api/v1/metas` endpoints. |
There was a problem hiding this comment.
As far as I know, metas endpoint is defunct now because Console's OLMv1 integration is using the all endpoint and caching what it needs.
- Is that correct?
- If so, should we remove the
metasendpoint and indexing, all of which is still behind a TPNU feature gate?
There was a problem hiding this comment.
AFAIK you are correct, but it feels like that's a matter for #1749, not this review.
I can pull references to it from this EP if you like.
There was a problem hiding this comment.
I think it may be reasonable to say that we think this feature obsoletes that feature, in the context of this EP.
There was a problem hiding this comment.
I have absolutely no problem with dropping the /api/v1/metas EP or feature code.
However, I sought to keep the design scopes disjoint since even removal of the existing Metas API requires code/test/openshift-api/OTE changes which ought to be identified elsewhere and technically this feature doesn't care whether the other feature exists/is-enabled.
I think it's not likely that anyone would use this EP as a reference for what happened to that feature.
|
|
||
| ### Drawbacks | ||
|
|
||
| The dynamic schema discovery approach trades type precision for zero-maintenance adaptability. Deeply nested or polymorphic fields (e.g., `properties[].value`) are serialized as JSON strings rather than fully typed, which limits the GraphQL introspection benefit for those fields. A future enhancement could add specialized type-union handling for well-known property types. |
There was a problem hiding this comment.
I'm curious if we eval'ed build-time schema generation vs run-time schema discovery? There are tradeoffs for sure, but maybe build-time generation would give us more predictability/testability and maybe the maintenance costs wouldn't be much higher (if at all) if we could automate the generation of the schemas from the types.
There was a problem hiding this comment.
This proposal is engineered for the lightest touch possible to give us testable functionality. I can pretty much guarantee that we'll wish to adjust limits (parsing, query, etc.) and otherwise tune this to suit for GA.
However, build-time schema generation has two challenges that this surmounts:
- the need to coordinate with another team to include information which might or might not be enabled on the instance being used (committing space, complexity, etc.);
- the ability to couple the concern of supplying the data with harvesting the data so that we have the smallest iterable space
| ### Goals | ||
|
|
||
| 1. Provide a server-side query endpoint that supports field selection, nested-object traversal, and pagination for FBC catalog data. | ||
| 2. Automatically adapt the GraphQL schema when FBC schemas evolve, requiring zero code changes in catalogd. |
There was a problem hiding this comment.
Do we actually want a fully dynamic GraphQL schema here? How will clients (like Console presumably?) be able to trust that their queries will be consistently valid?
There was a problem hiding this comment.
This proposition assumes that the producers and consumers will have some discussion as the underlying FBC schemas evolve. The assumption here is that new schemas (or deprecation of existing schemas) will not be a surprise to consumers, and the approach to deriving the appropriate graphql schema name is documented and straightforward (and can further be made more available by ensuring it is available as a library as mentioned here).
While it is possible for a client to determine the appropriate graphql schema nomenclature to be used from first principles by interrogating the service, it's an inefficient approach likely only useful for LLMs and human experimenters.
There was a problem hiding this comment.
The advantage of a fully dynamic schema here is that if I add io.openshift.lifecycle schema to my FBC, I don't need to plumb it through the graphql serving layer to expose it, or backport a change.
There was a problem hiding this comment.
I think the larger concern is that a dynamically-generated-at-runtime schema could break for an existing API. For example, what if we deprecate a field and teams stop using it. Schema discovery would claim "that's not a field", when in fact it is a field, but no one has populated it.
If a client requests a non-existent field according to the generated schema, what happens?
btw, I'm not sure this is the fatal flaw it sounds like it is. I'm just trying to think through it. We constantly dance around this idea that FBC can/will evolve, that it is up to clients to keep up, and that catalogd is not responsible for the back-compat of the FBC it serves. Maybe catalogd isn't, but OCP is?
There was a problem hiding this comment.
I think the larger concern is that a dynamically-generated-at-runtime schema could break for an existing API. For example, what if we deprecate a field and teams stop using it. Schema discovery would claim "that's not a field", when in fact it is a field, but no one has populated it.
If a client requests a non-existent field according to the generated schema, what happens?
This is the concern that I have. If Console attempts to fetch some kind of metadata for display purposes that happens to not exist for a given catalog, what is the response they get from the server when the request they send contains that non-existent field? How do they determine whether or not the request failed because of that specific field being invalid?
We constantly dance around this idea that FBC can/will evolve, that it is up to clients to keep up, and that catalogd is not responsible for the back-compat of the FBC it serves. Maybe catalogd isn't, but OCP is?
My interpretation here is that while FBC can evolve, it does have an API surface and should be treated the same as any other API. Catalogd just serving the raw data is fine, but clients should be able to expect that any queries they craft to be reliable. A field no longer existing should be considered a breaking change from the client perspective and that happening dynamically means that clients cannot trust that their queries are stable.
Have we considered that a prerequisite to making the querying of catalog contents easier might be to refine how the catalog content is actually presented?
For example, do we need some reasonable baseline schema for all catalog content blobs and a way to identify what the schema for those blobs would entail (maybe something like an OpenAPI schema?)?
There was a problem hiding this comment.
I think these are tertiary concerns. The grammar of graphql interactions for this case are a 200 (OK) HTTP status and an encapsulated graphql "cannot query field X on type Y" message which simply holds that the thing doesn't exist.
And if a product built on valid expressions of X on Y breaks because the FBC stopped encoding the value or otherwise broke the API contract... then it's a failure of the FBC evolution (which broke the contract) and the client (who failed to notice that FBC broke the contract).
IMO it's incredibly meaningful for FBC to respect a responsible API evolution path.
But I don't think that the service layer has any responsibility to enforce it.
And FBC does need to evolve. And it should evolve in the direction of making interactions (like querying) more straightforward.
But the service layer doesn't need that problem to be solved in order to be useful. It can give you what's there and even open the contents up for discovery.
There was a problem hiding this comment.
IIRC, there is no guaranteed baseline here. FBC is just a bunch of JSON blobs with a very hand-wavy schema.
With the currently proposed approach, there is no real way for a client to have a consistent experience provided the same schema across multiple catalogs.
For example, imagine the following scenario:
- As a programmatic client, I want to check for the existence of a property on the
olm.packageschema via thepropertiesfield that is documented for that schema across catalogs. - Catalog A has at least one
olm.packageentry that specifies thepropertiesfield. - Catalog B does not have any
olm.packageentries that specify thepropertiesfield.
When I query Catalog A asking for the inclusion of the properties field I get a response back where that field is either omitted (due to serialization) or present containing additional properties (may or may not contain the one I'm looking for).
When I execute the same query against Catalog B, I get a response back saying that there is "no such field properties for type olmPackage".
Now, as a client of this GraphQL API my request logic is exponentially more challenging to implement because for every request I now have to:
- Issue a request for the fields that I want to query.
- IF i receive a valid response, continue normally.
- ELSE:
- Identify the offending non-existent field
- Update my query to remove the offending field
- Re-issue the query
- Repeat steps 1-3 until I get a valid response
- Process the valid response, assuming that each field I had to remove from the request is equivalent to "omitted" and handle that accordingly based on the documentation for that schema.
This seems like a really poor client-side UX to me and this behavior doesn't actually have anything to do with whether or not the blob schema actually changed, just whether or not any blobs happen to specify that field from the schema.
While I don't think we can entirely get away from the notion that different catalogs may contain different content blobs, which is a limitation of the catalog system and the fact that not all catalogs will undergo the same strict validation, we can provide the primitives within this system to tell a deterministic querying language what the definition of a given schema is and it build the queryable types from that.
That at least puts the full onus of breaking schema changes on the catalog blob schema maintainers.
There was a problem hiding this comment.
I think these are absolutely valid, compelling concerns.
I just think that they belong to the user<->FBC contract, and any construct of FBC like a default catalog.
We already enforce bundle metadata format, channel head metadata pruning (or not), etc.
When the underlying format is constrained to comply with client contracts, this middle layer remains deterministic. But this restraint is manifested by this layer, not implemented in it. Custom schema implemented in FBC may or may not have the same constraints or compliance intervals, so the service layer is not opinionated by design.
We've always had Hyrum's law where independent tooling is dependent on specific information in the catalog, and this service layer isn't intended to change that.
FBC is still the API. GraphQL just becomes a way to interrogate it.
There was a problem hiding this comment.
I'm not saying that the service layer needs to be opinionated. I'm saying that it needs a way to reliably determine what schemas exist in a catalog and their shape such that it presents a consistent interface for that schema.
The Kubernetes API server isn't opinionated on the shape of your CR, but it can't consistently serve requests for that custom API shape if it doesn't know what it is supposed to look like (via a CRD).
IMO, that should be included within the scope of this work to ensure a reasonable UX when interacting with the GraphQL interface.
Regardless of my opinion, if you are intentionally making this UX tradeoff as part of this design I'd at least want some form of documentation that these things were considered and why the decision was made to reject the work associated with it (to me this seems like enough of a shift to be an explicit "alternatives" approach)
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: grokspawn <jordan@nimblewidget.com>
Signed-off-by: grokspawn <jordan@nimblewidget.com>
|
I updated the proposal to a file-backed approach, implemented in operator-framework/operator-controller#2732. |
|
@grokspawn: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This proposal enables a new catalogd service endpoint in OCP to provide catalogd data via graphql. This feature was merged in the upstream and we'd like to enable downstream to be able to evolve the OLMv1 console interactions and emergent catalog interactions.