Skip to content

Handle network failures in portal and remote channel endpoints#14813

Merged
rtibbles merged 1 commit into
learningequality:developfrom
rtibbles:handle-network-failures
Jun 25, 2026
Merged

Handle network failures in portal and remote channel endpoints#14813
rtibbles merged 1 commit into
learningequality:developfrom
rtibbles:handle-network-failures

Conversation

@rtibbles

@rtibbles rtibbles commented Jun 6, 2026

Copy link
Copy Markdown
Member

Summary

  • The views that proxy to KDP and Studio (portal register/validate_token, remote channel list/retrieve) crashed with 500s on network-level failures: NetworkLocationResponseFailure.response can be None, timeouts and connection failures were uncaught, and a non-404 Studio error fell off the end of an except block (200 with null body on list, spurious 404 on retrieve)
  • Network-level failures now consistently return 503 {"status": "offline"}
  • The portal endpoints no longer reflect upstream bodies wholesale: error responses are reduced to the error constants the frontend matches on, and validate_token's success response to the project name the frontend uses; non-JSON bodies (e.g. CDN error pages) are treated as network-level failures

References

Reviewer guidance

  • To exercise the real paths: the facility register flow (invalid token, valid token, already registered) against live KDP, and a Studio channel import with the network disconnected — the latter should now report offline rather than erroring
  • Behaviour change to weigh: a non-404 Studio error in RemoteChannelViewSet previously gave 200 with a null body (list) or a spurious 404 (retrieve); both now give 503 offline — worth checking the device plugin's channel import flow handles that
  • Any portal error other than the two whitelisted constants now reaches the client as an empty list at the upstream status code

AI usage

Written with Claude Code, test-first. I reviewed the diff and directed two revisions — first dropping reflection of raw upstream bytes, then reducing the reflection to the explicit whitelist of error constants — and verified the affected suites pass.

@github-actions github-actions Bot added DEV: backend Python, databases, networking, filesystem... SIZE: medium labels Jun 6, 2026
@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

@rtibbles rtibbles marked this pull request as ready for review June 6, 2026 04:40
@rtibbles rtibbles force-pushed the handle-network-failures branch from 11b6493 to c2e289d Compare June 9, 2026 16:39
Comment thread kolibri/core/content/api.py Outdated
return channels
except NetworkLocationResponseFailure as e:
if e.response.status_code == 404:
if e.response is not None and e.response.status_code == 404:

@akolson akolson Jun 23, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use HTTPStatus.NOT_FOUND or from rest_framework import status instead of hardcoded codes?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, seems reasonable, will update!

@akolson akolson left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes look correct to me! I've left a comment for your consideration. Thanks @rtibbles

Fixes a family of crashes in the views that proxy to the Kolibri Data
Portal and Studio, surfaced while removing real network access from the
test suite:

- NetworkLocationResponseFailure.response is None when the request
  failed without an HTTP response, so accessing response.status_code or
  response.json() on it raised AttributeError (500) in the portal
  register and validate_token actions, the remote channel lookup, and
  the v2 community library lookup.
- A non-404 upstream error in the remote channel lookup fell off the
  end of the except block, so list returned 200 with a null body and
  retrieve returned a spurious 404.
- NetworkLocationResponseTimeout was uncaught in the remote channel and
  validate_token actions, and connection failures were uncaught in
  register, all returning 500.
- A non-JSON error body (e.g. a Cloudflare HTML error page) crashed
  register, and validate_token reflected the raw bytes back to the
  client.

Network-level failures now consistently return 503 with an "offline"
status, and 404 from the content server still maps to Http404. The
portal endpoints no longer reflect upstream response bodies wholesale:
error bodies are reduced to the recognized error constants the frontend
matches on (with ALREADY_REGISTERED_FOR_COMMUNITY now mirrored in
error_constants.py), the validate_token success response is reduced to
the project name the frontend uses, and a missing or non-JSON body is
treated as another network-level failure: 503 offline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rtibbles rtibbles force-pushed the handle-network-failures branch from c2e289d to 99355ed Compare June 25, 2026 01:04

@akolson akolson left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @rtibbles

@rtibbles rtibbles merged commit 9a267df into learningequality:develop Jun 25, 2026
68 checks passed
@rtibbles rtibbles deleted the handle-network-failures branch June 25, 2026 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DEV: backend Python, databases, networking, filesystem... SIZE: medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants