Skip to content

fix: replace hardcoded "images" dataset check with generic corruption handling#40

Merged
gordonmurray merged 1 commit intolance-format:mainfrom
gordonmurray:fix/generic-corruption-handling
Apr 7, 2026
Merged

fix: replace hardcoded "images" dataset check with generic corruption handling#40
gordonmurray merged 1 commit intolance-format:mainfrom
gordonmurray:fix/generic-corruption-handling

Conversation

@gordonmurray
Copy link
Copy Markdown
Collaborator

@gordonmurray gordonmurray commented Apr 7, 2026

Fixes #19

Problem

/datasets/{name}/rows had a hardcoded branch that forced a schema-only corrupted_but_readable_schema response whenever the dataset was named images, regardless of whether the data was actually corrupted. Two failure modes fell out of this:

  1. Any healthy dataset named images was incorrectly surfaced as corrupted.
  2. Any corrupted dataset with a different name got no special handling.

Change

Remove the name-based check and rely on the existing except around the read path. Any dataset that fails to read (corruption, format error, unreadable bytes) now falls back to the same informational single-row response that was already documented as the graceful-degradation path. Healthy datasets named images are read normally.

Also drop the fallback log level from error to warning, since graceful degradation is an expected path rather than an error condition.

Verification

Smoke-tested locally against a temp LanceDB directory containing two tables:

  • normal (10 rows, with a vector column): read normally, returns rows and totals as expected.
  • images (5 rows): previously this would return the hardcoded corrupted_but_readable_schema single-row response. With this change it now returns the real data:
{
  "rows": [
    {"id": 0, "label": "img0"},
    {"id": 1, "label": "img1"},
    {"id": 2, "label": "img2"}
  ],
  "total": 5,
  "limit": 3,
  "offset": 0
}

The fallback path still triggers on any read failure and returns the existing error / dataset / details informational row, so the graceful-degradation contract is unchanged for actually-corrupted datasets.

Notes

  • No test additions in this PR; a proper endpoint test suite is tracked separately in test: add API endpoint tests #28.
  • No API shape changes for healthy datasets. The only observable behavior change is that healthy datasets named images now return their real rows instead of the synthetic schema-info row.

… handling

The `/datasets/{name}/rows` endpoint had a hardcoded branch that forced a
schema-only "corrupted_but_readable_schema" response whenever the dataset
was named `images`, regardless of whether the data was actually corrupted.
Any healthy dataset sharing that name was incorrectly shown as corrupted,
and any corrupted dataset with a different name got no special handling.

Remove the name-based check and rely on the existing exception handler
around the read path. Any dataset that fails to read (corruption, format
error, unreadable bytes) now falls back to the same informational single-row
response, matching the graceful-degradation behavior already documented for
the endpoint. Healthy datasets named `images` are read normally.

Also drop the log level for the fallback from `error` to `warning`, since
graceful degradation is an expected path rather than an error condition.

Fixes lance-format#19
@gordonmurray gordonmurray force-pushed the fix/generic-corruption-handling branch from 21231d9 to a516b7e Compare April 7, 2026 19:17
@gordonmurray gordonmurray merged commit fb5debe into lance-format:main Apr 7, 2026
12 checks passed
@gordonmurray gordonmurray deleted the fix/generic-corruption-handling branch April 7, 2026 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: replace hardcoded "images" dataset check with generic corruption handling

1 participant