Skip to content

Enhancement of Weaviate migration script#691

Open
sorphwer wants to merge 5 commits intolanggenius:mainfrom
sorphwer:main
Open

Enhancement of Weaviate migration script#691
sorphwer wants to merge 5 commits intolanggenius:mainfrom
sorphwer:main

Conversation

@sorphwer
Copy link
Copy Markdown

@sorphwer sorphwer commented Mar 1, 2026

This PR enhanced the script based on our internal experiment.

Key improvement:
Fixes the following failure scenarios in the old script:

  1. Schema type mismatch: Old script copies properties as-is, preserving uuid type for document_id/doc_id. Dify expects text type, so the migrated collection appears successful but Dify fails at runtime.

  2. UUID object insertion failure: When source collection has uuid-typed fields, the Weaviate client returns Python UUID objects. Writing these into text-typed fields causes batch insert errors, leading to data loss or migration abort.

  3. moduleConfig rejection: Stale moduleConfig on chunk_index from older Weaviate versions can cause collection creation to fail on newer Weaviate, aborting migration entirely.

  4. Partial migration blindspot: Collections already migrated for vectorConfig but still carrying wrong property types were skipped with "NEW SCHEMA (skip)", leaving silent incompatibilities.

Handle uuid→text conversion for document_id/doc_id and remove spurious
moduleConfig from chunk_index during schema migration. This fixes
property type incompatibilities that could cause issues even when
vectorConfig is already correct.

Fixes the following failure scenarios in the old script:

1. Schema type mismatch: Old script copies properties as-is, preserving
   uuid type for document_id/doc_id. Dify expects text type, so the
   migrated collection appears successful but Dify fails at runtime.

2. UUID object insertion failure: When source collection has uuid-typed
   fields, the Weaviate client returns Python UUID objects. Writing these
   into text-typed fields causes batch insert errors, leading to data
   loss or migration abort.

3. moduleConfig rejection: Stale moduleConfig on chunk_index from older
   Weaviate versions can cause collection creation to fail on newer
   Weaviate, aborting migration entirely.

4. Partial migration blindspot: Collections already migrated for
   vectorConfig but still carrying wrong property types were skipped
   with "NEW SCHEMA (skip)", leaving silent incompatibilities.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sorphwer sorphwer requested a review from RiskeyL as a code owner March 1, 2026 03:54
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Mar 1, 2026
@sorphwer sorphwer requested a review from ZhouhaoJiang March 1, 2026 03:57
sorphwer and others added 2 commits March 1, 2026 11:57
Document how to configure Weaviate connection for both in-container
and local (port-forward) scenarios, and clarify derived values.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reorder replace_old_collection to prevent data loss on failure:
- Fetch schema BEFORE deleting anything
- Wrap data copy in try/except to preserve migrated collection on error
- Add count verification after copy, keep migrated as backup on mismatch
- Only delete the migrated collection after full verification passes
- Print recovery instructions (collection name) on every failure path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@RiskeyL
Copy link
Copy Markdown
Contributor

RiskeyL commented Mar 1, 2026

Hi @DhruvGorasiya, could you please take a look at this PR and review the updates to the migration script?

@sorphwer
Copy link
Copy Markdown
Author

sorphwer commented Mar 2, 2026

The updated codes are from our internal experiment https://gist.github.com/sorphwer/a5ae5f2eab649d0913a5b7e811e95321

@sorphwer
Copy link
Copy Markdown
Author

sorphwer commented Mar 17, 2026

@RiskeyL We would like to have this PR merged ASAP since we desire to use Dify offical doc link in our enterprise doc. Is there anything we can do to speed-up the review process?

CC @ZhouhaoJiang

@RiskeyL
Copy link
Copy Markdown
Contributor

RiskeyL commented Mar 17, 2026

@sorphwer Sorry for the delay. We have reached out to the Weaviate team to review this script update, and the review is currently ongoing. We will push to get this merged as soon as possible.

@dudanogueira
Copy link
Copy Markdown

Hi @sorphwer !! Duda from Weaviate here :)

Thanks for the PR!

1 and 2) Do you believe we could keep UUID datatype somehow?
From Weaviate side, it's more performant to keep those fields as UUID. Maybe we can change this at Dify level?

# example of filtering with UUID
client.collections.delete("TestUUID")
collection = client.collections.create(
    "TestUUID",
    properties=[
        wvc.config.Property(name="document_id", data_type=wvc.config.DataType.UUID),
        wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT)
    ],
    vector_config=wvc.config.Configure.Vectors.text2vec_openai()
)
collection.data.insert({"document_id": "123e4567-e89b-12d3-a456-426614174000", "text": "A dog can bark loud",})
collection.data.insert({"document_id": "123e4567-e89b-12d3-a456-426614174001", "text": "A cat can climb a tree",})
collection.data.insert({"document_id": "123e4567-e89b-12d3-a456-426614174002", "text": "A bird can fly",})

doc_ids = [
    "123e4567-e89b-12d3-a456-426614174000",
    "123e4567-e89b-12d3-a456-426614174001",
]

query = collection.query.near_text(
    query="Human best friend",
    filters=wvc.query.Filter.by_property("document_id").contains_any(doc_ids),
    return_metadata=wvc.query.MetadataQuery(distance=True)
)
for o in query.objects:
    print(o.metadata.distance, o.properties)

Also, it would be best to target the migration to latest 1.36.latest. I will make a PR for that.

I also noticed that all properties from a KB are the same, so we can use multitenancy instead of having each KB per collection. This could be a good opportunity to migrate it, but would require some code changes.

The advantage here is, for large KB deployments, you could offload unused tenants from memory.

Another interesting feature to expose is dynamic index, where only KB with a certain threshold objects would get to memory.

Using Fetch instead of iterator

This is not the suggested way to read all objects from a collection. Instead, we suggest the iterator. We have a migration guide here that cover consuming the iterator on source, and batching to target.

Also, we suggest using fixed batch size, as dynamic will increase the batch size according to latency, and as they may run side by side, this latency may be low and client may overwhelm the server on big migrations.

I would love to jump on a call with you so we can discuss this further! Let me know if this works for you.

Thanks!

…ommendation

Replace manual cursor-based fetch_objects pagination with collection.iterator()
and switch batch.dynamic() to batch.fixed_size() to prevent overwhelming the
server during co-located migrations. Addresses review feedback from Weaviate team.
ZhouhaoJiang
ZhouhaoJiang previously approved these changes Mar 27, 2026
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Mar 27, 2026
@sorphwer
Copy link
Copy Markdown
Author

Hi, @dudanogueira
Thanks for your kind and detailed reply.

I've updated the PR with two changes as you suggested:

  1. fetch_objects → iterator
  2. batch.dynamic() → batch.fixed_size()

The background of this PR is that some of our clients failed to run the migration script, and we'd like this PR to address those failures. The updated script aligns with what Dify produces when uploading documents on a fresh install
(using the Weaviate V4 client). This fix is specifically intended to help clients successfully upgrade from 1.26 → 1.27, as we're aware there are additional migration steps required between 1.27 and later versions.

As for the UUID issue, I believe that warrants a dedicated discussion/issue/PR.

Regards,

@RiskeyL
Copy link
Copy Markdown
Contributor

RiskeyL commented Mar 28, 2026

Hi @sorphwer @dudanogueira, does this PR need further discussion before merging, or are we good to proceed? Also, should the migration docs be updated alongside the script changes?

@sorphwer
Copy link
Copy Markdown
Author

Hi @sorphwer @dudanogueira, does this PR need further discussion before merging, or are we good to proceed? Also, should the migration docs be updated alongside the script changes?

@RiskeyL We don't need to update migration doc. if @dudanogueira approved latest code change and the PR can be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants