Enhancement of Weaviate migration script#691
Enhancement of Weaviate migration script#691sorphwer wants to merge 5 commits intolanggenius:mainfrom
Conversation
Handle uuid→text conversion for document_id/doc_id and remove spurious moduleConfig from chunk_index during schema migration. This fixes property type incompatibilities that could cause issues even when vectorConfig is already correct. Fixes the following failure scenarios in the old script: 1. Schema type mismatch: Old script copies properties as-is, preserving uuid type for document_id/doc_id. Dify expects text type, so the migrated collection appears successful but Dify fails at runtime. 2. UUID object insertion failure: When source collection has uuid-typed fields, the Weaviate client returns Python UUID objects. Writing these into text-typed fields causes batch insert errors, leading to data loss or migration abort. 3. moduleConfig rejection: Stale moduleConfig on chunk_index from older Weaviate versions can cause collection creation to fail on newer Weaviate, aborting migration entirely. 4. Partial migration blindspot: Collections already migrated for vectorConfig but still carrying wrong property types were skipped with "NEW SCHEMA (skip)", leaving silent incompatibilities. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document how to configure Weaviate connection for both in-container and local (port-forward) scenarios, and clarify derived values. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reorder replace_old_collection to prevent data loss on failure: - Fetch schema BEFORE deleting anything - Wrap data copy in try/except to preserve migrated collection on error - Add count verification after copy, keep migrated as backup on mismatch - Only delete the migrated collection after full verification passes - Print recovery instructions (collection name) on every failure path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hi @DhruvGorasiya, could you please take a look at this PR and review the updates to the migration script? |
|
The updated codes are from our internal experiment https://gist.github.com/sorphwer/a5ae5f2eab649d0913a5b7e811e95321 |
|
@RiskeyL We would like to have this PR merged ASAP since we desire to use Dify offical doc link in our enterprise doc. Is there anything we can do to speed-up the review process? |
|
@sorphwer Sorry for the delay. We have reached out to the Weaviate team to review this script update, and the review is currently ongoing. We will push to get this merged as soon as possible. |
|
Hi @sorphwer !! Duda from Weaviate here :) Thanks for the PR! 1 and 2) Do you believe we could keep UUID datatype somehow? # example of filtering with UUID
client.collections.delete("TestUUID")
collection = client.collections.create(
"TestUUID",
properties=[
wvc.config.Property(name="document_id", data_type=wvc.config.DataType.UUID),
wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT)
],
vector_config=wvc.config.Configure.Vectors.text2vec_openai()
)
collection.data.insert({"document_id": "123e4567-e89b-12d3-a456-426614174000", "text": "A dog can bark loud",})
collection.data.insert({"document_id": "123e4567-e89b-12d3-a456-426614174001", "text": "A cat can climb a tree",})
collection.data.insert({"document_id": "123e4567-e89b-12d3-a456-426614174002", "text": "A bird can fly",})
doc_ids = [
"123e4567-e89b-12d3-a456-426614174000",
"123e4567-e89b-12d3-a456-426614174001",
]
query = collection.query.near_text(
query="Human best friend",
filters=wvc.query.Filter.by_property("document_id").contains_any(doc_ids),
return_metadata=wvc.query.MetadataQuery(distance=True)
)
for o in query.objects:
print(o.metadata.distance, o.properties)Also, it would be best to target the migration to latest 1.36.latest. I will make a PR for that. I also noticed that all properties from a KB are the same, so we can use multitenancy instead of having each KB per collection. This could be a good opportunity to migrate it, but would require some code changes. The advantage here is, for large KB deployments, you could offload unused tenants from memory. Another interesting feature to expose is dynamic index, where only KB with a certain threshold objects would get to memory. Using Fetch instead of iteratorThis is not the suggested way to read all objects from a collection. Instead, we suggest the iterator. We have a migration guide here that cover consuming the iterator on source, and batching to target. Also, we suggest using fixed batch size, as dynamic will increase the batch size according to latency, and as they may run side by side, this latency may be low and client may overwhelm the server on big migrations. I would love to jump on a call with you so we can discuss this further! Let me know if this works for you. Thanks! |
…ommendation Replace manual cursor-based fetch_objects pagination with collection.iterator() and switch batch.dynamic() to batch.fixed_size() to prevent overwhelming the server during co-located migrations. Addresses review feedback from Weaviate team.
|
Hi, @dudanogueira I've updated the PR with two changes as you suggested:
The background of this PR is that some of our clients failed to run the migration script, and we'd like this PR to address those failures. The updated script aligns with what Dify produces when uploading documents on a fresh install As for the UUID issue, I believe that warrants a dedicated discussion/issue/PR. Regards, |
|
Hi @sorphwer @dudanogueira, does this PR need further discussion before merging, or are we good to proceed? Also, should the migration docs be updated alongside the script changes? |
@RiskeyL We don't need to update migration doc. if @dudanogueira approved latest code change and the PR can be merged. |
This PR enhanced the script based on our internal experiment.
Key improvement:
Fixes the following failure scenarios in the old script:
Schema type mismatch: Old script copies properties as-is, preserving uuid type for document_id/doc_id. Dify expects text type, so the migrated collection appears successful but Dify fails at runtime.
UUID object insertion failure: When source collection has uuid-typed fields, the Weaviate client returns Python UUID objects. Writing these into text-typed fields causes batch insert errors, leading to data loss or migration abort.
moduleConfig rejection: Stale moduleConfig on chunk_index from older Weaviate versions can cause collection creation to fail on newer Weaviate, aborting migration entirely.
Partial migration blindspot: Collections already migrated for vectorConfig but still carrying wrong property types were skipped with "NEW SCHEMA (skip)", leaving silent incompatibilities.