diff --git a/README.md b/README.md index a400e81c..1679bcbb 100644 --- a/README.md +++ b/README.md @@ -30,6 +30,7 @@ Each plugin lives in `plugins/`. The directory name is the install keyword | `linear` | Linear SDK scripting skill for issue, project, team, cycle, and comment workflows. | | `mac-notify` | macOS notifications when a Cline run completes. | | `nanobanana` | Image generation through OpenRouter and Gemini image models. | +| `qdrant` | Qdrant vector search skills for scaling, search quality, monitoring, deployment, migrations, upgrades, and SDK usage. | | `speak` | Speaks completed Cline replies with ElevenLabs text to speech. | | `typescript-lsp` | TypeScript language service `goto_definition` support. | | `weather-metrics` | Demo weather tool plus runtime metrics hooks. | diff --git a/plugins/qdrant/LICENSE.qdrant-skills b/plugins/qdrant/LICENSE.qdrant-skills new file mode 100644 index 00000000..456fb05e --- /dev/null +++ b/plugins/qdrant/LICENSE.qdrant-skills @@ -0,0 +1,201 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright 2026 Qdrant Solutions GmbH + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/plugins/qdrant/NOTICE.qdrant-skills b/plugins/qdrant/NOTICE.qdrant-skills new file mode 100644 index 00000000..8e8395ee --- /dev/null +++ b/plugins/qdrant/NOTICE.qdrant-skills @@ -0,0 +1,10 @@ +Qdrant skill content + +This plugin includes Qdrant skill material adapted for the Cline plugin +ecosystem. The original material is copyright 2026 Qdrant Solutions GmbH +and licensed under Apache License 2.0. + +Adaptations for this plugin include Cline package metadata, removal of source +marketplace installation instructions, normalization of markdown formatting, +and conversion of nested drill-down skills into linked reference documents so +Cline exposes a smaller set of primary skills. diff --git a/plugins/qdrant/README.md b/plugins/qdrant/README.md new file mode 100644 index 00000000..9408dcb6 --- /dev/null +++ b/plugins/qdrant/README.md @@ -0,0 +1,19 @@ +# qdrant + +Qdrant vector search guidance for Cline. This plugin bundles skills for designing, tuning, operating, and migrating Qdrant-backed search systems. + +## Cline Primitives + +- Bundled skills for Qdrant SDK usage, deployment options, scaling, performance optimization, monitoring, search quality, embedding model migration, and version upgrades. + +The bundled skills treat live Qdrant collection, index, sharding, replica, migration, upgrade, and scaling changes as production database operations that need explicit user confirmation. + +## Requirements + +- No API keys are required to install the plugin. +- The skills can be used offline, but many of their references point to Qdrant documentation for deeper follow-up. +- The plugin does not start Qdrant, install SDKs, register MCP servers, or mutate clusters during installation. + +## Third-Party Notice + +The bundled Qdrant skill content is adapted for Cline from Qdrant skill material licensed under Apache License 2.0 by Qdrant Solutions GmbH. See `LICENSE.qdrant-skills` and `NOTICE.qdrant-skills`. diff --git a/plugins/qdrant/index.ts b/plugins/qdrant/index.ts new file mode 100644 index 00000000..e40a18c7 --- /dev/null +++ b/plugins/qdrant/index.ts @@ -0,0 +1,10 @@ +import type { AgentPlugin } from "@cline/sdk" + +const plugin: AgentPlugin = { + name: "qdrant", + manifest: { + capabilities: ["skills"], + }, +} + +export default plugin diff --git a/plugins/qdrant/package.json b/plugins/qdrant/package.json new file mode 100644 index 00000000..0ef26239 --- /dev/null +++ b/plugins/qdrant/package.json @@ -0,0 +1,19 @@ +{ + "name": "qdrant", + "version": "0.0.0", + "private": true, + "type": "module", + "description": "Qdrant vector search skills for scaling, search quality, monitoring, deployment, migrations, upgrades, and SDK usage.", + "cline": { + "plugins": [ + { + "paths": [ + "./index.ts" + ], + "capabilities": [ + "skills" + ] + } + ] + } +} diff --git a/plugins/qdrant/skills/index.md b/plugins/qdrant/skills/index.md new file mode 100644 index 00000000..591a68b8 --- /dev/null +++ b/plugins/qdrant/skills/index.md @@ -0,0 +1,16 @@ +# Qdrant Skills + +Agent skills encoding deep Qdrant knowledge for coding agents. + +## Available Skills + +- [qdrant-clients-sdk](qdrant-clients-sdk/SKILL.md) -- Client SDKs for Python, TypeScript, Rust, Go, .NET, and Java. +- [qdrant-deployment-options](qdrant-deployment-options/SKILL.md) -- Choosing between local, Docker, self-hosted, Cloud, and embedded deployments. +- [qdrant-model-migration](qdrant-model-migration/SKILL.md) -- Switching embedding models without downtime. +- [qdrant-monitoring](qdrant-monitoring/SKILL.md) -- Monitoring, observability, health checks, and debugging production issues. +- [qdrant-performance-optimization](qdrant-performance-optimization/SKILL.md) -- Search speed, memory usage, and indexing performance tuning. +- [qdrant-scaling](qdrant-scaling/SKILL.md) -- Scaling decisions: data volume, QPS, latency, horizontal vs vertical. +- [qdrant-search-quality](qdrant-search-quality/SKILL.md) -- Diagnosing bad results, search strategies, hybrid search, and reranking. +- [qdrant-version-upgrade](qdrant-version-upgrade/SKILL.md) -- Safe upgrade paths, compatibility guarantees, and rolling upgrades. + +Skills structure is hierarchical. Start with the top-level skill that matches the task, then follow its linked reference documents for deeper guidance. diff --git a/plugins/qdrant/skills/qdrant-clients-sdk/SKILL.md b/plugins/qdrant/skills/qdrant-clients-sdk/SKILL.md new file mode 100644 index 00000000..1b87de4e --- /dev/null +++ b/plugins/qdrant/skills/qdrant-clients-sdk/SKILL.md @@ -0,0 +1,65 @@ +--- +name: qdrant-clients-sdk +description: "Qdrant provides client SDKs for various programming languages, allowing easy integration with Qdrant deployments." +--- + +# Qdrant Clients SDK + +Qdrant has the following officially supported client SDK packages. Install or add one only when the user asks you to wire Qdrant into a project, or when the existing project already needs that dependency. + +- Python -- [qdrant-client](https://github.com/qdrant/qdrant-client), install with `pip install qdrant-client`; use `pip install qdrant-client[fastembed]` only when the project needs local FastEmbed embedding support +- JavaScript / TypeScript -- [qdrant-js](https://github.com/qdrant/qdrant-js), install with `npm install @qdrant/js-client-rest` +- Rust -- [rust-client](https://github.com/qdrant/rust-client), install with `cargo add qdrant-client` +- Go -- [go-client](https://github.com/qdrant/go-client), install with `go get github.com/qdrant/go-client` +- .NET -- [qdrant-dotnet](https://github.com/qdrant/qdrant-dotnet), install with `dotnet add package Qdrant.Client` +- Java -- [java-client](https://github.com/qdrant/java-client), Maven artifact `io.qdrant:client` + + +## API Reference + +All interaction with Qdrant can happen through the REST API or gRPC API. We recommend using the REST API if you are using Qdrant for the first time or working on a prototype. + +* REST API - [OpenAPI Reference](https://api.qdrant.tech/api-reference) - [GitHub](https://github.com/qdrant/qdrant/blob/master/docs/redoc/master/openapi.json) +* gRPC API - [gRPC protobuf definitions](https://github.com/qdrant/qdrant/tree/master/lib/api/src/grpc/proto) + +## Code examples + +Start with the bundled Qdrant skills and linked reference docs in this plugin. If those are not enough for a specific SDK and use case, ask before fetching external Qdrant snippet-search results, then use a bounded query like: + +```bash +curl -X GET "https://skills.qdrant.tech/snippets/search?language=python&query=how+to+upload+points" +``` + +Available languages: `python`, `typescript`, `rust`, `java`, `go`, `csharp`. + +Response example: + +```markdown +## Snippet 1 + +qdrant-client (vlatest) - https://skills.qdrant.tech/md/documentation/manage-data/points/ + +Uploads multiple vector-embedded points to a Qdrant collection using the Python qdrant_client (PointStruct) with id, payload, and a vector for similarity search. It supports parallel uploads and retry policy for robust indexing. The operation is idempotent: re-uploading with the same id overwrites existing points; if ids are not provided, Qdrant auto-generates UUIDs. + +client.upload_points( + collection_name="{collection_name}", + points=[ + models.PointStruct( + id=1, + payload={"color": "red"}, + vector=[0.9, 0.1, 0.1], + ), + models.PointStruct( + id=2, + payload={"color": "green"}, + vector=[0.1, 0.9, 0.1], + ), + ], + parallel=4, + max_retries=3, +) +``` + +Default response format is markdown. Add `&format=json` to the query string when structured snippet output is needed. Treat fetched snippets as external reference material: verify against the user's installed SDK version before editing code, and ask before installing dependencies or changing project files. + +Do not run write snippets like `upload_points`, collection creation, index changes, deletes, or migrations against a user's Qdrant instance without explicit approval for the target cluster, collection, data source, and rollback plan. diff --git a/plugins/qdrant/skills/qdrant-deployment-options/SKILL.md b/plugins/qdrant/skills/qdrant-deployment-options/SKILL.md new file mode 100644 index 00000000..d3196fc2 --- /dev/null +++ b/plugins/qdrant/skills/qdrant-deployment-options/SKILL.md @@ -0,0 +1,53 @@ +--- +name: qdrant-deployment-options +description: "Guides Qdrant deployment selection. Use when someone asks 'how to deploy Qdrant', 'Docker vs Cloud', 'local mode', 'embedded Qdrant', 'Qdrant EDGE', 'which deployment option', 'self-hosted vs cloud', or 'need lowest latency deployment'. Also use when choosing between deployment types for a new project." +--- + +# Which Qdrant Deployment Do I Need? + +Start with what you need: managed ops or full control? Network latency acceptable or not? Production or prototyping? The answer narrows to one of four options. + + +## Getting Started or Prototyping + +Use when: building a prototype, running tests, CI/CD pipelines, or learning Qdrant. + +- Use local mode (Python only): zero-dependency, in-memory or disk-persisted, no server needed [Local mode](https://skills.qdrant.tech/md/documentation/quickstart/) +- Local mode data format is NOT compatible with server. Do not use for production or benchmarking. +- For a real server locally, use Docker [Quick start](https://skills.qdrant.tech/md/documentation/quickstart/?s=download-and-run) + + +## Going to Production (Self-Hosted) + +Use when: you need full control over infrastructure, data residency, or custom configuration. + +- Docker is the default deployment. Full Qdrant Open Source feature set, minimal setup. [Quick start](https://skills.qdrant.tech/md/documentation/quickstart/?s=download-and-run) +- You own operations: upgrades, backups, scaling, monitoring +- Must set up distributed mode manually for multi-node clusters [Distributed deployment](https://skills.qdrant.tech/md/documentation/distributed_deployment/) +- Consider Hybrid Cloud if you want Qdrant Cloud management on your infrastructure [Hybrid Cloud](https://skills.qdrant.tech/md/documentation/hybrid-cloud/) + + +## Going to Production (Zero-Ops) + +Use when: you want managed infrastructure with zero-downtime updates, automatic backups, and resharding without operating clusters yourself. + +- Qdrant Cloud handles upgrades, scaling, backups, and monitoring [Qdrant Cloud](https://skills.qdrant.tech/md/documentation/cloud-quickstart/) +- Supports multi-version upgrades automatically +- Provides features not available in self-hosted: `/sys_metrics`, managed resharding, pre-configured alerts + + +## Need Lowest Possible Latency + +Use when: network round-trip to a server is unacceptable. Edge devices, in-process search, or latency-critical applications. + +- Qdrant EDGE: in-process bindings to Qdrant shard-level functions, no network overhead [Qdrant EDGE](https://skills.qdrant.tech/md/documentation/edge/edge-quickstart/) +- Same data format as server. Can sync with server via shard snapshots. +- Single-node feature set only. No distributed mode. + + +## What NOT to Do + +- Use local mode for production or benchmarking (not optimized, incompatible data format) +- Self-host without monitoring and backup strategy (you will lose data or miss outages) +- Choose EDGE when you need distributed search (single-node only) +- Pick Hybrid Cloud unless you have data residency requirements (unnecessary Kubernetes complexity when Qdrant Cloud works) diff --git a/plugins/qdrant/skills/qdrant-model-migration/SKILL.md b/plugins/qdrant/skills/qdrant-model-migration/SKILL.md new file mode 100644 index 00000000..f9cfd107 --- /dev/null +++ b/plugins/qdrant/skills/qdrant-model-migration/SKILL.md @@ -0,0 +1,100 @@ +--- +name: qdrant-model-migration +description: "Guides embedding model migration in Qdrant without downtime. Use when someone asks 'how to switch embedding models', 'how to migrate vectors', 'how to update to a new model', 'zero-downtime model change', 'how to re-embed my data', or 'can I use two models at once'. Also use when upgrading model dimensions, switching providers, or A/B testing models." +--- + +# What to Do When Changing Embedding Models + +Vectors from different models are incompatible. You cannot mix old and new embeddings in the same vector space. On v1.18+, you can add or delete named vector fields on an existing collection -- migration no longer always requires a new collection. On v1.17 or earlier, all named vectors must be defined at collection creation time. + +- Understand collection aliases before choosing a strategy [Collection aliases](https://skills.qdrant.tech/md/documentation/manage-data/collections/?s=collection-aliases) + + +## Can I Avoid Re-embedding? + +Use when: looking for shortcuts before committing to full migration. + +You MUST re-embed if: changing model provider (OpenAI to Cohere), changing architecture (CLIP to BGE), incompatible dimension counts across different models, or adding sparse vectors to dense-only collection. + +You CAN avoid re-embedding if: using Matryoshka models (use `dimensions` parameter to output lower-dimensional embeddings, learn linear transformation from sample data, some recall loss, good for 100M+ datasets). Or changing quantization (binary to scalar): Qdrant re-quantizes automatically. [Quantization](https://skills.qdrant.tech/md/documentation/manage-data/quantization/) + + +## Need Zero Downtime + +Use when: production must stay available. Recommended for model replacement at scale. + +- If the cluster is v1.18 or later AND the collection has named vectors: + + - Add the new vector field directly to the existing collection [Update vector schema](https://skills.qdrant.tech/md/documentation/manage-data/collections/?s=update-vector-schema) + - Re-embed all data in the background using `UpdateVectors` [Update vectors](https://skills.qdrant.tech/md/documentation/manage-data/points/?s=update-vectors) + - Verify search quality, then delete old vector field + +- If the cluster is v1.17 or earlier OR the collection doesn't have named vectors: + +- Create a new collection with the new model's dimensions and distance metric +- Re-embed all data into the new collection in the background +- Point your application at a collection alias instead of a direct collection name +- Atomically swap the alias to the new collection [Switch collection](https://skills.qdrant.tech/md/documentation/manage-data/collections/?s=switch-collection) +- Verify search quality, then delete the old collection + +Careful, the alias swap only redirects queries. Payloads must be re-uploaded separately. + + +## Need Both Models Live (Side-by-Side) + +Use when: A/B testing models, multi-modal (dense + sparse), or evaluating a new model before committing. + +- If the cluster is v1.18 or later: + + - Add the new vector field directly to the existing collection [Update vector schema](https://skills.qdrant.tech/md/documentation/manage-data/collections/?s=update-vector-schema) + - Backfill new model embeddings incrementally using `UpdateVectors` [Update vectors](https://skills.qdrant.tech/md/documentation/manage-data/points/?s=update-vectors) + +- If the cluster is v1.17 or earlier: You cannot add a named vector to an existing collection. Create a new collection with both vector fields defined upfront: + + - Create new collection with old and new named vectors both defined [Collection with multiple vectors](https://skills.qdrant.tech/md/documentation/manage-data/collections/?s=collection-with-multiple-vectors) + - Migrate data from old collection, preserving existing vectors in the old named field + - Backfill new model embeddings incrementally using `UpdateVectors` [Update vectors](https://skills.qdrant.tech/md/documentation/manage-data/points/?s=update-vectors) + - Compare quality by querying with `using: "old_model"` vs `using: "new_model"` + - Swap alias to new collection once satisfied + +Co-locating large multi-vectors (especially ColBERT) with dense vectors degrades ALL queries, even those only using dense. At millions of points, users report 13s latency dropping to 2s after removing ColBERT. Put large vectors on disk during side-by-side migration. + +If you anticipate future model migrations, define both vector fields upfront at collection creation. + + +## Dense to Hybrid Search Migration + +Use when: adding sparse/BM25 vectors to an existing dense-only collection. Most common migration pattern. + +You cannot add sparse vectors to an existing collection that uses a default (unnamed) dense vector. Must recreate: + +- Create new collection with both dense and sparse vector configs defined +- Re-embed all data with both dense and sparse models +- Migrate payloads, swap alias + +If the collection already uses named dense vectors and is on v1.18+, add the sparse vector field directly without recreating [Update vector schema](https://skills.qdrant.tech/md/documentation/manage-data/collections/?s=update-vector-schema). + +Sparse vectors at chunk level have different TF-IDF characteristics than document level. Test retrieval quality after migration, especially for non-English text without stop-word removal. + + +## Re-embedding Is Too Slow + +Use when: dataset is large and re-embedding is the bottleneck. + +- Use `update_mode: insert` (v1.17+) for safe idempotent migration [Update mode](https://skills.qdrant.tech/md/documentation/manage-data/points/?s=update-mode) +- Scroll the old collection with `with_vectors=False`, re-embed in batches, upsert into new collection +- Upload in parallel batches (64-256 points per request, 2-4 parallel streams) [Bulk upload](https://skills.qdrant.tech/md/documentation/tutorials-develop/bulk-upload/) +- Disable HNSW during bulk load (set `indexing_threshold_kb` very high, restore after) +- For Qdrant Cloud inference, switching models is a config change, not a pipeline change [Inference docs](https://skills.qdrant.tech/md/documentation/inference/) + +For 400GB+ datasets, expect days. For small datasets (<25MB), re-indexing from source is faster than using the migration tool. + + +## What NOT to Do + +- Assume you can add named vectors to an existing collection on v1.17 or earlier servers; check your server version first +- Delete the old collection before verifying the new one +- Forget to update the query embedding model in your application code +- Skip payload migration when using alias swap (aliases redirect queries, they do not copy data) +- Keep ColBERT vectors co-located with dense vectors during a long migration (I/O cost degrades all queries) +- Migrate to hybrid search without testing BM25 quality at chunk level diff --git a/plugins/qdrant/skills/qdrant-monitoring/SKILL.md b/plugins/qdrant/skills/qdrant-monitoring/SKILL.md new file mode 100644 index 00000000..e25345a7 --- /dev/null +++ b/plugins/qdrant/skills/qdrant-monitoring/SKILL.md @@ -0,0 +1,20 @@ +--- +name: qdrant-monitoring +description: "Guides Qdrant monitoring and observability setup. Use when someone asks 'how to monitor Qdrant', 'what metrics to track', 'is Qdrant healthy', 'optimizer stuck', 'why is memory growing', 'requests are slow', or needs to set up Prometheus, Grafana, or health checks. Also use when debugging production issues that require metric analysis." +--- + +# Qdrant Monitoring + +Qdrant monitoring allows tracking performance and health of your deployment, and identifying issues before they become outages. First determine whether you need to set up monitoring or diagnose an active issue. + +- Understand available metrics [Monitoring docs](https://skills.qdrant.tech/md/documentation/ops-monitoring/monitoring/) + + +## Monitoring Setup + +Prometheus scraping, health probes, Hybrid Cloud specifics, alerting, and log centralization. [Monitoring Setup](setup/README.md) + + +## Debugging with Metrics + +Optimizer stuck, memory growth, slow requests. Using metrics to diagnose active production issues. [Debugging with Metrics](debugging/README.md) diff --git a/plugins/qdrant/skills/qdrant-monitoring/debugging/README.md b/plugins/qdrant/skills/qdrant-monitoring/debugging/README.md new file mode 100644 index 00000000..6d0fa8cb --- /dev/null +++ b/plugins/qdrant/skills/qdrant-monitoring/debugging/README.md @@ -0,0 +1,52 @@ +--- +name: qdrant-monitoring-debugging +description: "Diagnoses Qdrant production issues using metrics and observability tools. Use when someone reports 'optimizer stuck', 'indexing too slow', 'memory too high', 'OOM crash', 'queries are slow', 'latency spike', or 'search was fast now it's slow'. Also use when performance degrades without obvious config changes." +--- + +# How to Debug Qdrant with Metrics + +First check optimizer status. Most production issues trace back to active optimizations competing for resources. If optimizer is clean, check memory, then request metrics. + + +## Optimizer Stuck or Too Slow + +Use when: optimizer running for hours, not finishing, or showing errors. + +- Use `/collections/{collection_name}/optimizations` endpoint (v1.17+) to check status [Optimization monitoring](https://skills.qdrant.tech/md/documentation/ops-optimization/optimizer/?s=optimization-monitoring) +- Query with optional detail flags: `?with=queued,completed,idle_segments` +- Returns: queued optimizations count, active optimizer type, involved segments, progress tracking +- Web UI has an Optimizations tab with timeline view and per-task duration metrics [Web UI](https://skills.qdrant.tech/md/documentation/ops-optimization/optimizer/?s=web-ui) +- If `optimizer_status` shows an error in collection info, check logs for disk full or corrupted segments +- Large merges and HNSW rebuilds legitimately take hours on big datasets. Check progress before assuming it's stuck. + + +## Memory Seems Too High + +Use when: memory exceeds expectations, node crashes with OOM, or memory keeps growing. + +- Process memory metrics available via `/metrics` (RSS, allocated bytes, page faults) +- Qdrant uses two types of RAM: resident memory (data structures, quantized vectors) and OS page cache (cached disk reads). Page cache filling available RAM is normal. [Memory article](https://qdrant.tech/articles/memory-consumption/) +- If resident memory (RSSAnon) exceeds 80% of total RAM, investigate +- Check `/telemetry` for per-collection breakdown of point counts and vector configurations +- Estimate expected memory: `num_vectors * dimensions * 4 bytes * 1.5` for vectors, plus payload and index overhead [Capacity planning](https://skills.qdrant.tech/md/documentation/capacity-planning/) +- Common causes of unexpected growth: quantized vectors with `always_ram=true`, too many payload indexes, large `max_segment_size` during optimization + + +## Queries Are Slow + +Use when: queries slower than expected and you need to identify the cause. + +- Track `rest_responses_avg_duration_seconds` and `rest_responses_max_duration_seconds` per endpoint +- Use histogram metric `rest_responses_duration_seconds` (v1.8+) for percentile analysis in Grafana +- Equivalent gRPC metrics with `grpc_responses_` prefix +- Check optimizer status first. Active optimizations compete for CPU and I/O, degrading search latency. +- Check segment count via collection info. Too many unmerged segments after bulk upload causes slower search. +- Compare filtered vs unfiltered query times. Large gap means missing payload index. [Payload index](https://skills.qdrant.tech/md/documentation/manage-data/indexing/?s=payload-index) + + +## What NOT to Do + +- Ignore optimizer status when debugging slow queries (most common root cause) +- Assume memory leak when page cache fills RAM (normal OS behavior) +- Make config changes while optimizer is running (causes cascading re-optimizations) +- Blame Qdrant before checking if bulk upload just finished (unmerged segments) diff --git a/plugins/qdrant/skills/qdrant-monitoring/setup/README.md b/plugins/qdrant/skills/qdrant-monitoring/setup/README.md new file mode 100644 index 00000000..08d93965 --- /dev/null +++ b/plugins/qdrant/skills/qdrant-monitoring/setup/README.md @@ -0,0 +1,61 @@ +--- +name: qdrant-monitoring-setup +description: "Guides Qdrant monitoring setup including Prometheus scraping, health probes, Hybrid Cloud metrics, alerting, and log centralization. Use when someone asks 'how to set up monitoring', 'Prometheus config', 'Grafana dashboard', 'health check endpoints', 'how to scrape Hybrid Cloud', 'what alerts to set', 'how to centralize logs', or 'audit logging'." +--- + +# How to Set Up Qdrant Monitoring + +Get Prometheus scraping working first, then health probes, then alerting. Do not skip monitoring setup before going to production. + + +## Prometheus Metrics + +Use when: setting up metric collection for the first time or adding a new deployment. + +- Node metrics at `/metrics` endpoint [Monitoring docs](https://skills.qdrant.tech/md/documentation/ops-monitoring/monitoring/) +- Cluster metrics at `/sys_metrics` (Qdrant Cloud only) +- Prefix customization via `service.metrics_prefix` config or `QDRANT__SERVICE__METRICS_PREFIX` env var +- Example self-hosted setup with Prometheus + Grafana [prometheus-monitoring repo](https://github.com/qdrant/prometheus-monitoring) + + +## Hybrid Cloud Scraping + +Use when: running Qdrant Hybrid Cloud and need cluster-level visibility. + +Do not just scrape Qdrant nodes. In Hybrid Cloud, you manage the Kubernetes data plane. You must also scrape the cluster-exporter and operator pods for full cluster visibility and operator state. + +- Hybrid Cloud Prometheus setup tutorial [Hybrid Cloud Prometheus](https://skills.qdrant.tech/md/documentation/ops-monitoring/hybrid-cloud-prometheus/) +- Official Grafana dashboards [Grafana dashboard repo](https://github.com/qdrant/qdrant-cloud-grafana-dashboard) + + +## Liveness and Readiness Probes + +Use when: configuring Kubernetes health checks. + +- Use `/healthz`, `/livez`, `/readyz` for basic status, liveness, and readiness [Kubernetes health endpoints](https://skills.qdrant.tech/md/documentation/ops-monitoring/monitoring/?s=kubernetes-health-endpoints) + + +## Alerting + +Use when: setting up alerts for production or Hybrid Cloud deployments. + +- Hybrid Cloud provides ~11 pre-configured Prometheus alerts out of the box [Cloud cluster monitoring](https://skills.qdrant.tech/md/documentation/cloud/cluster-monitoring/) +- Use AlertmanagerConfig to route alerts to Slack, PagerDuty, or other targets based on labels +- At minimum, alert on: optimizer errors, node not ready, replication factor below target, disk usage >80% + + +## Log Centralization and Audit Logging + +Use when: enterprise compliance requires centralized logs or audit trails. + +- Enable JSON log format for structured analysis: set `logger.format` to `json` in config [Configuration](https://skills.qdrant.tech/md/documentation/ops-configuration/configuration/) +- Use FluentD/OpenSearch for log aggregation +- Audit logs (v1.17+) write to local filesystem (`/qdrant/storage/audit/`), not stdout. Mount a Persistent Volume and deploy a sidecar container to tail these files to stdout so DaemonSets can pick them up. [Audit logging](https://skills.qdrant.tech/md/documentation/security/?s=audit-logging) + + +## What NOT to Do + +- Scrape `/sys_metrics` on self-hosted (only available on Qdrant Cloud) +- Scrape only Qdrant nodes in Hybrid Cloud (miss cluster-exporter and operator metrics) +- Skip monitoring setup before going to production (you will regret it) +- Alert on page cache memory usage (it's supposed to fill available RAM, normal OS behavior) diff --git a/plugins/qdrant/skills/qdrant-performance-optimization/SKILL.md b/plugins/qdrant/skills/qdrant-performance-optimization/SKILL.md new file mode 100644 index 00000000..d8661248 --- /dev/null +++ b/plugins/qdrant/skills/qdrant-performance-optimization/SKILL.md @@ -0,0 +1,33 @@ +--- +name: qdrant-performance-optimization +description: "Different techniques to optimize the performance of Qdrant, including indexing strategies, query optimization, and hardware considerations. Use when you want to improve the speed and efficiency of your Qdrant deployment." +--- + + +# Qdrant Performance Optimization + +There are different aspects of Qdrant performance, this document serves as a navigation hub for different aspects of performance optimization in Qdrant. + + +## Search Speed Optimization + +There are two different criteria for search speed: latency and throughput. +Latency is the time it takes to get a response for a single query, while throughput is the number of queries that can be processed in a given time frame. +Depending on your use case, you may want to optimize for one or both of these metrics. + +More on search speed optimization can be found in the [Search Speed Optimization](search-speed-optimization/README.md) skill. + + +## Indexing Performance Optimization + +Qdrant needs to build a vector index to perform efficient similarity search. The time it takes to build the index can vary depending on the size of your dataset, hardware, and configuration. + +More on indexing performance optimization can be found in the [Indexing Performance Optimization](indexing-performance-optimization/README.md) skill. + + +## Memory Usage Optimization + +Vector search can be memory intensive, especially when dealing with large datasets. +Qdrant has a flexible memory management system, which allows you to precisely control which parts of storage are kept in memory and which are stored on disk. This can help you optimize memory usage without sacrificing performance. + +More on memory usage optimization can be found in the [Memory Usage Optimization](memory-usage-optimization/README.md) skill. \ No newline at end of file diff --git a/plugins/qdrant/skills/qdrant-performance-optimization/indexing-performance-optimization/README.md b/plugins/qdrant/skills/qdrant-performance-optimization/indexing-performance-optimization/README.md new file mode 100644 index 00000000..01f3b005 --- /dev/null +++ b/plugins/qdrant/skills/qdrant-performance-optimization/indexing-performance-optimization/README.md @@ -0,0 +1,80 @@ +--- +name: qdrant-indexing-performance-optimization +description: "Diagnoses and fixes slow Qdrant indexing and data ingestion. Use when someone reports 'uploads are slow', 'indexing takes forever', 'optimizer is stuck', 'HNSW build time too long', or 'data uploaded but search is bad'. Also use when optimizer status shows errors, segments won't merge, or indexing threshold questions arise." +--- + +# What to Do When Qdrant Indexing Is Too Slow + +Qdrant does NOT build HNSW indexes immediately. Small segments use brute-force until they exceed `indexing_threshold_kb` (default: 20 MB). Search during this window is slower by design, not a bug. + +- Understand the indexing optimizer [Indexing optimizer](https://skills.qdrant.tech/md/documentation/ops-optimization/optimizer/?s=indexing-optimizer) + + +## Uploads/Ingestion Too Slow + +Use when: upload or upsert API calls are slow. +Identify bottleneck: client-side (network, batching) vs server-side (CPU, disk I/O) + +For client-side, optimize batching and parallelism: + +- Use batch upserts (64-256 points per request) [Points API](https://skills.qdrant.tech/md/documentation/manage-data/points/?s=upload-points) +- Use 2-4 parallel upload streams + +For server-side, optimize Qdrant configuration and indexing strategy: + +- Create more shards (3-12), each shard has an independent update worker [Sharding](https://skills.qdrant.tech/md/documentation/distributed_deployment/?s=sharding) +- Create payload indexes before HNSW builds (needed for filterable vector index) [Payload index](https://skills.qdrant.tech/md/documentation/manage-data/indexing/?s=payload-index) + +Suitable for initial bulk load of large datasets: + +- Disable HNSW during bulk load (set `indexing_threshold_kb` very high, restore after) [Collection params](https://skills.qdrant.tech/md/documentation/manage-data/collections/?s=update-collection-parameters) +- Setting `m=0` to disable HNSW is legacy, use high `indexing_threshold_kb` instead + +Careful, fast unindexed upload might temporarily use more RAM and degrade search performance until optimizer catches up. + +See https://skills.qdrant.tech/md/documentation/tutorials-develop/bulk-upload/ + + +## Optimizer Stuck or Taking Too Long + +Use when: optimizer running for hours, not finishing. + +- Check actual progress via optimizations endpoint (v1.17+) [Optimization monitoring](https://skills.qdrant.tech/md/documentation/ops-optimization/optimizer/?s=optimization-monitoring) +- Large merges and HNSW rebuilds legitimately take hours on big datasets +- Check CPU and disk I/O (HNSW is CPU-bound, merging is I/O-bound, HDD is not viable) +- If `optimizer_status` shows an error, check logs for disk full or corrupted segments + + +## HNSW Build Time Too High + +Use when: HNSW index build dominates total indexing time. + +- Reduce `m` (default 16, good for most cases, 32+ rarely needed) [HNSW params](https://skills.qdrant.tech/md/documentation/manage-data/indexing/?s=vector-index) +- Reduce `ef_construct` (100-200 sufficient) [HNSW config](https://skills.qdrant.tech/md/documentation/manage-data/collections/?s=indexing-vectors-in-hnsw) +- Keep `max_indexing_threads` proportional to CPU cores [Configuration](https://skills.qdrant.tech/md/documentation/ops-configuration/configuration/) +- Use GPU for indexing [GPU indexing](https://skills.qdrant.tech/md/documentation/ops-configuration/running-with-gpu/) + +## HNSW index for multi-tenant collections + +If you have a multi-tenant use case where all data is split by some payload field (e.g. `tenant_id`), you can avoid building a global HNSW index and instead rely on `payload_m` to build HNSW index only for subsets of data. +Skipping global HNSW index can significantly reduce indexing time. + +See [Multi-tenant collections](https://skills.qdrant.tech/md/documentation/manage-data/multitenancy/) for details. + +## Additional Payload Indexes Are Too Slow + +Qdrant builds extra HNSW links for all payload indexes to ensure that quality of filtered vector search does not degrade. +Some payload indexes (e.g. `text` fields with long texts) can have a very high number of unique values per point, which can lead to long HNSW build time. + +You can disable building extra HNSW links for specific payload index and instead rely on slightly slower query-time strategies like ACORN. + +Read more about disabling extra HNSW links in [documentation](https://skills.qdrant.tech/md/documentation/manage-data/indexing/?s=disable-the-creation-of-extra-edges-for-payload-fields) + +Read more about ACORN in [documentation](https://skills.qdrant.tech/md/documentation/search/search/?s=acorn-search-algorithm) + + +## What NOT to Do + +- Do not create payload indexes AFTER HNSW is built (breaks filterable vector index) +- Do not use `m=0` for bulk uploads into an existing collection, it might drop the existing HNSW and cause long reindexing +- Do not upload one point at a time (per-request overhead dominates) diff --git a/plugins/qdrant/skills/qdrant-performance-optimization/memory-usage-optimization/README.md b/plugins/qdrant/skills/qdrant-performance-optimization/memory-usage-optimization/README.md new file mode 100644 index 00000000..ea6a806e --- /dev/null +++ b/plugins/qdrant/skills/qdrant-performance-optimization/memory-usage-optimization/README.md @@ -0,0 +1,67 @@ +--- +name: qdrant-memory-usage-optimization +description: "Diagnoses and reduces Qdrant memory usage. Use when someone reports 'memory too high', 'RAM keeps growing', 'node crashed', 'out of memory', 'memory leak', or asks 'why is memory usage so high?', 'how to reduce RAM?'. Also use when memory doesn't match calculations, quantization didn't help, or nodes crash during recovery." +--- + +# Understanding memory usage + +Qdrant operates with two types of memory: + +- Resident memory (aka RSSAnon) - memory used for internal data structures like the ID tracker, plus components that must stay in RAM, such as quantized vectors when `always_ram=true` and payload indexes. + +- OS page cache - memory used for caching disk reads, which can be released when needed. Original vectors are normally stored in page cache, so the service won't crash if RAM is full, but performance may degrade. + +It is normal for the OS page cache to occupy all available RAM, but if resident memory is above 80% of total RAM, it is a sign of a problem. + +## Memory usage monitoring + +- Qdrant exposes memory usage through the `/metrics` endpoint. See [Monitoring docs](https://skills.qdrant.tech/md/documentation/ops-monitoring/monitoring/). + + + + +## How much memory is needed for Qdrant? + +Optimal memory usage depends on the use case. + +- For regular search scenarios, general guidelines are provided in the [Capacity planning docs](https://skills.qdrant.tech/md/documentation/capacity-planning/). + +For a detailed breakdown of memory usage at large scale, see [Large scale memory usage example](https://skills.qdrant.tech/md/documentation/tutorials-operations/large-scale-search/?s=memory-usage). + +Payload indexes and HNSW graph also require memory, along with vectors themselves, so it's important to consider them in calculations. + +Additionally, Qdrant requires some extra memory for optimizations. During optimization, optimized segments are fully loaded into RAM, so it is important to leave enough headroom. +The larger `max_segment_size` is, the more headroom is needed. + + +### When to put HNSW index on disk + +Putting frequently used components (such as HNSW index) on disk might cause significant performance degradation. +There are some scenarios, however, when it can be a good option: + +- Deployments with low latency disks - local NVMe or similar. +- Multi-tenant deployments, where only a subset of tenants is frequently accessed, so that only a fraction of data & index is loaded in RAM at a time. +- For deployments with [inline storage](https://skills.qdrant.tech/md/documentation/ops-optimization/optimize/?s=inline-storage-in-hnsw-index) enabled. + + +## How to minimize memory footprint + +The main challenge is to put on disk those parts of data, which are rarely accessed. +Here are the main techniques to achieve that: + +- Use quantization to store only compressed vectors in RAM [Quantization docs](https://skills.qdrant.tech/md/documentation/manage-data/quantization/) + +- Use float16 or int8 datatypes to reduce memory usage of vectors by 2x or 4x respectively, with some tradeoff in precision. Read more about vector datatypes in [documentation](https://skills.qdrant.tech/md/documentation/manage-data/vectors/?s=datatypes) + +- Leverage Matryoshka Representation Learning (MRL) to store only small vectors in RAM while keeping large vectors on disk. Examples of how to use MRL with Qdrant Cloud inference: [MRL docs](https://skills.qdrant.tech/md/documentation/inference/?s=reduce-vector-dimensionality-with-matryoshka-models) + +- For multi-tenant deployments with small tenants, vectors might be stored on disk because the same tenant's data is stored together [Multitenancy docs](https://skills.qdrant.tech/md/documentation/manage-data/multitenancy/?s=calibrate-performance) + +- For deployments with fast local storage and relatively low requirements for search throughput, it may be possible to store all components of vector store on disk. Read more about the performance implications of on-disk storage in [the article](https://qdrant.tech/articles/memory-consumption/) + +- For low RAM environments, consider `async_scorer` config, which enables support of `io_uring` for parallel disk access, which can significantly improve performance of on-disk storage. Read more about `async_scorer` in [the article](https://qdrant.tech/articles/io_uring/) (only available on Linux with kernel 5.11+) + +- Consider storing Sparse Vectors and text payload on disk, as they are usually more disk-friendly than dense vectors. +- Configure payload indexes to be stored on disk [docs](https://skills.qdrant.tech/md/documentation/manage-data/indexing/?s=on-disk-payload-index) +- Configure sparse vectors to be stored on disk [docs](https://skills.qdrant.tech/md/documentation/manage-data/indexing/?s=sparse-vector-index) + diff --git a/plugins/qdrant/skills/qdrant-performance-optimization/search-speed-optimization/README.md b/plugins/qdrant/skills/qdrant-performance-optimization/search-speed-optimization/README.md new file mode 100644 index 00000000..41ebf86e --- /dev/null +++ b/plugins/qdrant/skills/qdrant-performance-optimization/search-speed-optimization/README.md @@ -0,0 +1,77 @@ +--- +name: qdrant-search-speed-optimization +description: "Diagnoses and fixes slow Qdrant search. Use when someone reports 'search is slow', 'high latency', 'queries take too long', 'low QPS', 'throughput too low', 'filtered search is slow', or 'search was fast but now it's slow'. Also use when search performance degrades after config changes or data growth." +--- + +# Diagnose a problem + +There the multiple possible reasons for search performance degradation. The most common ones are: + +* Memory pressure: if the working set exceeds available RAM +* Complex requests (e.g. high `hnsw_ef`, complex filters without payload index) +* Competing background processes (e.g. optimizer still running after bulk upload) +* Problem with the cluster (e.g. network issues, hardware degradation) + + +## Single Query Too Slow (Latency) + +Use when: individual queries take too long regardless of load. + +### Diagnostic steps: + +- Check if second run of the same request is significantly faster (indicates memory pressure) +- Try the same query with `with_payload: false` and `with_vectors: false` to see if payload retrieval is the bottleneck +- If request uses filters, try to remove them one by one to identify if a specific filter condition is the bottleneck + +### Common fixes: + +- Tune HNSW parameters: [Fine-tuning search](https://skills.qdrant.tech/md/documentation/ops-optimization/optimize/?s=fine-tuning-search-parameters) +- Enable in-memory quantization: [Scalar quantization](https://skills.qdrant.tech/md/documentation/manage-data/quantization/?s=scalar-quantization) +- Reduce Vector Dimensionality with Matryoshka Models: [Matryoshka Models](https://skills.qdrant.tech/md/documentation/inference/?s=reduce-vector-dimensionality-with-matryoshka-models) +- Use oversampling + rescore for high-dimensional vectors [Search with quantization](https://skills.qdrant.tech/md/documentation/manage-data/quantization/?s=searching-with-quantization) +- Enable io_uring for disk-heavy workloads on Linux [io_uring](https://qdrant.tech/articles/io_uring/) + + +## Can't Handle Enough QPS (Throughput) + +Use when: system can't serve enough queries per second under load. + +- Reduce segment count (`default_segment_number` to 2) [Maximizing throughput](https://skills.qdrant.tech/md/documentation/ops-optimization/optimize/?s=maximizing-throughput) +- Use batch search API instead of single queries [Batch search](https://skills.qdrant.tech/md/documentation/search/search/?s=batch-search-api) +- Enable quantization to reduce CPU cost [Scalar quantization](https://skills.qdrant.tech/md/documentation/manage-data/quantization/?s=scalar-quantization) +- Add replicas to distribute read load [Replication](https://skills.qdrant.tech/md/documentation/distributed_deployment/?s=replication) + + +## Filtered Search Is Slow + +Use when: filtered search is significantly slower than unfiltered. Most common SA complaint after memory. + +- Create payload index on the filtered field [Payload index](https://skills.qdrant.tech/md/documentation/manage-data/indexing/?s=payload-index) +- Use `is_tenant=true` for primary filtering condition: [Tenant index](https://skills.qdrant.tech/md/documentation/manage-data/indexing/?s=tenant-index) +- Try ACORN algorithm for complex filters: [ACORN](https://skills.qdrant.tech/md/documentation/search/search/?s=acorn-search-algorithm) +- Avoid using `nested` filtering conditions as a primary filter. It might force qdrant to read raw payload values instead of using index. +- If payload index was added after HNSW build, trigger re-index to create filterable subgraph links + + +## Optimize search performance with parallel updates + +### Diagnostic steps + +- Try to run the same query with `indexed_only=true` parameter, if the query is significantly faster, it means that the optimizer is still running and has not yet indexed all segments. +- If CPU or IO usage is high even with no queries, it also indicates that the optimizer is still running. + +### Recommended configuration changes + +- reduce `optimizer_cpu_budget` to reserve more CPU for queries +- Use `prevent_unoptimized=true` to prevent creating segments with a large amount of unindexed data for searches. Instead, once a segment reaches the so called indexing_threshold, all additional points will be added in 'deferred state'. + +Learn more [here](https://skills.qdrant.tech/md/documentation/search/low-latency-search/?s=query-indexed-data-only) + + +## What NOT to Do + +- Set `always_ram=false` on quantization (disk thrashing on every search) +- Put HNSW on disk for latency-sensitive production (only for cold storage) +- Increase segment count for throughput (opposite: fewer = better) +- Create payload indexes on every field (wastes memory) +- Blame Qdrant before checking optimizer status diff --git a/plugins/qdrant/skills/qdrant-scaling/SKILL.md b/plugins/qdrant/skills/qdrant-scaling/SKILL.md new file mode 100644 index 00000000..3ff3d9ef --- /dev/null +++ b/plugins/qdrant/skills/qdrant-scaling/SKILL.md @@ -0,0 +1,47 @@ +--- +name: qdrant-scaling +description: "Guides Qdrant scaling decisions. Use when someone asks 'how many nodes do I need', 'data doesn't fit on one node', 'need more throughput', 'cluster is slow', 'too many tenants', 'vertical or horizontal', 'how to shard', or 'need to add capacity'." +--- + +# Qdrant Scaling + +First determine what you're scaling for: + +- data volume +- query throughput (QPS) +- query latency +- query volume + +After determining the scaling goal, we can choose scaling strategy based on tradeoffs and assumptions. +Each pulls toward different strategies. Scaling for throughput and latency are opposite tuning directions. + + +## Scaling Data Volume + +This becomes relevant when volume of the dataset exceeds the capacity of a single node. +Read more about scaling for data volume in [Scaling Data Volume](scaling-data-volume/README.md) + + +## Scaling for Query Throughput + +If your system needs to handle more parallel queries than a single node can handle, + then you need to scale for query throughput. + +Read more about scaling for query throughput in [Scaling for Query Throughput](scaling-qps/README.md) + +## Scaling for Query Latency + +Latency of a single query is determined by the slowest component in the query execution path. +It is in sometimes correlated with throughput, but not always. It might require different strategies for scaling. + +Read more about scaling for query latency in [Scaling for Query Latency](minimize-latency/README.md) + + +## Scaling for Query Volume + +By query volume we understand the amount of results that a single query returns. +If the query volume is too high, it can cause performance issues and increase latency. + +Tuning for query volume is opposite might require special strategies. + +Read more about scaling for query volume in [Scaling for Query Volume](scaling-query-volume/README.md) diff --git a/plugins/qdrant/skills/qdrant-scaling/minimize-latency/README.md b/plugins/qdrant/skills/qdrant-scaling/minimize-latency/README.md new file mode 100644 index 00000000..99e03ca8 --- /dev/null +++ b/plugins/qdrant/skills/qdrant-scaling/minimize-latency/README.md @@ -0,0 +1,41 @@ +--- +name: qdrant-minimize-latency +description: "Guides Qdrant query latency optimization. Use when someone asks 'search is slow', 'how to reduce latency', 'p99 is too high', 'tail latency', 'single query too slow', 'how to make search faster', or 'latency spikes'." +--- + +# Scaling for Query Latency + +Latency of a single query is determined by the slowest component in the query execution path. It is sometimes correlated with throughput, but not always -- throughput and latency are opposite tuning directions. + +Low latency optimization is aimed at utilising maximum resource saturation for a single query, while throughput optimization is aimed at minimizing per-query resource usage to allow more parallel queries. + +## Performance Tuning for Lower Latency + +- Increase segment count to match CPU cores (`default_segment_number: 16`) [Minimizing latency](https://skills.qdrant.tech/md/documentation/ops-optimization/optimize/?s=minimizing-latency) +- Keep quantized vectors and HNSW in RAM (`always_ram=true`) +- Reduce `hnsw_ef` at query time (trade recall for speed) [Search params](https://skills.qdrant.tech/md/documentation/ops-optimization/optimize/?s=fine-tuning-search-parameters) +- Use local NVMe, avoid network-attached storage + +## Memory Pressure and Latency + +RAM is the most critical resource for latency. If working set exceeds available RAM, OS cache eviction causes severe, sustained latency degradation. + +- Vertical scale RAM first. Critical if working set >80%. +- Use quantization: scalar (4x reduction) or binary (16x reduction) [Quantization](https://skills.qdrant.tech/md/documentation/manage-data/quantization/) +- Move payload indexes to disk if filtering is infrequent [On-disk payload index](https://skills.qdrant.tech/md/documentation/manage-data/indexing/?s=on-disk-payload-index) +- Set `optimizer_cpu_budget` to limit background optimization CPUs +- Schedule indexing: set high `indexing_threshold` during peak hours + + +## Vertical Scaling for Latency + +More RAM and faster CPU directly reduce latency. See [Vertical Scaling](../scaling-data-volume/vertical-scaling/README.md) for node sizing guidelines. + + +## What NOT to Do + +- Do not expect to optimize latency and throughput simultaneously on the same node +- Do not use few large segments for latency-sensitive workloads (each segment takes longer to search) +- Do not run at >90% RAM (cache eviction causes severe latency degradation that can last days) +- Do not ignore optimizer status during performance debugging +- Do not scale down RAM without load testing (cache eviction causes days-long latency incidents) diff --git a/plugins/qdrant/skills/qdrant-scaling/scaling-data-volume/README.md b/plugins/qdrant/skills/qdrant-scaling/scaling-data-volume/README.md new file mode 100644 index 00000000..e8b2e919 --- /dev/null +++ b/plugins/qdrant/skills/qdrant-scaling/scaling-data-volume/README.md @@ -0,0 +1,45 @@ +--- +name: qdrant-scaling-data-volume +description: "Guides Qdrant data volume scaling decisions. Use when someone asks 'data doesn't fit on one node', 'too much data', 'need more storage', 'vertical or horizontal scaling', 'tenant scaling', 'time window rotation', or 'data growth exceeds capacity'." +--- + +# Scaling Data Volume + +This document covers data volume scaling scenarios, +where the total size of the dataset exceeds the capacity of a single node. + +## Tenant Scaling + +If the use case is multi-tenant, meaning that each user only has access to a subset of the data, +and we never need to query across all the data, then we can use multi-tenancy patterns to scale. + +The recommended way is to use multi-tenant workloads with payload partitioning, per-tenant indexes, and tiered multitenancy. + +Learn more [Tenant Scaling](tenant-scaling/README.md) + +## Sliding Time Window + +Some use-cases are based on a sliding time window, where only the most recent data is relevant. +For example an index for social media posts, where only the last 6 months of data require fast search. + +Learn more [Sliding Time Window](sliding-time-window/README.md) + +## Global Search + +Most general use-cases require global search across all data. +In these situations, we might need to fall back to vertical scaling, +and then horizontal scaling when we reach the limits of vertical scaling. + + +### Vertical Scaling + +When data doesn't fit in a single node, the first approach is to scale the node itself -- more RAM, better disk, quantization, mmap. +Exhaust vertical options before going horizontal, as horizontal scaling adds permanent operational complexity. + +Learn more [Vertical Scaling](vertical-scaling/README.md) + +### Horizontal Scaling + +When a single node can't hold the data even with quantization and mmap, distribute data across multiple nodes via sharding. + +Learn more [Horizontal Scaling](horizontal-scaling/README.md) diff --git a/plugins/qdrant/skills/qdrant-scaling/scaling-data-volume/horizontal-scaling/README.md b/plugins/qdrant/skills/qdrant-scaling/scaling-data-volume/horizontal-scaling/README.md new file mode 100644 index 00000000..4633a4c1 --- /dev/null +++ b/plugins/qdrant/skills/qdrant-scaling/scaling-data-volume/horizontal-scaling/README.md @@ -0,0 +1,47 @@ +--- +name: qdrant-horizontal-scaling +description: "Diagnoses and guides Qdrant horizontal scaling decisions. Use when someone asks 'vertical or horizontal?', 'how many nodes?', 'how many shards?', 'how to add nodes', 'resharding', 'data doesn't fit', or 'need more capacity'. Also use when data growth outpaces current deployment." +--- + +# What to Do When Qdrant Needs More Capacity + +Vertical first: simpler operations, no network overhead, good up to ~100M vectors per node depending on dimensions and quantization. Horizontal when: data exceeds single node capacity, need fault tolerance, need to isolate tenants, or IOPS-bound (more nodes = more independent IOPS). + +## Most basic distributed configuration + +- 3 nodes, 3 shards with `replication_factor: 2` for zero-downtime scaling + +Minimum of 3 nodes is important for consensus and fault tolerance. With 3 nodes, you can lose 1 node without downtime. With 2 nodes, losing 1 node causes downtime for collection operations. +Replication factor of 2 means each shard has 1 replica, so you have 2 copies of data. This allows for zero-downtime scaling and maintenance. With `replication_factor: 1`, zero-downtime is not guaranteed even for point-level operations, and cluster maintenance requires downtime. + +## Choosing number of shards + +Shards are the unit of data distribution. +More shards allows more nodes and better distribution, but adds overhead. Fewer shards reduces overhead but limits horizontal scaling. + +For cluster of 3-6 nodes the recommended shard count is 6-12. +This allows for 2-4 shards per node, which balances distribution and overhead. + +## Changing number of shards + +Use when: shard count isn't evenly divisible by node count, causing uneven distribution, or need to rebalance. + +Resharding is expensive and time-consuming, it should be used as a last resort if regular data distribution is not possible. +Resharding is designed to be transparent for user operations, updates and searches should still work during resharding with some small performance impact. + +But resharding operation itself is time-consuming and requires to move large amounts of data between nodes. + +- Available in Qdrant Cloud [Resharding](https://skills.qdrant.tech/md/documentation/distributed_deployment/?s=resharding) +- Resharding is not available for self-hosted deployments. + +Better alternatives: over-provision shards initially, or spin up new cluster with correct config and migrate data. + + +## What NOT to Do + +- Do not jump to horizontal before exhausting vertical (adds complexity for no gain) +- Do not set `shard_number` that isn't a multiple of node count (uneven distribution) +- Do not use `replication_factor: 1` in production if you need fault tolerance +- Do not add nodes without rebalancing shards (use shard move API to redistribute) +- Do not scale down RAM without load testing (cache eviction causes days-long latency incidents) +- Do not hit the collection limit by using one collection per tenant (use payload partitioning) diff --git a/plugins/qdrant/skills/qdrant-scaling/scaling-data-volume/sliding-time-window/README.md b/plugins/qdrant/skills/qdrant-scaling/scaling-data-volume/sliding-time-window/README.md new file mode 100644 index 00000000..78ea8a12 --- /dev/null +++ b/plugins/qdrant/skills/qdrant-scaling/scaling-data-volume/sliding-time-window/README.md @@ -0,0 +1,68 @@ +--- +name: qdrant-sliding-time-window +description: "Guides sliding time window scaling in Qdrant. Use when someone asks 'only recent data matters', 'how to expire old vectors', 'time-based data rotation', 'delete old data efficiently', 'social media feed search', 'news search', 'log search with retention', or 'how to keep only last N months of data'." +--- + +# Scaling with a Sliding Time Window + +Use when only recent data needs fast search -- social media posts, news articles, support tickets, logs, job listings. Old data either becomes irrelevant or can tolerate slower access. + +Three strategies: shard rotation (recommended), collection rotation (when per-period config differs), and filter-and-delete (simplest, for continuous cleanup). + + +## Shard Rotation (Recommended) + +Use when: data has natural time boundaries (daily, weekly, monthly). Preferred because queries span all time periods in one request without application-level fan-out. [User-defined sharding](https://skills.qdrant.tech/md/documentation/distributed_deployment/?s=user-defined-sharding) + +1. Create a collection with user-defined sharding enabled +2. Create one shard key per time period (e.g., `2025-01`, `2025-02`, ..., `2025-06`) +3. Ingest data into the current period's shard key +4. When a new period starts, create a new shard key and redirect writes +5. Delete the oldest shard key outside the retention window + +- Deleting a shard key reclaims all resources instantly (no fragmentation, no optimizer overhead) +- Pre-create the next period's shard key before rotation to avoid write disruption +- Use `shard_key_selector` at query time to search only specific periods for efficiency +- Shard keys can be placed on specific nodes for hot/cold tiering + + +## Collection Rotation (Alias Swap) + +Use when: you need per-period collection configuration (e.g., different quantization or storage settings). [Collection aliases](https://skills.qdrant.tech/md/documentation/manage-data/collections/?s=collection-aliases) + +1. Create one collection per time period, point a write alias at the newest +2. Query across all active collections in parallel, merge results client-side +3. When a new period starts, create the new collection and swap the write alias [Switch collection](https://skills.qdrant.tech/md/documentation/manage-data/collections/?s=switch-collection) +4. Drop the oldest collection outside the window + +Trade-off vs shard rotation: allows per-collection config differences, but requires application-level fan-out and more operational overhead. + + +## Filter-and-Delete + +Use when: data arrives continuously without clear time boundaries, or you want the simplest setup. + +1. Store a `timestamp` payload on every point, create a payload index on it [Payload index](https://skills.qdrant.tech/md/documentation/manage-data/indexing/?s=payload-index) +2. Filter to the desired window at query time using `range` condition [Range filter](https://skills.qdrant.tech/md/documentation/search/filtering/?s=range) +3. Periodically delete expired points using delete-by-filter [Delete points](https://skills.qdrant.tech/md/documentation/manage-data/points/?s=delete-points) + +- Run cleanup during off-peak hours in batches (10k-50k points) to avoid optimizer locks +- Deletes are not free: tombstoned points degrade search until optimizer compacts segments +- Does not reclaim disk instantly (compaction is asynchronous) + + +## Hot/Cold Tiers + +Use when: recent data needs fast in-RAM search, older data should remain searchable at lower performance. + +- Shard rotation: place current shard key on fast-storage nodes, move older shard keys to cheaper nodes via shard placement. All queries still go through a single collection. +- Collection rotation: keep current collection in RAM (`always_ram: true`), move older collections to mmap/on-disk vectors. [Quantization](https://skills.qdrant.tech/md/documentation/manage-data/quantization/) + + +## What NOT to Do + +- Do not use filter-and-delete for high-volume time-series with millions of daily deletes (use rotation instead) +- Do not forget to index the timestamp field (range filters without an index cause full scans) +- Do not use collection rotation when shard rotation would suffice (unnecessary fan-out complexity) +- Do not drop a shard key or collection before verifying its period is fully outside the retention window +- Do not skip pre-creating the next period's shard key or collection (write failures during rotation are hard to recover) diff --git a/plugins/qdrant/skills/qdrant-scaling/scaling-data-volume/tenant-scaling/README.md b/plugins/qdrant/skills/qdrant-scaling/scaling-data-volume/tenant-scaling/README.md new file mode 100644 index 00000000..c26988d8 --- /dev/null +++ b/plugins/qdrant/skills/qdrant-scaling/scaling-data-volume/tenant-scaling/README.md @@ -0,0 +1,44 @@ +--- +name: qdrant-tenant-scaling +description: "Guides Qdrant multi-tenant scaling. Use when someone asks 'how to scale tenants', 'one collection per tenant?', 'tenant isolation', 'dedicated shards', or reports tenant performance issues. Also use when multi-tenant workloads outgrow shared infrastructure." +--- + +# What to Do When Scaling Multi-Tenant Qdrant + +Do not create one collection per tenant. Does not scale past a few hundred and wastes resources. One company hit the 1000 collection limit after a year of collection-per-repo and had to migrate to payload partitioning. Use a shared collection with a tenant key. + +- Understand multitenancy patterns [Multitenancy](https://skills.qdrant.tech/md/documentation/manage-data/multitenancy/) + +Here is a short summary of the patterns: + +## Number of Tenants is around 10k + +Use the default multitenancy strategy via payload filtering. + +Read about [Partition by payload](https://skills.qdrant.tech/md/documentation/manage-data/multitenancy/?s=partition-by-payload) and [Calibrate performance](https://skills.qdrant.tech/md/documentation/manage-data/multitenancy/?s=calibrate-performance) for best practices on indexing and query performance. + + +## Number of Tenants is around 100k and more + +At this scale, the cluster may consist of several peers. +To localize tenant data and improve performance, use [custom sharding](https://skills.qdrant.tech/md/documentation/distributed_deployment/?s=user-defined-sharding) to assign tenants to specific shards based on tenant ID hash. +This will localize tenant requests to specific nodes instead of broadcasting them to all nodes, improving performance and reducing load on each node. + +## If tenants are unevenly sized + +If some tenants are much larger than others, use [tiered multitenancy](https://skills.qdrant.tech/md/documentation/manage-data/multitenancy/?s=tiered-multitenancy) to promote large tenants to dedicated shards while keeping small tenants on shared shards. This optimizes resource allocation and performance for tenants of varying sizes. + +## Need Strict Tenant Isolation + +Use when: legal/compliance requirements demand per-tenant encryption or strict isolation beyond what payload filtering provides. + +- Multiple collections may be necessary for per-tenant encryption keys +- Limit collection count and use payload filtering within each collection +- This is the exception, not the default. Only use when compliance requires it. + + +## What NOT to Do + +- Do not create one collection per tenant without compliance justification (does not scale past hundreds) +- Do not skip `is_tenant=true` on the tenant index (kills sequential read performance) +- Do not build global HNSW for multi-tenant collections (wasteful, use `payload_m` instead) diff --git a/plugins/qdrant/skills/qdrant-scaling/scaling-data-volume/vertical-scaling/README.md b/plugins/qdrant/skills/qdrant-scaling/scaling-data-volume/vertical-scaling/README.md new file mode 100644 index 00000000..1bebe898 --- /dev/null +++ b/plugins/qdrant/skills/qdrant-scaling/scaling-data-volume/vertical-scaling/README.md @@ -0,0 +1,69 @@ +--- +name: qdrant-vertical-scaling +description: "Guides Qdrant vertical scaling decisions. Use when someone asks 'how to scale up a node', 'need more RAM', 'upgrade node size', 'vertical scaling', 'resize cluster', 'scale up vs scale out', or when memory/CPU is insufficient on current nodes. Also use when someone wants to avoid the complexity of horizontal scaling." +--- + +# What to Do When Qdrant Needs to Scale Vertically + +Vertical scaling means increasing CPU, RAM, or disk on existing nodes rather than adding more nodes. This is the recommended first step before considering horizontal scaling. Vertical scaling is simpler, avoids distributed system complexity, and is reversible. + +- Vertical scaling for Qdrant Cloud is done through the [Qdrant Cloud Console](https://cloud.qdrant.io/) +- For self-hosted deployments, resize the underlying VM or container resources + +## When to Scale Vertically + +Use when: current node resources (RAM, CPU, disk) are insufficient, but the workload doesn't yet require distribution. + +- RAM usage approaching 80% of available memory (OS page cache eviction starts, severe performance degradation) +- CPU saturation during query serving or indexing +- Disk space running low for on-disk vectors and payloads +- A single node can handle up to ~100M vectors depending on dimensions and quantization +- For non-production workloads, which are tolerant to single-point-of-failure and don't require high availability + + +## How to Scale Vertically in Qdrant Cloud + +Vertical scaling is managed through the Qdrant Cloud Console. + +- Log into [Qdrant Cloud Console](https://cloud.qdrant.io/) or use [CLI tool](https://github.com/qdrant/qcloud-cli) +- Select the cluster to resize +- Choose a larger node configuration (more RAM, CPU, or both) +- The upgrade process involves a rolling restart with no downtime if replication is configured +- Ensure `replication_factor: 2` or higher before resizing to maintain availability during the rolling restart + +Important: Scaling up is straightforward. Scaling down requires care -- if the working set no longer fits in RAM after downsizing, performance will degrade severely due to cache eviction. Always load test before scaling down. + + +## RAM Sizing Guidelines + +RAM is the most critical resource for Qdrant performance. Use these guidelines to right-size. + +- Exact estimation of RAM usage is difficult; use this simple approximate formula: `num_vectors * dimensions * 4 bytes * 1.5` for full-precision vectors in RAM +- With scalar quantization: divide by 4 (INT8 reduces each float32 to 1 byte) [Quantization](https://skills.qdrant.tech/md/documentation/manage-data/quantization/) +- With binary quantization: divide by 32 [Binary quantization](https://skills.qdrant.tech/md/documentation/manage-data/quantization/?s=binary-quantization) +- Add overhead for HNSW index (~20-30% of vector data), payload indexes, and WAL +- Reserve 20% headroom for optimizer operations and OS cache +- Monitor actual usage via Grafana/Prometheus before and after resizing [Monitoring](../../../qdrant-monitoring/SKILL.md) + + +## When Vertical Scaling Is No Longer Enough + +Recognize these signals that it's time to go horizontal: + +- Data volume exceeds what a single node can hold even with quantization and mmap +- IOPS are saturated (more nodes = more independent disk I/O) +- Need fault tolerance (requires replication across nodes) +- Need tenant isolation via dedicated shards +- Single-node CPU is maxed and query latency is unacceptable +- Next vertical scaling step is the largest available node size. You might need to be able to temporarily scale up to the larger node size to do batch operations or recovery. If you are already at the largest node size, you won't be able to do that. + +When you hit these limits, see [Horizontal Scaling](../horizontal-scaling/README.md) for guidance on sharding and node planning. + + +## What NOT to Do + +- Do not scale down RAM without load testing first (cache eviction = severe latency degradation that can last days) +- Do not ignore the 80% RAM threshold (performance cliff, not gradual degradation) +- Do not skip replication before resizing in Cloud (rolling restart without replicas = downtime) +- Do not jump to horizontal scaling before exhausting vertical options (adds permanent operational complexity) +- Do not assume more CPU always helps (IOPS-bound workloads won't improve with more cores) \ No newline at end of file diff --git a/plugins/qdrant/skills/qdrant-scaling/scaling-qps/README.md b/plugins/qdrant/skills/qdrant-scaling/scaling-qps/README.md new file mode 100644 index 00000000..147ed2a9 --- /dev/null +++ b/plugins/qdrant/skills/qdrant-scaling/scaling-qps/README.md @@ -0,0 +1,56 @@ +--- +name: qdrant-scaling-qps +description: "Guides Qdrant query throughput (QPS) scaling. Use when someone asks 'how to increase QPS', 'need more throughput', 'queries per second too low', 'batch search', 'read replicas', or 'how to handle more concurrent queries'." +--- + +# Scaling for Query Throughput (QPS) + +Throughput scaling means handling more parallel queries per second. +This is different from latency - throughput and latency are opposite tuning directions and cannot be optimized simultaneously on the same node. + +High throughput favors fewer, larger segments so each query touches less overhead. + + +## Performance Tuning for Higher RPS + +- Use fewer, larger segments (`default_segment_number: 2`) [Maximizing throughput](https://skills.qdrant.tech/md/documentation/ops-optimization/optimize/?s=maximizing-throughput) +- Enable quantization with `always_ram=true` to reduce disk IO [Quantization](https://skills.qdrant.tech/md/documentation/manage-data/quantization/) +- Use batch search API to amortize overhead [Batch search](https://skills.qdrant.tech/md/documentation/search/search/?s=batch-search-api) + +## Minimize impact of Update Workloads + +- Configure update throughput control (v1.17+) to prevent unoptimized searches degrading reads [Low latency search](https://skills.qdrant.tech/md/documentation/search/low-latency-search/) +- Set `optimizer_cpu_budget` to limit indexing CPUs (e.g. `2` on an 8-CPU node reserves 6 for queries) +- Configure delayed read fan-out (v1.17+) for tail latency [Delayed fan-outs](https://skills.qdrant.tech/md/documentation/search/low-latency-search/?s=use-delayed-fan-outs) + + + +## Horizontal Scaling for Throughput + +If a single node is saturated on CPU after applying the tuning above, scale horizontally with read replicas. + +- Shard replicas serve queries from replicated shards, distributing read load across nodes +- Each replica adds independent query capacity without re-sharding +- Use `replication_factor: 2+` and route reads to replicas [Distributed deployment](https://skills.qdrant.tech/md/documentation/distributed_deployment/?s=replication) + +See also [Horizontal Scaling](../scaling-data-volume/horizontal-scaling/README.md) for general horizontal scaling guidance. + + +## Disk I/O Bottlenecks + +If it is not possible to keep all vectors in RAM, disk I/O can become the bottleneck for throughput. +In this case: + +- Upgrade to provisioned IOPS or local NVMe first. See impact of disk performance to vector search in [Disk performance article](https://qdrant.tech/articles/memory-consumption/) +- Use `io_uring` on Linux (kernel 5.11+) [io_uring article](https://qdrant.tech/articles/io_uring/) +- In case of quantized vectors, prefer global rescoring over per-segment rescoring to reduce disk reads. Example in the [tutorial](https://skills.qdrant.tech/md/documentation/tutorials-operations/large-scale-search/?s=search-query) +- Configure higher number of search threads to parallelize disk reads. Default is `cpu_count - 1`, which is optimal for RAM-based search but may be too low for disk-based search. See [configuration reference](https://skills.qdrant.tech/md/documentation/ops-configuration/configuration/?s=configuration-options) +- If still saturated, scale out horizontally (each node adds independent IOPS) + + +## What NOT to Do + +- Do not expect to optimize throughput and latency simultaneously on the same node +- Do not use many small segments for throughput workloads (increases per-query overhead) +- Do not scale horizontally when IOPS-bound without also upgrading disk tier +- Do not run at >90% RAM (OS cache eviction = severe performance degradation) diff --git a/plugins/qdrant/skills/qdrant-scaling/scaling-query-volume/README.md b/plugins/qdrant/skills/qdrant-scaling/scaling-query-volume/README.md new file mode 100644 index 00000000..a47b999f --- /dev/null +++ b/plugins/qdrant/skills/qdrant-scaling/scaling-query-volume/README.md @@ -0,0 +1,23 @@ +--- +name: qdrant-scaling-query-volume +description: "Guides Qdrant query volume scaling. Use when someone asks 'query returns too many results', 'scroll performance', 'large limit values', 'paginating search results', 'fetching many vectors', or 'high cardinality results'." +--- + +# Scaling for Query Volume + +Problem: When a query has a large limit (e.g. 1000) and there are multiple shards (e.g. 10), naively each shard must return the full 1000 results -- totaling 10,000 scored points transferred and merged. This is wasteful since data is randomly distributed across auto-shards. + +## Core idea + +Instead of asking every shard for the full limit, ask each shard for a smaller limit computed via Poisson distribution statistics, then merge. This is safe because auto-sharding guarantees random, independent data distribution. + +## When it activates + +- More than 1 shard +- Auto-sharding is in use (all queried shards share the same shard key) +- The request's limit + offset >= SHARD_QUERY_SUBSAMPLING_LIMIT (128) +- The query is not exact + +## Key tradeoff + + The strategy trades a small probability of slightly incomplete results for a large reduction in inter-shard data transfer, especially for high-limit queries across many shards. The 1.2x safety factor and the 99.9% Poisson threshold keep the error rate very low -- comparable to inaccuracies already introduced by approximate vector indices like HNSW. \ No newline at end of file diff --git a/plugins/qdrant/skills/qdrant-search-quality/SKILL.md b/plugins/qdrant/skills/qdrant-search-quality/SKILL.md new file mode 100644 index 00000000..71d784de --- /dev/null +++ b/plugins/qdrant/skills/qdrant-search-quality/SKILL.md @@ -0,0 +1,20 @@ +--- +name: qdrant-search-quality +description: "Diagnoses and improves Qdrant search relevance. Use when someone reports 'search results are bad', 'wrong results', 'low precision', 'low recall', 'irrelevant matches', 'missing expected results', or asks 'how to improve search quality?', 'which embedding model?', 'should I use hybrid search?', 'should I use reranking?', 'how to measure retrieval quality?', 'build a golden set', 'ground truth dataset', or 'how to score recall@k?'. Also use when search quality degrades after quantization, model change, or data growth." +--- + +# Qdrant Search Quality + +First determine whether the problem is the embedding model, Qdrant configuration, or the query strategy. Most quality issues come from the model or data, not from Qdrant itself. If search quality is low, inspect how chunks are being passed to Qdrant before tuning any parameters. Splitting mid-sentence can drop quality 30-40%. + +- Start by testing with exact search to isolate the problem [Search API](https://skills.qdrant.tech/md/documentation/search/search/?s=search-api) + + +## Diagnosis and Tuning + +Isolate the source of quality issues, establish labeled baselines to measure recall and relevance, tune HNSW parameters, and choose the right embedding model. [Diagnosis and Tuning](diagnosis/README.md) + + +## Search Strategies + +Hybrid search, reranking, relevance feedback, and exploration APIs for improving result quality. [Search Strategies](search-strategies/README.md) diff --git a/plugins/qdrant/skills/qdrant-search-quality/diagnosis/README.md b/plugins/qdrant/skills/qdrant-search-quality/diagnosis/README.md new file mode 100644 index 00000000..9de6a6fa --- /dev/null +++ b/plugins/qdrant/skills/qdrant-search-quality/diagnosis/README.md @@ -0,0 +1,65 @@ +--- +name: qdrant-search-quality-diagnosis +description: "Diagnoses Qdrant search quality issues. Use when someone reports 'results are bad', 'wrong results', 'not relevant results', 'missing matches', 'recall is low', 'approximate search worse than exact', 'which embedding model', 'quality dropped after quantization', 'how to measure retrieval quality', 'build a golden set', 'ground truth dataset', or 'how to score recall@k'. Also use when search quality degrades without obvious changes." +--- + +# How to Diagnose Bad Search Quality + +Before tuning, establish baselines. Use exact KNN as ground truth, compare against approximate HNSW. Target >95% recall@K for production. + +## Don't Know What's Wrong Yet + +Use when: results are irrelevant or missing expected matches and you need to isolate the cause. + +- For a no-code quick check, use the Web UI's ANN Recall tab to compare approximate vs exact `recall@k` [Web UI ANN Recall](https://skills.qdrant.tech/md/documentation/tutorials-search-engineering/ann-recall/?s=measure-ann-recall-with-the-web-ui) +- For the same comparison in code (CI gating, regression tests), run each query twice -- once approximate, once with `exact=true` -- and compute `recall@k` from the overlap [ANN recall in CI](https://skills.qdrant.tech/md/documentation/tutorials-search-engineering/ann-recall/?s=automate-in-ci-with-python) +- Exact search bad = model or search pipeline problem. Exact good, approximate bad = tune HNSW. +- Check if quantization degrades quality (compare with and without) +- Check if filters are too restrictive (then you might need to use ACORN) +- If duplicate results from chunked documents, use Grouping API to deduplicate [Grouping](https://skills.qdrant.tech/md/documentation/search/search/?s=grouping-api) + +Payload filtering and sparse vector search are different things. Metadata (dates, categories, tags) goes in payload for filtering. Text content goes in sparse vectors for search. + +## Approximate Search Worse Than Exact + +Use when: exact search returns good results but HNSW approximation misses them. + +- Increase `hnsw_ef` at query time [Search params](https://skills.qdrant.tech/md/documentation/ops-optimization/optimize/?s=fine-tuning-search-parameters) +- Increase `ef_construct` (200+ for high quality) [HNSW config](https://skills.qdrant.tech/md/documentation/manage-data/indexing/?s=vector-index) +- Increase `m` (16 default, 32 for high recall) [HNSW config](https://skills.qdrant.tech/md/documentation/manage-data/indexing/?s=vector-index) +- Enable oversampling + rescore with quantization [Search with quantization](https://skills.qdrant.tech/md/documentation/manage-data/quantization/?s=searching-with-quantization) +- ACORN for filtered queries (v1.16+) [ACORN](https://skills.qdrant.tech/md/documentation/search/search/?s=acorn-search-algorithm) + +Binary quantization requires rescore. Without it, quality loss is severe. Use oversampling (3-5x minimum for binary) to recover recall. Always test quantization impact on your data before production. [Quantization](https://skills.qdrant.tech/md/documentation/manage-data/quantization/) + +## Wrong Embedding Model + +Use when: exact search also returns bad results. + +Check [Qdrant team recommendations on how to choose an embedding model](https://skills.qdrant.tech/md/articles/how-to-choose-an-embedding-model/). + +Test top 3 MTEB models on 100-1000 sample queries [Hosted Qdrant inference](https://skills.qdrant.tech/md/documentation/inference/). Score them against a labeled set to compare apples to apples [Measuring Retrieval Relevance](https://skills.qdrant.tech/md/documentation/improve-search/retrieval-relevance/). + +## Unoptimized Search Pipeline + +Use when: exact search also returns bad results and model choice is confirmed by user. + +Optimize search according to advanced search-strategies skill. + +## Need a Labeled Baseline to Score Recall, MRR, or NDCG + +Use when: user has no golden set, asks "how do I know if my search is good?", or needs to gate releases on a retrieval metric. + +- Build a labeled query set -- human, log-based, or LLM-synthetic -- and score retrieval with `ranx` [Measuring Retrieval Relevance](https://skills.qdrant.tech/md/documentation/improve-search/retrieval-relevance/) +- Pick the metric by usage: `Recall@k` for RAG, `MRR`/`Hits@1` for single-answer, `NDCG@k` for re-ranking [Choosing the metric](https://skills.qdrant.tech/md/documentation/improve-search/retrieval-relevance/?s=choosing-the-right-metric) +- For full RAG pipelines, also score generation with Ragas and use the retrieval-vs-generation 2x2 to isolate regressions [Pipeline Output Quality](https://skills.qdrant.tech/md/documentation/improve-search/pipeline-output-quality/) +- Gate CI on a per-metric threshold to catch regressions from embedding-model swaps, prompt changes, or index config changes + +## What NOT to Do + +- Tune Qdrant before verifying the model is right for the task (most quality issues are model issues) +- Use binary quantization without rescore (severe quality loss) +- Set `hnsw_ef` lower than results requested (guaranteed bad recall) +- Skip payload indexes on filtered fields then blame quality (HNSW can't traverse filtered-out nodes, and filterable HNSW is built only if payload indexes were set up prior) +- Deploy without baseline recall or other search relevance metrics (no way to measure regressions) +- Confuse payload filtering with sparse vector search (different things, different config) diff --git a/plugins/qdrant/skills/qdrant-search-quality/search-strategies/README.md b/plugins/qdrant/skills/qdrant-search-quality/search-strategies/README.md new file mode 100644 index 00000000..033d86ce --- /dev/null +++ b/plugins/qdrant/skills/qdrant-search-quality/search-strategies/README.md @@ -0,0 +1,55 @@ +--- +name: qdrant-search-strategies +description: "Guides Qdrant search strategy selection. Use when someone asks 'should I use hybrid search?', 'how to rerank?', 'results are not relevant', 'I don't get needed results from my dataset but they're there', 'retrieval quality is not good enough', 'results too similar', 'need diversity', 'MMR', 'relevance feedback', 'recommendation API', 'discovery API', or 'missing keyword matches'" +--- + +# How to Improve Search Results with Advanced Strategies + +These strategies complement basic vector search. Use them after confirming the embedding model is fitting the task and HNSW config is correct. If exact search returns bad results, verify the selection of the embedding model (retriever) first. +If the user wants to use a weaker embedding model because it is small, fast, and cheap, use reranking or relevance feedback to improve search quality. + +## Missing Keyword Matches or Need to Combine Multiple Search Signals + +Use when: pure vector search misses keyword/domain term matches, or the use case benefits from combining searches on multiple representations (including languages and modalities) of the same item. + +See how to use [hybrid search](hybrid-search/README.md) + +## Right Documents Found But Not in the Top Results + +Use when: good recall but poor precision (right docs in top-100, not top-10). + +- See how to use [Multistage queries](https://skills.qdrant.tech/md/documentation/search/hybrid-queries/?s=multi-stage-queries), for example with late interaction rerankers through [Multivectors](https://skills.qdrant.tech/md/documentation/manage-data/vectors/?s=multivectors). +- Cross-encoder rerankers via FastEmbed [Rerankers](https://skills.qdrant.tech/md/documentation/fastembed/fastembed-rerankers/) + +## Dense Retriever Misses Relevant Items or Reranking Is Too Costly + +Use when: dense retriever misses relevant items you know exist in the collection; relevant documents lie outside the initial ANN retrieval pool; reranking a large candidate pool is too slow or expensive; using a small/cheap embedding model but need quality close to a larger model; or want to improve top-1/3 precision without the full cost of reranking. + +See [Relevance Feedback in Qdrant](relevance-feedback/README.md) + +## Results Too Similar + +Use when: top results are redundant, near-duplicates, or lack diversity. Common in dense content domains (academic papers, product catalogs). + +- Use MMR (v1.15+) as a query parameter with `diversity` to balance relevance and diversity [MMR](https://skills.qdrant.tech/md/documentation/search/search-relevance/?s=maximal-marginal-relevance-mmr) +- Start with `diversity=0.5`, lower for more precision, higher for more exploration +- MMR is slower than standard search. Only use when redundancy is an actual problem. + +## Want to improve search results based on examples (positive and negative) + +Use when: you can provide positive and negative example points to steer search closer to positive and further from negative. + +- Recommendation API: positive/negative examples to recommend fitting vectors [Recommendation API](https://skills.qdrant.tech/md/documentation/search/explore/?s=recommendation-api) + - Best score strategy: better for diverse examples, supports negative-only [Best score](https://skills.qdrant.tech/md/documentation/search/explore/?s=best-score-strategy) +- Discovery API: context pairs (positive/negative) to constrain search regions without a request target [Discovery](https://skills.qdrant.tech/md/documentation/search/explore/?s=discovery-api) + +## Have Business Logic Behind Results Relevance + +Use when: results should be additionally ranked according to some business logic based on data, like recency or distance. + +Check how to set up in [Score Boosting docs](https://skills.qdrant.tech/md/documentation/search/search-relevance/?s=score-boosting) + +## What NOT to Do + +- Use hybrid search before verifying pure vector search quality (adds complexity, may mask model issues) +- Skip evaluation when adding relevance feedback -- score the end-to-end pipeline to confirm it actually helps [Pipeline Output Quality](https://skills.qdrant.tech/md/documentation/improve-search/pipeline-output-quality/) diff --git a/plugins/qdrant/skills/qdrant-search-quality/search-strategies/hybrid-search/README.md b/plugins/qdrant/skills/qdrant-search-quality/search-strategies/hybrid-search/README.md new file mode 100644 index 00000000..ad97697f --- /dev/null +++ b/plugins/qdrant/skills/qdrant-search-quality/search-strategies/hybrid-search/README.md @@ -0,0 +1,35 @@ +--- +name: qdrant-hybrid-search +description: "Explains hybrid search in Qdrant. Use when someone asks 'how do I setup hybrid search?', 'how to combine keyword and semantic search?', 'sparse plus dense vectors?', 'missing keyword matches', 'how to combine results from multiple searches?' and 'combining multiple representations'" +--- + +# Hybrid Search in Qdrant + +Hybrid search means running two or more different searches in parallel and combining their results into one. + +In Qdrant this is powered by the Query API via `prefetch`: each `prefetch` runs exactly one type of search independently, and the outer `query` combines results from parallel prefetches. +Prefetches can be nested and searches can be multi-stage, all pipeline happening in one request through Query API. See [Universal Query API](https://skills.qdrant.tech/md/course/essentials/day-5/universal-query-api/) for examples. + +Identify the user's problem and pick building blocks: +- What can go into one prefetch, e.g. power one search, in [Search Types](search-types/README.md) +- How to combine results of these searches (RRF, DBSF, FormulaQuery, reranking) in [Combining Searches](combining-searches/README.md) + +Based on what you've picked, test your approach: +1. Configure Qdrant collection with [named vectors](https://skills.qdrant.tech/md/documentation/manage-data/vectors/?s=named-vectors), where each named vector usually corresponds to one representation (different embedding models or different vector types) of a data point. +2. Construct a hybrid search request with Query API from your building blocks. You can search independently among one type of vectors, with `prefetch` + `using`, like shown in examples in [Hybrid Queries documentation](https://skills.qdrant.tech/md/documentation/search/hybrid-queries). +3. Evaluate hybrid search quality on real user data and provide user with improvements and tradeoffs (speed/resources). + +## How Isolated Are Parallel Searches? + +Use when: different tenants share one collection and you need to understand hybrid search isolation guarantees. + +If user wants to isolate/share hybrid search pipelines between tenants, consider that: + +- Indexes (sparse, payload and dense) and [IDF modifier](https://skills.qdrant.tech/md/documentation/manage-data/indexing/?s=idf-modifier) for sparse vectors are computed independently per shard, not per tenant. +- Prefetch runs independently per shard to retrieve #limit results, so for collection-level prefetches if collection has several shards, Qdrant will always prefetch under the hood #limit * #shard results. Final results are merged based on scores. +- In nested prefetches (deeper than 1 level), methods described in "Combining Searches" might be done on a shard level first, then per-shards results once again will be merged based on scores. + +## What NOT to Do + +- Choose a hybrid search pattern based on "vibes" without any [hybrid search quality evaluation](https://skills.qdrant.tech/md/articles/hybrid-search/?s=how-effective-is-your-search-system) in-place. +- Create too many named vectors without a need. An unfilled named vector might take as much resources as a filled one. \ No newline at end of file diff --git a/plugins/qdrant/skills/qdrant-search-quality/search-strategies/hybrid-search/combining-searches/README.md b/plugins/qdrant/skills/qdrant-search-quality/search-strategies/hybrid-search/combining-searches/README.md new file mode 100644 index 00000000..a336df92 --- /dev/null +++ b/plugins/qdrant/skills/qdrant-search-quality/search-strategies/hybrid-search/combining-searches/README.md @@ -0,0 +1,49 @@ +--- +name: qdrant-hybrid-search-combining +description: "Use when someone asks 'RRF or DBSF?', 'how to combine sparse and dense', 'how to combine scores from multiple searches?', 'custom fusion', or 'fusion is not producing good results'" +--- + +# Combining Prefetch Results + +The outer query fuses ranked candidate lists from all parallel prefetches into one ranked list of results. Fusion methods differ in whether they use rank, score or directly vector representations of candidates (their similarity to the outer query) and whether final score incorporates payload metadata. All methods support flat (one fusion step) and nested (multi-stage) prefetch structures. + +## Scores Are Not Comparable Across Prefetches & You Want Some Easy Baseline + +Use when: searches produce scores on different scales, like BM25 and cosine on dense embeddings. + +### RRF +- [RRF](https://skills.qdrant.tech/md/documentation/search/hybrid-queries/?s=reciprocal-rank-fusion-rrf) (Reciprocal Rank Fusion) -- rank-based, ignores scores magnitude, a decent default to start with. +- Tune `k` to [control rank sensitivity in RRF fusion](https://skills.qdrant.tech/md/documentation/search/hybrid-queries/?s=setting-rrf-constant-k). +- Add per-prefetch weights when one search should dominate, using [Weighted RRF](https://skills.qdrant.tech/md/documentation/search/hybrid-queries/?s=weighted-rrf). Weights should be customized per collection and retrievers' score distributions! + +### DBSF +- [DBSF](https://skills.qdrant.tech/md/documentation/search/hybrid-queries/?s=distribution-based-score-fusion-dbsf) (Distribution-Based Score Fusion) -- normalizes score distributions per prefetch before fusing them, for that, instead of min-max, uses mean +- 3 deviations on prefetched list of scores. Avoid relying on resulting absolute scores, as scores in DBSF are normalized per prefetch (aka per a retrieved list of search results), and might be uncomparable across queries. + +## Need Custom Fusion + +Use when: recency, popularity or other payload values should affect the merged ranking alongside candidate scores or you need a custom fusion. + +[With formula query](https://skills.qdrant.tech/md/documentation/search/search-relevance/?s=score-boosting), access `score` of each prefetch and, if desired, payload field values. + +If you want to implement custom fusion on `score` of each prefetch: +- Use decay or any other available expressions for normalizing score distributions before fusing them. +- Parameters of these expressions should be based on the collection & retriever score distributions (for example, adjusting these parameters on a subsample of real queries). +- Formula query is unable to provide ranks for custom fusions + +## Need Good Ranking of Fused Candidates and Ready To Spend More Resources + +Use when: you want to use similarity between query and candidates' vector representations as the prefetches combiner and simultaneously ranker. +More resource heavy than score/rank based fusions, but might be necessary due to use case requirements or need in a high top-K precision of results (when parallel prefetches have overall a good recall of retrieved candidates). + +You can use any type of vector as an outer query over the prefetches, to perform the fusion on the server-side in one QueryAPI request: sparse, dense, multivector. For that, same type of vector representations for documents need to be stored as named vectors per point. + +Instead of using client-side fusion through cross-encoders, a popular option is Late interaction models-based fusion, through reranking on multivectors (e.g. ColBERT for text, ColPali and ColQwen for images). +- Most precise but highest compute/resource usage. +- Configure multivectors used for fusion through reranking with HNSW disabled like in [Hybrid Search with Reranking tutorial](https://skills.qdrant.tech/md/documentation/tutorials-basics/reranking-hybrid-search/). + +## What NOT to Do + +- Use linear weighted fusion on incomparable score ranges. [Why not](https://skills.qdrant.tech/md/articles/hybrid-search/?s=why-not-a-linear-combination). +- Use "vibe" defined weights in weighted RRF. Weights should be fine-tuned per dataset and retrieval pipelines. +- Pick any fusion type without comparative experiments. +- Use late interaction multivectors for fusion without evaluating cheaper analogues, for example, MUVERA. More in [multi-vector Qdrant search course](https://skills.qdrant.tech/md/course/multi-vector-search/) diff --git a/plugins/qdrant/skills/qdrant-search-quality/search-strategies/hybrid-search/search-types/README.md b/plugins/qdrant/skills/qdrant-search-quality/search-strategies/hybrid-search/search-types/README.md new file mode 100644 index 00000000..afa93a4f --- /dev/null +++ b/plugins/qdrant/skills/qdrant-search-quality/search-strategies/hybrid-search/search-types/README.md @@ -0,0 +1,62 @@ +--- +name: qdrant-hybrid-search-prefetches +description: "Use when someone asks 'how to combine lexical and semantic retrieval', 'dense and sparse in one search?', 'how to combine multiple fields for retrieval?', 'payloads or sparse vectors for lexical?', 'which sparse embedding model to use?', 'BM25 vs SPLADE?'" +--- + +# Different Searches in One Query API Request + +Each `prefetch` runs exactly one search per one query. + +Understand if user wants to run several parallel searches on: +1. The same vector representations but different queries or filters. +2. Different vector representations but the same raw query. + +If first, help user to design logic of constructing query or/and filters on application side and then check [Combining Searches](../combining-searches/README.md). Don't forget to create [indices on filterable payload fields](https://skills.qdrant.tech/md/documentation/manage-data/indexing/?s=payload-index), immediately after collection creation, prior to building HNSW, so filterable HNSW could be constructed. + +If second, use [named vectors](https://skills.qdrant.tech/md/documentation/manage-data/vectors/?s=named-vectors), which allow to store multiple vector types per point in one collection. Beware that named vectors currently can be configured only at collection creation. To choose vectors, check following recommendations. + +## Missed Keyword Matches + +Use when: pure vector search misses exact term or keyword matches and you need lexical retrieval alongside semantic search. + +Most likely you need a sparse vector for exact text search alongside the dense one. Qdrant uses sparse vectors for lexical searches, as [payload filtering doesn't provide any ranking score](https://skills.qdrant.tech/md/documentation/search/text-search/?s=filtering-versus-querying). + +### Choose a Sparse Vector for Text +- BM25 statistical representations, built into Qdrant core (computed server-side). Good baseline, works out-of-domain, usually for long texts. Can be used for non-English content, but needs to be configured per language (tokenization, stemming, stopwords, etc) at indexing and retrieval time. More in [Text Search Guide](https://skills.qdrant.tech/md/documentation/search/text-search/?s=bm25) +- BM42 learned sparse, based on BM25, but better for small chunks of text & with meaning understanding. Works only on English. Requires fine-tuning for domain-specific retrieval. Requires FastEmbed (Python/REST only, not available in all SDKs). Not maintained. +- miniCOIL learned sparse, BM25 with additional understanding of words meaning in context. Works only on English. Requires fine-tuning for domain-specific retrieval. Requires FastEmbed. Usage shown in [FastEmbed miniCOIL documentation](https://skills.qdrant.tech/md/documentation/fastembed/fastembed-minicoil/). +- SPLADE++ learned sparse with term expansion. Heavier inference and resources usage but better performance due to term expansion. Requires fine-tuning for domain-specific retrieval. Provided in Qdrant Cloud Inference and FastEmbed versions work only on English. To use with FastEmbed, check [FastEmbed SPLADE documentation](https://skills.qdrant.tech/md/documentation/fastembed/fastembed-splade/). +- External learned sparse embeddings, for example BAAI/bge-m3. + +What to remember when using sparse vectors for lexical search: +- tokenization and stemming affect exact matches, especially on custom codes, terms, etc. + +What to remember when using Qdrant BM25 and miniCOIL (based on BM25): +- avg_len in formula is not computed server-side, it is a user responsibility and passed as a parameter +- BM25 might be not good for small chunks of text, as BM25 algorithm was initially created for search on long documents; consider adjusting document statistics in sparse vectors (TF & IDF, k, b). +- Qdrant BM25 vectors are configured per language, so consider customizing stop words, stemming & tokenization when users documents mix several languages or carefully configure vectors per point when they are monolingual. + +More on [Sparse Vectors for Text Search](https://skills.qdrant.tech/md/course/essentials/day-3/sparse-retrieval-demo/) + +## Need to Combine Multiple Representations of the Same Item + +Use when: the same item is embedded in multiple ways (e.g. different models, languages, or modalities) and you want to search across different representations in one request (don't have to be all of them, can be even one). + +Use multiple named vector prefetches, each prefetch covers one representation. + +- If you have groups and subgroups of representations (document -> chunk, image -> patch), you could use [searching in groups](https://skills.qdrant.tech/md/documentation/search/search/?s=search-groups). To not store identical payloads several times, check [Lookup in Groups](https://skills.qdrant.tech/md/documentation/search/search/#lookup-in-groups) + +You can also search directly on [multivectors](https://skills.qdrant.tech/md/documentation/manage-data/vectors/?s=multivectors), a matrix of dense vectors, in a prefetch. + +However, it comes with several considerations, as multivectors were designed to support late interaction models using max similarity metric, so it's impossible to retrieve the list of individual max similarity scores for each query vector. + +Moreover, multivectors are rarely a good pick for prefetch: +- max similarity metric is not symmetric, so [using HNSW index with it could be problematic](https://skills.qdrant.tech/md/course/multi-vector-search/module-1/maxsim-distance/#the-hnsw-challenge) +- [multivector representations are very heavy, as search process on them](https://skills.qdrant.tech/md/course/multi-vector-search/module-1/problems-multi-vector). + +There are ways to make multivector retrieval cheaper (MUVERA, pooling), you can see more in ["Evaluating Tradeoffs of Multi-stage Multi-vector Search"](https://skills.qdrant.tech/md/course/multi-vector-search/module-3/evaluating-pipelines/) + +## What NOT to Do +- Choose any search method (for example, BM25) without evaluation of its quality & resources used. +- Use any search method (for example, BM25) without paying attention to the specifics of their configuration and applicability to the use case. + diff --git a/plugins/qdrant/skills/qdrant-search-quality/search-strategies/relevance-feedback/README.md b/plugins/qdrant/skills/qdrant-search-quality/search-strategies/relevance-feedback/README.md new file mode 100644 index 00000000..cce81e19 --- /dev/null +++ b/plugins/qdrant/skills/qdrant-search-quality/search-strategies/relevance-feedback/README.md @@ -0,0 +1,118 @@ +--- +name: qdrant-relevance-feedback +description: "Use when someone asks about 'Qdrant\'s Relevance Feedback API', 'improving dense search relevance/recall', 'how to discover/get more relevant results from vector search', 'cheaper/better alternative to reranking', 'using a more heavy/big embedding model for dense search but can\'t afford it', 'finding more relevant documents beyond the initial search pool', or 'feedback loops'. Also trigger when the user has a search quality problem due to a dense retriever being weak and is considering reranking as a solution -- this API may be a better fit." +--- + +Reranking reorders documents that have already been retrieved. Qdrant's Relevance Feedback (RF) instead modifies the vector search process itself based on a small amount of reranker feedback, distilling reranker (feedback model) knowledge into the search step. This allows RF to surface documents that the initial ANN search did not score highly enough. + +The RF is intended for tasks where relevance correlates with similarity in vector space. + +How you apply the RF depends on your goals. +First, understand how the RF works, read the ENTIRE section. Then define your goals and choose the appropriate usage pattern described below. Make sure to avoid the listed anti-patterns ("DO NOTs"). Before implementing anything, read CAREFULLY to avoid missing important details. + +## How It Works + +The [Qdrant Query Point API with a type RelevanceFeedbackQuery](https://api.qdrant.tech/api-reference/search/query-points) takes: + +- a query (`target`) +- a small list of seed documents (`feedback`) with relevance scores (often 4-5 seeds are enough) +- formula weights, which MUST be trained once per general search use case (your dataset, dense retriever, and feedback model) + +If you do not train the formula weights, results will at best be random, will not align with your data distribution or model behavior. Training is lightweight because the formula itself is simple. + +During search, it scores each candidate by combining similarity to the original query, similarity to highly rated seed documents and dissimilarity to poorly rated ones. + +### Feedback Model + +A feedback model is any model that can produce a float relevance score for `(query, document)` pairs. Higher scores must always mean higher relevance. + +Examples: a cross-encoder, embedding similarity (for example, cosine similarity between query and document embeddings, or max_sim for late interaction models), an LLM-based scorer, a custom ranker. + +The feedback model used during training and inference MUST be the same model. Formula weights during training are calibrated to that model's score distribution. If you switch feedback models, you must retrain. + +What is a Good Feedback Model: +- If the model does not improve ranking quality when used as a reranker on retrieved documents, the RF search will not have a meaningful signal to amplify. +- RF search quality depends heavily on how well the feedback model scores partial matches. The training loss of RF formula relies on relative ordering, so poor score separation in the middle range (documents that are neither clearly relevant nor clearly irrelevant) weakens results. + +### To Make the RF API Work, You Need to Calibrate Weights First + +Use when: setting up RF for a new use case -- a new collection, feedback model, or embedding model powering ANN search. + +RF uses a weighted formula that combines the original query vector with feedback signals. + +For the currently available `naive` strategy, the learned weights control: +- `a` -- how much to trust the original ANN query-document similarity +- `b` -- how strongly differences in feedback scores matter +- `c` -- how strongly to follow the feedback direction (toward relevant documents and away from irrelevant ones) + +These weights must be learned from your data before use. You cannot safely use arbitrary values. + +- Install the [qdrant-relevance-feedback](https://pypi.org/project/qdrant-relevance-feedback/) Python library. Study what goes into RelevanceFeedback. +- Initialize a `RelevanceFeedback` instance. You can use provided QdrantRetriever or FastembedFeedback, or define your own. +- Review `train` parameters before calling `train`. The library retrieves `limit` candidates per train query, scores them with the feedback model, learns the weighting parameters, and returns the calibrated values. +- Call `train` on 50-200 representative, real, non-synthetic queries. + - Generate train queries yourself based on the use case, but give the option to the user to provide them, too. + - Inform user on cost and quality trade-offs of training. +- Check train metrics which show if RF had a signal (disagreement between retriever and feedback model) to distill and learn from. If there was no signal to learn from, adapt training parameters, queries or change a feedback model and retrain until RF learns well. +- Store the resulting RF parameters in your configuration and use them during inference. Retrain if your query distribution or corpus changes significantly. +- Evaluate resulting formula with `Evaluator` on a separate test set of representative, real, non-synthetic queries. If results seem unsatisfactory, investigate and inform user. + +The retriever, feedback model, and related parameters defined during training are assumed to remain the same during inference. + +## Want High-Quality Top-1/3 Results at Reasonable Cost + +Use when: top-1 or top-3 precision matters most, and reranking a large pool of documents would be too expensive or slow. This pattern below can match reranking quality at the top of the ranking for semantic similarity tasks, but it performs worse at deeper cutoffs. Do not use this approach when top-10+ recall is the priority. + +Only score a small set of seed documents. Five seeds is a robust default across many task types and scoring them costs user roughly 5x less than reranking a 25-document pool. + +- Retrieve the top 5 documents using ANN search. These become the feedback seeds. You'll need their stored embeddings. +- Score them with the feedback model used in training. +- Call Qdrant's Query API using the relevance feedback query: + - set `target` to the query retriever embedding (also possible to use Qdrant Cloud Inference). + - set `feedback` to a list of items where each item contains: + - `example=` (also possible to use Qdrant Cloud Inference) + - `score=` + - set `using` to retriever's handle, RF operates in retriever's vector space. + - set `strategy` to `naive` with your calibrated parameters + - set `limit` to the number of final results you need and use the RF results directly as final results. + + Check the [Relevance Feedback Query API documentation](https://skills.qdrant.tech/md/documentation/search/search-relevance/?s=relevance-feedback) and study code/methods of the relevant SDK before filling in anything. + +Using a point ID in `example` causes the RF API to automatically exclude that document from the final results. Using stored embeddings used for retrieval instead potentially keeps the document in the final results. + +## Want to Find Relevant Documents Beyond the Initial Search Results + +Use when: recall matters more than latency or cost (research, legal, medical, compliance), and relevant documents may exist outside the initial ANN retrieval pool. + +It performs two feedback model scoring rounds: +1. on feedback seeds +2. on newly surfaced RF results + +The second reranking pass safely promotes newly discovered documents into the top-10 of the final ranking. The advantage over standard reranking is that RF can reach relevant documents that lie completely outside the initial ANN pool, while a reranker with the same budget cannot. The tradeoff is higher latency due to two rounds of feedback-model scoring. + +- Retrieve and score 5 seed documents. These become the feedback seeds. You'll need their point IDs. +- Call Qdrant's query API using the relevance feedback query: + - set `target` to the query retriever embedding (also possible to use Qdrant Cloud Inference) + - set `feedback` to a list of items where each item contains: + - `example=` + - `score=` + - set `using` to retriever's handle, RF operates in retriever's vector space. + - set `strategy` to `naive` with your calibrated parameters + - set `limit` to the number of results user can afford to rerank based on the available cost budget. The total scoring cost equals the cost of scoring both the seeds and the RF results, roughly equivalent to reranking a pool of the same combined size. Inform and consult with the user. + - score the returned RF results with your feedback model. +- Merge the original seeds and RF results, then sort by feedback score. These will be your final results. + + Check the [Relevance Feedback Query API documentation](https://skills.qdrant.tech/md/documentation/search/search-relevance/?s=relevance-feedback) and study code/methods of the relevant SDK before filling in anything. + +Using a point ID in `example` causes the RF API to automatically exclude that document from the final results. Using stored embeddings used for retrieval instead potentially keeps the document in the final results. + +## What NOT to Do + +- Do not skip calibration and use random formula weights. Untrained weights produce arbitrary results. (`a=1, b=0, c=0` can be used if you only want vanilla ANN behavior through the RF API.) +- Do not use the RF API on sparse vectors. +- Do not use a feedback model where higher scores mean lower relevance. Scores must be monotonic: higher = more relevant. +- Do not use fewer than 2 feedback seeds. A single seed provides no contrastive signal. The formula needs at least one relatively more relevant and one relatively less relevant example to establish direction. Two is the minimum; five is the recommended default. +- Do not use significantly more than 5 seeds expecting better quality. Additional seeds usually add noise and increase scoring cost without meaningful gains. +- Do not use a different feedback model during inference than the one used during calibration. The learned weights are tied to that model's score scale and distribution. +- Do not use a feedback model that does not improve retrieval quality as a standard reranker on your data. +- Do not proceed to inference if training and evaluation metrics of qdrant-relevance-feedback package demonstrated unsatisfactory results, instead find a good training set of representative queries, a feedback model providing a meaningful signal and effective train parameters. \ No newline at end of file diff --git a/plugins/qdrant/skills/qdrant-version-upgrade/SKILL.md b/plugins/qdrant/skills/qdrant-version-upgrade/SKILL.md new file mode 100644 index 00000000..8acab022 --- /dev/null +++ b/plugins/qdrant/skills/qdrant-version-upgrade/SKILL.md @@ -0,0 +1,21 @@ +--- +name: qdrant-version-upgrade +description: "Guidance on how to upgrade your Qdrant version without interrupting the availability of your application and ensuring data integrity." +--- + + +# Qdrant Version Upgrade + +Qdrant has the following guarantees about version compatibility: + +- Major and minor versions of Qdrant and SDK are expected to match. For example, Qdrant 1.17.x is compatible with SDK 1.17.x. + +- Qdrant is tested for backward compatibility between minor versions. For example, Qdrant 1.17.x should be compatible with SDK 1.16.x. Qdrant server 1.16.x is also expected to be compatible with SDK 1.17.x, but only for the subset of features that were available in 1.16.x. + +- For migration to the next minor version, it is recommended to first upgrade the SDK to the next minor version and then upgrade the Qdrant server. + +- Storage compatibility is only guaranteed for one minor version. For example, data stored with Qdrant 1.16.x is expected to be compatible with Qdrant 1.17.x. If you need to migrate more than one minor version, it is required do the upgrade step by step, one minor version at a time. For example, to migrate from 1.15.x to 1.17.x, you need to first upgrade to 1.16.x and then to 1.17.x. Note: Qdrant Cloud automates this process, so you can directly upgrade from 1.15.x to 1.17.x without intermediate steps. + +- A Qdrant cluster with a replication factor of 2 or higher can be upgraded without downtime by performing a rolling upgrade. This means that you can upgrade one node at a time while the other nodes continue to serve requests. This allows you to maintain availability of your application during the upgrade process. More about replication factor: [Replication factor](https://skills.qdrant.tech/md/documentation/distributed_deployment/?s=replication-factor) + +For managing Qdrant version upgrades in Qdrant Cloud, you can use the [qcloud](https://github.com/qdrant/qcloud-cli) CLI tool.