Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/create_and_search_your_first_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ err := index.Index("doc", map[string]interface{}{

- Document gets a unique ID (`doc`)
- Fields are automatically mapped based on their Go types
- Text fields are analyzed (tokenized, lowercased, etc.) based on the mapping chosen (here, the default one)
- Text fields are analyzed (tokenized, lowercased, etc.) based on the chosen mapping (here, the default one)
- Document is stored in the search index

### 3. Searching
Expand Down Expand Up @@ -184,7 +184,7 @@ query.SetField("price")
### Custom Field Mapping

```go
// We can create customised mapping as well by specifying about analyzers
// We can create customized mappings as well with configurable analyzers
mapping := bleve.NewIndexMapping()

// Text field with custom analyzer
Expand Down
10 changes: 5 additions & 5 deletions docs/custom_query.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Bleve exposes two query nodes for embedding-defined per-hit logic:

Bleve itself only executes callbacks that were already provided/attached by the
embedding application. It does not interpret any
embedder-specific payload such as callback source, params, or requested fields.
embedder-specific payload such as callback source, "params", or requested fields.

## Query Objects

Expand Down Expand Up @@ -130,8 +130,8 @@ At the bleve layer, doc values are decoded based on the field's mapping type:
## Error Cases

- missing child query:
- `custom filter query must have a query`
- `custom score query must have a query`
- `"custom filter query must have a query"`
- `"custom score query must have a query"`
- missing bound callback:
- `custom filter query must have a filter callback`
- `custom score query must have a score callback`
- `"custom filter query must have a filter callback"`
- `"custom score query must have a score callback"`
17 changes: 9 additions & 8 deletions docs/fast_merge.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# Fast Merge

* v2.6.0 comes with the support for a feature called "fast merge" where we first train the index on a vector dataset and then build the vector index using this trained information of the centroid layout.
* This is an improvement on the existing behavior where we were performing the merge in a very naive fashion of reconstructing the participating vector indexes, re-training and then adding back the vectors into the index. Fast merge essentially merges the corresponding centroid cells' data vectors in a block wise fashion without doing the expensive operations
* This feature underneath the hood utilizes the existing [`merge_from` API](https://github.com/blevesearch/faiss/blob/ffd910a91f1acf49b9898a7e514e462db89ee7b3/faiss/Index.h#L396) in our fork of the faiss codebase.
* This is a separate feature when compared to the existing behavior where we were performing the merge in a very naive fashion of reconstructing the participating vector indexes, re-training and then adding back the vectors into the index. Fast merge essentially merges the corresponding centroid cells' data vectors in a block-wise fashion without doing the expensive operations. For this, a centroid/trained segment template needs to be built using enough random samples from a dataset prior to building segments with the actual index data.
* A trade-off with this feature is if new data were introduced that is not available at the time of index introduction, a data drift can develop resulting in reduced recall.
* This feature underneath the hood utilizes the existing [`merge_from` API](https://github.com/blevesearch/faiss/blob/ffd910a91f1acf49b9898a7e514e462db89ee7b3/faiss/Index.h#L396) in our fork of the FAISS codebase.

## Support

Expand All @@ -12,7 +13,7 @@

## Usage

The feature can be enabled by first passing a key value pair in the config part while creating a new index. If the flag is false, then the behavior falls back to the more expensive naive merge.
The feature can be enabled by first passing a key-value pair in the config part while creating a new index. If the flag is false, then the behavior falls back to the more expensive naive merge.

```go
kvConfig := map[string]interface{}{
Expand All @@ -25,7 +26,7 @@ if err != nil {
}
```

User should now "train" the index on a random sample of the vector dataset they're planning to index and search.
The user should now "train" the index on a random sample of the vector dataset they're planning to index and search.

* It's completely up to the user as to how much data they want to use for training, controlling the batch size used while training and also marking whether the training is complete.
* NOTE: User must index their data only after marking the training as complete, otherwise the batch won't be indexed.
Expand All @@ -39,7 +40,7 @@ for _, doc := range trainingDocuments {
}

// train the index on the batch of data
// NOTE: the training can be done in an incremental manner as well, by using same Train() API but repeatedly calling it on a particular batch of data.
// NOTE: the training can be done in an incremental manner as well, by using the same Train() API but repeatedly calling it on a particular batch of data.
if err := index.Train(batch); err != nil {
log.Fatal(err)
}
Expand All @@ -55,9 +56,9 @@ if err := index.Train(batch); err != nil {

## Disclaimer

* This feature is primarily meant for the use case where the user is aware about much data they want to index and also for a ready heavy workload and little to no updates on the index itself.
* The intention of the feature is to be able to quickly index a massive scale of data on an index in an expensive manner and perform search on it.
* This feature is primarily meant for the use case where the user is aware of how much data they want to index and also for a read-heavy workload and little to no updates on the index itself.
* The intention of the feature is to be able to quickly index a massive scale of data on an index in an efficient manner and perform search on it.
* Without this feature, i.e. when the index build happens without a prior training phase
* The user wouldn't have to worry about use cases where the dataset is continuously updated with new "type" of vector. This is because each merge cycle would do the training afresh.
* The user doesn't have a lag in indexing the data either, they can start ingesting the data immediately.
* Based on what's mentioned above, when it comes to update and delete type of workloads on the dataset its extremely difficult to detect when the data drift will occur. So we end up falling back to the naive way of reconstructing + re-training.
* Based on what's mentioned above, when it comes to update and delete type of workloads on the dataset it's extremely difficult to detect when the data drift will occur. So we end up falling back to the naive way of reconstructing + re-training.
12 changes: 6 additions & 6 deletions docs/hierarchy.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Hierarchical nested search

* *v2.6.0* (and after) will come with support for **Array indexing and hierarchical nested search**.
* *v2.6.0* (and after) will come with support for **array indexing and hierarchical nested search**.
* We've achieved this by embedding nested documents within our bleve (scorch) indexes.
* Usage of zap file format: [v17](https://github.com/blevesearch/zapx/blob/master/zap.md). Here we preserve hierarchical document relationships within segments, continuing to conform to the segmented architecture of *scorch*.

Expand Down Expand Up @@ -29,7 +29,7 @@
}
```

* Multi-level arrays: Arrays can contain objects that themselves have array fields, allowing for deeply nested structures, such as a list of projects, each with its own list of tasks.
* Multi-level arrays: arrays can contain objects that themselves have array fields, allowing for deeply nested structures, such as a list of projects, each with its own list of tasks.

```json
{
Expand Down Expand Up @@ -107,7 +107,7 @@
],
"locations": [
{"city": "Athens","country": "Greece"},
{"city": "Berlin","country": "USA"}
{"city": "Berlin", "country": "USA"}
]
}
}
Expand All @@ -117,7 +117,7 @@

* From v2.6.0 onwards, Bleve allows for accurate representation and querying of complex nested structures, preserving the relationships between different levels of the hierarchy, across multi-level, multiple and hybrid arrays.

* The addition of `nested` document mappings enable defining fields that contain arrays of objects, giving the option to preserve the hierarchical relationships within the array during indexing. Having `nested` as false (default) will flatten the objects within the array, losing the hierarchy, which was the earlier behavior.
* The addition of `nested` document mappings enables the definition of fields that contain arrays of objects, giving the option to preserve the hierarchical relationships within the array during indexing. Having `nested` as false (default) will flatten the objects within the array, losing the hierarchy, which was the earlier behavior.

```json
{
Expand Down Expand Up @@ -146,10 +146,10 @@
}
```

* Any Bleve query (e.g., `match`, `phrase`, `term`, `fuzzy`, `numeric/date range` etc.) can be executed against fields within nested documents, with no special handling required. The query processor will automatically traverse the nested structures to find matches. Additional search constructs
* Any Bleve query (e.g., `match`, `phrase`, `term`, `fuzzy`, `numeric/date range`, etc.) can be executed against fields within nested documents, with no special handling required. The query processor will automatically traverse the nested structures to find matches. Additional search constructs
like vector search, synonym search, hybrid and pre-filtered vector search integrate seamlessly with hierarchy search.

* Conjunction Queries (AND queries) and other queries that depend on term co-occurrence within the same hierarchical context will respect the boundaries of nested documents. This means that terms must appear within the same nested object to be considered a match. For example, a conjunction query searching for an employee named "Alice" with the role "Engineer" within the "Engineering" department will only return results where both name and role terms are found within the same employee object, which is itself within a "Engineering" department object.
* Conjunction Queries (AND queries) and other queries that depend on term co-occurrence within the same hierarchical context will respect the boundaries of nested documents. This means that terms must appear within the same nested object to be considered a match. For example, a conjunction query searching for an employee named "Alice" with the role "Engineer" within the "Engineering" department will only return results where both name and role terms are found within the same employee object, which is itself within an "Engineering" department object.

* Some other search constructs will have enhanced precision with hierarchy search.
* Field-Level Highlighting: Only fields within the matched nested object are retrieved and highlighted, ensuring highlights appear in the correct hierarchical context. For example, a match in `departments[name=Engineering].employees` highlights only employees in that department.
Expand Down
8 changes: 4 additions & 4 deletions docs/index_update.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Ability to reduce downtime during index mapping updates

* *v2.5.4* (and after) will come with support to delete or modify any field mapping in the index mapping without requiring a full rebuild of the index
* We do this by storing which portions of the field has to be deleted within zap and then lazily executing the deletion during subsequent merging of the segments
* *v2.5.4* (and after) will come with support for deleting or modifying any field mapping in the index mapping without requiring a full rebuild of the index
* We do this by storing which portions of the field that have to be deleted within zap and then lazily executing the deletion during subsequent merging of the segments

## Usage

Expand All @@ -20,12 +20,12 @@ However, if any of the following conditions are met, the index is considered non
* Any additional fields or enabled document mappings in the new index mapping
* Any changes to IncludeInAll, type, IncludeTermVectors and SkipFreqNorm
* Any document mapping having its enabled value changing from false to true
* Text fields with a different analyser or date time fields with a different date time format
* Text fields with a different analyzer or date time fields with a different date time format
* Vector and VectorBase64 fields changing dims, similarity or vectorIndexOptimizedFor
* Any changes when field is part of `_all`
* Full field deletions when it is covered by any dynamic setting (Index, Store or DocValues Dynamic)
* Any changes to dynamic settings at the top level or any enabled document mapping
* If multiple fields sharing the same field name either from different type mappings or aliases are present, then any non compatible changes across all of these fields
* If multiple fields sharing the same field name either from different type mappings or aliases are present, then any non-compatible changes across all of these fields

## How to enforce immediate deletion?

Expand Down
8 changes: 4 additions & 4 deletions docs/pagination.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ JSON example:
}
```

The result would be 5 hits starting from the 5th hit.
The result would be 5 hits starting from the 11th hit.

When to use:

Expand All @@ -49,7 +49,7 @@ Rules:
Where do sort keys come from?

- Each hit includes `Sort` (and `DecodedSort` from Bleve v2.5.2). Take the last hit's sort keys for `SearchAfter`, or the first hit's sort keys for `SearchBefore`.
- If the field/fields to be searched over is numeric, datetime or geo, the values in the `Sort` field may have garbled values; this is because of how Bleve represents such data types internally. To use such fields as sort keys, use the `DecodedSort` field, which decodes the internal representations. This feature is available from Bleve v2.5.4.
- If the field/fields to be searched over are numeric, datetime or geo, the values in the `Sort` field may have garbled values; this is because of how Bleve represents such data types internally. To use such fields as sort keys, use the `DecodedSort` field, which decodes the internal representations. This feature is available from Bleve v2.5.4.

> When using `DecodedSort`, the `Sort` array in the search request needs to explicitly declare the type of the field for proper decoding. Hence, the `Sort` array must contain either `SortField` objects (for numeric and datetime) or `SortGeoDistance` objects (for geo) rather than just the field names. More info on `SortField` and `SortGeoDistance` can be found in [sort_facet.md](sort_facet.md).

Expand All @@ -75,7 +75,7 @@ Backward pagination over `_id` and `_score`:
}
```

Pagination using numeric, datetime and geo fields. Notice how we specify the sort objects, with the "type" field explicitly declared in case of numeric and datetime:
Pagination using numeric, datetime, and geo fields. Notice how we specify the sort objects, with the "type" field explicitly declared in case of numeric and datetime:

```json
{
Expand All @@ -86,7 +86,7 @@ Pagination using numeric, datetime and geo fields. Notice how we specify the sor
"sort": [
{"by": "field", "field": "price", "type": "number"},
{"by": "field", "field": "created_at", "type": "date"},
{"by": "geo_distance", "field": "location", "location": {"lat": 40.7128,"lon": -74.0060}}
{"by": "geo_distance", "field": "location", "location": {"lat": 40.7128, "lon": -74.0060}}
],
"search_after": ["99.99", "2023-10-15T10:30:00Z", "5.2"]
}
Expand Down
12 changes: 6 additions & 6 deletions docs/persister.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,13 @@ Starting with v2.5.0, Scorch supports parallel flushing of in-memory segments to
- `NumPersisterWorkers`: This factor decides how many maximum workers can be spawned to flush out the in-memory segments. Each worker will work on a disjoint subset of segments, merge them, and flush them out to the disk. By default the persister deploys only one worker.
- `MaxSizeInMemoryMergePerWorker`: This config decides what's the maximum amount of input data in bytes a single worker can work upon. By default this value is equal to 0 which means that this config is disabled and the worker tries to merge all the data in one shot. Also note that it's imperative that the user set this config if `NumPersisterWorkers > 1`.

If the index is tuned to have a higher `NumPersisterWorkers` value, the memory can potentially drain out faster and ensure stronger consistency behaviour — but there would be a lot of on-disk files, and the background merger would experience the pressure of managing this large number of files, which can be resource-intensive.
If the index is tuned to have a higher `NumPersisterWorkers` value, the memory can potentially drain out faster and ensure stronger consistency behavior — but there would be a lot of on-disk files, and the background merger would experience the pressure of managing this large number of files, which can be resource-intensive.

- Tuning this config is very dependent on the available CPU resources, and something to keep in mind here is that the process's RSS can increase if the number of workers — and each of them working upon a large amount of data — is high.
- Tuning this config is very dependent on the available CPU resources. Something to keep in mind here is that the process's RSS can increase if the number of workers — and each of them working upon a large amount of data — is high.

Increasing the `MaxSizeInMemoryMergePerWorker` value would mean that each worker acts upon a larger amount of data and spends more time merging and flushing it out to disk — which can be healthy behaviour in terms of I/O, although it comes at the cost of time.
Increasing the `MaxSizeInMemoryMergePerWorker` value would mean that each worker acts upon a larger amount of data and spends more time merging and flushing it out to disk — which can be healthy behavior in terms of I/O, although it comes at the cost of time.

- Changing this config is usecase dependent, for example in usecases where the payload or per doc size is generally large in size (for eg vector usecases), it would be beneficial to have a larger value for this.
- Changing this config is use-case dependent; for example, in use cases where the payload or per-doc size is generally large (for example vector use cases), it would be beneficial to have a larger value for this.

So, having the ideal values for these two configs is definitely dependent on the use case and can involve a bunch of experiments, keeping the resource usage in mind.

Expand All @@ -29,10 +29,10 @@ Management of these files is crucial when it comes to query latency because a hi

The merger sees the files on disk and plans out which segments to merge so that the final layout of segment tiers (each tier having multiple files), which grow in a logarithmic way (the chances of larger tiers growing in number would decrease), is maintained. This also implies that deciding this first-tier size becomes important in deciding the number of segment files across all tiers.

Starting with v2.5.0, this first-tier size is dependent on the file size using the `FloorSegmentFileSize` config, because that's a better metric to consider (unlike the legacy live doc count metric) in order to ensure that the behaviour is in line with the use case and aware of the payload/doc size.
Starting with v2.5.0, this first-tier size is dependent on the file size using the `FloorSegmentFileSize` config, because that's a better metric to consider (unlike the legacy live doc count metric) in order to ensure that the behavior is in line with the use case and aware of the payload/doc size.

- This config can also be tuned to dictate how the I/O behaviour should be within an index. While tuning this config, it should be in proportion to the `MaxSizeInMemoryMergePerWorker` since that dictates the amount of data flushed out per flush.
- The observation here is that `FloorSegmentFileSize` is lesser than `MaxSizeInMemoryMergePerWorker` and for an optimal I/O during indexing, this value can be set close to `MaxSizeInMemoryMergePerWorker/6`.
- The observation here is that `FloorSegmentFileSize` is less than `MaxSizeInMemoryMergePerWorker` and for an optimal I/O during indexing, this value can be set close to `MaxSizeInMemoryMergePerWorker/6`.

## Setting a Persister/Merger Config in Index

Expand Down
Loading
Loading