diff --git a/docs/create_and_search_your_first_index.md b/docs/create_and_search_your_first_index.md index b0082c8f6..ae24f5b16 100644 --- a/docs/create_and_search_your_first_index.md +++ b/docs/create_and_search_your_first_index.md @@ -110,7 +110,7 @@ err := index.Index("doc", map[string]interface{}{ - Document gets a unique ID (`doc`) - Fields are automatically mapped based on their Go types -- Text fields are analyzed (tokenized, lowercased, etc.) based on the mapping chosen (here, the default one) +- Text fields are analyzed (tokenized, lowercased, etc.) based on the chosen mapping (here, the default one) - Document is stored in the search index ### 3. Searching @@ -184,7 +184,7 @@ query.SetField("price") ### Custom Field Mapping ```go -// We can create customised mapping as well by specifying about analyzers +// We can create customized mappings as well with configurable analyzers mapping := bleve.NewIndexMapping() // Text field with custom analyzer diff --git a/docs/custom_query.md b/docs/custom_query.md index 6c67bc4a0..244aef7e1 100644 --- a/docs/custom_query.md +++ b/docs/custom_query.md @@ -7,7 +7,7 @@ Bleve exposes two query nodes for embedding-defined per-hit logic: Bleve itself only executes callbacks that were already provided/attached by the embedding application. It does not interpret any -embedder-specific payload such as callback source, params, or requested fields. +embedder-specific payload such as callback source, "params", or requested fields. ## Query Objects @@ -130,8 +130,8 @@ At the bleve layer, doc values are decoded based on the field's mapping type: ## Error Cases - missing child query: - - `custom filter query must have a query` - - `custom score query must have a query` + - `"custom filter query must have a query"` + - `"custom score query must have a query"` - missing bound callback: - - `custom filter query must have a filter callback` - - `custom score query must have a score callback` + - `"custom filter query must have a filter callback"` + - `"custom score query must have a score callback"` diff --git a/docs/fast_merge.md b/docs/fast_merge.md index e52e19d7e..d47755272 100644 --- a/docs/fast_merge.md +++ b/docs/fast_merge.md @@ -1,8 +1,9 @@ # Fast Merge * v2.6.0 comes with the support for a feature called "fast merge" where we first train the index on a vector dataset and then build the vector index using this trained information of the centroid layout. -* This is an improvement on the existing behavior where we were performing the merge in a very naive fashion of reconstructing the participating vector indexes, re-training and then adding back the vectors into the index. Fast merge essentially merges the corresponding centroid cells' data vectors in a block wise fashion without doing the expensive operations -* This feature underneath the hood utilizes the existing [`merge_from` API](https://github.com/blevesearch/faiss/blob/ffd910a91f1acf49b9898a7e514e462db89ee7b3/faiss/Index.h#L396) in our fork of the faiss codebase. +* This is a separate feature when compared to the existing behavior where we were performing the merge in a very naive fashion of reconstructing the participating vector indexes, re-training and then adding back the vectors into the index. Fast merge essentially merges the corresponding centroid cells' data vectors in a block-wise fashion without doing the expensive operations. For this, a centroid/trained segment template needs to be built using enough random samples from a dataset prior to building segments with the actual index data. +* A trade-off with this feature is if new data were introduced that is not available at the time of index introduction, a data drift can develop resulting in reduced recall. +* This feature underneath the hood utilizes the existing [`merge_from` API](https://github.com/blevesearch/faiss/blob/ffd910a91f1acf49b9898a7e514e462db89ee7b3/faiss/Index.h#L396) in our fork of the FAISS codebase. ## Support @@ -12,7 +13,7 @@ ## Usage -The feature can be enabled by first passing a key value pair in the config part while creating a new index. If the flag is false, then the behavior falls back to the more expensive naive merge. +The feature can be enabled by first passing a key-value pair in the config part while creating a new index. If the flag is false, then the behavior falls back to the more expensive naive merge. ```go kvConfig := map[string]interface{}{ @@ -25,7 +26,7 @@ if err != nil { } ``` -User should now "train" the index on a random sample of the vector dataset they're planning to index and search. +The user should now "train" the index on a random sample of the vector dataset they're planning to index and search. * It's completely up to the user as to how much data they want to use for training, controlling the batch size used while training and also marking whether the training is complete. * NOTE: User must index their data only after marking the training as complete, otherwise the batch won't be indexed. @@ -39,7 +40,7 @@ for _, doc := range trainingDocuments { } // train the index on the batch of data -// NOTE: the training can be done in an incremental manner as well, by using same Train() API but repeatedly calling it on a particular batch of data. +// NOTE: the training can be done in an incremental manner as well, by using the same Train() API but repeatedly calling it on a particular batch of data. if err := index.Train(batch); err != nil { log.Fatal(err) } @@ -55,9 +56,9 @@ if err := index.Train(batch); err != nil { ## Disclaimer -* This feature is primarily meant for the use case where the user is aware about much data they want to index and also for a ready heavy workload and little to no updates on the index itself. -* The intention of the feature is to be able to quickly index a massive scale of data on an index in an expensive manner and perform search on it. +* This feature is primarily meant for the use case where the user is aware of how much data they want to index and also for a read-heavy workload and little to no updates on the index itself. +* The intention of the feature is to be able to quickly index a massive scale of data on an index in an efficient manner and perform search on it. * Without this feature, i.e. when the index build happens without a prior training phase * The user wouldn't have to worry about use cases where the dataset is continuously updated with new "type" of vector. This is because each merge cycle would do the training afresh. * The user doesn't have a lag in indexing the data either, they can start ingesting the data immediately. -* Based on what's mentioned above, when it comes to update and delete type of workloads on the dataset its extremely difficult to detect when the data drift will occur. So we end up falling back to the naive way of reconstructing + re-training. +* Based on what's mentioned above, when it comes to update and delete type of workloads on the dataset it's extremely difficult to detect when the data drift will occur. So we end up falling back to the naive way of reconstructing + re-training. diff --git a/docs/hierarchy.md b/docs/hierarchy.md index 39410864d..839e9f4e4 100644 --- a/docs/hierarchy.md +++ b/docs/hierarchy.md @@ -1,6 +1,6 @@ # Hierarchical nested search -* *v2.6.0* (and after) will come with support for **Array indexing and hierarchical nested search**. +* *v2.6.0* (and after) will come with support for **array indexing and hierarchical nested search**. * We've achieved this by embedding nested documents within our bleve (scorch) indexes. * Usage of zap file format: [v17](https://github.com/blevesearch/zapx/blob/master/zap.md). Here we preserve hierarchical document relationships within segments, continuing to conform to the segmented architecture of *scorch*. @@ -29,7 +29,7 @@ } ``` -* Multi-level arrays: Arrays can contain objects that themselves have array fields, allowing for deeply nested structures, such as a list of projects, each with its own list of tasks. +* Multi-level arrays: arrays can contain objects that themselves have array fields, allowing for deeply nested structures, such as a list of projects, each with its own list of tasks. ```json { @@ -107,7 +107,7 @@ ], "locations": [ {"city": "Athens","country": "Greece"}, - {"city": "Berlin","country": "USA"} + {"city": "Berlin", "country": "USA"} ] } } @@ -117,7 +117,7 @@ * From v2.6.0 onwards, Bleve allows for accurate representation and querying of complex nested structures, preserving the relationships between different levels of the hierarchy, across multi-level, multiple and hybrid arrays. -* The addition of `nested` document mappings enable defining fields that contain arrays of objects, giving the option to preserve the hierarchical relationships within the array during indexing. Having `nested` as false (default) will flatten the objects within the array, losing the hierarchy, which was the earlier behavior. +* The addition of `nested` document mappings enables the definition of fields that contain arrays of objects, giving the option to preserve the hierarchical relationships within the array during indexing. Having `nested` as false (default) will flatten the objects within the array, losing the hierarchy, which was the earlier behavior. ```json { @@ -146,10 +146,10 @@ } ``` -* Any Bleve query (e.g., `match`, `phrase`, `term`, `fuzzy`, `numeric/date range` etc.) can be executed against fields within nested documents, with no special handling required. The query processor will automatically traverse the nested structures to find matches. Additional search constructs +* Any Bleve query (e.g., `match`, `phrase`, `term`, `fuzzy`, `numeric/date range`, etc.) can be executed against fields within nested documents, with no special handling required. The query processor will automatically traverse the nested structures to find matches. Additional search constructs like vector search, synonym search, hybrid and pre-filtered vector search integrate seamlessly with hierarchy search. -* Conjunction Queries (AND queries) and other queries that depend on term co-occurrence within the same hierarchical context will respect the boundaries of nested documents. This means that terms must appear within the same nested object to be considered a match. For example, a conjunction query searching for an employee named "Alice" with the role "Engineer" within the "Engineering" department will only return results where both name and role terms are found within the same employee object, which is itself within a "Engineering" department object. +* Conjunction Queries (AND queries) and other queries that depend on term co-occurrence within the same hierarchical context will respect the boundaries of nested documents. This means that terms must appear within the same nested object to be considered a match. For example, a conjunction query searching for an employee named "Alice" with the role "Engineer" within the "Engineering" department will only return results where both name and role terms are found within the same employee object, which is itself within an "Engineering" department object. * Some other search constructs will have enhanced precision with hierarchy search. * Field-Level Highlighting: Only fields within the matched nested object are retrieved and highlighted, ensuring highlights appear in the correct hierarchical context. For example, a match in `departments[name=Engineering].employees` highlights only employees in that department. diff --git a/docs/index_update.md b/docs/index_update.md index e7e64b75b..74a212514 100644 --- a/docs/index_update.md +++ b/docs/index_update.md @@ -1,7 +1,7 @@ # Ability to reduce downtime during index mapping updates -* *v2.5.4* (and after) will come with support to delete or modify any field mapping in the index mapping without requiring a full rebuild of the index -* We do this by storing which portions of the field has to be deleted within zap and then lazily executing the deletion during subsequent merging of the segments +* *v2.5.4* (and after) will come with support for deleting or modifying any field mapping in the index mapping without requiring a full rebuild of the index +* We do this by storing which portions of the field that have to be deleted within zap and then lazily executing the deletion during subsequent merging of the segments ## Usage @@ -20,12 +20,12 @@ However, if any of the following conditions are met, the index is considered non * Any additional fields or enabled document mappings in the new index mapping * Any changes to IncludeInAll, type, IncludeTermVectors and SkipFreqNorm * Any document mapping having its enabled value changing from false to true -* Text fields with a different analyser or date time fields with a different date time format +* Text fields with a different analyzer or date time fields with a different date time format * Vector and VectorBase64 fields changing dims, similarity or vectorIndexOptimizedFor * Any changes when field is part of `_all` * Full field deletions when it is covered by any dynamic setting (Index, Store or DocValues Dynamic) * Any changes to dynamic settings at the top level or any enabled document mapping -* If multiple fields sharing the same field name either from different type mappings or aliases are present, then any non compatible changes across all of these fields +* If multiple fields sharing the same field name either from different type mappings or aliases are present, then any non-compatible changes across all of these fields ## How to enforce immediate deletion? diff --git a/docs/pagination.md b/docs/pagination.md index 17b01702a..b853f78cc 100644 --- a/docs/pagination.md +++ b/docs/pagination.md @@ -28,7 +28,7 @@ JSON example: } ``` -The result would be 5 hits starting from the 5th hit. +The result would be 5 hits starting from the 11th hit. When to use: @@ -49,7 +49,7 @@ Rules: Where do sort keys come from? - Each hit includes `Sort` (and `DecodedSort` from Bleve v2.5.2). Take the last hit's sort keys for `SearchAfter`, or the first hit's sort keys for `SearchBefore`. -- If the field/fields to be searched over is numeric, datetime or geo, the values in the `Sort` field may have garbled values; this is because of how Bleve represents such data types internally. To use such fields as sort keys, use the `DecodedSort` field, which decodes the internal representations. This feature is available from Bleve v2.5.4. +- If the field/fields to be searched over are numeric, datetime or geo, the values in the `Sort` field may have garbled values; this is because of how Bleve represents such data types internally. To use such fields as sort keys, use the `DecodedSort` field, which decodes the internal representations. This feature is available from Bleve v2.5.4. > When using `DecodedSort`, the `Sort` array in the search request needs to explicitly declare the type of the field for proper decoding. Hence, the `Sort` array must contain either `SortField` objects (for numeric and datetime) or `SortGeoDistance` objects (for geo) rather than just the field names. More info on `SortField` and `SortGeoDistance` can be found in [sort_facet.md](sort_facet.md). @@ -75,7 +75,7 @@ Backward pagination over `_id` and `_score`: } ``` -Pagination using numeric, datetime and geo fields. Notice how we specify the sort objects, with the "type" field explicitly declared in case of numeric and datetime: +Pagination using numeric, datetime, and geo fields. Notice how we specify the sort objects, with the "type" field explicitly declared in case of numeric and datetime: ```json { @@ -86,7 +86,7 @@ Pagination using numeric, datetime and geo fields. Notice how we specify the sor "sort": [ {"by": "field", "field": "price", "type": "number"}, {"by": "field", "field": "created_at", "type": "date"}, - {"by": "geo_distance", "field": "location", "location": {"lat": 40.7128,"lon": -74.0060}} + {"by": "geo_distance", "field": "location", "location": {"lat": 40.7128, "lon": -74.0060}} ], "search_after": ["99.99", "2023-10-15T10:30:00Z", "5.2"] } diff --git a/docs/persister.md b/docs/persister.md index d4dcad9a7..8fc4ad311 100644 --- a/docs/persister.md +++ b/docs/persister.md @@ -11,13 +11,13 @@ Starting with v2.5.0, Scorch supports parallel flushing of in-memory segments to - `NumPersisterWorkers`: This factor decides how many maximum workers can be spawned to flush out the in-memory segments. Each worker will work on a disjoint subset of segments, merge them, and flush them out to the disk. By default the persister deploys only one worker. - `MaxSizeInMemoryMergePerWorker`: This config decides what's the maximum amount of input data in bytes a single worker can work upon. By default this value is equal to 0 which means that this config is disabled and the worker tries to merge all the data in one shot. Also note that it's imperative that the user set this config if `NumPersisterWorkers > 1`. -If the index is tuned to have a higher `NumPersisterWorkers` value, the memory can potentially drain out faster and ensure stronger consistency behaviour — but there would be a lot of on-disk files, and the background merger would experience the pressure of managing this large number of files, which can be resource-intensive. +If the index is tuned to have a higher `NumPersisterWorkers` value, the memory can potentially drain out faster and ensure stronger consistency behavior — but there would be a lot of on-disk files, and the background merger would experience the pressure of managing this large number of files, which can be resource-intensive. -- Tuning this config is very dependent on the available CPU resources, and something to keep in mind here is that the process's RSS can increase if the number of workers — and each of them working upon a large amount of data — is high. +- Tuning this config is very dependent on the available CPU resources. Something to keep in mind here is that the process's RSS can increase if the number of workers — and each of them working upon a large amount of data — is high. -Increasing the `MaxSizeInMemoryMergePerWorker` value would mean that each worker acts upon a larger amount of data and spends more time merging and flushing it out to disk — which can be healthy behaviour in terms of I/O, although it comes at the cost of time. +Increasing the `MaxSizeInMemoryMergePerWorker` value would mean that each worker acts upon a larger amount of data and spends more time merging and flushing it out to disk — which can be healthy behavior in terms of I/O, although it comes at the cost of time. -- Changing this config is usecase dependent, for example in usecases where the payload or per doc size is generally large in size (for eg vector usecases), it would be beneficial to have a larger value for this. +- Changing this config is use-case dependent; for example, in use cases where the payload or per-doc size is generally large (for example vector use cases), it would be beneficial to have a larger value for this. So, having the ideal values for these two configs is definitely dependent on the use case and can involve a bunch of experiments, keeping the resource usage in mind. @@ -29,10 +29,10 @@ Management of these files is crucial when it comes to query latency because a hi The merger sees the files on disk and plans out which segments to merge so that the final layout of segment tiers (each tier having multiple files), which grow in a logarithmic way (the chances of larger tiers growing in number would decrease), is maintained. This also implies that deciding this first-tier size becomes important in deciding the number of segment files across all tiers. -Starting with v2.5.0, this first-tier size is dependent on the file size using the `FloorSegmentFileSize` config, because that's a better metric to consider (unlike the legacy live doc count metric) in order to ensure that the behaviour is in line with the use case and aware of the payload/doc size. +Starting with v2.5.0, this first-tier size is dependent on the file size using the `FloorSegmentFileSize` config, because that's a better metric to consider (unlike the legacy live doc count metric) in order to ensure that the behavior is in line with the use case and aware of the payload/doc size. - This config can also be tuned to dictate how the I/O behaviour should be within an index. While tuning this config, it should be in proportion to the `MaxSizeInMemoryMergePerWorker` since that dictates the amount of data flushed out per flush. -- The observation here is that `FloorSegmentFileSize` is lesser than `MaxSizeInMemoryMergePerWorker` and for an optimal I/O during indexing, this value can be set close to `MaxSizeInMemoryMergePerWorker/6`. +- The observation here is that `FloorSegmentFileSize` is less than `MaxSizeInMemoryMergePerWorker` and for an optimal I/O during indexing, this value can be set close to `MaxSizeInMemoryMergePerWorker/6`. ## Setting a Persister/Merger Config in Index diff --git a/docs/score_fusion.md b/docs/score_fusion.md index 9d78c02b7..61efc1ae8 100644 --- a/docs/score_fusion.md +++ b/docs/score_fusion.md @@ -139,7 +139,7 @@ searchRequest.AddKNN( 0.8, // kNN weight (boost) ) -// Optional: Configure RRF parameters +// Optional: Configure RSF parameters params := bleve.RequestParams{ ScoreWindowSize: 150 // Window size (default: size) } @@ -237,7 +237,7 @@ When using score fusion (`Score` set to `"rrf"` or `"rsf"`), certain features ar * **SearchAfter/SearchBefore**: Not compatible with score fusion. For pagination, use `From` and `Size` only. * **Sort**: Only descending score sort (`-_score`) or default sorting is allowed -* **Faceting**: Only documents included in the FTS result list are considered. Documents that appear exclusively in the KNN result list are ignored during faceting. +* **Faceting**: Only documents included in the FTS result list are considered. Documents that appear exclusively in the kNN result list are ignored during faceting. ## Choosing a Fusion Strategy diff --git a/docs/scoring.md b/docs/scoring.md index 5b9a365a1..480f510cf 100644 --- a/docs/scoring.md +++ b/docs/scoring.md @@ -1,6 +1,6 @@ # Scoring models for document hits -* Search is performed on a collection fields using compound queries such as conjunction/disjunction/boolean etc. However, the scoring itself is done independently for each field and then aggregated to get the final score for a document hit. +* Search is performed on a collection of fields using compound queries such as conjunction/disjunction/boolean, etc. However, the scoring itself is done independently for each field and then aggregated to get the final score for a document hit. * Default scoring scheme for document hits involving text hits: `tf-idf`. * Nearest-neighbor/vector hits scoring depends on chosen `knn distance` metric, highlighted [here](https://github.com/blevesearch/bleve/blob/master/docs/vectors.md#supported). * Hybrid search scoring will combine `tf-idf` scores with `knn distance` numbers. @@ -21,7 +21,7 @@ The scoring formula followed in BM25 is \sum_{i}^n IDF(q_i) {{f(q_i,D) * (k1 + 1)}\over{f(q_i,D) + k1 * (1-b+b*{{fieldLen}\over{avgFieldLen}})}} ``` -$IDF(q_i)$ here refers to Inverse Document Frequency talks about how rare (and hence rich in information) is a particular query term $`q_i`$ across all the documents in the index, which is calculated as +$IDF(q_i)$ here refers to Inverse Document Frequency, which talks about how rare (and hence rich in information) is a particular query term $`q_i`$ across all the documents in the index, which is calculated as ```math \ln(1 + {{docTotal - docTerm + 0.5}\over{docTerm + 0.5}}) @@ -34,7 +34,7 @@ Coming back to the BM25 scoring, $f(q_i,D)$ refers to the frequency of the query ### How to enable and use BM25 -Bleve v2.5.0 updated the `indexMapping` construct with the concept of `scoringModel`. This is a global (meaning applicable to all the fields) setting which drives which scoring algorithm to apply while scoring the document hits. Supported scoring models are defined [here](https://github.com/blevesearch/bleve_index_api/blob/f54d76f0a71a838837159aa44ced0404bb6ec25f/indexing_options.go#L27) +Bleve v2.5.0 updated the `indexMapping` construct with the concept of `scoringModel`. This is a global (meaning applicable to all fields) setting which drives which scoring algorithm to apply while scoring the document hits. Supported scoring models are defined [here](https://github.com/blevesearch/bleve_index_api/blob/f54d76f0a71a838837159aa44ced0404bb6ec25f/indexing_options.go#L27) For instance, while defining the index mapping for the data model that's been decided by the user, following snippet can be referred to enable BM25 @@ -45,11 +45,11 @@ indexMapping.DefaultAnalyzer = "en" indexMapping.ScoringModel = "bm25" ``` -During search time there's explicit change involved, unless the user wants to perform a global scoring. +During search time there's no explicit change involved, unless the user wants to perform a global scoring. ### Global Scoring -Let's say that the user has a dataset which is quite large (let's say 3 million) and to have good throughput, they create 3 shards (with the same index mapping) for the "index". Each of these shards can be `bleve.Index` type and while performing a search over the entire dataset, a `bleve.IndexAlias` can be created over which a search can be performed. This parallelizes things pretty good, both on the indexing path and the search path. +Let's say that the user has a dataset which is quite large (let's say 3 million) and to have good throughput, they create 3 shards (with the same index mapping) for the "index". Each of these shards can be `bleve.Index` type and while performing a search over the entire dataset, a `bleve.IndexAlias` can be created over which a search can be performed. This parallelizes things pretty well, both on the indexing path and the search path. The concept of global scoring is applicable when the index is "sharded" (similar to above situation). This is because each index has data which is disjoint, and thereby while performing the scoring on document hits on each of them, the value of stats is not complete at a global level, since we're doing a search over the entire dataset using the `bleve.IndexAlias`. For eg: `docTotal` value while scoring the document hits would be 1 million which is incorrect at a global level. @@ -69,7 +69,7 @@ ctx = context.WithValue(ctx, search.SearchTypeKey, search.GlobalScoring) res, err := multiPartIndex.SearchInContext(ctx, searchRequest) ``` -A note here is that, this would only matter if the relative order of the document hits vary quite a bit (vs single shard case). This would be possible when the shard count increases quite a bit, in low doc count situations or if there is a heavy skew in the data distribution amongst the shards for some reason. Ideally the shards are created when the data is quite large and each of them index same amount of data - in which case the scores won't fluctuate much to affect the relative hit order and the user can choose to avoid the global scoring mechanism altogether. +A note here is that, this would only matter if the relative order of the document hits vary quite a bit (vs single shard case). This would be possible when the shard count increases quite a bit, in low doc count situations or if there is a heavy skew in the data distribution amongst the shards for some reason. Ideally the shards are created when the data is quite large and each of them index the same amount of data - in which case the scores won't fluctuate much to affect the relative hit order and the user can choose to avoid the global scoring mechanism altogether. ## TF-IDF diff --git a/docs/search_autocomplete.md b/docs/search_autocomplete.md index 3513b4a2f..225323a99 100644 --- a/docs/search_autocomplete.md +++ b/docs/search_autocomplete.md @@ -129,7 +129,7 @@ func (s *EdgeNgramFilter) Filter(input analysis.TokenStream) analysis.TokenStrea for _, token := range input { runeCount := utf8.RuneCount(token.Term) runes := bytes.Runes(token.Term) - // ..builds tokens based from either end, specified in the input + // ..builds tokens based on either end, specified in the input } return rv } @@ -204,7 +204,7 @@ if err := indexMapping.AddCustomTokenFilter("Engram", edgeGramFilter); err != ni - `"type": "edge_ngram"` - Tells Bleve to use the edge n-gram filter - `"min": 2` - Start creating tokens from 2 characters ("ja", "sc", etc.) - `"max": 4` - Stop at 4 characters ("java", "scri", etc.) -- `"back": "false"` - Create tokens from the front (beginning) of words +- `"back": false` - Create tokens from the front (beginning) of words ### Step 2: Create Custom Analyzer diff --git a/docs/sort_facet.md b/docs/sort_facet.md index 8f1c2e460..ee2ba294d 100644 --- a/docs/sort_facet.md +++ b/docs/sort_facet.md @@ -172,8 +172,7 @@ No DocValues used No DocValues used DocValues used for field "dolor". Field Mapping for "dolor" may enable docValues. - DocValues used, for field "sit_amet". -Field Mapping for "sit_amet" may enable docValues. + DocValues used for field "dolor". Field Mapping for "dolor" may enable docValues. @@ -394,8 +393,8 @@ Enabling docValues for the fields associated with such facet requests might prov - DocValues used for field "dolor" and "lorem". Field Mapping for "dolor" and "lorem" may enable docValues. - DocValues used for field "dolor" and "ipsum". Field Mapping for "dolor" and "ipsum" may enable docValues. + DocValues used for fields "dolor" and "lorem". Field Mapping for "dolor" and "lorem" may enable docValues. + DocValues used for fields "dolor" and "ipsum". Field Mapping for "dolor" and "ipsum" may enable docValues. @@ -546,8 +545,8 @@ Document Format used for the test scenario:
 {
 	"dummyTerm":"Term",
-	"dummyDate":"2000-01-01T00:00:00,
-	"dummyNumber:1
+	"dummyDate":"2000-01-01T00:00:00",
+	"dummyNumber":1
 }
 			
@@ -555,8 +554,8 @@ Document Format used for the test scenario:
 {
 	"dummyTerm":"Term",
-	"dummyDate":"2000-01-01T01:00:00,
-	"dummyNumber:2
+	"dummyDate":"2000-01-01T01:00:00",
+	"dummyNumber":2
 }
 			
@@ -565,7 +564,7 @@ Document Format used for the test scenario: { "dummyTerm":"Term", "dummyDate":"2000-01-01T01:00:00"+(i hours), - "dummyNumber:i + "dummyNumber":i } @@ -573,8 +572,8 @@ Document Format used for the test scenario:
 {
 	"dummyTerm":"Term",
-	"dummyDate":2000-01-01T01:00:00 + (5000 hours),
-	"dummyNumber:5000
+	"dummyDate":"2000-01-01T01:00:00 + (5000 hours)",
+	"dummyNumber":5000
 }
 			
@@ -775,7 +774,7 @@ Document Format used for the test scenario: 27.034 -Even at this small scale, with a small document size and a very limited number of indexed documents, we still observe a noticeable tradeoff. With just a slight increase in the index size (an average of 7KB) we obtain a 20ms reduction in the total execution time, on average, for only 1000 queries. +Even at this small scale, with a small document size and a very limited number of indexed documents, we still observe a noticeable tradeoff. With just a slight increase in the index size (an average of 7KB), we obtain a 20ms reduction in the total execution time, on average, for only 1000 queries.

Technical Information

diff --git a/docs/synonyms.md b/docs/synonyms.md index 0777debdf..4845b40e6 100644 --- a/docs/synonyms.md +++ b/docs/synonyms.md @@ -2,7 +2,7 @@ * *v2.5.0* (and after) will come with support for **synonym definition indexing and search**. * We've achieved this by embedding synonym indexes within our bleve (scorch) indexes. -* Usage of zap file format: [v16](https://github.com/blevesearch/zapx/blob/master/zap.md). Here we co-locate text, vector and synonym indexes as neighbors within segments, continuing to conform to the segmented architecture of *scorch*. +* Usage of zap file format: [v16](https://github.com/blevesearch/zapx/blob/master/zap.md). Here we co-locate text, vector, and synonym indexes as neighbors within segments, continuing to conform to the segmented architecture of *scorch*. ## Supported @@ -43,7 +43,7 @@ } ``` -* The addition of `Synonym Sources` in the index mapping enables associating a set of `synonym definitions` (called a `synonym collection`) with a specific analyzer. This allows for preprocessing of terms in both the *input* and *synonyms* lists before the synonym index is created. By using an analyzer, you can normalize or transform terms (e.g., case folding, stemming) to improve synonym matching. +* The addition of `Synonym Sources` in the index mapping enables associating a set of `synonym definitions` (called a `synonym collection`) with a specific analyzer. This allows for preprocessing of terms in both the "input" and "synonyms" lists before the synonym index is created. By using an analyzer, you can normalize or transform terms (e.g., case folding, stemming) to improve synonym matching. ```json { @@ -172,7 +172,7 @@ if err != nil { panic(err) } -// The search result will contain one match: "doc1". This document includes the term "hardworking", +// The search result will contain one match: "doc1". This document includes the term "hardworking", // which is a synonym for the queried term "persistent". The synonym relationship is based on // the user-defined thesaurus associated with the index. // Print the search results, which will include the explanation for the match. diff --git a/docs/vectors.md b/docs/vectors.md index fac830f35..add22fe55 100644 --- a/docs/vectors.md +++ b/docs/vectors.md @@ -1,13 +1,13 @@ # Approximate Nearest Neighbor Search (over vectors) -* *v2.4.0* (and after) will come with support for **vectors' indexing and search**. +* *v2.4.0* (and after) will come with support for **vector indexing and search**. * We've achieved this by embedding [FAISS](https://github.com/facebookresearch/faiss) indexes within our bleve (scorch) indexes. * Introduction of new zap file formats: [v16](https://github.com/blevesearch/zapx/blob/v16.x/zap.md), [v17](https://github.com/blevesearch/zapx/blob/master/zap.md) .. to co-locate text and vector indexes as neighbors within segments, continuing to conform to the segmented architecture of *scorch*. ## Pre-requisite(s) -* Induction of [FAISS](https://github.com/blevesearch/faiss) into our eco system, which is a fork of the original [facebookresearch/faiss](https://github.com/facebookresearch/faiss) -* FAISS is a C++ library that needs to be compiled and it's shared libraries need to be situated at an accessible path for your application. +* Induction of [FAISS](https://github.com/blevesearch/faiss) into our ecosystem, which is a fork of the original [facebookresearch/faiss](https://github.com/facebookresearch/faiss) +* FAISS is a C++ library that needs to be compiled and its shared libraries need to be situated at an accessible path for your application. * A `vectors` GO TAG needs to be set for bleve to access all the supporting code. This TAG must be set only after the FAISS shared library is made available. Failure to do either will inhibit you from using this feature. * Please follow these [instructions](#setup-instructions) below for any assistance in the area. * Releases of `blevesearch/bleve` work with select checkpoints of `blevesearch/faiss` owing to API changes and improvements (tracking over the `bleve` branch): @@ -45,7 +45,7 @@ * Vectors from documents that do not conform to the index mapping dimensionality are simply discarded at index time. * The dimensionality of the query vector must match the dimensionality of the indexed vectors to obtain any results. -* Pure kNN searches can be performed, but the `query` attribute within the search request must be set - to `{"match_none": {}}` in this case. The `query` attribute is made optional when `knn` is available with v2.4.1+. +* Pure kNN searches can be performed, but the `query` attribute within the search request must be set -- to `{"match_none": {}}` in this case. The `query` attribute is made optional when `knn` is available with v2.4.1+. * Hybrid searches are supported, where results from `query` are unioned with results from `knn`. The FTS scores from exact searches are simply summed with the similarity distances to determine the aggregate scores. ```text @@ -300,7 +300,7 @@ sudo cp build/c_api/libfaiss_c.so /usr/local/lib ### OSX -While you shouldn't need to do any different over osX x86_64, with aarch64 - some instructions need adjusting (see [facebookresearch/faiss#2111](https://github.com/facebookresearch/faiss/issues/2111)) .. +While you shouldn't need to do anything different for macOS x86_64, with aarch64 - some instructions need adjusting (see [facebookresearch/faiss#2111](https://github.com/facebookresearch/faiss/issues/2111)) .. ```shell LDFLAGS="-L/opt/homebrew/opt/llvm/lib" CPPFLAGS="-I/opt/homebrew/opt/llvm/include" CXX=/opt/homebrew/opt/llvm/bin/clang++ CC=/opt/homebrew/opt/llvm/bin/clang cmake -B build -DFAISS_ENABLE_GPU=OFF -DFAISS_ENABLE_C_API=ON -DBUILD_SHARED_LIBS=ON -DFAISS_ENABLE_PYTHON=OFF . diff --git a/go.mod b/go.mod index f9515e30c..b91f54359 100644 --- a/go.mod +++ b/go.mod @@ -7,7 +7,7 @@ require ( github.com/bits-and-blooms/bitset v1.24.2 github.com/blevesearch/bleve_index_api v1.3.11 github.com/blevesearch/geo v0.2.5 - github.com/blevesearch/go-faiss v1.0.36 + github.com/blevesearch/go-faiss v1.1.0 github.com/blevesearch/go-metrics v0.0.0-20201227073835-cf1acfcdf475 github.com/blevesearch/go-porterstemmer v1.0.3 github.com/blevesearch/goleveldb v1.0.1 @@ -25,7 +25,7 @@ require ( github.com/blevesearch/zapx/v14 v14.4.3 github.com/blevesearch/zapx/v15 v15.4.3 github.com/blevesearch/zapx/v16 v16.3.4 - github.com/blevesearch/zapx/v17 v17.1.1 + github.com/blevesearch/zapx/v17 v17.1.2 github.com/couchbase/moss v0.2.0 github.com/spf13/cobra v1.10.2 go.etcd.io/bbolt v1.4.0 diff --git a/go.sum b/go.sum index 9a57a9f50..c865d6b26 100644 --- a/go.sum +++ b/go.sum @@ -6,8 +6,8 @@ github.com/blevesearch/bleve_index_api v1.3.11 h1:x29vbV8OjWfLcrDVd7Lr1q+BkLNS0J github.com/blevesearch/bleve_index_api v1.3.11/go.mod h1:xvd48t5XMeeioWQ5/jZvgLrV98flT2rdvEJ3l/ki4Ko= github.com/blevesearch/geo v0.2.5 h1:yJg9FX1oRwLnjXSXF+ECHfXFTF4diF02Ca/qUGVjJhE= github.com/blevesearch/geo v0.2.5/go.mod h1:Jhq7WE2K6mJTx1xS44M2pUO6Io+wjCSHh1+co3YOgH4= -github.com/blevesearch/go-faiss v1.0.36 h1:qrP6LZX7xrQQ3pOF2B+t+5E+brlOzwQUzZrGLHz4IeU= -github.com/blevesearch/go-faiss v1.0.36/go.mod h1:OMGQwOaRRYxrmeNdMrXJPvVx8gBnvE5RYrr0BahNnkk= +github.com/blevesearch/go-faiss v1.1.0 h1:xM7Jc0ZUCv5lssG9Ohj3Jv0SdTpxcUABU1dDt9XVsc4= +github.com/blevesearch/go-faiss v1.1.0/go.mod h1:OMGQwOaRRYxrmeNdMrXJPvVx8gBnvE5RYrr0BahNnkk= github.com/blevesearch/go-metrics v0.0.0-20201227073835-cf1acfcdf475 h1:kDy+zgJFJJoJYBvdfBSiZYBbdsUL0XcjHYWezpQBGPA= github.com/blevesearch/go-metrics v0.0.0-20201227073835-cf1acfcdf475/go.mod h1:9eJDeqxJ3E7WnLebQUlPD7ZjSce7AnDb9vjGmMCbD0A= github.com/blevesearch/go-porterstemmer v1.0.3 h1:GtmsqID0aZdCSNiY8SkuPJ12pD4jI+DdXTAn4YRcHCo= @@ -45,8 +45,8 @@ github.com/blevesearch/zapx/v15 v15.4.3 h1:iJiMJOHrz216jyO6lS0m9RTCEkprUnzvqAI2l github.com/blevesearch/zapx/v15 v15.4.3/go.mod h1:1pssev/59FsuWcgSnTa0OeEpOzmhtmr/0/11H0Z8+Nw= github.com/blevesearch/zapx/v16 v16.3.4 h1:hDAqA8qusZTNbPEL7//w5P65UZ2de6yhSeUaTbp0Po0= github.com/blevesearch/zapx/v16 v16.3.4/go.mod h1:zqkPPqs9GS9FzVWzCO3Wf1X044yWAV17+4zb+FTiEHg= -github.com/blevesearch/zapx/v17 v17.1.1 h1:Ltal7LsjzRerUg4hqVgMruKj3BAse+rrrDTe+9epJ2k= -github.com/blevesearch/zapx/v17 v17.1.1/go.mod h1:AfYxjApHf7JpQdW4yzFGisSKIrdkPesFn4yJ3vKKPQE= +github.com/blevesearch/zapx/v17 v17.1.2 h1:avbOk2igaASNoiy0BE/jPgcxAnRI2PGeydeP4hg7Ikk= +github.com/blevesearch/zapx/v17 v17.1.2/go.mod h1:WQObxKrqUX7cd0G1GMvDfc/bmZzQvoy7APOPimx7DiI= github.com/couchbase/ghistogram v0.1.0 h1:b95QcQTCzjTUocDXp/uMgSNQi8oj1tGwnJ4bODWZnps= github.com/couchbase/ghistogram v0.1.0/go.mod h1:s1Jhy76zqfEecpNWJfWUiKZookAFaiGOEoyzgHt9i7k= github.com/couchbase/moss v0.2.0 h1:VCYrMzFwEryyhRSeI+/b3tRBSeTpi/8gn5Kf6dxqn+o=