-
Notifications
You must be signed in to change notification settings - Fork 805
SOLR-18071 Support Stored Fields in Export Writer #4053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
bf813b7 to
093af42
Compare
epugh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will scratch some of my own issues!
| } | ||
| SchemaField schemaField = req.getSchema().getField(field); | ||
| if (!schemaField.hasDocValues()) { | ||
| throw new IOException(schemaField + " must have DocValues to use this feature."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this! I've been burned so many times on wanting to do this and discovering missing DocValues.
|
|
||
| @Override | ||
| public boolean write( | ||
| public int write( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went down thorugh the code, why the boolean -> int? Looks like we only ever return 0 or 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cover this a little in the second paragraph of #Solution. To elaborate, I introduced StoredFieldsWriter which is a special type of FieldWriter that writes many fields. I did this because it was the easiest way I saw to hook into the output of a StoredFieldVisitor , which writes multiple fields. Up until this point FieldWriter would only write one field at a time. It would return a boolean representing whether it successfully wrote the field. If the boolean was true an internal loop would increment the fieldIndex which then ... did absolutely nothing. So not entirely clear what the purpose is of keeping track of how many fields were written. But in order to maintain this potentially useful (although not currently useful) feature, I decided to expand the API to return an int which returns how many fields a particular FieldWriter wrote so we can increment by an arbitrary amount to support the new StoredFieldsWriter . I would be equally happy to just make FieldWriter::write be void instead of tracking this unused index.
I also realize you may have missed StoredFieldsWriter because I put the class inside the ExportWriter. I think I am going to put it in a standalone file like the other FieldWriters so it is more consistent and clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be equally happy to just make FieldWriter::write be void instead of tracking this unused index.
+1
Then you could retract my comment on javadocs for write
| @Test | ||
| public void testIncludeStoredFieldsExplicitRequest() throws Exception { | ||
| // Test that stored-only fields are returned when includeStoredFields=true | ||
| clearIndex(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe not speciifc to your goal, but should clearIndex() be in a before method/setup step? If it's on every unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leave it to your judgement, should this be in this PR or a follow-up? Honestly, I just copied the pattern of the existing tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably should be in a follow up, and I suspect that this pattern is in many tests, that all should be cleaned up to fix this bad pattern and establish a better new one. (off topic, but having a prompts.md that captures this community knowlege for AI is also good for humans).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
being consistent with nearby tests in a single source file (aka suite) makes sense.
| An optional parameter `batchSize` determines the size of the internal buffers for partial results. | ||
| The default value is `30000` but users may want to specify smaller values to limit the memory use (at the cost of degraded performance) or higher values to improve export performance (the relationship is not linear and larger values don't bring proportionally larger performance increases). | ||
|
|
||
| An optional parameter `includeStoredFields` (default `false`) enables exporting fields that only have stored values (no docValues). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some day we'll have a way of thinking about should this be "true" ;-). Challenge of open source is we struggle to know what hsould be a default yes even when it has knock on impacts until a lot of time passes. ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will add an explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, can I put the explanation in the CHANGELOG? The docs already warn:
Note that retrieving stored fields may significantly impact export performance compared to docValues fields, as stored fields require additional I/O operations.
9ff0b20 to
d510972
Compare
|
Just a quick observation -- I see a force push that came after Eric's review. I strongly recommend against force pushing because it resets the review state for anyone who has already reviewed -- Eric in this case. |
dsmiley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how overall simple the implementation turned out to be. This wasn't a heavy lift. Nice job!
Curious... at this point, how do you think /export contrasts with /select + cursorMark? Pros/cons... (this could be documented in the ref guide as well to durably record these insights)
| import org.apache.lucene.index.LeafReaderContext; | ||
| import org.apache.solr.common.MapWriter; | ||
|
|
||
| abstract class FieldWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deserves javadoc, at least for clarifying it's return semantics
| if (map == null) { | ||
| map = new WeakHashMap<>(); | ||
| storedFieldsMap.set(map); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this pattern can be improved to basically be handled at the ThreadLocal declaration to provide an initializer.
But moreover I'm concerned about the use of ThreadLocal in the first place -- it's typically a tool of last resort. And further true with use of weak references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I'm fine with your use of it here but was speaking out-loud, maybe a little too out-loud :-)
|
|
||
| class StoredFieldsWriter extends FieldWriter { | ||
|
|
||
| private final Map<String, SchemaField> fields; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd prefer this be named schema or something including that word
| if (fieldType instanceof BoolField) { | ||
| // Convert "T"/"F" stored value to boolean true/false | ||
| addField(fieldInfo.name, Boolean.valueOf(fieldType.indexedToReadable(value))); | ||
| } else { | ||
| addField(fieldInfo.name, value); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I see special cases like this, I ask myself... is there a fieldType method that should be handling this? If not do we need to add one?
CC @hossman if you are interested in review; you've looked at this topic in a nearby issue lately
| class StoredFieldsWriter extends FieldWriter { | ||
|
|
||
| private final Map<String, SchemaField> fields; | ||
| private final ThreadLocal<WeakHashMap<IndexReader.CacheKey, StoredFields>> storedFieldsMap = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ThreadLocals are typically static. In this case, I think it should be because the information in it isn't specific to this instance of StoredFieldsWriter in any way that I can see. Java static analyzers including built into IntelliJ will generally advise you on this point.
|
|
||
| assertJsonEquals( | ||
| resp, | ||
| "{\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you didn't get the memo -- we're on Java 21 now :-). multi-line string please
| // Check if field can use DocValues | ||
| boolean canUseDocValues = | ||
| schemaField.hasDocValues() | ||
| && (!(fieldType instanceof SortableTextField) || schemaField.useDocValuesAsStored()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there precedent for this special case RE instanceof SortableTextField?
| return fields.containsKey(fieldInfo.name) ? Status.YES : Status.NO; | ||
| } | ||
|
|
||
| private <T> void addField(String fieldName, T value) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic flow is confusing... I started to write why it wouldn't work but end up finally contorting myself to see how it could work. Could you try to reflow it somehow? Or if you like it's flow then add comments? (or both)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've never seen this pattern before where you have to track state of multi-valued fields in a callback-style API where the caller sends one element at a time? It's not the greatest but it's what I'm dealing with here. The flush here can be thought of as being akin to, say, Jackson's JsonGenerator::writeEndObject. The look-behind is because I need to buffer and I don't get a signal on the last element that it is the last element. I don't really control that. Nota bene something similar happens in Solr here although I suppose it is less-contortion inducing because it happens all in one function. But this is hard to do here given the API.
One thing I can do is buffer into a multi-map-like thing instead of rolling my own StoredFieldVisitor, i.e. I can buffer into a Document (basically a multi-map), i.e. maybe with DocumentStoredFieldVisitor. I'd need to convert the multi-map thing (i.e. Document) to a collection-property-aware thing like SolrDocument, perhaps using the same convertLuceneDocToSolrDoc I just linked. But that is a lot of transformation so I am pretty neutral about it.
Maybe you can elaborate on what is confusing you? Also, any thoughts on the alternatives I just suggested would be welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have; I'm saying nonetheless I had some trouble reading it how you coded it. It's subjective; maybe "i'm the problem". In particular, I was surprised to see the very first bit of logic was to check if the current field is multi-valued... instead I expected it to check if we're adding to an existing field or starting a new field. There's no right answer here... it's subjective as to who finds what easier to read. So updated the logic and added a bit more comments. LMK what you think. If you don't like it than I'll refer it back and maybe you add more comments.
(I don't like the alternative suggestion of incorporating use of Document; seems quite unnecessary).
| } | ||
|
|
||
| private int flush() throws IOException { | ||
| if (lastFieldName != null && multiValue != null && !multiValue.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the logic should be improved to read super simple here. Like, if multiValue is not null then put it -- that's it. No lastFieldName check, no isEmpty check.
|
|
||
| @Override | ||
| public boolean write( | ||
| public int write( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be equally happy to just make FieldWriter::write be void instead of tracking this unused index.
+1
Then you could retract my comment on javadocs for write
https://issues.apache.org/jira/browse/SOLR-18071
Description
Adds support for exporting stored-only fields (fields without docValues) in the
/exportrequest handler via a newincludeStoredFieldsparameter. Previously, all fields in the field list (fl) were required to have docValues enabled. This change allows users to include stored fields that don't have docValues, which can be useful, i.e. when exporting fields which don't support docValues or trying to export data that has already been indexed without DVs.Solution
If fl explicitly names a stored-only field and includeStoredFields is not enabled, the request fails with a 400 and a hint to add includeStoredFields=true. For glob patterns (e.g., fl=* or fl=intdv,*), stored-only fields are skipped unless includeStoredFields=true, to preserve backward compatibility. The current implementation fetches from
StoredFieldsDV-enabled fields when some stored fields have already been requested. This avoids DV-lookup for a field, which makes sense since we have to parse the StoredFields anyway. My (somewhat limited) benchmarks appear to corroborate that this is the best choice for performance.A quirky thing about this implementation is that the very much internal
FieldWriterAPI was changed to support more than one field. This makes it more interchangeable with the existingStoredFieldVisitorinterface, which assumes one visitor per many fields. I landed on this rather than creating an adapter to bridge the two as it appeared to be simpler. It's worth stressing again that theFieldWriteris very much internal to the export package and the boolean it returned was effectively discarded (the localfieldIndexit drives isn't even used anywhere). It could be argued thatFieldWriter::writecould bevoidand I'd also be open to such a change.Tests
Also have some performance comparisons of exporting vs not exporting stored fields:
stored-fields-export-writer-1k-doc-benchmark.txt
stored-fields-export-writer-140k-doc-benchmark.txt
Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.