Skip to content

Conversation

@kotman12
Copy link
Contributor

https://issues.apache.org/jira/browse/SOLR-18071

Description

Adds support for exporting stored-only fields (fields without docValues) in the /export request handler via a new includeStoredFields parameter. Previously, all fields in the field list (fl) were required to have docValues enabled. This change allows users to include stored fields that don't have docValues, which can be useful, i.e. when exporting fields which don't support docValues or trying to export data that has already been indexed without DVs.

Solution

If fl explicitly names a stored-only field and includeStoredFields is not enabled, the request fails with a 400 and a hint to add includeStoredFields=true. For glob patterns (e.g., fl=* or fl=intdv,*), stored-only fields are skipped unless includeStoredFields=true, to preserve backward compatibility. The current implementation fetches from StoredFields DV-enabled fields when some stored fields have already been requested. This avoids DV-lookup for a field, which makes sense since we have to parse the StoredFields anyway. My (somewhat limited) benchmarks appear to corroborate that this is the best choice for performance.

A quirky thing about this implementation is that the very much internal FieldWriter API was changed to support more than one field. This makes it more interchangeable with the existing StoredFieldVisitor interface, which assumes one visitor per many fields. I landed on this rather than creating an adapter to bridge the two as it appeared to be simpler. It's worth stressing again that the FieldWriter is very much internal to the export package and the boolean it returned was effectively discarded (the local fieldIndex it drives isn't even used anywhere). It could be argued that FieldWriter::write could be void and I'd also be open to such a change.

Tests

  • Explicit stored-only field export succeeds with includeStoredFields=true (single-valued and multi-valued).
  • Explicit stored-only field export fails without the parameter and includes the includeStoredFields=true hint and field name.
  • Glob fl skips stored-only fields without the parameter (request succeeds, stored-only fields not present).
  • Glob fl includes stored-only fields with the parameter.
  • Coverage for stored field types: string, int, long, float, double, boolean, date.

Also have some performance comparisons of exporting vs not exporting stored fields:
stored-fields-export-writer-1k-doc-benchmark.txt
stored-fields-export-writer-140k-doc-benchmark.txt

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide
  • I have added a changelog entry for my change

@github-actions github-actions bot added documentation Improvements or additions to documentation tests labels Jan 15, 2026
@kotman12 kotman12 force-pushed the stored-fields-export-writer branch from bf813b7 to 093af42 Compare January 15, 2026 23:01
Copy link
Contributor

@epugh epugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will scratch some of my own issues!

}
SchemaField schemaField = req.getSchema().getField(field);
if (!schemaField.hasDocValues()) {
throw new IOException(schemaField + " must have DocValues to use this feature.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this! I've been burned so many times on wanting to do this and discovering missing DocValues.


@Override
public boolean write(
public int write(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went down thorugh the code, why the boolean -> int? Looks like we only ever return 0 or 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cover this a little in the second paragraph of #Solution. To elaborate, I introduced StoredFieldsWriter which is a special type of FieldWriter that writes many fields. I did this because it was the easiest way I saw to hook into the output of a StoredFieldVisitor , which writes multiple fields. Up until this point FieldWriter would only write one field at a time. It would return a boolean representing whether it successfully wrote the field. If the boolean was true an internal loop would increment the fieldIndex which then ... did absolutely nothing. So not entirely clear what the purpose is of keeping track of how many fields were written. But in order to maintain this potentially useful (although not currently useful) feature, I decided to expand the API to return an int which returns how many fields a particular FieldWriter wrote so we can increment by an arbitrary amount to support the new StoredFieldsWriter . I would be equally happy to just make FieldWriter::write be void instead of tracking this unused index.

I also realize you may have missed StoredFieldsWriter because I put the class inside the ExportWriter. I think I am going to put it in a standalone file like the other FieldWriters so it is more consistent and clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be equally happy to just make FieldWriter::write be void instead of tracking this unused index.

+1

Then you could retract my comment on javadocs for write

@Test
public void testIncludeStoredFieldsExplicitRequest() throws Exception {
// Test that stored-only fields are returned when includeStoredFields=true
clearIndex();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not speciifc to your goal, but should clearIndex() be in a before method/setup step? If it's on every unit test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave it to your judgement, should this be in this PR or a follow-up? Honestly, I just copied the pattern of the existing tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should be in a follow up, and I suspect that this pattern is in many tests, that all should be cleaned up to fix this bad pattern and establish a better new one. (off topic, but having a prompts.md that captures this community knowlege for AI is also good for humans).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

being consistent with nearby tests in a single source file (aka suite) makes sense.

An optional parameter `batchSize` determines the size of the internal buffers for partial results.
The default value is `30000` but users may want to specify smaller values to limit the memory use (at the cost of degraded performance) or higher values to improve export performance (the relationship is not linear and larger values don't bring proportionally larger performance increases).

An optional parameter `includeStoredFields` (default `false`) enables exporting fields that only have stored values (no docValues).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some day we'll have a way of thinking about should this be "true" ;-). Challenge of open source is we struggle to know what hsould be a default yes even when it has knock on impacts until a lot of time passes. ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will add an explanation.

Copy link
Contributor Author

@kotman12 kotman12 Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, can I put the explanation in the CHANGELOG? The docs already warn:

Note that retrieving stored fields may significantly impact export performance compared to docValues fields, as stored fields require additional I/O operations.

@kotman12 kotman12 force-pushed the stored-fields-export-writer branch from 9ff0b20 to d510972 Compare January 16, 2026 20:22
@dsmiley
Copy link
Contributor

dsmiley commented Jan 17, 2026

Just a quick observation -- I see a force push that came after Eric's review. I strongly recommend against force pushing because it resets the review state for anyone who has already reviewed -- Eric in this case.

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how overall simple the implementation turned out to be. This wasn't a heavy lift. Nice job!

Curious... at this point, how do you think /export contrasts with /select + cursorMark? Pros/cons... (this could be documented in the ref guide as well to durably record these insights)

import org.apache.lucene.index.LeafReaderContext;
import org.apache.solr.common.MapWriter;

abstract class FieldWriter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deserves javadoc, at least for clarifying it's return semantics

Comment on lines +52 to +55
if (map == null) {
map = new WeakHashMap<>();
storedFieldsMap.set(map);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this pattern can be improved to basically be handled at the ThreadLocal declaration to provide an initializer.

But moreover I'm concerned about the use of ThreadLocal in the first place -- it's typically a tool of last resort. And further true with use of weak references.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I'm fine with your use of it here but was speaking out-loud, maybe a little too out-loud :-)


class StoredFieldsWriter extends FieldWriter {

private final Map<String, SchemaField> fields;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd prefer this be named schema or something including that word

Comment on lines +82 to +87
if (fieldType instanceof BoolField) {
// Convert "T"/"F" stored value to boolean true/false
addField(fieldInfo.name, Boolean.valueOf(fieldType.indexedToReadable(value)));
} else {
addField(fieldInfo.name, value);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I see special cases like this, I ask myself... is there a fieldType method that should be handling this? If not do we need to add one?
CC @hossman if you are interested in review; you've looked at this topic in a nearby issue lately

class StoredFieldsWriter extends FieldWriter {

private final Map<String, SchemaField> fields;
private final ThreadLocal<WeakHashMap<IndexReader.CacheKey, StoredFields>> storedFieldsMap =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ThreadLocals are typically static. In this case, I think it should be because the information in it isn't specific to this instance of StoredFieldsWriter in any way that I can see. Java static analyzers including built into IntelliJ will generally advise you on this point.


assertJsonEquals(
resp,
"{\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you didn't get the memo -- we're on Java 21 now :-). multi-line string please

// Check if field can use DocValues
boolean canUseDocValues =
schemaField.hasDocValues()
&& (!(fieldType instanceof SortableTextField) || schemaField.useDocValuesAsStored());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there precedent for this special case RE instanceof SortableTextField?

return fields.containsKey(fieldInfo.name) ? Status.YES : Status.NO;
}

private <T> void addField(String fieldName, T value) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic flow is confusing... I started to write why it wouldn't work but end up finally contorting myself to see how it could work. Could you try to reflow it somehow? Or if you like it's flow then add comments? (or both)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've never seen this pattern before where you have to track state of multi-valued fields in a callback-style API where the caller sends one element at a time? It's not the greatest but it's what I'm dealing with here. The flush here can be thought of as being akin to, say, Jackson's JsonGenerator::writeEndObject. The look-behind is because I need to buffer and I don't get a signal on the last element that it is the last element. I don't really control that. Nota bene something similar happens in Solr here although I suppose it is less-contortion inducing because it happens all in one function. But this is hard to do here given the API.

One thing I can do is buffer into a multi-map-like thing instead of rolling my own StoredFieldVisitor, i.e. I can buffer into a Document (basically a multi-map), i.e. maybe with DocumentStoredFieldVisitor. I'd need to convert the multi-map thing (i.e. Document) to a collection-property-aware thing like SolrDocument, perhaps using the same convertLuceneDocToSolrDoc I just linked. But that is a lot of transformation so I am pretty neutral about it.

Maybe you can elaborate on what is confusing you? Also, any thoughts on the alternatives I just suggested would be welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have; I'm saying nonetheless I had some trouble reading it how you coded it. It's subjective; maybe "i'm the problem". In particular, I was surprised to see the very first bit of logic was to check if the current field is multi-valued... instead I expected it to check if we're adding to an existing field or starting a new field. There's no right answer here... it's subjective as to who finds what easier to read. So updated the logic and added a bit more comments. LMK what you think. If you don't like it than I'll refer it back and maybe you add more comments.

(I don't like the alternative suggestion of incorporating use of Document; seems quite unnecessary).

}

private int flush() throws IOException {
if (lastFieldName != null && multiValue != null && !multiValue.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic should be improved to read super simple here. Like, if multiValue is not null then put it -- that's it. No lastFieldName check, no isEmpty check.


@Override
public boolean write(
public int write(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be equally happy to just make FieldWriter::write be void instead of tracking this unused index.

+1

Then you could retract my comment on javadocs for write

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants