feat: Improve Column Remapping Performance in COPY FROM#193
feat: Improve Column Remapping Performance in COPY FROM#193shirly121 wants to merge 6 commits intoalibaba:mainfrom
COPY FROM#193Conversation
…ojectColumns Introduce ProjectIntoDataSourceOptimizer that pushes Projection into LogicalTableFunctionCall when the pattern is Projection -> TableFunctionCall, enabling column remapping at the data source level. Also migrates the column pruning semantics from "columns to skip" to "columns to project" across the entire pipeline (reader, optimizer, planner, serialization). Key changes: - New ProjectIntoDataSourceOptimizer for projection pushdown and column reorder - Rename ReadSharedState::skipColumns -> projectColumns with updated semantics - Rename ArrowOptionsBuilder::skipColumns() -> projectColumns() - Add projectColumns to TableFuncBindData replacing deprecated columnSkips - Restore BATCH_READ=false safety in bind_load_from.cpp - Remove stale commented-out code - Update all tests to use new project-columns semantics Made-with: Cursor Committed-by: Xiaoli Zhou from Dev container Committed-by: Xiaoli Zhou from Dev container
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Review Summary by QodoImprove Column Remapping Performance via Projection Pushdown Optimizer
WalkthroughsDescription• Introduce ProjectIntoDataSourceOptimizer for projection pushdown into data sources • Migrate column semantics from "skip" to "project" across entire pipeline • Rename skipColumns to projectColumns in ReadSharedState and related classes • Update protobuf field skip_columns to project_columns for semantic clarity • Restore BATCH_READ=false safety for non-COPY_FROM operations Diagramflowchart LR
A["Projection Operator"] -->|"ProjectIntoDataSourceOptimizer"| B["TableFunctionCall"]
B -->|"setProjectColumns"| C["BindData"]
C -->|"getProjectColumns"| D["Arrow Projection"]
E["skipColumns semantics"] -->|"migrate to"| F["projectColumns semantics"]
F -->|"used in"| G["Reader/Optimizer/Planner"]
File Changes1. include/neug/compiler/optimizer/project_into_data_source_optimizer.h
|
Code Review by Qodo
|
| // if there is a non-copy operation after LOAD FROM, batch_read=true will not | ||
| // work, batch_read=true will produce a special ArrowStreamContext, which can | ||
| // only be processed by copy from, other operators do not support it. | ||
| if (op->getNumChildren() > 0 && | ||
| op->getChild(0)->getOperatorType() == | ||
| LogicalOperatorType::TABLE_FUNCTION_CALL) { | ||
| auto tableFunc = op->getChild(0)->ptrCast<LogicalTableFunctionCall>(); | ||
| auto* scanBindData = | ||
| dynamic_cast<function::ScanFileBindData*>(tableFunc->getBindData()); | ||
| if (scanBindData != nullptr) { | ||
| auto& options = scanBindData->fileScanInfo.options; | ||
| if (op->getOperatorType() != LogicalOperatorType::COPY_FROM) { | ||
| options.insert_or_assign("BATCH_READ", | ||
| common::Value::createValue(false)); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
2. Batch_read ignores nonzero child 🐞 Bug ☼ Reliability
ProjectIntoDataSourceOptimizer disables BATCH_READ only when TABLE_FUNCTION_CALL is op->getChild(0), so scans positioned in other child slots (e.g., HASH_JOIN build side child[1]) can keep batch_read=true and emit ArrowStreamContextColumn that most operators cannot process.
Agent Prompt
### Issue description
`ProjectIntoDataSourceOptimizer::visitOperator` only disables `BATCH_READ` when a `TABLE_FUNCTION_CALL` is in `child(0)`. Operators with scans in other child positions (e.g., join build side) can bypass the guard, leaving `batch_read=true` and producing `ArrowStreamContextColumn` in unsupported operator pipelines.
### Issue Context
`ArrowReader::batch_read` produces `ArrowStreamContextColumn`, which is not generally supported (see the ArrowStreamContextColumn comment).
### Fix Focus Areas
- src/compiler/optimizer/project_into_data_source_optimizer.cpp[63-88]
- include/neug/compiler/planner/operator/logical_hash_join.h[11-25]
### What to change
- In `visitOperator`, iterate over **all** children (not just index 0) and apply the same `ScanFileBindData` lookup and `BATCH_READ=false` insertion for any child that is a `TABLE_FUNCTION_CALL` under a non-`COPY_FROM` parent.
- Keep the existing `COPY_FROM` exception semantics.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| auto boundScanSource = bindFileScanSource( | ||
| *source, loadFrom.getParsingOptions(), columnNames, columnTypes); | ||
| auto& scanInfo = boundScanSource->constCast<BoundTableScanSource>().info; | ||
| auto& bindData = scanInfo.bindData->cast<function::ScanFileBindData>(); | ||
| auto& fileInfo = bindData.fileScanInfo; | ||
| // We support LOAD FROM by ArrowArrayContextColumn with BATCH_READ set to | ||
| // false. | ||
| fileInfo.options.insert({"BATCH_READ", common::Value::createValue(false)}); | ||
| boundLoadFrom = std::make_unique<BoundLoadFrom>(scanInfo.copy()); | ||
| } break; |
There was a problem hiding this comment.
3. Load from batch_read regression 🐞 Bug ☼ Reliability
bindLoadFrom no longer forces batch_read/BATCH_READ=false for file sources, and the new safety is implemented only in an optimizer pass gated by enablePlanOptimizer; with optimizer disabled, batch_read defaults to true and LOAD FROM can produce ArrowStreamContextColumn in unsupported pipelines.
Agent Prompt
### Issue description
The previous bind-time safety that forced `BATCH_READ/batch_read=false` for `LOAD FROM` file scans has been removed. The replacement guard lives in an optimizer pass that does not run when `enablePlanOptimizer=false`, allowing `batch_read` to fall back to its default (`true`) and produce `ArrowStreamContextColumn`.
### Issue Context
- `Optimizer::optimize` is gated by `enablePlanOptimizer`.
- `ReadOptions.batch_read` defaults to `true`.
- `ArrowReader::batch_read` emits `ArrowStreamContextColumn`, which is not generally supported.
### Fix Focus Areas
- src/compiler/binder/bind/read/bind_load_from.cpp[46-61]
- src/compiler/optimizer/optimizer.cpp[55-82]
- include/neug/utils/reader/options.h[122-126]
### What to change
- Reintroduce a bind-time override for `LOAD FROM` scans to force batch read off (e.g., set `BATCH_READ=false` / `batch_read=false` in the scan options map) so correctness does not depend on optimizer execution.
- Keep the optimizer-based guard as an additional safety net, but don’t rely on it for baseline correctness.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Committed-by: Xiaoli Zhou from Dev container
This reverts commit 9e0abe2. Committed-by: Xiaoli Zhou from Dev container
Committed-by: Xiaoli Zhou from Dev container
What do these changes do?
Introduce ProjectIntoDataSourceOptimizer that pushes Projection into LogicalTableFunctionCall when the pattern is Projection -> TableFunctionCall, enabling column remapping at the data source level. Also migrates the column pruning semantics from "columns to skip" to "columns to project" across the entire pipeline (reader, optimizer, planner, serialization).
Key changes:
Related issue number
Fixes #165