Skip to content

feat: Improve Column Remapping Performance in COPY FROM#193

Open
shirly121 wants to merge 6 commits intoalibaba:mainfrom
shirly121:fix_copy_perf
Open

feat: Improve Column Remapping Performance in COPY FROM#193
shirly121 wants to merge 6 commits intoalibaba:mainfrom
shirly121:fix_copy_perf

Conversation

@shirly121
Copy link
Copy Markdown
Collaborator

What do these changes do?

Introduce ProjectIntoDataSourceOptimizer that pushes Projection into LogicalTableFunctionCall when the pattern is Projection -> TableFunctionCall, enabling column remapping at the data source level. Also migrates the column pruning semantics from "columns to skip" to "columns to project" across the entire pipeline (reader, optimizer, planner, serialization).

Key changes:

  • New ProjectIntoDataSourceOptimizer for projection pushdown and column reorder
  • Rename ReadSharedState::skipColumns -> projectColumns with updated semantics
  • Rename ArrowOptionsBuilder::skipColumns() -> projectColumns()
  • Add projectColumns to TableFuncBindData replacing deprecated columnSkips
  • Restore BATCH_READ=false safety in bind_load_from.cpp
  • Remove stale commented-out code
  • Update all tests to use new project-columns semantics

Related issue number

Fixes #165

…ojectColumns

Introduce ProjectIntoDataSourceOptimizer that pushes Projection into
LogicalTableFunctionCall when the pattern is Projection -> TableFunctionCall,
enabling column remapping at the data source level. Also migrates the
column pruning semantics from "columns to skip" to "columns to project"
across the entire pipeline (reader, optimizer, planner, serialization).

Key changes:
- New ProjectIntoDataSourceOptimizer for projection pushdown and column reorder
- Rename ReadSharedState::skipColumns -> projectColumns with updated semantics
- Rename ArrowOptionsBuilder::skipColumns() -> projectColumns()
- Add projectColumns to TableFuncBindData replacing deprecated columnSkips
- Restore BATCH_READ=false safety in bind_load_from.cpp
- Remove stale commented-out code
- Update all tests to use new project-columns semantics

Made-with: Cursor

Committed-by: Xiaoli Zhou from Dev container

Committed-by: Xiaoli Zhou from Dev container
Committed-by: Xiaoli Zhou from Dev container
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Improve Column Remapping Performance via Projection Pushdown Optimizer

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Introduce ProjectIntoDataSourceOptimizer for projection pushdown into data sources
• Migrate column semantics from "skip" to "project" across entire pipeline
• Rename skipColumns to projectColumns in ReadSharedState and related classes
• Update protobuf field skip_columns to project_columns for semantic clarity
• Restore BATCH_READ=false safety for non-COPY_FROM operations
Diagram
flowchart LR
  A["Projection Operator"] -->|"ProjectIntoDataSourceOptimizer"| B["TableFunctionCall"]
  B -->|"setProjectColumns"| C["BindData"]
  C -->|"getProjectColumns"| D["Arrow Projection"]
  E["skipColumns semantics"] -->|"migrate to"| F["projectColumns semantics"]
  F -->|"used in"| G["Reader/Optimizer/Planner"]
Loading

Grey Divider

File Changes

1. include/neug/compiler/optimizer/project_into_data_source_optimizer.h ✨ Enhancement +46/-0

New optimizer for projection pushdown into data sources

include/neug/compiler/optimizer/project_into_data_source_optimizer.h


2. src/compiler/optimizer/project_into_data_source_optimizer.cpp ✨ Enhancement +153/-0

Implementation of projection pushdown optimizer logic

src/compiler/optimizer/project_into_data_source_optimizer.cpp


3. include/neug/compiler/function/table/bind_data.h ✨ Enhancement +8/-5

Rename columnSkips to projectColumns with new semantics

include/neug/compiler/function/table/bind_data.h


View more (19)
4. src/compiler/function/table/bind_data.cpp ✨ Enhancement +0/-12

Remove deprecated getColumnSkips method implementation

src/compiler/function/table/bind_data.cpp


5. include/neug/compiler/planner/operator/logical_table_function_call.h ✨ Enhancement +2/-2

Update method signature from setColumnSkips to setProjectColumns

include/neug/compiler/planner/operator/logical_table_function_call.h


6. src/compiler/planner/operator/logical_table_function_call.cpp ✨ Enhancement +18/-14

Refactor schema computation to use projectColumns semantics

src/compiler/planner/operator/logical_table_function_call.cpp


7. src/compiler/optimizer/projection_push_down_optimizer.cpp ✨ Enhancement +12/-8

Update to use projectColumns instead of columnSkips boolean vector

src/compiler/optimizer/projection_push_down_optimizer.cpp


8. src/compiler/optimizer/optimizer.cpp ✨ Enhancement +4/-0

Integrate ProjectIntoDataSourceOptimizer into optimization pipeline

src/compiler/optimizer/optimizer.cpp


9. src/compiler/gopt/g_query_converter.cpp ✨ Enhancement +4/-9

Update protobuf serialization to use project_columns field

src/compiler/gopt/g_query_converter.cpp


10. src/compiler/binder/bind/read/bind_load_from.cpp ✨ Enhancement +1/-5

Remove manual BATCH_READ=false setting from bind phase

src/compiler/binder/bind/read/bind_load_from.cpp


11. include/neug/utils/reader/reader.h ✨ Enhancement +5/-13

Rename skipColumns to projectColumns in ReadSharedState

include/neug/utils/reader/reader.h


12. src/utils/reader/options.cc ✨ Enhancement +10/-17

Rename skipColumns method to projectColumns with updated logic

src/utils/reader/options.cc


13. include/neug/utils/reader/options.h 📝 Documentation +12/-11

Update documentation and method signatures for projectColumns

include/neug/utils/reader/options.h


14. src/utils/reader/reader.cc ✨ Enhancement +1/-1

Update method call from skipColumns to projectColumns

src/utils/reader/reader.cc


15. src/execution/execute/ops/batch/data_source.cc ✨ Enhancement +5/-3

Update deserialization to use project_columns field

src/execution/execute/ops/batch/data_source.cc


16. src/execution/execute/ops/retrieve/sink.cc ✨ Enhancement +7/-1

Add dummy_sink helper for load operations handling

src/execution/execute/ops/retrieve/sink.cc


17. proto/cypher_dml.proto ✨ Enhancement +1/-1

Rename skip_columns field to project_columns in DataSource message

proto/cypher_dml.proto


18. src/compiler/optimizer/CMakeLists.txt ⚙️ Configuration changes +2/-1

Add new optimizer source file to build configuration

src/compiler/optimizer/CMakeLists.txt


19. extension/json/tests/json_test.cpp 🧪 Tests +3/-2

Update test to use projectColumns parameter semantics

extension/json/tests/json_test.cpp


20. extension/parquet/tests/parquet_test.cpp 🧪 Tests +8/-8

Update column pruning tests to use projectColumns semantics

extension/parquet/tests/parquet_test.cpp


21. tests/utils/test_reader.h 🧪 Tests +2/-2

Update test helper to use projectColumns parameter

tests/utils/test_reader.h


22. tests/utils/test_reader.cc 🧪 Tests +12/-8

Update reader tests to use projectColumns instead of skipColumns

tests/utils/test_reader.cc


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 9, 2026

Code Review by Qodo

🐞 Bugs (2)   📘 Rule violations (0)   📎 Requirement gaps (0)   🎨 UX Issues (0)
🐞\ ☼ Reliability (2)

Grey Divider


Action required

1. Golden JSON field mismatch🐞
Description
DataSource field rename to project_columns changes protobuf JSON output, but existing DML
physical-plan golden files still contain "skip_columns", so verifyPhysicalByJson snapshot
comparisons will fail.
Code

proto/cypher_dml.proto[R27-31]

   string extension_name = 1;
   EntrySchema entry_schema = 4;
   FileSchema file_schema = 5;
-    repeated string skip_columns = 6;
+    repeated string project_columns = 6;
   common.Expression skip_rows = 7;
Evidence
Physical plan JSON printing preserves proto field names, so the rename from skip_columns to
project_columns changes the emitted JSON key. The checked-in golden JSON snapshots still use
"skip_columns", so equality assertions will mismatch.

proto/cypher_dml.proto[25-32]
tests/compiler/gopt_test.h[55-80]
tests/compiler/resources/dml_test/COPY_PERSON_physical[38-43]
tests/compiler/resources/dml_test/COPY_KNOWS_physical[31-37]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The protobuf field rename (`skip_columns` -> `project_columns`) changes the JSON key emitted by `MessageToJsonString` (because `preserve_proto_field_names=true`). Golden physical-plan JSON snapshots that still contain `"skip_columns"` will no longer match.
### Issue Context
`VerifyFactory::verifyPhysicalByJson` compares the actual printed plan JSON against JSON resources under `tests/compiler/resources/...`.
### Fix Focus Areas
- proto/cypher_dml.proto[25-32]
- tests/compiler/gopt_test.h[55-80]
- tests/compiler/resources/dml_test/COPY_PERSON_physical[38-43]
- tests/compiler/resources/dml_test/COPY_KNOWS_physical[31-37]
### What to change
- Update (or regenerate) the affected golden JSON resources, replacing `"skip_columns": [...]` with `"project_columns": [...]` (keeping semantics consistent with the new meaning: included/projection columns).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. BATCH_READ ignores nonzero child 🐞
Description
ProjectIntoDataSourceOptimizer disables BATCH_READ only when TABLE_FUNCTION_CALL is op->getChild(0),
so scans positioned in other child slots (e.g., HASH_JOIN build side child[1]) can keep
batch_read=true and emit ArrowStreamContextColumn that most operators cannot process.
Code

src/compiler/optimizer/project_into_data_source_optimizer.cpp[R71-87]

+  // if there is a non-copy operation after LOAD FROM, batch_read=true will not
+  // work, batch_read=true will produce a special ArrowStreamContext, which can
+  // only be processed by copy from, other operators do not support it.
+  if (op->getNumChildren() > 0 &&
+      op->getChild(0)->getOperatorType() ==
+          LogicalOperatorType::TABLE_FUNCTION_CALL) {
+    auto tableFunc = op->getChild(0)->ptrCast<LogicalTableFunctionCall>();
+    auto* scanBindData =
+        dynamic_cast<function::ScanFileBindData*>(tableFunc->getBindData());
+    if (scanBindData != nullptr) {
+      auto& options = scanBindData->fileScanInfo.options;
+      if (op->getOperatorType() != LogicalOperatorType::COPY_FROM) {
+        options.insert_or_assign("BATCH_READ",
+                                 common::Value::createValue(false));
+      }
+    }
+  }
Evidence
The optimizer’s safety check hard-codes child(0). LogicalHashJoin explicitly models the build side
as children[1], so a file scan used there can bypass this guard. When batch_read is enabled,
ArrowReader builds ArrowStreamContextColumn, which is documented as only workable with a narrow set
of operators, making bypasses likely to crash or misbehave at runtime.

src/compiler/optimizer/project_into_data_source_optimizer.cpp[63-88]
include/neug/compiler/planner/operator/logical_hash_join.h[11-25]
src/utils/reader/reader.cc[60-70]
src/utils/reader/reader.cc[180-221]
include/neug/execution/common/columns/arrow_context_column.h[124-147]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`ProjectIntoDataSourceOptimizer::visitOperator` only disables `BATCH_READ` when a `TABLE_FUNCTION_CALL` is in `child(0)`. Operators with scans in other child positions (e.g., join build side) can bypass the guard, leaving `batch_read=true` and producing `ArrowStreamContextColumn` in unsupported operator pipelines.
### Issue Context
`ArrowReader::batch_read` produces `ArrowStreamContextColumn`, which is not generally supported (see the ArrowStreamContextColumn comment).
### Fix Focus Areas
- src/compiler/optimizer/project_into_data_source_optimizer.cpp[63-88]
- include/neug/compiler/planner/operator/logical_hash_join.h[11-25]
### What to change
- In `visitOperator`, iterate over **all** children (not just index 0) and apply the same `ScanFileBindData` lookup and `BATCH_READ=false` insertion for any child that is a `TABLE_FUNCTION_CALL` under a non-`COPY_FROM` parent.
- Keep the existing `COPY_FROM` exception semantics.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. LOAD FROM batch_read regression 🐞
Description
bindLoadFrom no longer forces batch_read/BATCH_READ=false for file sources, and the new safety is
implemented only in an optimizer pass gated by enablePlanOptimizer; with optimizer disabled,
batch_read defaults to true and LOAD FROM can produce ArrowStreamContextColumn in unsupported
pipelines.
Code

src/compiler/binder/bind/read/bind_load_from.cpp[R57-61]

   auto boundScanSource = bindFileScanSource(
       *source, loadFrom.getParsingOptions(), columnNames, columnTypes);
   auto& scanInfo = boundScanSource->constCast<BoundTableScanSource>().info;
-    auto& bindData = scanInfo.bindData->cast<function::ScanFileBindData>();
-    auto& fileInfo = bindData.fileScanInfo;
-    // We support LOAD FROM by ArrowArrayContextColumn with BATCH_READ set to
-    // false.
-    fileInfo.options.insert({"BATCH_READ", common::Value::createValue(false)});
   boundLoadFrom = std::make_unique<BoundLoadFrom>(scanInfo.copy());
 } break;
Evidence
Optimizer rewrites (including the new BATCH_READ=false insertion) only run when
enablePlanOptimizer=true. ReadOptions.batch_read defaults to true, and ArrowReader chooses
batch_read mode based on that option; in batch mode it emits ArrowStreamContextColumn, which is
documented as not broadly supported. Since bindLoadFrom no longer sets BATCH_READ=false at
bind-time, disabling the optimizer reintroduces the unsafe default.

src/compiler/binder/bind/read/bind_load_from.cpp[46-61]
src/compiler/optimizer/optimizer.cpp[55-82]
include/neug/utils/reader/options.h[122-126]
src/utils/reader/reader.cc[60-70]
src/utils/reader/reader.cc[180-221]
include/neug/execution/common/columns/arrow_context_column.h[124-147]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The previous bind-time safety that forced `BATCH_READ/batch_read=false` for `LOAD FROM` file scans has been removed. The replacement guard lives in an optimizer pass that does not run when `enablePlanOptimizer=false`, allowing `batch_read` to fall back to its default (`true`) and produce `ArrowStreamContextColumn`.
### Issue Context
- `Optimizer::optimize` is gated by `enablePlanOptimizer`.
- `ReadOptions.batch_read` defaults to `true`.
- `ArrowReader::batch_read` emits `ArrowStreamContextColumn`, which is not generally supported.
### Fix Focus Areas
- src/compiler/binder/bind/read/bind_load_from.cpp[46-61]
- src/compiler/optimizer/optimizer.cpp[55-82]
- include/neug/utils/reader/options.h[122-126]
### What to change
- Reintroduce a bind-time override for `LOAD FROM` scans to force batch read off (e.g., set `BATCH_READ=false` / `batch_read=false` in the scan options map) so correctness does not depend on optimizer execution.
- Keep the optimizer-based guard as an additional safety net, but don’t rely on it for baseline correctness.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment on lines +71 to +87
// if there is a non-copy operation after LOAD FROM, batch_read=true will not
// work, batch_read=true will produce a special ArrowStreamContext, which can
// only be processed by copy from, other operators do not support it.
if (op->getNumChildren() > 0 &&
op->getChild(0)->getOperatorType() ==
LogicalOperatorType::TABLE_FUNCTION_CALL) {
auto tableFunc = op->getChild(0)->ptrCast<LogicalTableFunctionCall>();
auto* scanBindData =
dynamic_cast<function::ScanFileBindData*>(tableFunc->getBindData());
if (scanBindData != nullptr) {
auto& options = scanBindData->fileScanInfo.options;
if (op->getOperatorType() != LogicalOperatorType::COPY_FROM) {
options.insert_or_assign("BATCH_READ",
common::Value::createValue(false));
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Batch_read ignores nonzero child 🐞 Bug ☼ Reliability

ProjectIntoDataSourceOptimizer disables BATCH_READ only when TABLE_FUNCTION_CALL is op->getChild(0),
so scans positioned in other child slots (e.g., HASH_JOIN build side child[1]) can keep
batch_read=true and emit ArrowStreamContextColumn that most operators cannot process.
Agent Prompt
### Issue description
`ProjectIntoDataSourceOptimizer::visitOperator` only disables `BATCH_READ` when a `TABLE_FUNCTION_CALL` is in `child(0)`. Operators with scans in other child positions (e.g., join build side) can bypass the guard, leaving `batch_read=true` and producing `ArrowStreamContextColumn` in unsupported operator pipelines.

### Issue Context
`ArrowReader::batch_read` produces `ArrowStreamContextColumn`, which is not generally supported (see the ArrowStreamContextColumn comment).

### Fix Focus Areas
- src/compiler/optimizer/project_into_data_source_optimizer.cpp[63-88]
- include/neug/compiler/planner/operator/logical_hash_join.h[11-25]

### What to change
- In `visitOperator`, iterate over **all** children (not just index 0) and apply the same `ScanFileBindData` lookup and `BATCH_READ=false` insertion for any child that is a `TABLE_FUNCTION_CALL` under a non-`COPY_FROM` parent.
- Keep the existing `COPY_FROM` exception semantics.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines 57 to 61
auto boundScanSource = bindFileScanSource(
*source, loadFrom.getParsingOptions(), columnNames, columnTypes);
auto& scanInfo = boundScanSource->constCast<BoundTableScanSource>().info;
auto& bindData = scanInfo.bindData->cast<function::ScanFileBindData>();
auto& fileInfo = bindData.fileScanInfo;
// We support LOAD FROM by ArrowArrayContextColumn with BATCH_READ set to
// false.
fileInfo.options.insert({"BATCH_READ", common::Value::createValue(false)});
boundLoadFrom = std::make_unique<BoundLoadFrom>(scanInfo.copy());
} break;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

3. Load from batch_read regression 🐞 Bug ☼ Reliability

bindLoadFrom no longer forces batch_read/BATCH_READ=false for file sources, and the new safety is
implemented only in an optimizer pass gated by enablePlanOptimizer; with optimizer disabled,
batch_read defaults to true and LOAD FROM can produce ArrowStreamContextColumn in unsupported
pipelines.
Agent Prompt
### Issue description
The previous bind-time safety that forced `BATCH_READ/batch_read=false` for `LOAD FROM` file scans has been removed. The replacement guard lives in an optimizer pass that does not run when `enablePlanOptimizer=false`, allowing `batch_read` to fall back to its default (`true`) and produce `ArrowStreamContextColumn`.

### Issue Context
- `Optimizer::optimize` is gated by `enablePlanOptimizer`.
- `ReadOptions.batch_read` defaults to `true`.
- `ArrowReader::batch_read` emits `ArrowStreamContextColumn`, which is not generally supported.

### Fix Focus Areas
- src/compiler/binder/bind/read/bind_load_from.cpp[46-61]
- src/compiler/optimizer/optimizer.cpp[55-82]
- include/neug/utils/reader/options.h[122-126]

### What to change
- Reintroduce a bind-time override for `LOAD FROM` scans to force batch read off (e.g., set `BATCH_READ=false` / `batch_read=false` in the scan options map) so correctness does not depend on optimizer execution.
- Keep the optimizer-based guard as an additional safety net, but don’t rely on it for baseline correctness.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Committed-by: Xiaoli Zhou from Dev container
Committed-by: Xiaoli Zhou from Dev container
This reverts commit 9e0abe2.

Committed-by: Xiaoli Zhou from Dev container
Committed-by: Xiaoli Zhou from Dev container
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance bottleneck in COPY with column remapping

1 participant