feat: Improve Column Remapping Performance in `COPY FROM` by shirly121 · Pull Request #193 · alibaba/neug

shirly121 · 2026-04-09T04:19:35Z

What do these changes do?

Introduce ProjectIntoDataSourceOptimizer that pushes Projection into LogicalTableFunctionCall when the pattern is Projection -> TableFunctionCall, enabling column remapping at the data source level. Also migrates the column pruning semantics from "columns to skip" to "columns to project" across the entire pipeline (reader, optimizer, planner, serialization).

Key changes:

New ProjectIntoDataSourceOptimizer for projection pushdown and column reorder
Rename ReadSharedState::skipColumns -> projectColumns with updated semantics
Rename ArrowOptionsBuilder::skipColumns() -> projectColumns()
Add projectColumns to TableFuncBindData replacing deprecated columnSkips
Restore BATCH_READ=false safety in bind_load_from.cpp
Remove stale commented-out code
Update all tests to use new project-columns semantics

Related issue number

Fixes #165

…ojectColumns Introduce ProjectIntoDataSourceOptimizer that pushes Projection into LogicalTableFunctionCall when the pattern is Projection -> TableFunctionCall, enabling column remapping at the data source level. Also migrates the column pruning semantics from "columns to skip" to "columns to project" across the entire pipeline (reader, optimizer, planner, serialization). Key changes: - New ProjectIntoDataSourceOptimizer for projection pushdown and column reorder - Rename ReadSharedState::skipColumns -> projectColumns with updated semantics - Rename ArrowOptionsBuilder::skipColumns() -> projectColumns() - Add projectColumns to TableFuncBindData replacing deprecated columnSkips - Restore BATCH_READ=false safety in bind_load_from.cpp - Remove stale commented-out code - Update all tests to use new project-columns semantics Made-with: Cursor Committed-by: Xiaoli Zhou from Dev container Committed-by: Xiaoli Zhou from Dev container

Committed-by: Xiaoli Zhou from Dev container

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

qodo-code-review · 2026-04-09T04:19:56Z

Review Summary by Qodo

Improve Column Remapping Performance via Projection Pushdown Optimizer

✨ Enhancement

Walkthroughs

Description

• Introduce ProjectIntoDataSourceOptimizer for projection pushdown into data sources
• Migrate column semantics from "skip" to "project" across entire pipeline
• Rename skipColumns to projectColumns in ReadSharedState and related classes
• Update protobuf field skip_columns to project_columns for semantic clarity
• Restore BATCH_READ=false safety for non-COPY_FROM operations

Diagram

flowchart LR
  A["Projection Operator"] -->|"ProjectIntoDataSourceOptimizer"| B["TableFunctionCall"]
  B -->|"setProjectColumns"| C["BindData"]
  C -->|"getProjectColumns"| D["Arrow Projection"]
  E["skipColumns semantics"] -->|"migrate to"| F["projectColumns semantics"]
  F -->|"used in"| G["Reader/Optimizer/Planner"]

File Changes

1. include/neug/compiler/optimizer/project_into_data_source_optimizer.h ✨ Enhancement +46/-0

New optimizer for projection pushdown into data sources

include/neug/compiler/optimizer/project_into_data_source_optimizer.h

2. src/compiler/optimizer/project_into_data_source_optimizer.cpp ✨ Enhancement +153/-0

Implementation of projection pushdown optimizer logic

src/compiler/optimizer/project_into_data_source_optimizer.cpp

3. include/neug/compiler/function/table/bind_data.h ✨ Enhancement +8/-5

Rename columnSkips to projectColumns with new semantics

include/neug/compiler/function/table/bind_data.h

View more (19)

4. src/compiler/function/table/bind_data.cpp ✨ Enhancement +0/-12

Remove deprecated getColumnSkips method implementation

src/compiler/function/table/bind_data.cpp

5. include/neug/compiler/planner/operator/logical_table_function_call.h ✨ Enhancement +2/-2

Update method signature from setColumnSkips to setProjectColumns

include/neug/compiler/planner/operator/logical_table_function_call.h

6. src/compiler/planner/operator/logical_table_function_call.cpp ✨ Enhancement +18/-14

Refactor schema computation to use projectColumns semantics

src/compiler/planner/operator/logical_table_function_call.cpp

7. src/compiler/optimizer/projection_push_down_optimizer.cpp ✨ Enhancement +12/-8

Update to use projectColumns instead of columnSkips boolean vector

src/compiler/optimizer/projection_push_down_optimizer.cpp

8. src/compiler/optimizer/optimizer.cpp ✨ Enhancement +4/-0

Integrate ProjectIntoDataSourceOptimizer into optimization pipeline

src/compiler/optimizer/optimizer.cpp

9. src/compiler/gopt/g_query_converter.cpp ✨ Enhancement +4/-9

Update protobuf serialization to use project_columns field

src/compiler/gopt/g_query_converter.cpp

10. src/compiler/binder/bind/read/bind_load_from.cpp ✨ Enhancement +1/-5

Remove manual BATCH_READ=false setting from bind phase

src/compiler/binder/bind/read/bind_load_from.cpp

11. include/neug/utils/reader/reader.h ✨ Enhancement +5/-13

Rename skipColumns to projectColumns in ReadSharedState

include/neug/utils/reader/reader.h

12. src/utils/reader/options.cc ✨ Enhancement +10/-17

Rename skipColumns method to projectColumns with updated logic

src/utils/reader/options.cc

13. include/neug/utils/reader/options.h 📝 Documentation +12/-11

Update documentation and method signatures for projectColumns

include/neug/utils/reader/options.h

14. src/utils/reader/reader.cc ✨ Enhancement +1/-1

Update method call from skipColumns to projectColumns

src/utils/reader/reader.cc

15. src/execution/execute/ops/batch/data_source.cc ✨ Enhancement +5/-3

Update deserialization to use project_columns field

src/execution/execute/ops/batch/data_source.cc

16. src/execution/execute/ops/retrieve/sink.cc ✨ Enhancement +7/-1

Add dummy_sink helper for load operations handling

src/execution/execute/ops/retrieve/sink.cc

17. proto/cypher_dml.proto ✨ Enhancement +1/-1

Rename skip_columns field to project_columns in DataSource message

proto/cypher_dml.proto

18. src/compiler/optimizer/CMakeLists.txt ⚙️ Configuration changes +2/-1

Add new optimizer source file to build configuration

src/compiler/optimizer/CMakeLists.txt

19. extension/json/tests/json_test.cpp 🧪 Tests +3/-2

Update test to use projectColumns parameter semantics

extension/json/tests/json_test.cpp

20. extension/parquet/tests/parquet_test.cpp 🧪 Tests +8/-8

Update column pruning tests to use projectColumns semantics

extension/parquet/tests/parquet_test.cpp

21. tests/utils/test_reader.h 🧪 Tests +2/-2

Update test helper to use projectColumns parameter

tests/utils/test_reader.h

22. tests/utils/test_reader.cc 🧪 Tests +12/-8

Update reader tests to use projectColumns instead of skipColumns

tests/utils/test_reader.cc

qodo-code-review · 2026-04-09T04:19:57Z

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX Issues (0)

🐞\ ☼ Reliability (2)

1. ~~Golden JSON field mismatch~~ ☑ 🐞 ≡

Description

DataSource field rename to project_columns changes protobuf JSON output, but existing DML
physical-plan golden files still contain "skip_columns", so verifyPhysicalByJson snapshot
comparisons will fail.

Code

proto/cypher_dml.proto[R27-31]

   string extension_name = 1;
   EntrySchema entry_schema = 4;
   FileSchema file_schema = 5;
-    repeated string skip_columns = 6;
+    repeated string project_columns = 6;
   common.Expression skip_rows = 7;

Evidence
Physical plan JSON printing preserves proto field names, so the rename from skip_columns to
project_columns changes the emitted JSON key. The checked-in golden JSON snapshots still use
"skip_columns", so equality assertions will mismatch.
proto/cypher_dml.proto[25-32]
tests/compiler/gopt_test.h[55-80]
tests/compiler/resources/dml_test/COPY_PERSON_physical[38-43]
tests/compiler/resources/dml_test/COPY_KNOWS_physical[31-37]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The protobuf field rename (`skip_columns` -> `project_columns`) changes the JSON key emitted by `MessageToJsonString` (because `preserve_proto_field_names=true`). Golden physical-plan JSON snapshots that still contain `"skip_columns"` will no longer match.
### Issue Context
`VerifyFactory::verifyPhysicalByJson` compares the actual printed plan JSON against JSON resources under `tests/compiler/resources/...`.
### Fix Focus Areas
- proto/cypher_dml.proto[25-32]
- tests/compiler/gopt_test.h[55-80]
- tests/compiler/resources/dml_test/COPY_PERSON_physical[38-43]
- tests/compiler/resources/dml_test/COPY_KNOWS_physical[31-37]
### What to change
- Update (or regenerate) the affected golden JSON resources, replacing `"skip_columns": [...]` with `"project_columns": [...]` (keeping semantics consistent with the new meaning: included/projection columns).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. BATCH_READ ignores nonzero child 🐞 ☼

Description

ProjectIntoDataSourceOptimizer disables BATCH_READ only when TABLE_FUNCTION_CALL is op->getChild(0),
so scans positioned in other child slots (e.g., HASH_JOIN build side child[1]) can keep
batch_read=true and emit ArrowStreamContextColumn that most operators cannot process.

Code

src/compiler/optimizer/project_into_data_source_optimizer.cpp[R71-87]

+  // if there is a non-copy operation after LOAD FROM, batch_read=true will not
+  // work, batch_read=true will produce a special ArrowStreamContext, which can
+  // only be processed by copy from, other operators do not support it.
+  if (op->getNumChildren() > 0 &&
+      op->getChild(0)->getOperatorType() ==
+          LogicalOperatorType::TABLE_FUNCTION_CALL) {
+    auto tableFunc = op->getChild(0)->ptrCast<LogicalTableFunctionCall>();
+    auto* scanBindData =
+        dynamic_cast<function::ScanFileBindData*>(tableFunc->getBindData());
+    if (scanBindData != nullptr) {
+      auto& options = scanBindData->fileScanInfo.options;
+      if (op->getOperatorType() != LogicalOperatorType::COPY_FROM) {
+        options.insert_or_assign("BATCH_READ",
+                                 common::Value::createValue(false));
+      }
+    }
+  }

Evidence
The optimizer’s safety check hard-codes child(0). LogicalHashJoin explicitly models the build side
as children[1], so a file scan used there can bypass this guard. When batch_read is enabled,
ArrowReader builds ArrowStreamContextColumn, which is documented as only workable with a narrow set
of operators, making bypasses likely to crash or misbehave at runtime.
src/compiler/optimizer/project_into_data_source_optimizer.cpp[63-88]
include/neug/compiler/planner/operator/logical_hash_join.h[11-25]
src/utils/reader/reader.cc[60-70]
src/utils/reader/reader.cc[180-221]
include/neug/execution/common/columns/arrow_context_column.h[124-147]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`ProjectIntoDataSourceOptimizer::visitOperator` only disables `BATCH_READ` when a `TABLE_FUNCTION_CALL` is in `child(0)`. Operators with scans in other child positions (e.g., join build side) can bypass the guard, leaving `batch_read=true` and producing `ArrowStreamContextColumn` in unsupported operator pipelines.
### Issue Context
`ArrowReader::batch_read` produces `ArrowStreamContextColumn`, which is not generally supported (see the ArrowStreamContextColumn comment).
### Fix Focus Areas
- src/compiler/optimizer/project_into_data_source_optimizer.cpp[63-88]
- include/neug/compiler/planner/operator/logical_hash_join.h[11-25]
### What to change
- In `visitOperator`, iterate over **all** children (not just index 0) and apply the same `ScanFileBindData` lookup and `BATCH_READ=false` insertion for any child that is a `TABLE_FUNCTION_CALL` under a non-`COPY_FROM` parent.
- Keep the existing `COPY_FROM` exception semantics.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. LOAD FROM batch_read regression 🐞 ☼

Description

bindLoadFrom no longer forces batch_read/BATCH_READ=false for file sources, and the new safety is
implemented only in an optimizer pass gated by enablePlanOptimizer; with optimizer disabled,
batch_read defaults to true and LOAD FROM can produce ArrowStreamContextColumn in unsupported
pipelines.

Code

src/compiler/binder/bind/read/bind_load_from.cpp[R57-61]

   auto boundScanSource = bindFileScanSource(
       *source, loadFrom.getParsingOptions(), columnNames, columnTypes);
   auto& scanInfo = boundScanSource->constCast<BoundTableScanSource>().info;
-    auto& bindData = scanInfo.bindData->cast<function::ScanFileBindData>();
-    auto& fileInfo = bindData.fileScanInfo;
-    // We support LOAD FROM by ArrowArrayContextColumn with BATCH_READ set to
-    // false.
-    fileInfo.options.insert({"BATCH_READ", common::Value::createValue(false)});
   boundLoadFrom = std::make_unique<BoundLoadFrom>(scanInfo.copy());
 } break;

Evidence
Optimizer rewrites (including the new BATCH_READ=false insertion) only run when
enablePlanOptimizer=true. ReadOptions.batch_read defaults to true, and ArrowReader chooses
batch_read mode based on that option; in batch mode it emits ArrowStreamContextColumn, which is
documented as not broadly supported. Since bindLoadFrom no longer sets BATCH_READ=false at
bind-time, disabling the optimizer reintroduces the unsafe default.
src/compiler/binder/bind/read/bind_load_from.cpp[46-61]
src/compiler/optimizer/optimizer.cpp[55-82]
include/neug/utils/reader/options.h[122-126]
src/utils/reader/reader.cc[60-70]
src/utils/reader/reader.cc[180-221]
include/neug/execution/common/columns/arrow_context_column.h[124-147]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The previous bind-time safety that forced `BATCH_READ/batch_read=false` for `LOAD FROM` file scans has been removed. The replacement guard lives in an optimizer pass that does not run when `enablePlanOptimizer=false`, allowing `batch_read` to fall back to its default (`true`) and produce `ArrowStreamContextColumn`.
### Issue Context
- `Optimizer::optimize` is gated by `enablePlanOptimizer`.
- `ReadOptions.batch_read` defaults to `true`.
- `ArrowReader::batch_read` emits `ArrowStreamContextColumn`, which is not generally supported.
### Fix Focus Areas
- src/compiler/binder/bind/read/bind_load_from.cpp[46-61]
- src/compiler/optimizer/optimizer.cpp[55-82]
- include/neug/utils/reader/options.h[122-126]
### What to change
- Reintroduce a bind-time override for `LOAD FROM` scans to force batch read off (e.g., set `BATCH_READ=false` / `batch_read=false` in the scan options map) so correctness does not depend on optimizer execution.
- Keep the optimizer-based guard as an additional safety net, but don’t rely on it for baseline correctness.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

proto/cypher_dml.proto

qodo-code-review · 2026-04-09T04:28:26Z

src/compiler/optimizer/project_into_data_source_optimizer.cpp

+  // if there is a non-copy operation after LOAD FROM, batch_read=true will not
+  // work, batch_read=true will produce a special ArrowStreamContext, which can
+  // only be processed by copy from, other operators do not support it.
+  if (op->getNumChildren() > 0 &&
+      op->getChild(0)->getOperatorType() ==
+          LogicalOperatorType::TABLE_FUNCTION_CALL) {
+    auto tableFunc = op->getChild(0)->ptrCast<LogicalTableFunctionCall>();
+    auto* scanBindData =
+        dynamic_cast<function::ScanFileBindData*>(tableFunc->getBindData());
+    if (scanBindData != nullptr) {
+      auto& options = scanBindData->fileScanInfo.options;
+      if (op->getOperatorType() != LogicalOperatorType::COPY_FROM) {
+        options.insert_or_assign("BATCH_READ",
+                                 common::Value::createValue(false));
+      }
+    }
+  }


2. Batch_read ignores nonzero child 🐞 Bug ☼ Reliability

ProjectIntoDataSourceOptimizer disables BATCH_READ only when TABLE_FUNCTION_CALL is op->getChild(0), so scans positioned in other child slots (e.g., HASH_JOIN build side child[1]) can keep batch_read=true and emit ArrowStreamContextColumn that most operators cannot process.

Agent Prompt

### Issue description `ProjectIntoDataSourceOptimizer::visitOperator` only disables `BATCH_READ` when a `TABLE_FUNCTION_CALL` is in `child(0)`. Operators with scans in other child positions (e.g., join build side) can bypass the guard, leaving `batch_read=true` and producing `ArrowStreamContextColumn` in unsupported operator pipelines. ### Issue Context `ArrowReader::batch_read` produces `ArrowStreamContextColumn`, which is not generally supported (see the ArrowStreamContextColumn comment). ### Fix Focus Areas - src/compiler/optimizer/project_into_data_source_optimizer.cpp[63-88] - include/neug/compiler/planner/operator/logical_hash_join.h[11-25] ### What to change - In `visitOperator`, iterate over **all** children (not just index 0) and apply the same `ScanFileBindData` lookup and `BATCH_READ=false` insertion for any child that is a `TABLE_FUNCTION_CALL` under a non-`COPY_FROM` parent. - Keep the existing `COPY_FROM` exception semantics.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-04-09T04:28:26Z

src/compiler/binder/bind/read/bind_load_from.cpp

    auto boundScanSource = bindFileScanSource(
        *source, loadFrom.getParsingOptions(), columnNames, columnTypes);
    auto& scanInfo = boundScanSource->constCast<BoundTableScanSource>().info;
-    auto& bindData = scanInfo.bindData->cast<function::ScanFileBindData>();
-    auto& fileInfo = bindData.fileScanInfo;
-    // We support LOAD FROM by ArrowArrayContextColumn with BATCH_READ set to
-    // false.
-    fileInfo.options.insert({"BATCH_READ", common::Value::createValue(false)});
    boundLoadFrom = std::make_unique<BoundLoadFrom>(scanInfo.copy());
  } break;


3. Load from batch_read regression 🐞 Bug ☼ Reliability

bindLoadFrom no longer forces batch_read/BATCH_READ=false for file sources, and the new safety is implemented only in an optimizer pass gated by enablePlanOptimizer; with optimizer disabled, batch_read defaults to true and LOAD FROM can produce ArrowStreamContextColumn in unsupported pipelines.

Agent Prompt

### Issue description The previous bind-time safety that forced `BATCH_READ/batch_read=false` for `LOAD FROM` file scans has been removed. The replacement guard lives in an optimizer pass that does not run when `enablePlanOptimizer=false`, allowing `batch_read` to fall back to its default (`true`) and produce `ArrowStreamContextColumn`. ### Issue Context - `Optimizer::optimize` is gated by `enablePlanOptimizer`. - `ReadOptions.batch_read` defaults to `true`. - `ArrowReader::batch_read` emits `ArrowStreamContextColumn`, which is not generally supported. ### Fix Focus Areas - src/compiler/binder/bind/read/bind_load_from.cpp[46-61] - src/compiler/optimizer/optimizer.cpp[55-82] - include/neug/utils/reader/options.h[122-126] ### What to change - Reintroduce a bind-time override for `LOAD FROM` scans to force batch read off (e.g., set `BATCH_READ=false` / `batch_read=false` in the scan options map) so correctness does not depend on optimizer execution. - Keep the optimizer-based guard as an additional safety net, but don’t rely on it for baseline correctness.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Committed-by: Xiaoli Zhou from Dev container

This reverts commit 9e0abe2. Committed-by: Xiaoli Zhou from Dev container

Committed-by: Xiaoli Zhou from Dev container

shirly121 added 2 commits April 9, 2026 12:08

minor fix

afb4dfa

Committed-by: Xiaoli Zhou from Dev container

greptile-apps bot reviewed Apr 9, 2026

View reviewed changes

qodo-code-review bot reviewed Apr 9, 2026

View reviewed changes

shirly121 added 4 commits April 9, 2026 13:36

minor fix

36c3384

Committed-by: Xiaoli Zhou from Dev container

fix batch_read flag settings

9e0abe2

Committed-by: Xiaoli Zhou from Dev container

Revert "fix batch_read flag settings"

c470a94

This reverts commit 9e0abe2. Committed-by: Xiaoli Zhou from Dev container

fix batch_read flag settings

ea9c73c

Committed-by: Xiaoli Zhou from Dev container

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Improve Column Remapping Performance in `COPY FROM`#193

feat: Improve Column Remapping Performance in `COPY FROM`#193
shirly121 wants to merge 6 commits intoalibaba:mainfrom
shirly121:fix_copy_perf

shirly121 commented Apr 9, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

qodo-code-review bot commented Apr 9, 2026

Uh oh!

qodo-code-review bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

qodo-code-review bot Apr 9, 2026

Uh oh!

qodo-code-review bot Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shirly121 commented Apr 9, 2026

What do these changes do?

Related issue number

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

qodo-code-review bot commented Apr 9, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

qodo-code-review bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qodo-code-review bot commented Apr 9, 2026 •

edited

Loading