Skip to content

fix: refactor string-column dump#172

Merged
liulx20 merged 24 commits intoalibaba:mainfrom
liulx20:refactor-string-dump
Apr 16, 2026
Merged

fix: refactor string-column dump#172
liulx20 merged 24 commits intoalibaba:mainfrom
liulx20:refactor-string-dump

Conversation

@liulx20
Copy link
Copy Markdown
Collaborator

@liulx20 liulx20 commented Apr 3, 2026

This pull request refactors the interface and implementation of edge table size tracking in the property graph storage system. The main goal is to clarify the meaning of "size" for edge tables, especially distinguishing between bundled and unbundled edge storage, and to simplify related code and tests. Additionally, the PR modernizes file I/O in several places, replacing C-style file operations with C++ streams, and removes unused or overly complex compaction logic for string columns.

Key changes include:

Edge Table Size Refactor

  • Renamed EdgeTable::Size() to PropTableSize() and updated its semantics: for bundled tables, it now always returns 0, clarifying that bundled edges do not have a separate property table size. All code and tests referencing edge table size were updated accordingly.

File I/O Modernization

  • Replaced C-style FILE* operations with C++ std::ofstream in MutableCsr::dump and related code, improving exception safety and code clarity.

String Column Compaction Simplification

  • Removed the complex compaction plan and streaming compaction logic from TypedColumn<std::string_view>::dump, replacing it with a straightforward forward pass that writes unique string data and item metadata using C++ streams. The MD5 computation is stubbed for now.

Minor Cleanups

  • Removed unused local variables and dead code in MutableCsr::dump related to capacity lists.

These changes improve code clarity, maintainability, and correctness, especially around edge table size semantics and file operations.

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Refactor file I/O with BufferWriter utility class

✨ Enhancement 🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Introduce BufferWriter utility class for efficient buffered file I/O
• Replace manual file operations with BufferWriter in CSR dump method
• Refactor string-view array dump to use buffered writing approach
• Improve error handling and resource management across file operations
Diagram
flowchart LR
  A["Manual fopen/fwrite/fclose"] -->|"Replace with"| B["BufferWriter class"]
  B -->|"Used in"| C["MutableCsr::dump"]
  B -->|"Used in"| D["mmap_array string_view dump"]
  C -->|"Writes"| E[".nbr files"]
  D -->|"Writes"| F[".data and .items files"]
Loading

Grey Divider

File Changes

1. include/neug/utils/file_utils.h ✨ Enhancement +78/-0

Add BufferWriter utility for buffered file writing

• Added BufferWriter struct with 4MB buffered writing capability
• Implements write(), flush(), and close() methods for safe file I/O
• Includes automatic error handling with glog and exception throwing
• Added necessary headers for file operations and error handling

include/neug/utils/file_utils.h


2. src/storages/csr/mutable_csr.cc Refactoring +5/-32

Refactor MutableCsr dump to use BufferWriter

• Replaced manual fopen/fwrite/fclose logic with BufferWriter class
• Removed verbose error handling code (28 lines) with cleaner abstraction
• Simplified neighbor list writing with writer.write() and writer.close()
• Improved code maintainability by delegating I/O concerns to utility

src/storages/csr/mutable_csr.cc


3. include/neug/utils/mmap_array.h Refactoring +41/-10

Refactor string_view array dump with BufferWriter

• Commented out previous compaction and streaming logic
• Implemented new dump method using BufferWriter for both data and items files
• Iterates through items and writes non-empty strings with buffered I/O
• Added file truncation to set correct data file size after writing

include/neug/utils/mmap_array.h


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Apr 3, 2026

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX Issues (0)

Grey Divider


Action required

1. Corrupt string dump files🐞
Description
mmap_array<std::string_view>::dump writes string bytes sequentially but never increments the new
write offset and persists original string_item offsets, then truncates the .data file to 0 bytes,
corrupting the snapshot.
Code

include/neug/utils/mmap_array.h[R525-549]

+    BufferWriter fout_data(datas_file);
+    BufferWriter fout_items(items_file);
+    size_t offset = 0;
+    for (size_t i = 0; i < items_.size(); ++i) {
+      const string_item& item = items_.get(i);
+      if (item.length > 0) {
+        fout_data.write(data_.data() + item.offset, item.length);
+        fout_items.write(reinterpret_cast<const char*>(&item), sizeof(item));
+      } else {
+        string_item empty_item = {0, 0};
+        fout_items.write(reinterpret_cast<const char*>(&empty_item),
+                         sizeof(empty_item));
+      }
+    }
+    fout_data.close();
+    fout_items.close();
+    size_t data_file_size = offset;
+    // ftruncate data file to actual data size
+    int rt = truncate(datas_file.c_str(), data_file_size);
+    if (rt != 0) {
+      std::stringstream ss;
+      ss << "Failed to truncate file [ " << datas_file << " ], "
+         << strerror(errno);
+      LOG(ERROR) << ss.str();
+      THROW_RUNTIME_ERROR(ss.str());
Evidence
dump() initializes offset to 0 and never updates it, yet uses it to truncate the newly-written .data
file; additionally it writes the original string_item (old offsets) while get() later interprets
item.offset as an index into the dumped .data file, so offsets must be rewritten when data is
written sequentially.

include/neug/utils/mmap_array.h[507-553]
include/neug/utils/mmap_array.h[471-474]
include/neug/utils/mmap_array.h[584-587]
include/neug/utils/mmap_array.h[705-728]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`mmap_array<std::string_view>::dump()` currently produces invalid `.data`/`.items` files: it writes data sequentially but writes unchanged `string_item` offsets, and it truncates the `.data` file using an `offset` variable that is never incremented (so the file becomes 0 bytes).
### Issue Context
Readers use `string_item.offset` to index into the `.data` file (`get()` constructs `std::string_view(data_.data() + item.offset, item.length)`). Therefore, when dumping sequentially, you must rebuild offsets (and optionally deduplicate reused offsets like the existing compaction logic does).
### Fix Focus Areas
- include/neug/utils/mmap_array.h[507-553]
- include/neug/utils/mmap_array.h[584-587]
- include/neug/utils/mmap_array.h[692-780]
### Suggested fix approach
- Either restore the previous compaction/streaming dump behavior (updating `items_` offsets as in `stream_compact_and_dump()`), but replace raw `FILE*` writes with `BufferWriter`.
- Or, in the new loop:
- Track `offset` and increment it by `item.length` whenever you write string bytes.
- Write an updated `string_item` with `offset=<current_offset>` (not the original `item.offset`).
- Ensure the final truncate size matches the number of bytes written (or preserve the original reserved size if that’s required elsewhere).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. MutableCsr nbr format mismatch🐞
Description
MutableCsr::dump now writes only adj_list_size_ neighbors per vertex into the .nbr file, but open()
reconstructs per-vertex pointers by advancing through the file using adj_list_capacity_, so vertices
with capacity>size will misalign pointers and can read past the mapped file.
Code

src/storages/csr/mutable_csr.cc[R184-189]

+  BufferWriter writer(new_snapshot_dir + "/" + name + ".nbr");
for (size_t i = 0; i < vnum; ++i) {
-    size_t ret{};
-    if ((ret = fwrite(adj_list_buffer_[i], sizeof(nbr_t), adj_list_capacity_[i],
-                      fout)) != static_cast<size_t>(adj_list_capacity_[i])) {
-      std::stringstream ss;
-      ss << "Failed to write nbr list " << filename << ", expected "
-         << adj_list_capacity_[i] << ", got " << ret << ", " << strerror(errno);
-      LOG(ERROR) << ss.str();
-      throw std::runtime_error(ss.str());
-    }
-  }
-  int ret = 0;
-  if ((ret = fflush(fout)) != 0) {
-    std::stringstream ss;
-    ss << "Failed to flush nbr list " << filename << ", error code: " << ret
-       << " " << strerror(errno);
-    LOG(ERROR) << ss.str();
-    throw std::runtime_error(ss.str());
-  }
-  if ((ret = fclose(fout)) != 0) {
-    std::stringstream ss;
-    ss << "Failed to close nbr list " << filename << ", error code: " << ret
-       << " " << strerror(errno);
-    LOG(ERROR) << ss.str();
-    throw std::runtime_error(ss.str());
+    writer.write(reinterpret_cast<const char*>(adj_list_buffer_[i]),
+                 adj_list_size_[i] * sizeof(nbr_t));
}
Evidence
dump() still computes and may persist a per-vertex capacity list (.cap). open() uses that capacity
to advance the base pointer (ptr += cap) when slicing the .nbr backing array into per-vertex
adjacency lists; therefore the .nbr file layout must contain capacity-sized blocks per vertex (as
before), not size-sized blocks.

src/storages/csr/mutable_csr.cc[155-192]
src/storages/csr/mutable_csr.cc[64-72]
src/storages/csr/mutable_csr.cc[160-180]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`MutableCsr<EDATA_T>::dump()` changed the `.nbr` serialization to write `adj_list_size_[i]` entries per vertex, but `open()` still reconstructs adjacency-list pointers assuming `cap` entries per vertex (from `.cap` when present). This mismatch corrupts the CSR layout on disk and can lead to out-of-bounds access when opening.
### Issue Context
The existing format uses contiguous blocks sized by capacity so `adj_list_buffer_[i]` can be set by pointer arithmetic and `ptr += cap`.
### Fix Focus Areas
- src/storages/csr/mutable_csr.cc[155-192]
- src/storages/csr/mutable_csr.cc[37-76]
### Suggested fix approach
- Either:
1) Restore writing `adj_list_capacity_[i] * sizeof(nbr_t)` bytes per vertex to `.nbr` (matching `open()`), or
2) If the intended new on-disk format is size-only, update `open()` to advance by size and ensure `.cap` is not used/produced (or redefine its meaning) consistently.
- Add/adjust a test that dumps then opens a CSR where `capacity > size` for at least one vertex.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Header relies on transitive includes🐞
Description
file_utils.h uses std::min and errno but does not include <algorithm> or <cerrno>/<errno.h>, and
src/utils/file_utils.cc includes file_utils.h without including those first, so compilation is not
guaranteed.
Code

include/neug/utils/file_utils.h[R18-23]

+#include <assert.h>
+#include <glog/logging.h>
+#include <cstdio>
+#include <cstring>
+#include <sstream>
+#include <stdexcept>
Evidence
BufferWriter::write() calls std::min, which is declared in <algorithm>, but file_utils.h doesn't
include it; errno is used in error messages without an errno header. src/utils/file_utils.cc
includes file_utils.h as its first include and does not include <algorithm>/<cerrno> beforehand, so
this header must be self-sufficient.

include/neug/utils/file_utils.h[18-26]
include/neug/utils/file_utils.h[79-97]
include/neug/utils/file_utils.h[66-74]
src/utils/file_utils.cc[16-24]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`include/neug/utils/file_utils.h` is a public header but is missing required includes for identifiers it uses (`std::min`, `errno`). This makes builds depend on include order/transitive headers.
### Issue Context
`src/utils/file_utils.cc` includes `neug/utils/file_utils.h` first and does not include `<algorithm>`/`<cerrno>` before it.
### Fix Focus Areas
- include/neug/utils/file_utils.h[15-26]
- include/neug/utils/file_utils.h[79-97]
### Suggested fix approach
- Add explicit includes in `file_utils.h`:
- `#include <algorithm>` (for `std::min`)
- `#include <cerrno>` (or `#include <errno.h>`) (for `errno`)
- Keep these includes in the header itself (not in .cc files) so it is self-contained.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

4. Destructor asserts on errors🐞
Description
BufferWriter's destructor asserts that close() was called, so if write/flush throws before close(),
stack unwinding will hit the assert (debug) or silently leak/skip close (release), masking the
original I/O error and risking unflushed output.
Code

include/neug/utils/file_utils.h[R66-78]

+  BufferWriter(const std::string& filename)
+      : buffer(kBufferSize), buffered_bytes(0) {
+    fout = fopen(filename.c_str(), "wb");
+    if (fout == nullptr) {
+      std::stringstream ss;
+      ss << "Failed to open file " << filename << ", " << strerror(errno);
+      LOG(ERROR) << ss.str();
+      throw std::runtime_error(ss.str());
+    }
+  }
+
+  ~BufferWriter() { assert(fout == nullptr); }
+
Evidence
~BufferWriter() enforces a manual-close contract with assert(fout == nullptr), while
flush()/close() throw on I/O failures; an exception before an explicit close() will therefore
trigger assertion failure during destruction (or skip closing entirely in release builds).

include/neug/utils/file_utils.h[60-128]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`BufferWriter` is not RAII-safe: it requires callers to manually call `close()` and asserts in the destructor if they don’t. This can crash during exception unwinding and/or leak file handles and buffered data.
### Issue Context
`flush()`/`close()` throw `std::runtime_error` on I/O failure. If any write path throws before `close()` is reached, the destructor runs and currently asserts.
### Fix Focus Areas
- include/neug/utils/file_utils.h[60-129]
### Suggested fix approach
- Replace the destructor with a `noexcept` best-effort close, e.g.:
- `~BufferWriter() noexcept { if (fout) { try { close(); } catch (...) { LOG(ERROR) << ...; /* swallow */ } } }`
- Consider deleting copy operations and implementing a move constructor/assignment to avoid accidental FILE* aliasing/double-close:
- `BufferWriter(const BufferWriter&) = delete; BufferWriter& operator=(...) = delete;`
- `BufferWriter(BufferWriter&&) noexcept` that transfers `fout` and resets the source to nullptr.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread include/neug/utils/mmap_array.h Outdated
Comment thread src/storages/csr/mutable_csr.cc Outdated
Comment thread include/neug/utils/file_utils.h Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new buffered file-writing utility (BufferWriter) and refactors dump paths to use buffered writes instead of direct FILE* I/O, aiming to improve dump performance and centralize error handling.

Changes:

  • Added BufferWriter (4MB buffer) to include/neug/utils/file_utils.h.
  • Refactored MutableCsr<EDATA_T>::dump() to write .nbr via BufferWriter.
  • Refactored mmap_array<std::string_view>::dump() to emit .data/.items via BufferWriter instead of the prior compaction/streaming dump flow.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
include/neug/utils/file_utils.h Adds the new BufferWriter abstraction used by multiple dump paths.
include/neug/utils/mmap_array.h Refactors string-column dump to buffered writes for .data/.items.
src/storages/csr/mutable_csr.cc Updates CSR neighbor list dumping to use BufferWriter.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread include/neug/utils/file_utils.h Outdated
Comment thread include/neug/utils/file_utils.h Outdated
Comment thread include/neug/utils/mmap_array.h Outdated
Comment thread include/neug/utils/mmap_array.h Outdated
Comment thread src/storages/csr/mutable_csr.cc Outdated
@liulx20 liulx20 force-pushed the refactor-string-dump branch from afe7d52 to 2b5c1da Compare April 9, 2026 02:29
@liulx20 liulx20 force-pushed the refactor-string-dump branch from 35c536b to fe5a2ff Compare April 9, 2026 05:00
@liulx20 liulx20 requested a review from luoxiaojian April 14, 2026 07:20
@liulx20 liulx20 requested a review from zhanglei1949 April 16, 2026 08:35
Copy link
Copy Markdown
Member

@zhanglei1949 zhanglei1949 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liulx20 liulx20 merged commit a747386 into alibaba:main Apr 16, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants