Rb pre parse schema#3
Draft
robertbuessow wants to merge 2 commits into
Draft
Conversation
…hema, CBuffer, COffsets, lazy columns, @generated dispatch Zero-allocation-per-schema schema parsing: - Add `SchemaNode` (replaces per-batch unsafe_string + format dispatch) and `TableSchema` (pre-built col_names + lookup Dict) so the schema is parsed once via `parse_c_schema` and every subsequent batch import skips all string work. - Add `from_c_data(::TableSchema, array_ptrs)` and `from_c_data(::SchemaNode, ptr)` as the fast-path entry points. Zero-boxing C buffer wrappers: - `CBuffer{T} <: AbstractVector{T}`: isbits wrapper around a C pointer that avoids the `Vector` header allocation produced by `unsafe_wrap`. Stored inline in `Primitive.data`, `List.data`, etc. - `COffsets{T} <: AbstractOffsets{T}`: isbits wrapper for list/map offset arrays. Parameterize `List{T,O,A,OF}` and `Map{T,O,A,OF}` on the offsets type so `COffsets` is embedded inline instead of heap-allocated. - `AbstractOffsets{T}` abstract type + `_raw_offsets` helper keep the IPC write path working unchanged. Lazy CImportedTable (no pre-built column vector): - `CImportedTable` stores `arr_ptrs::Vector{Ptr{ArrowArray}}` (isbits) instead of `columns::Vector{CImportedArray}` (abstract, boxed). Column ArrowVectors are constructed on demand in `Tables.getcolumn`. - `shared_handle` field distinguishes stream-owned arrays (release root once) from individually-owned flat arrays. Remove per-column overhead in `_import_arrowvec_fast`: - Drop `handle` parameter entirely — it was only threaded through for the now-removed `CImportedArray` wrapping. - `ArrowArray` and `ArrowSchema` changed from `mutable struct` to `struct`; `unsafe_load` now returns a stack value with zero allocation. - `_ALL_VALID` singleton: null-free columns reuse one pre-allocated `ValidityBitmap` instead of allocating a new one per column. Type-stability fixes: - `@generated _import_prim_fast(... ::Val{S})` makes `S = node.storage_type` concrete at compile time so `CBuffer{S}` is provably isbits and stack-allocated. - `@generated _make_dict_indices(... ::Val{S})` applies the same treatment narrowly to dict index construction (specialises only on S, avoiding combinatorial blowup over dict value types that caused OOM under JULIA_NUM_THREADS=2). - Split `CKIND_STR32/BIN32` and `CKIND_STR64/BIN64` branches to eliminate the `OT::Union{Type{Int32},Type{Int64}}` phi-node union. - Rename `data_bytes` in the bool branch to prevent it merging with the string branch into a type-unstable union. Benchmark (bench/schema_cache.jl): 10-column mixed table (int+float+string+nullable), cached path vs. parse-every-call baseline: Before: ~16.6 μs baseline, no column access measured After: ~7.7 μs baseline / ~1.6 μs cached — **4.7× speedup**, 34 allocs Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…w pairs
Revert the ValidityBitmap struct from Ptr{UInt8}+ref back to the original
Vector{UInt8}+pos layout, and update all call sites accordingly:
- Restore ValidityBitmap struct fields: bytes::Vector{UInt8}, pos::Int
- Revert _build_validity() back to ValidityBitmap() constructor
- Revert getindex/setindex! to array-indexed access
- Revert writebitmap to view-based approach
- Remove _ALL_VALID singleton (was only valid with pointer-based struct)
- Revert _import_validity() to copy C bytes into Vector{UInt8}
- Revert _validity_ptr() export helper to use pointer(bm.bytes, bm.pos)
- Replace push!(roots, v.validity.ref) with push!(roots, v.validity.bytes)
Separately, collapse unsafe_wrap+view offset patterns into direct
pointer-arithmetic wraps throughout _import_arrowvec:
- Fixed-size binary: unsafe_wrap(dptr + off*N, len*N)
- String/binary offsets: unsafe_wrap(optr + off*sizeof(OT), len+1)
- Generic list offsets: same
- Map offsets: same (Int32)
- Dict index array: unsafe_wrap(iptr + off*sizeof(S), len)
- Primitive data: unsafe_wrap(dptr + off*sizeof(S), len)
Each case advances the pointer by off × element_size bytes and wraps
exactly the needed count, removing the intermediate array allocation
and the conditional view.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.