Conversation
WalkthroughAdds an optional Polars integration and feature flag, implements a Polars extension trait for predictions from DataFrames, refactors build.rs to prefer PyPI wheel discovery/extraction (fallback to GitHub releases) to obtain platform libs, updates Cargo.toml with dependencies/features and example paths, and includes minor formatting tweaks. Changes
Sequence Diagram(s)sequenceDiagram
participant Build as build.rs
participant PyPI as PyPI API
participant Wheel as .whl (Zip)
participant GitHub as GitHub Releases
participant Libs as libs/
Build->>PyPI: query wheel index for OS/arch
alt wheel found
PyPI-->>Build: wheel URL
Build->>Wheel: download wheel -> OUT_DIR
Build->>Wheel: open ZipArchive, locate lib_lightgbm.*
Wheel-->>Build: extracted library bytes
Build->>Libs: write lib_lightgbm.{dylib|so|dll}
else wheel not found
Build->>GitHub: try release asset patterns
alt asset found
GitHub-->>Build: asset URL
Build->>Libs: download & write library
else
Build-->>Build: emit detailed failure with remediation
end
end
Build->>Build: emit cargo:rustc-link-lib=_lightgbm and search paths
sequenceDiagram
participant User as calling code
participant Booster
participant PolarsExt as BoosterPolarsExt
participant DF as Polars DataFrame
participant Converter as DF→Vec<f32>
participant Predictor as predict_f32
User->>Booster: predict_dataframe(df, predict_type)
Booster->>PolarsExt: invoke trait impl
PolarsExt->>DF: select columns & validate dtypes/nulls
DF->>Converter: iterate rows, cast to f32 (row-major)
Converter-->>PolarsExt: Vec<f32>
PolarsExt->>Predictor: call predict_f32(&data, ...)
Predictor-->>PolarsExt: Vec<f64> (predictions)
PolarsExt-->>User: return predictions or mapped error
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
src/polars_ext.rs (2)
78-104: Row‑major conversion is correct but can overflow for extremely wide/long frames
dataframe_to_denseproducesnum_rows * num_colsf32 values in row‑major order, which matches the nestedfor row { for col { .. } }and is whatpredict_f32is expecting.Two minor edge‑case concerns:
Vec::with_capacity(num_rows * num_cols)usesusizemultiplication; for very large DataFrames this can overflow (panic in debug, wrap in release). Given LightGBM itself uses 32‑bit sizes, this is unlikely in practice but still a theoretical footgun.num_rowsandnum_colsare cast toi32when returned. If a DataFrame exceedsi32::MAXrows or columns, this will truncate and misreport sizes to LightGBM.You might want to guard with a check like
if num_rows > i32::MAX as usize || num_cols > i32::MAX as usize { ... }and return a clear error.
94-101: Pre-materialize columns before the row loop to avoid repeatedas_materialized_series()callsThe review comment is correct. Calling
as_materialized_series()inside the inner row loop (lines 96–97) results innum_rows × num_colsredundant invocations. Althoughas_materialized_series()is cheap when the Column already holds a materialized Series, repeated calls waste CPU when the reference could be cached once per column.The proposed refactor is type-compatible and implementable:
+ let cols: Vec<&Series> = df + .get_columns() + .iter() + .map(|c| c.as_materialized_series()) + .collect(); + - for row_idx in 0..num_rows { - for col in df.get_columns() { - let series = col.as_materialized_series(); - let value = extract_f32_value(series, row_idx)?; - data.push(value); - } - } + for row_idx in 0..num_rows { + for series in &cols { + let value = extract_f32_value(series, row_idx)?; + data.push(value); + } + }
extract_f32_valueaccepts&Series, so this is type-compatible.build.rs (1)
170-201: Extraction logic is sound; two recommended improvements for production useThe zip scanning for platform-appropriate libraries (
.dylib,.so,.dll) is straightforward and correctly implemented across all three platform variants. The code works as designed.Two optional improvements worth considering:
- Wheel cleanup: Wheels are downloaded to
OUT_DIRbut never removed after extraction. Addfs::remove_file(&wheel_path)after line 195 (macOS), line 257 (Linux), and line 307 (Windows) to reduce build cache size.- URL/path override: Consider adding optional env vars
LIGHTGBM_WHEEL_URLand/orLIGHTGBM_LIB_PATHto support private mirrors and pre-downloaded artifacts for advanced users, rather than always pulling from PyPI.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
Cargo.toml(1 hunks)build.rs(2 hunks)src/lib.rs(1 hunks)src/polars_ext.rs(1 hunks)
🔇 Additional comments (4)
src/lib.rs (1)
10-13: Polars extension exposure is cleanly gatedThe conditional
mod polars_extandpub use BoosterPolarsExtare straightforward and keep the Polars API entirely behind thepolarsfeature without touching existing public surface.build.rs (2)
128-315: Remove the critical issue flag; URLs work correctly via PyPI redirectsureq automatically follows redirects by default (up to 5), so the constructed URLs function properly. All five wheels return HTTP 302 redirects to their actual PyPI locations, which ureq handles transparently. The code will successfully download wheels on all supported platforms.
The wheel selection by OS and architecture, HTTP status checking, and extraction logic are sound. No action required.
369-423: I need to gather more information since the sandbox lacks a Rust toolchain. Let me examine the codebase for CI configuration and linking tests, then search for Rust library naming conventions.Let me search for more specific information about Windows DLL import library naming conventions and whether the current setup would work:
Based on my research, the review comment raises a valid concern that has been confirmed through multiple verification angles:
On Windows linking to a DLL involves linking to its "import library", and the platform conventions are .dylib on macOS, .lib on Windows, and .so on Linux.
The critical issue is a platform-specific linking mismatch:
On Unix (Linux/macOS): The linker convention is
-lfoo→ looks forlibfoo.so/libfoo.dylib. Your code uses-l_lightgbmwhich correctly findslib_lightgbm.so/lib_lightgbm.dylib. ✓On Windows: The linker expects an import library. When you use
-l_lightgbm, the MSVC linker searches for_lightgbm.lib(NOTlib_lightgbm.dll). Your code copieslib_lightgbm.dllbut doesn't provide the corresponding import library that matches the link directive. ✗This confirms the reviewer's concern: the link name
_lightgbmdoes not align with the library filenamelib_lightgbm.dllon Windows. The code needs either:
- A rename of the copied DLL file to
_lightgbm.dllon Windows, OR- A change to the link directive to match the actual library name, OR
- An import library named
_lightgbm.libto be generated/provided on Windows
Verify whether the crate successfully builds on Windows with the current configuration. If CI validation exists showing Windows builds pass, that would indicate the concern has been addressed elsewhere; if not, the mismatched link name on Windows requires correction.
Cargo.toml (1)
12-14: Feature set for Polars is correct; no action neededThe
df!macro in Polars 0.45 doesn't require special feature flags—it's available by default. The current configuration withfeatures = ["dtype-full"]is sufficient. Additionally, the doctest insrc/polars_ext.rsis markedno_run, so it won't be compiled or executed during testing. The feature wiring is correct and the code requires no changes.
| /// Extension trait for LightGBM Booster to support Polars DataFrames | ||
| pub trait BoosterPolarsExt { | ||
| /// Predict using a Polars DataFrame as input | ||
| /// | ||
| /// This method efficiently converts the DataFrame to the format LightGBM expects | ||
| /// and runs prediction. All numeric columns will be used as features. | ||
| /// | ||
| /// # Arguments | ||
| /// * `df` - Input DataFrame with numeric features | ||
| /// * `predict_type` - Prediction type (see `predict_type` module constants) | ||
| /// | ||
| /// # Returns | ||
| /// A vector of prediction values | ||
| /// | ||
| /// # Example | ||
| /// ```no_run | ||
| /// # use lightgbm_rust::{Booster, BoosterPolarsExt, predict_type}; | ||
| /// # use polars::prelude::*; | ||
| /// let booster = Booster::load("model.txt").unwrap(); | ||
| /// | ||
| /// let df = df! { | ||
| /// "feature1" => [1.0f32, 2.0, 3.0], | ||
| /// "feature2" => [4.0f32, 5.0, 6.0], | ||
| /// }.unwrap(); | ||
| /// | ||
| /// let predictions = booster.predict_dataframe(&df, predict_type::NORMAL).unwrap(); | ||
| /// ``` | ||
| fn predict_dataframe( | ||
| &self, | ||
| df: &DataFrame, | ||
| predict_type: i32, | ||
| ) -> LightGBMResult<Vec<f64>>; | ||
|
|
||
| /// Predict using specific columns from a Polars DataFrame | ||
| /// | ||
| /// # Arguments | ||
| /// * `df` - Input DataFrame | ||
| /// * `columns` - Column names to use as features (in order) | ||
| /// * `predict_type` - Prediction type | ||
| fn predict_dataframe_with_columns( | ||
| &self, | ||
| df: &DataFrame, | ||
| columns: &[&str], | ||
| predict_type: i32, | ||
| ) -> LightGBMResult<Vec<f64>>; | ||
| } | ||
|
|
||
| impl BoosterPolarsExt for Booster { | ||
| fn predict_dataframe( | ||
| &self, | ||
| df: &DataFrame, | ||
| predict_type: i32, | ||
| ) -> LightGBMResult<Vec<f64>> { | ||
| let (data, num_rows, num_cols) = dataframe_to_dense(df)?; | ||
| self.predict_f32(&data, num_rows, num_cols, predict_type) | ||
| } | ||
|
|
||
| fn predict_dataframe_with_columns( | ||
| &self, | ||
| df: &DataFrame, | ||
| columns: &[&str], | ||
| predict_type: i32, | ||
| ) -> LightGBMResult<Vec<f64>> { | ||
| let column_names: Vec<String> = columns.iter().map(|s| s.to_string()).collect(); | ||
| let selected = df.select(column_names).map_err(|e| LightGBMError { | ||
| description: format!("Failed to select columns: {}", e), | ||
| })?; | ||
|
|
||
| let (data, num_rows, num_cols) = dataframe_to_dense(&selected)?; | ||
| self.predict_f32(&data, num_rows, num_cols, predict_type) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
BoosterPolarsExt API is well-shaped; consider clarifying null / non‑numeric behavior
The extension trait and its implementation on Booster look coherent: both methods ultimately go through predict_f32 with a dense row‑major buffer, and predict_dataframe_with_columns preserves column order.
Behavioral details that are worth documenting more explicitly:
dataframe_to_densewill currently fail on any column that is neither numeric nor boolean (viaextract_f32_value’s “Unsupported data type” error), even though the docstring says “all numeric columns will be used”. Non‑numeric columns cause an error, not silent skipping.- Any null values cause an error (“Null value at index …”) rather than, e.g., being imputed to 0.0 or NaN.
I’d suggest either:
- updating the docs to state that all columns must be numeric or boolean and non‑null, or
- changing the conversion logic to skip unsupported columns and/or map nulls in a well‑defined way.
🏁 Script executed:
#!/bin/bash
# Grep for callers of the new methods to see current expectations about nulls and dtypes.
rg -n "predict_dataframe(_with_columns)?\s*\(" src tests examplesLength of output: 369
🏁 Script executed:
#!/bin/bash
# Find dataframe_to_dense and extract_f32_value implementations
rg -n "fn dataframe_to_dense|fn extract_f32_value" src --type rust -A 20Length of output: 2390
🏁 Script executed:
#!/bin/bash
# Check what directories exist and search for tests
fd -t d "test|example" . 2>/dev/null | head -20Length of output: 11
🏁 Script executed:
#!/bin/bash
# Look for any test files related to polars or prediction
fd -t f ".*test.*\.rs|.*spec.*\.rs" . 2>/dev/null | head -20Length of output: 0
🏁 Script executed:
#!/bin/bash
# Get complete extract_f32_value implementation
sed -n '109,160p' src/polars_ext.rsLength of output: 2080
🏁 Script executed:
#!/bin/bash
# Check examples directory for any polars-related usage
fd -t f . examples/ --type rustLength of output: 171
🏁 Script executed:
#!/bin/bash
# Search examples for predict_dataframe usage
rg -n "predict_dataframe" examples/ -A 3 -B 3Length of output: 0
🏁 Script executed:
#!/bin/bash
# Get rest of extract_f32_value function including Boolean and default case
sed -n '150,180p' src/polars_ext.rsLength of output: 1254
🏁 Script executed:
#!/bin/bash
# Find where extract_f32_value function ends
rg -n "^}" src/polars_ext.rs | head -10Length of output: 21
🏁 Script executed:
#!/bin/bash
# Get full function in one shot - safer approach
awk '/^fn extract_f32_value/,/^}/' src/polars_ext.rs | tail -30Length of output: 956
Clarify null/dtype handling in docstring to match implementation
Verification confirms the implementation correctly handles nulls and unsupported types, but the API documentation is misleading:
The docstring states "All numeric columns will be used as features," yet the implementation errors when encountering any non-numeric or non-boolean column, rather than automatically skipping them. Additionally, any null values cause an error ("Null value at index…") instead of being handled gracefully.
Supported types: Float32, Float64, Int8–64, UInt8–64, Boolean (converts true→1.0, false→0.0). Unsupported types trigger an error: "Unsupported data type for conversion to f32."
Update the docstrings for predict_dataframe and predict_dataframe_with_columns (lines 8–9 and 38) to explicitly state:
- All columns must be numeric, boolean, or a supported integer type
- Null values are not permitted and will cause an error
- Boolean columns are supported and converted to 0.0/1.0
| /// Extract an f32 value from a Series at the given index | ||
| /// | ||
| /// Supports conversion from all numeric types and booleans. | ||
| fn extract_f32_value(series: &Series, idx: usize) -> LightGBMResult<f32> { | ||
| use DataType::*; | ||
|
|
||
| match series.dtype() { | ||
| Float32 => { | ||
| let ca = series.f32().map_err(|e| LightGBMError { | ||
| description: format!("Failed to cast to f32: {}", e), | ||
| })?; | ||
| ca.get(idx).ok_or_else(|| LightGBMError { | ||
| description: format!("Null value at index {}", idx), | ||
| }) | ||
| } | ||
| Float64 => { | ||
| let ca = series.f64().map_err(|e| LightGBMError { | ||
| description: format!("Failed to cast to f64: {}", e), | ||
| })?; | ||
| Ok(ca.get(idx).ok_or_else(|| LightGBMError { | ||
| description: format!("Null value at index {}", idx), | ||
| })? as f32) | ||
| } | ||
| Int8 => { | ||
| let ca = series.i8().map_err(|e| LightGBMError { | ||
| description: format!("Failed to cast to i8: {}", e), | ||
| })?; | ||
| Ok(ca.get(idx).ok_or_else(|| LightGBMError { | ||
| description: format!("Null value at index {}", idx), | ||
| })? as f32) | ||
| } | ||
| Int16 => { | ||
| let ca = series.i16().map_err(|e| LightGBMError { | ||
| description: format!("Failed to cast to i16: {}", e), | ||
| })?; | ||
| Ok(ca.get(idx).ok_or_else(|| LightGBMError { | ||
| description: format!("Null value at index {}", idx), | ||
| })? as f32) | ||
| } | ||
| Int32 => { | ||
| let ca = series.i32().map_err(|e| LightGBMError { | ||
| description: format!("Failed to cast to i32: {}", e), | ||
| })?; | ||
| Ok(ca.get(idx).ok_or_else(|| LightGBMError { | ||
| description: format!("Null value at index {}", idx), | ||
| })? as f32) | ||
| } | ||
| Int64 => { | ||
| let ca = series.i64().map_err(|e| LightGBMError { | ||
| description: format!("Failed to cast to i64: {}", e), | ||
| })?; | ||
| Ok(ca.get(idx).ok_or_else(|| LightGBMError { | ||
| description: format!("Null value at index {}", idx), | ||
| })? as f32) | ||
| } | ||
| UInt8 => { | ||
| let ca = series.u8().map_err(|e| LightGBMError { | ||
| description: format!("Failed to cast to u8: {}", e), | ||
| })?; | ||
| Ok(ca.get(idx).ok_or_else(|| LightGBMError { | ||
| description: format!("Null value at index {}", idx), | ||
| })? as f32) | ||
| } | ||
| UInt16 => { | ||
| let ca = series.u16().map_err(|e| LightGBMError { | ||
| description: format!("Failed to cast to u16: {}", e), | ||
| })?; | ||
| Ok(ca.get(idx).ok_or_else(|| LightGBMError { | ||
| description: format!("Null value at index {}", idx), | ||
| })? as f32) | ||
| } | ||
| UInt32 => { | ||
| let ca = series.u32().map_err(|e| LightGBMError { | ||
| description: format!("Failed to cast to u32: {}", e), | ||
| })?; | ||
| Ok(ca.get(idx).ok_or_else(|| LightGBMError { | ||
| description: format!("Null value at index {}", idx), | ||
| })? as f32) | ||
| } | ||
| UInt64 => { | ||
| let ca = series.u64().map_err(|e| LightGBMError { | ||
| description: format!("Failed to cast to u64: {}", e), | ||
| })?; | ||
| Ok(ca.get(idx).ok_or_else(|| LightGBMError { | ||
| description: format!("Null value at index {}", idx), | ||
| })? as f32) | ||
| } | ||
| Boolean => { | ||
| let ca = series.bool().map_err(|e| LightGBMError { | ||
| description: format!("Failed to cast to bool: {}", e), | ||
| })?; | ||
| Ok(if ca.get(idx).ok_or_else(|| LightGBMError { | ||
| description: format!("Null value at index {}", idx), | ||
| })? { | ||
| 1.0 | ||
| } else { | ||
| 0.0 | ||
| }) | ||
| } | ||
| dt => Err(LightGBMError { | ||
| description: format!( | ||
| "Unsupported data type for conversion to f32: {}", | ||
| dt | ||
| ), | ||
| }), | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
Comprehensive dtype coverage; consider sharing casting logic and tightening messages
extract_f32_value covers the expected numeric and boolean dtypes and produces clear error messages for cast failures, nulls, and unsupported types. A couple of small improvements you might consider:
- The per‑dtype branches are repetitive; a small helper like
fn cast_and_get<T>(...)whereT: PolarsNumericTypecould reduce boilerplate and centralize the “Null value at index …” message. - For out‑of‑bounds indexes,
ca.get(idx)returnsNonebut the error messages talk about nulls. If you want to distinguish real nulls from indexing mistakes, you could explicit‑checkidx < ca.len()and tailor the error messages.
Overall, the behavior is correct; these are mostly clarity / maintainability tweaks.
🌐 Web query:
Verify how `ChunkedArray::get` behaves in Polars 0.45: does it return `None` both for nulls and out-of-bounds, or just nulls?
💡 Result:
ChunkedArray::get returns None only for NULL values; an out-of-bounds index will panic. [1]
Sources
- ChunkedArray::get docs (notes None = NULL; function will panic if idx is out of bounds). [1]
Out-of-bounds indexes will cause panics; add explicit bounds checking
The code calls ca.get(idx) without verifying that idx is within bounds. Unlike what the error messages suggest, ChunkedArray::get returns None only for NULL values in the data; an out-of-bounds index will panic. Add an explicit bounds check before accessing the element:
if idx >= ca.len() {
return Err(LightGBMError {
description: format!("Index {} out of bounds (length: {})", idx, ca.len()),
});
}
let value = ca.get(idx).ok_or_else(|| LightGBMError {
description: format!("Null value at index {}", idx),
})?;As a secondary improvement, the per-dtype branches are repetitive; consider a small helper to centralize casting and bounds/null checking logic.
🤖 Prompt for AI Agents
In src/polars_ext.rs around lines 106 to 212, the code calls ca.get(idx) for
each branch without checking bounds which can panic for out-of-range indexes;
add an explicit bounds check on each ChunkedArray (if idx >= ca.len() return
Err(LightGBMError { description: format!("Index {} out of bounds (length: {})",
idx, ca.len()) })) before calling ca.get, then keep the existing ok_or_else null
check to distinguish NULL values; optionally refactor the repeated pattern into
a small helper that accepts a typed ChunkedArray, performs the bounds check,
checks for NULL, and returns the typed value cast to f32 to reduce duplication.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/polars_ext.rs (1)
5-32: Clarify Polars DataFrame docs to match dtype/null requirementsThe docs say “all numeric columns will be used as features” and accept a generic “DataFrame with numeric features,” but the implementation:
- Uses all columns (not just numeric ones) and will error out if a column cannot be cast to
Float64.- Treats any null as an error (
Null value at row …, col …), rather than skipping/imputing.Either tighten the docstrings for
predict_dataframeandpredict_dataframe_with_columnsto state that all selected columns must be castable toFloat64and contain no nulls, or relax the implementation to skip unsupported columns / handle nulls explicitly. This overlaps with the earlier feedback on dtype/null semantics.Also applies to: 40-45
🧹 Nitpick comments (2)
src/polars_ext.rs (1)
70-109: Guard against extreme shapes when flattening DataFrameThe row‑major flattening and
Float64casting are implemented correctly, but two edge cases are worth hardening:
total_elements = num_rows * num_colscan overflowusizefor extremely large DataFrames.num_rows/num_colsare cast toi32without checking they fit, butBooster::predictexpectsi32sizes.Consider:
- Using
checked_mulfortotal_elementsand returning aLightGBMErroron overflow.- Validating that
num_rowsandnum_colsare ≤i32::MAXbefore casting, returning an error otherwise.This keeps behavior unchanged for normal sizes while avoiding surprising panics on pathological inputs.
build.rs (1)
397-398: Linking and rpath setup are coherent; verify across OS/toolchainsCopying
lib_lightgbm.*intotarget/{PROFILE}and:
- adding
out_dir/libstocargo:rustc-link-search,- setting rpaths for macOS (
@executable_path, target debug/release, andlib_search_path) and Linux ($ORIGIN,$ORIGIN/../.., andlib_search_path),- and linking with
cargo:rustc-link-lib=dylib=_lightgbm(so the linker looks forlib_lightgbm.*),is internally consistent and should pick up the extracted library in typical Cargo layouts.
Given the cross‑platform nuances (MSVC vs MinGW, test binaries under
target/*/deps, etc.), it’s worth running builds and tests on all three targets (macOS, Linux, Windows) to confirm that the linker finds the library and that the runtime loader respects the rpaths as expected.Also applies to: 413-415, 419-425, 433-435, 440-440
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
Cargo.toml(1 hunks)build.rs(8 hunks)examples/advanced_usage.rs(3 hunks)examples/basic_usage.rs(2 hunks)src/error.rs(1 hunks)src/lib.rs(1 hunks)src/model.rs(5 hunks)src/polars_ext.rs(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- examples/basic_usage.rs
- examples/advanced_usage.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- src/lib.rs
- Cargo.toml
🧰 Additional context used
🧬 Code graph analysis (1)
build.rs (1)
src/model.rs (1)
drop(299-303)
🔇 Additional comments (7)
src/model.rs (4)
51-57: Path and NUL-byte validation inBooster::loadlooks solidConverting
Pathto UTF‑8 and guarding against interior NULs withCString::newplus descriptiveLightGBMErrormessages is correct and keeps FFI usage safe.
85-88: NUL-byte handling forload_from_stringis correctWrapping
CString::new(model_str)with a mappedLightGBMErrorcleanly surfaces invalid model strings without UB.
116-120:load_from_bufferUTF‑8 conversion is appropriateUsing
std::str::from_utf8with a clear error message is a good way to validate text‑based model buffers before delegating toload_from_string.
171-176: Improved size-mismatch error messages aid debuggingThe expanded
format!for data length mismatches inpredict/predict_f32preserves logic while making diagnostics more actionable.Also applies to: 245-248
src/error.rs (1)
1-1: Single top-levelcrate::sysimport is fineCentralizing the
sysimport at the top avoids duplication and keeps the error module tidy without changing behavior.src/polars_ext.rs (1)
48-67: BoosterPolarsExt implementation correctly reuses existing prediction pathUsing
dataframe_to_densefollowed byBooster::predict(with optional column selection viadf.select) is a clean integration and preserves column order as provided incolumns.build.rs (1)
57-59: Header and optional Arrow downloads: status checks and logging look goodSwitching to 2xx range checks and adding explicit logging around the optional
arrow.h/arrow.tppdownloads improves robustness and debuggability without changing behavior for successful cases.Also applies to: 75-77, 90-93, 108-111, 126-128
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
Cargo.toml(1 hunks)build.rs(8 hunks)
🔇 Additional comments (5)
Cargo.toml (2)
12-15: Polars feature flagging and rayon dependency look good; verify MSRV alignment.Optional
polarswithdefault-features = falseand a dedicatedpolarsfeature, plusrayonas a runtime dep, is a clean way to gate the new integration. One thing to double-check is thatpolars = "0.45"andrayon = "1.10"are compatible with the crate’s declaredrust-version = "1.70"and with your intended supported toolchains.Also applies to: 25-25
18-20: Build-dependency updates match the new build pipeline.The
ureq(withjsonfeature),zip, andserde_jsonbuild-dependencies align with the PyPI JSON lookup and wheel-extraction logic inbuild.rs. I don’t see issues here; just ensure these versions are acceptable for your CI/build environments and any MSRV constraints.build.rs (3)
58-60: Header download status checks and optional Arrow files are handled cleanly.Using
200..300status checks forc_api.h/export.hand treatingarrow.h/arrow.tppas optional with explicit logging gives predictable behavior across LightGBM versions without failing builds when Arrow headers are absent. No issues here.Also applies to: 76-78, 91-95, 109-112, 127-129
197-252: GitHub release fallback is pragmatic and in line with LightGBM assets.Trying several plausible asset names (
lib_lightgbm_{os}[_{arch}].{ext}, then plainlib_lightgbm.{ext}) and normalizing the saved file name tolib_lightgbm.{ext}inlibsis a reasonable strategy when PyPI wheels aren’t available. Error messaging that includes the attempted asset names will be helpful when debugging.
431-431: Copying the library to the target dir and expanded rpaths should improve runtime UX.Copying
lib_lightgbm.*fromOUT_DIR/libsintotarget/{PROFILE}and adding rpaths for both the crate’slibsdirectory and the debug/release target roots (on macOS and Linux) should make it much easier to run binaries/tests without extra manual setup. Linking againstdylib=_lightgbmis consistent with thelib_lightgbm.*naming convention used earlier in the script.Also applies to: 447-450, 453-460, 468-470, 475-475
| use std::path::{Path, PathBuf}; | ||
| use serde_json::Value; |
There was a problem hiding this comment.
🧩 Analysis chain
PyPI JSON wheel discovery is a solid improvement; tighten Windows arch handling.
Switching to the PyPI JSON API and matching against expected wheel filenames per (os, arch) is much more robust than hardcoding files.pythonhosted paths. One corner case: the ("windows", _) arm will also be used for non‑x86_64 Windows targets but still looks for a win_amd64 wheel, which could silently pick a 64‑bit wheel for a 32‑bit target. It’s safer to restrict this arm to x86_64 and let other Windows arches fall through to the “unsupported platform” error (and GitHub fallback).
You can do this minimally as:
- ("windows", _) => vec![format!("lightgbm-{}-py3-none-win_amd64.whl", version)],
+ ("windows", "x86_64") => vec![format!("lightgbm-{}-py3-none-win_amd64.whl", version)],Optionally, you could also filter urls by packagetype == "bdist_wheel" rather than only by filename, but that’s not strictly required given the explicit name patterns.
Also applies to: 136-194
I'll examine the build.rs file to verify the review comment's concern about Windows architecture handling.
🏁 Script executed:
#!/bin/bash
# Find and examine build.rs
fd -name "build.rs" -type fLength of output: 230
🏁 Script executed:
#!/bin/bash
# Display the full context around the reported lines
if [ -f "build.rs" ]; then
echo "=== Lines 1-40 (context for imports) ==="
sed -n '1,40p' build.rs
echo ""
echo "=== Lines 130-200 (platform matching logic) ==="
sed -n '130,200p' build.rs
fiLength of output: 3542
🏁 Script executed:
#!/bin/bash
# Search for windows architecture patterns in build.rs
if [ -f "build.rs" ]; then
echo "=== Searching for 'windows' references ==="
rg -n "windows" build.rs -A2 -B2
fiLength of output: 1243
🏁 Script executed:
#!/bin/bash
# Check usage of Path and PathBuf in build.rs
if [ -f "build.rs" ]; then
echo "=== Usage of Path and PathBuf ==="
rg "Path|PathBuf" build.rs -i
fiLength of output: 2488
🏁 Script executed:
#!/bin/bash
# Check for any references to i686 or other Windows variants
if [ -f "build.rs" ]; then
echo "=== References to i686 and architecture comments ==="
rg "i686|i586|win32|32.bit" build.rs -i -A1 -B1
fiLength of output: 165
Restrict Windows wheel matching to x86_64 architecture.
The ("windows", _) pattern matches any Windows architecture but only looks for win_amd64 wheels. Since get_platform_info() recognizes 32-bit Windows targets (i686), they would incorrectly attempt to fetch a 64-bit wheel instead of erroring out.
Restrict this to ("windows", "x86_64"):
- ("windows", _) => vec![format!("lightgbm-{}-py3-none-win_amd64.whl", version)],
+ ("windows", "x86_64") => vec![format!("lightgbm-{}-py3-none-win_amd64.whl", version)],This allows other Windows architectures (i686, aarch64) to fall through to the unsupported platform error, which is the correct behavior. Apply at line 171.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In build.rs around line 171, the match arm currently uses ("windows", _) which
matches any Windows architecture but only tries to locate win_amd64 wheels;
change that pattern to ("windows", "x86_64") so only 64-bit Windows targets are
handled and other Windows architectures (i686, aarch64) will fall through to the
unsupported platform error; update any corresponding pattern variants that
reference ("windows", _) to the more specific ("windows", "x86_64") to ensure
correct platform handling.
| fn download_compiled_library(out_dir: &Path) -> Result<(), Box<dyn std::error::Error>> { | ||
| let (os, arch) = get_platform_info(); | ||
| let version = get_lightgbm_version(); | ||
|
|
||
| // Create the library directory | ||
| let lib_dir = out_dir.join("libs"); | ||
| fs::create_dir_all(&lib_dir)?; | ||
|
|
||
| // Download the library directly into the `libs` directory with its correct name | ||
| let lib_path = lib_dir.join(&lib_filename); | ||
| let mut dest = fs::File::create(&lib_path)?; | ||
| // Strategy 1: Try PyPI JSON API to find the correct wheel URL (most reliable) | ||
| println!("cargo:warning=Querying PyPI for LightGBM wheel..."); | ||
| match find_wheel_url_from_pypi(&version, &os, &arch) { | ||
| Ok(wheel_url) => { | ||
| println!("cargo:warning=Downloading wheel from: {}", wheel_url); | ||
|
|
||
| let response = ureq::get(&download_url).call()?; | ||
| let status = response.status(); | ||
| if status < 200 || status >= 300 { | ||
| return Err(format!("Failed to download library: HTTP {}", status).into()); | ||
| } | ||
| let response = ureq::get(&wheel_url).call()?; | ||
| let status = response.status(); | ||
| if !(200..300).contains(&status) { | ||
| return Err(format!("Failed to download wheel: HTTP {}", status).into()); | ||
| } | ||
|
|
||
| io::copy(&mut response.into_reader(), &mut dest)?; | ||
| // Extract filename from URL | ||
| let wheel_name = wheel_url | ||
| .split('/') | ||
| .last() | ||
| .unwrap_or("lightgbm.whl") | ||
| .to_string(); | ||
| let wheel_path = out_dir.join(&wheel_name); | ||
|
|
||
| // Download wheel | ||
| let mut wheel_file = fs::File::create(&wheel_path)?; | ||
| io::copy(&mut response.into_reader(), &mut wheel_file)?; | ||
| drop(wheel_file); | ||
|
|
||
| println!("cargo:warning=✓ Downloaded wheel"); | ||
|
|
||
| // Extract the library from the wheel | ||
| println!("cargo:warning=Extracting library from wheel"); | ||
|
|
||
| let file = fs::File::open(&wheel_path)?; | ||
| let mut archive = zip::ZipArchive::new(file)?; | ||
|
|
||
| // Determine the library extension | ||
| let lib_extension = match os.as_str() { | ||
| "darwin" => "dylib", | ||
| "linux" => "so", | ||
| "windows" => "dll", | ||
| _ => return Err(format!("Unsupported OS: {}", os).into()), | ||
| }; | ||
|
|
||
| // Look for the library in the wheel | ||
| let mut found = false; | ||
| for i in 0..archive.len() { | ||
| let mut file = archive.by_index(i)?; | ||
| let name = file.name().to_string(); | ||
|
|
||
| if name.contains("lib_lightgbm") && name.ends_with(lib_extension) { | ||
| println!("cargo:warning=Found library at: {}", name); | ||
|
|
||
| let lib_path = lib_dir.join(format!("lib_lightgbm.{}", lib_extension)); | ||
| let mut outfile = fs::File::create(&lib_path)?; | ||
| io::copy(&mut file, &mut outfile)?; | ||
|
|
||
| println!( | ||
| "cargo:warning=✓ Successfully extracted LightGBM library to: {}", | ||
| lib_dir.display() | ||
| ); | ||
| found = true; | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| println!( | ||
| "cargo:warning=Downloaded LightGBM library to: {}", | ||
| lib_path.display() | ||
| ); | ||
| if !found { | ||
| return Err(format!( | ||
| "Could not find lib_lightgbm.{} in wheel", | ||
| lib_extension | ||
| ) | ||
| .into()); | ||
| } | ||
|
|
||
| Ok(()) | ||
| Ok(()) | ||
| } | ||
| Err(pypi_err) => { | ||
| // Strategy 2: Fallback to GitHub releases | ||
| println!( | ||
| "cargo:warning=PyPI wheel not found ({}), trying GitHub releases as fallback...", | ||
| pypi_err | ||
| ); | ||
| match try_github_release(&version, &os, &arch, &lib_dir) { | ||
| Ok(_) => Ok(()), | ||
| Err(github_err) => { | ||
| // Strategy 3: Final error with helpful message | ||
| Err(format!( | ||
| "Failed to download LightGBM library:\n\ | ||
| - PyPI: {}\n\ | ||
| - GitHub releases: {}\n\ | ||
| \n\ | ||
| Please try:\n\ | ||
| 1. Using a different LightGBM version (set LIGHTGBM_VERSION env var)\n\ | ||
| 2. Building LightGBM from source and setting LIGHTGBM_LIB_DIR env var\n\ | ||
| 3. Checking https://pypi.org/project/lightgbm/{}/", | ||
| pypi_err, github_err, version | ||
| ) | ||
| .into()) | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
Wheel download/extraction flow looks good; fix stale env-var advice and tighten library match.
The overall strategy in download_compiled_library—PyPI wheel first, then GitHub fallback, with clear warnings and a final aggregated error—is solid. Two details worth adjusting:
-
Stale
LIGHTGBM_LIB_DIRguidance in the error message.
The final error suggests “building LightGBM from source and settingLIGHTGBM_LIB_DIRenv var”, but there’s no code in thisbuild.rsthat actually readsLIGHTGBM_LIB_DIR. That can confuse users trying to recover from failures.- Either implement a pre-check that honors
LIGHTGBM_LIB_DIR(and skips network downloads), or - Simplify the guidance text to avoid mentioning an env var that’s not consumed.
If you don’t want to implement the override right now, a minimal doc fix would be:
- Either implement a pre-check that honors
-
2. Building LightGBM from source and setting LIGHTGBM_LIB_DIR env var\n\
-
2. Building LightGBM from source and ensuring the resulting shared library is discoverable on your system (e.g., via library search path or colocated with the binary)\n\
-
Library filename detection in the wheel.
The extraction loop matches entries withname.contains("lib_lightgbm") && name.ends_with(lib_extension). To avoid accidental matches on paths that merely end with"so"/"dll"/"dylib", you could check for the full".ext"suffix:
-
if name.contains("lib_lightgbm") && name.ends_with(lib_extension) {
-
if name.contains("lib_lightgbm") -
&& name.ends_with(&format!(".{}", lib_extension)) -
{This is mostly defensive, but makes the intent clearer.
I'll verify the review comment by checking the codebase. Let me search for the key concerns.
---
🏁 Script executed:
```shell
#!/bin/bash
# Find build.rs and check if LIGHTGBM_LIB_DIR is actually used
rg -n "LIGHTGBM_LIB_DIR" --type rust -C 3
Length of output: 507
🏁 Script executed:
#!/bin/bash
# Find build.rs and display lines 254-360
fd "build\.rs" -x cat {} | head -400 | tail -150Length of output: 5648
🏁 Script executed:
#!/bin/bash
# Search for any env var reading in build.rs that might handle library paths
fd "build\.rs" -x rg -n "env::" {} -C 2Length of output: 587
Remove stale LIGHTGBM_LIB_DIR reference and tighten library filename matching.
The PyPI/GitHub fallback strategy is sound, but two details need adjusting:
-
Stale env-var guidance. The error message (line 352) recommends "setting
LIGHTGBM_LIB_DIRenv var", butbuild.rsnever reads this variable. Remove the reference or replace with actionable guidance about system library paths. -
Loose library filename detection. Line 306-307 uses
name.ends_with(lib_extension)wherelib_extensionis "so", "dll", or "dylib" (without leading dot). This matches any file ending with those strings, not just the extension. Change toname.ends_with(&format!(".{}", lib_extension))to match only ".so", ".dll", or ".dylib" suffixes.
🤖 Prompt for AI Agents
In build.rs around lines 254 to 360, update two items: 1) tighten the library
filename check by changing the ends_with test to match the extension with a
leading dot (use name.ends_with(&format!(".{}", lib_extension))) so only files
ending in ".so", ".dll", or ".dylib" are accepted; 2) remove the stale "setting
LIGHTGBM_LIB_DIR env var" guidance from the final error message (or replace it
with actionable guidance such as checking system library paths or setting an
actual env var your build reads), and adjust the multi-line error text
accordingly.
Summary by CodeRabbit
New Features
Chores