Skip to content

support polars.#3

Open
aryehlev wants to merge 4 commits into
mainfrom
polars
Open

support polars.#3
aryehlev wants to merge 4 commits into
mainfrom
polars

Conversation

@aryehlev
Copy link
Copy Markdown
Owner

@aryehlev aryehlev commented Nov 17, 2025

Summary by CodeRabbit

  • New Features

    • Polars DataFrame integration: predict directly from DataFrames or selected columns.
  • Chores

    • Improved runtime library acquisition: prefers platform wheels, falls back to releases, extracts and installs native libs with better diagnostics.
    • Build/tooling updates: added optional Polars support, parallelism/runtime helpers, archive support, and updated build-time JSON/http tooling.
    • Examples updated with explicit example entries.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 17, 2025

Walkthrough

Adds an optional Polars integration and feature flag, implements a Polars extension trait for predictions from DataFrames, refactors build.rs to prefer PyPI wheel discovery/extraction (fallback to GitHub releases) to obtain platform libs, updates Cargo.toml with dependencies/features and example paths, and includes minor formatting tweaks.

Changes

Cohort / File(s) Summary
Cargo manifest
Cargo.toml
Added optional polars = "0.45" (dtype-full, default-features = false) and rayon = "1.10" deps; added build-deps zip = "0.6", updated serde_json = "1.0" and ureq to include features = ["json"]; added feature polars = ["dep:polars"]; set path fields for [[example]] entries.
Build acquisition & linking
build.rs
Replace direct binary downloads with PyPI wheel discovery and extraction per OS/arch; extract platform library (`lib_lightgbm.{dylib
Polars feature gate & export
src/lib.rs
Add #[cfg(feature = "polars")] mod polars_ext; and #[cfg(feature = "polars")] pub use crate::polars_ext::BoosterPolarsExt;.
Polars integration implementation
src/polars_ext.rs
New public trait BoosterPolarsExt with predict_dataframe and predict_dataframe_with_columns and its Booster impl; converts Polars DataFrame (selected columns) to dense row-major Vec<f32> with dtype and null checks, then calls existing predict_f32; maps conversion errors to LightGBM error type.
Examples (formatting only)
examples/basic_usage.rs, examples/advanced_usage.rs
Formatting and import-order adjustments only; no behavioral changes.
Minor housekeeping
src/error.rs, src/model.rs
Import ordering and formatting tweaks; no API or semantic changes.

Sequence Diagram(s)

sequenceDiagram
    participant Build as build.rs
    participant PyPI as PyPI API
    participant Wheel as .whl (Zip)
    participant GitHub as GitHub Releases
    participant Libs as libs/

    Build->>PyPI: query wheel index for OS/arch
    alt wheel found
        PyPI-->>Build: wheel URL
        Build->>Wheel: download wheel -> OUT_DIR
        Build->>Wheel: open ZipArchive, locate lib_lightgbm.*
        Wheel-->>Build: extracted library bytes
        Build->>Libs: write lib_lightgbm.{dylib|so|dll}
    else wheel not found
        Build->>GitHub: try release asset patterns
        alt asset found
            GitHub-->>Build: asset URL
            Build->>Libs: download & write library
        else
            Build-->>Build: emit detailed failure with remediation
        end
    end
    Build->>Build: emit cargo:rustc-link-lib=_lightgbm and search paths
Loading
sequenceDiagram
    participant User as calling code
    participant Booster
    participant PolarsExt as BoosterPolarsExt
    participant DF as Polars DataFrame
    participant Converter as DF→Vec<f32>
    participant Predictor as predict_f32

    User->>Booster: predict_dataframe(df, predict_type)
    Booster->>PolarsExt: invoke trait impl
    PolarsExt->>DF: select columns & validate dtypes/nulls
    DF->>Converter: iterate rows, cast to f32 (row-major)
    Converter-->>PolarsExt: Vec<f32>
    PolarsExt->>Predictor: call predict_f32(&data, ...)
    Predictor-->>PolarsExt: Vec<f64> (predictions)
    PolarsExt-->>User: return predictions or mapped error
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Areas needing extra attention:
    • build.rs: PyPI query patterns, wheel filename matching per OS/arch, Zip extraction logic (path matching for lib_lightgbm), rpath and search-path emission, and fallback-to-GitHub correctness.
    • src/polars_ext.rs: correctness and performance of DataFrame→Vec<f32] conversion, casting behavior, null handling, supported Series types, and error mapping to existing error types.
    • Cargo.toml: feature wiring for optional dep polars and added examples' paths.

Possibly related PRs

Poem

🐰 I fetched a wheel and nibbled code with care,
Pulled out a lib for Mac, Linux, Windows to share,
Polars rows lined up as f32s, neat and spry,
Predictions hop back quick — a carrot of a reply! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'support polars.' directly and clearly describes the main change: adding Polars DataFrame integration to the LightGBM Booster through a new extension trait and feature flag.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch polars

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
src/polars_ext.rs (2)

78-104: Row‑major conversion is correct but can overflow for extremely wide/long frames

dataframe_to_dense produces num_rows * num_cols f32 values in row‑major order, which matches the nested for row { for col { .. } } and is what predict_f32 is expecting.

Two minor edge‑case concerns:

  • Vec::with_capacity(num_rows * num_cols) uses usize multiplication; for very large DataFrames this can overflow (panic in debug, wrap in release). Given LightGBM itself uses 32‑bit sizes, this is unlikely in practice but still a theoretical footgun.
  • num_rows and num_cols are cast to i32 when returned. If a DataFrame exceeds i32::MAX rows or columns, this will truncate and misreport sizes to LightGBM.

You might want to guard with a check like if num_rows > i32::MAX as usize || num_cols > i32::MAX as usize { ... } and return a clear error.


94-101: Pre-materialize columns before the row loop to avoid repeated as_materialized_series() calls

The review comment is correct. Calling as_materialized_series() inside the inner row loop (lines 96–97) results in num_rows × num_cols redundant invocations. Although as_materialized_series() is cheap when the Column already holds a materialized Series, repeated calls waste CPU when the reference could be cached once per column.

The proposed refactor is type-compatible and implementable:

+    let cols: Vec<&Series> = df
+        .get_columns()
+        .iter()
+        .map(|c| c.as_materialized_series())
+        .collect();
+
-    for row_idx in 0..num_rows {
-        for col in df.get_columns() {
-            let series = col.as_materialized_series();
-            let value = extract_f32_value(series, row_idx)?;
-            data.push(value);
-        }
-    }
+    for row_idx in 0..num_rows {
+        for series in &cols {
+            let value = extract_f32_value(series, row_idx)?;
+            data.push(value);
+        }
+    }

extract_f32_value accepts &Series, so this is type-compatible.

build.rs (1)

170-201: Extraction logic is sound; two recommended improvements for production use

The zip scanning for platform-appropriate libraries (.dylib, .so, .dll) is straightforward and correctly implemented across all three platform variants. The code works as designed.

Two optional improvements worth considering:

  • Wheel cleanup: Wheels are downloaded to OUT_DIR but never removed after extraction. Add fs::remove_file(&wheel_path) after line 195 (macOS), line 257 (Linux), and line 307 (Windows) to reduce build cache size.
  • URL/path override: Consider adding optional env vars LIGHTGBM_WHEEL_URL and/or LIGHTGBM_LIB_PATH to support private mirrors and pre-downloaded artifacts for advanced users, rather than always pulling from PyPI.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 01fb89f and d4c3217.

📒 Files selected for processing (4)
  • Cargo.toml (1 hunks)
  • build.rs (2 hunks)
  • src/lib.rs (1 hunks)
  • src/polars_ext.rs (1 hunks)
🔇 Additional comments (4)
src/lib.rs (1)

10-13: Polars extension exposure is cleanly gated

The conditional mod polars_ext and pub use BoosterPolarsExt are straightforward and keep the Polars API entirely behind the polars feature without touching existing public surface.

build.rs (2)

128-315: Remove the critical issue flag; URLs work correctly via PyPI redirects

ureq automatically follows redirects by default (up to 5), so the constructed URLs function properly. All five wheels return HTTP 302 redirects to their actual PyPI locations, which ureq handles transparently. The code will successfully download wheels on all supported platforms.

The wheel selection by OS and architecture, HTTP status checking, and extraction logic are sound. No action required.


369-423: I need to gather more information since the sandbox lacks a Rust toolchain. Let me examine the codebase for CI configuration and linking tests, then search for Rust library naming conventions.

Let me search for more specific information about Windows DLL import library naming conventions and whether the current setup would work:

Based on my research, the review comment raises a valid concern that has been confirmed through multiple verification angles:

On Windows linking to a DLL involves linking to its "import library", and the platform conventions are .dylib on macOS, .lib on Windows, and .so on Linux.

The critical issue is a platform-specific linking mismatch:

  • On Unix (Linux/macOS): The linker convention is -lfoo → looks for libfoo.so/libfoo.dylib. Your code uses -l_lightgbm which correctly finds lib_lightgbm.so/lib_lightgbm.dylib. ✓

  • On Windows: The linker expects an import library. When you use -l_lightgbm, the MSVC linker searches for _lightgbm.lib (NOT lib_lightgbm.dll). Your code copies lib_lightgbm.dll but doesn't provide the corresponding import library that matches the link directive. ✗

This confirms the reviewer's concern: the link name _lightgbm does not align with the library filename lib_lightgbm.dll on Windows. The code needs either:

  1. A rename of the copied DLL file to _lightgbm.dll on Windows, OR
  2. A change to the link directive to match the actual library name, OR
  3. An import library named _lightgbm.lib to be generated/provided on Windows

Verify whether the crate successfully builds on Windows with the current configuration. If CI validation exists showing Windows builds pass, that would indicate the concern has been addressed elsewhere; if not, the mismatched link name on Windows requires correction.

Cargo.toml (1)

12-14: Feature set for Polars is correct; no action needed

The df! macro in Polars 0.45 doesn't require special feature flags—it's available by default. The current configuration with features = ["dtype-full"] is sufficient. Additionally, the doctest in src/polars_ext.rs is marked no_run, so it won't be compiled or executed during testing. The feature wiring is correct and the code requires no changes.

Comment thread src/polars_ext.rs
Comment on lines +5 to +75
/// Extension trait for LightGBM Booster to support Polars DataFrames
pub trait BoosterPolarsExt {
/// Predict using a Polars DataFrame as input
///
/// This method efficiently converts the DataFrame to the format LightGBM expects
/// and runs prediction. All numeric columns will be used as features.
///
/// # Arguments
/// * `df` - Input DataFrame with numeric features
/// * `predict_type` - Prediction type (see `predict_type` module constants)
///
/// # Returns
/// A vector of prediction values
///
/// # Example
/// ```no_run
/// # use lightgbm_rust::{Booster, BoosterPolarsExt, predict_type};
/// # use polars::prelude::*;
/// let booster = Booster::load("model.txt").unwrap();
///
/// let df = df! {
/// "feature1" => [1.0f32, 2.0, 3.0],
/// "feature2" => [4.0f32, 5.0, 6.0],
/// }.unwrap();
///
/// let predictions = booster.predict_dataframe(&df, predict_type::NORMAL).unwrap();
/// ```
fn predict_dataframe(
&self,
df: &DataFrame,
predict_type: i32,
) -> LightGBMResult<Vec<f64>>;

/// Predict using specific columns from a Polars DataFrame
///
/// # Arguments
/// * `df` - Input DataFrame
/// * `columns` - Column names to use as features (in order)
/// * `predict_type` - Prediction type
fn predict_dataframe_with_columns(
&self,
df: &DataFrame,
columns: &[&str],
predict_type: i32,
) -> LightGBMResult<Vec<f64>>;
}

impl BoosterPolarsExt for Booster {
fn predict_dataframe(
&self,
df: &DataFrame,
predict_type: i32,
) -> LightGBMResult<Vec<f64>> {
let (data, num_rows, num_cols) = dataframe_to_dense(df)?;
self.predict_f32(&data, num_rows, num_cols, predict_type)
}

fn predict_dataframe_with_columns(
&self,
df: &DataFrame,
columns: &[&str],
predict_type: i32,
) -> LightGBMResult<Vec<f64>> {
let column_names: Vec<String> = columns.iter().map(|s| s.to_string()).collect();
let selected = df.select(column_names).map_err(|e| LightGBMError {
description: format!("Failed to select columns: {}", e),
})?;

let (data, num_rows, num_cols) = dataframe_to_dense(&selected)?;
self.predict_f32(&data, num_rows, num_cols, predict_type)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

BoosterPolarsExt API is well-shaped; consider clarifying null / non‑numeric behavior

The extension trait and its implementation on Booster look coherent: both methods ultimately go through predict_f32 with a dense row‑major buffer, and predict_dataframe_with_columns preserves column order.

Behavioral details that are worth documenting more explicitly:

  • dataframe_to_dense will currently fail on any column that is neither numeric nor boolean (via extract_f32_value’s “Unsupported data type” error), even though the docstring says “all numeric columns will be used”. Non‑numeric columns cause an error, not silent skipping.
  • Any null values cause an error (“Null value at index …”) rather than, e.g., being imputed to 0.0 or NaN.

I’d suggest either:

  • updating the docs to state that all columns must be numeric or boolean and non‑null, or
  • changing the conversion logic to skip unsupported columns and/or map nulls in a well‑defined way.

🏁 Script executed:

#!/bin/bash
# Grep for callers of the new methods to see current expectations about nulls and dtypes.
rg -n "predict_dataframe(_with_columns)?\s*\(" src tests examples

Length of output: 369


🏁 Script executed:

#!/bin/bash
# Find dataframe_to_dense and extract_f32_value implementations
rg -n "fn dataframe_to_dense|fn extract_f32_value" src --type rust -A 20

Length of output: 2390


🏁 Script executed:

#!/bin/bash
# Check what directories exist and search for tests
fd -t d "test|example" . 2>/dev/null | head -20

Length of output: 11


🏁 Script executed:

#!/bin/bash
# Look for any test files related to polars or prediction
fd -t f ".*test.*\.rs|.*spec.*\.rs" . 2>/dev/null | head -20

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Get complete extract_f32_value implementation
sed -n '109,160p' src/polars_ext.rs

Length of output: 2080


🏁 Script executed:

#!/bin/bash
# Check examples directory for any polars-related usage
fd -t f . examples/ --type rust

Length of output: 171


🏁 Script executed:

#!/bin/bash
# Search examples for predict_dataframe usage
rg -n "predict_dataframe" examples/ -A 3 -B 3

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Get rest of extract_f32_value function including Boolean and default case
sed -n '150,180p' src/polars_ext.rs

Length of output: 1254


🏁 Script executed:

#!/bin/bash
# Find where extract_f32_value function ends
rg -n "^}" src/polars_ext.rs | head -10

Length of output: 21


🏁 Script executed:

#!/bin/bash
# Get full function in one shot - safer approach
awk '/^fn extract_f32_value/,/^}/' src/polars_ext.rs | tail -30

Length of output: 956


Clarify null/dtype handling in docstring to match implementation

Verification confirms the implementation correctly handles nulls and unsupported types, but the API documentation is misleading:

The docstring states "All numeric columns will be used as features," yet the implementation errors when encountering any non-numeric or non-boolean column, rather than automatically skipping them. Additionally, any null values cause an error ("Null value at index…") instead of being handled gracefully.

Supported types: Float32, Float64, Int8–64, UInt8–64, Boolean (converts true→1.0, false→0.0). Unsupported types trigger an error: "Unsupported data type for conversion to f32."

Update the docstrings for predict_dataframe and predict_dataframe_with_columns (lines 8–9 and 38) to explicitly state:

  • All columns must be numeric, boolean, or a supported integer type
  • Null values are not permitted and will cause an error
  • Boolean columns are supported and converted to 0.0/1.0

Comment thread src/polars_ext.rs Outdated
Comment on lines +106 to +212
/// Extract an f32 value from a Series at the given index
///
/// Supports conversion from all numeric types and booleans.
fn extract_f32_value(series: &Series, idx: usize) -> LightGBMResult<f32> {
use DataType::*;

match series.dtype() {
Float32 => {
let ca = series.f32().map_err(|e| LightGBMError {
description: format!("Failed to cast to f32: {}", e),
})?;
ca.get(idx).ok_or_else(|| LightGBMError {
description: format!("Null value at index {}", idx),
})
}
Float64 => {
let ca = series.f64().map_err(|e| LightGBMError {
description: format!("Failed to cast to f64: {}", e),
})?;
Ok(ca.get(idx).ok_or_else(|| LightGBMError {
description: format!("Null value at index {}", idx),
})? as f32)
}
Int8 => {
let ca = series.i8().map_err(|e| LightGBMError {
description: format!("Failed to cast to i8: {}", e),
})?;
Ok(ca.get(idx).ok_or_else(|| LightGBMError {
description: format!("Null value at index {}", idx),
})? as f32)
}
Int16 => {
let ca = series.i16().map_err(|e| LightGBMError {
description: format!("Failed to cast to i16: {}", e),
})?;
Ok(ca.get(idx).ok_or_else(|| LightGBMError {
description: format!("Null value at index {}", idx),
})? as f32)
}
Int32 => {
let ca = series.i32().map_err(|e| LightGBMError {
description: format!("Failed to cast to i32: {}", e),
})?;
Ok(ca.get(idx).ok_or_else(|| LightGBMError {
description: format!("Null value at index {}", idx),
})? as f32)
}
Int64 => {
let ca = series.i64().map_err(|e| LightGBMError {
description: format!("Failed to cast to i64: {}", e),
})?;
Ok(ca.get(idx).ok_or_else(|| LightGBMError {
description: format!("Null value at index {}", idx),
})? as f32)
}
UInt8 => {
let ca = series.u8().map_err(|e| LightGBMError {
description: format!("Failed to cast to u8: {}", e),
})?;
Ok(ca.get(idx).ok_or_else(|| LightGBMError {
description: format!("Null value at index {}", idx),
})? as f32)
}
UInt16 => {
let ca = series.u16().map_err(|e| LightGBMError {
description: format!("Failed to cast to u16: {}", e),
})?;
Ok(ca.get(idx).ok_or_else(|| LightGBMError {
description: format!("Null value at index {}", idx),
})? as f32)
}
UInt32 => {
let ca = series.u32().map_err(|e| LightGBMError {
description: format!("Failed to cast to u32: {}", e),
})?;
Ok(ca.get(idx).ok_or_else(|| LightGBMError {
description: format!("Null value at index {}", idx),
})? as f32)
}
UInt64 => {
let ca = series.u64().map_err(|e| LightGBMError {
description: format!("Failed to cast to u64: {}", e),
})?;
Ok(ca.get(idx).ok_or_else(|| LightGBMError {
description: format!("Null value at index {}", idx),
})? as f32)
}
Boolean => {
let ca = series.bool().map_err(|e| LightGBMError {
description: format!("Failed to cast to bool: {}", e),
})?;
Ok(if ca.get(idx).ok_or_else(|| LightGBMError {
description: format!("Null value at index {}", idx),
})? {
1.0
} else {
0.0
})
}
dt => Err(LightGBMError {
description: format!(
"Unsupported data type for conversion to f32: {}",
dt
),
}),
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Comprehensive dtype coverage; consider sharing casting logic and tightening messages

extract_f32_value covers the expected numeric and boolean dtypes and produces clear error messages for cast failures, nulls, and unsupported types. A couple of small improvements you might consider:

  • The per‑dtype branches are repetitive; a small helper like fn cast_and_get<T>(...) where T: PolarsNumericType could reduce boilerplate and centralize the “Null value at index …” message.
  • For out‑of‑bounds indexes, ca.get(idx) returns None but the error messages talk about nulls. If you want to distinguish real nulls from indexing mistakes, you could explicit‑check idx < ca.len() and tailor the error messages.

Overall, the behavior is correct; these are mostly clarity / maintainability tweaks.


🌐 Web query:

Verify how `ChunkedArray::get` behaves in Polars 0.45: does it return `None` both for nulls and out-of-bounds, or just nulls?

💡 Result:

ChunkedArray::get returns None only for NULL values; an out-of-bounds index will panic. [1]

Sources

  • ChunkedArray::get docs (notes None = NULL; function will panic if idx is out of bounds). [1]

Out-of-bounds indexes will cause panics; add explicit bounds checking

The code calls ca.get(idx) without verifying that idx is within bounds. Unlike what the error messages suggest, ChunkedArray::get returns None only for NULL values in the data; an out-of-bounds index will panic. Add an explicit bounds check before accessing the element:

if idx >= ca.len() {
    return Err(LightGBMError {
        description: format!("Index {} out of bounds (length: {})", idx, ca.len()),
    });
}
let value = ca.get(idx).ok_or_else(|| LightGBMError {
    description: format!("Null value at index {}", idx),
})?;

As a secondary improvement, the per-dtype branches are repetitive; consider a small helper to centralize casting and bounds/null checking logic.

🤖 Prompt for AI Agents
In src/polars_ext.rs around lines 106 to 212, the code calls ca.get(idx) for
each branch without checking bounds which can panic for out-of-range indexes;
add an explicit bounds check on each ChunkedArray (if idx >= ca.len() return
Err(LightGBMError { description: format!("Index {} out of bounds (length: {})",
idx, ca.len()) })) before calling ca.get, then keep the existing ok_or_else null
check to distinguish NULL values; optionally refactor the repeated pattern into
a small helper that accepts a typed ChunkedArray, performs the bounds check,
checks for NULL, and returns the typed value cast to f32 to reduce duplication.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/polars_ext.rs (1)

5-32: Clarify Polars DataFrame docs to match dtype/null requirements

The docs say “all numeric columns will be used as features” and accept a generic “DataFrame with numeric features,” but the implementation:

  • Uses all columns (not just numeric ones) and will error out if a column cannot be cast to Float64.
  • Treats any null as an error (Null value at row …, col …), rather than skipping/imputing.

Either tighten the docstrings for predict_dataframe and predict_dataframe_with_columns to state that all selected columns must be castable to Float64 and contain no nulls, or relax the implementation to skip unsupported columns / handle nulls explicitly. This overlaps with the earlier feedback on dtype/null semantics.

Also applies to: 40-45

🧹 Nitpick comments (2)
src/polars_ext.rs (1)

70-109: Guard against extreme shapes when flattening DataFrame

The row‑major flattening and Float64 casting are implemented correctly, but two edge cases are worth hardening:

  • total_elements = num_rows * num_cols can overflow usize for extremely large DataFrames.
  • num_rows / num_cols are cast to i32 without checking they fit, but Booster::predict expects i32 sizes.

Consider:

  • Using checked_mul for total_elements and returning a LightGBMError on overflow.
  • Validating that num_rows and num_cols are ≤ i32::MAX before casting, returning an error otherwise.

This keeps behavior unchanged for normal sizes while avoiding surprising panics on pathological inputs.

build.rs (1)

397-398: Linking and rpath setup are coherent; verify across OS/toolchains

Copying lib_lightgbm.* into target/{PROFILE} and:

  • adding out_dir/libs to cargo:rustc-link-search,
  • setting rpaths for macOS (@executable_path, target debug/release, and lib_search_path) and Linux ($ORIGIN, $ORIGIN/../.., and lib_search_path),
  • and linking with cargo:rustc-link-lib=dylib=_lightgbm (so the linker looks for lib_lightgbm.*),

is internally consistent and should pick up the extracted library in typical Cargo layouts.

Given the cross‑platform nuances (MSVC vs MinGW, test binaries under target/*/deps, etc.), it’s worth running builds and tests on all three targets (macOS, Linux, Windows) to confirm that the linker finds the library and that the runtime loader respects the rpaths as expected.

Also applies to: 413-415, 419-425, 433-435, 440-440

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d4c3217 and 499f297.

📒 Files selected for processing (8)
  • Cargo.toml (1 hunks)
  • build.rs (8 hunks)
  • examples/advanced_usage.rs (3 hunks)
  • examples/basic_usage.rs (2 hunks)
  • src/error.rs (1 hunks)
  • src/lib.rs (1 hunks)
  • src/model.rs (5 hunks)
  • src/polars_ext.rs (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • examples/basic_usage.rs
  • examples/advanced_usage.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lib.rs
  • Cargo.toml
🧰 Additional context used
🧬 Code graph analysis (1)
build.rs (1)
src/model.rs (1)
  • drop (299-303)
🔇 Additional comments (7)
src/model.rs (4)

51-57: Path and NUL-byte validation in Booster::load looks solid

Converting Path to UTF‑8 and guarding against interior NULs with CString::new plus descriptive LightGBMError messages is correct and keeps FFI usage safe.


85-88: NUL-byte handling for load_from_string is correct

Wrapping CString::new(model_str) with a mapped LightGBMError cleanly surfaces invalid model strings without UB.


116-120: load_from_buffer UTF‑8 conversion is appropriate

Using std::str::from_utf8 with a clear error message is a good way to validate text‑based model buffers before delegating to load_from_string.


171-176: Improved size-mismatch error messages aid debugging

The expanded format! for data length mismatches in predict/predict_f32 preserves logic while making diagnostics more actionable.

Also applies to: 245-248

src/error.rs (1)

1-1: Single top-level crate::sys import is fine

Centralizing the sys import at the top avoids duplication and keeps the error module tidy without changing behavior.

src/polars_ext.rs (1)

48-67: BoosterPolarsExt implementation correctly reuses existing prediction path

Using dataframe_to_dense followed by Booster::predict (with optional column selection via df.select) is a clean integration and preserves column order as provided in columns.

build.rs (1)

57-59: Header and optional Arrow downloads: status checks and logging look good

Switching to 2xx range checks and adding explicit logging around the optional arrow.h / arrow.tpp downloads improves robustness and debuggability without changing behavior for successful cases.

Also applies to: 75-77, 90-93, 108-111, 126-128

Comment thread build.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 499f297 and ef4c148.

📒 Files selected for processing (2)
  • Cargo.toml (1 hunks)
  • build.rs (8 hunks)
🔇 Additional comments (5)
Cargo.toml (2)

12-15: Polars feature flagging and rayon dependency look good; verify MSRV alignment.

Optional polars with default-features = false and a dedicated polars feature, plus rayon as a runtime dep, is a clean way to gate the new integration. One thing to double-check is that polars = "0.45" and rayon = "1.10" are compatible with the crate’s declared rust-version = "1.70" and with your intended supported toolchains.

Also applies to: 25-25


18-20: Build-dependency updates match the new build pipeline.

The ureq (with json feature), zip, and serde_json build-dependencies align with the PyPI JSON lookup and wheel-extraction logic in build.rs. I don’t see issues here; just ensure these versions are acceptable for your CI/build environments and any MSRV constraints.

build.rs (3)

58-60: Header download status checks and optional Arrow files are handled cleanly.

Using 200..300 status checks for c_api.h / export.h and treating arrow.h / arrow.tpp as optional with explicit logging gives predictable behavior across LightGBM versions without failing builds when Arrow headers are absent. No issues here.

Also applies to: 76-78, 91-95, 109-112, 127-129


197-252: GitHub release fallback is pragmatic and in line with LightGBM assets.

Trying several plausible asset names (lib_lightgbm_{os}[_{arch}].{ext}, then plain lib_lightgbm.{ext}) and normalizing the saved file name to lib_lightgbm.{ext} in libs is a reasonable strategy when PyPI wheels aren’t available. Error messaging that includes the attempted asset names will be helpful when debugging.


431-431: Copying the library to the target dir and expanded rpaths should improve runtime UX.

Copying lib_lightgbm.* from OUT_DIR/libs into target/{PROFILE} and adding rpaths for both the crate’s libs directory and the debug/release target roots (on macOS and Linux) should make it much easier to run binaries/tests without extra manual setup. Linking against dylib=_lightgbm is consistent with the lib_lightgbm.* naming convention used earlier in the script.

Also applies to: 447-450, 453-460, 468-470, 475-475

Comment thread build.rs
Comment on lines +6 to +7
use std::path::{Path, PathBuf};
use serde_json::Value;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

PyPI JSON wheel discovery is a solid improvement; tighten Windows arch handling.

Switching to the PyPI JSON API and matching against expected wheel filenames per (os, arch) is much more robust than hardcoding files.pythonhosted paths. One corner case: the ("windows", _) arm will also be used for non‑x86_64 Windows targets but still looks for a win_amd64 wheel, which could silently pick a 64‑bit wheel for a 32‑bit target. It’s safer to restrict this arm to x86_64 and let other Windows arches fall through to the “unsupported platform” error (and GitHub fallback).

You can do this minimally as:

-        ("windows", _) => vec![format!("lightgbm-{}-py3-none-win_amd64.whl", version)],
+        ("windows", "x86_64") => vec![format!("lightgbm-{}-py3-none-win_amd64.whl", version)],

Optionally, you could also filter urls by packagetype == "bdist_wheel" rather than only by filename, but that’s not strictly required given the explicit name patterns.

Also applies to: 136-194


I'll examine the build.rs file to verify the review comment's concern about Windows architecture handling.


🏁 Script executed:

#!/bin/bash
# Find and examine build.rs
fd -name "build.rs" -type f

Length of output: 230


🏁 Script executed:

#!/bin/bash
# Display the full context around the reported lines
if [ -f "build.rs" ]; then
  echo "=== Lines 1-40 (context for imports) ==="
  sed -n '1,40p' build.rs
  echo ""
  echo "=== Lines 130-200 (platform matching logic) ==="
  sed -n '130,200p' build.rs
fi

Length of output: 3542


🏁 Script executed:

#!/bin/bash
# Search for windows architecture patterns in build.rs
if [ -f "build.rs" ]; then
  echo "=== Searching for 'windows' references ==="
  rg -n "windows" build.rs -A2 -B2
fi

Length of output: 1243


🏁 Script executed:

#!/bin/bash
# Check usage of Path and PathBuf in build.rs
if [ -f "build.rs" ]; then
  echo "=== Usage of Path and PathBuf ==="
  rg "Path|PathBuf" build.rs -i
fi

Length of output: 2488


🏁 Script executed:

#!/bin/bash
# Check for any references to i686 or other Windows variants
if [ -f "build.rs" ]; then
  echo "=== References to i686 and architecture comments ==="
  rg "i686|i586|win32|32.bit" build.rs -i -A1 -B1
fi

Length of output: 165


Restrict Windows wheel matching to x86_64 architecture.

The ("windows", _) pattern matches any Windows architecture but only looks for win_amd64 wheels. Since get_platform_info() recognizes 32-bit Windows targets (i686), they would incorrectly attempt to fetch a 64-bit wheel instead of erroring out.

Restrict this to ("windows", "x86_64"):

-        ("windows", _) => vec![format!("lightgbm-{}-py3-none-win_amd64.whl", version)],
+        ("windows", "x86_64") => vec![format!("lightgbm-{}-py3-none-win_amd64.whl", version)],

This allows other Windows architectures (i686, aarch64) to fall through to the unsupported platform error, which is the correct behavior. Apply at line 171.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In build.rs around line 171, the match arm currently uses ("windows", _) which
matches any Windows architecture but only tries to locate win_amd64 wheels;
change that pattern to ("windows", "x86_64") so only 64-bit Windows targets are
handled and other Windows architectures (i686, aarch64) will fall through to the
unsupported platform error; update any corresponding pattern variants that
reference ("windows", _) to the more specific ("windows", "x86_64") to ensure
correct platform handling.

Comment thread build.rs
Comment on lines +254 to +360
fn download_compiled_library(out_dir: &Path) -> Result<(), Box<dyn std::error::Error>> {
let (os, arch) = get_platform_info();
let version = get_lightgbm_version();

// Create the library directory
let lib_dir = out_dir.join("libs");
fs::create_dir_all(&lib_dir)?;

// Download the library directly into the `libs` directory with its correct name
let lib_path = lib_dir.join(&lib_filename);
let mut dest = fs::File::create(&lib_path)?;
// Strategy 1: Try PyPI JSON API to find the correct wheel URL (most reliable)
println!("cargo:warning=Querying PyPI for LightGBM wheel...");
match find_wheel_url_from_pypi(&version, &os, &arch) {
Ok(wheel_url) => {
println!("cargo:warning=Downloading wheel from: {}", wheel_url);

let response = ureq::get(&download_url).call()?;
let status = response.status();
if status < 200 || status >= 300 {
return Err(format!("Failed to download library: HTTP {}", status).into());
}
let response = ureq::get(&wheel_url).call()?;
let status = response.status();
if !(200..300).contains(&status) {
return Err(format!("Failed to download wheel: HTTP {}", status).into());
}

io::copy(&mut response.into_reader(), &mut dest)?;
// Extract filename from URL
let wheel_name = wheel_url
.split('/')
.last()
.unwrap_or("lightgbm.whl")
.to_string();
let wheel_path = out_dir.join(&wheel_name);

// Download wheel
let mut wheel_file = fs::File::create(&wheel_path)?;
io::copy(&mut response.into_reader(), &mut wheel_file)?;
drop(wheel_file);

println!("cargo:warning=✓ Downloaded wheel");

// Extract the library from the wheel
println!("cargo:warning=Extracting library from wheel");

let file = fs::File::open(&wheel_path)?;
let mut archive = zip::ZipArchive::new(file)?;

// Determine the library extension
let lib_extension = match os.as_str() {
"darwin" => "dylib",
"linux" => "so",
"windows" => "dll",
_ => return Err(format!("Unsupported OS: {}", os).into()),
};

// Look for the library in the wheel
let mut found = false;
for i in 0..archive.len() {
let mut file = archive.by_index(i)?;
let name = file.name().to_string();

if name.contains("lib_lightgbm") && name.ends_with(lib_extension) {
println!("cargo:warning=Found library at: {}", name);

let lib_path = lib_dir.join(format!("lib_lightgbm.{}", lib_extension));
let mut outfile = fs::File::create(&lib_path)?;
io::copy(&mut file, &mut outfile)?;

println!(
"cargo:warning=✓ Successfully extracted LightGBM library to: {}",
lib_dir.display()
);
found = true;
break;
}
}

println!(
"cargo:warning=Downloaded LightGBM library to: {}",
lib_path.display()
);
if !found {
return Err(format!(
"Could not find lib_lightgbm.{} in wheel",
lib_extension
)
.into());
}

Ok(())
Ok(())
}
Err(pypi_err) => {
// Strategy 2: Fallback to GitHub releases
println!(
"cargo:warning=PyPI wheel not found ({}), trying GitHub releases as fallback...",
pypi_err
);
match try_github_release(&version, &os, &arch, &lib_dir) {
Ok(_) => Ok(()),
Err(github_err) => {
// Strategy 3: Final error with helpful message
Err(format!(
"Failed to download LightGBM library:\n\
- PyPI: {}\n\
- GitHub releases: {}\n\
\n\
Please try:\n\
1. Using a different LightGBM version (set LIGHTGBM_VERSION env var)\n\
2. Building LightGBM from source and setting LIGHTGBM_LIB_DIR env var\n\
3. Checking https://pypi.org/project/lightgbm/{}/",
pypi_err, github_err, version
)
.into())
}
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

Wheel download/extraction flow looks good; fix stale env-var advice and tighten library match.

The overall strategy in download_compiled_library—PyPI wheel first, then GitHub fallback, with clear warnings and a final aggregated error—is solid. Two details worth adjusting:

  1. Stale LIGHTGBM_LIB_DIR guidance in the error message.
    The final error suggests “building LightGBM from source and setting LIGHTGBM_LIB_DIR env var”, but there’s no code in this build.rs that actually reads LIGHTGBM_LIB_DIR. That can confuse users trying to recover from failures.

    • Either implement a pre-check that honors LIGHTGBM_LIB_DIR (and skips network downloads), or
    • Simplify the guidance text to avoid mentioning an env var that’s not consumed.

    If you don’t want to implement the override right now, a minimal doc fix would be:

  •                     2. Building LightGBM from source and setting LIGHTGBM_LIB_DIR env var\n\
    
  •                     2. Building LightGBM from source and ensuring the resulting shared library is discoverable on your system (e.g., via library search path or colocated with the binary)\n\
    
    
    
  1. Library filename detection in the wheel.
    The extraction loop matches entries with name.contains("lib_lightgbm") && name.ends_with(lib_extension). To avoid accidental matches on paths that merely end with "so" / "dll" / "dylib", you could check for the full ".ext" suffix:

  •            if name.contains("lib_lightgbm") && name.ends_with(lib_extension) {
    
  •            if name.contains("lib_lightgbm")
    
  •                && name.ends_with(&format!(".{}", lib_extension))
    
  •            {
    
    
    This is mostly defensive, but makes the intent clearer.
    
    

I'll verify the review comment by checking the codebase. Let me search for the key concerns.


---

🏁 Script executed:

```shell
#!/bin/bash
# Find build.rs and check if LIGHTGBM_LIB_DIR is actually used
rg -n "LIGHTGBM_LIB_DIR" --type rust -C 3

Length of output: 507


🏁 Script executed:

#!/bin/bash
# Find build.rs and display lines 254-360
fd "build\.rs" -x cat {} | head -400 | tail -150

Length of output: 5648


🏁 Script executed:

#!/bin/bash
# Search for any env var reading in build.rs that might handle library paths
fd "build\.rs" -x rg -n "env::" {} -C 2

Length of output: 587


Remove stale LIGHTGBM_LIB_DIR reference and tighten library filename matching.

The PyPI/GitHub fallback strategy is sound, but two details need adjusting:

  1. Stale env-var guidance. The error message (line 352) recommends "setting LIGHTGBM_LIB_DIR env var", but build.rs never reads this variable. Remove the reference or replace with actionable guidance about system library paths.

  2. Loose library filename detection. Line 306-307 uses name.ends_with(lib_extension) where lib_extension is "so", "dll", or "dylib" (without leading dot). This matches any file ending with those strings, not just the extension. Change to name.ends_with(&format!(".{}", lib_extension)) to match only ".so", ".dll", or ".dylib" suffixes.

🤖 Prompt for AI Agents
In build.rs around lines 254 to 360, update two items: 1) tighten the library
filename check by changing the ends_with test to match the extension with a
leading dot (use name.ends_with(&format!(".{}", lib_extension))) so only files
ending in ".so", ".dll", or ".dylib" are accepted; 2) remove the stale "setting
LIGHTGBM_LIB_DIR env var" guidance from the final error message (or replace it
with actionable guidance such as checking system library paths or setting an
actual env var your build reads), and adjust the multi-line error text
accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant