-
Notifications
You must be signed in to change notification settings - Fork 114
WIP: Add a new remote execution system that is integrated in the cuopt engine #769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This adds a header that provides convenient aliases for data_model_view_t and span<T> in the cuopt::linear_programming namespace. The canonical implementation remains in cuopt::mps_parser to maintain mps_parser as a standalone library.
- Refactored problem_and_stream_view_t to use unique_ptr for both CPU and GPU storage - Check is_remote_solve_enabled() at problem creation time - Local solve: store problem directly in GPU memory (optimization_problem_t) - Remote solve: store problem in CPU memory (problem_cpu_data_t) - cuOptSolve routes to appropriate path based on is_remote_configured flag - Updated all cuOptGet* functions to read from correct storage - No memory overhead: only one storage type is allocated per problem - Maintains backward compatibility with existing C API
- Add is_device_memory field to data_model_view_t to track memory location - Simplify solve_lp/solve_mip overloads: mps_data_model_t now creates view and delegates - C API checks remote config at creation time: GPU memory for local, CPU for remote - Handle edge case: GPU data + remote config copies to CPU with warning - Add white-box tests for all local/remote + CPU/GPU combinations
- Modify cython_solve.cu to check for remote solve configuration early - When remote is enabled, mark data_model as CPU memory and call solve_lp/solve_mip with data_model_view_t overloads - Fix segfault by initializing warm start data to empty device buffers instead of nullptr when returning from remote solve path - Local solve path unchanged - still creates GPU optimization_problem_t
…emote solve - Add is_device_memory flag to optimization_problem_solution_t and mip_solution_t - Use std::unique_ptr for both CPU (std::vector) and GPU (rmm::device_uvector) storage - Add _host() getters for C++ users to access CPU-resident solution data - Update cython_solve.cu to populate CPU or GPU storage based on solve type - Update Python wrapper to transparently handle both CPU and GPU solution data - Backward compatible: existing GPU-based code unchanged, Python users need no changes
- Check is_remote_solve_enabled() before any CUDA initialization in CLI - Skip raft::handle creation and memory resource setup for remote solve - Use CPU-only solution constructors when remote solve returns error - Handle nullptr handle_ptr gracefully in solve_lp/solve_mip - Enables CLI to run on hosts without GPU when remote solve is configured
- Make raft::handle_t in problem_and_stream_view_t lazy (unique_ptr) - Make stream_view in solution_and_stream_view_t optional - Check is_remote_solve_enabled() before creating handle in cuOptSolve - Update cuOptGetPrimalSolution/DualSolution/ReducedCosts to handle CPU data - Enables C API to run on hosts without GPU when remote solve is configured
- cython_solve.cu: Check remote solve before CUDA stream/handle creation - Remote solve path passes nullptr handle to solve_lp/solve_mip - Local solve path creates CUDA resources only when needed - cuopt/__init__.py: Make linear_programming and routing imports lazy - Uses __getattr__ for deferred module loading - Allows 'import cuopt' on CPU-only hosts with remote solve - solver_wrapper.pyx: Lazy CUDA imports and remote solve detection - Remove eager imports of cupy, cudf, numba.cuda - Add is_remote_solve_enabled() Python function - Add _get_cuda_imports() for lazy module loading - Update create_solution() to check is_device_memory_ for warm start - utilities/__init__.py: Lazy imports for CUDA-dependent utilities - type_cast, series_from_buf, check_solution via __getattr__ - utilities/utils.py: Lazy cudf/pylibcudf imports in series_from_buf Enables Python programs to run on CPU-only hosts when remote solve is configured via CUOPT_REMOTE_HOST and CUOPT_REMOTE_PORT.
Features: - Remote solve server (cuopt_remote_server) with protobuf serialization - Real-time log streaming from server to client via stdout capture - Transparent remote solve via CUOPT_REMOTE_HOST/PORT environment variables - Pluggable serialization interface (remote_serialization.hpp) - Static linking of libmps_parser into libcuopt for simpler deployment Components: - cuopt_remote_server.cpp: Server with streaming protocol - cuopt_remote.proto: Protocol Buffers schema for LP/MIP problems - protobuf_serializer.cu: Serialization implementation - remote_solve.cu: Client-side remote solve with streaming support Build: - Added BUILD_REMOTE_SERVER cmake option - Added cuopt_remote_server target to build.sh - Added mps_parser_static library for static linking The server captures stdout during solve and streams it to clients while also echoing to the server console (for docker logs, monitoring, etc.).
When users set constraints using set_row_types() + set_constraint_bounds() instead of set_constraint_lower/upper_bounds(), the serializer now computes the constraint lower/upper bounds from the row types: - 'E' (equality): lb = ub = rhs - 'L' (less-than): lb = -inf, ub = rhs - 'G' (greater-than): lb = rhs, ub = inf - 'N' (free): lb = -inf, ub = inf Also fixed server to always use streaming protocol for solutions (even when --no-stream is set, only log streaming is disabled).
- Integrate solve_mip_remote() into solve_mip() dispatch - Serialize variable types (integer/binary) to protobuf is_integer/is_binary fields - Deserialize variable types on server side - Fix is_mip_request() to check header.problem_type for reliable detection MIP remote solve now correctly handles integer and binary variables.
- Add to_host() method to LP/MIP solution classes for GPU->CPU copy - Server calls to_host() before serialization, simplifying serializer impl - Add dynamic serializer loading via CUOPT_SERIALIZER_LIB env var - Add MsgPack serializer plugin as proof-of-concept - Add serialization plugin development guide
Features:
- Async job management: SUBMIT_JOB, CHECK_STATUS, GET_RESULT, DELETE_RESULT, GET_LOGS
- Worker processes with shared memory job queues (fork-based, not threads)
- File-based log streaming for async mode (matches Python server architecture)
- Per-job log files: /tmp/cuopt_logs/log_{job_id}
- Incremental reads via byte offset (frombyte parameter)
- Automatic cleanup on job deletion
- Sync mode still uses real-time socket streaming
- CUOPT_REMOTE_USE_SYNC=1 env var to force sync mode
Protocol additions:
- RequestType.GET_LOGS for log retrieval
- AsyncRequest.frombyte for incremental log reads
- LogsResponse with log_lines, nbytes, job_exists
Client modes:
- Sync (blocking=true): Real-time log streaming over socket
- Async (blocking=false): Submit -> Poll (logs + status) -> Get -> Delete
Server (cuopt_remote_server.cpp): - Add worker process monitoring with waitpid - Automatically restart dead workers - Mark in-progress jobs as FAILED when worker dies - Add flush calls to monitoring log messages Msgpack serializer: - Implement full async interface support - Add async message types (MSG_ASYNC_LP/MIP_REQUEST, etc.) - Implement server-side methods: is_async_request, is_blocking_request, extract_problem_data, get_async_request_type, get_job_id, get_frombyte - Implement response serialization: serialize_submit_response, serialize_status_response, serialize_result_response, serialize_delete_response, serialize_logs_response - Fix is_mip_request to detect both sync and async MIP formats Tested with LP (afiro.mps) and MIP (flugpl.mps) using msgpack async.
- docs/developer/REMOTE_SOLVE_GUIDE.md: Comprehensive guide covering server setup, client configuration, async/sync modes, job lifecycle, custom serialization, worker monitoring, and troubleshooting - docs/developer/SOLUTION_MEMORY_ARCHITECTURE.md: Documents the hybrid GPU/CPU memory strategy for solutions, including to_host() method, accessor patterns, and design rationale Deleted 56 obsolete markdown files from root directory that contained outdated information from previous development iterations.
Features: - CANCEL_JOB: Cancel queued jobs (marked cancelled, workers skip them) or running jobs (worker killed via SIGKILL, auto-restarted) - WAIT_FOR_RESULT: Blocking wait using condition variables (no polling) Server thread sleeps until job completes, fails, or is cancelled - cuopt_cancel_job: Standalone CLI utility to cancel jobs by ID Protocol changes: - Added CANCEL_JOB (type 5) and WAIT_FOR_RESULT (type 6) request types - Added CANCELLED (5) to JobStatus enum - Added CancelResponse message to protobuf schema Server changes: - Added cancelled flag to shared memory JobQueueEntry - Workers check cancelled flag before processing claimed jobs - Cancel handler kills worker process for running jobs - wait_for_result() uses JobWaiter condition variables - Fixed check_job_status() to detect PROCESSING state - Thread-per-connection model (already present) enables blocking waits Serialization: - Added cancel methods to remote_serialization.hpp base class - Implemented in both protobuf and msgpack serializers - Added MSG_WAIT_REQUEST (17) to msgpack message types Documentation: - Added docs/remote_solve_architecture.md for security review - Covers protocol, process model, data flow, security considerations
C++ API: - Added cancel_job_remote() function to remote_solve.hpp/cu - Added remote_job_status_t enum and cancel_job_result_t struct - Simplified cuopt_cancel_job utility to use the new API (194→105 lines) Python API: - Added cuopt.linear_programming.remote module - cancel_job() function for cancelling remote jobs - JobStatus enum and CancelResult dataclass - Exported from cuopt.linear_programming package All three APIs tested and verified working: - Raw protobuf cancel (baseline) - C++ utility cancel (cuopt_cancel_job) - Python API cancel (cancel_job)
Server-side: - All jobs (blocking and non-blocking) now go through the shared memory job queue and are processed by worker processes - For blocking=true requests, server submits to queue then waits on condition variable until completion, then auto-deletes and returns - This enables cancellation for 'sync' jobs and allows concurrent solves - Single code path for all job processing Client-side: - Updated solve_lp_remote and solve_mip_remote to use the unified response format (result_response) for blocking mode - Removed streaming log dependency for blocking mode Benefits: - Cancellation works for all jobs (blocking and non-blocking) - Multiple concurrent 'sync' solves possible (one per worker) - Simpler code with single processing path - Worker isolation (one job failure doesn't crash server)
- Rename protobuf OptimizationProblem fields to match data_model_view_t: * constraint_matrix_values -> A_values (matches A_) * constraint_matrix_indices -> A_indices (matches A_indices_) * constraint_matrix_offsets -> A_offsets (matches A_offsets_) * objective_coefficients -> c (matches c_) * constraint_bounds -> b (matches b_) * is_integer/is_binary -> variable_types (bytes, matches variable_types_) - Add missing fields to protobuf and msgpack serialization: * initial_primal_solution - for warm start initial primal values * initial_dual_solution - for warm start initial dual values * Q_values, Q_indices, Q_offsets - quadratic objective matrix for QPS - Update msgpack serializer to use identical field names as protobuf - Update protobuf_serializer.cu serialization/deserialization - Both serializers now handle all data_model_view_t fields
- Rename A_values -> A (matches data_model_view_t::A_) - Remove unused problem_category field and enum - Rename PDLPSolverMode enum values to match pdlp_solver_mode_t: PDLP_MODE_STABLE1 -> Stable1, etc. - Rename LPMethod enum values to match method_t: METHOD_CONCURRENT -> Concurrent, METHOD_BARRIER -> Barrier, etc. - Update protobuf_serializer.cu for new enum names - Update msgpack_serializer.cpp for A field rename
- Remove redundant n_rows, n_cols, nnz fields (sizes inferred from arrays) - Add variable_names and row_names fields (matching protobuf) - Msgpack and protobuf now use identical field names for OptimizationProblem
Shared Memory Changes: - Replace fixed 64MB/128MB embedded data arrays with per-job shared memory - JobQueueEntry and ResultQueueEntry now store shm_data_name instead of data[] - Create /cuopt_job_<id> and /cuopt_result_<id> segments on demand - Cleanup segments after job completion or on error - Change data_size from uint32_t to uint64_t for >4GB support Wire Protocol Changes: - Change message size from uint32_t (4GB max) to uint64_t - Update both server (cuopt_remote_server.cpp) and client (remote_solve.cu) - Increase sanity check limit from 100MB to 16GB Benefits: - No fixed size limits - problems limited only by system RAM - No wasted memory for small problems - Queue entries are now ~256 bytes instead of 64MB+ - Supports very large LP/MIP problems
- Add is_mip parameter to serialize_result_response to properly distinguish LP vs MIP solutions (protobuf parsing is lenient and was misinterpreting MIP results as LP) - Add get_job_is_mip helper function to retrieve job type from tracker - Change job_id fields from string to bytes in protobuf schema to avoid spurious UTF-8 validation warnings when parsing wrong message types - Update all serialize_result_response call sites to pass is_mip flag Both LP and MIP remote solve now work correctly with Python Problem class.
Pipes are container-friendly (no /dev/shm size limits or volume mounts needed). Shared memory may be slightly faster for very large problems. Changes: - Add WorkerPipes struct with pipe file descriptors - Add worker_index, data_sent fields to JobQueueEntry for pipe sync - Add worker_index field to ResultQueueEntry for pipe mode - Add pending_job_data map for storing data before worker claims job - Add pipe I/O helpers: write_to_pipe, read_from_pipe, send_job_data_pipe, recv_job_data_pipe, send_result_pipe, recv_result_pipe - Modify spawn_workers to create pipes before forking - Modify spawn_single_worker to recreate pipes when restarting workers - Modify worker_process to read job data from pipe or shm based on mode - Modify result_retrieval_thread to send job data to workers (pipe mode) and read results from pipes - Add --use-shm command line option - Display IPC mode in server startup message Tested: Both pipe mode (default) and shm mode work correctly with LP and MIP remote solve via Python Problem class.
When a running job is cancelled, the worker is killed with SIGKILL which closes its pipe ends. If the server is in the middle of writing to the worker's input pipe, it receives SIGPIPE which would terminate the server. By ignoring SIGPIPE, the write() call returns -1 with errno=EPIPE instead, which is handled gracefully (logged as 'Failed to send job data'). Tested: Both pipe mode and shm mode work correctly with job cancellation and worker restart.
These scripts are for manual testing during development and won't be run by CI (pytest looks in specific directories like python/cuopt/). - test_pipe_cancel_restart.py: Tests job cancellation and worker restart - test_python_problem_remote.py: Tests Python Problem class with remote solve
Move protobuf::libprotobuf from PUBLIC to PRIVATE linking for both cuopt and cuopt_remote_server libraries. This fixes Python wheel build failures where protobuf was incorrectly exposed as a public dependency.
Remove unnecessary data_sent flag wait loop that caused 30s timeout. The pipe read operation naturally blocks until data is available, so explicit polling of the data_sent flag is redundant and can cause the worker to timeout before the server writes data. This fixes the 'Timeout waiting for job data' error when using pipe-based IPC with large problems. Bug: Worker would timeout after 30 seconds waiting for data_sent flag Fix: Worker now directly reads from pipe (blocking read) Result: Data transfer happens as soon as server writes to pipe
Document all three operating modes (Sync, Async, WAIT_FOR_RESULT) with: - Detailed workflow descriptions and use cases - Log retrieval methods for each mode - Complete API endpoint reference - Python usage examples and best practices - Architecture diagrams for all modes - Future API design recommendations This consolidates the remote solve protocol documentation into a single reference for users and developers.
- Fix send_typed_message to write 8-byte payload size (was incorrectly writing 4 bytes) - Fix pipe mode deadlock: set result.ready=true BEFORE writing to pipe (prevents deadlock when result size exceeds pipe buffer ~64KB) - Add 2-minute poll() timeout in read_from_pipe() for safety if worker dies - Add #include <poll.h> for poll() function
📝 WalkthroughWalkthroughAdds end-to-end remote-solve: server and CLI, protobuf/msgpack serializers with pluggable factory, remote client APIs (sync/async) and cancel, view-based CPU/GPU data paths with hybrid solution memory, build/CI dependency updates, Python/C bindings, tests, docs, and developer scripts. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/cuopt_cli.cpp (1)
224-226: Minor: typo in log message“MPS parser execption” (Line 225) → “exception”.
🤖 Fix all issues with AI agents
In @cpp/cuopt_remote_server.cpp:
- Around line 108-127: send_typed_message currently assumes ::write writes the
entire type byte and 8-byte size and doesn't handle partial writes or EINTR;
update it to use the existing write_all helper (and ensure write_all itself
handles EINTR/short writes) for the header fields (msg_type and payload_size)
and for the payload loop so all writes are retried until complete; also audit
and replace raw ::write usages in the other framing helpers (the code around the
other helper functions that perform framing and the read_all/write_all
implementations) to use the robust write_all/read_all semantics so headers,
sizes and payloads cannot be partially written or interrupted.
- Around line 1190-1202: The code mutates JobWaiter fields from
result_retrieval_thread() and cancel_job() without acquiring the
JobWaiter::mutex, causing data races against the waiting thread that reads
waiter->ready under that mutex; to fix, when you look up the waiter in
waiting_threads (using waiters_mutex) then acquire the specific waiter's mutex
(e.g., std::lock_guard<std::mutex> waiter_lock(waiter->mutex)) before writing
waiter->result_data, waiter->error_message, waiter->success, and waiter->ready,
and call waiter->cv.notify_one() while still holding that waiter mutex (or use
notify_one immediately after releasing but ensure memory visibility), doing the
same symmetric locking in cancel_job(); ensure the waiting thread still checks
the ready predicate under JobWaiter::mutex so no race or missed wakeup occurs.
- Around line 67-91: The problem: several strncpy(...) uses that populate
fixed-size fields (JobQueueEntry::job_id, JobQueueEntry::shm_data_name,
ResultQueueEntry::job_id, ResultQueueEntry::shm_data_name,
ResultQueueEntry::error_message and similar spots) use length-1 without
explicitly writing a trailing NUL, allowing non-NUL-terminated strings and UB
when constructing std::string or calling strcmp. Fix: after every strncpy into
these fixed buffers (job_id, shm_data_name, error_message, etc.) explicitly set
buffer[sizeof(buffer)-1] = '\0' (or switch to a safe copy like strncpy followed
by that assignment, or use strlcpy if available) so the fields are always
NUL-terminated; update all occurrences noted in the review (around lines 67-91,
548-562, 720-738, 999-1011, 1459-1476) and any other strncpy uses that write
into fixed char arrays.
In @cpp/src/linear_programming/cuopt_c.cpp:
- Around line 228-230: The GPU view creation in create_view_from_gpu_problem
currently leaves view.variable_types unset (unlike create_view_from_cpu_data
which calls create_view), causing solve routines in solve.cu (which expect
variable_types to be present) to read uninitialized data; fix by calling
view.set_variable_types(gpu.get_variable_types().data(), gpu.get_n_variables())
during create_view_from_gpu_problem so the view's variable_types span is
initialized, or alternatively modify the solve routines to check for a
missing/empty variable_types span and handle the GPU path explicitly.
In @cpp/src/mip/solve.cu:
- Around line 512-564: The CPU-local path in solve_mip can dereference a null
CUDA handle: before calling data_model_view_to_optimization_problem(handle_ptr,
view) when view.is_device_memory() is false and remote_config is not set,
validate handle_ptr; if it's null, log an error (CUOPT_LOG_ERROR) and return a
mip_solution_t<i_t,f_t> with cuopt::logic_error("No CUDA handle for CPU->GPU
copy", cuopt::error_type_t::RuntimeError) to mirror the GPU->CPU remote-check
branch; otherwise proceed to call data_model_view_to_optimization_problem and
return solve_mip(op_problem, settings).
In @dev_scripts/test_pipe_cancel_restart.py:
- Around line 29-52: The client protocol in
python/cuopt/cuopt/linear_programming/remote.py incorrectly uses 32-bit size
prefixes ("<I"), causing mismatch with the server's uint64_t sizes; update the
send and receive size packing/unpacking in functions that call struct.pack("<I",
...) and struct.unpack("<I", ...) to use "<Q" instead, and ensure any buffer
reads expect 8 bytes for the size prefix (e.g., change reads that assume 4 bytes
to 8 bytes) so the client reads the full uint64 size and then reads the
subsequent payload accordingly.
In @docs/remote_solve_architecture.md:
- Around line 304-314: The documented struct definitions for JobQueueEntry and
ResultQueueEntry are out of sync with the real implementation; update the
documentation to match cpp/cuopt_remote_server.cpp by changing
JobQueueEntry::job_id to char[64], data_size to uint64_t, remove inline data[]
and add char shm_data_name[128] for shared-memory names, change worker_pid to
std::atomic<pid_t>, and add std::atomic<int> worker_index and std::atomic<bool>
data_sent; similarly update ResultQueueEntry to include shm_data_name[128], use
uint64_t data_size, add worker_index and clarify the error message field/type;
also update surrounding explanatory text to describe dual-mode transport (pipe
and shared memory) and how data is referenced via shm_data_name instead of
inline buffers.
🟠 Major comments (14)
cpp/libmps_parser/CMakeLists.txt-81-82 (1)
81-82: Static library missing BZip2/ZLIB include directories.The static library has the same source files as the shared library, but lines 144-150 only add the BZip2 and ZLIB include directories to
mps_parser, notmps_parser_static. If any source files include headers from these libraries, the static build will fail whenMPS_PARSER_WITH_BZIP2orMPS_PARSER_WITH_ZLIBis enabled.🔧 Proposed fix
if(MPS_PARSER_WITH_BZIP2) target_include_directories(mps_parser PRIVATE BZip2::BZip2) + target_include_directories(mps_parser_static PRIVATE BZip2::BZip2) endif(MPS_PARSER_WITH_BZIP2) if(MPS_PARSER_WITH_ZLIB) target_include_directories(mps_parser PRIVATE ZLIB::ZLIB) + target_include_directories(mps_parser_static PRIVATE ZLIB::ZLIB) endif(MPS_PARSER_WITH_ZLIB)Also applies to: 94-99
cpp/src/mip/solver_solution.cu-147-169 (1)
147-169: Potential null pointer dereference in solution accessors.
get_solution()dereferencessolution_andget_solution_host()dereferencessolution_host_without null checks. Ifget_solution()is called on a CPU-only solution (wheresolution_is null), this will crash. Similarly forget_solution_host()on a GPU solution.Consider adding assertions or guards to catch misuse.
🛡️ Suggested defensive check
template <typename i_t, typename f_t> const rmm::device_uvector<f_t>& mip_solution_t<i_t, f_t>::get_solution() const { + cuopt_assert(solution_ != nullptr, "get_solution() called but solution is in host memory"); return *solution_; } template <typename i_t, typename f_t> rmm::device_uvector<f_t>& mip_solution_t<i_t, f_t>::get_solution() { + cuopt_assert(solution_ != nullptr, "get_solution() called but solution is in host memory"); return *solution_; } template <typename i_t, typename f_t> std::vector<f_t>& mip_solution_t<i_t, f_t>::get_solution_host() { + cuopt_assert(solution_host_ != nullptr, "get_solution_host() called but solution is in device memory"); return *solution_host_; } template <typename i_t, typename f_t> const std::vector<f_t>& mip_solution_t<i_t, f_t>::get_solution_host() const { + cuopt_assert(solution_host_ != nullptr, "get_solution_host() called but solution is in device memory"); return *solution_host_; }cpp/src/linear_programming/utilities/serializers/msgpack_serializer.cpp-60-137 (1)
60-137: Duplicate LP settings serialization code should be extracted.The LP settings packing code (28 fields) is duplicated between
serialize_lp_request(lines 72-134) andserialize_async_lp_request(lines 577-639). This violates DRY principle and increases maintenance burden.♻️ Extract helper method for LP settings serialization
private: void pack_lp_settings(msgpack::packer<msgpack::sbuffer>& pk, const pdlp_solver_settings_t<i_t, f_t>& settings) { pk.pack_map(28); pk.pack("absolute_gap_tolerance"); pk.pack(settings.tolerances.absolute_gap_tolerance); pk.pack("relative_gap_tolerance"); pk.pack(settings.tolerances.relative_gap_tolerance); // ... rest of settings }Then in both
serialize_lp_requestandserialize_async_lp_request:pack_lp_settings(pk, settings);Also applies to: 558-642
cpp/src/linear_programming/utilities/serializers/msgpack_serializer.cpp-139-183 (1)
139-183: Duplicate MIP settings serialization should also be extracted.Similarly, MIP settings packing (13 fields) is duplicated between
serialize_mip_requestandserialize_async_mip_request.♻️ Extract helper method for MIP settings serialization
private: void pack_mip_settings(msgpack::packer<msgpack::sbuffer>& pk, const mip_solver_settings_t<i_t, f_t>& settings) { pk.pack_map(13); pk.pack("time_limit"); pk.pack(settings.time_limit); // ... rest of settings }Also applies to: 644-695
cpp/src/linear_programming/cuopt_c.cpp-1275-1355 (1)
1275-1355: Exception safety: potential memory leak if solve throws.If
solve_miporsolve_lpthrows an exception (e.g., due to CUDA errors), thesolution_and_stream_view_tallocated at lines 1284, 1300, 1325, or 1341 will be leaked.Consider wrapping the solve calls in try-catch:
Suggested fix pattern
solution_and_stream_view_t* solution_and_stream_view = new solution_and_stream_view_t(true, problem_and_stream_view->handle.get()); try { solution_and_stream_view->mip_solution_ptr = new mip_solution_t<cuopt_int_t, cuopt_float_t>( solve_mip<cuopt_int_t, cuopt_float_t>(gpu_problem, mip_settings)); } catch (...) { delete solution_and_stream_view; throw; // or return appropriate error code }cpp/cuopt_remote_server.cpp-1189-1221 (1)
1189-1221: Major: result data is moved into the waiter before storing tojob_trackerWhen a blocking waiter exists,
result_datais moved into the waiter (Line 1196), thenjob_tracker[job_id].result_data = result_data;stores an empty vector (Line 1211). That makes subsequent GET_RESULT/DELETE flows inconsistent (especially for WAIT_FOR_RESULT).cpp/cuopt_remote_server.cpp-380-459 (1)
380-459: Security hardening: shared memory and log files are world-writable / symlinkable
shm_open(..., 0666)(Line 380, 410, 437) exposes job/result control to other users on the same host.- Logs are written under
/tmp(Line 1648) and opened withO_CREAT|O_TRUNC(Line 583), which is vulnerable to symlink attacks.Consider
0600perms (and/or per-user shm names), and opening logs withO_NOFOLLOWplus ensuring the log dir is owned and not world-writable.Also applies to: 580-602, 1647-1665
cpp/cuopt_cli.cpp-72-172 (1)
72-172: Explicitly mark the MPS-backed view as host memory (avoid remote-solve false negatives/positives)
create_view_from_mps_data_model()builds spans to CPU vectors, but the view’sis_device_memoryis never set here (Line 82+; later used at Line 271). If the default istrue, CLI remote solves will fail becausesolve_lp/solve_mipwill treat the view as GPU data and require a CUDA handle.Proposed fix
template <typename i_t, typename f_t> cuopt::linear_programming::data_model_view_t<i_t, f_t> create_view_from_mps_data_model( const cuopt::mps_parser::mps_data_model_t<i_t, f_t>& mps_data_model) { cuopt::linear_programming::data_model_view_t<i_t, f_t> view; + // MPS parser owns CPU buffers; this view is host-backed. + view.set_is_device_memory(false); view.set_maximize(mps_data_model.get_sense()); ... return view; }Also applies to: 269-272
cpp/src/linear_programming/utilities/remote_solve.cu-42-223 (1)
42-223: Harden socket I/O: handle EINTR/partial writes + add timeouts
read_all()/write_all()treat anyn <= 0as hard failure (Lines 193-200, 209-216), which will spuriously fail on EINTR. Also, connect/read/write can block indefinitely; please add timeouts (SO_RCVTIMEO/SO_SNDTIMEO or non-blocking connect + poll). This is especially important for long-running remote solves.cpp/src/mip/solve.cu-298-381 (1)
298-381: Potential correctness gap: view helper ignores quadratic objective (and should set is_device_memory=false)This
create_view_from_mps_data_model()omits Q-objective population (unlike the CLI helper), but later code copies/serializes quadratic objective fields (cpu_problem_data_t has Q buffers; copy_view_to_cpu copies Q). That mismatch strongly suggests Q was intended to be supported here too. Also, this helper should explicitly mark the view as host memory.Also applies to: 505-510
cpp/src/mip/solve.cu-450-495 (1)
450-495: Add CUDA error checks for cudaMemcpyAsync and cudaStreamSynchronizeLines 57–65 missing error checking: wrap
cudaMemcpyAsyncandcudaStreamSynchronizewithRAFT_CUDA_TRY()to ensure copies complete successfully before returning potentially incomplete data for remote serialization. Without error checking, corrupted or partial data silently propagates downstream.cpp/src/linear_programming/solve.cu-1145-1275 (1)
1145-1275: Add CUDA error checks and use handle streams for data copiesThe
cudaMemcpycalls at lines 1187–1190 and 1240–1251 lack error checking. Additionally,cudaMemcpyAsync(line 1380) andcudaStreamSynchronize(line 1388) incopy_view_to_cpuare unchecked. More critically, the synchronouscudaMemcpycalls indata_model_view_to_optimization_problemdo not passhandle_ptr->get_stream(), causing them to execute on the default stream and ignore upstream work enqueued on a non-default stream, which breaks correctness. Useraft::copyorcudaMemcpyAsyncwith the handle's stream (as done incopy_view_to_cpu), and wrap all CUDA operations with error checking.cpp/src/linear_programming/utilities/protobuf_serializer.cu-1383-1391 (1)
1383-1391: Missing CUDA error checking on cudaMemcpy.Same issue as in
lp_solution_to_proto- wrap withRAFT_CUDA_TRY.🔧 Proposed fix
if (d_sol.size() > 0) { std::vector<f_t> h_sol(d_sol.size()); - cudaMemcpy(h_sol.data(), d_sol.data(), d_sol.size() * sizeof(f_t), cudaMemcpyDeviceToHost); + RAFT_CUDA_TRY(cudaMemcpy(h_sol.data(), d_sol.data(), d_sol.size() * sizeof(f_t), cudaMemcpyDeviceToHost)); for (const auto& v : h_sol) { pb_solution->add_solution(static_cast<double>(v)); } }cpp/src/linear_programming/utilities/protobuf_serializer.cu-1291-1322 (1)
1291-1322: Missing CUDA error checking on cudaMemcpy calls.Per coding guidelines, every CUDA memory operation must have error checking. These
cudaMemcpycalls could fail silently.🔧 Proposed fix
if (d_primal.size() > 0) { std::vector<f_t> h_primal(d_primal.size()); - cudaMemcpy( - h_primal.data(), d_primal.data(), d_primal.size() * sizeof(f_t), cudaMemcpyDeviceToHost); + RAFT_CUDA_TRY(cudaMemcpy( + h_primal.data(), d_primal.data(), d_primal.size() * sizeof(f_t), cudaMemcpyDeviceToHost)); for (const auto& v : h_primal) { pb_solution->add_primal_solution(static_cast<double>(v)); } } // Copy dual solution from device if (d_dual.size() > 0) { std::vector<f_t> h_dual(d_dual.size()); - cudaMemcpy( - h_dual.data(), d_dual.data(), d_dual.size() * sizeof(f_t), cudaMemcpyDeviceToHost); + RAFT_CUDA_TRY(cudaMemcpy( + h_dual.data(), d_dual.data(), d_dual.size() * sizeof(f_t), cudaMemcpyDeviceToHost)); for (const auto& v : h_dual) { pb_solution->add_dual_solution(static_cast<double>(v)); } } // For reduced cost, we can access via const cast since we're just reading auto& nc_solution = const_cast<optimization_problem_solution_t<i_t, f_t>&>(solution); auto& d_reduced_cost = nc_solution.get_reduced_cost(); if (d_reduced_cost.size() > 0) { std::vector<f_t> h_reduced_cost(d_reduced_cost.size()); - cudaMemcpy(h_reduced_cost.data(), + RAFT_CUDA_TRY(cudaMemcpy(h_reduced_cost.data(), d_reduced_cost.data(), d_reduced_cost.size() * sizeof(f_t), - cudaMemcpyDeviceToHost); + cudaMemcpyDeviceToHost)); for (const auto& v : h_reduced_cost) { pb_solution->add_reduced_cost(static_cast<double>(v)); } }You'll need to include
<raft/util/cudart_utils.hpp>forRAFT_CUDA_TRY.
🟡 Minor comments (12)
dev_scripts/test_c_api.c-138-143 (1)
138-143: Add NULL check after malloc.
malloccan returnNULLon allocation failure, but the code proceeds to usesolution_valueswithout checking.Proposed fix
// Get and print solution variables solution_values = (cuopt_float_t*)malloc(num_variables * sizeof(cuopt_float_t)); + if (solution_values == NULL) { + printf("Error allocating memory for solution values\n"); + status = CUOPT_ERROR; + goto DONE; + } status = cuOptGetPrimalSolution(solution, solution_values);python/cuopt/cuopt/linear_programming/solver/solver_wrapper.pyx-155-185 (1)
155-185: Potential issue withtype_castwhenobjhas neitherdtypenordtypes.When
objlacks bothdtypeanddtypesattributes (line 172),obj_type = type(obj)returns a Python type object (e.g.,<class 'int'>). Passing this tonp.issubdtype()may not behave as expected sincenp.issubdtypeexpects numpy dtype-like objects.🐛 Consider handling this edge case
else: - obj_type = type(obj) + # Fallback: try to infer dtype from a numpy array conversion + try: + obj_type = np.asarray(obj).dtype + except Exception: + obj_type = type(obj)Alternatively, you could skip the type checking for this fallback case since
astypewill handle conversion anyway.cpp/tests/linear_programming/CMakeLists.txt-6-10 (1)
6-10: Test coverage is incomplete; add tests for numerical correctness and degenerate cases.The test file properly isolates environment variables and covers the basic remote/local solve paths, but lacks critical validation:
- Tests only verify termination status (Optimal or NumericalError), not actual solution correctness
- No degenerate cases tested (infeasible, unbounded, singleton problems)
- No free variables or extreme problem dimensions tested
- No verification of problem transformation correctness or index consistency
Add tests that validate:
- Actual optimal solution values against expected results
- Degenerate cases (infeasible, unbounded, empty, and singleton problems)
- Free variables and extreme problem dimensions
- Transformation and postsolve mapping consistency
cpp/cuopt_cancel_job.cpp-82-83 (1)
82-83:std::stoican throw on invalid input.If an invalid port string is provided (e.g.,
-p abc),std::stoiwill throwstd::invalid_argumentorstd::out_of_range, causing an unhandled exception and crash.🔧 Suggested fix with exception handling
} else if (strcmp(argv[i], "-p") == 0 && i + 1 < argc) { - port = std::stoi(argv[++i]); + try { + port = std::stoi(argv[++i]); + } catch (const std::exception& e) { + std::cerr << "Error: Invalid port number\n"; + return 1; + } } else if (argv[i][0] != '-') {docs/developer/REMOTE_SOLVE_GUIDE.md-60-67 (1)
60-67: Correct the documented default port in REMOTE_SOLVE_GUIDE.md from 8765 to 9090.The actual default port for
cuopt_remote_serveris 9090 (as confirmed incpp/cuopt_remote_server.cppline 257 and usage output), but this guide incorrectly documents it as 8765 at lines 29 and 65. Update the documentation to match the actual implementation. Note: Port 8765 can remain in examples if intentional, but the stated default must be corrected.build.sh-403-404 (1)
403-404: Fix CMake reconfigure command - incorrect source directory.The cmake reconfigure command passes
${LIBCUOPT_BUILD_DIR}as the source directory, but this is the build directory, not the source directory. CMake will use the cached source directory when reconfiguring an existing build, but this syntax is potentially confusing and may cause issues.🔧 Suggested fix
# Reconfigure with BUILD_REMOTE_SERVER=ON - cmake -DBUILD_REMOTE_SERVER=ON "${LIBCUOPT_BUILD_DIR}" + cmake -DBUILD_REMOTE_SERVER=ON .When already in the build directory (line 401:
cd "${LIBCUOPT_BUILD_DIR}"), use.to reconfigure the current build directory, which is the idiomatic CMake approach.docs/remote_solve_architecture.md-536-547 (1)
536-547: Update documentation: return type for deserialize_lp_solution is incorrect.The documented return type
lp_solution_t<i_t, f_t>should beoptimization_problem_solution_t<i_t, f_t>to match the actual interface defined incpp/include/cuopt/linear_programming/utilities/remote_serialization.hpp(line 78).cpp/src/linear_programming/utilities/serializers/msgpack_serializer.cpp-1051-1059 (1)
1051-1059: Potential protocol issue: error message duplicated on failure.In
serialize_submit_response, when!success, theresultstring (which is the error message) is packed twice (lines 1057-1058). This may cause deserialization issues if the client doesn't expect this duplication.🐛 Verify protocol expectation or fix duplication
If the comment "error message duplicated for compatibility" is intentional, consider adding a version check. Otherwise:
pk.pack(result); // job_id on success, error message on failure - if (!success) { pk.pack(result); } // error message duplicated for compatibilitycpp/include/cuopt/linear_programming/utilities/remote_serialization.hpp-101-125 (1)
101-125: Document input validation requirements in the interface.The virtual methods
deserialize_lp_requestanddeserialize_mip_requestprocess untrusted network data. Update the documentation to explicitly specify that implementations must validate:
- Array sizes do not exceed system/solver resource limits (prevent resource exhaustion)
- CSR matrix indices are within valid range (0 to num_variables-1) and offsets are consistent
- Constraint/objective dimensions match declared problem size (prevent buffer overflows and array access violations)
Without documented requirements, custom implementations may skip validation.
cpp/include/cuopt/linear_programming/mip/solver_solution.hpp-80-86 (1)
80-86: Potential null dereference if called whenis_device_memory()is true.The documentation states the method is only valid when
is_device_memory()returns false, but there's no runtime protection. Consider adding a debug assertion or throwing an exception for misuse.💡 Suggested defensive check (in implementation)
std::vector<f_t>& mip_solution_t<i_t, f_t>::get_solution_host() { cuopt_assert(solution_host_ != nullptr, "get_solution_host() called on device memory solution"); return *solution_host_; }cpp/src/linear_programming/solver_solution.cu-441-476 (1)
441-476: Potential null dereference in host getters when called on device memory solution.These methods dereference
unique_ptrwithout null checks. If called whenis_device_memory_is true, the underlying pointer would be null.💡 Suggested defensive check
template <typename i_t, typename f_t> std::vector<f_t>& optimization_problem_solution_t<i_t, f_t>::get_primal_solution_host() { + cuopt_assert(!is_device_memory_ && primal_solution_host_, + "get_primal_solution_host() requires host memory"); return *primal_solution_host_; }Apply similar checks to all host getters.
cpp/src/linear_programming/solver_solution.cu-410-439 (1)
410-439: Device getters may cause null dereference afterto_host()is called.After
to_host()is invoked,primal_solution_,dual_solution_, andreduced_cost_are reset to null (lines 706-708). Subsequent calls to these getters would dereference null pointers.💡 Suggested defensive check
template <typename i_t, typename f_t> rmm::device_uvector<f_t>& optimization_problem_solution_t<i_t, f_t>::get_primal_solution() { + cuopt_assert(is_device_memory_ && primal_solution_, + "get_primal_solution() requires device memory"); return *primal_solution_; }Apply similar checks to all device getters.
🧹 Nitpick comments (38)
REMOTE_SOLVE_MODES.md (1)
38-40: Consider adding language identifiers to ASCII diagram code blocks.The static analysis flagged several code blocks without language specifiers. For ASCII workflow diagrams, pseudo-code, and architecture diagrams, consider adding
textorplaintextas the language identifier to silence linter warnings while preserving readability.Example fix for line 38:
-``` +```text Client → SUBMIT_JOB (blocking=true) → Server blocks → Worker solves → Server returns result -``` +```This applies to other ASCII diagram blocks at lines 67, 99, 217, 492, 502, 567, 591, 634, 766, 790, and 838.
python/cuopt/cuopt/__init__.py (1)
19-24: Consider caching the imported module in the namespace.The current implementation calls
importlib.import_moduleon every attribute access. While Python's import system caches modules insys.modules, you could avoid repeated__getattr__calls by caching the result in the module namespace:♻️ Optional optimization
def __getattr__(name): """Lazy import submodules to support CPU-only hosts with remote solve.""" if name in _submodules: import importlib - return importlib.import_module(f"cuopt.{name}") + module = importlib.import_module(f"cuopt.{name}") + globals()[name] = module # Cache to avoid future __getattr__ calls + return module raise AttributeError(f"module {__name__!r} has no attribute {name!r}")This is a minor optimization—the current approach is functionally correct.
dev_scripts/test_python_api.py (1)
33-52: Consider adding error handling for parse and solve failures.If
ParseMpsorSolvethrows an exception, the script will crash without useful context. Wrapping these in try-except would improve usability for a dev test script.🛠️ Suggested improvement
+ try: # Parse MPS file into DataModel data_model = cuopt_mps_parser.ParseMps(mps_file) - # Create solver settings settings = solver_settings.SolverSettings() settings.set_optimality_tolerance(1e-6) - # Solve print("Solving...") solution = solver.Solve(data_model, settings) + except Exception as e: + print(f"Error during parse/solve: {e}") + sys.exit(1)python/cuopt/cuopt/utilities/__init__.py (1)
15-26: Lazy import lacks caching—repeated attribute access triggers repeated imports.Each call to
from cuopt.utilities import type_cast(or direct access) will re-execute the import. For idiomatic lazy loading, cache the imported symbol in the module's__dict__after first access.♻️ Suggested fix with caching
def __getattr__(name): """Lazy import CUDA-dependent utilities.""" if name == "type_cast": from cuopt.utilities.type_casting import type_cast + globals()["type_cast"] = type_cast return type_cast elif name == "series_from_buf": from cuopt.utilities.utils import series_from_buf + globals()["series_from_buf"] = series_from_buf return series_from_buf elif name == "check_solution": from cuopt.utilities.utils import check_solution + globals()["check_solution"] = check_solution return check_solution raise AttributeError(f"module {__name__!r} has no attribute {name!r}")docs/developer/SOLUTION_MEMORY_ARCHITECTURE.md (1)
73-78: Add language specifier to fenced code blocks for ASCII diagrams.The static analysis flagged these blocks (MD040). Use
textorplaintextas the language identifier for ASCII art diagrams.📝 Suggested fix
-``` +```text ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Solver │ ──► │ Solution │ ──► │ User │ │ (GPU) │ │ (GPU mem) │ │ (GPU/CPU) │ └─────────────┘ └─────────────┘ └─────────────┘Apply similarly to the second diagram at line 90. </details> Also applies to: 90-95 </blockquote></details> <details> <summary>dev_scripts/test_python_problem_remote.py (1)</summary><blockquote> `157-179`: **Consider adding edge case tests for degenerate problems.** Per coding guidelines, tests should cover degenerate cases (infeasible, unbounded, empty problems). Consider adding tests for: - Infeasible LP (e.g., x >= 1, x <= 0) - Unbounded LP (e.g., minimize x with no upper bound) Would you like me to generate additional test functions for infeasible and unbounded problem cases? </blockquote></details> <details> <summary>docs/developer/SERIALIZATION_PLUGIN_GUIDE.md (3)</summary><blockquote> `13-25`: **Add language specifier to fenced code block.** The ASCII diagram code block lacks a language specifier. Consider adding `text` or `plaintext` for consistency with linting rules. <details> <summary>Proposed fix</summary> ```diff -``` +```text ┌─────────────┐ ┌─────────────┐ │ Client │ │ Server │
374-377: Add language specifier to fenced code block.The server output example block lacks a language specifier.
Proposed fix
-``` +```text [remote_solve] Loading custom serializer from: /path/to/libcuopt_my_serializer.so [remote_solve] Using custom serializer: my_format</details> --- `182-197`: **Incomplete code example may confuse developers.** The `deserialize_lp_solution` and `deserialize_mip_solution` method bodies only show the error case but lack a return statement for the success path. Consider adding a placeholder or explicit note. <details> <summary>Suggested improvement</summary> ```diff optimization_problem_solution_t<i_t, f_t> deserialize_lp_solution( const std::vector<uint8_t>& data) override { // Parse the solution data // Create and return solution object + // Example success return: + // return optimization_problem_solution_t<i_t, f_t>(...); // On error, return error solution: // return optimization_problem_solution_t<i_t, f_t>( // cuopt::logic_error("Parse error", cuopt::error_type_t::RuntimeError)); }cpp/src/linear_programming/utilities/cython_solve.cu (1)
255-262: Consider usingstd::any_offor cleaner MIP detection.The loop correctly identifies MIP problems, but could be more idiomatic.
Optional refactor
// Determine if LP or MIP based on variable types - bool is_mip = false; - auto var_types = data_model->get_variable_types(); - for (size_t i = 0; i < var_types.size(); ++i) { - if (var_types.data()[i] != 'C') { - is_mip = true; - break; - } - } + auto var_types = data_model->get_variable_types(); + bool is_mip = std::any_of(var_types.data(), + var_types.data() + var_rules.size(), + [](char c) { return c != 'C'; });cpp/CMakeLists.txt (2)
228-231: Prefer target-specific include directories over globalinclude_directories.Using
include_directories()at the top level pollutes the global include path for all targets. Consider usingtarget_include_directories()on the specific targets that need these paths (e.g.,cuopt).♻️ Suggested change
# Protocol Buffers for remote solve serialization find_package(Protobuf REQUIRED) -include_directories(${Protobuf_INCLUDE_DIRS}) -include_directories(${CMAKE_CURRENT_BINARY_DIR}) +# Protobuf includes will be added via target_include_directories on cuopt targetThen add to the
cuopttarget's include directories (around line 301):target_include_directories(cuopt PRIVATE ... "${Protobuf_INCLUDE_DIRS}" "${CMAKE_CURRENT_BINARY_DIR}" # For generated proto headers ... )
551-613: Remote server build configuration looks well-structured.The optional
BUILD_REMOTE_SERVERflag with proper guards (NOT BUILD_LP_ONLY) and RUNPATH configuration for development is appropriate. The install targets and component assignment are correctly configured.One minor observation:
cuopt_cancel_joblinkscuoptasPUBLIC, but for executables,PRIVATElinkage is more appropriate since executables don't export their dependencies.♻️ Minor linkage adjustment
target_link_libraries(cuopt_cancel_job - PUBLIC + PRIVATE cuopt )cpp/include/cuopt/linear_programming/pdlp/solver_solution.hpp (1)
274-306: Host accessor documentation could note thread-safety.The host accessors are documented with preconditions, but per coding guidelines, consider adding a note about thread-safety. Since these return references to internal storage, concurrent access would be unsafe.
📝 Optional documentation enhancement
/** * @brief Returns the primal solution in host (CPU) memory. * Only valid when is_device_memory() returns false. + * @note Not thread-safe - caller must ensure exclusive access when modifying. * * @return std::vector<f_t>& The host memory container for the primal solution. */ std::vector<f_t>& get_primal_solution_host();docs/developer/REMOTE_SOLVE_GUIDE.md (1)
16-46: Consider addingtextorplaintextlanguage identifier for the ASCII diagram.The static analysis tool flags the missing language specifier. For ASCII diagrams, you can use
textorplaintextto satisfy linters while preserving formatting.🔧 Suggested fix
-``` +```text ┌─────────────────────────────────────────────────────────────┐cpp/cuopt_cancel_job.cpp (1)
73-74: Consider validating environment variable port value.
std::atoisilently returns 0 for invalid input (e.g.,CUOPT_REMOTE_PORT=abc), which would result in attempting to connect to port 0. Consider adding validation or usingstd::stoiwith try-catch for consistency.docs/remote_solve_architecture.md (2)
15-62: Add language specifier to ASCII diagram code blocks.The ASCII diagrams should have a language specifier for markdown linting compliance. Use
textorplaintextfor ASCII art diagrams.📝 Suggested fix
-``` +```text ┌─────────────────────────────────────────────────────────────────────────────┐ │ CLIENT PROCESS │Apply similar changes to other ASCII diagram blocks at lines 102, 138, 193, 229, 265, 336, 385, 445, and 490.
359-366: Document the rationale for connection timeout absence.The "Connection timeout: None" entry is flagged as a security note, but the document doesn't explain whether this is intentional (e.g., for long-running solves) or a gap to be addressed. Consider adding context about why timeouts are not implemented and whether this is planned for future enhancement.
cpp/libmps_parser/include/mps_parser/data_model_view.hpp (1)
424-426: Consider documenting thread-safety expectations.The
is_device_memory_flag has no synchronization. If adata_model_view_tinstance might be accessed from multiple threads (one modifying viaset_is_device_memory, another reading viais_device_memory), this could lead to a data race. Consider either:
- Documenting that the view is not thread-safe (consistent with the non-owning, setup-then-use pattern), or
- Using
std::atomic<bool>if concurrent access is expected.Given the typical usage pattern (configure view, then use), option 1 is likely sufficient.
build.sh (1)
409-415: Consider using CMake install instead of manual install command.The manual
install -m 755command bypasses CMake's install mechanism, which could lead to inconsistencies (e.g., missing from uninstall, not tracked by CMake install manifest). If the CMake target has proper install rules, consider usingcmake --installfor the specific component.However, if the server install rules are separate from the main libcuopt install, this approach may be intentional to avoid installing all of libcuopt again.
💡 Alternative using CMake install
# Install the server executable if [ -z "${INSTALL_TARGET}" ]; then echo "Skipping install of cuopt_remote_server (-n flag set)" else - install -m 755 "${LIBCUOPT_BUILD_DIR}/cuopt_remote_server" "${INSTALL_PREFIX}/bin/" - echo "Installed cuopt_remote_server to ${INSTALL_PREFIX}/bin/" + cmake --install "${LIBCUOPT_BUILD_DIR}" --component cuopt_remote_server --prefix "${INSTALL_PREFIX}" + echo "Installed cuopt_remote_server to ${INSTALL_PREFIX}" fiThis requires the CMake target to have a corresponding install rule with the
cuopt_remote_servercomponent.cpp/src/linear_programming/utilities/serializers/CMakeLists.txt (2)
20-26: Consider using CMAKE_CURRENT_SOURCE_DIR or target-based includes.Using
${CMAKE_SOURCE_DIR}for include directories can be fragile if this CMakeLists.txt is ever included from a different location or used as a subproject. Consider using${CMAKE_CURRENT_SOURCE_DIR}/../../../..(relative paths) or preferably, rely on include directories exported by thecuopttarget dependency.💡 Alternative using relative paths
target_include_directories(cuopt_msgpack_serializer PRIVATE - "${CMAKE_SOURCE_DIR}/include" - "${CMAKE_SOURCE_DIR}/libmps_parser/include" - "${CMAKE_SOURCE_DIR}/src" + "${CMAKE_CURRENT_SOURCE_DIR}/../../../../include" + "${CMAKE_CURRENT_SOURCE_DIR}/../../../../libmps_parser/include" + "${CMAKE_CURRENT_SOURCE_DIR}/../.." ${CUDA_INCLUDE_DIRS} )Or better, if
cuopttarget exports its include directories properly, some of these may be redundant sincecuoptis already linked.
42-46: Consider adding SOVERSION for shared library versioning.For a shared library plugin, setting
SOVERSIONhelps with ABI compatibility tracking and allows multiple versions to coexist.💡 Suggested addition
set_target_properties(cuopt_msgpack_serializer PROPERTIES CXX_STANDARD 17 CXX_STANDARD_REQUIRED ON POSITION_INDEPENDENT_CODE ON + VERSION 1.0.0 + SOVERSION 1 )dev_scripts/test_pipe_cancel_restart.py (1)
194-284: Test workflow is well-structured with clear steps.The test follows a logical sequence: submit → wait → check status → cancel → verify cancellation → wait for restart → submit new job → verify completion. Each step has appropriate logging and validation.
Consider adding a return code for the case where cancellation verification fails (lines 239-244 only print a warning but don't fail the test).
♻️ Optional: Return failure if cancellation doesn't produce expected status
if status != cuopt_remote_pb2.CANCELLED: print( f" WARNING: Expected CANCELLED, got {cuopt_remote_pb2.JobStatus.Name(status)}" ) + # Optionally fail the test if cancellation didn't work as expected + # return 1 else: print(" ✓ Job successfully cancelled!")cpp/tests/linear_programming/unit_tests/remote_solve_test.cu (2)
138-194: Add numerical correctness validation for local solve tests.The local solve tests (
gpu_data_with_local_solveandcpu_data_with_local_solve) only verifyOptimaltermination status but don't validate the solution values. Per coding guidelines, tests should validate numerical correctness, not just "runs without error."For this LP:
- minimize: 0.2x1 + 0.1x2
- subject to: 3x1 + 4x2 <= 5.4, 2.7x1 + 10.1x2 <= 4.9, x1,x2 >= 0
The optimal solution is at x1=x2=0 with objective=0.
♻️ Add numerical validation to local solve tests
// After checking termination status EXPECT_EQ(solution.get_termination_status(), pdlp_termination_status_t::Optimal); // Validate solution values auto primal = solution.get_primal_solution(); EXPECT_NEAR(solution.get_primal_objective(), 0.0, 1e-6); // For this problem, optimal is at origin (0, 0)Also applies to: 232-265
128-136: Verify expected behavior when remote solve is fully implemented.The tests currently expect
NumericalErrorfor remote paths because remote solving is not yet implemented. Add a TODO or tracking mechanism to update these expectations once remote solving is functional.Would you like me to open an issue to track updating these test expectations when remote solve is fully implemented?
Also applies to: 225-229
python/cuopt/cuopt/linear_programming/remote.py (1)
163-170: Socket receive loop should handle partial reads correctly.The receive loop correctly accumulates chunks until the full message is received. However, if the connection is closed mid-message (
not chunk), the function breaks out and may process incomplete data. Consider returning an error in this case.♻️ Handle incomplete message as error
while len(response_data) < size: chunk = sock.recv(size - len(response_data)) if not chunk: - break + sock.close() + return CancelResult( + success=False, + message="Connection closed while receiving response", + job_status=JobStatus.NOT_FOUND, + ) response_data += chunkcpp/include/cuopt/linear_programming/solve.hpp (2)
112-127: Documentation may be misleading about memory location.The Doxygen comment states the view "points to GPU memory", but per the PR's dual-path architecture,
data_model_view_tcan point to either CPU or GPU memory. Consider updating the documentation to reflect this:- * This function copies data from the view (which points to GPU memory) - * into an owning optimization_problem_t. + * This function copies data from the view (which can point to either CPU + * or GPU memory, based on is_device_memory()) into an owning + * optimization_problem_t stored on the GPU.
129-157: Well-documented API addition for view-based LP solving.The documentation clearly explains the dual CPU/GPU path behavior and environment variable usage. Consider adding a brief note about thread-safety expectations, as this is a public header. For example:
@note This function is not thread-safe when multiple threads solve using the same handle.
cpp/include/cuopt/linear_programming/utilities/remote_serialization.hpp (2)
217-237: Nested types within template class.The
job_status_tenum,logs_result_t, andcancel_result_tstructs are defined inside the template class. This means users must access them asremote_serializer_t<i_t, f_t>::job_status_t, which may be cumbersome.Consider whether these types should be defined at namespace scope for easier access, especially since they don't depend on the template parameters:
namespace cuopt::linear_programming { enum class job_status_t { QUEUED, PROCESSING, COMPLETED, FAILED, NOT_FOUND, CANCELLED }; struct logs_result_t { ... }; struct cancel_result_t { ... }; }This is a minor usability consideration and can be addressed later if needed.
307-314: Consider using enum for request type instead of int.The
get_async_request_typereturns an int with magic number meanings documented in the comment. Consider defining aRequestTypeenum at namespace scope and using it as the return type for type safety:enum class request_type_t { SUBMIT_JOB = 0, CHECK_STATUS = 1, GET_RESULT = 2, DELETE_RESULT = 3, GET_LOGS = 4, CANCEL_JOB = 5, WAIT_FOR_RESULT = 6 }; virtual request_type_t get_async_request_type(const std::vector<uint8_t>& data) = 0;This would prevent callers from using incorrect integer values.
cpp/include/cuopt/linear_programming/utilities/remote_solve.hpp (2)
39-54: Add port range validation.The port parsing at line 46 uses
std::stoibut doesn't validate that the port is in the valid range (1-65535). Invalid port values could cause connection failures with unclear error messages.Suggested improvement
if (host != nullptr && port != nullptr && host[0] != '\0' && port[0] != '\0') { try { int port_num = std::stoi(port); + if (port_num < 1 || port_num > 65535) { + return std::nullopt; // Invalid port range + } return remote_solve_config_t{std::string(host), port_num}; } catch (...) { // Invalid port number, fall back to local solve return std::nullopt; } }
98-114: Duplicate job status enum definitions.
remote_job_status_there duplicatesjob_status_tdefined insideremote_serializer_tinremote_serialization.hpp. While this separation may be intentional (public vs internal API), consider:
- Using a single definition at namespace scope
- Or adding a comment explaining why duplication is preferred
This prevents drift between the two definitions.
cpp/src/linear_programming/cuopt_c.cpp (2)
140-146: Consider caching MIP detection result.The
is_mip()method iterates through all variable types on each call, which is O(n). For large problems called multiple times, consider caching this result during problem creation:bool is_mip_cached = false; bool is_mip_value = false; bool is_mip() const { if (!is_mip_cached) { for (char vt : variable_types) { if (vt == CUOPT_INTEGER) { is_mip_value = true; break; } } is_mip_cached = true; } return is_mip_value; }However, since
is_mip()is typically called only a few times per solve, this is a minor optimization.
819-823: Consider null checks for defensive programming.The getter functions assume that if
!view.is_device_memory()thencpu_datais valid, and vice versa forgpu_problem. While this invariant should hold if the problem was created correctly, adding defensive null checks would catch programming errors:if (!problem_and_stream_view->view.is_device_memory()) { if (!problem_and_stream_view->cpu_data) return CUOPT_INTERNAL_ERROR; *num_constraints_ptr = problem_and_stream_view->cpu_data->num_constraints; }This is a defensive programming suggestion and may not be necessary if invariants are always maintained.
cpp/cuopt_cli.cpp (1)
72-172: Deduplicatecreate_view_from_mps_data_modelacross TUs (currently diverging)This helper exists (with different field coverage) in multiple places (CLI + LP + MIP). It’s already diverging (e.g., quadratic objective handling differs), which is a correctness risk long-term. Suggest moving to a shared header (e.g.,
cuopt/linear_programming/utilities/create_view.hpp) and reusing everywhere. Based on learnings, reduce duplication and coupling by consolidating shared utilities.cpp/cuopt_remote_server.cpp (1)
2275-2277: Bound per-connection thread creation (DoS risk)Each accepted socket spawns a detached thread (Line 2276) with no cap/backpressure. A burst of connections can exhaust threads/memory.
cpp/src/linear_programming/utilities/protobuf_serializer.cu (3)
140-153: Consider extracting MIP settings serialization into a helper method.This block is duplicated verbatim at lines 339-352 in
serialize_async_mip_request. Per coding guidelines, refactor code duplication (3+ occurrences) into shared utilities.♻️ Suggested refactor
+private: + void serialize_mip_settings_to_proto(const mip_solver_settings_t<i_t, f_t>& settings, + cuopt::remote::MIPSolverSettings* pb_settings) + { + pb_settings->set_time_limit(settings.time_limit); + pb_settings->set_relative_mip_gap(settings.tolerances.relative_mip_gap); + pb_settings->set_absolute_mip_gap(settings.tolerances.absolute_mip_gap); + pb_settings->set_integrality_tolerance(settings.tolerances.integrality_tolerance); + pb_settings->set_absolute_tolerance(settings.tolerances.absolute_tolerance); + pb_settings->set_relative_tolerance(settings.tolerances.relative_tolerance); + pb_settings->set_presolve_absolute_tolerance(settings.tolerances.presolve_absolute_tolerance); + pb_settings->set_log_to_console(settings.log_to_console); + pb_settings->set_heuristics_only(settings.heuristics_only); + pb_settings->set_num_cpu_threads(settings.num_cpu_threads); + pb_settings->set_num_gpus(settings.num_gpus); + pb_settings->set_presolve(settings.presolve); + pb_settings->set_mip_scaling(settings.mip_scaling); + }Then use
serialize_mip_settings_to_proto(settings, pb_settings)in both locations.
1310-1312: Consider adding a const overload forget_reduced_cost()instead of usingconst_cast.The
const_castworks for read-only access but is fragile. If the getter ever has side effects, this could cause undefined behavior.
800-833: Consider adding a defensive check for CPU memory requirement.The comment at lines 803-804 notes that
viewmust point to CPU memory, but there's no runtime verification. If this precondition is violated, the code would have undefined behavior accessing GPU pointers from CPU.💡 Suggested defensive check
void serialize_problem_to_proto(const mps_parser::data_model_view_t<i_t, f_t>& view, cuopt::remote::OptimizationProblem* pb_problem) { // Note: view must point to CPU memory for serialization // The solve functions ensure this by copying GPU data to CPU if needed + if (view.is_device_memory()) { + throw cuopt::logic_error("serialize_problem_to_proto requires CPU memory view", + cuopt::error_type_t::InvalidArgument); + }This assumes
data_model_view_thas anis_device_memory()method similar to solution types.
| struct JobQueueEntry { | ||
| char job_id[64]; | ||
| uint32_t problem_type; // 0 = LP, 1 = MIP | ||
| uint64_t data_size; // Size of problem data (uint64 for large problems) | ||
| char shm_data_name[128]; // Name of per-job shared memory segment (shm mode only) | ||
| std::atomic<bool> ready; // Job is ready to be processed | ||
| std::atomic<bool> claimed; // Worker has claimed this job | ||
| std::atomic<pid_t> worker_pid; // PID of worker that claimed this job (0 if none) | ||
| std::atomic<bool> cancelled; // Job has been cancelled (worker should skip) | ||
| // Pipe mode fields | ||
| std::atomic<int> worker_index; // Index of worker that claimed this job (-1 if none) | ||
| std::atomic<bool> data_sent; // Server has sent data to worker's pipe (pipe mode) | ||
| }; | ||
|
|
||
| // Result queue entry - small fixed size, data stored in separate per-result shared memory or pipe | ||
| struct ResultQueueEntry { | ||
| char job_id[64]; | ||
| uint32_t status; // 0 = success, 1 = error, 2 = cancelled | ||
| uint64_t data_size; // Size of result data (uint64 for large results) | ||
| char shm_data_name[128]; // Name of per-result shared memory segment (shm mode only) | ||
| char error_message[1024]; | ||
| std::atomic<bool> ready; // Result is ready | ||
| std::atomic<bool> retrieved; // Result has been retrieved | ||
| std::atomic<int> worker_index; // Index of worker that produced this result (pipe mode) | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: unsafe strncpy usage can produce non-NUL-terminated strings
Several strncpy(..., n-1) calls never write the trailing '\0', so later std::string(job_queue[i].job_id) / strcmp(...) may read past bounds (undefined behavior). This affects job IDs, shm names, and error messages.
Suggested pattern
- strncpy(result_queue[i].job_id, job_id.c_str(), sizeof(result_queue[i].job_id) - 1);
+ std::snprintf(result_queue[i].job_id, sizeof(result_queue[i].job_id), "%s", job_id.c_str());(Or set buf[sizeof(buf)-1] = '\0' immediately after each strncpy.)
Also applies to: 548-562, 720-738, 999-1011, 1459-1476
🤖 Prompt for AI Agents
In @cpp/cuopt_remote_server.cpp around lines 67 - 91, The problem: several
strncpy(...) uses that populate fixed-size fields (JobQueueEntry::job_id,
JobQueueEntry::shm_data_name, ResultQueueEntry::job_id,
ResultQueueEntry::shm_data_name, ResultQueueEntry::error_message and similar
spots) use length-1 without explicitly writing a trailing NUL, allowing
non-NUL-terminated strings and UB when constructing std::string or calling
strcmp. Fix: after every strncpy into these fixed buffers (job_id,
shm_data_name, error_message, etc.) explicitly set buffer[sizeof(buffer)-1] =
'\0' (or switch to a safe copy like strncpy followed by that assignment, or use
strlcpy if available) so the fields are always NUL-terminated; update all
occurrences noted in the review (around lines 67-91, 548-562, 720-738, 999-1011,
1459-1476) and any other strncpy uses that write into fixed char arrays.
| // Helper to send a framed message with type | ||
| static bool send_typed_message(int sockfd, MessageType type, const void* data, size_t size) | ||
| { | ||
| uint8_t msg_type = static_cast<uint8_t>(type); | ||
| uint64_t payload_size = static_cast<uint64_t>(size); | ||
|
|
||
| if (::write(sockfd, &msg_type, 1) != 1) return false; | ||
| if (::write(sockfd, &payload_size, sizeof(payload_size)) != sizeof(payload_size)) return false; | ||
| if (size > 0) { | ||
| const uint8_t* ptr = static_cast<const uint8_t*>(data); | ||
| size_t remaining = size; | ||
| while (remaining > 0) { | ||
| ssize_t written = ::write(sockfd, ptr, remaining); | ||
| if (written <= 0) return false; | ||
| ptr += written; | ||
| remaining -= written; | ||
| } | ||
| } | ||
| return true; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: socket framing writes are not robust (partial writes / EINTR)
send_typed_message() assumes ::write() writes the full 1-byte type and 8-byte size (Lines 114-116). On sockets this can be partial; also EINTR isn’t handled here or in write_all/read_all. This can corrupt the stream protocol.
Proposed fix (use write_all for headers too)
static bool send_typed_message(int sockfd, MessageType type, const void* data, size_t size)
{
uint8_t msg_type = static_cast<uint8_t>(type);
uint64_t payload_size = static_cast<uint64_t>(size);
- if (::write(sockfd, &msg_type, 1) != 1) return false;
- if (::write(sockfd, &payload_size, sizeof(payload_size)) != sizeof(payload_size)) return false;
+ if (!write_all(sockfd, &msg_type, 1)) return false;
+ if (!write_all(sockfd, &payload_size, sizeof(payload_size))) return false;
...
}Also applies to: 315-339, 185-197
🤖 Prompt for AI Agents
In @cpp/cuopt_remote_server.cpp around lines 108 - 127, send_typed_message
currently assumes ::write writes the entire type byte and 8-byte size and
doesn't handle partial writes or EINTR; update it to use the existing write_all
helper (and ensure write_all itself handles EINTR/short writes) for the header
fields (msg_type and payload_size) and for the payload loop so all writes are
retried until complete; also audit and replace raw ::write usages in the other
framing helpers (the code around the other helper functions that perform framing
and the read_all/write_all implementations) to use the robust write_all/read_all
semantics so headers, sizes and payloads cannot be partially written or
interrupted.
| { | ||
| std::lock_guard<std::mutex> lock(waiters_mutex); | ||
| auto wit = waiting_threads.find(job_id); | ||
| if (wit != waiting_threads.end()) { | ||
| // Wake up the waiting thread | ||
| auto waiter = wit->second; | ||
| waiter->result_data = std::move(result_data); | ||
| waiter->error_message = error_message; | ||
| waiter->success = success; | ||
| waiter->ready = true; | ||
| waiter->cv.notify_one(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: data races when completing/cancelling a blocking waiter
result_retrieval_thread() and cancel_job() mutate JobWaiter fields without locking JobWaiter::mutex (Lines 1195-1201, 1758-1763). The waiting thread reads waiter->ready under that mutex (Line 1620), so current code is a data race (UB) and can miss wakeups or observe torn state.
Proposed fix
auto waiter = wit->second;
- waiter->result_data = std::move(result_data);
- waiter->error_message = error_message;
- waiter->success = success;
- waiter->ready = true;
- waiter->cv.notify_one();
+ {
+ std::lock_guard<std::mutex> lk(waiter->mutex);
+ waiter->result_data = std::move(result_data);
+ waiter->error_message = error_message;
+ waiter->success = success;
+ waiter->ready = true;
+ }
+ waiter->cv.notify_one();Also applies to: 1605-1627, 1754-1764
🤖 Prompt for AI Agents
In @cpp/cuopt_remote_server.cpp around lines 1190 - 1202, The code mutates
JobWaiter fields from result_retrieval_thread() and cancel_job() without
acquiring the JobWaiter::mutex, causing data races against the waiting thread
that reads waiter->ready under that mutex; to fix, when you look up the waiter
in waiting_threads (using waiters_mutex) then acquire the specific waiter's
mutex (e.g., std::lock_guard<std::mutex> waiter_lock(waiter->mutex)) before
writing waiter->result_data, waiter->error_message, waiter->success, and
waiter->ready, and call waiter->cv.notify_one() while still holding that waiter
mutex (or use notify_one immediately after releasing but ensure memory
visibility), doing the same symmetric locking in cancel_job(); ensure the
waiting thread still checks the ready predicate under JobWaiter::mutex so no
race or missed wakeup occurs.
| // Note: variable_types in optimization_problem_t uses var_t enum, not char | ||
| // The view's variable_types span will point to GPU memory with var_t values | ||
| // This is handled specially in solve routines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
rg -n -B 3 -A 10 'variable_types' cpp/src/linear_programming/solve.cu | head -100Repository: NVIDIA/cuopt
Length of output: 4952
🏁 Script executed:
# Also check the optimization_problem_t definition to see the actual type
rg -n -B 2 -A 15 'struct optimization_problem_t|class optimization_problem_t' --type cpp --type cu | head -80Repository: NVIDIA/cuopt
Length of output: 82
🏁 Script executed:
# Check the view creation methods in the file mentioned
rg -n -B 2 -A 8 'create_view_from_gpu_problem\|create_view_from_cpu_data' cpp/src/linear_programming/cuopt_c.cppRepository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
cat -n cpp/src/linear_programming/cuopt_c.cpp | sed -n '149,253p'Repository: NVIDIA/cuopt
Length of output: 5121
🏁 Script executed:
# Search for problem_and_stream_view_t definition and its create_view_from_gpu_problem method
rg -n -A 30 'create_view_from_gpu_problem' cpp/src/linear_programming/cuopt_c.cppRepository: NVIDIA/cuopt
Length of output: 9238
Add missing set_variable_types() call to GPU view creation or handle unset variable_types in solve routines.
The create_view_from_gpu_problem() method (lines 200-242) does not set variable_types on the view, even though create_view_from_cpu_data() does via cpu_data->create_view(). The comment claims this is "handled specially in solve routines," but the view's variable_types span remains uninitialized for GPU data. Solve routines in solve.cu expect variable_types to be set on the view (lines 1177-1202) and will receive empty/uninitialized data for GPU problems. Either call view.set_variable_types(gpu.get_variable_types().data(), gpu.get_n_variables()) in create_view_from_gpu_problem(), or ensure solve routines gracefully handle missing variable_types on the GPU path.
🤖 Prompt for AI Agents
In @cpp/src/linear_programming/cuopt_c.cpp around lines 228 - 230, The GPU view
creation in create_view_from_gpu_problem currently leaves view.variable_types
unset (unlike create_view_from_cpu_data which calls create_view), causing solve
routines in solve.cu (which expect variable_types to be present) to read
uninitialized data; fix by calling
view.set_variable_types(gpu.get_variable_types().data(), gpu.get_n_variables())
during create_view_from_gpu_problem so the view's variable_types span is
initialized, or alternatively modify the solve routines to check for a
missing/empty variable_types span and handle the GPU path explicitly.
| // Helper to create a data_model_view_t from mps_data_model_t (for remote solve path) | ||
| template <typename i_t, typename f_t> | ||
| static data_model_view_t<i_t, f_t> create_view_from_mps_data_model( | ||
| const cuopt::mps_parser::mps_data_model_t<i_t, f_t>& mps_data_model) | ||
| { | ||
| data_model_view_t<i_t, f_t> view; | ||
|
|
||
| view.set_maximize(mps_data_model.get_sense()); | ||
|
|
||
| if (!mps_data_model.get_constraint_matrix_values().empty()) { | ||
| view.set_csr_constraint_matrix(mps_data_model.get_constraint_matrix_values().data(), | ||
| mps_data_model.get_constraint_matrix_values().size(), | ||
| mps_data_model.get_constraint_matrix_indices().data(), | ||
| mps_data_model.get_constraint_matrix_indices().size(), | ||
| mps_data_model.get_constraint_matrix_offsets().data(), | ||
| mps_data_model.get_constraint_matrix_offsets().size()); | ||
| } | ||
|
|
||
| if (!mps_data_model.get_constraint_bounds().empty()) { | ||
| view.set_constraint_bounds(mps_data_model.get_constraint_bounds().data(), | ||
| mps_data_model.get_constraint_bounds().size()); | ||
| } | ||
|
|
||
| if (!mps_data_model.get_objective_coefficients().empty()) { | ||
| view.set_objective_coefficients(mps_data_model.get_objective_coefficients().data(), | ||
| mps_data_model.get_objective_coefficients().size()); | ||
| } | ||
|
|
||
| view.set_objective_scaling_factor(mps_data_model.get_objective_scaling_factor()); | ||
| view.set_objective_offset(mps_data_model.get_objective_offset()); | ||
|
|
||
| if (!mps_data_model.get_variable_lower_bounds().empty()) { | ||
| view.set_variable_lower_bounds(mps_data_model.get_variable_lower_bounds().data(), | ||
| mps_data_model.get_variable_lower_bounds().size()); | ||
| } | ||
|
|
||
| if (!mps_data_model.get_variable_upper_bounds().empty()) { | ||
| view.set_variable_upper_bounds(mps_data_model.get_variable_upper_bounds().data(), | ||
| mps_data_model.get_variable_upper_bounds().size()); | ||
| } | ||
|
|
||
| if (!mps_data_model.get_variable_types().empty()) { | ||
| view.set_variable_types(mps_data_model.get_variable_types().data(), | ||
| mps_data_model.get_variable_types().size()); | ||
| } | ||
|
|
||
| if (!mps_data_model.get_row_types().empty()) { | ||
| view.set_row_types(mps_data_model.get_row_types().data(), | ||
| mps_data_model.get_row_types().size()); | ||
| } | ||
|
|
||
| if (!mps_data_model.get_constraint_lower_bounds().empty()) { | ||
| view.set_constraint_lower_bounds(mps_data_model.get_constraint_lower_bounds().data(), | ||
| mps_data_model.get_constraint_lower_bounds().size()); | ||
| } | ||
|
|
||
| if (!mps_data_model.get_constraint_upper_bounds().empty()) { | ||
| view.set_constraint_upper_bounds(mps_data_model.get_constraint_upper_bounds().data(), | ||
| mps_data_model.get_constraint_upper_bounds().size()); | ||
| } | ||
|
|
||
| view.set_objective_name(mps_data_model.get_objective_name()); | ||
| view.set_problem_name(mps_data_model.get_problem_name()); | ||
|
|
||
| if (!mps_data_model.get_variable_names().empty()) { | ||
| view.set_variable_names(mps_data_model.get_variable_names()); | ||
| } | ||
|
|
||
| if (!mps_data_model.get_row_names().empty()) { | ||
| view.set_row_names(mps_data_model.get_row_names()); | ||
| } | ||
|
|
||
| if (!mps_data_model.get_initial_primal_solution().empty()) { | ||
| view.set_initial_primal_solution(mps_data_model.get_initial_primal_solution().data(), | ||
| mps_data_model.get_initial_primal_solution().size()); | ||
| } | ||
|
|
||
| if (!mps_data_model.get_initial_dual_solution().empty()) { | ||
| view.set_initial_dual_solution(mps_data_model.get_initial_dual_solution().data(), | ||
| mps_data_model.get_initial_dual_solution().size()); | ||
| } | ||
|
|
||
| return view; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: quadratic objective is dropped when solving from mps_data_model_t via view
solve_lp(handle, mps_data_model, ...) now builds a view (Line 1140) using create_view_from_mps_data_model() (Line 1047), but that helper never sets the quadratic objective matrix. This will silently change behavior for QP inputs (local and remote), since downstream view.has_quadratic_objective() will be false.
Proposed fix
static data_model_view_t<i_t, f_t> create_view_from_mps_data_model(
const cuopt::mps_parser::mps_data_model_t<i_t, f_t>& mps_data_model)
{
data_model_view_t<i_t, f_t> view;
view.set_maximize(mps_data_model.get_sense());
+ view.set_is_device_memory(false); // MPS data is host-backed
...
+ if (mps_data_model.has_quadratic_objective()) {
+ view.set_quadratic_objective_matrix(mps_data_model.get_quadratic_objective_values().data(),
+ mps_data_model.get_quadratic_objective_values().size(),
+ mps_data_model.get_quadratic_objective_indices().data(),
+ mps_data_model.get_quadratic_objective_indices().size(),
+ mps_data_model.get_quadratic_objective_offsets().data(),
+ mps_data_model.get_quadratic_objective_offsets().size());
+ }
return view;
}Also applies to: 1138-1143
| template <typename i_t, typename f_t> | ||
| mip_solution_t<i_t, f_t> solve_mip(raft::handle_t const* handle_ptr, | ||
| const data_model_view_t<i_t, f_t>& view, | ||
| mip_solver_settings_t<i_t, f_t> const& settings) | ||
| { | ||
| // Initialize logger for this overload (needed for early returns) | ||
| init_logger_t log(settings.log_file, settings.log_to_console); | ||
|
|
||
| // Check for remote solve configuration first | ||
| auto remote_config = get_remote_solve_config(); | ||
|
|
||
| if (view.is_device_memory()) { | ||
| if (remote_config.has_value()) { | ||
| // GPU data + remote solve requested: need valid handle to copy GPU→CPU | ||
| if (handle_ptr == nullptr) { | ||
| CUOPT_LOG_ERROR( | ||
| "[solve_mip] Remote solve requested with GPU data but no CUDA handle. " | ||
| "This is an internal error - GPU data should not exist without CUDA initialization."); | ||
| return mip_solution_t<i_t, f_t>( | ||
| cuopt::logic_error("No CUDA handle for GPU data", cuopt::error_type_t::RuntimeError)); | ||
| } | ||
| CUOPT_LOG_WARN( | ||
| "[solve_mip] Remote solve requested but data is on GPU. " | ||
| "Copying to CPU for serialization (performance impact)."); | ||
| auto cpu_data = copy_view_to_cpu(handle_ptr, view); | ||
| auto cpu_view = cpu_data.create_view(); | ||
|
|
||
| CUOPT_LOG_INFO( | ||
| "[solve_mip] Remote solve detected: CUOPT_REMOTE_HOST=%s, CUOPT_REMOTE_PORT=%d", | ||
| remote_config->host.c_str(), | ||
| remote_config->port); | ||
| // Remote solve with GPU data - serialize cpu_view and send to remote server | ||
| return solve_mip_remote(*remote_config, cpu_view, settings); | ||
| } | ||
|
|
||
| // Local solve: data already on GPU - convert view to optimization_problem_t and solve | ||
| auto op_problem = data_model_view_to_optimization_problem(handle_ptr, view); | ||
| return solve_mip(op_problem, settings); | ||
| } | ||
|
|
||
| // Data is on CPU | ||
| if (remote_config.has_value()) { | ||
| CUOPT_LOG_INFO("[solve_mip] Remote solve detected: CUOPT_REMOTE_HOST=%s, CUOPT_REMOTE_PORT=%d", | ||
| remote_config->host.c_str(), | ||
| remote_config->port); | ||
| // Remote solve with CPU data - serialize view and send to remote server | ||
| return solve_mip_remote(*remote_config, view, settings); | ||
| } | ||
|
|
||
| // Local solve with CPU data: copy to GPU and solve | ||
| auto op_problem = data_model_view_to_optimization_problem(handle_ptr, view); | ||
| return solve_mip(op_problem, settings); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: local CPU-view path can dereference a null CUDA handle
Same pattern as LP: when remote_config is not set and view.is_device_memory()==false, the code converts CPU view → GPU problem without validating handle_ptr (Lines 561-563).
🤖 Prompt for AI Agents
In @cpp/src/mip/solve.cu around lines 512 - 564, The CPU-local path in solve_mip
can dereference a null CUDA handle: before calling
data_model_view_to_optimization_problem(handle_ptr, view) when
view.is_device_memory() is false and remote_config is not set, validate
handle_ptr; if it's null, log an error (CUOPT_LOG_ERROR) and return a
mip_solution_t<i_t,f_t> with cuopt::logic_error("No CUDA handle for CPU->GPU
copy", cuopt::error_type_t::RuntimeError) to mirror the GPU->CPU remote-check
branch; otherwise proceed to call data_model_view_to_optimization_problem and
return solve_mip(op_problem, settings).
| sys.path.insert(0, "/home/tmckay/repos/nvidia-cuopt/cpp/build") | ||
| import cuopt_remote_pb2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard-coded developer path must be removed.
Line 22 contains a hard-coded absolute path to a specific developer's home directory. This will break for any other developer or CI environment.
🐛 Proposed fix: Use relative path or environment variable
-sys.path.insert(0, "/home/tmckay/repos/nvidia-cuopt/cpp/build")
+# Add build directory for development - use relative path from script location
+import pathlib
+build_path = pathlib.Path(__file__).parent.parent / "cpp" / "build"
+if build_path.exists():
+ sys.path.insert(0, str(build_path))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sys.path.insert(0, "/home/tmckay/repos/nvidia-cuopt/cpp/build") | |
| import cuopt_remote_pb2 | |
| # Add build directory for development - use relative path from script location | |
| import pathlib | |
| build_path = pathlib.Path(__file__).parent.parent / "cpp" / "build" | |
| if build_path.exists(): | |
| sys.path.insert(0, str(build_path)) | |
| import cuopt_remote_pb2 |
| def send_recv(sock, data): | ||
| """Send request and receive response with uint64 size prefix.""" | ||
| # Send size (uint64) + data | ||
| sock.sendall(struct.pack("<Q", len(data)) + data) | ||
|
|
||
| # Receive size (uint64) | ||
| size_data = b"" | ||
| while len(size_data) < 8: | ||
| chunk = sock.recv(8 - len(size_data)) | ||
| if not chunk: | ||
| raise ConnectionError("Connection closed") | ||
| size_data += chunk | ||
|
|
||
| size = struct.unpack("<Q", size_data)[0] | ||
|
|
||
| # Receive response | ||
| response = b"" | ||
| while len(response) < size: | ||
| chunk = sock.recv(min(4096, size - len(response))) | ||
| if not chunk: | ||
| raise ConnectionError("Connection closed") | ||
| response += chunk | ||
|
|
||
| return response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, find and examine the Python files for struct.pack/unpack patterns
echo "=== Python struct.pack/unpack patterns ==="
rg -n "struct\.(pack|unpack)" --type=py -B2 -A2 | head -100
echo ""
echo "=== Check remote.py specifically ==="
find . -path "*/python/cuopt/cuopt/linear_programming/remote.py" -o -path "*/remote.py" | head -5
echo ""
echo "=== Look for dev_scripts/test_pipe_cancel_restart.py ==="
find . -name "test_pipe_cancel_restart.py"Repository: NVIDIA/cuopt
Length of output: 1822
🏁 Script executed:
#!/bin/bash
# Check C++ files for size prefix usage
echo "=== C++ uint64_t/uint32_t size usage ==="
rg -n "uint64_t.*size|uint32_t.*size" --type=cpp --type=cu -B2 -A2 | head -100
echo ""
echo "=== Find remote_solve.cu ==="
find . -name "remote_solve.cu"Repository: NVIDIA/cuopt
Length of output: 225
🏁 Script executed:
#!/bin/bash
# Examine remote_solve.cu around lines 89-98
echo "=== remote_solve.cu lines 80-110 ==="
sed -n '80,110p' ./cpp/src/linear_programming/utilities/remote_solve.cu
echo ""
echo "=== Search for size/socket patterns in remote_solve.cu ==="
grep -n "uint64_t\|uint32_t\|recv\|send" ./cpp/src/linear_programming/utilities/remote_solve.cu | head -20Repository: NVIDIA/cuopt
Length of output: 2164
Fix protocol mismatch: python/cuopt/cuopt/linear_programming/remote.py uses <I (uint32) but should use <Q (uint64).
The C++ server in remote_solve.cu (line 95) sends size as uint64_t. The test script test_pipe_cancel_restart.py correctly uses struct.pack("<Q", ...), but remote.py incorrectly uses struct.pack("<I", ...) at line 149 and struct.unpack("<I", ...) at line 162. This protocol mismatch breaks remote communication—the server sends 8 bytes for size, but the client reads only 4 bytes, corrupting all subsequent messages.
🧰 Tools
🪛 Ruff (0.14.11)
39-39: Avoid specifying long messages outside the exception class
(TRY003)
49-49: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In @dev_scripts/test_pipe_cancel_restart.py around lines 29 - 52, The client
protocol in python/cuopt/cuopt/linear_programming/remote.py incorrectly uses
32-bit size prefixes ("<I"), causing mismatch with the server's uint64_t sizes;
update the send and receive size packing/unpacking in functions that call
struct.pack("<I", ...) and struct.unpack("<I", ...) to use "<Q" instead, and
ensure any buffer reads expect 8 bytes for the size prefix (e.g., change reads
that assume 4 bytes to 8 bytes) so the client reads the full uint64 size and
then reads the subsequent payload accordingly.
| struct JobQueueEntry { | ||
| char job_id[32]; // Unique job identifier | ||
| uint32_t problem_type; // 0=LP, 1=MIP | ||
| uint32_t data_size; // Size of serialized problem | ||
| uint8_t data[MAX_JOB_DATA]; // Serialized problem data (~1MB) | ||
| pid_t worker_pid; // PID of worker processing this job | ||
| std::atomic<bool> ready; // Job ready for processing | ||
| std::atomic<bool> claimed; // Job claimed by a worker | ||
| std::atomic<bool> cancelled; // Job cancelled by user | ||
| }; | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify struct definitions in the actual implementation
echo "=== Searching for JobQueueEntry definition ==="
rg -n "struct JobQueueEntry" --type cpp -A 15
echo ""
echo "=== Searching for ResultQueueEntry definition ==="
rg -n "struct ResultQueueEntry" --type cpp -A 15Repository: NVIDIA/cuopt
Length of output: 2697
Update struct definitions to match actual implementation in cpp/cuopt_remote_server.cpp.
The documentation does not match the actual code. Key discrepancies:
JobQueueEntry::job_idischar[64], notchar[32]data_sizeisuint64_t, notuint32_t- No inline
uint8_t data[MAX_JOB_DATA]field; replaced withchar shm_data_name[128]for shared memory segment names worker_pidisstd::atomic<pid_t>, notpid_t- Missing fields:
std::atomic<int> worker_indexandstd::atomic<bool> data_sent(pipe mode support)
ResultQueueEntry similarly needs updates: add shm_data_name[128], update data_size to uint64_t, add worker_index, and clarify error message field. The architecture has evolved to support both pipe and shared memory modes; update the explanatory text to reflect this dual-mode design.
🤖 Prompt for AI Agents
In @docs/remote_solve_architecture.md around lines 304 - 314, The documented
struct definitions for JobQueueEntry and ResultQueueEntry are out of sync with
the real implementation; update the documentation to match
cpp/cuopt_remote_server.cpp by changing JobQueueEntry::job_id to char[64],
data_size to uint64_t, remove inline data[] and add char shm_data_name[128] for
shared-memory names, change worker_pid to std::atomic<pid_t>, and add
std::atomic<int> worker_index and std::atomic<bool> data_sent; similarly update
ResultQueueEntry to include shm_data_name[128], use uint64_t data_size, add
worker_index and clarify the error message field/type; also update surrounding
explanatory text to describe dual-mode transport (pipe and shared memory) and
how data is referenced via shm_data_name instead of inline buffers.
| # Send request (length-prefixed) | ||
| data = request.SerializeToString() | ||
| sock.sendall(struct.pack("<I", len(data))) | ||
| sock.sendall(data) | ||
|
|
||
| # Receive response | ||
| size_data = sock.recv(4) | ||
| if len(size_data) < 4: | ||
| sock.close() | ||
| return CancelResult( | ||
| success=False, | ||
| message="Failed to receive response size", | ||
| job_status=JobStatus.NOT_FOUND, | ||
| ) | ||
|
|
||
| size = struct.unpack("<I", size_data)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Size prefix format inconsistency with other implementations.
This code uses <I (uint32, 4 bytes) for the size prefix (lines 149, 153, 162), but:
dev_scripts/test_pipe_cancel_restart.pyuses<Q(uint64, 8 bytes)- C++
remote_solve.cuusesuint64_t(8 bytes)
This will cause protocol errors when communicating with the server.
🐛 Proposed fix: Use uint64 size prefix to match server protocol
# Send request (length-prefixed)
data = request.SerializeToString()
- sock.sendall(struct.pack("<I", len(data)))
+ sock.sendall(struct.pack("<Q", len(data)))
sock.sendall(data)
# Receive response
- size_data = sock.recv(4)
- if len(size_data) < 4:
+ size_data = sock.recv(8)
+ if len(size_data) < 8:
sock.close()
return CancelResult(
success=False,
message="Failed to receive response size",
job_status=JobStatus.NOT_FOUND,
)
- size = struct.unpack("<I", size_data)[0]
+ size = struct.unpack("<Q", size_data)[0]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Send request (length-prefixed) | |
| data = request.SerializeToString() | |
| sock.sendall(struct.pack("<I", len(data))) | |
| sock.sendall(data) | |
| # Receive response | |
| size_data = sock.recv(4) | |
| if len(size_data) < 4: | |
| sock.close() | |
| return CancelResult( | |
| success=False, | |
| message="Failed to receive response size", | |
| job_status=JobStatus.NOT_FOUND, | |
| ) | |
| size = struct.unpack("<I", size_data)[0] | |
| # Send request (length-prefixed) | |
| data = request.SerializeToString() | |
| sock.sendall(struct.pack("<Q", len(data))) | |
| sock.sendall(data) | |
| # Receive response | |
| size_data = sock.recv(8) | |
| if len(size_data) < 8: | |
| sock.close() | |
| return CancelResult( | |
| success=False, | |
| message="Failed to receive response size", | |
| job_status=JobStatus.NOT_FOUND, | |
| ) | |
| size = struct.unpack("<Q", size_data)[0] |
|
/ok to test afd25ed |
- Add libprotobuf to conda recipe build and host requirements - Add libprotobuf to dependencies.yaml build_cpp section - Regenerate conda environment files with libprotobuf Fixes CI build failure where protobuf library was not found during CMake configuration for remote solve server implementation.
|
/ok to test 28ef558 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @conda/environments/all_cuda-130_arch-x86_64.yaml.orig:
- Around line 1-82: The file all_cuda-130_arch-x86_64.yaml.orig should be
deleted from the commit because it represents an unsupported CUDA 13.0
environment and is missing the libprotobuf dependency; remove the file entirely
from the branch (git rm --cached/--force as appropriate) and commit that
deletion, then verify no other generated env files reference CUDA 13.0 or lack
libprotobuf so the supported CUDA 12.9 and 13.1 variants remain intact.
In @conda/recipes/libcuopt/recipe.yaml:
- Line 65: The run requirements for the libcuopt package are missing
libprotobuf, causing runtime link failures because cpp/CMakeLists.txt links
libcuopt.so against protobuf::libprotobuf; update the recipe.yaml run section
for the libcuopt output (the run list around the current entries pinned at the
run block) to include libprotobuf alongside libboost-devel and librmm (i.e.,
insert a run entry "libprotobuf" in the run dependencies list so consumers of
libcuopt get protobuf at runtime).
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
conda/environments/all_cuda-129_arch-aarch64.yamlconda/environments/all_cuda-129_arch-x86_64.yamlconda/environments/all_cuda-130_arch-x86_64.yaml.origconda/environments/all_cuda-131_arch-aarch64.yamlconda/environments/all_cuda-131_arch-x86_64.yamlconda/recipes/libcuopt/recipe.yamldependencies.yaml
🔇 Additional comments (5)
conda/environments/all_cuda-131_arch-aarch64.yaml (1)
35-35: LGTM - libprotobuf dependency correctly added.The addition aligns with the
dependencies.yamlsource file change and supports the new protobuf-based remote solve functionality.conda/environments/all_cuda-129_arch-x86_64.yaml (1)
35-35: LGTM - Consistent with other environment files.conda/environments/all_cuda-131_arch-x86_64.yaml (1)
35-35: LGTM - Consistent with other environment files.dependencies.yaml (1)
296-305: LGTM - libprotobuf correctly added to build_cpp dependencies.The placement under
build_cpp.commonwithoutput_types: condais appropriate for the C++ build dependency needed for protobuf-based remote serialization. This follows the existing pattern of other build dependencies in this section.conda/environments/all_cuda-129_arch-aarch64.yaml (1)
35-35: LGTM!The
libprotobufdependency addition is consistent with the PR's protobuf-based remote solve serialization requirements. The alphabetical placement is correct.
| # This file is generated by `rapids-dependency-file-generator`. | ||
| # To make changes, edit ../../dependencies.yaml and run `rapids-dependency-file-generator`. | ||
| channels: | ||
| - rapidsai-nightly | ||
| - rapidsai | ||
| - conda-forge | ||
| dependencies: | ||
| - boost | ||
| - breathe | ||
| - bzip2 | ||
| - c-compiler | ||
| - ccache | ||
| - clang-tools=20.1.4 | ||
| - clang==20.1.4 | ||
| - cmake>=3.30.4 | ||
| - cpp-argparse | ||
| - cuda-nvcc | ||
| - cuda-nvtx-dev | ||
| - cuda-python>=13.0.1,<14.0a0 | ||
| - cuda-sanitizer-api | ||
| - cuda-version=13.0 | ||
| - cudf==26.2.*,>=0.0.0a0 | ||
| - cupy>=13.6.0 | ||
| - cxx-compiler | ||
| - cython>=3.0.3 | ||
| - doxygen=1.9.1 | ||
| - exhale | ||
| - fastapi | ||
| - gcc_linux-64=14.* | ||
| - gmock | ||
| - gtest | ||
| - ipython | ||
| - jsonref==1.1.0 | ||
| - libcudss-dev >=0.7 | ||
| - libcurand-dev | ||
| - libcusolver-dev | ||
| - libcusparse-dev | ||
| - libraft-headers==26.2.*,>=0.0.0a0 | ||
| - librmm==26.2.*,>=0.0.0a0 | ||
| - make | ||
| - msgpack-numpy==0.4.8 | ||
| - msgpack-python==1.1.0 | ||
| - myst-nb | ||
| - myst-parser | ||
| - ninja | ||
| - notebook | ||
| - numba-cuda>=0.19.1,<0.20.0a0 | ||
| - numba>=0.60.0 | ||
| - numpy>=1.23.5,<3.0a0 | ||
| - numpydoc | ||
| - pandas>=2.0 | ||
| - pexpect | ||
| - pip | ||
| - pre-commit | ||
| - psutil>=6.0.0 | ||
| - pylibraft==26.2.*,>=0.0.0a0 | ||
| - pyrsistent | ||
| - pytest-cov | ||
| - pytest<8 | ||
| - python>=3.10,<3.14 | ||
| - rapids-build-backend>=0.4.0,<0.5.0.dev0 | ||
| - rapids-dask-dependency==26.2.*,>=0.0.0a0 | ||
| - rapids-logger==0.2.*,>=0.0.0a0 | ||
| - requests | ||
| - rmm==26.2.*,>=0.0.0a0 | ||
| - scikit-build-core>=0.10.0 | ||
| - sphinx | ||
| - sphinx-copybutton | ||
| - sphinx-design | ||
| - sphinx-markdown-tables | ||
| - sphinx_rtd_theme | ||
| - sphinxcontrib-openapi | ||
| - sphinxcontrib-websupport | ||
| - sysroot_linux-64==2.28 | ||
| - tbb-devel | ||
| - uvicorn==0.34.* | ||
| - zlib | ||
| - pip: | ||
| - nvidia_sphinx_theme | ||
| - swagger-plugin-for-sphinx | ||
| - veroviz | ||
| name: all_cuda-130_arch-x86_64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if the non-.orig version exists
echo "=== Checking for existing all_cuda-130_arch-x86_64.yaml ==="
fd -t f "all_cuda-130_arch-x86_64.yaml" conda/environments/
echo ""
echo "=== Checking libprotobuf presence in all environment files ==="
for f in conda/environments/all_cuda-*.yaml; do
if [ -f "$f" ]; then
echo -n "$f: "
grep -q "libprotobuf" "$f" && echo "HAS libprotobuf" || echo "MISSING libprotobuf"
fi
done
echo ""
echo "=== Listing any .orig files that may be artifacts ==="
fd -e orig . conda/environments/
echo ""
echo "=== Checking git status and history for .orig files ==="
cd conda/environments && git status --short | grep orig || echo "No .orig files in git status"Repository: NVIDIA/cuopt
Length of output: 694
🏁 Script executed:
#!/bin/bash
# Check if .orig file is tracked in git
echo "=== Checking if .orig file is tracked in git ==="
git ls-files | grep "all_cuda-130_arch-x86_64.yaml.orig"
echo ""
echo "=== Checking git log for this file ==="
git log --oneline -- conda/environments/all_cuda-130_arch-x86_64.yaml.orig | head -5
echo ""
echo "=== List all all_cuda*.yaml* files to see naming pattern ==="
ls -la conda/environments/all_cuda*.yaml*Repository: NVIDIA/cuopt
Length of output: 804
🏁 Script executed:
#!/bin/bash
echo "=== Check all environment files in the directory ==="
ls -1 conda/environments/all_cuda*.yaml* | sort
echo ""
echo "=== Check what's in the git commit that added libprotobuf ==="
git show --stat 28ef558 | head -50
echo ""
echo "=== Check the full diff for the .orig file in that commit ==="
git show 28ef558 -- conda/environments/all_cuda-130_arch-x86_64.yaml.orig | head -100Repository: NVIDIA/cuopt
Length of output: 5691
🏁 Script executed:
#!/bin/bash
echo "=== Check if dependencies.yaml mentions CUDA 130 ==="
grep -A 10 "130" dependencies.yaml | head -30
echo ""
echo "=== Count how many env variants are defined for each CUDA version in dependencies.yaml ==="
grep -E "cuda-130|cuda-129|cuda-131" dependencies.yaml | wc -lRepository: NVIDIA/cuopt
Length of output: 202
🏁 Script executed:
#!/bin/bash
echo "=== Search for CUDA version patterns in dependencies.yaml ==="
grep -i "cuda" dependencies.yaml | head -20
echo ""
echo "=== Check the structure of dependencies.yaml ==="
head -100 dependencies.yaml
echo ""
echo "=== Look for arch and cuda variants ==="
grep -E "arch|cuda" dependencies.yaml | grep -i "x86_64\|aarch64\|130\|129\|131" | head -20Repository: NVIDIA/cuopt
Length of output: 3306
🏁 Script executed:
#!/bin/bash
echo "=== Check git history for the .orig file ==="
git log --all --oneline -- conda/environments/all_cuda-130_arch-x86_64.yaml.orig
echo ""
echo "=== Check if there are any references to CUDA 130 in the codebase ==="
rg "130" conda/ --type yaml | head -20
echo ""
echo "=== Verify that CUDA 129 and 131 both have both arch variants ==="
echo "CUDA 129:"
ls -1 conda/environments/all_cuda-129* | sort
echo ""
echo "CUDA 131:"
ls -1 conda/environments/all_cuda-131* | sortRepository: NVIDIA/cuopt
Length of output: 504
Delete the all_cuda-130_arch-x86_64.yaml.orig file—it should not be committed.
This file represents an unsupported CUDA version. The dependencies.yaml only defines CUDA versions 12.9 and 13.1, which generate environment files for CUDA 129 and 131. CUDA 130 is not in the supported matrix and has no corresponding aarch64 variant (unlike the 129 and 131 versions which both have x86_64 and aarch64 pairs).
Additionally, this file is missing the libprotobuf dependency that was added across all other environment files in this PR. It appears to be a leftover from the file regeneration process and should be removed entirely rather than renamed.
🤖 Prompt for AI Agents
In @conda/environments/all_cuda-130_arch-x86_64.yaml.orig around lines 1 - 82,
The file all_cuda-130_arch-x86_64.yaml.orig should be deleted from the commit
because it represents an unsupported CUDA 13.0 environment and is missing the
libprotobuf dependency; remove the file entirely from the branch (git rm
--cached/--force as appropriate) and commit that deletion, then verify no other
generated env files reference CUDA 13.0 or lack libprotobuf so the supported
CUDA 12.9 and 13.1 variants remain intact.
| - tbb-devel | ||
| - zlib | ||
| - bzip2 | ||
| - libprotobuf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Find the recipe.yaml file and check its structure
find . -name "recipe.yaml" -path "*libcuopt*" 2>/dev/nullRepository: NVIDIA/cuopt
Length of output: 92
🏁 Script executed:
# Look for CMakeLists.txt files that might contain protobuf references
find . -name "CMakeLists.txt" 2>/dev/null | head -20Repository: NVIDIA/cuopt
Length of output: 799
🏁 Script executed:
# Search for any protobuf references in the repository
rg -l "protobuf" --type yaml --type cmake 2>/dev/null | head -20Repository: NVIDIA/cuopt
Length of output: 367
🏁 Script executed:
# Read the recipe.yaml file to see the structure and all requirements
cat -n ./conda/recipes/libcuopt/recipe.yamlRepository: NVIDIA/cuopt
Length of output: 8706
🏁 Script executed:
# Check cpp/CMakeLists.txt for protobuf linking details
rg -n "protobuf" ./cpp/CMakeLists.txt -A 3 -B 1Repository: NVIDIA/cuopt
Length of output: 380
🏁 Script executed:
# Also check the main source CMakeLists for protobuf
rg -n "protobuf" ./cpp/src/CMakeLists.txt -A 3 -B 1Repository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
# Search for how protobuf is found/configured in CMakeLists
rg -n "find_package.*protobuf" ./cpp/CMakeLists.txt -i -A 5Repository: NVIDIA/cuopt
Length of output: 342
🏁 Script executed:
# Check if there's any mention of static protobuf linking
rg -n "protobuf.*STATIC\|Protobuf_USE_STATIC_LIBS\|Protobuf.*static" ./cpp/CMakeLists.txt -iRepository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
# Look for the full context of the protobuf link in libcuopt
rg -n "target_link_libraries.*cuopt" ./cpp/CMakeLists.txt -B 10 -A 10 | head -60Repository: NVIDIA/cuopt
Length of output: 1934
libprotobuf missing from libcuopt run requirements — add at line 161.
While libprotobuf is correctly added to build and host requirements, it must also be in the run requirements for the libcuopt output package. The CMakeLists.txt dynamically links libcuopt.so against protobuf::libprotobuf (line 370 in cpp/CMakeLists.txt), which means consumers of libcuopt require protobuf at runtime. Without this dependency, binaries and libraries linking to libcuopt.so will fail with missing symbol errors.
Add libprotobuf to the run requirements at line 161:
run:
- ${{ pin_compatible("cuda-version", upper_bound="x", lower_bound="x") }}
- ${{ pin_subpackage("libmps-parser", exact=True) }}
- libboost-devel
- libprotobuf
- librmm =${{ minor_version }}
🤖 Prompt for AI Agents
In @conda/recipes/libcuopt/recipe.yaml at line 65, The run requirements for the
libcuopt package are missing libprotobuf, causing runtime link failures because
cpp/CMakeLists.txt links libcuopt.so against protobuf::libprotobuf; update the
recipe.yaml run section for the libcuopt output (the run list around the current
entries pinned at the run block) to include libprotobuf alongside libboost-devel
and librmm (i.e., insert a run entry "libprotobuf" in the run dependencies list
so consumers of libcuopt get protobuf at runtime).
- Add depends_on_protobuf section to dependencies.yaml - Include protobuf in py_rapids_build_libcuopt requirements - Regenerate python/libcuopt/pyproject.toml with protobuf Fixes Python wheel CI build failure where CMake could not find protobuf during the libcuopt wheel build. The previous commit added protobuf to conda builds, this commit adds it to pip wheel builds.
|
/ok to test 3b4e741 |
|
TODO: take another pass and replace client protocol with gRPC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@ci/utils/install_protobuf.sh`:
- Around line 9-25: Pin Protobuf to a compatible version across all places:
update dependencies.yaml to require a specific Protobuf range (e.g. "protobuf:
>=3.20,<4.0"), update CMakeLists.txt's find_package invocation to include the
minimum version (e.g. change find_package(Protobuf REQUIRED) to
find_package(Protobuf 3.20 REQUIRED)), and change this install script
(install_protobuf.sh) to install that pinned version via the package manager
(adjust the dnf/apt package names or add version qualifiers / use apt-get
install protobuf-compiler=3.20.* or the equivalent distro-specific
package/version) and also ensure conda environment specs are updated to the same
version range; make sure error branches still instruct manual install if the
exact version is unavailable.
🧹 Nitpick comments (2)
ci/utils/install_protobuf.sh (2)
14-17: Consider cleaning up apt cache to reduce image size.In CI/container environments, cleaning up the apt cache after installation reduces the final image size.
Suggested improvement
elif [[ "$ID" == "ubuntu" ]]; then echo "Detected Ubuntu. Installing Protobuf via apt..." apt-get update - apt-get install -y libprotobuf-dev protobuf-compiler + apt-get install -y --no-install-recommends libprotobuf-dev protobuf-compiler + rm -rf /var/lib/apt/lists/* else
11-21: Consider adding Debian support.Since Debian is closely related to Ubuntu and uses the same package manager, you might want to support it explicitly. This would help if the CI environment or developer machines use Debian.
Suggested improvement
if [[ "$ID" == "rocky" ]]; then echo "Detected Rocky Linux. Installing Protobuf via dnf..." dnf install -y protobuf-devel protobuf-compiler - elif [[ "$ID" == "ubuntu" ]]; then - echo "Detected Ubuntu. Installing Protobuf via apt..." + elif [[ "$ID" == "ubuntu" || "$ID" == "debian" ]]; then + echo "Detected $ID. Installing Protobuf via apt..." apt-get update apt-get install -y libprotobuf-dev protobuf-compiler else
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
ci/build_wheel_libcuopt.shci/utils/install_protobuf.sh
🔇 Additional comments (2)
ci/utils/install_protobuf.sh (1)
1-6: Good shell script setup.The strict mode (
set -euo pipefail) ensures the script fails fast on errors, which is appropriate for CI scripts.ci/build_wheel_libcuopt.sh (1)
20-22: LGTM!The protobuf installation step is correctly placed in the dependency installation sequence, following the established pattern used by other dependencies (Boost, TBB, cuDSS). The script's
set -euo pipefailensures any installation failure will abort the build.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| if [ -f /etc/os-release ]; then | ||
| . /etc/os-release | ||
| if [[ "$ID" == "rocky" ]]; then | ||
| echo "Detected Rocky Linux. Installing Protobuf via dnf..." | ||
| dnf install -y protobuf-devel protobuf-compiler | ||
| elif [[ "$ID" == "ubuntu" ]]; then | ||
| echo "Detected Ubuntu. Installing Protobuf via apt..." | ||
| apt-get update | ||
| apt-get install -y libprotobuf-dev protobuf-compiler | ||
| else | ||
| echo "Unknown OS: $ID. Please install Protobuf development libraries manually." | ||
| exit 1 | ||
| fi | ||
| else | ||
| echo "/etc/os-release not found. Cannot determine OS. Please install Protobuf development libraries manually." | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if there are version constraints specified elsewhere in the project
echo "=== Checking dependency files for protobuf version constraints ==="
rg "protobuf" --type yaml --type txt -g '!*.pb.*' -n
echo ""
echo "=== Checking CMake files for protobuf version requirements ==="
rg "find_package.*[Pp]rotobuf" -A2 -B2 -n
echo ""
echo "=== Searching for CMakeLists.txt files ==="
fd "CMakeLists.txt" -x cat -n {}
echo ""
echo "=== Checking for .proto files ==="
fd "\.proto$" -x cat -n {}
echo ""
echo "=== Checking requirements.txt or similar Python dependency files ==="
fd -E node_modules "(requirements.*\.txt|setup\.py|pyproject\.toml|poetry\.lock|Pipfile)" -x cat -n {}
echo ""
echo "=== Checking for docker or container files ==="
fd "(Dockerfile|\.dockerignore)" -x cat -n {}Repository: NVIDIA/cuopt
Length of output: 50369
🏁 Script executed:
#!/bin/bash
echo "=== Checking dependencies.yaml for protobuf version constraints ==="
rg -A5 "depends_on_protobuf|libprotobuf" dependencies.yaml
echo ""
echo "=== Checking conda environment files for libprotobuf version specs ==="
fd "all_cuda.*\.yaml" conda/environments/ -x grep -A1 -B1 "libprotobuf" {}
echo ""
echo "=== Checking conda recipe for libprotobuf ==="
grep -A2 -B2 "libprotobuf" conda/recipes/libcuopt/recipe.yaml
echo ""
echo "=== Checking if protobuf has version specifications anywhere ==="
rg "protobuf.*==" -nRepository: NVIDIA/cuopt
Length of output: 1147
🏁 Script executed:
#!/bin/bash
echo "=== Looking for .proto file to understand protobuf usage ==="
find . -name "*.proto" -type f -exec cat -n {} +
echo ""
echo "=== Checking what's in linear_programming/utilities/protobuf_serializer.cu ==="
fd "protobuf_serializer" -x head -30 {}Repository: NVIDIA/cuopt
Length of output: 14312
Add protobuf version constraints to ensure compatibility across distributions.
Protobuf version mismatches between build and runtime environments can cause serialization/deserialization issues. Currently, protobuf has no version pinning in dependencies.yaml, conda environments, or CMakeLists.txt (find_package(Protobuf REQUIRED) lacks version specification), while the install script installs without version constraints. Given that this PR introduces protobuf-based serialization for remote solving, add explicit version constraints to dependencies.yaml, CMakeLists.txt, and update the install script to match the pinned version.
🤖 Prompt for AI Agents
In `@ci/utils/install_protobuf.sh` around lines 9 - 25, Pin Protobuf to a
compatible version across all places: update dependencies.yaml to require a
specific Protobuf range (e.g. "protobuf: >=3.20,<4.0"), update CMakeLists.txt's
find_package invocation to include the minimum version (e.g. change
find_package(Protobuf REQUIRED) to find_package(Protobuf 3.20 REQUIRED)), and
change this install script (install_protobuf.sh) to install that pinned version
via the package manager (adjust the dnf/apt package names or add version
qualifiers / use apt-get install protobuf-compiler=3.20.* or the equivalent
distro-specific package/version) and also ensure conda environment specs are
updated to the same version range; make sure error branches still instruct
manual install if the exact version is unavailable.
|
/ok to test 0dcd54a |
|
/ok to test de8daf7 |
|
/ok to test 7660217 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@cpp/CMakeLists.txt`:
- Around line 621-624: Wrap the
add_subdirectory(src/linear_programming/utilities/serializers) call with the
BUILD_REMOTE_SERVER conditional so the serializers plugin is only built when
BUILD_REMOTE_SERVER is enabled; locate the existing add_subdirectory invocation
in CMakeLists and move or surround it with the if(BUILD_REMOTE_SERVER) ...
endif() block (keeping it near other remote-server related entries and
references such as cuopt_remote_server.cpp) to prevent building the msgpack
serializer when remote server support is disabled.
🧹 Nitpick comments (2)
cpp/CMakeLists.txt (2)
234-238: Prefer target-specific includes over globalinclude_directories().Using global
include_directories()adds paths to all targets in this directory, which can cause unintended include pollution. Modern CMake best practice is to usetarget_include_directories()on specific targets that need the paths.Since multiple targets need the generated proto headers (cuopt, cuopt_remote_server), consider either:
- Adding the paths explicitly to each target that needs them, or
- Creating an interface library for the proto headers
♻️ Suggested approach using target-specific includes
# Protocol Buffers for remote solve serialization find_package(Protobuf REQUIRED) -include_directories(SYSTEM ${Protobuf_INCLUDE_DIRS}) -include_directories(${CMAKE_CURRENT_BINARY_DIR}) +# Protobuf includes added to targets via target_include_directories belowThen add to the cuopt target (around line 305-319):
target_include_directories(cuopt SYSTEM PRIVATE ${Protobuf_INCLUDE_DIRS}) target_include_directories(cuopt PRIVATE ${CMAKE_CURRENT_BINARY_DIR})And similarly for cuopt_remote_server.
575-581: Consider using PRIVATE for executable link dependencies.For executable targets, the
PUBLIC/PRIVATEdistinction ontarget_link_librarieshas no practical effect on the executable itself. UsingPRIVATEis the conventional choice and more clearly expresses intent.♻️ Suggested change
target_link_libraries(cuopt_remote_server - PUBLIC - cuopt - OpenMP::OpenMP_CXX PRIVATE + cuopt + OpenMP::OpenMP_CXX protobuf::libprotobuf )target_link_libraries(cuopt_cancel_job - PUBLIC - cuopt + PRIVATE cuopt )Also applies to: 609-612
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cpp/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (11)
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Check that hard-coded GPU device IDs and resource limits are made configurable; abstract multi-backend support for different CUDA versions
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Verify error propagation from CUDA to user-facing APIs is complete; ensure CUDA errors are caught and mapped to meaningful user error codes
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh} : Every CUDA kernel launch and memory operation must have error checking with CUDA_CHECK or equivalent verification
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cpp,hpp,h} : Avoid inappropriate use of exceptions in performance-critical GPU operation paths; prefer error codes or CUDA error checking for latency-sensitive code
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh} : Avoid reinventing functionality already available in Thrust, CCCL, or RMM libraries; prefer standard library utilities over custom implementations
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.cu : Detect inefficient GPU kernel launches with low occupancy or poor memory access patterns; optimize for coalesced memory access and minimize warp divergence in hot paths
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Track GPU device memory allocations and deallocations to prevent memory leaks; ensure cudaMalloc/cudaFree balance and cleanup of streams/events
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : For concurrent CUDA operations (barriers, async operations), explicitly create and manage dedicated streams instead of reusing the default stream; document stream lifecycle
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Refactor code duplication in solver components (3+ occurrences) into shared utilities; for GPU kernels, use templated device functions to avoid duplication
Applied to files:
cpp/CMakeLists.txt
🔇 Additional comments (3)
cpp/CMakeLists.txt (3)
142-147: LGTM!Reasonable workaround for GCC's known false-positive
-Warray-boundsdiagnostics from external headers like Protobuf, while preserving-Werrorfor other warnings.
239-253: Proto generation setup looks correct.The custom command is properly configured with
DEPENDSfor rebuild tracking andVERBATIMfor safe argument handling.Minor observation:
PROTO_HDRSis defined but unused. Consider adding it to the cuopt target's sources ortarget_sources()to ensure proper dependency tracking for the header file, though in practice the.ccfile inclusion typically suffices.
264-267: LGTM!The protobuf sources are correctly added to the cuopt library, and the linking arrangement is appropriate:
mps_parser_staticembeds the parser symbols directlyprotobuf::libprotobufas PRIVATE keeps it an implementation detailAlso applies to: 375-377
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| # ################################################################################################## | ||
| # - Pluggable Serializers -------------------------------------------------------------------------- | ||
| # Build optional serializer plugins (e.g., msgpack) | ||
| add_subdirectory(src/linear_programming/utilities/serializers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if serializers are used outside of remote solve code paths
# Find references to serializer in the codebase excluding the serializers directory itself
echo "=== References to serializer outside serializers directory ==="
rg -n "serializer" --type cpp --type cmake -g '!**/serializers/**' -g '!**/test**' | head -100
echo ""
echo "=== Check if serializers CMakeLists.txt has its own guards ==="
fd -t f "CMakeLists.txt" --path "*/serializers/*" -exec cat {} \;
echo ""
echo "=== Check the context around line 621-624 in CMakeLists.txt ==="
rg -B 30 -A 5 "add_subdirectory.*serializers" cpp/CMakeLists.txtRepository: NVIDIA/cuopt
Length of output: 7539
🏁 Script executed:
#!/bin/bash
# Get serializers CMakeLists.txt
echo "=== Serializers CMakeLists.txt ==="
find . -path "*/serializers/CMakeLists.txt" -exec cat {} \;
echo ""
echo "=== Check BUILD_REMOTE_SERVER conditional structure in main CMakeLists.txt ==="
# Search for BUILD_REMOTE_SERVER conditions to see the structure
rg -n "BUILD_REMOTE_SERVER|add_subdirectory.*serializers" cpp/CMakeLists.txt | tail -50
echo ""
echo "=== Check protobuf_serializer.cu location and context ==="
fd -t f "protobuf_serializer.cu" -exec head -20 {} \;
echo ""
echo "=== Check if protobuf_serializer is conditioned in src/linear_programming/CMakeLists.txt ==="
rg -B 5 -A 5 "protobuf_serializer" cpp/src/linear_programming/CMakeLists.txtRepository: NVIDIA/cuopt
Length of output: 2658
Move the serializers subdirectory inside the BUILD_REMOTE_SERVER conditional.
The add_subdirectory(src/linear_programming/utilities/serializers) call builds the msgpack serializer plugin, which is only used by the remote server. All serializer references in the codebase are confined to cuopt_remote_server.cpp. Building this optional plugin when remote server support is disabled is wasteful and should be gated behind BUILD_REMOTE_SERVER.
🤖 Prompt for AI Agents
In `@cpp/CMakeLists.txt` around lines 621 - 624, Wrap the
add_subdirectory(src/linear_programming/utilities/serializers) call with the
BUILD_REMOTE_SERVER conditional so the serializers plugin is only built when
BUILD_REMOTE_SERVER is enabled; locate the existing add_subdirectory invocation
in CMakeLists and move or surround it with the if(BUILD_REMOTE_SERVER) ...
endif() block (keeping it near other remote-server related entries and
references such as cuopt_remote_server.cpp) to prevent building the msgpack
serializer when remote server support is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@cpp/CMakeLists.txt`:
- Around line 266-269: The cuopt library is currently built with ${PROTO_SRCS}
always appended which breaks builds when Protobuf is disabled; ensure PROTO_SRCS
is defined to an empty value when the protobuf/remote-server branch is not taken
(e.g., set(PROTO_SRCS "" ) or guard with if(NOT DEFINED PROTO_SRCS)
set(PROTO_SRCS "")) so the add_library(cuopt SHARED ${CUOPT_SRC_FILES}
${PROTO_SRCS}) line works unconditionally; update the CMake logic that defines
PROTO_SRCS (the conditional that generates/sets it) to provide this default.
- Around line 236-255: Wrap the Protobuf dependency and proto generation in an
if(BUILD_REMOTE_SERVER) ... endif() guard so find_package(Protobuf REQUIRED),
the add_custom_command that produces PROTO_SRCS/PROTO_HDRS, and setting
PROTO_FILE/PROTO_SRCS/PROTO_HDRS only run when BUILD_REMOTE_SERVER is ON; remove
the unconditional include_directories(SYSTEM ${Protobuf_INCLUDE_DIRS}) and
include_directories(${CMAKE_CURRENT_BINARY_DIR}) and instead call
target_include_directories(<the target that consumes PROTO_SRCS/PROTO_HDRS>
PRIVATE ${CMAKE_CURRENT_BINARY_DIR} SYSTEM ${Protobuf_INCLUDE_DIRS}) so only the
target that uses the generated headers gets the includes.
♻️ Duplicate comments (1)
cpp/CMakeLists.txt (1)
623-626: Move the serializers subdirectory inside theBUILD_REMOTE_SERVERconditional.This was flagged in a previous review. The
add_subdirectory(src/linear_programming/utilities/serializers)builds the msgpack serializer plugin, which is only used by the remote server.♻️ Suggested fix
-# ################################################################################################## -# - Pluggable Serializers -------------------------------------------------------------------------- -# Build optional serializer plugins (e.g., msgpack) -add_subdirectory(src/linear_programming/utilities/serializers)Move this inside the
if(BUILD_REMOTE_SERVER AND NOT BUILD_LP_ONLY)block above (before line 621).
🧹 Nitpick comments (2)
cpp/CMakeLists.txt (2)
376-379: Protobuf linkage should be conditional.The
protobuf::libprotobufis linked unconditionally. If the Protobuf dependency is made conditional onBUILD_REMOTE_SERVER, this linkage should also be guarded:♻️ Suggested refactor
PRIVATE ${CUOPT_PRIVATE_CUDA_LIBS} cuopt::mps_parser_static # Static link - symbols embedded in libcuopt.so - protobuf::libprotobuf + $<$<BOOL:${BUILD_REMOTE_SERVER}>:protobuf::libprotobuf> )
559-621: Remote server build section is well-structured.The
BUILD_REMOTE_SERVERguard and RUNPATH configuration are appropriate. Minor suggestions:
Redundant include directories (lines 567-576): For executables,
PUBLICinclude directories are not meaningful since nothing links against them. The same paths appear in bothPRIVATEandPUBLIC.Executable linkage (lines 577-583, 611-614): For executables,
PUBLICvsPRIVATElinkage is semantically equivalent. Consider using onlyPRIVATEfor clarity.♻️ Optional cleanup
target_include_directories(cuopt_remote_server PRIVATE "${CMAKE_CURRENT_SOURCE_DIR}/src" "${CMAKE_CURRENT_SOURCE_DIR}/include" "${CMAKE_CURRENT_SOURCE_DIR}/libmps_parser/include" "${CMAKE_CURRENT_BINARY_DIR}" - PUBLIC "$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>" "$<BUILD_INTERFACE:${CMAKE_CURRENT_BINARY_DIR}/include>" ) target_link_libraries(cuopt_remote_server - PUBLIC - cuopt - OpenMP::OpenMP_CXX PRIVATE + cuopt + OpenMP::OpenMP_CXX protobuf::libprotobuf )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cpp/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (12)
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Check that hard-coded GPU device IDs and resource limits are made configurable; abstract multi-backend support for different CUDA versions
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cpp,hpp,h} : Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Refactor code duplication in solver components (3+ occurrences) into shared utilities; for GPU kernels, use templated device functions to avoid duplication
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Verify error propagation from CUDA to user-facing APIs is complete; ensure CUDA errors are caught and mapped to meaningful user error codes
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh} : Every CUDA kernel launch and memory operation must have error checking with CUDA_CHECK or equivalent verification
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cpp,hpp,h} : Avoid inappropriate use of exceptions in performance-critical GPU operation paths; prefer error codes or CUDA error checking for latency-sensitive code
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh} : Avoid reinventing functionality already available in Thrust, CCCL, or RMM libraries; prefer standard library utilities over custom implementations
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.cu : Detect inefficient GPU kernel launches with low occupancy or poor memory access patterns; optimize for coalesced memory access and minimize warp divergence in hot paths
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Track GPU device memory allocations and deallocations to prevent memory leaks; ensure cudaMalloc/cudaFree balance and cleanup of streams/events
Applied to files:
cpp/CMakeLists.txt
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : For concurrent CUDA operations (barriers, async operations), explicitly create and manage dedicated streams instead of reusing the default stream; document stream lifecycle
Applied to files:
cpp/CMakeLists.txt
🔇 Additional comments (1)
cpp/CMakeLists.txt (1)
142-149: LGTM: Pragmatic warning suppression for third-party headers.The GCC-specific suppression of
-Warray-boundsand-Wstringop-overreadis a reasonable approach to avoid false positives from Protobuf headers while maintaining strict warnings for project code.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| # Protocol Buffers for remote solve serialization | ||
| find_package(Protobuf REQUIRED) | ||
| include_directories(SYSTEM ${Protobuf_INCLUDE_DIRS}) | ||
| include_directories(${CMAKE_CURRENT_BINARY_DIR}) | ||
|
|
||
| # Generate C++ code from .proto file | ||
| set(PROTO_FILE "${CMAKE_CURRENT_SOURCE_DIR}/src/linear_programming/utilities/cuopt_remote.proto") | ||
| set(PROTO_SRCS "${CMAKE_CURRENT_BINARY_DIR}/cuopt_remote.pb.cc") | ||
| set(PROTO_HDRS "${CMAKE_CURRENT_BINARY_DIR}/cuopt_remote.pb.h") | ||
|
|
||
| add_custom_command( | ||
| OUTPUT "${PROTO_SRCS}" "${PROTO_HDRS}" | ||
| COMMAND ${Protobuf_PROTOC_EXECUTABLE} | ||
| ARGS --cpp_out ${CMAKE_CURRENT_BINARY_DIR} | ||
| --proto_path ${CMAKE_CURRENT_SOURCE_DIR}/src/linear_programming/utilities | ||
| ${PROTO_FILE} | ||
| DEPENDS ${PROTO_FILE} | ||
| COMMENT "Generating C++ code from cuopt_remote.proto" | ||
| VERBATIM | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Protobuf is unconditionally required even when remote server is disabled.
The find_package(Protobuf REQUIRED) and proto generation occur unconditionally, adding a mandatory build dependency for all users even when BUILD_REMOTE_SERVER=OFF. This increases build complexity and external dependencies for users who only need local solves.
Consider making Protobuf conditional on BUILD_REMOTE_SERVER:
♻️ Suggested refactor
-# Protocol Buffers for remote solve serialization
-find_package(Protobuf REQUIRED)
-include_directories(SYSTEM ${Protobuf_INCLUDE_DIRS})
-include_directories(${CMAKE_CURRENT_BINARY_DIR})
-
-# Generate C++ code from .proto file
-set(PROTO_FILE "${CMAKE_CURRENT_SOURCE_DIR}/src/linear_programming/utilities/cuopt_remote.proto")
-set(PROTO_SRCS "${CMAKE_CURRENT_BINARY_DIR}/cuopt_remote.pb.cc")
-set(PROTO_HDRS "${CMAKE_CURRENT_BINARY_DIR}/cuopt_remote.pb.h")
-
-add_custom_command(
- OUTPUT "${PROTO_SRCS}" "${PROTO_HDRS}"
- COMMAND ${Protobuf_PROTOC_EXECUTABLE}
- ARGS --cpp_out ${CMAKE_CURRENT_BINARY_DIR}
- --proto_path ${CMAKE_CURRENT_SOURCE_DIR}/src/linear_programming/utilities
- ${PROTO_FILE}
- DEPENDS ${PROTO_FILE}
- COMMENT "Generating C++ code from cuopt_remote.proto"
- VERBATIM
-)
+# Protocol Buffers for remote solve serialization (optional)
+set(PROTO_SRCS "")
+if(BUILD_REMOTE_SERVER)
+ find_package(Protobuf REQUIRED)
+
+ set(PROTO_FILE "${CMAKE_CURRENT_SOURCE_DIR}/src/linear_programming/utilities/cuopt_remote.proto")
+ set(PROTO_SRCS "${CMAKE_CURRENT_BINARY_DIR}/cuopt_remote.pb.cc")
+ set(PROTO_HDRS "${CMAKE_CURRENT_BINARY_DIR}/cuopt_remote.pb.h")
+
+ add_custom_command(
+ OUTPUT "${PROTO_SRCS}" "${PROTO_HDRS}"
+ COMMAND ${Protobuf_PROTOC_EXECUTABLE}
+ ARGS --cpp_out ${CMAKE_CURRENT_BINARY_DIR}
+ --proto_path ${CMAKE_CURRENT_SOURCE_DIR}/src/linear_programming/utilities
+ ${PROTO_FILE}
+ DEPENDS ${PROTO_FILE}
+ COMMENT "Generating C++ code from cuopt_remote.proto"
+ VERBATIM
+ )
+endif()Additionally, prefer target_include_directories() over global include_directories() for better encapsulation.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Protocol Buffers for remote solve serialization | |
| find_package(Protobuf REQUIRED) | |
| include_directories(SYSTEM ${Protobuf_INCLUDE_DIRS}) | |
| include_directories(${CMAKE_CURRENT_BINARY_DIR}) | |
| # Generate C++ code from .proto file | |
| set(PROTO_FILE "${CMAKE_CURRENT_SOURCE_DIR}/src/linear_programming/utilities/cuopt_remote.proto") | |
| set(PROTO_SRCS "${CMAKE_CURRENT_BINARY_DIR}/cuopt_remote.pb.cc") | |
| set(PROTO_HDRS "${CMAKE_CURRENT_BINARY_DIR}/cuopt_remote.pb.h") | |
| add_custom_command( | |
| OUTPUT "${PROTO_SRCS}" "${PROTO_HDRS}" | |
| COMMAND ${Protobuf_PROTOC_EXECUTABLE} | |
| ARGS --cpp_out ${CMAKE_CURRENT_BINARY_DIR} | |
| --proto_path ${CMAKE_CURRENT_SOURCE_DIR}/src/linear_programming/utilities | |
| ${PROTO_FILE} | |
| DEPENDS ${PROTO_FILE} | |
| COMMENT "Generating C++ code from cuopt_remote.proto" | |
| VERBATIM | |
| ) | |
| # Protocol Buffers for remote solve serialization (optional) | |
| set(PROTO_SRCS "") | |
| if(BUILD_REMOTE_SERVER) | |
| find_package(Protobuf REQUIRED) | |
| set(PROTO_FILE "${CMAKE_CURRENT_SOURCE_DIR}/src/linear_programming/utilities/cuopt_remote.proto") | |
| set(PROTO_SRCS "${CMAKE_CURRENT_BINARY_DIR}/cuopt_remote.pb.cc") | |
| set(PROTO_HDRS "${CMAKE_CURRENT_BINARY_DIR}/cuopt_remote.pb.h") | |
| add_custom_command( | |
| OUTPUT "${PROTO_SRCS}" "${PROTO_HDRS}" | |
| COMMAND ${Protobuf_PROTOC_EXECUTABLE} | |
| ARGS --cpp_out ${CMAKE_CURRENT_BINARY_DIR} | |
| --proto_path ${CMAKE_CURRENT_SOURCE_DIR}/src/linear_programming/utilities | |
| ${PROTO_FILE} | |
| DEPENDS ${PROTO_FILE} | |
| COMMENT "Generating C++ code from cuopt_remote.proto" | |
| VERBATIM | |
| ) | |
| endif() |
🤖 Prompt for AI Agents
In `@cpp/CMakeLists.txt` around lines 236 - 255, Wrap the Protobuf dependency and
proto generation in an if(BUILD_REMOTE_SERVER) ... endif() guard so
find_package(Protobuf REQUIRED), the add_custom_command that produces
PROTO_SRCS/PROTO_HDRS, and setting PROTO_FILE/PROTO_SRCS/PROTO_HDRS only run
when BUILD_REMOTE_SERVER is ON; remove the unconditional
include_directories(SYSTEM ${Protobuf_INCLUDE_DIRS}) and
include_directories(${CMAKE_CURRENT_BINARY_DIR}) and instead call
target_include_directories(<the target that consumes PROTO_SRCS/PROTO_HDRS>
PRIVATE ${CMAKE_CURRENT_BINARY_DIR} SYSTEM ${Protobuf_INCLUDE_DIRS}) so only the
target that uses the generated headers gets the includes.
| add_library(cuopt SHARED | ||
| ${CUOPT_SRC_FILES} | ||
| ${PROTO_SRCS} | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proto sources compiled into main library unconditionally.
The ${PROTO_SRCS} is added to the cuopt library regardless of whether remote server support is enabled. If Protobuf setup is made conditional per the previous comment, ensure PROTO_SRCS defaults to empty so this line works correctly in both cases.
🤖 Prompt for AI Agents
In `@cpp/CMakeLists.txt` around lines 266 - 269, The cuopt library is currently
built with ${PROTO_SRCS} always appended which breaks builds when Protobuf is
disabled; ensure PROTO_SRCS is defined to an empty value when the
protobuf/remote-server branch is not taken (e.g., set(PROTO_SRCS "" ) or guard
with if(NOT DEFINED PROTO_SRCS) set(PROTO_SRCS "")) so the add_library(cuopt
SHARED ${CUOPT_SRC_FILES} ${PROTO_SRCS}) line works unconditionally; update the
CMake logic that defines PROTO_SRCS (the conditional that generates/sets it) to
provide this default.
Description
This change adds a new remote execution system which will ultimately replace the cuopt Python server. In this new implementation, remote solves are identical to local solves except that the env vars CUOPT_REMOTE_HOST and CUOPT_REMOTE_PORT are set to indicate a remote server.
The remote server code is included and can be built optionally with "build.sh cuopt_remote_server". It does not build by default.
Internal code paths have been modified to use data_model_view_t with non-owning pointers to store problems and solutions as GPU memory or CPU memory, the former for local solves and the latter for remote solves.
Some markdown files have been added that describe the changes:
Checklist
Need to:
evaluate the need for new tests
review the available markdowns and make sure the feature is well-documented
Summary by CodeRabbit
New Features
Documentation
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.