Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/tabpfn/preprocessing/clean.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,4 +169,4 @@ def process_text_na_dataframe(
np.nan,
X_encoded[:, string_cols_ix],
)
return typing.cast("np.ndarray", X_encoded.astype(np.float64))
return typing.cast("np.ndarray", X_encoded.astype(np.float32))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Changing the return type to float32 introduces several inconsistencies:

  1. Numerical Stability: src/tabpfn/constants.py (lines 50-51) defines DEFAULT_NUMPY_PREPROCESSING_DTYPE as np.float64 specifically to avoid overflows during transformations like Yeo-Johnson. Hardcoding float32 here may lead to issues in subsequent preprocessing steps.
  2. Docstring Inconsistency: The docstring for process_text_na_dataframe (lines 142 and 145) still explicitly mentions conversion to float64.
  3. Pipeline Inconsistency: fix_dtypes (line 69) defaults to float64. Since clean_data calls both, the numeric_dtype setting in fix_dtypes is now effectively overridden by this hardcoded float32 cast.

If the intention is to move the pipeline to float32, consider updating the global constant or making the dtype a parameter to maintain consistency.

Loading