Skip to content

Add support for pandas 3.0#500

Open
hagenw wants to merge 24 commits intomainfrom
dev
Open

Add support for pandas 3.0#500
hagenw wants to merge 24 commits intomainfrom
dev

Conversation

@hagenw
Copy link
Member

@hagenw hagenw commented Jan 27, 2026

Closes #487

Timedelta/datetime updates

pandas changed the default unit of timedelta and datetime entries from nanoseconds to a resolution that matches the given input precision.

We update the code here to ensure we always get nanosecond resolution as before.

String updates

pandas introduces a new default string type (str/<StringDtype(na_value=nan)>), which replaces object as default.
Unfortunately, the new string type is different from string/<StringDtype(na_value=<NA>)> as it uses a different value to represent missing values.

We update the code here to ensure we get the same results wit pandas <3.0 and pandas >=3.0:

  • We continue to use string/<StringDtype(na_value=<NA>)> for the string scheme
  • We continue to use object for schemes with string labels (which are represented by a categorical data type)
  • Fix audformat.utils.set_index_dtypes() to be able to change between all available string types
Examples of changed string behavior

Output of print(obj.dtype)

Command pandas 2.3.3 pandas 3.0.0
pd.Series([]) object object
pd.Series(["a"]) object str
pd.Series(["a", pd.NA]) object str
pd.Series(["a", np.nan]) object str
pd.Series(["a"], dtype="string") string string
pd.Series(["a"], dtype=str) object str
pd.Series(["a"], dtype=str) object str

Output of obj.dtype

Command pandas 2.3.3 pandas 3.0.0
pd.Series([]) dtype('O') dtype('O')
pd.Series(["a"]) dtype('O') <StringDtype(na_value=nan)>
pd.Series(["a", pd.NA]) dtype('O') <StringDtype(na_value=nan)>
pd.Series(["a", np.nan]) dtype('O') <StringDtype(na_value=nan)>
pd.Series(["a"], dtype="string") string[python] <StringDtype(na_value=<NA>)>
pd.Series(["a"], dtype=str) dtype('O') <StringDtype(na_value=nan)>
pd.Series(["a"], dtype="str") dtype('O') <StringDtype(na_value=nan)>

hagenw added 12 commits January 23, 2026 11:31
* Add failing test

* Make test pandas 3.0.0 compatible

* Fix set_index_dtypes() for pandas 3.0

* Add comment

* Fix doctests

* Update segmented_index()

* Use segmented_index in test

* Add test for segmented_index
* pandas 3.0: fix utils.hash()

* Fix comment

* Remove unneeded code

* Add more tests

* Preserve ordered setting

* Update comment
* Fix categorical dtype with Database.get()

* Update tests

* Add additional test

* Improve code

* Clean up comment

* We converted to categorical data

* Simplify test

* Simplify string test
* Require timedelta64[ns] in assert_index()

* Add tests for mixed cases
* pandas 3.0: segmented_index() and set_index_dtypes() (#490)

* Add failing test

* Make test pandas 3.0.0 compatible

* Fix set_index_dtypes() for pandas 3.0

* Add comment

* Fix doctests

* Update segmented_index()

* Use segmented_index in test

* Add test for segmented_index

* Avoid warning in testing.add_table() (#491)

* pandas 3.0: fix utils.hash() (#492)

* pandas 3.0: fix utils.hash()

* Fix comment

* Remove unneeded code

* Add more tests

* Preserve ordered setting

* Update comment

* Fix categorical dtype with Database.get() (#493)

* Fix categorical dtype with Database.get()

* Update tests

* Add additional test

* Improve code

* Clean up comment

* We converted to categorical data

* Simplify test

* Simplify string test

* Require timedelta64[ns] in assert_index() (#494)

* Require timedelta64[ns] in assert_index()

* Add tests for mixed cases

* pandas 3.0: fix doctests output
* Update test_utils.py

* Update test_misc_table

* Set index dtypes directly

* Fix test_table

* Update to_timedelta in index.py

* Fix conversion to timedelta in testing.py

* Update test_utils_concat.py

* Add comment

* Update to_timedelta()
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jan 27, 2026

Reviewer's Guide

Adds pandas 3.0 compatibility by enforcing nanosecond-resolution datetime/timedelta dtypes, normalizing string and categorical dtypes (especially for schemes and indices), making hashing/index utilities robust to new pandas string behavior, and relaxing the pandas upper bound; tests and docs are updated accordingly.

Updated class diagram for categorical and scheme dtype handling

classDiagram
    class Scheme {
        +labels
        +dtype
        +to_pandas_dtype() pd_dtype
    }

    class CommonModule {
        +to_categorical_dtype(labels) CategoricalDtype
        +to_pandas_dtype(dtype) pandas_dtype
    }

    class Table {
        +_pyarrow_convert_dtypes(df) DataFrame
    }

    class Database {
        +schemes
    }

    class Column {
        +scheme_id
    }

    Database --> Scheme : contains
    Table --> Database : references
    Table --> Column : contains
    Column --> Scheme : uses_scheme_id

    Scheme ..> CommonModule : uses_to_categorical_dtype
    Table ..> CommonModule : uses_to_categorical_dtype
Loading

File-Level Changes

Change Details Files
Normalize string, categorical, and index dtypes to behave consistently across pandas <3.0 and >=3.0, including schemes and segmented indices.
  • Introduce common.to_categorical_dtype() and reuse it from Scheme.to_pandas_dtype() and Table._pyarrow_convert_dtypes() to build categorical dtypes with stable category dtypes (ints as nullable int64, strings as object).
  • Adjust segmented index helpers and assertions to enforce file level as pandas string dtype with and start/end levels as timedelta64[ns] (including via to_timedelta/to_timedelta helpers and assert_index checks).
  • Update set_index_dtypes() to compare dtypes via a helper that distinguishes different StringDtype na_value variants, cast timedelta levels explicitly to requested units, and add extensive tests around string/StringDtype/NA vs nan variants on Index and MultiIndex.
  • Update schemes and database get/append logic so string-based schemes use object-backed categoricals, normalize mixed string/object categorical category dtypes to object when aggregating, and ensure error messages stringify dtypes for robustness across pandas versions.
audformat/core/common.py
audformat/core/index.py
audformat/core/scheme.py
audformat/core/database.py
audformat/core/table.py
audformat/core/testing.py
tests/test_index.py
tests/test_scheme.py
tests/test_database_get.py
tests/test_misc_table.py
tests/test_table.py
tests/test_utils.py
tests/test_utils_concat.py
tests/test_column.py
Make hashing of pandas objects stable across pandas 3.0 string/categorical changes and pyarrow inference differences, especially for empty frames and string/categorical columns.
  • In utils.hash(), normalize string-typed columns to object dtype before conversion to pyarrow, and normalize categorical columns whose categories are string-like to use object-backed categories.
  • Build an explicit pyarrow schema for empty DataFrames where needed so object columns map to string rather than null, and fall back to normal from_pandas for non-empty frames.
  • Extend hash tests to cover string vs object dtypes (including filewise indices and categorical data) and ensure the resulting hashes are identical across dtype variants.
audformat/core/utils.py
tests/test_utils.py
Align tests, docs, and misc-table utilities with explicit dtypes and new pandas string defaults, and relax the pandas version constraint.
  • Update many tests to construct Index/MultiIndex objects with explicit dtype (object, string, Int64, timedelta64[ns], datetime64[ns]) so expectations are stable under pandas 3.0, and adjust expected error messages or xfails where dtype representations changed.
  • Update misc-table creation/extension tests and examples to use explicit string index dtypes, remove cases that relied on implicit str->object behavior, and ensure drop/extend/pick operations respect index dtypes.
  • Fix read_csv and other helpers to normalize empty DataFrame column dtypes under pandas 3.0 (e.g., explicitly casting columns/index names to string where pandas changed defaults).
  • Change the development dependency on pandas in pyproject.toml to allow pandas >=2.0.0 with no <3.0 upper bound, and refresh doc examples where necessary.
tests/test_misc_table.py
tests/test_table.py
tests/test_index.py
tests/test_utils.py
tests/test_utils_concat.py
tests/test_scheme.py
tests/test_database_get.py
tests/test_column.py
audformat/core/utils.py
pyproject.toml
audformat/core/utils.py
audformat/core/index.py
docs/data-misc-tables.rst

Assessment against linked issues

Issue Objective Addressed Explanation
#487 Ensure audformat’s timedelta and segmented index handling (including utilities like segmented_index/to_segmented_index) remains correct and stable under pandas 3.0.0’s new datetime/timedelta resolution inference, preserving the expected nanosecond precision.
#487 Update audformat to be fully compatible with pandas 3.0.0 overall (e.g., new default string dtype and related categorical/index behavior) so that the package works correctly with pandas 3.x.

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@codecov
Copy link

codecov bot commented Jan 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.0%. Comparing base (56a8268) to head (0e7089d).

Additional details and impacted files
Files with missing lines Coverage Δ
audformat/core/common.py 100.0% <100.0%> (ø)
audformat/core/database.py 100.0% <100.0%> (ø)
audformat/core/index.py 100.0% <100.0%> (ø)
audformat/core/scheme.py 100.0% <100.0%> (ø)
audformat/core/table.py 100.0% <100.0%> (ø)
audformat/core/testing.py 100.0% <100.0%> (ø)
audformat/core/utils.py 100.0% <100.0%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

* Ensure object dtype for string categories

* Adjust tests

* Better label type detection

* Fix linter
* Add tests for expected categorical dtype

* Add tests for expected scheme dtypes

* Fix test

* Fix test

* Use explicit StringDtype
* Simplify checking for string dtype

* Improve variable names
* Simplify creation of segmented index

* Fix set_index_dtypes()

* Revert "Simplify creation of segmented index"

This reverts commit 73ff082.

* Clean up comment

* Fix dtype normalization

* Revert "Revert "Simplify creation of segmented index""

This reverts commit 6f51a35.

* Add tests for empty string index
and df[column_id].dtype == "object"
):
df[column_id] = df[column_id].astype("string", copy=False)
df[column_id] = df[column_id].astype("string")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This avoids a FutureWarning as copy will be removed from astype().

@hagenw hagenw marked this pull request as ready for review February 2, 2026 11:39
sourcery-ai[bot]

This comment was marked as outdated.

@hagenw hagenw self-assigned this Feb 2, 2026
@hagenw hagenw requested a review from frankenjoe February 2, 2026 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pandas 3.0.0 breaking changes

1 participant