Skip to content

Adding new index_name argument to Dataframe Collections #496

@john-sanchez31

Description

@john-sanchez31

Add a new optional argument, index_name, to DataFrame Collections.
When this argument is provided, the DataFrame index is materialized as a new column with the given name.

Behavior:

  • If index_name is set, a new column is created in the DataFrame collection containing the index values.
  • The original DataFrame index remains unchanged.
  • The new column may participate in uniqueness constraints if its values are unique.

Validations:

  • Type Validation: Valid types are str, range (default) and numbers (integers or floats)
  • Name conflict: The name of the column must not conflict with the columns in the Dataframe
  • Uniqueness: Uniqueness of the index (if custom indexes are allowed, this validation may be required)

Note: This new columns can be added to the list of unique_column_names.
This is determinate by the uniqueness of the column

Exmple:

df = pd.DataFrame({
  "A": [1, 2, 3] * 2,
  "B": ["A", "B] * 3,
}, index=range(6))

pydough.dataframe_collection(
        name="my_df",
        dataframe=df,
        unique_column_names=["C", ["A", "B"]],
        index_name="C"
    )

The Dataframe collection would be created with a new column called "C" that contains [0, 1, 2, 3, 4, 5]
Result

     C      A      B
     0      1      A
     1      2      B
     2      3      A 
     3      1      B
     4      2      A
     5      3      B

Unique columns validation: Include the ability to have at least one column from unique_column_names in filter_columns instead of all being required.

Example:
unique columns are ["column1", ["column2", "column3"]]
but in the filter column I can include column1 only or column2 and column3 without column1

This also requires smarter validation for unique_columns. Making sure that if a unique column is compose of more than one column, all of them are included in the filter_column if provided. Following the last example, if column1 is not included it must include column2 AND column3.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions