Skip to content

batch concatenation leads to duplicate dictionary entries and not readable by pandas #10160

Description

@pierrebelzile

Describe the bug

When we concatenate 2 or more batches where some columns are dictionaries, the dictionaries are concatenated instead of being merged.

The dictionary may end-up with duplicates. For example if both batches have the string "alpha", the new collapsed batch with have 2 dictionary entries for that string. The result is strictly correct (all indices point to their original value). However any library that tries to perform an operation on the indices will obtain a wrong result. (e.g, an aggregation).

Perhaps more direct: pandas will reject the batch because it validates uniqueness:

lib/python3.11/site-packages/pandas/core/dtypes/dtypes.py", line 570, in validate_categories
    raise ValueError("Categorical categories must be unique")

Pandas does have a function (union_categoricals) to merge dataframes with different dictionaries but it is not intended to reduce dictionaries of a single dataframe.

To Reproduce

//! Concatenation tests for dictionary arrays.
use std::sync::Arc;

use arrow::{
    array::{Array, ArrayRef, AsArray, DictionaryArray, Int32Array, RecordBatch, StringArray},
    compute::concat_batches,
    datatypes::{DataType, Field, Int32Type, Schema},
};

/// Build a dictionary array with explicit dictionary value order and key values.
fn dictionary_array(dictionary_values: Vec<&str>, keys: Vec<i32>) -> ArrayRef {
    Arc::new(
        DictionaryArray::<Int32Type>::try_new(
            Int32Array::from(keys),
            Arc::new(StringArray::from(dictionary_values)),
        )
        .expect("dictionary array"),
    )
}

/// Build a one-column record batch containing a dictionary array.
fn dictionary_batch(
    schema: Arc<Schema>,
    dictionary_values: Vec<&str>,
    keys: Vec<i32>,
) -> RecordBatch {
    RecordBatch::try_new(schema, vec![dictionary_array(dictionary_values, keys)])
        .expect("record batch")
}

/// this test will start to fail when arrow dictionary concat is supported
#[test]
fn concat_then_normalize_deduplicates_dictionary_values_and_remaps_keys() {
    let schema = Arc::new(Schema::new(vec![Field::new(
        "symbol",
        DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8)),
        false,
    )]));

    let batch_0 = dictionary_batch(
        schema.clone(),
        vec!["alpha", "beta", "gamma"],
        vec![0, 1, 2, 0],
    );
    let batch_1 = dictionary_batch(
        schema.clone(),
        vec!["gamma", "alpha", "beta"],
        vec![2, 1, 0, 2],
    );

    let raw_concatenated = concat_batches(&schema, &[batch_0, batch_1]).expect("concat batches");
    let raw_column = raw_concatenated.column(0).as_dictionary::<Int32Type>();
    // this should be 3 because both batches had the same values
    assert_eq!(raw_column.values().len(), 6);
}

Expected behavior

The dictionary should only contain unique entries.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions