When we concatenate 2 or more batches where some columns are dictionaries, the dictionaries are concatenated instead of being merged.
The dictionary may end-up with duplicates. For example if both batches have the string "alpha", the new collapsed batch with have 2 dictionary entries for that string. The result is strictly correct (all indices point to their original value). However any library that tries to perform an operation on the indices will obtain a wrong result. (e.g, an aggregation).
Perhaps more direct: pandas will reject the batch because it validates uniqueness:
lib/python3.11/site-packages/pandas/core/dtypes/dtypes.py", line 570, in validate_categories
raise ValueError("Categorical categories must be unique")
Pandas does have a function (union_categoricals) to merge dataframes with different dictionaries but it is not intended to reduce dictionaries of a single dataframe.
//! Concatenation tests for dictionary arrays.
use std::sync::Arc;
use arrow::{
array::{Array, ArrayRef, AsArray, DictionaryArray, Int32Array, RecordBatch, StringArray},
compute::concat_batches,
datatypes::{DataType, Field, Int32Type, Schema},
};
/// Build a dictionary array with explicit dictionary value order and key values.
fn dictionary_array(dictionary_values: Vec<&str>, keys: Vec<i32>) -> ArrayRef {
Arc::new(
DictionaryArray::<Int32Type>::try_new(
Int32Array::from(keys),
Arc::new(StringArray::from(dictionary_values)),
)
.expect("dictionary array"),
)
}
/// Build a one-column record batch containing a dictionary array.
fn dictionary_batch(
schema: Arc<Schema>,
dictionary_values: Vec<&str>,
keys: Vec<i32>,
) -> RecordBatch {
RecordBatch::try_new(schema, vec![dictionary_array(dictionary_values, keys)])
.expect("record batch")
}
/// this test will start to fail when arrow dictionary concat is supported
#[test]
fn concat_then_normalize_deduplicates_dictionary_values_and_remaps_keys() {
let schema = Arc::new(Schema::new(vec![Field::new(
"symbol",
DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8)),
false,
)]));
let batch_0 = dictionary_batch(
schema.clone(),
vec!["alpha", "beta", "gamma"],
vec![0, 1, 2, 0],
);
let batch_1 = dictionary_batch(
schema.clone(),
vec!["gamma", "alpha", "beta"],
vec![2, 1, 0, 2],
);
let raw_concatenated = concat_batches(&schema, &[batch_0, batch_1]).expect("concat batches");
let raw_column = raw_concatenated.column(0).as_dictionary::<Int32Type>();
// this should be 3 because both batches had the same values
assert_eq!(raw_column.values().len(), 6);
}
The dictionary should only contain unique entries.
Describe the bug
When we concatenate 2 or more batches where some columns are dictionaries, the dictionaries are concatenated instead of being merged.
The dictionary may end-up with duplicates. For example if both batches have the string "alpha", the new collapsed batch with have 2 dictionary entries for that string. The result is strictly correct (all indices point to their original value). However any library that tries to perform an operation on the indices will obtain a wrong result. (e.g, an aggregation).
Perhaps more direct: pandas will reject the batch because it validates uniqueness:
Pandas does have a function (union_categoricals) to merge dataframes with different dictionaries but it is not intended to reduce dictionaries of a single dataframe.
To Reproduce
Expected behavior
The dictionary should only contain unique entries.
Additional context
No response