Skip to content

Buffer count mismatched with metadata when encoding records with dictionary of dictionaries #10213

Description

@thorfour

Describe the bug

When the IPC writer encodes a record that has a schema with a Dict(Dict(...,...)) encoded column, the StreamReader cannot decode it. It throws a Buffer count mismatched with metadata error.

running 1 test
test tests::dict_of_dict_ipc_error ... FAILED

failures:

---- tests::dict_of_dict_ipc_error stdout ----
Error: IpcError("Buffer count mismatched with metadata")


failures:
    tests::dict_of_dict_ipc_error

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

To Reproduce

use std::sync::Arc;
use arrow_array::RecordBatch;
use arrow_array::types::UInt32Type;
use arrow_array::{ArrayRef, DictionaryArray, StringArray, UInt32Array};
use arrow_schema::{ArrowError, DataType, Field, Schema};

fn dict_of_dict() -> ArrayRef {
    let values = Arc::new(StringArray::from(vec!["a", "b", "c"])) as ArrayRef;
    let inner = Arc::new(DictionaryArray::<UInt32Type>::new(
        UInt32Array::from(vec![0u32, 1, 2, 0]),
        values,
    )) as ArrayRef;
    Arc::new(DictionaryArray::<UInt32Type>::new(
        UInt32Array::from(vec![0u32, 1, 2, 3]),
        inner,
    )) as ArrayRef
 }

#[test]
    fn dict_of_dict_ipc_error() -> std::result::Result<(), ArrowError> {
        use arrow_ipc::reader::StreamReader;
        use arrow_ipc::writer::StreamWriter;

        fn ipc_roundtrip(batch: &RecordBatch) -> std::result::Result<RecordBatch, ArrowError> {
            let mut buf = Vec::new();
            {
                let mut writer = StreamWriter::try_new(&mut buf, &batch.schema())?;
                writer.write(batch)?;
                writer.finish()?;
            }
            StreamReader::try_new(buf.as_slice(), None)?
                .next()
                .expect("one batch")
        }

        let single = DataType::Dictionary(Box::new(DataType::UInt32), Box::new(DataType::Utf8));
        let dod = DataType::Dictionary(Box::new(DataType::UInt32), Box::new(single.clone()));
        let original = dict_of_dict();
        let declared = Arc::new(Schema::new(vec![Field::new("f", dod, true)]));
        let batch =
            RecordBatch::try_new(Arc::clone(&declared), vec![Arc::clone(&original)]).unwrap();

        // Reproduces the bug: dict-of-dict cannot round-trip through Arrow IPC.
        ipc_roundtrip(&batch)?;

        Ok(())
    }

Expected behavior

I would expect it to encode/decode it correctly, or at the very least throw an error that this schema is not supported by IPC.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions