Skip to content

Hash functions return 0 for rows when input column contains NULLs and a selection vector is active #5

@calebgregory

Description

@calebgregory

Hash functions return incorrect values (0) for a non-deterministic subset of rows when input column contains NULLs

Summary

When a column contains NULL values and a filter excludes those NULLs before hashing, hash functions in the hashfuncs extension return incorrect values (typically 0) for a non-deterministic subset of the remaining rows. This appears to be the same selection vector bug reported in #2 and #3, which was not fully fixed.

Environment

Reproduced on macOS (arm64) with:

  • DuckDB 1.3.2 (Ossivalis)
  • DuckDB 1.5.0
  • DuckDB 1.5.1 (Variegata)

Installed via: INSTALL hashfuncs FROM community

Reproduction

INSTALL hashfuncs FROM community;
LOAD hashfuncs;

-- Create a table with NULLs in the column to be hashed
CREATE TABLE t AS
SELECT
    i,
    CASE WHEN i % 2 = 0 THEN 'value_' || i::VARCHAR ELSE NULL END AS val
FROM range(200000) t(i);

-- Path A: hash ALL rows (including NULLs), then filter → CORRECT
CREATE TABLE path_a AS SELECT i, val, xxh64(val) AS h FROM t;
SELECT count(*) FROM path_a WHERE val IS NOT NULL AND h = 0;
-- Returns: 0 (correct)

-- Path B: filter out NULLs, then hash → INCORRECT
CREATE TABLE path_b AS SELECT i, val, xxh64(val) AS h FROM t WHERE val IS NOT NULL;
SELECT count(*) FROM path_b WHERE h = 0;
-- Returns a non-deterministic number of incorrect hash=0 rows.
-- Observed values across runs: 7712, 30758, 50000 (out of 100000 non-null rows).
-- The number and identity of affected rows varies between runs.

Verifying that the hash values are genuinely wrong:

-- Standalone hash of a value that got hash=0 in path_b
SELECT xxh64('value_2');
-- Returns: 11283197125054826412 (correct, non-zero)

-- But when computed via the filtered table, it was 0
SELECT h FROM path_b WHERE val = 'value_2';
-- Returns: 0

Affected functions

All hash functions in the extension are affected. Tested with both seeded and unseeded variants:

Function Unseeded affected? Seeded affected?
xxh64 Yes Yes
xxh32 Yes N/A
xxh3_64 Yes Yes
xxh3_128 Yes Yes

Root cause

This appears to be the same selection vector bug as #2 / #3. When DuckDB applies a filter (e.g., WHERE val IS NOT NULL), it produces a selection vector that maps logical row indices to physical array positions. The hash function implementation reads input data using raw loop indices instead of going through the selection vector, so it reads the wrong memory — producing 0 (uninitialized/zero-initialized memory) instead of the correct hash.

The fix in commit 30c8cd0 addressed this for the seed parameter's selection vector access, but the same issue exists for the input data access path. When a filter is active, the input values are accessed at the wrong offsets.

Impact

This bug silently corrupts hash values without any error or warning. In our case, it caused ~1,000 out of 2.2M patient records in a data pipeline to receive hash_of_value = 0 (the column had ~0.15% NULLs from invalid input tokens, and the CASE WHEN ... ELSE NULL derivation created the selection vector that triggers the bug). Because the corruption is silent and the affected rows vary with data layout, it was difficult to diagnose.

A Workaround

Hash the column before filtering out NULLs, so no selection vector is active during the hash computation:

-- Instead of:
SELECT xxh64(val) FROM t WHERE val IS NOT NULL;

-- Do:
SELECT h FROM (SELECT val, xxh64(val) AS h FROM t) WHERE val IS NOT NULL;

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions