-
Notifications
You must be signed in to change notification settings - Fork 2
Hash functions return 0 for rows when input column contains NULLs and a selection vector is active #5
Description
Hash functions return incorrect values (0) for a non-deterministic subset of rows when input column contains NULLs
Summary
When a column contains NULL values and a filter excludes those NULLs before hashing, hash functions in the hashfuncs extension return incorrect values (typically 0) for a non-deterministic subset of the remaining rows. This appears to be the same selection vector bug reported in #2 and #3, which was not fully fixed.
Environment
Reproduced on macOS (arm64) with:
- DuckDB 1.3.2 (Ossivalis)
- DuckDB 1.5.0
- DuckDB 1.5.1 (Variegata)
Installed via: INSTALL hashfuncs FROM community
Reproduction
INSTALL hashfuncs FROM community;
LOAD hashfuncs;
-- Create a table with NULLs in the column to be hashed
CREATE TABLE t AS
SELECT
i,
CASE WHEN i % 2 = 0 THEN 'value_' || i::VARCHAR ELSE NULL END AS val
FROM range(200000) t(i);
-- Path A: hash ALL rows (including NULLs), then filter → CORRECT
CREATE TABLE path_a AS SELECT i, val, xxh64(val) AS h FROM t;
SELECT count(*) FROM path_a WHERE val IS NOT NULL AND h = 0;
-- Returns: 0 (correct)
-- Path B: filter out NULLs, then hash → INCORRECT
CREATE TABLE path_b AS SELECT i, val, xxh64(val) AS h FROM t WHERE val IS NOT NULL;
SELECT count(*) FROM path_b WHERE h = 0;
-- Returns a non-deterministic number of incorrect hash=0 rows.
-- Observed values across runs: 7712, 30758, 50000 (out of 100000 non-null rows).
-- The number and identity of affected rows varies between runs.Verifying that the hash values are genuinely wrong:
-- Standalone hash of a value that got hash=0 in path_b
SELECT xxh64('value_2');
-- Returns: 11283197125054826412 (correct, non-zero)
-- But when computed via the filtered table, it was 0
SELECT h FROM path_b WHERE val = 'value_2';
-- Returns: 0Affected functions
All hash functions in the extension are affected. Tested with both seeded and unseeded variants:
| Function | Unseeded affected? | Seeded affected? |
|---|---|---|
xxh64 |
Yes | Yes |
xxh32 |
Yes | N/A |
xxh3_64 |
Yes | Yes |
xxh3_128 |
Yes | Yes |
Root cause
This appears to be the same selection vector bug as #2 / #3. When DuckDB applies a filter (e.g., WHERE val IS NOT NULL), it produces a selection vector that maps logical row indices to physical array positions. The hash function implementation reads input data using raw loop indices instead of going through the selection vector, so it reads the wrong memory — producing 0 (uninitialized/zero-initialized memory) instead of the correct hash.
The fix in commit 30c8cd0 addressed this for the seed parameter's selection vector access, but the same issue exists for the input data access path. When a filter is active, the input values are accessed at the wrong offsets.
Impact
This bug silently corrupts hash values without any error or warning. In our case, it caused ~1,000 out of 2.2M patient records in a data pipeline to receive hash_of_value = 0 (the column had ~0.15% NULLs from invalid input tokens, and the CASE WHEN ... ELSE NULL derivation created the selection vector that triggers the bug). Because the corruption is silent and the affected rows vary with data layout, it was difficult to diagnose.
A Workaround
Hash the column before filtering out NULLs, so no selection vector is active during the hash computation:
-- Instead of:
SELECT xxh64(val) FROM t WHERE val IS NOT NULL;
-- Do:
SELECT h FROM (SELECT val, xxh64(val) AS h FROM t) WHERE val IS NOT NULL;