Hash functions return 0 for rows when input column contains NULLs and a selection vector is active

# Hash functions return incorrect values (0) for a non-deterministic subset of rows when input column contains NULLs

## Summary

When a column contains NULL values and a filter excludes those NULLs before hashing, hash functions in the `hashfuncs` extension return incorrect values (typically 0) for a non-deterministic subset of the remaining rows. This appears to be the same selection vector bug reported in #2 and #3, which was not fully fixed.

## Environment

Reproduced on macOS (arm64) with:

- DuckDB 1.3.2 (Ossivalis)
- DuckDB 1.5.0
- DuckDB 1.5.1 (Variegata)

Installed via: `INSTALL hashfuncs FROM community`

## Reproduction

```sql
INSTALL hashfuncs FROM community;
LOAD hashfuncs;

-- Create a table with NULLs in the column to be hashed
CREATE TABLE t AS
SELECT
    i,
    CASE WHEN i % 2 = 0 THEN 'value_' || i::VARCHAR ELSE NULL END AS val
FROM range(200000) t(i);

-- Path A: hash ALL rows (including NULLs), then filter → CORRECT
CREATE TABLE path_a AS SELECT i, val, xxh64(val) AS h FROM t;
SELECT count(*) FROM path_a WHERE val IS NOT NULL AND h = 0;
-- Returns: 0 (correct)

-- Path B: filter out NULLs, then hash → INCORRECT
CREATE TABLE path_b AS SELECT i, val, xxh64(val) AS h FROM t WHERE val IS NOT NULL;
SELECT count(*) FROM path_b WHERE h = 0;
-- Returns a non-deterministic number of incorrect hash=0 rows.
-- Observed values across runs: 7712, 30758, 50000 (out of 100000 non-null rows).
-- The number and identity of affected rows varies between runs.
```

Verifying that the hash values are genuinely wrong:

```sql
-- Standalone hash of a value that got hash=0 in path_b
SELECT xxh64('value_2');
-- Returns: 11283197125054826412 (correct, non-zero)

-- But when computed via the filtered table, it was 0
SELECT h FROM path_b WHERE val = 'value_2';
-- Returns: 0
```

## Affected functions

All hash functions in the extension are affected. Tested with both seeded and unseeded variants:

| Function | Unseeded affected? | Seeded affected? |
|---|---|---|
| `xxh64` | Yes | Yes |
| `xxh32` | Yes | N/A |
| `xxh3_64` | Yes | Yes |
| `xxh3_128` | Yes | Yes |

## Root cause

This appears to be the same selection vector bug as #2 / #3. When DuckDB applies a filter (e.g., `WHERE val IS NOT NULL`), it produces a selection vector that maps logical row indices to physical array positions. The hash function implementation reads input data using raw loop indices instead of going through the selection vector, so it reads the wrong memory — producing 0 (uninitialized/zero-initialized memory) instead of the correct hash.

The fix in commit `30c8cd0` addressed this for the **seed** parameter's selection vector access, but the same issue exists for the **input data** access path. When a filter is active, the input values are accessed at the wrong offsets.

## Impact

This bug silently corrupts hash values without any error or warning. In our case, it caused ~1,000 out of 2.2M patient records in a data pipeline to receive `hash_of_value = 0` (the column had ~0.15% NULLs from invalid input tokens, and the `CASE WHEN ... ELSE NULL` derivation created the selection vector that triggers the bug). Because the corruption is silent and the affected rows vary with data layout, it was difficult to diagnose.

## A Workaround

Hash the column **before** filtering out NULLs, so no selection vector is active during the hash computation:

```sql
-- Instead of:
SELECT xxh64(val) FROM t WHERE val IS NOT NULL;

-- Do:
SELECT h FROM (SELECT val, xxh64(val) AS h FROM t) WHERE val IS NOT NULL;
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hash functions return 0 for rows when input column contains NULLs and a selection vector is active #5

Hash functions return incorrect values (0) for a non-deterministic subset of rows when input column contains NULLs

Summary

Environment

Reproduction

Affected functions

Root cause

Impact

A Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hash functions return 0 for rows when input column contains NULLs and a selection vector is active #5

Description

Hash functions return incorrect values (0) for a non-deterministic subset of rows when input column contains NULLs

Summary

Environment

Reproduction

Affected functions

Root cause

Impact

A Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions