Three Windows-only bugs in v2.3.1 (CRLF Databricks detection, UTF-8 gitignore, stale FastMCP test API)

## Summary
While investigating v2.3.1 test output on Windows, I found three distinct bugs that all reproduce on `origin/main` with no modifications. They are not flakiness — each is a deterministic cross-platform bug with a clear root cause.

All three are currently masked by the 8 failing tests on Windows, which are easy to dismiss as "platform noise" but actually reveal real product bugs.

---

## Bug 1 — `get_data_dir()` writes non-UTF-8 `.gitignore` on Windows

**File:** `code_review_graph/incremental.py:128-138` (in `get_data_dir()`)

```python
inner_gitignore.write_text(
    "# Auto-generated by code-review-graph — do not commit database files.\n"
    "# The graph.db contains absolute paths and code structure metadata.\n"
    "*\n"
)
```

`write_text()` is called **without an `encoding=` argument**. The string literal contains an em-dash (`—`, U+2014). On Windows, `Path.write_text` uses the system default codepage (cp1252 in most locales), which encodes U+2014 as byte `0x97`. When anything later reads this file as UTF-8 — including `get_data_dir`'s own test `test_default_uses_repo_subdir` at `tests/test_incremental.py:194` — it raises:

```
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 38: invalid start byte
```

**Reproducible on:** Windows 11 / Python 3.10.10 / v2.3.1, clean checkout of `origin/main`. No modifications required.

**Impact:** Any Windows user running `code-review-graph build` generates a `.gitignore` that cannot be parsed by standard UTF-8 tooling (including the tool's own tests and any editor set to UTF-8 strict mode). Unix systems are unaffected because they default to UTF-8.

**Fix:** Add `encoding="utf-8"` to the `write_text()` call. (Same fix that `get_data_dir`'s caller `_ensure_repo_gitignore` at line 170 already applies — it uses `encoding="utf-8"`, but this sibling function was missed.)

---

## Bug 2 — Databricks notebook auto-detection fails on CRLF line endings

**File:** `code_review_graph/parser.py:391-394` (in `parse_bytes()`)

```python
if language == "python" and source.startswith(
    b"# Databricks notebook source\n",
):
    return self._parse_databricks_py_notebook(path, source)
```

The check hard-codes `\n`. On Windows, git checkouts default to `core.autocrlf=true`, which converts the Databricks fixture's line endings to `\r\n` on checkout. The file on disk starts with `b"# Databricks notebook source\r\n"`, so `source.startswith(b"# Databricks notebook source\n")` returns `False`, the file is parsed as regular Python, and **all Databricks-specific handling is bypassed**:

- `notebook_format` metadata is never set on the File node
- `# MAGIC %sql` cells are not parsed as SQL (no `IMPORTS_FROM` edges for table references)
- `# MAGIC %md` cells are not skipped
- Per-cell `cell_index` metadata is missing from function nodes

**Reproducible on:** Windows 11 / v2.3.1, clean checkout, confirmed with a single `python -c` statement reading `tests/fixtures/sample_databricks_export.py` — the file starts with `b'# Databricks notebook source\r\n'`.

**Impact:** Every one of the 4 `TestDatabricksPyNotebook` tests fails on Windows:
- `test_detects_databricks_header` — no `notebook_format`
- `test_extracts_sql_tables` — missing `bronze.events`, `silver.users`, etc.
- `test_skips_magic_md_cells` — extra function from MD cells that should be skipped
- `test_cell_index_tracking` — `cell_index` metadata missing

**Fix:** Parse the first line robustly, stripping any trailing `\r`:

```python
first_line_end = source.find(b"\n")
first_line = source[:first_line_end].rstrip(b"\r") if first_line_end != -1 else source
if language == "python" and first_line == b"# Databricks notebook source":
    return self._parse_databricks_py_notebook(path, source)
```

This matches both `\n` and `\r\n` endings.

---

## Bug 3 — Stale FastMCP API in async regression guard test

**File:** `tests/test_main.py:68-71` (in `TestLongRunningToolsAreAsync::test_heavy_tools_are_coroutines`)

```python
async def test_heavy_tools_are_coroutines(self):
    tools = await crg_main.mcp.get_tools()
```

`FastMCP` does not expose a `get_tools()` method. The available API in `fastmcp>=2.14.0` (the version pinned in `pyproject.toml`) is `list_tools()`. The test dies at `AttributeError: 'FastMCP' object has no attribute 'get_tools'. Did you mean: 'list_tools'?`.

This is not a cosmetic issue — this test is the **regression guard** for #46/#136 (v2.3.1 Windows async fix). The PR description for #231 explicitly says:

> There are regression tests (`test_heavy_tools_are_coroutines` + `test_heavy_tool_source_uses_to_thread`) that will fail at CI collection time if anyone converts one of the 5 tools back to sync in a future refactor, so we shouldn't hit this class of bug again.

But the test has been failing at **runtime** on every platform since it was written, which means the regression guard is **not actually guarding anything**. A future refactor that converts `build_or_update_graph_tool` back to sync would not be caught by CI because this test is already red.

**Reproducible on:** every platform with the pinned `fastmcp>=2.14.0`. Not Windows-specific.

**Fix:** Replace `get_tools()` with `list_tools()`. The rest of the test logic (verifying `asyncio.iscoroutinefunction(tool.fn)` for the heavy 5 tools) is correct.

---

## Follow-up PR
A single PR will fix all three bugs in one focused commit. Each fix is small, each has a direct regression test, and each is verifiable on unchanged `origin/main` before and after.

Note: there are 2 additional failing tests on Windows (`test_returns_none_without_git`, `test_falls_back_to_start` in `TestFindRepoRoot` / `TestFindProjectRoot`) that are true **test environmental flakiness** — they assume no ancestor of `tmp_path` has a `.git` directory, which is false on any Windows user whose home directory contains a git repo (e.g. dotfiles). Those are intentionally **not** addressed here because the fix requires either a product API change (`stop_at` parameter) or invasive monkeypatching — worth a separate discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Three Windows-only bugs in v2.3.1 (CRLF Databricks detection, UTF-8 gitignore, stale FastMCP test API) #239

Summary

Bug 1 — `get_data_dir()` writes non-UTF-8 `.gitignore` on Windows

Bug 2 — Databricks notebook auto-detection fails on CRLF line endings

Bug 3 — Stale FastMCP API in async regression guard test

Follow-up PR

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Three Windows-only bugs in v2.3.1 (CRLF Databricks detection, UTF-8 gitignore, stale FastMCP test API) #239

Description

Summary

Bug 1 — get_data_dir() writes non-UTF-8 .gitignore on Windows

Bug 2 — Databricks notebook auto-detection fails on CRLF line endings

Bug 3 — Stale FastMCP API in async regression guard test

Follow-up PR

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Bug 1 — `get_data_dir()` writes non-UTF-8 `.gitignore` on Windows