fix(pure-magic): relax try_csv to match libmagic semantics#36
Conversation
The hardcoded CSV detector required exactly 10 records before reporting text/csv, so typical small CSVs (configs, fixtures, short exports) were silently classified as plain ASCII text. Upstream libmagic's is_csv.c treats CSV_LINES as an early-exit cap, not a minimum, and accepts any input with `tf > 1 && nl >= 2` — file(1) itself loosened this in 2023 (PR/463 "CSV can be also only 2 lines"). Drop the 10-record floor: read records until EOF, require >=2 records with consistent column count. Disable csv::Reader's header inference since libmagic counts newlines (not data rows), so a 2-line "a,b\n1,2\n" must qualify. Add five regression tests covering: 2-row positive, 5-row positive, 12-row positive (the previously-passing case), single-field reject, ragged-columns reject. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tnaroska
left a comment
There was a problem hiding this comment.
A few annotations on the change locations.
| let buf = haystack.read_range(0..FILE_BYTES_MAX as u64)?; | ||
| let mut reader = csv::Reader::from_reader(io::Cursor::new(buf)); | ||
| let mut reader = csv::ReaderBuilder::new() | ||
| .has_headers(false) |
There was a problem hiding this comment.
has_headers(false) is the load-bearing bit for two-line CSVs. With the default true, csv::Reader consumes a,b as a header and records() then yields only one data row from a,b\n1,2\n — n stays at 1 and the n < 2 reject fires. libmagic counts newlines, not data rows, so we need every line in the count.
| return Ok(false); | ||
| } | ||
| } else { | ||
| for i in records { |
There was a problem hiding this comment.
Dropped .take(9) so we read until EOF / FILE_BYTES_MAX. Loop exits early on the first ragged row or parse error, so the cost is bounded by the buffer the haystack already provides.
There was a problem hiding this comment.
If you remove .take up to 7MB (i.e. FILE_BYTES_MAX) of CSV can be parsed for nothing, as we only care about the 2 first lines. This needs to be adjusted not to impact perfs on edge cases.
|
|
||
| // we need at least 10 lines | ||
| if n != 10 { | ||
| if n < 2 { |
There was a problem hiding this comment.
n < 2 matches upstream is_csv.c: return tf > 1 && nl >= 2. The first.len() <= 1 check just above is the tf > 1 half (≥2 fields per row), and this is the nl >= 2 half (≥2 records).
There was a problem hiding this comment.
I don't see the code equivalent to tf > 1
There was a problem hiding this comment.
Minimum number of fields requirement is handled by the check above
// very not likely a CSV otherwise all programming
// languages having ; line terminator would be
// considered as CSV
if first.len() <= 1 {
return Ok(false);
}
| } | ||
|
|
||
| // we already parsed first line | ||
| let mut n = 1; |
There was a problem hiding this comment.
This must be adjusted, because you don't parse the header anymore.
There was a problem hiding this comment.
libmagic doesn't distinguish between header/record lines. It only checks at least 2 rows in the file with matching number of fields. That should be matched in the code here now.
has_headers(false)will return the header line as first item fromrecords- check for at least 2 fields happens
- loop over at most 10 lines total to check field count is consistent
- accept the file as
csvif at least 2 matching lines were found.
Whether we need to parse up to 10 lines if 2 lines is sufficient for a positive test is debatable. Parsing a couple more lines could help detecting negative cases where first two lines look like matching, but subsequent lines disqualify the file due to field number mismatch.
|
|
||
| // we need at least 10 lines | ||
| if n != 10 { | ||
| if n < 2 { |
There was a problem hiding this comment.
I don't see the code equivalent to tf > 1
| return Ok(false); | ||
| } | ||
| } else { | ||
| for i in records { |
There was a problem hiding this comment.
If you remove .take up to 7MB (i.e. FILE_BYTES_MAX) of CSV can be parsed for nothing, as we only care about the 2 first lines. This needs to be adjusted not to impact perfs on edge cases.
|
Hi @qjerome , thanks for the feedback. I made some adjustments in the PR and simplified the change. |
|
Hi @tnaroska, Great news, I am glad you gave this library a try. As the implementation is not the exact same, it is possible that some rules don't get evaluated in the good order and returning a different first match than C libmagic. Another possibility is that there is a small bug still living somewhere. But if you like to share one of your problematic Cheers, |
tl;dr: this is to better match the CSV recognition in libmagic. Specific problem I observed with CSV files is that pure-magic didn't recognize small csv files (few lines) and reported
text/plaininstead.Apparently pure-magic required at least 10 lines of csv to identify the file type, whereas the check in libmagic matches csv with a minimum of two lines (1 header, and 1 data).
Summary
try_csv(pure-magic/src/lib.rs) required exactly 10 records before reportingtext/csv, so typical small CSVs (configs, fixtures, short exports) were silently classified as plain text. This PR relaxes the threshold to match upstream libmagic and adds regression tests.Reproducer (before this PR)
After this PR both files report
mime:text/csv.Why the old threshold was wrong
Upstream
file/file/src/is_csv.c::csv_parseaccepts any input wheretf > 1 && nl >= 2(≥2 lines with consistent column count, ≥2 fields per row).CSV_LINES = 10is only an early-exit cap in the upstream parser, not a minimum. The 10-record floor intry_csvproduced false negatives that diverge fromfile(1).Upstream history confirms the policy: in 2023, file(1) explicitly loosened its detector with PR/463 (
b4e621d1, "CSV can be also only 2 lines") — so even the conservative reference treats 10 as too strict.What changed
n != 10→n < 2. Reads records until EOF and accepts any input with ≥2 records and consistent column count, mirroringtf > 1 && nl >= 2inis_csv.c.csv::Reader::from_reader(...)(which treats the first line as a header by default) tocsv::ReaderBuilder::new().has_headers(false).from_reader(...). libmagic counts newlines, not data rows, so a 2-linea,b\n1,2\nmust qualify; without this change,csv::Readerconsumeda,bas a header and only saw one data record, leavingn = 1.#[cfg(test)] mod testsblock, using a tinycsv_magic()helper that reusesfirst_magicwith a never-matching rule so only the hardcoded CSV detector is exercised:test_csv_two_rows_two_cols— minimal 2-row CSV (the previously-failing case).test_csv_short_consistent_rows— 5-row CSV.test_csv_many_rows_still_detected— 12-row CSV (preserves the previously-passing case).test_csv_single_field_rejected—tf > 1boundary; multi-line single-column text must NOT be detected as CSV.test_csv_ragged_columns_rejected— column-count consistency check.Non-goals
file(1)only auto-detects comma; this PR keeps that behavior.Test plan
cargo test -p pure-magic— 139 passed (134 existing + 5 new), 9 doctests pass.cargo clippy -p pure-magic --no-deps— clean.wizaparity check on 2-row, 5-row, 12-row, and ragged CSVs against systemfile(1).🤖 Generated with Claude Code