Skip to content

CopyFrom: auto-detect binary/text format with text fallback#2521

Draft
jackc wants to merge 1 commit intomasterfrom
copy-from-text-format
Draft

CopyFrom: auto-detect binary/text format with text fallback#2521
jackc wants to merge 1 commit intomasterfrom
copy-from-text-format

Conversation

@jackc
Copy link
Copy Markdown
Owner

@jackc jackc commented Mar 20, 2026

CopyFrom previously hardcoded binary format and used tryScanStringCopyValueThenEncode as a workaround when binary encoding failed. This was fragile and couldn't handle types that only support text format (e.g. jsonpath, aclitem).

Now CopyFrom peeks at the first row and checks both codec-level format support and value-level encode plan availability. If any column cannot use binary, it transparently falls back to text format COPY, letting PostgreSQL handle the parsing natively.

  • Remove tryScanStringCopyValueThenEncode
  • Add encodeCopyValueText with proper COPY text escaping
  • Add canUseBinaryFormat two-level detection
  • Add buildCopyBufText for text format row encoding
  • Buffer first row for format decision without data loss
  • Add tests for text fallback, special char escaping, NULLs, large datasets, all query exec modes, string-to-int conversion, and empty row sets

CopyFrom previously hardcoded binary format and used
tryScanStringCopyValueThenEncode as a workaround when binary encoding
failed. This was fragile and couldn't handle types that only support
text format (e.g. jsonpath, aclitem).

Now CopyFrom peeks at the first row and checks both codec-level format
support and value-level encode plan availability. If any column cannot
use binary, it transparently falls back to text format COPY, letting
PostgreSQL handle the parsing natively.

- Remove tryScanStringCopyValueThenEncode
- Add encodeCopyValueText with proper COPY text escaping
- Add canUseBinaryFormat two-level detection
- Add buildCopyBufText for text format row encoding
- Buffer first row for format decision without data loss
- Add tests for text fallback, special char escaping, NULLs, large
  datasets, all query exec modes, string-to-int conversion, and
  empty row sets

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abrightwell
Copy link
Copy Markdown
Contributor

I like this approach. Though, I'm curious about the value-level check. The idea behind the first-row peek makes sense. But it adds complexity and doesn't reliably catch potential caller type mismatches.

For instance:

CREATE TABLE (a int);

Fails:

rows := [][]any{
	{int32(42)},
	{"42"},
}

Succeeds:

rows := [][]any{
	{"42"},
	{int32(42)},
}

Admittedly, I'm sure this is an unlikely edge-case, as I would expect that the input rows would be uniform in format. But, it seemed worth bringing attention to.

Regardless, could the value types be checked without buffering the row itself?

@jackc
Copy link
Copy Markdown
Owner Author

jackc commented Mar 21, 2026

@abrightwell To be honest, I haven't looked closed at this yet. I was more curious to see if Claude could do it at all, and the results seemed plausible. But it would definitely need careful review. It would also need careful consideration of if auto-fallback is desirable behavior. Could make performance less understandable.

@abrightwell
Copy link
Copy Markdown
Contributor

Oh for sure, it definitely seems like Claude might be on to something with the approach. In fact, it got me thinking slightly differently.

It would also need careful consideration of if auto-fallback is desirable behavior.

Yeah, this is where I've shifted my thinking. Perhaps instead of attempting to predict the format, allow for it to be explicitly set by the caller with a reasonable default? For instance, adding an optional parameter on CopyFrom for a CopyFromFormat, where CopyFromFormatText is the default (based on COPY command docs).

The binary path could do a simple column codec check, returning an error if any of column types do not support it. Maybe something like: "column %s type %s does not support binary format, use CopyFromFormatText". And similarly if the caller passes incompatible data.

Could make performance less understandable.

Agreed. I think requiring the caller to be aware of and explicit about which is most appropriate for their use-case, while being able to fail-fast with actionable information could help with that. It might present an initial performance regression going from a default binary to text, but explicit control could reduce the friction there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants