CopyFrom: auto-detect binary/text format with text fallback#2521
CopyFrom: auto-detect binary/text format with text fallback#2521
Conversation
CopyFrom previously hardcoded binary format and used tryScanStringCopyValueThenEncode as a workaround when binary encoding failed. This was fragile and couldn't handle types that only support text format (e.g. jsonpath, aclitem). Now CopyFrom peeks at the first row and checks both codec-level format support and value-level encode plan availability. If any column cannot use binary, it transparently falls back to text format COPY, letting PostgreSQL handle the parsing natively. - Remove tryScanStringCopyValueThenEncode - Add encodeCopyValueText with proper COPY text escaping - Add canUseBinaryFormat two-level detection - Add buildCopyBufText for text format row encoding - Buffer first row for format decision without data loss - Add tests for text fallback, special char escaping, NULLs, large datasets, all query exec modes, string-to-int conversion, and empty row sets Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
I like this approach. Though, I'm curious about the value-level check. The idea behind the first-row peek makes sense. But it adds complexity and doesn't reliably catch potential caller type mismatches. For instance:
Fails: Succeeds: Admittedly, I'm sure this is an unlikely edge-case, as I would expect that the input rows would be uniform in format. But, it seemed worth bringing attention to. Regardless, could the value types be checked without buffering the row itself? |
|
@abrightwell To be honest, I haven't looked closed at this yet. I was more curious to see if Claude could do it at all, and the results seemed plausible. But it would definitely need careful review. It would also need careful consideration of if auto-fallback is desirable behavior. Could make performance less understandable. |
|
Oh for sure, it definitely seems like Claude might be on to something with the approach. In fact, it got me thinking slightly differently.
Yeah, this is where I've shifted my thinking. Perhaps instead of attempting to predict the format, allow for it to be explicitly set by the caller with a reasonable default? For instance, adding an optional parameter on The binary path could do a simple column codec check, returning an error if any of column types do not support it. Maybe something like:
Agreed. I think requiring the caller to be aware of and explicit about which is most appropriate for their use-case, while being able to fail-fast with actionable information could help with that. It might present an initial performance regression going from a default binary to text, but explicit control could reduce the friction there? |
CopyFrom previously hardcoded binary format and used tryScanStringCopyValueThenEncode as a workaround when binary encoding failed. This was fragile and couldn't handle types that only support text format (e.g. jsonpath, aclitem).
Now CopyFrom peeks at the first row and checks both codec-level format support and value-level encode plan availability. If any column cannot use binary, it transparently falls back to text format COPY, letting PostgreSQL handle the parsing natively.