BCP hostfile read buffering by mmcnabb-vms · Pull Request #699 · FreeTDS/freetds

mmcnabb-vms · 2026-02-13T11:29:52Z

This addresses a performance issue. My test data set is a table with 300 small columns and 20K rows. The hostfile is about 40MB, but since FreeTDS makes a separate fread call for every column and every column prefix, this works out to something like 10 million fread and/or fgetc calls.

By reading a 512-byte block at a time from the FILE * , I got the following performance improvements for an in operation:

In Linux - freebcp -c improved by a factor of 2
In VMS - freebcp -c improved by a factor of 4
In VMS - freebcp -n improved by a factor of 10 (down from 46 seconds to 5 seconds)

The fread, fgetc functions have more overhead in VMS than in other operating systems because of how the filesystem works, it's far more efficient working in block multiples of 512.

In all cases, the database server is close to the client, so the database latency is not a factor, it can write the database faster than it can process the hostfile.

Similar improvements are also achieved for the out operation by buffering the writes; that will be a subsequent PR if this one goes well. Some users have data sets of over 1GB so any performance improvement on freebcp is quite significant. Have not tested host files over 2GB in size.

An increase of the block size to 32K was either the same or very fractionally better on performance, but not significantly so.

Profiling in VMS showed a lot of time spent in ftello (as well as fread): the code calls ftello prior to every column read, in case the read has a conversion error and the dump file needs to dump the offset where an error occurred (even if the dump file is not enabled...) I had to rework this anyway as a side effect of adding the buffering.

freddy77 · 2026-02-13T19:39:07Z

Not looking much at the moment. Did you try using setvbuf with full buffering?

mmcnabb-vms · 2026-02-15T21:44:04Z

Not looking much at the moment. Did you try using setvbuf with full buffering?

with setvbuf(hostfile, NULL, _IOFBF, 512) or size 32768 - No measurable difference in Linux; my test case (as described above) takes around 10-11 seconds on master with or without setvbuf, and about 5-6 seconds with this PR. No difference in VMS either, the short of it is that setvbuf has no effect in VMS anyway. Timing results from Linux:

master:

real    0m11.571s
user    0m6.687s
sys     0m2.637s

this PR:

real    0m6.131s
user    0m3.342s
sys     0m0.112s

freddy77 · 2026-02-15T22:23:56Z

Not looking much at the moment. Did you try using setvbuf with full buffering?

with setvbuf(hostfile, NULL, _IOFBF, 512) or size 32768 - No measurable difference in Linux; my test case (as described above) takes around 10-11 seconds on master with or without setvbuf, and about 5-6 seconds with this PR. No difference in VMS either, the short of it is that setvbuf has no effect in VMS anyway. Timing results from Linux:

master:
real    0m11.571s
user    0m6.687s
sys     0m2.637s
this PR:
real    0m6.131s
user    0m3.342s
sys     0m0.112s

You have also to pass a buffer, do not pass NULL.

mmcnabb-vms · 2026-02-15T22:33:45Z

No difference with a malloc'd buffer of 32768

mmcnabb-vms · 2026-03-03T00:24:59Z

Rebased to master.

freddy77 · 2026-03-12T22:01:38Z

Can you rebase on master?

Overall I have the feeling we could go even further and "implement" also a faster and proper "tell" function. Meaning using open, read, close with a binary file to have correct offset and do the buffering without FILE* functions (that already handle buffering). What do you think? I don't know OpenVMS support for these POSIX functions. Surely Mac and Linux works, Windows also maybe with some minor "hint".

mmcnabb-vms · 2026-03-13T02:54:51Z

Overall I have the feeling we could go even further and "implement" also a faster and proper "tell" function. Meaning using open, read, close with a binary file to have correct offset and do the buffering without FILE* functions (that already handle buffering).

I tried keeping track of the offset without using ftell, but ran into a problem that the host file is opened in text mode, where ftell does not give the actual byte offset into the file due to the line endings transformation. So it's not reliable to pass our computed offset to fseek. In Standard C for a text file, the only value that can be safely given to fseek is 0 or the result of an earlier ftell call.

The host file probably should be opened in binary mode, since it can contain embedded null bytes, even in character mode in Windows. But we would have to add in explicit EOL handling if we do make that change.

open/read/close should work in OpenVMS but I couldn't say if there would be noticeable performance improvement. The implementation of open/read/close in VMS is very similar to fopen/fread/fclose, they are both abstraction layers over VMS file system calls which already do their own buffering. (which is why setvbuf is basically unimplemented).

Avoid malloc in hot loop Optimize common cases of terminator size 1 or 2 Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>

fseeko/ftello definition is moved to header because "offset_type" is needed by TDSFILESTREAM prototypes Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>

Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>

mmcnabb-vms · 2026-03-13T04:22:35Z

Rebased to master . Comments:

using the fixed buffer for hostfile terminator also avoids the need to have a "destructor" for the stream, since it's not allocating any memory (keeps the code using the stream simpler)
The fseeko/ftello preprocessor code is in the header because offset_type is now needed in the TDSFILESTREAM function prototypes; it might be possible to hide them in bulk.c with some tricks to set up offset_type another way
The fseek/ftell in order to write the row-error file seems unavoidable, without major restructuring

freddy77 · 2026-04-07T20:57:40Z

I was looking at the code. Saying, to avoid ftell, that we open the file as binary, would be possible to handle "text" handling with code? For instance, for Windows text files you need to convert the sequence "\r\n" to just "\n". What about OpenVMS? Or should we use binary read/write even on OpenVMS to make files more "portables"?

mmcnabb-vms · 2026-04-07T22:56:55Z

I was looking at the code. Saying, to avoid ftell, that we open the file as binary, would be possible to handle "text" handling with code? For instance, for Windows text files you need to convert the sequence "\r\n" to just "\n". What about OpenVMS? Or should we use binary read/write even on OpenVMS to make files more "portables"?

In brief, it won't make a difference on OpenVMS whether or not b is used with standard fopen calls, there are no translation differences.

In Windows I think it's the right thing to do to use binary mode; using text mode with line-ending conversion for -n format in Windows makes no sense in the first place. As you say, we'd need some new code to make it so that -c files end in CR-NL pairs , and some consideration of -f files . (Someone might be using a custom format with \n as a field terminator for fields other than the final field...)

IDK anything about Mac files.

Making a change to line ending translation might break any users who are using freebcp in conjunction with other tools, as opposed to freebcp roundtripping (e.g. they export their other data source to a text file and load it with freebcp in) but perhaps that can't be helped and they'll have to update their toolchains if moving to freebcp 1.6

freddy77 · 2026-04-13T19:36:42Z

Finally I found some time to do some investigation on this, or better on how MS handle line termination.
From documentation the default row terminator (for character format, native does not use terminators) is \n (new line). Documentation also states that \n is converted to \r\n (that true, although they fail to mention that this does not happen in Linux).
Then I tried to insert a newline into a VARCHAR field and see how it was saved to the file. In this case the new line is not prefixed with return carriage (\r). This is a behavior of binary file, as text file on Windows should replace \n with \r\n.
Also MS bcp utility supports binary format terminator and you can pass a new line (like bcp ... -r 0x0a). In this case the newline is not prefixed with by carriage return.
Another oddness is that you can specify a NUL character with \0 (like bcp ... -r \0, weirdly you can't insert NUL character with hexadecimal syntax...) and a 0 byte is inserted (which is odd for text files too).

So, my conclusion is that the file is written as binary file and simply \r\n is used as terminator (at least on Windows).

Another consideration about text/binary files is that writing prefixes using text files is wrong, on Windows a length of 10 (0x0a, \n) could become 2573 (0x0d0a, \r\n).

mmcnabb-vms · 2026-04-13T21:52:18Z

Agree that we should be using binary mode for this file. That should probably be a separate Issue to this PR, even though there will be some overlap in the code ?

Regarding null bytes, see also #686 , there are some other issues to address there as well. I'm sure those issues will become easier if we do first switch to opening the file in binary mode.

freddy77 · 2026-04-17T06:35:11Z

Merged!

mmcnabb-vms · 2026-04-22T00:38:31Z

Merged!

Thanks. However, t0016 step 3 fails for me in Windows now -- which turns out to be because my t0016_3.in uses CRLF endings, and the changes you have made are not tolerant of input files that use CRLF endings: the character mode file now must have LF endings only otherwise it gives a syntax error.

I fixed it by deleting the files and re-checking them out from git. This produced LF files, although to ensure this for all users regardless of their git LF settings, maybe it would be good to add a gitattributes rule for this directory to force *.in *.out *.exp to binary.

Generally speaking, this change will break bcp in for existing Windows users who generate data files with CRLF endings. Which I think would be the norm, for example if they use a text editor, or any other programmatic method which writes a file in text mode currently.

Is it working as intended and Windows users will have to produce their input files differently when upgrading to FreeTDS 1.6; or did you hope to continue to support CRLF input with -c ?

The same would apply to embedded newlines in text fields (for example a VARCHAR containing \n), the input file might contain CRLF that the user intends to arrive in the database as LF .

mmcnabb-vms force-pushed the bcp_readbuf branch from 00761d8 to 3cc0c75 Compare February 16, 2026 22:09

mmcnabb-vms force-pushed the bcp_readbuf branch from 3cc0c75 to 9cfbe10 Compare February 27, 2026 07:53

This comment was marked as outdated.

Sign in to view

mmcnabb-vms force-pushed the bcp_readbuf branch from 4a37faa to 742a66b Compare March 3, 2026 00:24

mmcnabb-vms mentioned this pull request Mar 3, 2026

BCP defaults #703

Closed

mmcnabb-vms and others added 4 commits March 13, 2026 16:21

bcp in: Optimize code to check for hostfile field terminator

7ee8847

Avoid malloc in hot loop Optimize common cases of terminator size 1 or 2 Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>

bcp in: Use TDSFILESTREAM for the host file read

f438a03

fseeko/ftello definition is moved to header because "offset_type" is needed by TDSFILESTREAM prototypes Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>

bcp in: Avoid calling ftell in hot loop

e32bb51

Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>

bcp in: Buffer input stream in blocks

9fefe7a

Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>

mmcnabb-vms force-pushed the bcp_readbuf branch from 742a66b to 9fefe7a Compare March 13, 2026 04:15

freddy77 closed this Apr 17, 2026

mmcnabb-vms deleted the bcp_readbuf branch April 21, 2026 22:57

Conversation

mmcnabb-vms commented Feb 13, 2026

Uh oh!

freddy77 commented Feb 13, 2026

Uh oh!

mmcnabb-vms commented Feb 15, 2026

Uh oh!

freddy77 commented Feb 15, 2026

Uh oh!

mmcnabb-vms commented Feb 15, 2026

Uh oh!

This comment was marked as outdated.

mmcnabb-vms commented Mar 3, 2026

Uh oh!

freddy77 commented Mar 12, 2026

Uh oh!

mmcnabb-vms commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mmcnabb-vms commented Mar 13, 2026

Uh oh!

freddy77 commented Apr 7, 2026

Uh oh!

mmcnabb-vms commented Apr 7, 2026

Uh oh!

freddy77 commented Apr 13, 2026

Uh oh!

mmcnabb-vms commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

freddy77 commented Apr 17, 2026

Uh oh!

mmcnabb-vms commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mmcnabb-vms commented Mar 13, 2026 •

edited

Loading

mmcnabb-vms commented Apr 13, 2026 •

edited

Loading

mmcnabb-vms commented Apr 22, 2026 •

edited

Loading