Skip to content

BCP hostfile read buffering#699

Closed
mmcnabb-vms wants to merge 4 commits into
FreeTDS:masterfrom
mmcnabb-vms:bcp_readbuf
Closed

BCP hostfile read buffering#699
mmcnabb-vms wants to merge 4 commits into
FreeTDS:masterfrom
mmcnabb-vms:bcp_readbuf

Conversation

@mmcnabb-vms
Copy link
Copy Markdown
Contributor

This addresses a performance issue. My test data set is a table with 300 small columns and 20K rows. The hostfile is about 40MB, but since FreeTDS makes a separate fread call for every column and every column prefix, this works out to something like 10 million fread and/or fgetc calls.

By reading a 512-byte block at a time from the FILE * , I got the following performance improvements for an in operation:

  • In Linux - freebcp -c improved by a factor of 2
  • In VMS - freebcp -c improved by a factor of 4
  • In VMS - freebcp -n improved by a factor of 10 (down from 46 seconds to 5 seconds)

The fread, fgetc functions have more overhead in VMS than in other operating systems because of how the filesystem works, it's far more efficient working in block multiples of 512.

In all cases, the database server is close to the client, so the database latency is not a factor, it can write the database faster than it can process the hostfile.

Similar improvements are also achieved for the out operation by buffering the writes; that will be a subsequent PR if this one goes well. Some users have data sets of over 1GB so any performance improvement on freebcp is quite significant. Have not tested host files over 2GB in size.

An increase of the block size to 32K was either the same or very fractionally better on performance, but not significantly so.

Profiling in VMS showed a lot of time spent in ftello (as well as fread): the code calls ftello prior to every column read, in case the read has a conversion error and the dump file needs to dump the offset where an error occurred (even if the dump file is not enabled...) I had to rework this anyway as a side effect of adding the buffering.

@freddy77
Copy link
Copy Markdown
Contributor

Not looking much at the moment. Did you try using setvbuf with full buffering?

@mmcnabb-vms
Copy link
Copy Markdown
Contributor Author

Not looking much at the moment. Did you try using setvbuf with full buffering?

with setvbuf(hostfile, NULL, _IOFBF, 512) or size 32768 - No measurable difference in Linux; my test case (as described above) takes around 10-11 seconds on master with or without setvbuf, and about 5-6 seconds with this PR. No difference in VMS either, the short of it is that setvbuf has no effect in VMS anyway. Timing results from Linux:

master:

real    0m11.571s
user    0m6.687s
sys     0m2.637s

this PR:

real    0m6.131s
user    0m3.342s
sys     0m0.112s

@freddy77
Copy link
Copy Markdown
Contributor

Not looking much at the moment. Did you try using setvbuf with full buffering?

with setvbuf(hostfile, NULL, _IOFBF, 512) or size 32768 - No measurable difference in Linux; my test case (as described above) takes around 10-11 seconds on master with or without setvbuf, and about 5-6 seconds with this PR. No difference in VMS either, the short of it is that setvbuf has no effect in VMS anyway. Timing results from Linux:

master:

real    0m11.571s
user    0m6.687s
sys     0m2.637s

this PR:

real    0m6.131s
user    0m3.342s
sys     0m0.112s

You have also to pass a buffer, do not pass NULL.

@mmcnabb-vms
Copy link
Copy Markdown
Contributor Author

No difference with a malloc'd buffer of 32768

@mmcnabb-vms

This comment was marked as outdated.

@mmcnabb-vms
Copy link
Copy Markdown
Contributor Author

Rebased to master.

@mmcnabb-vms mmcnabb-vms mentioned this pull request Mar 3, 2026
@freddy77
Copy link
Copy Markdown
Contributor

Can you rebase on master?

Overall I have the feeling we could go even further and "implement" also a faster and proper "tell" function. Meaning using open, read, close with a binary file to have correct offset and do the buffering without FILE* functions (that already handle buffering). What do you think? I don't know OpenVMS support for these POSIX functions. Surely Mac and Linux works, Windows also maybe with some minor "hint".

@mmcnabb-vms
Copy link
Copy Markdown
Contributor Author

mmcnabb-vms commented Mar 13, 2026

Overall I have the feeling we could go even further and "implement" also a faster and proper "tell" function. Meaning using open, read, close with a binary file to have correct offset and do the buffering without FILE* functions (that already handle buffering).

I tried keeping track of the offset without using ftell, but ran into a problem that the host file is opened in text mode, where ftell does not give the actual byte offset into the file due to the line endings transformation. So it's not reliable to pass our computed offset to fseek. In Standard C for a text file, the only value that can be safely given to fseek is 0 or the result of an earlier ftell call.

The host file probably should be opened in binary mode, since it can contain embedded null bytes, even in character mode in Windows. But we would have to add in explicit EOL handling if we do make that change.

open/read/close should work in OpenVMS but I couldn't say if there would be noticeable performance improvement. The implementation of open/read/close in VMS is very similar to fopen/fread/fclose, they are both abstraction layers over VMS file system calls which already do their own buffering. (which is why setvbuf is basically unimplemented).

mmcnabb-vms and others added 4 commits March 13, 2026 16:21
	Avoid malloc in hot loop
	Optimize common cases of terminator size 1 or 2

Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
	fseeko/ftello definition is moved to header because
	"offset_type" is needed by TDSFILESTREAM prototypes

Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
@mmcnabb-vms
Copy link
Copy Markdown
Contributor Author

Rebased to master . Comments:

  • using the fixed buffer for hostfile terminator also avoids the need to have a "destructor" for the stream, since it's not allocating any memory (keeps the code using the stream simpler)
  • The fseeko/ftello preprocessor code is in the header because offset_type is now needed in the TDSFILESTREAM function prototypes; it might be possible to hide them in bulk.c with some tricks to set up offset_type another way
  • The fseek/ftell in order to write the row-error file seems unavoidable, without major restructuring

@freddy77
Copy link
Copy Markdown
Contributor

freddy77 commented Apr 7, 2026

I was looking at the code. Saying, to avoid ftell, that we open the file as binary, would be possible to handle "text" handling with code? For instance, for Windows text files you need to convert the sequence "\r\n" to just "\n". What about OpenVMS? Or should we use binary read/write even on OpenVMS to make files more "portables"?

@mmcnabb-vms
Copy link
Copy Markdown
Contributor Author

I was looking at the code. Saying, to avoid ftell, that we open the file as binary, would be possible to handle "text" handling with code? For instance, for Windows text files you need to convert the sequence "\r\n" to just "\n". What about OpenVMS? Or should we use binary read/write even on OpenVMS to make files more "portables"?

In brief, it won't make a difference on OpenVMS whether or not b is used with standard fopen calls, there are no translation differences.

In Windows I think it's the right thing to do to use binary mode; using text mode with line-ending conversion for -n format in Windows makes no sense in the first place. As you say, we'd need some new code to make it so that -c files end in CR-NL pairs , and some consideration of -f files . (Someone might be using a custom format with \n as a field terminator for fields other than the final field...)

IDK anything about Mac files.

Making a change to line ending translation might break any users who are using freebcp in conjunction with other tools, as opposed to freebcp roundtripping (e.g. they export their other data source to a text file and load it with freebcp in) but perhaps that can't be helped and they'll have to update their toolchains if moving to freebcp 1.6

@freddy77
Copy link
Copy Markdown
Contributor

Finally I found some time to do some investigation on this, or better on how MS handle line termination.
From documentation the default row terminator (for character format, native does not use terminators) is \n (new line). Documentation also states that \n is converted to \r\n (that true, although they fail to mention that this does not happen in Linux).
Then I tried to insert a newline into a VARCHAR field and see how it was saved to the file. In this case the new line is not prefixed with return carriage (\r). This is a behavior of binary file, as text file on Windows should replace \n with \r\n.
Also MS bcp utility supports binary format terminator and you can pass a new line (like bcp ... -r 0x0a). In this case the newline is not prefixed with by carriage return.
Another oddness is that you can specify a NUL character with \0 (like bcp ... -r \0, weirdly you can't insert NUL character with hexadecimal syntax...) and a 0 byte is inserted (which is odd for text files too).

So, my conclusion is that the file is written as binary file and simply \r\n is used as terminator (at least on Windows).

Another consideration about text/binary files is that writing prefixes using text files is wrong, on Windows a length of 10 (0x0a, \n) could become 2573 (0x0d0a, \r\n).

@mmcnabb-vms
Copy link
Copy Markdown
Contributor Author

mmcnabb-vms commented Apr 13, 2026

Agree that we should be using binary mode for this file. That should probably be a separate Issue to this PR, even though there will be some overlap in the code ?

Regarding null bytes, see also #686 , there are some other issues to address there as well. I'm sure those issues will become easier if we do first switch to opening the file in binary mode.

@freddy77
Copy link
Copy Markdown
Contributor

Merged!

@freddy77 freddy77 closed this Apr 17, 2026
@mmcnabb-vms mmcnabb-vms deleted the bcp_readbuf branch April 21, 2026 22:57
@mmcnabb-vms
Copy link
Copy Markdown
Contributor Author

mmcnabb-vms commented Apr 22, 2026

Merged!

Thanks. However, t0016 step 3 fails for me in Windows now -- which turns out to be because my t0016_3.in uses CRLF endings, and the changes you have made are not tolerant of input files that use CRLF endings: the character mode file now must have LF endings only otherwise it gives a syntax error.

I fixed it by deleting the files and re-checking them out from git. This produced LF files, although to ensure this for all users regardless of their git LF settings, maybe it would be good to add a gitattributes rule for this directory to force *.in *.out *.exp to binary.

Generally speaking, this change will break bcp in for existing Windows users who generate data files with CRLF endings. Which I think would be the norm, for example if they use a text editor, or any other programmatic method which writes a file in text mode currently.

Is it working as intended and Windows users will have to produce their input files differently when upgrading to FreeTDS 1.6; or did you hope to continue to support CRLF input with -c ?

The same would apply to embedded newlines in text fields (for example a VARCHAR containing \n), the input file might contain CRLF that the user intends to arrive in the database as LF .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants