BCP hostfile read buffering#699
Conversation
|
Not looking much at the moment. Did you try using |
with master: this PR: |
You have also to pass a buffer, do not pass |
|
No difference with a malloc'd buffer of 32768 |
00761d8 to
3cc0c75
Compare
3cc0c75 to
9cfbe10
Compare
This comment was marked as outdated.
This comment was marked as outdated.
|
Rebased to master. |
|
Can you rebase on master? Overall I have the feeling we could go even further and "implement" also a faster and proper "tell" function. Meaning using |
I tried keeping track of the offset without using The host file probably should be opened in binary mode, since it can contain embedded null bytes, even in character mode in Windows. But we would have to add in explicit EOL handling if we do make that change. open/read/close should work in OpenVMS but I couldn't say if there would be noticeable performance improvement. The implementation of open/read/close in VMS is very similar to fopen/fread/fclose, they are both abstraction layers over VMS file system calls which already do their own buffering. (which is why |
Avoid malloc in hot loop Optimize common cases of terminator size 1 or 2 Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
fseeko/ftello definition is moved to header because "offset_type" is needed by TDSFILESTREAM prototypes Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
742a66b to
9fefe7a
Compare
|
Rebased to master . Comments:
|
|
I was looking at the code. Saying, to avoid |
In brief, it won't make a difference on OpenVMS whether or not In Windows I think it's the right thing to do to use binary mode; using text mode with line-ending conversion for IDK anything about Mac files. Making a change to line ending translation might break any users who are using freebcp in conjunction with other tools, as opposed to freebcp roundtripping (e.g. they export their other data source to a text file and load it with freebcp in) but perhaps that can't be helped and they'll have to update their toolchains if moving to freebcp 1.6 |
|
Finally I found some time to do some investigation on this, or better on how MS handle line termination. So, my conclusion is that the file is written as binary file and simply Another consideration about text/binary files is that writing prefixes using text files is wrong, on Windows a length of 10 (0x0a, |
|
Agree that we should be using binary mode for this file. That should probably be a separate Issue to this PR, even though there will be some overlap in the code ? Regarding null bytes, see also #686 , there are some other issues to address there as well. I'm sure those issues will become easier if we do first switch to opening the file in binary mode. |
|
Merged! |
Thanks. However, t0016 step 3 fails for me in Windows now -- which turns out to be because my I fixed it by deleting the files and re-checking them out from Generally speaking, this change will break Is it working as intended and Windows users will have to produce their input files differently when upgrading to FreeTDS 1.6; or did you hope to continue to support CRLF input with The same would apply to embedded newlines in text fields (for example a VARCHAR containing |
This addresses a performance issue. My test data set is a table with 300 small columns and 20K rows. The hostfile is about 40MB, but since FreeTDS makes a separate
freadcall for every column and every column prefix, this works out to something like 10 millionfreadand/orfgetccalls.By reading a 512-byte block at a time from the
FILE *, I got the following performance improvements for aninoperation:freebcp -cimproved by a factor of 2freebcp -cimproved by a factor of 4freebcp -nimproved by a factor of 10 (down from 46 seconds to 5 seconds)The
fread,fgetcfunctions have more overhead in VMS than in other operating systems because of how the filesystem works, it's far more efficient working in block multiples of 512.In all cases, the database server is close to the client, so the database latency is not a factor, it can write the database faster than it can process the hostfile.
Similar improvements are also achieved for the
outoperation by buffering the writes; that will be a subsequent PR if this one goes well. Some users have data sets of over 1GB so any performance improvement on freebcp is quite significant. Have not tested host files over 2GB in size.An increase of the block size to 32K was either the same or very fractionally better on performance, but not significantly so.
Profiling in VMS showed a lot of time spent in
ftello(as well asfread): the code callsftelloprior to every column read, in case the read has a conversion error and the dump file needs to dump the offset where an error occurred (even if the dump file is not enabled...) I had to rework this anyway as a side effect of adding the buffering.