Skip to content

BCP improvements#704

Open
mmcnabb-vms wants to merge 15 commits into
FreeTDS:masterfrom
mmcnabb-vms:bcp_improve
Open

BCP improvements#704
mmcnabb-vms wants to merge 15 commits into
FreeTDS:masterfrom
mmcnabb-vms:bcp_improve

Conversation

@mmcnabb-vms
Copy link
Copy Markdown
Contributor

  • Improve the layout of T0016 output
  • Support MSSQL datetime2 in precision 7 (the native if you don't reduce it to 3)
  • Fix cases of reading the wrong number of bytes from the host file (addresses some of the issues raised on Handling of NULL values in FreeBCP native file format with MSSQL #686, mainly to do with MSSQL nullable fields)
  • Some tidying-up of handling of incorrect data in the hostfile (extra checks and error messages)
  • fix application of -L limit in freebcp
  • Other minor issues

I left out one change from my earlier pull request #685 , to do with validating hostfile data, because that territory is covered by other commits on Ucko PR#558 .

This PR should be processable independently of the BCP Defaults PR although I developed them both together and then separated into more manageably-sized PRs . If you merge/rebase these two PRs, there is one conflict: a common on bcp_defaults deletes some code from the start of bcp_options() and a commit on this PR modifies the first line after the deleted text; git doesn't automatically resolve that.

Comment thread src/dblib/bcp.c Outdated
buflen = (int)tds_strftime((TDS_CHAR *)(*p_data), 256,
bcpdatefmt, &when, 3);
buflen = (int)tds_strftime((TDS_CHAR*)(*p_data), 256,
bcpdatefmt, &when, prec);
Copy link
Copy Markdown
Contributor Author

@mmcnabb-vms mmcnabb-vms Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a bug in that MS datetime2 would be truncated to 3 digits of precision even if the value held higher precision.

A side-effect of this change is that the bcp out -c file will differ between 1.5 and 1.6 for MS datetime2 columns: with this change, it always prints 7 digits of precision even if they are just trailing zeroes.

This actually matches the behaviour of the Microsoft bcp client.

It would be possible to have our freebcp save space on disk by truncating any trailing zeroes here (including omitting the decimal point entirely if the value were an exact number of seconds), your thoughts on that? The file is still read back in correctly without trailing zeroes.

As an aside freebcp still does not generate an exactly identical output to the Microsoft bcp when talking to a MSSQL server: the Microsoft bcp also drops the leading zeroes on NUMERIC/DECIMAL/MONEY values between -1 and 1. (But ASE bcp retains that leading zero!).

@mmcnabb-vms mmcnabb-vms mentioned this pull request Feb 22, 2026
mmcnabb-vms and others added 14 commits March 3, 2026 13:28
	Helps with the heap-tracing report

Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
	However, does not yet report on rows silently dropped from
	hostfile.

Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
@mmcnabb-vms
Copy link
Copy Markdown
Contributor Author

Rebased to master .

Note some issues on #686 are still outstanding, for example freebcp -n roundtrip to MSSQL doesn't work in the presence of null varchars.

Comment thread include/sybdb.h
RETCODE bcp_exec(DBPROCESS * dbproc, DBINT * rows_copied);
DBBOOL bcp_getl(LOGINREC * login);
RETCODE bcp_options(DBPROCESS * dbproc, int option, BYTE * value, int valuelen);
RETCODE bcp_options(DBPROCESS * dbproc, int option, BYTE * value, size_t valuelen);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changing an ABI

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, leave this change out

Comment thread src/tds/bulk.c
Comment thread src/dblib/unittests/common.c Outdated
Copy link
Copy Markdown
Contributor

@freddy77 freddy77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was wrong with the previous code? Can you explain in the commit message?

Copy link
Copy Markdown
Contributor

@freddy77 freddy77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test with MS dblib ?
What about Sybase behavior ?
From your comment it looks like the correct way would be to understand the reason of tds_bcp_fread failed and set the proper error instead.

Copy link
Copy Markdown
Contributor

@freddy77 freddy77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was it wrong before ?
Is this consistent with Sybase ? Microsoft ?

Copy link
Copy Markdown
Contributor

@freddy77 freddy77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you are changing the behavior here to then change again in the next commit.

Comment thread src/dblib/bcp.c

/* Validate that data does not exceed expected maximum (could cause
* wrong row length to be sent on wire) */
if (hostcol->column_len > 0 && collen > hostcol->column_len)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the hostcol->column_len > 0 condition trying to check ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking size of data only makes sense when there is a known maximum size to check against.

hostcol->column_len can be -1 (for fixed types) or 0 (meaning no data should be sent for this column)

@freddy77
Copy link
Copy Markdown
Contributor

freddy77 commented Mar 6, 2026

Merged some commits

@mmcnabb-vms
Copy link
Copy Markdown
Contributor Author

What was wrong with the previous code? Can you explain in the commit message?

Did you test with MS dblib ?
What about Sybase behavior ?
From your comment it looks like the correct way would be to understand the reason of tds_bcp_fread failed and set the >proper error instead.

Why was it wrong before ?
Is this consistent with Sybase ? Microsoft ?

It looks like you are changing the behavior here to then change again in the next commit.

@freddy77 Which code is these comments referring to? It doesn't show me in the conversation & they don't appear on the other views.

@freddy77
Copy link
Copy Markdown
Contributor

freddy77 commented Mar 9, 2026

What was wrong with the previous code? Can you explain in the commit message?
Did you test with MS dblib ?
What about Sybase behavior ?
From your comment it looks like the correct way would be to understand the reason of tds_bcp_fread failed and set the >proper error instead.
Why was it wrong before ?
Is this consistent with Sybase ? Microsoft ?
It looks like you are changing the behavior here to then change again in the next commit.

@freddy77 Which code is these comments referring to? It doesn't show me in the conversation & they don't appear on the other views.

When I wrote them they were attached to specific commits. The interface is not showing the relationship apparently.

Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
@mmcnabb-vms
Copy link
Copy Markdown
Contributor Author

What was wrong with the previous code? Can you explain in the commit message?
Did you test with MS dblib ?
What about Sybase behavior ?
From your comment it looks like the correct way would be to understand the reason of tds_bcp_fread failed and set the >proper error instead.
Why was it wrong before ?
Is this consistent with Sybase ? Microsoft ?
It looks like you are changing the behavior here to then change again in the next commit.

@freddy77 Which code is these comments referring to? It doesn't show me in the conversation & they don't appear on the other views.

When I wrote them they were attached to specific commits. The interface is not showing the relationship apparently.

In VMS we build with MSDBLIB enabled but the same binary appears to work correctly in all tests against both SQL Server, and ASE 16 . When running test suite in Linux or Windows I don't enable that option.

I guess the tds_bcp_fread comment is regarding 08fab50 . It's not a big deal but I found it confusing to receive the message "Attempt to bulk copy an oversized row to the server" when the problem was (a) occurring while parsing the hostfile, not copying the row yet, and (b) the row wasn't oversized.

Not sure how easy (and/or useful) it would be to try and get more accurate failure information out of tds_bcp_fread; but in theory we should never pack an oversized row anyway: length can be validated on host file fields as they are read. One of the other commits on this PR does generate the oversized-row error code if a varchar field is longer than the host maximum. (without that check , freebcp could actually send too much data for the row, leading to unstable behaviour)

@mmcnabb-vms
Copy link
Copy Markdown
Contributor Author

@freddy77 waiting for anything here? Can rebase if desired.

@freddy77
Copy link
Copy Markdown
Contributor

Sorry, just busy. I merged Clean up memory on exit, even in failure cases.

About Fix signed/unsigned build warnings what about this?

diff --git a/src/tds/bulk.c b/src/tds/bulk.c
index f959e3958..96225900a 100644
--- a/src/tds/bulk.c
+++ b/src/tds/bulk.c
@@ -888,18 +888,18 @@ tds5_bcp_add_variable_columns(TDSBCPINFO *bcpinfo, tds_bcp_get_col_data get_col_
 
        if (ncols && !bcpinfo->datarows_locking) {
                TDS_UCHAR *poff = rowbuffer + row_pos;
-               unsigned int pfx_top = offsets[ncols] >> 8;
+               unsigned int pfx_top = offsets[ncols] / 256u;
 
                tdsdump_log(TDS_DBG_FUNC, "ncols=%u poff=%p [%u]\n", ncols, poff, offsets[ncols]);
 
-               if (offsets[ncols] / 256 == offsets[ncols - 1] / 256)
+               if (offsets[ncols] / 256u == offsets[ncols - 1] / 256u)
                        *poff++ = ncols + 1;
                /* this is some kind of run-length-prefix encoding */
                while (pfx_top) {
                        unsigned int n_pfx = 1;
 
                        for (i = 0; i <= ncols ; ++i)
-                               if ((offsets[i] >> 8) < pfx_top)
+                               if (offsets[i] / 256u < pfx_top)
                                        ++n_pfx;
                        *poff++ = n_pfx;
                        --pfx_top;
@@ -916,8 +916,8 @@ tds5_bcp_add_variable_columns(TDSBCPINFO *bcpinfo, tds_bcp_get_col_data get_col_
                for (col = ncols; col-- > 0;) {
                        /* The DOL BULK format has a much simpler row table -- it's just a 2-byte length for every
                         * non-fixed column (does not have the extra "offset after the end" that the basic format has) */
-                       rowbuffer[row_pos++] = offsets[col] % 256;
-                       rowbuffer[row_pos++] = offsets[col] / 256;
+                       rowbuffer[row_pos++] = offsets[col] % 256u;
+                       rowbuffer[row_pos++] = offsets[col] / 256u;
 
                        tdsdump_log(TDS_DBG_FUNC, "[DOL BULK offset table] col=%u offset=%u\n", col, offsets[col]);
                }

There are already division by 256.

The others mainly they miss some more explanation and I would write some tests.

  • freebcp: Apply the -L limit correctly, what was wrong with the previous code?
  • bcp in: error message for missing data-file terminator. Match to MS bcp. By MS bcp do you mean bcp utility by Microsoft? They use ODBC, not dblib. What about Sybase? TDSRET is not necessarily success or failure if you need to return more specifically the type of error.
  • bcp bind - error code and message for incorrect fixed field width, why the new one is better? It looks like more generic to me.
  • Issue #686: bcp in: Read length-prefixed data correctly, what was wrong with the previous code?
  • Issue #686: bcp in: Read correct number of bytes from host file., what was wrong with the previous code?
  • bcp in: validate that hostfile data does not exceed expected width. That's more clear

@mmcnabb-vms
Copy link
Copy Markdown
Contributor Author

About Fix signed/unsigned build warnings what about this?

The line 902 change if ((offsets[i] >> 8) < pfx_top) to / 256u fixes the signed/unsigned warning (in Windows build) , the other changes not necessary (But don't hurt either)

The others mainly they miss some more explanation and I would write some tests.

  • freebcp: Apply the -L limit correctly, what was wrong with the previous code?

The previous code didn't work at all, the logic was wrong in the attempted application. MAX(n, 0x7FFFFFFFF) is always 0x7FFFFFFF. Changing MAX to MIN is not a sufficient fix as then omitting the -L argument behaves like -L 0 and so exits without saving any rows.

I also added a comment for suggested improvement: after the requested number of rows have been processed, freebcp could just exit at that point, instead of spending ages processing the rest of the stream (e.g. if they want to pull the first 5 rows from a table of 5 million records). Not sure the best way to go about coding that.

  • bcp in: error message for missing data-file terminator. Match to MS bcp. By MS bcp do you mean bcp utility by Microsoft? They use ODBC, not dblib.

By "MS bcp" I mean bcp.exe that SQL Server Client SDK (v17.0 in my case). But as you suggest , my comment is not correct for that utility.

My main point here is that "Attempt to bulk-copy an oversized row to the SQL Server" is neither correct nor useful for the case of failing a type conversion while reading the data file, so that should be changed.

The ASE bcp utility gives the following error (among others) for the case of a missing field terminator:
CSLIB Message: - L0/O0/S0/N24/1/0: cs_convert: cslib user api layer: common library error: The conversion/operation was stopped due to a syntax error in the source field.
Which sounds roughly in line with the message for SYBECSYN.

I think you are suggesting improving the error message further by analyzing why tds_bcp_fread failed in more detail? Maybe possible but that could be a future change. "Syntax error in the source field" seems very likely to be the cause of any failure to read the hostfile, other than lower level read errors.

  • bcp bind - error code and message for incorrect fixed field width, why the new one is better? It looks like more generic to me.

Well, the error condition here is an invalid varlen passed to bcp_bind() for an is_fixed_type() type. The error message It's illegal to use BCP terminators with program variables other than SYBCHAR, SYBBINARY, SYBTEXT, or SYBIMAGE is not correct for that condition, because the problem is varlen, not terminators; and there are a lot more non-fixed types than just that list.

I agree that my message would not make sense to someone who's not a programmer, but this situation probably only arises when someone does call bcp_bind in their own code with invalid arguments. I was prompted to make this change after making such an error in a unit test and getting an error message that didn't make sense.

Not a big deal though and this could be left out, the programmer could do as I did and search the code base for the error message etc.

  • Issue #686: bcp in: Read length-prefixed data correctly, what was wrong with the previous code?

Say for example the data file uses a length prefix to specify that the next 8 bytes in the file are the value for this type, then we should be reading 8 bytes and not overriding it with an expected size calculated another way.

It's never correct to read say 6 bytes from a length-prefixed field of length 8 and then just carry on, because whatever data was in the last 2 bytes will be tried to be read as the next field and things go off the rails quickly. It could give bizarre errors, or (even worse) end up succeeding but writing garbage into the database.

My change will prioritize the length prefix if it exists, falling back to the fixed size for fixed types if there was no prefix. The previous code was the other way around.

It's possible for a format file, or a call to bcp_colfmt, to specify that a field has fixed type and a length prefix. Now you could argue (and I probably agree) that if a user does such a thing then they're doing it wrong so this doesn't matter; but I think it should behave in a stable manner rather than reading the wrong number of bytes and going off into undefined territory.

The situation might also arise if we had a mismatch between dbvarylen() and is_fixed_type() (not sure if our test suite otherwise runs over all of those cases to check they are consistent).

After my change, a length-prefixed fixed type with the wrong length will carry on through to tds_convert() to deal with it. Perhaps it would be even better to explicitly generate an error at the same point on the code as this change.

Not sure about testing this, as we currently don't have any "front door" freebcp tests with a format file. Perhaps we could add another round to t0017 that calls bcp_colfmt() with these "invalid" arguments, I will try that.

  • Issue #686: bcp in: Read correct number of bytes from host file., what was wrong with the previous code?

collen is the number of bytes to read from the hostfile, for a non-delimited field. The line collen = TDS_MIN(hostcol->column_len, collen) causes the fread of a length-prefixed VARCHAR field for example to just stop at the database field width, meaning that if a hostfile contains longer text than the database field width, it will go on to try and read the next field's data from the remainder of the previous field's data , leading to weird errors or incorrect data written to database as described above.

The rest of the diff lines in the commit are refactoring for readability -- I've separated the collen calculations (for non-delimited fields) clearly into the four different cases: always-null / prefixed / fixed-type / non-fixed-type. The original code was harder to follow (IMO) , and didn't seem to know why it did what it did, as evinced by the comment Meaning what, exactly.

This is a plausible error scenario, for example the DBA narrows a VARCHAR max field width and then a user tries to bcp in the same data file. Or if the input file is otherwise generated in some way that doesn't have knowledge of the database max field width for that column.

bcp in: validate that hostfile data does not exceed expected width. That's more clear

That explicitly catches the case described just above, after my fix to make sure collen matches the host file's contents.

These last three commits could be squashed, I separated them to hopefully make more sense at the review stage. (I guess this is what you referred to in your older comment about one commit modifying another commit)

I'll try to come up with a test suite entry for this last case of data file char field exceeding database size.

@freddy77
Copy link
Copy Markdown
Contributor

About fc14e3d, I look for possible error in documentation and I didn't find it, then I looked for possible messages from Sybase library and I didn't find it. At the end I wrote a test program and, surprise surprise, no errors is returned by either bcp_bind or BCP inserts. Basically besides values less than -1 (always error) and 0 (NULL) all other values are interpreted as the correct data type size. So, what about something like this:

	if (is_fixed_type(vartype) && varlen != 0)
		varlen = tds_get_size_by_type(vartype);

@mmcnabb-vms
Copy link
Copy Markdown
Contributor Author

About fc14e3d, I look for possible error in documentation and I didn't find it, then I looked for possible messages from Sybase library and I didn't find it. At the end I wrote a test program and, surprise surprise, no errors is returned by either bcp_bind or BCP inserts. Basically besides values less than -1 (always error) and 0 (NULL) all other values are interpreted as the correct data type size. So, what about something like this:

	if (is_fixed_type(vartype) && varlen != 0)
		varlen = tds_get_size_by_type(vartype);

Not sure about this - the function documentation (in bcp.c comment block ahead of bcp_bind) says that for fixed types, only -1 or 0 should be passed. As a programmer I would expect invalid inputs to give an error, rather than being silently "corrected" which could lead to buffer overruns or other unexpected behaviour.

I think the existing logic is correct, just recommending a more appropriate error message for invalid input.

Changing the function to set column_bindlen to a positive value instead of the current setting -1 for fixed types could possibly have impacts elsewhere in the code, we would need to review all uses of that value . Which seems like unnecessary work given that -1 is known to work correctly.

Although, looking through the usages of that value, I am not sure how offset * bindcol->column_bindlen is working when column_bindlen is -1 ...

Note - bcp_collen() also sets column_bindlen, without doing any validation.

@mmcnabb-vms
Copy link
Copy Markdown
Contributor Author

mmcnabb-vms commented May 25, 2026

Recapping this one: everything is merged except for fc14e3d , and the next group of 3 commits regarding reading the hostfile correctly for a "bcp in".

I think fc14e3d can be accepted as-is: it's an improvement of error message for when a user specifically calls bcp_bind with invalid arguments; as mentioned in my previous comment, as a programmer I would not consider it an improvement to accept the invalid arguments.

@mmcnabb-vms
Copy link
Copy Markdown
Contributor Author

mmcnabb-vms commented May 25, 2026

@freddy77 as this is now a fair way behind master, I'll recommend closing this PR, and I can reformulate and resubmit the three "bcp in" behaviour commits as a new PR, with better documentation and test cases.

I added format file testing to t0016 and can present a clearer overview of the problems and solutions now , as well as test cases for my suggested fixes .

@freddy77
Copy link
Copy Markdown
Contributor

freddy77 commented Jun 4, 2026

@mmcnabb-vms I kept a branch updated, see https://github.com/freddy77/freetds/tree/mmc_bcp_improve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants