BCP Out performance#705
Conversation
Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
MSVC profiler: reduce time spent in fwrite from 41% to 17.3% for a test case of 400MB of data. Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Increase log performance by factor of 3. Set TDSDUMP_FFLUSH to anything, to flush after each line. Should only be needed when debugging segfaults. Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
MSVC profiler shows 5% gain in bcp -c from this change Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
MSVC profiler showed 35% in tds_strftime for a test case. Now 2.7% spent in _bcp_strftime. Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
Gain of about 1% from baseline. Signed-off-by: Matt McNabb <matthew.mcnabb@vmssoftware.com>
| break; /* found */ | ||
| } | ||
| } | ||
| #endif /* ENABLE_EXTRA_CHECKS */ |
There was a problem hiding this comment.
This does not seem right. The "extra checks" are expected to be expensive and we don't want to risk having in the final builds. It is this some kind of leftover of some middle development?
There was a problem hiding this comment.
The idea is to let you define ENABLE_EXTRA_CHECKS for a .c file you're working on without having to turn it on for the whole project (and rebuild the whole project). Not really important but I found it useful while testing _bcp_strftime
| size_t tds_u32toafast_right(char out[10], uint32_t v); | ||
| size_t tds_u32toafast(char out[10], uint32_t v); | ||
| size_t tds_i32toafast(char out[11], int32_t v); | ||
| void tds_02dfast(char out[2], int v); /* v must be 0-99 (non negative) */ |
There was a problem hiding this comment.
Looking at the caller and to avoid "surprises" why not saying that range is 0-63 and do a & 63 in tds_02dfast ?
I would also use an underscore before "fast", like tds_u32toa_fast_right.
There was a problem hiding this comment.
OK, although it might also be surprising to another coder who assumed 02d would mean 0-99 ! perhaps a more meaningful function name is possible.
Note that tds_u32toafast_right does use the full 0-99 capability of the underlying function
There was a problem hiding this comment.
What about
v = TDS_CLAMP(v, 0, 99);??
There was a problem hiding this comment.
What about
v = TDS_CLAMP(v, 0, 99);??
OK but for performance I think it would be best to to only have this on code paths where it is needed (e.g. tds_u32toafast_right does not need it since that call site guarantees the value is in range already). So perhaps it could go in tds_02dfast but not in tds02dfast. And include a TDS_LIKELY.
You would have noticed I had two functions tds02dfast and tds_02dfast, this is so that tds02dfast can be inlined in convert.c . It would be nice to have it inlinable everywhere but I wasn't sure about the support for having inline function bodies in header files across all the compilers and linkers that we need to support.
There was a problem hiding this comment.
That's why I was suggesting the v &= 63 for tds_02dfast, it's just a single machine instruction with no conditions.
|
|
||
| /* more characters are required because we replace %z with up to 7 digits */ | ||
| our_format = tds_new(char, strlen(format) + 1 + 5 + 1); | ||
| our_format = tds_new(char, strlen(format) + 7); |
There was a problem hiding this comment.
I hope any modern compiler could optimize the above without hints, otherwise I would use
our_format = tds_new(char, strlen(format) + (1 + 5 + 1));just to be on the safer side.
There was a problem hiding this comment.
OK. And maybe expand the comment a bit to explain what the 1 + 5 + 1 actually represents (we can work it out of course but it'd make for quicker reading of the code)
| if (pad > 3) | ||
| memset(buf + 3, '0', pad - 3); |
There was a problem hiding this comment.
In another hunk of code you first set all buffer and then use tds_u32toafast_right. I think filling all buffer with 0es should be faster with most modern compilers as the compiler will do the filling using less instructions.
|
|
||
| static FILE* tdsdump_append(void); | ||
|
|
||
| static bool g_dumpfile_fflush = false; |
There was a problem hiding this comment.
I think this file is the only file using the g_ prefix, I would avoid extending this style.
| } | ||
|
|
||
| /** Fast strftime for default bcp format */ | ||
| static int _bcp_strftime(TDS_UCHAR* buf, size_t maxsize, const char* format, const TDSDATEREC* dr, int prec) |
There was a problem hiding this comment.
Given that buf is always converted to char * I would declare this as char *. Maybe renaming cbuf to pbuf. I can do it.
There was a problem hiding this comment.
OK. I was thinking in terms of avoiding the need for a cast at the call site.
| /* Fill buffer from the end so it comes out in the right order */ | ||
| while (v >= 100u) { | ||
| uint32_t q = v / 100u; | ||
| uint32_t r = v - q * 100u; |
There was a problem hiding this comment.
Why not
uint32_t r = v % 100u;
```
?? Modern compiler should be able to optimize depending on specific processor.There was a problem hiding this comment.
OK. I have checked with VSI C on Itanium and both versions do compute r using a mod-100 "magic multiplication", although the generated assembly isn't exactly identical for the entire computation .
| p -= 2; | ||
| tds02dfast(p, v); | ||
| } | ||
| else if (v > 0u) |
There was a problem hiding this comment.
Does this mean that 0 will be formatted as "" (empty string) ?
There was a problem hiding this comment.
Yes, which is fine for the existing call sites. I left that case out to avoid an extra branch that wasn't needed but if you want to add in that case I doubt it would have an measurable performance penalty.
There was a problem hiding this comment.
Yes, but you are using this function when converting a number to string so 0 will be converted to "" instead of "0". The code difference is just to remove the last if (v > 0u).
|
Merged part of the PR. |
|
Merged log part |
|
Merged, minor changes (mainly style). |
This PR also includes the MS DATETIME2 bcp precision fix, from #704 , but otherwise should be independent of my other two BCP PRs.
Speed of my bcp out test case increased by nearly factor of 2 in Windows, and factor of 10 in VMS. have included individual performance gains in the commit message for each one. Profiling showed more than 30% of runtime spent in tds_strftime on both systems.
I tried applying the output buffer stream to the log file but it didn't help (replacing vfprintf with vasprintf/fwrite hurts).