Conversation
…only empty strings, by working on the input dataframe and not an empty dataframe
…tions and to allow working around issues with pandas csv parsing and writing
|
Should something be mentioned in the CHANGELOG? If we merge this, the only user visible changes will be the slightly adjusted example, the "support" for python 3.11 and some type annotation improvements. The CSV changes are in that sense no new features or behavior changes, more fixes to achieve the expected behavior in various 'edge' cases. |
src/cadenzaanalytics/util/csv.py
Outdated
| lines.append(_format_row(columns_list, columns_list, None, None, None, None)) | ||
|
|
||
| # Write data rows | ||
| for _, row in df.iterrows(): |
There was a problem hiding this comment.
iterrows is very slow. It would probably be 5-10x faster to use itertuples or transform into a numpy array like so:
values = df.to_numpy(dtype=object, na_value=None)
for row in values:
lines.append(_format_row(list(row), ...))
| # Quoted value - extract content (can contain newlines) | ||
| pos += 1 | ||
| value = [] | ||
| while pos < len(csv_data): |
There was a problem hiding this comment.
This looks at the whole payload character by character. For large data, this will be very slow.
We could user str.find() instead to find the next quote (should be implemented in C).
Something along the lines of
while pos < len(csv_data):
next_quote = csv_data.find('"', pos)
value_parts.append(csv_data[pos:next_quote])
pos = next_quote + 1
buddemat
left a comment
There was a problem hiding this comment.
I have 2 comments concerning performance, whcih I guess should be addressed. I have not looked at the tests.
From what I read in the channels, both @julianjanssen and @ArneBab see some issues with the "full custom csv import" approach. We might want to discuss this once more?
…ead of custom reader
…ending on version
… to utc to have a stable output, always have pandas Timestamps and prevent issues with mixed timezone offsets
…rted and attempt to lower required version to 3.10
…about the cadenza server timezone, use the python server timezone as fallback
|
|
||
| all_rows = _parse_csv_with_default_reader(csv_data) \ | ||
| if sys.version_info >= (3, 13) \ | ||
| else _parse_csv(csv_data) |
|
|
||
| for row in reader: | ||
| all_rows.append(row) | ||
| return all_rows |
No description provided.