[SYNPY-1749]Allow quote, apostrophe and ellipsis in store_row_async#1316
[SYNPY-1749]Allow quote, apostrophe and ellipsis in store_row_async#1316
Conversation
|
@danlu1 Is this still WIP, or are you looking for reviews? |
|
@andrewelamb sorry I should have marked this a draft. |
…ctly when upload data from a dataframe
…ger output json string
|
The integration test failures are in the recordset and submission modules and do not appear to be related to my changes. |
linglp
left a comment
There was a problem hiding this comment.
I think overall it looks good. The tests can be consolidated a bit to test all the edge cases in fewer integration tests to improve performance, and the docstring can be updated to reflect the new state of the code since json.dumps() was removed. There's also some logic that can be simplified in the redundant checks where sample_values is created but never actually used. The function name could also be more descriptive of what it actually does now (sanitizing special values rather than just converting dtypes)
tests/integration/synapseclient/models/async/test_table_async.py
Outdated
Show resolved
Hide resolved
|
|
||
| import pandas as pd | ||
| import pytest | ||
| from pandas.api.types import is_object_dtype |
There was a problem hiding this comment.
I think if you merge develop to this branch, you should be able to see that we are no longer adding tests to synchronous folder any more.
There was a problem hiding this comment.
That is right, Sorry for the code conflicts that you'll need to deal with @danlu1
| @@ -141,6 +141,7 @@ def row_labels_from_rows(rows: List[Row]) -> List[Row]: | |||
| def convert_dtypes_to_json_serializable(df): | |||
| to_csv_kwargs: Additional arguments to pass to the `pd.DataFrame.to_csv` | ||
| function when writing the data to a CSV file. | ||
| """ | ||
| # Serializes dict/list values to JSON strings |
There was a problem hiding this comment.
I think the comment here is misleading. The function no longer converts dict/list values to JSON strings. I think in your function, the dict/list values remain as dict/list objects
| df.iloc[offset_start:end].to_csv( | ||
| buffer, | ||
| mode="a", | ||
| header=False, | ||
| index=False, | ||
| float_format="%.12g", | ||
| doublequote=False, | ||
| escapechar="\\", | ||
| quoting=0, |
There was a problem hiding this comment.
Should this instead be a passthrough for the settings that the user has defined for the doublequote, escapechar, when they called the row upsert or store method(s)? I am seeing that it is currently used to go from CSV -> DataFrame, but then we are not using them to go from DataFrame -> CSV (And we probably should in some capacity). We could then default them as we do today if not passed in, but it would allow a user to specify them.
There was a problem hiding this comment.
I tried implementing a passthrough when calling the store method, but it didn’t work. The likely root cause is a bug: to_csv_kwargs isn’t being passed into _stream_and_update_from_df, so downstream steps never get a chance to use it. I added the parameters here because Synapse only allows a limited set of options, similar to how we handled float_format. That said, the ideal solution is to support to_csv_kwargs directly for DataFrames. I’m going to add support for to_csv_kwargs.
BryanFauble
left a comment
There was a problem hiding this comment.
This is looking great, once we get the last few items handled (and develop merged in), I can approve!
Problem:
A JSON serialization issue occurs when a DataFrame passed to store_row_async contains a list or dictionary with strings that include both double quotes and apostrophes.
Solution:
Restrict
to_csv_kwargsinget_partial_dataframe_chunk, which is the final step before uploading the data to Synapse to obtain the file handle.Testing:
Unit test and integration test have been added.