Hi Pygef team,
Thanks for your work on this package! I encountered some unexpected behavior while parsing CPT data using Pygef and wanted to report it, as I believe it could be improved or at least clarified in the documentation.
Summary of issues:
-
Inconsistent parsing behavior between .gef and .xml files
-
When parsing .gef files, Pygef implicitly drops any row that contains invalid measurements (i.e., nulls in Polars), regardless of which column contains the null. This happens after interpolation (see point 2). In my case, rows were dropped due to invalid values in a column I didn’t need, which caused me to lose valid data points from the columns I did care about.
-
In contrast, for .xml files, only rows where coneResistance is null are dropped. Other nulls in other columns are retained.
-
This leads to inconsistent behavior depending on the file type. It can result in different parsed datasets from equivalent sources, which is unexpected. I would argue that dropping rows with nulls should be left to the user—not handled implicitly by Pygef—since it directly affects the integrity of the original measurement data.
-
Implicit interpolation of missing values
-
When parsing .gef files, missing values are automatically interpolated using polars.interpolate(), without any indication or control exposed to the user.
-
For a tool that describes itself as a "simple parser," this kind of implicit data manipulation is surprising and may lead to misinterpretation of results.
-
.xml files are not interpolated in the same way, which adds to the inconsistency and raises questions—e.g., are missing measurements only assumed to exist in .gef files?
Suggestions:
-
If interpolation and/or row-dropping is intended, it should be explicit and optional, not silently applied in the background.
-
The documentation (or function signatures) should clearly indicate when and how interpolation or row filtering occurs.
-
Consider aligning the behavior between .gef and .xml formats, or at least documenting the differences clearly and justifying them.
Hi Pygef team,
Thanks for your work on this package! I encountered some unexpected behavior while parsing CPT data using Pygef and wanted to report it, as I believe it could be improved or at least clarified in the documentation.
Summary of issues:
Inconsistent parsing behavior between .gef and .xml files
When parsing
.geffiles, Pygef implicitly drops any row that contains invalid measurements (i.e., nulls in Polars), regardless of which column contains the null. This happens after interpolation (see point 2). In my case, rows were dropped due to invalid values in a column I didn’t need, which caused me to lose valid data points from the columns I did care about.In contrast, for
.xmlfiles, only rows whereconeResistanceisnullare dropped. Other nulls in other columns are retained.This leads to inconsistent behavior depending on the file type. It can result in different parsed datasets from equivalent sources, which is unexpected. I would argue that dropping rows with nulls should be left to the user—not handled implicitly by Pygef—since it directly affects the integrity of the original measurement data.
Implicit interpolation of missing values
When parsing
.geffiles, missing values are automatically interpolated usingpolars.interpolate(), without any indication or control exposed to the user.For a tool that describes itself as a "simple parser," this kind of implicit data manipulation is surprising and may lead to misinterpretation of results.
.xmlfiles are not interpolated in the same way, which adds to the inconsistency and raises questions—e.g., are missing measurements only assumed to exist in.geffiles?Suggestions:
If interpolation and/or row-dropping is intended, it should be explicit and optional, not silently applied in the background.
The documentation (or function signatures) should clearly indicate when and how interpolation or row filtering occurs.
Consider aligning the behavior between
.gefand.xmlformats, or at least documenting the differences clearly and justifying them.