Skip to content

Inconsistent behavior and implicit interpolation and row-dropping when parsing .gef vs .xml files #426

@bro-wi

Description

@bro-wi

Hi Pygef team,

Thanks for your work on this package! I encountered some unexpected behavior while parsing CPT data using Pygef and wanted to report it, as I believe it could be improved or at least clarified in the documentation.

Summary of issues:

  1. Inconsistent parsing behavior between .gef and .xml files

    • When parsing .gef files, Pygef implicitly drops any row that contains invalid measurements (i.e., nulls in Polars), regardless of which column contains the null. This happens after interpolation (see point 2). In my case, rows were dropped due to invalid values in a column I didn’t need, which caused me to lose valid data points from the columns I did care about.

    • In contrast, for .xml files, only rows where coneResistance is null are dropped. Other nulls in other columns are retained.

    • This leads to inconsistent behavior depending on the file type. It can result in different parsed datasets from equivalent sources, which is unexpected. I would argue that dropping rows with nulls should be left to the user—not handled implicitly by Pygef—since it directly affects the integrity of the original measurement data.

  2. Implicit interpolation of missing values

    • When parsing .gef files, missing values are automatically interpolated using polars.interpolate(), without any indication or control exposed to the user.

    • For a tool that describes itself as a "simple parser," this kind of implicit data manipulation is surprising and may lead to misinterpretation of results.

    • .xml files are not interpolated in the same way, which adds to the inconsistency and raises questions—e.g., are missing measurements only assumed to exist in .gef files?

Suggestions:

  • If interpolation and/or row-dropping is intended, it should be explicit and optional, not silently applied in the background.

  • The documentation (or function signatures) should clearly indicate when and how interpolation or row filtering occurs.

  • Consider aligning the behavior between .gef and .xml formats, or at least documenting the differences clearly and justifying them.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestpythonPull requests that update Python code

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions