Skip to content

Provide option to not serialize parquet files to disk, and stream content into duckdb for statistics #56

@Urfoex

Description

@Urfoex

Current implementation saves dataframes to disk as parquet files and then uses them to generated statistics via duckdb:
https://github.com/getml/getml-io/pull/49/files#diff-2e534974074df91c1edcab0c279d52e8228c7ec26346a746a97e90bcb582abbcR127

In case, that someone does not want to explicitly save the dataframes, we need to stream the dataframes in from getml into duckdb:
https://github.com/getml/getml-io/pull/19/files#diff-7a912f9ee2a1c8c724e374aa668d7cd394b96fa18db5b2fd912be63b092cf53eR60

  • add options, so that a user can decide, if and which dataframes to store
  • add a method to generate statistics from getml-arrow-stream instead of parquet file in case of not-stored dataframes
  • adjust dataframe-information-path to reflect, that dataframes might not have been saved

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions