Add stable columnar file format strings#15078
Conversation
Greptile SummaryThis PR adds stable
Confidence Score: 4/5The change is additive and low-risk — only toString overrides are modified, with no effect on GPU write logic or resource management. Three of the four format strings are clearly distinct and correct. The unresolved question is whether GpuHiveParquetFileFormat should return Parquet (same as GpuParquetFileFormat) or something more specific like HiveParquet — event-log consumers currently cannot distinguish a Hive-managed Parquet write from a plain Parquet write. The prior review thread raised this and it has not yet been addressed. GpuHiveFileFormat.scala and ColumnarFileFormatSuite.scala warrant a second look regarding the GpuHiveParquetFileFormat label choice. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[ColumnarFileFormat.toString] --> B{Which class?}
B --> C[GpuParquetFileFormat]
B --> D[GpuHiveParquetFileFormat]
B --> E[GpuOrcFileFormat]
B --> F[GpuHiveTextFileFormat]
C --> G[Parquet]
D --> H[Parquet]
E --> I[ORC]
F --> J[HiveText]
G -. same string, ambiguous in event logs .-> H
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[ColumnarFileFormat.toString] --> B{Which class?}
B --> C[GpuParquetFileFormat]
B --> D[GpuHiveParquetFileFormat]
B --> E[GpuOrcFileFormat]
B --> F[GpuHiveTextFileFormat]
C --> G[Parquet]
D --> H[Parquet]
E --> I[ORC]
F --> J[HiveText]
G -. same string, ambiguous in event logs .-> H
Reviews (1): Last reviewed commit: "Add stable file format strings" | Re-trigger Greptile |
| class GpuHiveParquetFileFormat(compType: CompressionType) extends ColumnarFileFormat | ||
| with Logging with Serializable { | ||
|
|
||
| override def toString: String = "Parquet" |
There was a problem hiding this comment.
GpuHiveParquetFileFormat and GpuParquetFileFormat share the same toString value
Both classes now return "Parquet", making them indistinguishable by string representation. A consumer reading event logs that contain both a Hive-managed Parquet write and a plain Parquet write will see "Parquet" for both with no way to tell which path was taken. Consider returning "HiveParquet" instead to preserve the distinction the class name already encodes.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Signed-off-by: Nanook <nanookclaw@users.noreply.github.com>
a728c81 to
53df481
Compare
Fixes #12226.
Description
This adds stable
toStringvalues for the four GPUColumnarFileFormatimplementations called out in the issue. Event logs previously rendered these objects with the JVM default form, such ascom.nvidia.spark.rapids.GpuParquetFileFormat@..., which made theformatfield noisy and coupled downstream consumers to RAPIDS implementation class names.After this change, Parquet formats render as
Parquet, ORC asORC, and Hive text asHiveText. These names line up with the existing file-format labels used by the plugin and avoid exposing package/class details in Spark event logs.I also added a small ScalaTest suite that directly checks the string representation for
GpuParquetFileFormat,GpuOrcFileFormat,GpuHiveParquetFileFormat, andGpuHiveTextFileFormat.Local verification:
git diff --checkI could not run the focused ScalaTest command locally because this environment does not have
mvn,mvnw,scala,scalac, orsbtinstalled.Checklists
Documentation
Testing
(Please provide the names of the existing tests in the PR description.)
Performance