Skip to content

Add stable columnar file format strings#15078

Open
nanookclaw wants to merge 1 commit into
NVIDIA:mainfrom
nanookclaw:fix/columnar-format-pretty-string
Open

Add stable columnar file format strings#15078
nanookclaw wants to merge 1 commit into
NVIDIA:mainfrom
nanookclaw:fix/columnar-format-pretty-string

Conversation

@nanookclaw

Copy link
Copy Markdown

Fixes #12226.

Description

This adds stable toString values for the four GPU ColumnarFileFormat implementations called out in the issue. Event logs previously rendered these objects with the JVM default form, such as com.nvidia.spark.rapids.GpuParquetFileFormat@..., which made the format field noisy and coupled downstream consumers to RAPIDS implementation class names.

After this change, Parquet formats render as Parquet, ORC as ORC, and Hive text as HiveText. These names line up with the existing file-format labels used by the plugin and avoid exposing package/class details in Spark event logs.

I also added a small ScalaTest suite that directly checks the string representation for GpuParquetFileFormat, GpuOrcFileFormat, GpuHiveParquetFileFormat, and GpuHiveTextFileFormat.

Local verification:

  • git diff --check

I could not run the focused ScalaTest command locally because this environment does not have mvn, mvnw, scala, scalac, or sbt installed.

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (Please provide the names of the existing tests in the PR description.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

@greptile-apps

greptile-apps Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds stable toString overrides to the four GPU ColumnarFileFormat implementations so that event logs show human-readable format names ("Parquet", "ORC", "HiveText") instead of JVM default object identifiers. A new ColumnarFileFormatSuite is introduced to lock in the string values.

  • GpuParquetFileFormat"Parquet", GpuOrcFileFormat"ORC", GpuHiveTextFileFormat"HiveText": all three are distinct and sensible.
  • GpuHiveParquetFileFormat also returns "Parquet", making it indistinguishable from GpuParquetFileFormat in event logs — this ambiguity was flagged in a prior review thread and remains unresolved.
  • The new test suite uses plain AnyFunSuite (no SparkSession required) and covers all four classes; no GPU resources are allocated, so no afterAll cleanup is needed.

Confidence Score: 4/5

The change is additive and low-risk — only toString overrides are modified, with no effect on GPU write logic or resource management.

Three of the four format strings are clearly distinct and correct. The unresolved question is whether GpuHiveParquetFileFormat should return Parquet (same as GpuParquetFileFormat) or something more specific like HiveParquet — event-log consumers currently cannot distinguish a Hive-managed Parquet write from a plain Parquet write. The prior review thread raised this and it has not yet been addressed.

GpuHiveFileFormat.scala and ColumnarFileFormatSuite.scala warrant a second look regarding the GpuHiveParquetFileFormat label choice.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetFileFormat.scala Adds override def toString: String = Parquet to GpuParquetFileFormat; minimal, correct, no logic changes.
sql-plugin/src/main/scala/org/apache/spark/sql/hive/rapids/GpuHiveFileFormat.scala Adds toString to both GpuHiveParquetFileFormat (Parquet) and GpuHiveTextFileFormat (HiveText); the Hive Parquet label is indistinguishable from plain Parquet (already flagged in a prior review thread).
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuOrcFileFormat.scala Adds override def toString: String = ORC to GpuOrcFileFormat and bumps the copyright year to 2026; both changes are correct.
tests/src/test/scala/com/nvidia/spark/rapids/ColumnarFileFormatSuite.scala New AnyFunSuite suite that asserts the four toString values; covers the new code paths, but the test for GpuHiveParquetFileFormat asserts Parquet making it identical to the GpuParquetFileFormat assertion, which does not distinguish between the two formats.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[ColumnarFileFormat.toString] --> B{Which class?}
    B --> C[GpuParquetFileFormat]
    B --> D[GpuHiveParquetFileFormat]
    B --> E[GpuOrcFileFormat]
    B --> F[GpuHiveTextFileFormat]
    C --> G[Parquet]
    D --> H[Parquet]
    E --> I[ORC]
    F --> J[HiveText]
    G -. same string, ambiguous in event logs .-> H
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[ColumnarFileFormat.toString] --> B{Which class?}
    B --> C[GpuParquetFileFormat]
    B --> D[GpuHiveParquetFileFormat]
    B --> E[GpuOrcFileFormat]
    B --> F[GpuHiveTextFileFormat]
    C --> G[Parquet]
    D --> H[Parquet]
    E --> I[ORC]
    F --> J[HiveText]
    G -. same string, ambiguous in event logs .-> H
Loading

Reviews (1): Last reviewed commit: "Add stable file format strings" | Re-trigger Greptile

class GpuHiveParquetFileFormat(compType: CompressionType) extends ColumnarFileFormat
with Logging with Serializable {

override def toString: String = "Parquet"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 GpuHiveParquetFileFormat and GpuParquetFileFormat share the same toString value

Both classes now return "Parquet", making them indistinguishable by string representation. A consumer reading event logs that contain both a Hive-managed Parquet write and a plain Parquet write will see "Parquet" for both with no way to tell which path was taken. Consider returning "HiveParquet" instead to preserve the distinction the class name already encodes.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Signed-off-by: Nanook <nanookclaw@users.noreply.github.com>
@nanookclaw nanookclaw force-pushed the fix/columnar-format-pretty-string branch from a728c81 to 53df481 Compare June 14, 2026 00:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] ColumnarFileFormat does not have a pretty string

2 participants