[GLUTEN-10113][VL] Pass hadoop fs related session level configurations to native#12367
[GLUTEN-10113][VL] Pass hadoop fs related session level configurations to native#12367zhouyuan wants to merge 5 commits into
Conversation
Signed-off-by: Yuan <yuanzhou@apache.org>
|
Run Gluten Clickhouse CI on x86 |
e37d128 to
1d6f087
Compare
There was a problem hiding this comment.
Pull request overview
This PR extends the Velox backend’s partition serialization so that Hadoop filesystem-related session configuration (notably fs.azure.*, fs.s3a.*, fs.gs.*) is captured on the driver and carried across the RDD partition boundary to executors, then forwarded into the native runtime via extraConf.
Changes:
- Added an
fsConf: Map[String, String]field toGlutenPartitionto carry selected Hadoop FS config across to executors. - Updated
VeloxIteratorApi.genPartitionsto collectfs.*keys from the driver-side Hadoop configuration and embed them into eachGlutenPartition. - Updated
VeloxIteratorApi.genFirstStageIteratorto mergeGlutenPartition.fsConfinto the nativeextraConf, and added a new Velox UT suite to validate propagation behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| gluten-substrait/src/main/scala/org/apache/gluten/execution/GlutenWholeStageColumnarRDD.scala | Adds GlutenPartition.fsConf to transport FS-related Hadoop config across partition boundaries. |
| backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxIteratorApi.scala | Captures driver-side FS Hadoop config into partitions and forwards it into native extraConf. |
| backends-velox/src/test/scala/org/apache/gluten/backendsapi/velox/VeloxIteratorApiFsConfSuite.scala | Adds unit tests verifying that selected fs.* keys are embedded into GlutenPartition.fsConf. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
Comments suppressed due to low confidence (1)
gluten-substrait/src/main/scala/org/apache/gluten/execution/GlutenWholeStageColumnarRDD.scala:51
GlutenPartitionnow containsfsConf, which can include filesystem credentials (e.g., S3/ABFS secrets). Because this is a case class, the defaulttoStringwill include the fullfsConfmap and may leak secrets into Spark logs / exception messages / UI when partitions are stringified. OverridetoStringto avoid printing values (e.g., print only keys).
override def preferredLocations(): Array[String] =
splitInfos.flatMap(_.preferredLocations().asScala)
}
|
Run Gluten Clickhouse CI on x86 |
2220efc to
9b41c7f
Compare
|
Run Gluten Clickhouse CI on x86 |
| val partitionFsConf = inputPartition.asInstanceOf[GlutenPartition].fsConf | ||
| val extraConf = (partitionFsConf + | ||
| (GlutenConfig.COLUMNAR_CUDF_ENABLED.key -> enableCudf.toString)).asJava | ||
| val transKernel = NativePlanEvaluator.create(BackendsApiManager.getBackendName, extraConf) |
| "fs.azure.account.auth.type.myaccount.dfs.core.windows.net" -> "OAuth", | ||
| "fs.azure.account.oauth.provider.type" -> "ClientCredentials" | ||
| ) { | ||
| val partitions = api.genPartitions(emptyWsCtx, Seq(Seq.empty), Seq.empty) |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
Run Gluten Clickhouse CI on x86 |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
gluten-substrait/src/main/scala/org/apache/gluten/execution/GlutenWholeStageColumnarRDD.scala:51
GlutenPartitionis a case class, so its generatedtoStringwill include the fullfsConfmap (including credential values). This can leak secrets via exceptions/logging (e.g.,$splitstring interpolation) or Spark UI/debug output. Consider overridingtoStringto avoid printing values (e.g., only print the keys).
override def preferredLocations(): Array[String] =
splitInfos.flatMap(_.preferredLocations().asScala)
}
| val fsPrefixes = Seq("fs.azure.", "fs.s3a.", "fs.gs.") | ||
| val hadoopConf = leaves.headOption | ||
| .map(_ => org.apache.spark.sql.SparkSession.active.sessionState.newHadoopConf()) | ||
| .getOrElse(org.apache.spark.SparkContext.getOrCreate().hadoopConfiguration) |
| val partitionFsConf = inputPartition.asInstanceOf[GlutenPartition].fsConf | ||
| val extraConf = (partitionFsConf + | ||
| (GlutenConfig.COLUMNAR_CUDF_ENABLED.key -> enableCudf.toString)).asJava | ||
| val transKernel = NativePlanEvaluator.create(BackendsApiManager.getBackendName, extraConf) | ||
|
|
| test("genPartitions embeds fs.s3a.* keys from Hadoop conf into GlutenPartition.fsConf") { | ||
| withHadoopConf( | ||
| "fs.s3a.access.key" -> "AKIAIOSFODNN7EXAMPLE", | ||
| "fs.s3a.secret.key" -> "wJalrXUtnFEMI" | ||
| ) { | ||
| val partitions = api.genPartitions(emptyWsCtx, Seq(Seq.empty), Seq.empty) | ||
| assert(partitions.size == 1) | ||
| val fsConf = partitions.head.asInstanceOf[GlutenPartition].fsConf | ||
| assert( | ||
| fsConf.contains("fs.s3a.access.key"), | ||
| s"Expected fs.s3a.access.key not found; got: ${fsConf.keys.mkString(", ")}") | ||
| assert(fsConf("fs.s3a.access.key") == "AKIAIOSFODNN7EXAMPLE") | ||
| assert(fsConf.contains("fs.s3a.secret.key")) | ||
| assert(fsConf("fs.s3a.secret.key") == "wJalrXUtnFEMI") | ||
| } | ||
| } |
| test("genPartitions embeds fs.azure.* keys from Hadoop conf into GlutenPartition.fsConf") { | ||
| withHadoopConf( | ||
| "fs.azure.account.auth.type.myaccount.dfs.core.windows.net" -> "OAuth", | ||
| "fs.azure.account.oauth.provider.type" -> "ClientCredentials" | ||
| ) { | ||
| val partitions = api.genPartitions(emptyWsCtx, Seq(Seq.empty), Seq.empty) |
What changes are proposed in this pull request?
This patch adds the runtime hadoop related configurations to native.
How was this patch tested?
new UT
Was this patch authored or co-authored using generative AI tooling?
IBM BoB
Related issue: #10113