-
Notifications
You must be signed in to change notification settings - Fork 583
Description
Backend
VL (Velox)
Bug description
The following SQL statement is used for the Hive write test.
DROP TABLE IF EXISTS hive_support;
CREATE TABLE IF NOT EXISTS hive_support (
int_col INT,
array_col ARRAY<INT>,
map_col MAP<STRING, INT>,
struct_col STRUCT<field1: INT, field2: STRING, field3: BOOLEAN>
)
PARTITIONED BY (part_col INT)
STORED AS PARQUET
TBLPROPERTIES (
'parquet.compression' = 'ZSTD',
'parquet.compression.zstd.level' = '5',
'parquet.enable.dictionary' = 'false'
);
WITH number_seq AS (
SELECT CAST(id AS INT) AS id_int
FROM range(1, 50000000, 1, 100)
)
INSERT OVERWRITE TABLE hive_support
PARTITION (part_col)
SELECT
id_int AS int_col,
ARRAY(
id_int % 10,
id_int % 100,
id_int % 1000
) AS array_col,
MAP(
CONCAT('key_', CAST(id_int % 5 AS STRING)), id_int % 100,
CONCAT('key_', CAST((id_int + 1) % 5 AS STRING)), (id_int + 1) % 100
) AS map_col,
named_struct(
'field1', id_int,
'field2', CONCAT('str_', CAST(id_int AS STRING)),
'field3', id_int % 2 = 0
) AS struct_col,
id_int % 10 AS part_col
FROM number_seq
SORT BY id_int;
The following error is reported during the execution:
26/03/10 20:50:46 WARN GlutenFallbackReporter: Validation failed for plan: WriteFiles[QueryId=4], due to: Unsupported native write: Found unsupported type:ArrayType,MapType,StructType.
The code in backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxBackend.scala is as follows:
// Validate if all types are supported.
def validateDataTypes(): Option[String] = {
val unsupportedTypes = format match {
case _: ParquetFileFormat =>
fields.flatMap {
case StructField(_, _: YearMonthIntervalType, _, _) =>
Some("YearMonthIntervalType")
case StructField(_, _: StructType, _, _) =>
Some("StructType")
case _ => None
}
case _ =>
fields.flatMap {
field =>
field.dataType match {
case _: StructType => Some("StructType") // here 1
case _: ArrayType => Some("ArrayType") // here 2
case _: MapType => Some("MapType") // here 3
case _: YearMonthIntervalType => Some("YearMonthIntervalType")
case _ => None
}
}
}
if (unsupportedTypes.nonEmpty) {
Some(unsupportedTypes.mkString("Found unsupported type:", ",", ""))
} else {
None
}
}
After the three lines of the related type are commented out, no rollback occurs and the data is successfully written.
The Velox version downloaded by gluten in the branch-1.3 is https://github.com/oap-project/velox/tree/gluten-1.3.0. In this case, HiveDataSink::appendData does not support PARTITIONED BY and complex types. However, the Velox version downloaded by gluten in the main branch is https://github.com/IBM/velox/tree/dft-2026_03_10-iceberg, where HiveDataSink::appendData method has been updated.
I also want to confirm whether there are any other situations that have not been considered.
Gluten version
main branch
Spark version
Spark-3.4.x
Spark configurations
--master yarn
--driver-cores 4
--driver-memory 8g
--num-executors 12
--executor-cores 4
--executor-memory 5g
--conf spark.memory.offHeap.enabled=true
--conf spark.memory.offHeap.size=20g
--conf spark.executor.memoryOverhead=5g
--conf spark.task.cpus=1
--conf spark.executor.extraJavaOptions="-XX:+UseG1GC -XX:ActiveProcessorCount=4 -Dio.netty.tryReflectionSetAccessible=true"
--conf spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
--conf spark.locality.wait=0
--conf spark.driver.extraClassPath="${JAR_PATH}"
--conf spark.executor.extraClassPath="${JAR_PATH}"
--conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager
--conf spark.plugins=org.apache.gluten.GlutenPlugin
--conf spark.gluten.loadLibFromJar=false
--conf spark.gluten.sql.columnar.backend.lib=velox
--conf spark.gluten.sql.columnar.maxBatchSize=8192
--conf spark.gluten.sql.orc.charType.scan.fallback.enabled=false
--conf spark.gluten.sql.columnar.physicalJoinOptimizeEnable=true
--conf spark.gluten.sql.columnar.physicalJoinOptimizationLevel=19
--conf spark.gluten.sql.columnar.backend.velox.resizeBatches.shuffleInput=false
--conf spark.sql.adaptive.coalescePartitions.initialPartitionNum=48
--conf spark.default.parallelism=144
--conf spark.sql.shuffle.partitions=144
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.sql.sources.parallelPartitionDiscovery.parallelism=60
--conf spark.network.timeout=600
--conf spark.sql.broadcastTimeout=600
--conf spark.sql.adaptive.enabled=false
--conf spark.sql.optimizer.runtime.bloomFilter.enabled=true
--conf spark.sql.hive.convertMetastoreParquet=false
--conf spark.sql.parquet.writeLegacyFormat=true
--conf spark.sql.hive.manageFilesourcePartitions=false
--conf spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict
--conf spark.hadoop.hive.exec.max.dynamic.partitions=1000
--conf spark.hadoop.hive.exec.max.dynamic.partitions.pernode=1000
--conf spark.sql.parquet.enableVectorizedWriter=false
--conf spark.sql.parquet.enableVectorizedReader=false
--conf spark.sql.parquet.enable.dictionary=false
--conf spark.hadoop.parquet.enable.dictionary=false
--conf spark.io.compression.codec=zstd
--conf spark.sql.parquet.compression.codec=zstd
System information
Gluten Version: 1.7.0-SNAPSHOT
Commit: 625a476
CMake Version: 3.28.3
System: Linux-5.10.0-182.0.0.95.oe2203sp3.aarch64
Arch: aarch64
CPU Name: BIOS Model name: Kunpeng 920
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 12.5.0
C Compiler: /usr/bin/cc
C Compiler Version: 12.5.0
CMake Prefix Path: /usr/local;/usr;/;/usr/local/lib64/python3.9/site-packages/cmake/data;/usr/local;/usr/X11R6;/usr/pkg;/opt
Relevant logs
26/03/10 20:50:46 WARN GlutenFallbackReporter: Validation failed for plan: WriteFiles[QueryId=4], due to: Unsupported native write: Found unsupported type:ArrayType,MapType.