Skip to content

[VL] [BUG] Complex types already supported in Velox are considered not supported by Gluten #11746

@VvanFalleaves

Description

@VvanFalleaves

Backend

VL (Velox)

Bug description

The following SQL statement is used for the Hive write test.

DROP TABLE IF EXISTS hive_support;
CREATE TABLE IF NOT EXISTS hive_support (
  int_col INT,
  array_col ARRAY<INT>,
  map_col MAP<STRING, INT>,
  struct_col STRUCT<field1: INT, field2: STRING, field3: BOOLEAN>
)
PARTITIONED BY (part_col INT)
STORED AS PARQUET
TBLPROPERTIES (
  'parquet.compression' = 'ZSTD',
  'parquet.compression.zstd.level' = '5',
  'parquet.enable.dictionary' = 'false'
);
WITH number_seq AS (
  SELECT CAST(id AS INT) AS id_int
  FROM range(1, 50000000, 1, 100)
)
INSERT OVERWRITE TABLE hive_support
PARTITION (part_col) 
SELECT
  id_int AS int_col,
  ARRAY(
    id_int % 10,
    id_int % 100,
    id_int % 1000
  ) AS array_col,
  MAP(
    CONCAT('key_', CAST(id_int % 5 AS STRING)), id_int % 100,
    CONCAT('key_', CAST((id_int + 1) % 5 AS STRING)), (id_int + 1) % 100
  ) AS map_col,
  named_struct(
    'field1', id_int,
    'field2', CONCAT('str_', CAST(id_int AS STRING)),
    'field3', id_int % 2 = 0
  ) AS struct_col,
  id_int % 10 AS part_col
FROM number_seq
SORT BY id_int;

The following error is reported during the execution:

26/03/10 20:50:46 WARN GlutenFallbackReporter: Validation failed for plan: WriteFiles[QueryId=4], due to: Unsupported native write: Found unsupported type:ArrayType,MapType,StructType.

The code in backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxBackend.scala is as follows:

    // Validate if all types are supported.
    def validateDataTypes(): Option[String] = {
      val unsupportedTypes = format match {
        case _: ParquetFileFormat =>
          fields.flatMap {
            case StructField(_, _: YearMonthIntervalType, _, _) =>
              Some("YearMonthIntervalType")
            case StructField(_, _: StructType, _, _) =>
              Some("StructType")
            case _ => None
          }
        case _ =>
          fields.flatMap {
            field =>
              field.dataType match {
                case _: StructType => Some("StructType")    // here 1
                case _: ArrayType => Some("ArrayType")      // here 2
                case _: MapType => Some("MapType")        // here 3
                case _: YearMonthIntervalType => Some("YearMonthIntervalType")
                case _ => None
              }
          }
      }
      if (unsupportedTypes.nonEmpty) {
        Some(unsupportedTypes.mkString("Found unsupported type:", ",", ""))
      } else {
        None
      }
    }

After the three lines of the related type are commented out, no rollback occurs and the data is successfully written.

The Velox version downloaded by gluten in the branch-1.3 is https://github.com/oap-project/velox/tree/gluten-1.3.0. In this case, HiveDataSink::appendData does not support PARTITIONED BY and complex types. However, the Velox version downloaded by gluten in the main branch is https://github.com/IBM/velox/tree/dft-2026_03_10-iceberg, where HiveDataSink::appendData method has been updated.

I also want to confirm whether there are any other situations that have not been considered.

Gluten version

main branch

Spark version

Spark-3.4.x

Spark configurations

  --master yarn
  --driver-cores 4
  --driver-memory 8g
  --num-executors 12
  --executor-cores 4
  --executor-memory 5g
  --conf spark.memory.offHeap.enabled=true
  --conf spark.memory.offHeap.size=20g
  --conf spark.executor.memoryOverhead=5g
  --conf spark.task.cpus=1
  --conf spark.executor.extraJavaOptions="-XX:+UseG1GC -XX:ActiveProcessorCount=4 -Dio.netty.tryReflectionSetAccessible=true"
  --conf spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
  --conf spark.locality.wait=0
  --conf spark.driver.extraClassPath="${JAR_PATH}"
  --conf spark.executor.extraClassPath="${JAR_PATH}"
  --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager
  --conf spark.plugins=org.apache.gluten.GlutenPlugin
  --conf spark.gluten.loadLibFromJar=false
  --conf spark.gluten.sql.columnar.backend.lib=velox
  --conf spark.gluten.sql.columnar.maxBatchSize=8192
  --conf spark.gluten.sql.orc.charType.scan.fallback.enabled=false
  --conf spark.gluten.sql.columnar.physicalJoinOptimizeEnable=true
  --conf spark.gluten.sql.columnar.physicalJoinOptimizationLevel=19
  --conf spark.gluten.sql.columnar.backend.velox.resizeBatches.shuffleInput=false
  --conf spark.sql.adaptive.coalescePartitions.initialPartitionNum=48
  --conf spark.default.parallelism=144
  --conf spark.sql.shuffle.partitions=144
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
  --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=60
  --conf spark.network.timeout=600
  --conf spark.sql.broadcastTimeout=600
  --conf spark.sql.adaptive.enabled=false
  --conf spark.sql.optimizer.runtime.bloomFilter.enabled=true
  --conf spark.sql.hive.convertMetastoreParquet=false
  --conf spark.sql.parquet.writeLegacyFormat=true
  --conf spark.sql.hive.manageFilesourcePartitions=false
  --conf spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict
  --conf spark.hadoop.hive.exec.max.dynamic.partitions=1000
  --conf spark.hadoop.hive.exec.max.dynamic.partitions.pernode=1000
  --conf spark.sql.parquet.enableVectorizedWriter=false
  --conf spark.sql.parquet.enableVectorizedReader=false
  --conf spark.sql.parquet.enable.dictionary=false
  --conf spark.hadoop.parquet.enable.dictionary=false
  --conf spark.io.compression.codec=zstd
  --conf spark.sql.parquet.compression.codec=zstd

System information

Gluten Version: 1.7.0-SNAPSHOT
Commit: 625a476
CMake Version: 3.28.3
System: Linux-5.10.0-182.0.0.95.oe2203sp3.aarch64
Arch: aarch64
CPU Name: BIOS Model name: Kunpeng 920
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 12.5.0
C Compiler: /usr/bin/cc
C Compiler Version: 12.5.0
CMake Prefix Path: /usr/local;/usr;/;/usr/local/lib64/python3.9/site-packages/cmake/data;/usr/local;/usr/X11R6;/usr/pkg;/opt

Relevant logs

26/03/10 20:50:46 WARN GlutenFallbackReporter: Validation failed for plan: WriteFiles[QueryId=4], due to: Unsupported native write: Found unsupported type:ArrayType,MapType.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions