apache · caldempsey · Jul 13, 2025 · Jul 13, 2025 · Jul 13, 2025 · Jul 13, 2025
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -29,10 +29,10 @@ on:
   pull_request:
   push:
     branches:
-      - master
+      - main
 
 env:
-  SPARK_VERSION: '4.0.0'
+  SPARK_VERSION: '4.0.1'
   HADOOP_VERSION: '3'
 
 permissions:
@@ -84,7 +84,12 @@ jobs:
           echo "Apache Spark is not installed"
           # Access the directory.
           mkdir -p ~/deps/
-          wget -q https://dlcdn.apache.org/spark/spark-${{ env.SPARK_VERSION }}/spark-${{ env.SPARK_VERSION }}-bin-hadoop${{ env.HADOOP_VERSION }}.tgz
+          # dlcdn.apache.org only keeps current releases on its mirrors and
+          # occasionally 404s on older ones. archive.apache.org is the
+          # canonical mirror and never rotates — use it as a fallback.
+          ARCHIVE=spark-${{ env.SPARK_VERSION }}-bin-hadoop${{ env.HADOOP_VERSION }}.tgz
+          wget -q https://dlcdn.apache.org/spark/spark-${{ env.SPARK_VERSION }}/$ARCHIVE || \
+          wget -q https://archive.apache.org/dist/spark/spark-${{ env.SPARK_VERSION }}/$ARCHIVE
           tar -xzf spark-${{ env.SPARK_VERSION }}-bin-hadoop${{ env.HADOOP_VERSION }}.tgz -C ~/deps/
           # Delete the old file
           rm spark-${{ env.SPARK_VERSION }}-bin-hadoop${{ env.HADOOP_VERSION }}.tgz

diff --git a/README.md b/README.md
@@ -1,76 +1,151 @@
-# Apache Spark Connect Client for Golang
+# spark-connect-go
 
-This project houses the **experimental** client for [Spark
-Connect](https://spark.apache.org/docs/latest/spark-connect-overview.html) for
-[Apache Spark](https://spark.apache.org/) written in [Golang](https://go.dev/).
+> A maintained fork of [`apache/spark-connect-go`](https://github.com/apache/spark-connect-go) with a `database/sql` driver, edge-typed DataFrames, exposed gRPC dial options, and a typed `ClusterNotReady` error. Tracks upstream; deltas are queued to upstream.
 
-## Current State of the Project
+Spark Connect is Spark's [language-neutral gRPC protocol](https://spark.apache.org/docs/latest/spark-connect-overview.html). The upstream Go client is the official reference implementation. This fork carries the deltas needed for production usage while those patches work their way upstream — drop in by swapping the import path; the session API, DataFrame surface, and protobuf stubs are unchanged.
 
-Currently, the Spark Connect client for Golang is highly experimental and should
-not be used in any production setting. In addition, the PMC of the Apache Spark
-project reserves the right to withdraw and abandon the development of this project
-if it is not sustainable.
+## What's added
 
-## Getting started
+- **`database/sql` driver.** `sql.Open("spark", "sc://host:port")` works with goose, sqlc-generated code, pgx-style consumers — anything that speaks `database/sql`. Registered under the name `spark` in `spark/sql/driver`. `$N` positional placeholders are rendered client-side into Spark SQL literals (the native parameter proto isn't reliable across every supported Spark version).
+- **Edge-typed DataFrames.** `As[T](df) → *DataFrameOf[T]` caches a reflected row plan once; `Collect`, `Stream`, `First` materialise into struct types at the point you know the result shape. Top-level `Collect[T] / Stream[T] / First[T]` helpers do the `As[T]` plus the call in one shot.
+- **`SparkSessionBuilder.WithDialOptions`.** gRPC dial options exposed on the builder — auth interceptors, TLS, observability handlers wire in without subclassing.
+- **`sparkerrors.IsClusterNotReady(err)`.** Typed error for cluster cold-start states. Databricks serverless clusters take 30-90s to warm; retry logic upstack needs a reliable signal instead of string-matching on error messages.
 
-This section explains how to run Spark Connect Go locally.
+Every delta is tracked as a PR queued for `apache/spark-connect-go`. When a delta lands upstream we drop it from the fork. Long-term goal is zero deltas.
 
-Step 1: Install Golang: https://go.dev/doc/install.
+## Install
 
-Step 2: Ensure you have installed `buf CLI` installed, [more info here](https://buf.build/docs/installation/)
+```bash
+go get github.com/datalake-go/spark-connect-go
+```
 
-Step 3: Run the following commands to setup the Spark Connect client.
+Requires a Spark Connect server (Spark 3.4+).
 
-Building with Spark in case you need to re-generate the source files from the proto sources.
+## Quick start
 
-```
-git clone https://github.com/apache/spark-connect-go.git
-git submodule update --init --recursive
+```go
+import (
+    sparksql "github.com/datalake-go/spark-connect-go/spark/sql"
+)
 
-make gen && make test
+session, err := sparksql.NewSessionBuilder().
+    Remote("sc://spark.internal:15002").
+    Build(ctx)
+if err != nil { /* ... */ }
+defer session.Stop()
 
+df, _ := session.Sql(ctx, "SELECT id, email FROM users WHERE tier = 'gold'")
+_ = df.Show(ctx, 20, false)
 ```
 
-Building without Spark
+The `sparksql` alias avoids collision with stdlib `database/sql` — the actual package name is `sql`.
 
+### Using DataFrames
+
+The untyped `DataFrame` is the building block — same surface as upstream. Transformations (`Where`, `Limit`, `OrderBy`, `Select`, `Join`, `GroupBy`) compose lazily and execute on the Spark side; materialisers (`Show`, `Collect`, `First`, `Count`) round-trip and return `[]types.Row`.
+
+```go
+df, _ := session.Sql(ctx, "SELECT id, email, created_at FROM users")
+
+filtered, _ := df.Where(ctx, "tier = 'gold'")
+top, _      := filtered.OrderBy(ctx, "created_at DESC").Limit(ctx, 100)
+
+rows, _ := top.Collect(ctx)
+for _, r := range rows {
+    // r is types.Row — positional access by index or by name
+}
 ```
-git clone https://github.com/apache/spark-connect-go.git
-make && make test
-```
 
-Step 4: Setup the Spark Driver on localhost.
+Use this when the result shape is dynamic, or as the composition surface that you eventually re-type at the edge.
+
+### Using Typed DataFrames
+
+`As[T](df) → *DataFrameOf[T]` is the typed surface. It binds a result shape to a struct, caches the reflected row plan once, and materialises into `[]T` / `*T` without re-validating on every call.
 
-1. [Download Spark distribution](https://spark.apache.org/downloads.html) (4.0.0+), unzip the package.
+```go
+type User struct {
+    ID      string    `spark:"id"`
+    Email   string    `spark:"email"`
+    Created time.Time `spark:"created_at"`
+}
+
+df, _    := session.Sql(ctx, "SELECT id, email, created_at FROM users WHERE tier = 'gold'")
+typed, _ := sparksql.As[User](df)
+
+users, _   := typed.Collect(ctx)
+alice, err := typed.First(ctx)
+if errors.Is(err, sparksql.ErrNotFound) { /* zero rows */ }
+```
 
-2. Start the Spark Connect server with the following command (make sure to use a package version that matches your Spark distribution):
+If you only need the result once, `Collect[T] / First[T] / Stream[T]` are top-level helpers that fold `As[T]` into the call:
 
+```go
+users, _ := sparksql.Collect[User](ctx, df)
 ```
-sbin/start-connect-server.sh
+
+Untagged fields map by snake_case'd field name, so plain Go structs work without tags. `spark:"-"` skips a field. `*DataFrameOf[T]` deliberately has no transformation methods — `Where` / `Limit` / `Select` / `Join` change the row shape and would make `T` lie. Compose on the untyped `DataFrame`, then re-type at the edge:
+
+```go
+typed, _    := sparksql.As[User](df)
+narrower, _ := typed.DataFrame().Select(ctx, "id", "email")  // back to untyped
+ids, _      := sparksql.Collect[struct{ ID string `spark:"id"` }](ctx, narrower)
 ```
 
-Step 5: Run the example Go application.
+### Streaming Results
 
+`Stream[T]` returns a Go 1.23 [`iter.Seq2[T, error]`](https://pkg.go.dev/iter#Seq2). One of the things Go gives us over the Python / Scala clients is a real pull-based iterator — rows decode one at a time as the gRPC stream resolves them, with constant memory regardless of result size. No need to buffer the whole result, no callback API: just `range`.
+
+```go
+for row, err := range sparksql.Stream[User](ctx, df) {
+    if err != nil { break }
+    // use row — decoded from the next Arrow batch as it lands
+}
 ```
-go run cmd/spark-connect-example-spark-session/main.go
+
+Schema binding happens on the first row; if a later row's schema diverges from the first, the error surfaces through the iterator (no per-row panics).
+
+Use `Stream[T]` when result sets are large, when you want to short-circuit early without dragging the rest of the rows over the wire, or when you're piping into another `iter.Seq2` consumer.
+
+### `database/sql` driver
+
+```go
+import (
+    "database/sql"
+    _ "github.com/datalake-go/spark-connect-go/spark/sql/driver"
+)
+
+db, _   := sql.Open("spark", "sc://spark.internal:15002")
+rows, _ := db.QueryContext(ctx, "SELECT id FROM users WHERE tier = $1", "gold")
 ```
 
-## Runnning Spark Connect Go Application in a Spark Cluster
+`$N` placeholders render with type-aware quoting (strings, numbers, bools, `[]byte`, `time.Time`). `?` placeholders aren't supported — most `database/sql`-adjacent codegen (sqlc, goose dialects, pgx patterns) emits `$N`, so the narrower grammar keeps the renderer simple.
 
-To run the Spark Connect Go application in a Spark Cluster, you need to build the Go application and submit it to the Spark Cluster. You can find a more detailed example runner and wrapper script in the `java` directory.
+### Cluster cold-start
 
-See the guide here: [Sample Spark-Submit Wrapper](java/README.md).
+```go
+import "github.com/datalake-go/spark-connect-go/spark/sparkerrors"
 
-## How to write Spark Connect Go Application in your own project
+df, err := session.Sql(ctx, query)
+if sparkerrors.IsClusterNotReady(err) {
+    // retry with backoff — Databricks serverless usually warms in 30-90s
+}
+```
+
+## Building from source
 
-See [Quick Start Guide](quick-start.md)
+```bash
+git clone https://github.com/datalake-go/spark-connect-go.git
+cd spark-connect-go
+make && make test
+```
 
-## High Level Design
+Regenerating protobuf stubs from the Spark submodule:
 
-The overall goal of the design is to find a good balance of principle of the least surprise for
-develoeprs that are familiar with the APIs of Apache Spark and idiomatic Go usage. The high-level
-structure of the packages follows roughly the PySpark giudance but with Go idioms.
+```bash
+git submodule update --init --recursive
+make gen && make test
+```
 
 ## Contributing
 
-Please review the [Contribution to Spark guide](https://spark.apache.org/contributing.html)
-for information on how to get started contributing to the project.
+Feature work that could land upstream should be proposed against [`apache/spark-connect-go`](https://github.com/apache/spark-connect-go) first. Fork-only changes (anything that wouldn't be accepted upstream) stay on this tree. See [CONTRIBUTING.md](CONTRIBUTING.md).
diff --git a/cmd/spark-connect-example-raw-grpc-client/main.go b/cmd/spark-connect-example-raw-grpc-client/main.go
@@ -22,7 +22,7 @@ import (
 	"log"
 	"time"
 
-	proto "github.com/apache/spark-connect-go/internal/generated"
+	proto "github.com/datalake-go/spark-connect-go/internal/generated"
 	"github.com/google/uuid"
 	"google.golang.org/grpc"
 	"google.golang.org/grpc/credentials/insecure"

diff --git a/cmd/spark-connect-example-spark-session/main.go b/cmd/spark-connect-example-spark-session/main.go
@@ -22,12 +22,12 @@ import (
 	"fmt"
 	"log"
 
-	"github.com/apache/spark-connect-go/spark/sql/types"
+	"github.com/datalake-go/spark-connect-go/spark/sql/types"
 
-	"github.com/apache/spark-connect-go/spark/sql/functions"
+	"github.com/datalake-go/spark-connect-go/spark/sql/functions"
 
-	"github.com/apache/spark-connect-go/spark/sql"
-	"github.com/apache/spark-connect-go/spark/sql/utils"
+	"github.com/datalake-go/spark-connect-go/spark/sql"
+	"github.com/datalake-go/spark-connect-go/spark/sql/utils"
 )
 
 var (

diff --git a/go.mod b/go.mod
@@ -13,9 +13,9 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
 
-module github.com/apache/spark-connect-go
+module github.com/datalake-go/spark-connect-go
 
-go 1.23.2
+go 1.24
 
 require (
 	github.com/apache/arrow-go/v18 v18.4.0