Apache Spark

Pre-flight Checks

Before the first code edit or running test in a session, ensure a clean working environment. DO NOT skip these checks:

Run git remote -v to identify the personal fork and upstream (apache/spark). If unclear, ask the user to configure their remotes following the standard convention (origin for the fork, upstream for apache/spark).
If the latest commit on <upstream>/master is more than a day old (check with git log -1 --format="%ci" <upstream>/master), run git fetch <upstream> master.
If there are uncommitted changes (check with git status), ask the user to stash them before proceeding.
Switch to the appropriate branch:
- Existing PR: resolve the PR branch name via gh api repos/apache/spark/pulls/<number> --jq '.head.ref', then look for a local branch matching that name. If found, switch to it and inform the user. If not found, ask whether to fetch it or if there is a local branch under a different name.
- New edits: ask the user to choose: create a new git worktree from <upstream>/master and work from there (recommended), or create and switch to a new branch from <upstream>/master.
- Running tests: use <upstream>/master.

Development Notes

SQL golden file tests are managed by SQLQueryTestSuite and its variants. Read the class documentation before running or updating these tests. DO NOT edit the generated golden files (.sql.out) directly. Always regenerate them when needed, and carefully review the diff to make sure it's expected.

Spark Connect protocol is defined in proto files under sql/connect/common/src/main/protobuf/. Read the README there before modifying proto definitions.

Build and Test

Build and tests can take a long time. Before running tests, ask the user if they have more changes to make.

Prefer SBT over Maven for faster incremental compilation. Module names are defined in project/SparkBuild.scala.

Compile a single module:

build/sbt <module>/compile

Compile test code for a single module:

build/sbt <module>/Test/compile

Run test suites by wildcard or full class name:

build/sbt '<module>/testOnly *MySuite'
build/sbt '<module>/testOnly org.apache.spark.sql.MySuite'

Run test cases matching a substring:

build/sbt '<module>/testOnly *MySuite -- -z "test name"'

For faster iteration, keep SBT open in interactive mode:

build/sbt
> project <module>
> testOnly *MySuite

PySpark Tests

PySpark tests require building Spark with Hive support first:

build/sbt -Phive package

Activate the virtual environment specified by the user, or default to .venv:

source <venv>/bin/activate

If the default venv does not exist, create it:

python3 -m venv .venv
source .venv/bin/activate
pip install -r dev/requirements.txt

Run a single test suite:

python/run-tests --testnames pyspark.sql.tests.arrow.test_arrow

Run a single test case:

python/run-tests --testnames "pyspark.sql.tests.test_catalog CatalogTests.test_current_database"

Investigating PR CI Failures

Do NOT download full job logs to grep for errors — they are very large and slow. Instead, use the test report annotations on the fork.

Step 1 — Get the fork owner and the latest commit SHA of the PR:

gh api repos/apache/spark/pulls/<PR_NUMBER> --jq '{owner: .head.repo.owner.login, sha: .head.sha}'

Step 2 — Find the "Report test results" check run on the fork's commit:

gh api repos/<OWNER>/spark/commits/<SHA>/check-runs \
  --jq '.check_runs[] | select(.name == "Report test results") | {id: .id, annotations: .output.annotations_count}'

Step 3 — Fetch failure annotations:

gh api repos/<OWNER>/spark/check-runs/<CHECK_RUN_ID>/annotations

Each annotation contains the test class, test name, and failure message.

Pull Request Workflow

PR title format is [SPARK-xxxx][Component] Title. Infer the PR title from the changes. If no ticket ID is given, create one using dev/create_spark_jira.py, using the PR title (without the JIRA ID and component tag) as the ticket title.

python3 dev/create_spark_jira.py "<title>" -c <component> { -t <type> | -p <parent-jira-id> }

Component (-c): e.g. "SQL", "Spark Core", "PySpark", "Connect". Run python3 dev/create_spark_jira.py --list-components for the full list.
Issue type (-t): "Bug", "Improvement", "New Feature", "Test", "Documentation", or "Dependency upgrade".
Parent (-p): if the user mentions a parent JIRA ticket (e.g., "this is a subtask of SPARK-12345"), pass it instead of -t. The issue type is automatically "Sub-task".

The script sets the latest unreleased version as the default affected version. Ask the user to review and adjust versions and other fields on the JIRA ticket after creation.

Before writing the PR description, read .github/PULL_REQUEST_TEMPLATE and fill in every section from that file.

DO NOT push to the upstream repo. Always push to the personal fork. Open PRs against master on the upstream repo.

DO NOT force push or use --amend on pushed commits unless the user explicitly asks. If the remote branch has new commits, fetch and rebase before pushing.

Always get user approval before external operations such as pushing commits, creating PRs, or posting comments. Use gh pr create to open PRs. If gh is not installed, generate the GitHub PR URL for the user and recommend installing the GitHub CLI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache Spark

Pre-flight Checks

Development Notes

Build and Test

PySpark Tests

Investigating PR CI Failures

Pull Request Workflow

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

Apache Spark

Pre-flight Checks

Development Notes

Build and Test

PySpark Tests

Investigating PR CI Failures

Pull Request Workflow