Before the first code edit or running test in a session, ensure a clean working environment. DO NOT skip these checks:
- Run
git remote -vto identify the personal fork and upstream (apache/spark). If unclear, ask the user to configure their remotes following the standard convention (originfor the fork,upstreamforapache/spark). - If the latest commit on
<upstream>/masteris more than a day old (check withgit log -1 --format="%ci" <upstream>/master), rungit fetch <upstream> master. - If there are uncommitted changes (check with
git status), ask the user to stash them before proceeding. - Switch to the appropriate branch:
- Existing PR: resolve the PR branch name via
gh api repos/apache/spark/pulls/<number> --jq '.head.ref', then look for a local branch matching that name. If found, switch to it and inform the user. If not found, ask whether to fetch it or if there is a local branch under a different name. - New edits: ask the user to choose: create a new git worktree from
<upstream>/masterand work from there (recommended), or create and switch to a new branch from<upstream>/master. - Running tests: use
<upstream>/master.
- Existing PR: resolve the PR branch name via
SQL golden file tests are managed by SQLQueryTestSuite and its variants. Read the class documentation before running or updating these tests. DO NOT edit the generated golden files (.sql.out) directly. Always regenerate them when needed, and carefully review the diff to make sure it's expected.
Spark Connect protocol is defined in proto files under sql/connect/common/src/main/protobuf/. Read the README there before modifying proto definitions.
Build and tests can take a long time. Before running tests, ask the user if they have more changes to make.
Prefer SBT over Maven for faster incremental compilation. Module names are defined in project/SparkBuild.scala.
Compile a single module:
build/sbt <module>/compile
Compile test code for a single module:
build/sbt <module>/Test/compile
Run test suites by wildcard or full class name:
build/sbt '<module>/testOnly *MySuite'
build/sbt '<module>/testOnly org.apache.spark.sql.MySuite'
Run test cases matching a substring:
build/sbt '<module>/testOnly *MySuite -- -z "test name"'
For faster iteration, keep SBT open in interactive mode:
build/sbt
> project <module>
> testOnly *MySuite
PySpark tests require building Spark with Hive support first:
build/sbt -Phive package
Activate the virtual environment specified by the user, or default to .venv:
source <venv>/bin/activate
If the default venv does not exist, create it:
python3 -m venv .venv
source .venv/bin/activate
pip install -r dev/requirements.txt
Run a single test suite:
python/run-tests --testnames pyspark.sql.tests.arrow.test_arrow
Run a single test case:
python/run-tests --testnames "pyspark.sql.tests.test_catalog CatalogTests.test_current_database"
Do NOT download full job logs to grep for errors — they are very large and slow. Instead, use the test report annotations on the fork.
Step 1 — Get the fork owner and the latest commit SHA of the PR:
gh api repos/apache/spark/pulls/<PR_NUMBER> --jq '{owner: .head.repo.owner.login, sha: .head.sha}'
Step 2 — Find the "Report test results" check run on the fork's commit:
gh api repos/<OWNER>/spark/commits/<SHA>/check-runs \
--jq '.check_runs[] | select(.name == "Report test results") | {id: .id, annotations: .output.annotations_count}'
Step 3 — Fetch failure annotations:
gh api repos/<OWNER>/spark/check-runs/<CHECK_RUN_ID>/annotations
Each annotation contains the test class, test name, and failure message.
PR title format is [SPARK-xxxx][Component] Title. Infer the PR title from the changes. If no ticket ID is given, create one using dev/create_spark_jira.py, using the PR title (without the JIRA ID and component tag) as the ticket title.
python3 dev/create_spark_jira.py "<title>" -c <component> { -t <type> | -p <parent-jira-id> }
- Component (
-c): e.g. "SQL", "Spark Core", "PySpark", "Connect". Runpython3 dev/create_spark_jira.py --list-componentsfor the full list. - Issue type (
-t): "Bug", "Improvement", "New Feature", "Test", "Documentation", or "Dependency upgrade". - Parent (
-p): if the user mentions a parent JIRA ticket (e.g., "this is a subtask of SPARK-12345"), pass it instead of-t. The issue type is automatically "Sub-task".
The script sets the latest unreleased version as the default affected version. Ask the user to review and adjust versions and other fields on the JIRA ticket after creation.
Before writing the PR description, read .github/PULL_REQUEST_TEMPLATE and fill in every section from that file.
DO NOT push to the upstream repo. Always push to the personal fork. Open PRs against master on the upstream repo.
DO NOT force push or use --amend on pushed commits unless the user explicitly asks. If the remote branch has new commits, fetch and rebase before pushing.
Always get user approval before external operations such as pushing commits, creating PRs, or posting comments. Use gh pr create to open PRs. If gh is not installed, generate the GitHub PR URL for the user and recommend installing the GitHub CLI.