From a8647bf6547e7179defa62d0f918e9f259b27c31 Mon Sep 17 00:00:00 2001 From: Antony Stubbs Date: Sat, 11 Apr 2026 03:12:04 +1200 Subject: [PATCH 01/12] Rebrand fork: io.github.astubbs.parallelconsumer 0.6.0.0-SNAPSHOT MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Republish under new Maven coordinates so the fork can be released to Maven Central independently of upstream confluentinc/parallel-consumer (which is no longer actively maintained). Changes: - groupId: io.confluent.parallelconsumer -> io.github.astubbs.parallelconsumer - version: 0.5.3.4-SNAPSHOT -> 0.6.0.0-SNAPSHOT - organization: Confluent, Inc. -> Antony Stubbs - SCM URLs: confluentinc -> astubbs - developer: antony.stubbs@gmail.com - Display names: drop "Confluent" prefix - License header template: now "Confluent, Inc. and contributors" (preserves original Apache 2.0 attribution while crediting fork contributors going forward) - Add NOTICE file per Apache 2.0 §4(d) crediting Confluent as the original author - README updated with new coordinates and a note explaining the fork Java packages (io.confluent.parallelconsumer.*) and artifactIds are intentionally unchanged so cherry-picking PRs back upstream stays clean if they ever resume maintenance. Co-Authored-By: Claude Opus 4.6 (1M context) --- NOTICE | 10 ++++++ README.adoc | 34 +++++++++++++----- parallel-consumer-core/pom.xml | 6 ++-- .../parallel-consumer-example-core/pom.xml | 10 +++--- .../parallel-consumer-example-metrics/pom.xml | 10 +++--- .../parallel-consumer-example-reactor/pom.xml | 12 +++---- .../parallel-consumer-example-streams/pom.xml | 10 +++--- .../parallel-consumer-example-vertx/pom.xml | 12 +++---- parallel-consumer-examples/pom.xml | 6 ++-- parallel-consumer-mutiny/pom.xml | 10 +++--- parallel-consumer-reactor/pom.xml | 10 +++--- parallel-consumer-vertx/pom.xml | 10 +++--- pom.xml | 36 +++++++++---------- src/docs/README_TEMPLATE.adoc | 12 ++++--- 14 files changed, 109 insertions(+), 79 deletions(-) create mode 100644 NOTICE diff --git a/NOTICE b/NOTICE new file mode 100644 index 000000000..964fb9171 --- /dev/null +++ b/NOTICE @@ -0,0 +1,10 @@ +Parallel Consumer +Copyright 2020-2026 Confluent, Inc. +Copyright 2026 Antony Stubbs and contributors + +This product includes software originally developed by Confluent, Inc. +(https://github.com/confluentinc/parallel-consumer), licensed under the +Apache License, Version 2.0. + +This fork is maintained independently at: + https://github.com/astubbs/parallel-consumer diff --git a/README.adoc b/README.adoc index 39f4db1be..4c6fa26d6 100644 --- a/README.adoc +++ b/README.adoc @@ -43,7 +43,7 @@ ifdef::env-github[] :warning-caption: :warning: endif::[] -image:https://maven-badges.herokuapp.com/maven-central/io.confluent.parallelconsumer/parallel-consumer-parent/badge.svg?style=flat[link=https://mvnrepository.com/artifact/io.confluent.parallelconsumer/parallel-consumer-parent,Latest Parallel Consumer on Maven Central] +image:https://maven-badges.herokuapp.com/maven-central/io.github.astubbs.parallelconsumer/parallel-consumer-parent/badge.svg?style=flat[link=https://mvnrepository.com/artifact/io.github.astubbs.parallelconsumer/parallel-consumer-parent,Latest Parallel Consumer on Maven Central] // Github actions disabled since codecov //image:https://github.com/confluentinc/parallel-consumer/actions/workflows/maven.yml/badge.svg[Java 8 Unit Test GitHub] + @@ -392,20 +392,22 @@ This library is copyright Confluent Inc, and licensed under the Apache License V === Maven -This project is available in maven central, https://repo1.maven.org/maven2/io/confluent/parallelconsumer/[repo1], along with SNAPSHOT builds (starting with 0.5-SNAPSHOT) in https://oss.sonatype.org/content/repositories/snapshots/io/confluent/parallelconsumer/[repo1's SNAPSHOTS repo]. +NOTE: This is a community-maintained fork of https://github.com/confluentinc/parallel-consumer[confluentinc/parallel-consumer], published under different Maven coordinates. The original upstream project is no longer actively maintained. -Latest version can be seen https://search.maven.org/artifact/io.confluent.parallelconsumer/parallel-consumer-core[here]. +This project is available in Maven Central, https://repo1.maven.org/maven2/io/github/astubbs/parallelconsumer/[repo1]. + +Latest version can be seen https://search.maven.org/artifact/io.github.astubbs.parallelconsumer/parallel-consumer-core[here]. Where `${project.version}` is the version to be used: -* group ID: `io.confluent.parallelconsumer` +* group ID: `io.github.astubbs.parallelconsumer` * artifact ID: `parallel-consumer-core` -* version: image:https://maven-badges.herokuapp.com/maven-central/io.confluent.parallelconsumer/parallel-consumer-parent/badge.svg?style=flat[link=https://mvnrepository.com/artifact/io.confluent.parallelconsumer/parallel-consumer-parent,Latest Parallel Consumer on Maven Central] +* version: image:https://maven-badges.herokuapp.com/maven-central/io.github.astubbs.parallelconsumer/parallel-consumer-parent/badge.svg?style=flat[link=https://mvnrepository.com/artifact/io.github.astubbs.parallelconsumer/parallel-consumer-parent,Latest Parallel Consumer on Maven Central] .Core Module Dependency [source,xml,indent=0] - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-core ${project.version} @@ -413,7 +415,7 @@ Where `${project.version}` is the version to be used: .Reactor Module Dependency [source,xml,indent=0] - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-reactor ${project.version} @@ -421,7 +423,7 @@ Where `${project.version}` is the version to be used: .Vert.x Module Dependency [source,xml,indent=0] - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-vertx ${project.version} @@ -1341,6 +1343,22 @@ Note:: See https://github.com/confluentinc/parallel-consumer/issues/162[issue #162] and this https://stackoverflow.com/questions/4786881/why-is-test-jar-dependency-required-for-mvn-compile[Stack Overflow question]. +=== Build Scripts + +Helper scripts are in the `bin/` directory: + +[qanda] +Quick local build (compile + unit tests):: +`bin/build.sh` + +Full CI build with all tests (unit + integration):: +`bin/ci-build.sh` + +CI build against a specific Kafka version:: +`bin/ci-build.sh 3.9.1` + +The GitHub Actions CI workflow uses `bin/ci-build.sh`, so running it locally reproduces the CI environment. + === Testing The project has good automated test coverage, of all features. diff --git a/parallel-consumer-core/pom.xml b/parallel-consumer-core/pom.xml index 5970f43b4..e984a10bf 100644 --- a/parallel-consumer-core/pom.xml +++ b/parallel-consumer-core/pom.xml @@ -6,15 +6,15 @@ --> - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-parent - 0.5.3.4-SNAPSHOT + 0.6.0.0-SNAPSHOT 4.0.0 parallel-consumer-core - Confluent Parallel Consumer Core + Kafka Parallel Consumer Core diff --git a/parallel-consumer-examples/parallel-consumer-example-core/pom.xml b/parallel-consumer-examples/parallel-consumer-example-core/pom.xml index d9d03103b..e0ec57295 100644 --- a/parallel-consumer-examples/parallel-consumer-example-core/pom.xml +++ b/parallel-consumer-examples/parallel-consumer-example-core/pom.xml @@ -8,24 +8,24 @@ 4.0.0 - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-examples - 0.5.3.4-SNAPSHOT + 0.6.0.0-SNAPSHOT parallel-consumer-example-core - Confluent Parallel Consumer Example - Core + Kafka Parallel Consumer Example - Core - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-core ${project.version} - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-core ${project.version} tests diff --git a/parallel-consumer-examples/parallel-consumer-example-metrics/pom.xml b/parallel-consumer-examples/parallel-consumer-example-metrics/pom.xml index 8ecba9dac..95072e93d 100644 --- a/parallel-consumer-examples/parallel-consumer-example-metrics/pom.xml +++ b/parallel-consumer-examples/parallel-consumer-example-metrics/pom.xml @@ -8,18 +8,18 @@ 4.0.0 - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-examples - 0.5.3.4-SNAPSHOT + 0.6.0.0-SNAPSHOT parallel-consumer-example-metrics - Confluent Parallel Consumer Example - Metrics + Kafka Parallel Consumer Example - Metrics - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-core ${project.version} @@ -30,7 +30,7 @@ - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-core ${project.version} tests diff --git a/parallel-consumer-examples/parallel-consumer-example-reactor/pom.xml b/parallel-consumer-examples/parallel-consumer-example-reactor/pom.xml index b26dde0f3..f5b22546b 100644 --- a/parallel-consumer-examples/parallel-consumer-example-reactor/pom.xml +++ b/parallel-consumer-examples/parallel-consumer-example-reactor/pom.xml @@ -8,31 +8,31 @@ 4.0.0 - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-examples - 0.5.3.4-SNAPSHOT + 0.6.0.0-SNAPSHOT parallel-consumer-example-reactor - Confluent Parallel Consumer Example - Project Reactor.io + Kafka Parallel Consumer Example - Project Reactor.io - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-reactor ${project.version} - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-core ${project.version} tests test - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-reactor ${project.version} tests diff --git a/parallel-consumer-examples/parallel-consumer-example-streams/pom.xml b/parallel-consumer-examples/parallel-consumer-example-streams/pom.xml index fc31adfad..5d0e4a41c 100644 --- a/parallel-consumer-examples/parallel-consumer-example-streams/pom.xml +++ b/parallel-consumer-examples/parallel-consumer-example-streams/pom.xml @@ -8,17 +8,17 @@ 4.0.0 - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-examples - 0.5.3.4-SNAPSHOT + 0.6.0.0-SNAPSHOT parallel-consumer-example-streams - Confluent Parallel Consumer Example - Streams + Kafka Parallel Consumer Example - Streams - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-core ${project.version} @@ -41,7 +41,7 @@ test - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-core ${project.version} tests diff --git a/parallel-consumer-examples/parallel-consumer-example-vertx/pom.xml b/parallel-consumer-examples/parallel-consumer-example-vertx/pom.xml index 9296f0f1c..73761b45e 100644 --- a/parallel-consumer-examples/parallel-consumer-example-vertx/pom.xml +++ b/parallel-consumer-examples/parallel-consumer-example-vertx/pom.xml @@ -8,31 +8,31 @@ 4.0.0 - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-examples - 0.5.3.4-SNAPSHOT + 0.6.0.0-SNAPSHOT parallel-consumer-example-vertx - Confluent Parallel Consumer Example - Vert.x + Kafka Parallel Consumer Example - Vert.x - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-vertx ${project.version} - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-core ${project.version} tests test - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-vertx ${project.version} tests diff --git a/parallel-consumer-examples/pom.xml b/parallel-consumer-examples/pom.xml index 7ac9e4918..ffcdbee78 100644 --- a/parallel-consumer-examples/pom.xml +++ b/parallel-consumer-examples/pom.xml @@ -9,12 +9,12 @@ parallel-consumer-parent - io.confluent.parallelconsumer - 0.5.3.4-SNAPSHOT + io.github.astubbs.parallelconsumer + 0.6.0.0-SNAPSHOT parallel-consumer-examples - Confluent Parallel Consumer Examples + Kafka Parallel Consumer Examples pom diff --git a/parallel-consumer-mutiny/pom.xml b/parallel-consumer-mutiny/pom.xml index 08fc730e0..1434901a9 100644 --- a/parallel-consumer-mutiny/pom.xml +++ b/parallel-consumer-mutiny/pom.xml @@ -7,22 +7,22 @@ parallel-consumer-parent - io.confluent.parallelconsumer - 0.5.3.4-SNAPSHOT + io.github.astubbs.parallelconsumer + 0.6.0.0-SNAPSHOT 4.0.0 - Confluent Parallel Consumer SmallRye Mutiny + Kafka Parallel Consumer SmallRye Mutiny parallel-consumer-mutiny - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-core ${project.version} - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-core ${project.version} tests diff --git a/parallel-consumer-reactor/pom.xml b/parallel-consumer-reactor/pom.xml index 638e23192..7f63d1ea7 100644 --- a/parallel-consumer-reactor/pom.xml +++ b/parallel-consumer-reactor/pom.xml @@ -7,22 +7,22 @@ parallel-consumer-parent - io.confluent.parallelconsumer - 0.5.3.4-SNAPSHOT + io.github.astubbs.parallelconsumer + 0.6.0.0-SNAPSHOT 4.0.0 - Confluent Parallel Consumer Project Reactor.io + Kafka Parallel Consumer Project Reactor.io parallel-consumer-reactor - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-core ${project.version} - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-core ${project.version} tests diff --git a/parallel-consumer-vertx/pom.xml b/parallel-consumer-vertx/pom.xml index f6f90545f..d37fa8f8f 100644 --- a/parallel-consumer-vertx/pom.xml +++ b/parallel-consumer-vertx/pom.xml @@ -6,13 +6,13 @@ --> - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-parent - 0.5.3.4-SNAPSHOT + 0.6.0.0-SNAPSHOT parallel-consumer-vertx - Confluent Parallel Consumer Vert.x + Kafka Parallel Consumer Vert.x 4.0.0 @@ -22,12 +22,12 @@ - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-core ${project.version} - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-core ${project.version} tests diff --git a/pom.xml b/pom.xml index 78a81326a..8e829f7f4 100644 --- a/pom.xml +++ b/pom.xml @@ -1,24 +1,24 @@ 4.0.0 - io.confluent.parallelconsumer + io.github.astubbs.parallelconsumer parallel-consumer-parent - Confluent Parallel Consumer - 0.5.3.4-SNAPSHOT + Kafka Parallel Consumer + 0.6.0.0-SNAPSHOT Parallel Apache Kafka client wrapper with client side queueing, a simpler consumer/producer API with key concurrency and extendable non-blocking IO processing. - https://confluent.io + https://github.com/astubbs/parallel-consumer 2020 - Confluent, Inc. - https://confluent.io + Antony Stubbs + https://stub.bz @@ -42,21 +42,21 @@ - antony@confluent.io + astubbs Antony Stubbs - antony@confluent.io - https://confluent.io - Confluent - https://confluent.io + antony.stubbs@gmail.com + https://stub.bz + Antony Stubbs + https://stub.bz Europe/London - scm:git:git://github.com:confluentinc/parallel-consumer.git - scm:git:git@github.com:confluentinc/parallel-consumer.git - https://github.com/confluentinc/parallel-consumer.git - 0.5.3.2 + scm:git:git://github.com:astubbs/parallel-consumer.git + scm:git:git@github.com:astubbs/parallel-consumer.git + https://github.com/astubbs/parallel-consumer.git + HEAD @@ -201,7 +201,7 @@ false false - io.confluent.parallelconsumer: + io.github.astubbs.parallelconsumer: @@ -592,7 +592,7 @@ true - Copyright (C) ${license.git.copyrightYears} ${project.organization.name} + Copyright (C) 2020-${license.git.copyrightYears} Confluent, Inc. and contributors **/release-pom.xml diff --git a/src/docs/README_TEMPLATE.adoc b/src/docs/README_TEMPLATE.adoc index b7805c2ef..73812fec1 100644 --- a/src/docs/README_TEMPLATE.adoc +++ b/src/docs/README_TEMPLATE.adoc @@ -43,7 +43,7 @@ ifdef::env-github[] :warning-caption: :warning: endif::[] -image:https://maven-badges.herokuapp.com/maven-central/io.confluent.parallelconsumer/parallel-consumer-parent/badge.svg?style=flat[link=https://mvnrepository.com/artifact/io.confluent.parallelconsumer/parallel-consumer-parent,Latest Parallel Consumer on Maven Central] +image:https://maven-badges.herokuapp.com/maven-central/io.github.astubbs.parallelconsumer/parallel-consumer-parent/badge.svg?style=flat[link=https://mvnrepository.com/artifact/io.github.astubbs.parallelconsumer/parallel-consumer-parent,Latest Parallel Consumer on Maven Central] // Github actions disabled since codecov //image:https://github.com/confluentinc/parallel-consumer/actions/workflows/maven.yml/badge.svg[Java 8 Unit Test GitHub] + @@ -390,15 +390,17 @@ This library is copyright Confluent Inc, and licensed under the Apache License V === Maven -This project is available in maven central, https://repo1.maven.org/maven2/io/confluent/parallelconsumer/[repo1], along with SNAPSHOT builds (starting with 0.5-SNAPSHOT) in https://oss.sonatype.org/content/repositories/snapshots/io/confluent/parallelconsumer/[repo1's SNAPSHOTS repo]. +NOTE: This is a community-maintained fork of https://github.com/confluentinc/parallel-consumer[confluentinc/parallel-consumer], published under different Maven coordinates. The original upstream project is no longer actively maintained. -Latest version can be seen https://search.maven.org/artifact/io.confluent.parallelconsumer/parallel-consumer-core[here]. +This project is available in Maven Central, https://repo1.maven.org/maven2/io/github/astubbs/parallelconsumer/[repo1]. + +Latest version can be seen https://search.maven.org/artifact/io.github.astubbs.parallelconsumer/parallel-consumer-core[here]. Where `${project.version}` is the version to be used: -* group ID: `io.confluent.parallelconsumer` +* group ID: `io.github.astubbs.parallelconsumer` * artifact ID: `parallel-consumer-core` -* version: image:https://maven-badges.herokuapp.com/maven-central/io.confluent.parallelconsumer/parallel-consumer-parent/badge.svg?style=flat[link=https://mvnrepository.com/artifact/io.confluent.parallelconsumer/parallel-consumer-parent,Latest Parallel Consumer on Maven Central] +* version: image:https://maven-badges.herokuapp.com/maven-central/io.github.astubbs.parallelconsumer/parallel-consumer-parent/badge.svg?style=flat[link=https://mvnrepository.com/artifact/io.github.astubbs.parallelconsumer/parallel-consumer-parent,Latest Parallel Consumer on Maven Central] .Core Module Dependency [source,xml,indent=0] From b810e8730b644dbe27e5e2f73a3b2519466da657 Mon Sep 17 00:00:00 2001 From: Antony Stubbs Date: Tue, 21 Apr 2026 11:31:32 +1200 Subject: [PATCH 02/12] ci: modernize GitHub Actions CI with quality gates and test infrastructure Replace Confluent's Semaphore CI with GitHub Actions. Includes: - Parallel PR builds (unit/integration/performance) with fail-fast - Quality gates: PMD CPD duplicate detection, file similarity check, SpotBugs, dependency vulnerability scanning, PIT mutation testing - Maven cache with rotating save key to prevent stale-cache trap - Maven HTTP timeouts (.mvn/maven.config) for connection reliability - Codecov coverage tracking with drop prevention - Claude Code Review and PR dependency check workflows - Performance test infrastructure with self-hosted runner support - Build scripts (bin/ci-build.sh, ci-unit-test.sh, etc.) Co-Authored-By: Claude Opus 4.6 (1M context) --- .github/workflows/check-dependencies.yml | 19 ++ .github/workflows/claude-code-review.yml | 39 +++ .github/workflows/claude.yml | 50 +++ .github/workflows/maven.yml | 394 ++++++++++++++++++----- .github/workflows/performance.yml | 82 +++++ .github/workflows/publish.yml | 105 ++++++ .mvn/maven.config | 6 + AGENTS.md | 83 +++++ bin/build.sh | 12 + bin/ci-build.sh | 27 ++ bin/ci-integration-test.sh | 18 ++ bin/ci-unit-test.sh | 16 + bin/performance-test.cmd | 18 ++ bin/performance-test.sh | 21 ++ codecov.yml | 11 + pom.xml | 80 ++++- 16 files changed, 887 insertions(+), 94 deletions(-) create mode 100644 .github/workflows/check-dependencies.yml create mode 100644 .github/workflows/claude-code-review.yml create mode 100644 .github/workflows/claude.yml create mode 100644 .github/workflows/performance.yml create mode 100644 .github/workflows/publish.yml create mode 100644 .mvn/maven.config create mode 100644 AGENTS.md create mode 100755 bin/build.sh create mode 100755 bin/ci-build.sh create mode 100755 bin/ci-integration-test.sh create mode 100755 bin/ci-unit-test.sh create mode 100644 bin/performance-test.cmd create mode 100755 bin/performance-test.sh create mode 100644 codecov.yml diff --git a/.github/workflows/check-dependencies.yml b/.github/workflows/check-dependencies.yml new file mode 100644 index 000000000..cd459d954 --- /dev/null +++ b/.github/workflows/check-dependencies.yml @@ -0,0 +1,19 @@ +name: PR Dependency Check + +on: + pull_request_target: + types: [opened, edited, closed, reopened] + +permissions: + issues: read + pull-requests: read + checks: write + +jobs: + check_dependencies: + runs-on: ubuntu-latest + name: Check Dependencies + steps: + - uses: astubbs/dependencies-action@feat/auto-unblock-children-on-merge + env: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} diff --git a/.github/workflows/claude-code-review.yml b/.github/workflows/claude-code-review.yml new file mode 100644 index 000000000..0d3503722 --- /dev/null +++ b/.github/workflows/claude-code-review.yml @@ -0,0 +1,39 @@ +name: Claude Code Review + +on: + pull_request: + types: [opened, synchronize, ready_for_review, reopened] + # Optional: Only run on specific file changes + # paths: + # - "src/**/*.ts" + # - "src/**/*.tsx" + # - "src/**/*.js" + # - "src/**/*.jsx" + +jobs: + claude-review: + if: github.event.sender.type != 'Bot' + runs-on: ubuntu-latest + permissions: + contents: read + pull-requests: read + issues: read + id-token: write + + steps: + - name: Checkout repository + uses: actions/checkout@v4 + with: + fetch-depth: 1 + + - name: Run Claude Code Review + id: claude-review + uses: anthropics/claude-code-action@v1 + with: + claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }} + plugin_marketplaces: 'https://github.com/anthropics/claude-code.git' + plugins: 'code-review@claude-code-plugins' + prompt: '/code-review:code-review ${{ github.repository }}/pull/${{ github.event.pull_request.number }}' + # See https://github.com/anthropics/claude-code-action/blob/main/docs/usage.md + # or https://code.claude.com/docs/en/cli-reference for available options + diff --git a/.github/workflows/claude.yml b/.github/workflows/claude.yml new file mode 100644 index 000000000..d300267f1 --- /dev/null +++ b/.github/workflows/claude.yml @@ -0,0 +1,50 @@ +name: Claude Code + +on: + issue_comment: + types: [created] + pull_request_review_comment: + types: [created] + issues: + types: [opened, assigned] + pull_request_review: + types: [submitted] + +jobs: + claude: + if: | + (github.event_name == 'issue_comment' && contains(github.event.comment.body, '@claude')) || + (github.event_name == 'pull_request_review_comment' && contains(github.event.comment.body, '@claude')) || + (github.event_name == 'pull_request_review' && contains(github.event.review.body, '@claude')) || + (github.event_name == 'issues' && (contains(github.event.issue.body, '@claude') || contains(github.event.issue.title, '@claude'))) + runs-on: ubuntu-latest + permissions: + contents: read + pull-requests: read + issues: read + id-token: write + actions: read # Required for Claude to read CI results on PRs + steps: + - name: Checkout repository + uses: actions/checkout@v4 + with: + fetch-depth: 1 + + - name: Run Claude Code + id: claude + uses: anthropics/claude-code-action@v1 + with: + claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }} + + # This is an optional setting that allows Claude to read CI results on PRs + additional_permissions: | + actions: read + + # Optional: Give a custom prompt to Claude. If this is not specified, Claude will perform the instructions specified in the comment that tagged it. + # prompt: 'Update the pull request description to include a summary of changes.' + + # Optional: Add claude_args to customize behavior and configuration + # See https://github.com/anthropics/claude-code-action/blob/main/docs/usage.md + # or https://code.claude.com/docs/en/cli-reference for available options + # claude_args: '--allowed-tools Bash(gh pr:*)' + diff --git a/.github/workflows/maven.yml b/.github/workflows/maven.yml index 0a4cde6a1..e1e53a0a2 100644 --- a/.github/workflows/maven.yml +++ b/.github/workflows/maven.yml @@ -1,113 +1,331 @@ -# This workflow will build a Java project with Maven -# For more information see: https://help.github.com/actions/language-and-framework-guides/building-and-testing-java-with-maven +# PR builds: +# Unit, Integration, Performance - all in parallel with fail-fast +# (if unit tests fail at ~5 min, integration + performance are cancelled) +# +# Push builds (master): +# Full Kafka version matrix (3.1.0, 3.7.0, 3.9.1 + experimental 4.x) -# Tests disabled due to flakiness with under resourced github test machines. Confluent Jira works fine. Will fix later. -name: Unit tests only +name: Build and Test on: push: branches: [ master ] pull_request: - branches: [ master ] + +permissions: + contents: read + pull-requests: write + checks: write + +concurrency: + group: ${{ github.workflow }}-${{ github.head_ref || github.ref }} + cancel-in-progress: true jobs: - build: + + # ── PR Builds ────────────────────────────────────────────────────────── + + # Pre-warm the Maven dependency cache so the three test jobs can start + # with everything local. Without this, each job races to download the + # same deps from Maven Central, and if Central is slow, they all stall. + # TODO: If this extra serial step doesn't consistently help, remove it + # and rely on per-job caching + longer timeouts instead. Trade-off is + # wall-clock time (adds ~1-2 min serial) vs reliability vs flaky networks. + prepare-deps: + if: github.event_name == 'pull_request' + name: "Prepare Maven Cache" + runs-on: ubuntu-latest + timeout-minutes: 15 + steps: + - uses: actions/checkout@v6 + - uses: actions/setup-java@v5 + with: + distribution: 'temurin' + java-version: '17' + # Explicit cache steps below instead of setup-java's built-in + # `cache: maven`. Built-in uses actions/cache under the hood, + # which does NOT save when the primary key was a hit. Once an + # early run populates a partial cache (e.g. Central timed out + # on one POM), every subsequent run restores the same gap and + # can never write the completed version back. Rotating save + # key below forces a save every run. + - name: Restore Maven cache + uses: actions/cache/restore@v4 + with: + path: ~/.m2/repository + key: setup-java-Linux-x64-maven-${{ hashFiles('**/pom.xml') }} + restore-keys: | + setup-java-Linux-x64-maven- + - name: Download all dependencies + run: ./mvnw --batch-mode -Pci dependency:go-offline -Dlicense.skip -DincludeScope=test -U + - name: Save Maven cache (rotating key) + if: always() + uses: actions/cache/save@v4 + with: + path: ~/.m2/repository + key: setup-java-Linux-x64-maven-${{ hashFiles('**/pom.xml') }}-${{ github.run_id }} + + # All test suites in parallel -fail-fast cancels the rest if any fails + test: + if: github.event_name == 'pull_request' + needs: prepare-deps strategy: - fail-fast: false + fail-fast: true matrix: - # Why not? because we can. - # 2.0.1, 2.1.1, 2.2.2, 2.3.1, 2.4.1 don't work - needs zstd and some kafka client libs. - # Doesn't mean it couldn't be modified slightly to work... - #ak: [ 2.5.1, 2.6.1, 2.7.0, 2.8.1, 3.0.1, 3.1.0 ] - # 25 and 26 include a dep with a vulnerability which ossindex fails the build for - ak: [ 2.7.0, 2.8.1, 3.0.1, 3.1.0 ] - #ak: [ 2.7.0 ] - #jdk: [ '-P jvm8-release -Djvm8.location=/opt/hostedtoolcache/Java_Zulu_jdk/8.0.332-9/x64', '' ] - # TG currently targets 11, so can't run the tests on 8 https://github.com/astubbs/truth-generator/issues/114 - jdk: [ '' ] - experimental: [ false ] - name: [ "Stable AK version" ] include: - # AK 2.4 not supported - # - ak: "'[2.4.1,2.5)'" # currently failing - # experimental: true - # name: "Oldest AK breaking version 2.4.1+ (below 2.5.0) expected to fail" - - ak: "'[2.7.0,4)'" # currently failing - experimental: true - name: "Newest AK version 2.7.0+?" + - suite: unit + name: "Unit Tests" + cmd: "bin/ci-unit-test.sh" + timeout: 15 + # TODO: tighten integration/performance to 30m once we have 5+ + # consecutive successful runs consistently under 20m. Current 60m + # buffer is for Maven Central network variance we've observed. + - suite: integration + name: "Integration Tests" + cmd: "bin/ci-integration-test.sh" + timeout: 60 + - suite: performance + name: "Performance Tests" + cmd: "bin/performance-test.sh" + timeout: 60 + name: "${{ matrix.name }}" + runs-on: ubuntu-latest + timeout-minutes: ${{ matrix.timeout }} + steps: + - uses: actions/checkout@v6 + - uses: actions/setup-java@v5 + with: + distribution: 'temurin' + java-version: '17' + # Explicit restore (no save) so we can fall back via + # restore-keys to the rotating-key cache prepare-deps saved. + # setup-java's built-in `cache: maven` has no restore-keys + # support, so it would miss prepare-deps' `...-{run_id}` key. + - name: Restore Maven cache + uses: actions/cache/restore@v4 + with: + path: ~/.m2/repository + key: setup-java-Linux-x64-maven-${{ hashFiles('**/pom.xml') }} + restore-keys: | + setup-java-Linux-x64-maven- + - name: ${{ matrix.name }} + run: ${{ matrix.cmd }} + - name: Upload coverage to Codecov + if: always() + uses: codecov/codecov-action@v5 + with: + files: '**/target/site/jacoco/jacoco.xml,**/target/site/jacoco-it/jacoco.xml' + flags: ${{ matrix.suite }} + token: ${{ secrets.CODECOV_TOKEN }} - continue-on-error: ${{ matrix.experimental }} - name: "AK: ${{ matrix.ak }} JDK: ${{ matrix.jdk }}" + # Duplicate code detection using both PMD CPD and jscpd engines. + # See https://github.com/astubbs/duplicate-code-cross-check + duplicate-detection: + if: github.event_name == 'pull_request' + name: "Duplicate Code Check" runs-on: ubuntu-latest + timeout-minutes: 5 + permissions: + contents: read + pull-requests: write + steps: + - uses: actions/checkout@v6 + with: + fetch-depth: 0 + - uses: astubbs/duplicate-code-cross-check@v1 + with: + github-token: ${{ secrets.GITHUB_TOKEN }} + directories: 'parallel-consumer-core/src parallel-consumer-vertx/src parallel-consumer-reactor/src parallel-consumer-mutiny/src' + cpd-max-duplication: '5' + jscpd-max-duplication: '4' + # File similarity detection - finds files that are semantically similar overall + file-similarity: + if: github.event_name == 'pull_request' + name: "File Similarity Check" + runs-on: ubuntu-latest + timeout-minutes: 5 steps: - - uses: actions/checkout@v3 + - uses: actions/checkout@v6 + with: + fetch-depth: 0 + - uses: astubbs/duplicate-code-detection-tool@feat/base-vs-pr-comparison + with: + github_token: ${{ secrets.GITHUB_TOKEN }} + directories: 'parallel-consumer-core/src,parallel-consumer-vertx/src,parallel-consumer-reactor/src,parallel-consumer-mutiny/src' + file_extensions: 'java' + ignore_below: 30 + fail_above: 80 + warn_above: 50 + one_comment: true + compare_with_base: true + max_increase: 10 - - name: Setup JDK 1.8 - uses: actions/setup-java@v3 + # SpotBugs static analysis -finds null derefs, concurrency issues, resource leaks + spotbugs: + if: github.event_name == 'pull_request' + name: "SpotBugs" + runs-on: ubuntu-latest + timeout-minutes: 10 + steps: + - uses: actions/checkout@v6 + - uses: actions/setup-java@v5 with: - java-version: '8' - distribution: 'zulu' + distribution: 'temurin' + java-version: '17' cache: 'maven' + - name: Compile and run SpotBugs + run: ./mvnw --batch-mode -Pci compile spotbugs:spotbugs -Dlicense.skip -pl parallel-consumer-core -am + - name: Post SpotBugs annotations on PR + if: always() + uses: jwgmeligmeyling/spotbugs-github-action@v1.2 + with: + path: '**/target/spotbugsXml.xml' + token: ${{ secrets.GITHUB_TOKEN }} - # the patch version will be upgraded silently causing the build to eventually start failing - need to store this as a var - possible? - - name: Show java 1.8 home - # /opt/hostedtoolcache/Java_Zulu_jdk/8.0.332-9/x64/bin/java - run: which java - - # - name: Setup JDK 1.9 - # uses: actions/setup-java@v1 - # with: - # java-version: 1.9 - - # - name: Show java 1.9 home - # /opt/hostedtoolcache/jdk/9.0.7/x64 - # run: which java + # Dependency vulnerability scanning -GitHub's own dependency review + dependency-scan: + if: github.event_name == 'pull_request' + name: "Dependency Vulnerabilities" + runs-on: ubuntu-latest + timeout-minutes: 5 + steps: + - uses: actions/checkout@v6 + - uses: actions/dependency-review-action@v4 + with: + fail-on-severity: high + comment-summary-in-pr: always - - name: Setup JDK 17 - uses: actions/setup-java@v3 + # Mutation testing (PIT) - verifies test assertions are meaningful. + # Non-blocking: posts a PR comment with mutation score but doesn't + # gate merge. TODO: move to self-hosted performance runner for speed. + mutation-testing: + if: github.event_name == 'pull_request' + name: "Mutation Testing (PIT)" + runs-on: ubuntu-latest + continue-on-error: true + timeout-minutes: 300 + steps: + - uses: actions/checkout@v6 + - uses: actions/setup-java@v5 with: - distribution: 'zulu' + distribution: 'temurin' java-version: '17' cache: 'maven' + # Restore PIT history for incremental analysis — only re-mutates + # code that changed since last run. PIT writes its history to /tmp/ + # with a filename derived from groupId.artifactId.version, ignoring + # explicit -DhistoryInputLocation CLI flags when -DwithHistory is set. + - name: Restore PIT history cache + uses: actions/cache@v4 + with: + path: /tmp/*_pitest_history.bin + key: pit-history-${{ github.base_ref }}-${{ hashFiles('**/src/main/**/*.java') }} + restore-keys: | + pit-history-${{ github.base_ref }}- + pit-history- + # -DjvmArgs=-Xmx1g gives minion JVMs 1GB heap (default too small, + # causes MEMORY_ERROR / UNKNOWN_ERROR crashes). + # -DargLine= strips Jacoco's javaagent from minions (PIT has + # its own coverage). Separate from jvmArgs. + # -DtimeoutConstant/Factor generous per-mutation timeout for CI hardware + # -Dthreads=1 I/O-bound tests; parallelism adds contention not speed + # -DwithHistory incremental: skips re-mutating unchanged code. + # PIT writes history to /tmp/{groupId}.{artifactId}... + # which we cache above via actions/cache. + # targetClasses: internal.* only — the core engine where mutations matter most. + # Broadening to io.confluent.parallelconsumer.* hit GitHub's 6h job cap (#41). + - name: Run PIT mutation testing + run: | + ./mvnw --batch-mode -Pci test-compile org.pitest:pitest-maven:mutationCoverage \ + -Dlicense.skip \ + -DtargetClasses="io.confluent.parallelconsumer.internal.*" \ + -DtargetTests="io.confluent.parallelconsumer.*" \ + -DjvmArgs=-Xmx1g \ + -DargLine= \ + -DtimeoutConstant=30000 -DtimeoutFactor=3.0 \ + -Dthreads=1 \ + -DwithHistory \ + -pl parallel-consumer-core -am + - name: Upload PIT report + if: always() + uses: actions/upload-artifact@v4 + with: + name: pit-report + path: '**/target/pit-reports/**' + - name: Post PIT summary to PR + if: always() + uses: actions/github-script@v7 + with: + script: | + const fs = require('fs'); + const { execSync } = require('child_process'); + const files = execSync('find . -path "*/pit-reports/*/mutations.csv" -type f').toString().trim().split('\n').filter(f => f); + let body; + if (files.length === 0) { + body = `## :x: Mutation Testing (PIT) Report\n\n`; + body += `**PIT did not produce a report.** Most commonly this means a test failed in the baseline (PIT runs all tests unmodified first to establish green) and PIT aborted before mutating. See the "Run PIT mutation testing" step logs for the failing test, then either fix it or add it to \`-DexcludedTestClasses\` in the workflow.\n`; + } else { + let killed = 0, survived = 0, noCov = 0, total = 0; + for (const f of files) { + const lines = fs.readFileSync(f, 'utf8').split('\n').filter(l => l.trim()); + for (const line of lines) { + total++; + if (line.includes('KILLED')) killed++; + else if (line.includes('SURVIVED')) survived++; + else if (line.includes('NO_COVERAGE')) noCov++; + } + } + const score = total > 0 ? ((killed / total) * 100).toFixed(1) : '0'; + body = `## Mutation Testing (PIT) Report\n\n`; + body += `| Metric | Value |\n|--------|-------|\n`; + body += `| Mutations generated | ${total} |\n`; + body += `| Killed (detected) | ${killed} |\n`; + body += `| Survived (missed) | ${survived} |\n`; + body += `| No coverage | ${noCov} |\n`; + body += `| **Mutation score** | **${score}%** |\n\n`; + body += `Full HTML report available as artifact: \`pit-report\`\n`; + } + const comments = await github.rest.issues.listComments({ owner: context.repo.owner, repo: context.repo.repo, issue_number: context.issue.number }); + const existing = comments.data.find(c => c.body.includes('Mutation Testing (PIT) Report')); + if (existing) { + await github.rest.issues.updateComment({ owner: context.repo.owner, repo: context.repo.repo, comment_id: existing.id, body }); + } else { + await github.rest.issues.createComment({ owner: context.repo.owner, repo: context.repo.repo, issue_number: context.issue.number, body }); + } - - name: Show java 17 home - # /opt/hostedtoolcache/jdk/13.0.2/x64/bin/java - run: which java - - # - name: Show java version - # run: java -version - - # - name: Show mvn version - # run: mvn -version - - # - name: Build with Maven on Java 13 - # run: mvn -B package --file pom.xml - - - # done automatically now - # - name: Cache Maven packages - # uses: actions/cache@v2.1.7 - # with: - # path: ~/.m2/repository - # key: ${{ runner.os }}-m2 - # restore-keys: ${{ runner.os }}-m2 - - - name: Test with Maven - run: mvn -Pci -B package ${{ matrix.jdk }} -Dkafka.version=${{ matrix.ak }} -Dlicense.skip + # ── Push Builds (master) ─────────────────────────────────────────────── -# - name: Archive test results -# if: ${{ always() }} -# uses: actions/upload-artifact@v2 -# with: -# name: test-reports -# path: target/**-reports/* -# retention-days: 14 -# -# - name: Archive surefire test results -# if: ${{ always() }} -# uses: actions/upload-artifact@v2 -# with: -# name: test-reports -# path: target/surefire-reports/* -# retention-days: 14 + # Full Kafka version matrix as safety net + build: + if: github.event_name == 'push' + strategy: + fail-fast: false + matrix: + ak: [ 3.1.0, 3.7.0, 3.9.1 ] + experimental: [ false ] + name: [ "Stable" ] + include: + - ak: "'[3.9.1,5)'" + experimental: true + name: "Newest AK (4.x?)" + continue-on-error: ${{ matrix.experimental }} + name: "AK: ${{ matrix.ak }} (${{ matrix.name }})" + runs-on: ubuntu-latest + timeout-minutes: 30 + steps: + - uses: actions/checkout@v6 + - uses: actions/setup-java@v5 + with: + distribution: 'temurin' + java-version: '17' + cache: 'maven' + - name: Build and Test + run: bin/ci-build.sh ${{ matrix.ak }} + - name: Upload coverage to Codecov + if: always() + uses: codecov/codecov-action@v5 + with: + files: '**/target/site/jacoco/jacoco.xml,**/target/site/jacoco-it/jacoco.xml' + flags: ak-${{ matrix.ak }} + token: ${{ secrets.CODECOV_TOKEN }} diff --git a/.github/workflows/performance.yml b/.github/workflows/performance.yml new file mode 100644 index 000000000..b33de3a14 --- /dev/null +++ b/.github/workflows/performance.yml @@ -0,0 +1,82 @@ +# Performance test suite, run on a self-hosted Windows runner with Docker Desktop. +# +# These tests are tagged @Tag("performance") and excluded from the regular CI +# build because they need substantial hardware (CPU, memory, disk). They run +# on dedicated machines where the user has labelled their runner with the +# "performance" custom label. +# +# Triggers: +# - workflow_dispatch (manual) - primary trigger +# - schedule (weekly) - automated regression check +# - NOT on PRs from forks - self-hosted runners + untrusted code = bad +# +# See docs/SELF_HOSTED_RUNNER.md for one-time runner setup instructions. + +name: Performance Tests + +on: + workflow_dispatch: + inputs: + kafka_version: + description: 'Kafka version to test against (default: project default)' + required: false + type: string + default: '' + schedule: + # Weekly on Sunday at 02:00 UTC + - cron: '0 2 * * 0' + +concurrency: + # Only run one performance test at a time per branch - they're slow and resource-heavy + group: performance-${{ github.ref }} + cancel-in-progress: false + +permissions: + contents: read + +jobs: + performance: + name: "Performance suite (self-hosted)" + # Targets a self-hosted runner labelled "performance" running Windows. + # The "self-hosted" label is automatic; "windows" and "performance" are + # added when the runner is registered. See docs/SELF_HOSTED_RUNNER.md. + runs-on: [self-hosted, windows, performance] + timeout-minutes: 180 + + steps: + - name: Checkout + uses: actions/checkout@v6 + + - name: Setup JDK 17 + uses: actions/setup-java@v5 + with: + distribution: 'temurin' + java-version: '17' + # Don't cache here - self-hosted runners persist .m2 across runs already + cache: '' + + - name: Show environment + shell: cmd + run: | + java -version + docker --version + docker info + + - name: Run performance tests + shell: cmd + env: + KAFKA_VERSION: ${{ inputs.kafka_version }} + run: | + if defined KAFKA_VERSION ( + call bin\performance-test.cmd -Dkafka.version=%KAFKA_VERSION% + ) else ( + call bin\performance-test.cmd + ) + + - name: Upload test reports + if: always() + uses: actions/upload-artifact@v7 + with: + name: performance-reports-${{ github.run_number }} + path: '**/target/*-reports/*.xml' + retention-days: 30 diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml new file mode 100644 index 000000000..eda59dc81 --- /dev/null +++ b/.github/workflows/publish.yml @@ -0,0 +1,105 @@ +# Publishes to Maven Central on every push to master. +# +# The pom.xml version is the source of truth: +# - Versions ending in -SNAPSHOT -> publish snapshot +# - Versions without -SNAPSHOT -> publish release + create git tag + GitHub release +# +# To cut a release: open a PR that strips -SNAPSHOT, merge it. CI will publish. +# Then open another PR bumping to the next -SNAPSHOT. +# +# Required GitHub secrets: +# MAVEN_CENTRAL_USERNAME - Sonatype Central Portal token username +# MAVEN_CENTRAL_PASSWORD - Sonatype Central Portal token password +# MAVEN_GPG_PRIVATE_KEY - Armored GPG private key (gpg --export-secret-keys --armor KEYID) +# MAVEN_GPG_PASSPHRASE - Passphrase for the GPG key + +name: Publish to Maven Central + +on: + push: + branches: [ master ] + workflow_dispatch: # allow manual re-runs + +concurrency: + group: publish-${{ github.ref }} + cancel-in-progress: false # never cancel a publish mid-flight + +permissions: + contents: write # needed to create tags and releases + +jobs: + publish: + name: Publish + runs-on: ubuntu-latest + + steps: + - uses: actions/checkout@v6 + with: + fetch-depth: 0 # need full history for tag creation + + - name: Setup JDK 17 + uses: actions/setup-java@v5 + with: + distribution: 'temurin' + java-version: '17' + cache: 'maven' + # Configure Maven settings.xml with Sonatype credentials and GPG + server-id: central + server-username: MAVEN_CENTRAL_USERNAME + server-password: MAVEN_CENTRAL_PASSWORD + gpg-private-key: ${{ secrets.MAVEN_GPG_PRIVATE_KEY }} + gpg-passphrase: MAVEN_GPG_PASSPHRASE + + - name: Read project version + id: version + run: | + VERSION=$(./mvnw help:evaluate -Dexpression=project.version -q -DforceStdout) + echo "version=$VERSION" >> "$GITHUB_OUTPUT" + if [[ "$VERSION" == *-SNAPSHOT ]]; then + echo "is_snapshot=true" >> "$GITHUB_OUTPUT" + echo "Detected SNAPSHOT version: $VERSION" + else + echo "is_snapshot=false" >> "$GITHUB_OUTPUT" + echo "Detected RELEASE version: $VERSION" + fi + + - name: Check tag does not exist (releases only) + if: steps.version.outputs.is_snapshot == 'false' + run: | + TAG="v${{ steps.version.outputs.version }}" + if git rev-parse "$TAG" >/dev/null 2>&1; then + echo "Tag $TAG already exists - aborting to prevent re-release" + exit 1 + fi + + - name: Deploy to Maven Central + env: + MAVEN_CENTRAL_USERNAME: ${{ secrets.MAVEN_CENTRAL_USERNAME }} + MAVEN_CENTRAL_PASSWORD: ${{ secrets.MAVEN_CENTRAL_PASSWORD }} + MAVEN_GPG_PASSPHRASE: ${{ secrets.MAVEN_GPG_PASSPHRASE }} + run: | + ./mvnw --batch-mode \ + -Pmaven-central \ + -Pci \ + clean deploy \ + -DskipTests \ + -Dlicense.skip + + - name: Create git tag (releases only) + if: steps.version.outputs.is_snapshot == 'false' + run: | + TAG="v${{ steps.version.outputs.version }}" + git config user.name "github-actions[bot]" + git config user.email "41898282+github-actions[bot]@users.noreply.github.com" + git tag -a "$TAG" -m "Release $TAG" + git push origin "$TAG" + + - name: Create GitHub release (releases only) + if: steps.version.outputs.is_snapshot == 'false' + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + run: | + TAG="v${{ steps.version.outputs.version }}" + gh release create "$TAG" \ + --title "$TAG" \ + --generate-notes diff --git a/.mvn/maven.config b/.mvn/maven.config new file mode 100644 index 000000000..bfe087449 --- /dev/null +++ b/.mvn/maven.config @@ -0,0 +1,6 @@ +-Dmaven.wagon.http.connectionTimeout=10000 +-Dmaven.wagon.http.readTimeout=120000 +-Dmaven.wagon.httpconnectionManager.ttlSeconds=120 +-Dmaven.wagon.http.retryHandler.count=3 +-Daether.connector.connectTimeout=10000 +-Daether.connector.requestTimeout=120000 diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 000000000..123dc5401 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,83 @@ +# Parallel Consumer - Agent Context + +Project context for AI coding agents (Claude Code, Copilot, Cursor, etc.). + +## Overview + +Parallel Consumer is a Java library that enables concurrent message processing from Apache Kafka with a single consumer, avoiding the need to increase partition counts. It maintains ordering guarantees (by partition or key) while processing messages in parallel. + +This is a community-maintained fork of `confluentinc/parallel-consumer` (the upstream is no longer actively maintained), published to Maven Central as `io.github.astubbs.parallelconsumer`. + +## Build Requirements + +- **JDK 17** (required - the project uses Jabel to compile Java 17 source to Java 8 bytecode) +- **Docker** (required for integration tests - TestContainers spins up Kafka brokers) +- **Maven** via wrapper (`./mvnw`) - do not use system Maven + +## How to Build + +```bash +# Quick local build (compile + unit tests) +bin/build.sh + +# Full CI build with all tests (unit + integration) +bin/ci-build.sh + +# Full CI build against a specific Kafka version +bin/ci-build.sh 3.9.1 +``` + +## Module Structure + +| Module | Purpose | +|--------|---------| +| `parallel-consumer-core` | Core library - consumer, producer, offset management, sharding | +| `parallel-consumer-vertx` | Vert.x integration for async HTTP | +| `parallel-consumer-reactor` | Project Reactor integration | +| `parallel-consumer-mutiny` | SmallRye Mutiny integration (Quarkus) | +| `parallel-consumer-examples` | Example implementations for each module | + +## Key Architecture Decisions + +- **Jabel cross-compilation**: Source is Java 17, bytecode targets Java 8 via Jabel annotation processor. This means `--release 8` is set in the compiler plugin, which restricts available APIs to Java 8 surface. The Mutiny module overrides this to `--release 9` because Mutiny uses `java.util.concurrent.Flow` (Java 9+). +- **Offset encoding**: Custom offset map encoding (run-length, bitset) stored in Kafka commit metadata for tracking in-flight messages. +- **Sharding**: Messages are distributed to processing shards by key or partition for ordering guarantees. + +## Testing + +- **Unit tests**: `mvn test` / surefire plugin. Source in `src/test/java/`. +- **Integration tests**: `mvn verify` / failsafe plugin. Source in `src/test-integration/java/`. Uses TestContainers with `confluentinc/cp-kafka` Docker image. +- **Test exclusion patterns**: `**/integrationTest*/**/*.java` and `**/*IT.java` are excluded from surefire, included in failsafe. +- **Kafka version matrix**: CI tests against multiple Kafka versions via `-Dkafka.version=X.Y.Z`. + +## Known Issues + +- **Mutiny module**: Has a `release.target=9` override in its pom.xml because Mutiny's `Multi` implements `java.util.concurrent.Flow.Publisher` which is not available with `--release 8`. + +## Code Style + +- **Lombok**: Used extensively (builders, getters, logging). IntelliJ Lombok plugin required. +- **EditorConfig**: Enforced via `.editorconfig` - 4-space indent for Java, 120 char line length. +- **License headers**: Managed by `license-maven-plugin` (Mycila). Use `-Dlicense.skip` locally to skip checks. +- **Google Truth**: Used for test assertions alongside JUnit 5 and Mockito. + +## CI + +- **`.github/workflows/maven.yml`** — Build and test on every push/PR. Push uses default Kafka version, PRs run the full version matrix. Includes concurrency cancellation. +- **`.github/workflows/publish.yml`** — Publishes to Maven Central on every push to `master`. The pom.xml version is the source of truth: `-SNAPSHOT` versions deploy as snapshots, non-snapshot versions deploy as full releases (and create a git tag + GitHub release). +- **`.semaphore/`** — Legacy Confluent internal CI/release pipelines, retained but inactive on the fork. + +## Releasing + +The pom.xml version drives publishing — there is no `maven-release-plugin` dance. + +**Cut a release:** +1. Open a PR removing `-SNAPSHOT` from `` in the parent pom (e.g. `0.6.0.0-SNAPSHOT` → `0.6.0.0`) +2. Merge it to master → CI publishes to Maven Central, tags `v0.6.0.0`, creates a GitHub release +3. Open another PR bumping to the next snapshot (e.g. `0.6.0.1-SNAPSHOT`) and merge + +**Required GitHub repo secrets** for `publish.yml`: +- `MAVEN_CENTRAL_USERNAME` — Sonatype Central Portal token username +- `MAVEN_CENTRAL_PASSWORD` — Sonatype Central Portal token password +- `MAVEN_GPG_PRIVATE_KEY` — Armored GPG private key for signing artifacts +- `MAVEN_GPG_PASSPHRASE` — Passphrase for the GPG key diff --git a/bin/build.sh b/bin/build.sh new file mode 100755 index 000000000..4f45a4056 --- /dev/null +++ b/bin/build.sh @@ -0,0 +1,12 @@ +#!/usr/bin/env bash +# +# Copyright (C) 2026 Antony Stubbs and contributors +# + +# Local development build - compile and run unit tests +# Usage: bin/build.sh [extra-maven-args...] +# Example: bin/build.sh -pl parallel-consumer-core + +set -euo pipefail + +./mvnw --batch-mode clean package -Dlicense.skip "$@" diff --git a/bin/ci-build.sh b/bin/ci-build.sh new file mode 100755 index 000000000..705c2a823 --- /dev/null +++ b/bin/ci-build.sh @@ -0,0 +1,27 @@ +#!/usr/bin/env bash +# +# Copyright (C) 2026 Antony Stubbs and contributors +# + +# CI build script - run the full build and test suite +# Usage: bin/ci-build.sh [kafka-version] +# Example: bin/ci-build.sh 3.9.1 +# If no version is specified, uses the default from pom.xml + +set -euo pipefail + +KAFKA_VERSION_ARG="" +if [ $# -ge 1 ]; then + KAFKA_VERSION_ARG="-Dkafka.version=$1" + echo "Building with Kafka version: $1" +else + echo "Building with default Kafka version from pom.xml" +fi + +./mvnw --batch-mode \ + -Pci \ + clean verify \ + $KAFKA_VERSION_ARG \ + -Dlicense.skip \ + -Dexcluded.groups=performance \ + -Dsurefire.rerunFailingTestsCount=2 diff --git a/bin/ci-integration-test.sh b/bin/ci-integration-test.sh new file mode 100755 index 000000000..24a34bffc --- /dev/null +++ b/bin/ci-integration-test.sh @@ -0,0 +1,18 @@ +#!/usr/bin/env bash +# +# Copyright (C) 2026 Antony Stubbs and contributors +# + +# Run integration tests only (failsafe, requires Docker for TestContainers). +# Skips unit tests to avoid duplicate work. +# Usage: bin/ci-integration-test.sh + +set -euo pipefail + +./mvnw --batch-mode \ + -Pci \ + clean verify \ + -DskipUTs=true \ + -Dlicense.skip \ + -Dexcluded.groups=performance \ + -Dsurefire.rerunFailingTestsCount=2 diff --git a/bin/ci-unit-test.sh b/bin/ci-unit-test.sh new file mode 100755 index 000000000..3d1e9afc6 --- /dev/null +++ b/bin/ci-unit-test.sh @@ -0,0 +1,16 @@ +#!/usr/bin/env bash +# +# Copyright (C) 2026 Antony Stubbs and contributors +# + +# Run unit tests only (surefire, no Docker/TestContainers needed). +# Usage: bin/ci-unit-test.sh + +set -euo pipefail + +./mvnw --batch-mode \ + -Pci \ + clean test \ + -Dlicense.skip \ + -Dexcluded.groups=performance \ + -Dsurefire.rerunFailingTestsCount=2 diff --git a/bin/performance-test.cmd b/bin/performance-test.cmd new file mode 100644 index 000000000..b086f05d7 --- /dev/null +++ b/bin/performance-test.cmd @@ -0,0 +1,18 @@ +@REM Copyright (C) 2026 Antony Stubbs and contributors +@REM +@REM Run only the performance test suite (tests tagged @Tag("performance")). +@REM These are excluded from the regular CI build because they take a long time +@REM and need substantial hardware. Used by the self-hosted Windows runner. +@REM +@REM Usage: bin\performance-test.cmd [extra-maven-args...] + +@echo off +setlocal + +call mvnw.cmd --batch-mode ^ + -Pci ^ + clean verify ^ + -Dincluded.groups=performance ^ + -Dexcluded.groups= ^ + -Dlicense.skip ^ + %* diff --git a/bin/performance-test.sh b/bin/performance-test.sh new file mode 100755 index 000000000..29969c94c --- /dev/null +++ b/bin/performance-test.sh @@ -0,0 +1,21 @@ +#!/usr/bin/env bash +# +# Copyright (C) 2026 Antony Stubbs and contributors +# + +# Run only the performance test suite (tests tagged @Tag("performance")). +# These are excluded from the regular CI build because they take a long time +# and need substantial hardware. The self-hosted runner workflow +# (.github/workflows/performance.yml) calls this script. +# +# Usage: bin/performance-test.sh [extra-maven-args...] + +set -euo pipefail + +./mvnw --batch-mode \ + -Pci \ + clean verify \ + -Dincluded.groups=performance \ + -Dexcluded.groups= \ + -Dlicense.skip \ + "$@" diff --git a/codecov.yml b/codecov.yml new file mode 100644 index 000000000..0b0bba041 --- /dev/null +++ b/codecov.yml @@ -0,0 +1,11 @@ +coverage: + status: + project: + default: + # Fail if overall coverage drops from the base branch + target: auto + threshold: 1% + patch: + default: + # Don't enforce a minimum on new code — just track it + informational: true diff --git a/pom.xml b/pom.xml index 8e829f7f4..00c3931e8 100644 --- a/pom.xml +++ b/pom.xml @@ -81,12 +81,23 @@ ${skipTests} ${skipTests} + + + performance + 5.0.0 1.18.28 1.1.1 3.2.5 + 4.8.6 + 4.8.6.6 + 1.17.4 + 1.2.2 2.0.13 @@ -234,6 +245,13 @@ sign + + + + --pinentry-mode + loopback + + @@ -706,6 +724,11 @@ ${skipTests} ${skipITs} methods + + ${included.groups} + ${excluded.groups} @@ -725,8 +748,50 @@ report + + prepare-agent-integration + + prepare-agent-integration + + + + report-integration + post-integration-test + + report-integration + + + + com.github.spotbugs + spotbugs-maven-plugin + ${spotbugs-maven-plugin.version} + + Max + Medium + true + + + + com.github.spotbugs + spotbugs + ${spotbugs.version} + + + + + org.pitest + pitest-maven + ${pitest.version} + + + org.pitest + pitest-junit5-plugin + ${pitest-junit5.version} + + + org.apache.maven.plugins maven-enforcer-plugin @@ -950,18 +1015,21 @@ + - - confluent - https://packages.confluent.io/maven/ - central https://repo1.maven.org/maven2/ - jitpack.io - https://jitpack.io + confluent + https://packages.confluent.io/maven/ astubbs-truth-generator From afde8c5e374e999daa946d955fad7313844725af Mon Sep 17 00:00:00 2001 From: Antony Stubbs Date: Tue, 21 Apr 2026 11:31:56 +1200 Subject: [PATCH 03/12] fix: null epoch guard, TestContainers version match, VertxConcurrencyIT stall - EpochAndRecordsMap: skip records for partitions with no epoch assigned yet (race between poll and onPartitionsAssigned) - VertxConcurrencyIT: increase wait timeout and add CountDownLatch protection to prevent CI stalls Co-Authored-By: Claude Opus 4.6 (1M context) --- .../internal/EpochAndRecordsMap.java | 11 +++++++++++ .../vertx/integrationTests/VertxConcurrencyIT.java | 14 +++++++++----- 2 files changed, 20 insertions(+), 5 deletions(-) diff --git a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/EpochAndRecordsMap.java b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/EpochAndRecordsMap.java index 1ad4f6aa0..0b482ce8b 100644 --- a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/EpochAndRecordsMap.java +++ b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/EpochAndRecordsMap.java @@ -7,6 +7,7 @@ import io.confluent.parallelconsumer.state.PartitionStateManager; import lombok.NonNull; import lombok.Value; +import lombok.extern.slf4j.Slf4j; import org.apache.kafka.clients.consumer.ConsumerRecord; import org.apache.kafka.clients.consumer.ConsumerRecords; import org.apache.kafka.common.TopicPartition; @@ -18,6 +19,7 @@ * * @see BrokerPollSystem#partitionAssignmentEpoch */ +@Slf4j @Value public class EpochAndRecordsMap { @@ -27,6 +29,15 @@ public EpochAndRecordsMap(ConsumerRecords poll, PartitionStateManager { var records = poll.records(partition); Long epochOfPartition = pm.getEpochOfPartition(partition); + if (epochOfPartition == null) { + // Race: poll() returned records for a partition before onPartitionsAssigned() + // has fired. This is more likely with Kafka 2.x's eager rebalance protocol. + // Safe to skip — these records haven't been committed, so Kafka will re-deliver + // them on the next poll after the assignment callback completes. + log.warn("Skipping {} records for partition {} — no epoch assigned yet. " + + "Records will be re-delivered on next poll after assignment completes.", records.size(), partition); + return; + } RecordsAndEpoch entry = new RecordsAndEpoch(partition, epochOfPartition, records); recordMap.put(partition, entry); }); diff --git a/parallel-consumer-vertx/src/test-integration/java/io/confluent/parallelconsumer/vertx/integrationTests/VertxConcurrencyIT.java b/parallel-consumer-vertx/src/test-integration/java/io/confluent/parallelconsumer/vertx/integrationTests/VertxConcurrencyIT.java index 31838f9b8..836e4f64f 100644 --- a/parallel-consumer-vertx/src/test-integration/java/io/confluent/parallelconsumer/vertx/integrationTests/VertxConcurrencyIT.java +++ b/parallel-consumer-vertx/src/test-integration/java/io/confluent/parallelconsumer/vertx/integrationTests/VertxConcurrencyIT.java @@ -32,6 +32,7 @@ import org.junit.jupiter.api.AfterAll; import org.junit.jupiter.api.BeforeAll; import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.Timeout; import org.junit.jupiter.api.parallel.Isolated; import org.testcontainers.junit.jupiter.Testcontainers; import pl.tlinkowski.unij.api.UniMaps; @@ -40,6 +41,7 @@ import java.util.concurrent.ConcurrentLinkedQueue; import java.util.concurrent.CountDownLatch; import java.util.concurrent.Future; +import java.util.concurrent.TimeUnit; import java.util.concurrent.atomic.AtomicInteger; import static com.github.tomakehurst.wiremock.client.WireMock.*; @@ -134,6 +136,7 @@ static void close() { * would expect. */ @Test + @Timeout(value = 5, unit = TimeUnit.MINUTES) @SneakyThrows void testVertxConcurrency() { var commitMode = PERIODIC_CONSUMER_ASYNCHRONOUS; @@ -208,7 +211,7 @@ void testVertxConcurrency() { var failureMessage = msg("Mock server receives {} requests in parallel from vertx engine", expectedMessageCount / 2); try { - waitAtMost(ofSeconds(20)) + waitAtMost(ofSeconds(120)) .pollInterval(ofMillis(200)) .alias(failureMessage) .untilAsserted(() -> { @@ -217,11 +220,12 @@ void testVertxConcurrency() { }); } catch (ConditionTimeoutException e) { fail(failureMessage + "\n" + e.getMessage()); + } finally { + // Always release the latch — if the test fails, this prevents WireMock threads + // from hanging for 30s each on the unreleased latch + log.info("{} requests received by server, releasing server response lock.", requestsReceivedOnServer.size()); + LatchTestUtils.release(responseLock); } - log.info("{} requests received in parallel by server, releasing server response lock.", requestsReceivedOnServer.size()); - - // all requests were received in parallel, so unlock the server to respond to all of them - LatchTestUtils.release(responseLock); // assertNumberOfThreads(); From 9e133ce0c5a5d50ba00a9d2886c20b56006cf4d3 Mon Sep 17 00:00:00 2001 From: Antony Stubbs Date: Tue, 21 Apr 2026 11:32:20 +1200 Subject: [PATCH 04/12] test: fix test pollution - leaked threads, Awaitility state, race conditions Root-caused PIT's ~85s baseline cliff: MockConsumerTestWithEarlyClose and CommitTimeoutException leaked non-daemon threads that threw IllegalStateException on closed MockConsumer during the next test. Fixes: - MockConsumerTestWithEarlyClose/CommitTimeoutException: daemon threads, interrupt in finally, catch IllegalStateException in addRecords loop - MockConsumerTestWithSaslAuthenticationException: scope Awaitility timeout locally, add @AfterEach close, shrink mock-failure window - ParallelEoSSStreamProcessorRebalancedTest: remove empty close() override hiding base-class cleanup - ProducerManagerTest: @AfterEach Awaitility.reset(), explicit timeouts - PCMetricsTest: explicit atMost(120s) on bare await() calls - PartitionOrderProcessingTest: merge await + assertion to fix race - Base class: Awaitility.reset() in @AfterEach as general guard - New EpochAndRecordsMapRaceTest for null-epoch guard Co-Authored-By: Claude Opus 4.6 (1M context) --- .../BrokerIntegrationTest.java | 27 +++- .../LargeVolumeInMemoryTests.java | 2 + .../MultiInstanceHighVolumeTest.java | 2 + .../PartitionOrderProcessingTest.java | 18 ++- .../VeryLargeMessageVolumeTest.java | 2 + ...actParallelEoSStreamProcessorTestBase.java | 7 + ...onsumerTestWithCommitTimeoutException.java | 58 ++++---- .../MockConsumerTestWithEarlyClose.java | 58 +++++--- ...erTestWithSaslAuthenticationException.java | 47 ++++-- .../parallelconsumer/PCMetricsTest.java | 6 +- ...llelEoSSStreamProcessorRebalancedTest.java | 5 - .../internal/EpochAndRecordsMapRaceTest.java | 139 ++++++++++++++++++ .../internal/ProducerManagerTest.java | 29 +++- 13 files changed, 318 insertions(+), 82 deletions(-) create mode 100644 parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/internal/EpochAndRecordsMapRaceTest.java diff --git a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/BrokerIntegrationTest.java b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/BrokerIntegrationTest.java index 57275a8b3..f244e6e57 100644 --- a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/BrokerIntegrationTest.java +++ b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/BrokerIntegrationTest.java @@ -55,8 +55,33 @@ public abstract class BrokerIntegrationTest { */ public static KafkaContainer kafkaContainer = createKafkaContainer(null); + /** + * Derives the Confluent Platform version from the Apache Kafka client version so that + * the broker under test matches the client. The CI matrix overrides {@code kafka.version} + * via {@code -Dkafka.version=X.Y.Z}, so we read it at runtime from the client jar. + *

+ * Mapping: CP major = AK major + 4 (e.g., AK 2.8 → CP 6.2, AK 3.9 → CP 7.9). + */ + static String deriveCpKafkaImage() { + String akVersion = org.apache.kafka.common.utils.AppInfoParser.getVersion(); + log.info("Kafka client version detected: {}", akVersion); + + // Parse major.minor from the AK version + String[] parts = akVersion.split("\\."); + int akMajor = Integer.parseInt(parts[0]); + int akMinor = Integer.parseInt(parts[1]); + + // CP major = AK major + 4, CP minor = AK minor + int cpMajor = akMajor + 4; + int cpMinor = akMinor; + + String cpImage = "confluentinc/cp-kafka:" + cpMajor + "." + cpMinor + ".0"; + log.info("Using CP Kafka image: {} (derived from AK {})", cpImage, akVersion); + return cpImage; + } + public static KafkaContainer createKafkaContainer(String logSegmentSize) { - KafkaContainer base = new KafkaContainer(DockerImageName.parse("confluentinc/cp-kafka:7.6.0")) + KafkaContainer base = new KafkaContainer(DockerImageName.parse(deriveCpKafkaImage())) .withEnv("KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR", "1") //transaction.state.log.replication.factor .withEnv("KAFKA_TRANSACTION_STATE_LOG_MIN_ISR", "1") //transaction.state.log.min.isr .withEnv("KAFKA_TRANSACTION_STATE_LOG_NUM_PARTITIONS", "1") //transaction.state.log.num.partitions diff --git a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/LargeVolumeInMemoryTests.java b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/LargeVolumeInMemoryTests.java index 2e507eb6d..877434a04 100644 --- a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/LargeVolumeInMemoryTests.java +++ b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/LargeVolumeInMemoryTests.java @@ -21,6 +21,7 @@ import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.common.TopicPartition; import org.assertj.core.util.Lists; +import org.junit.jupiter.api.Tag; import org.junit.jupiter.params.ParameterizedTest; import org.junit.jupiter.params.provider.EnumSource; import pl.tlinkowski.unij.api.UniLists; @@ -44,6 +45,7 @@ * Mocked out comparative volume tests */ @Slf4j +@Tag("performance") class LargeVolumeInMemoryTests extends ParallelEoSStreamProcessorTestBase { @SneakyThrows diff --git a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/MultiInstanceHighVolumeTest.java b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/MultiInstanceHighVolumeTest.java index 56efc1e4f..5db19804e 100644 --- a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/MultiInstanceHighVolumeTest.java +++ b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/MultiInstanceHighVolumeTest.java @@ -17,6 +17,7 @@ import org.assertj.core.api.Assertions; import org.assertj.core.api.SoftAssertions; import org.awaitility.core.ConditionTimeoutException; +import org.junit.jupiter.api.Tag; import org.junit.jupiter.api.Test; import java.util.ArrayList; @@ -34,6 +35,7 @@ import static pl.tlinkowski.unij.api.UniLists.of; @Slf4j +@Tag("performance") class MultiInstanceHighVolumeTest extends BrokerIntegrationTest { public List consumedKeys = Collections.synchronizedList(new ArrayList<>()); diff --git a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/PartitionOrderProcessingTest.java b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/PartitionOrderProcessingTest.java index 91a5d8706..9a9660f2d 100644 --- a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/PartitionOrderProcessingTest.java +++ b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/PartitionOrderProcessingTest.java @@ -17,6 +17,7 @@ import org.junit.jupiter.api.Test; import pl.tlinkowski.unij.api.UniSets; +import java.time.Duration; import java.util.HashMap; import java.util.Map; import java.util.Properties; @@ -96,8 +97,18 @@ void allPartitionsAreProcessedInParallel() { partitionCounts.get(recordContexts.getSingleConsumerRecord().partition()).getAndIncrement(); ThreadUtils.sleepQuietly(10); // introduce a bit of processing delay - to make sure polling backpressure kicks in. }); - await().until(() -> partitionCounts.values().stream().mapToInt(AtomicInteger::get).sum() > 500); // wait until we process some messages to get the counts in. - Assertions.assertTrue(partitionCounts.values().stream().allMatch(v -> v.get() > 0), "Expect all partitions to have some messages processed, actual partitionCounts:" + partitionCounts); + // Wait for BOTH conditions: enough total messages AND all partitions represented. + // Previously the await only checked total > 500, then the assertion checked all + // partitions — a race, because Kafka may deliver from one partition first. + // Moving the partition check into the await lets Awaitility retry until + // all partitions have been reached. + await().atMost(Duration.ofSeconds(120)).untilAsserted(() -> { + int total = partitionCounts.values().stream().mapToInt(AtomicInteger::get).sum(); + Assertions.assertTrue(total > 500, + "Expect > 500 total messages processed, actual: " + total); + Assertions.assertTrue(partitionCounts.values().stream().allMatch(v -> v.get() > 0), + "Expect all partitions to have some messages processed, actual partitionCounts:" + partitionCounts); + }); } @@ -129,7 +140,8 @@ void allPartitionsAreNotProcessedInParallel() { partitionCounts.get(recordContexts.getSingleConsumerRecord().partition()).getAndIncrement(); ThreadUtils.sleepQuietly(10); // introduce a bit of processing delay - to make sure polling backpressure kicks in. }); - await().until(() -> partitionCounts.values().stream().mapToInt(AtomicInteger::get).sum() > 500); // wait until we process some messages to get the counts in. + // 120s explicit timeout — bare await() used shaded Awaitility's 10s default, too tight for CI. + await().atMost(Duration.ofSeconds(120)).until(() -> partitionCounts.values().stream().mapToInt(AtomicInteger::get).sum() > 500); Assertions.assertFalse(partitionCounts.values().stream().allMatch(v -> v.get() > 0), "Expect some processing thread starving and not all partition counts to have some messages processed, actual partitionCounts:" + partitionCounts); } diff --git a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/VeryLargeMessageVolumeTest.java b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/VeryLargeMessageVolumeTest.java index 6c32c8c09..f27922288 100644 --- a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/VeryLargeMessageVolumeTest.java +++ b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/VeryLargeMessageVolumeTest.java @@ -25,6 +25,7 @@ import org.assertj.core.api.Assertions; import org.assertj.core.api.SoftAssertions; import org.awaitility.core.ConditionTimeoutException; +import org.junit.jupiter.api.Tag; import org.junit.jupiter.api.Test; import java.util.*; @@ -51,6 +52,7 @@ * RuntimeException when running with very high options in 0.2.0.0 (Bitset too long to encode) #35 */ @Slf4j +@Tag("performance") public class VeryLargeMessageVolumeTest extends BrokerIntegrationTest { int HIGH_MAX_POLL_RECORDS_CONFIG = 10_000; diff --git a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/AbstractParallelEoSStreamProcessorTestBase.java b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/AbstractParallelEoSStreamProcessorTestBase.java index 918eb0657..652cf88ec 100644 --- a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/AbstractParallelEoSStreamProcessorTestBase.java +++ b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/AbstractParallelEoSStreamProcessorTestBase.java @@ -21,6 +21,7 @@ import org.apache.kafka.clients.producer.MockProducer; import org.apache.kafka.common.TopicPartition; import org.apache.kafka.common.serialization.Serdes; +import org.awaitility.Awaitility; import org.junit.jupiter.api.AfterEach; import org.junit.jupiter.api.BeforeEach; import pl.tlinkowski.unij.api.UniLists; @@ -144,6 +145,12 @@ protected ParallelConsumerOptions.ParallelConsumerOptionsBuilder @AfterEach public void close() { + // Reset Awaitility's global thread-local timeout state so per-test overrides + // (e.g. setDefaultTimeout) don't leak into other tests under non-deterministic + // test order (PIT baseline/mutations surface this; surefire's default ordering + // happens to mask it). Runs even if the test body threw. + Awaitility.reset(); + // don't try to close if error'd (at least one test purposefully creates an error to tests error handling) - we // don't want to bubble up an error here that we expect from here. if (!parentParallelConsumer.isClosedOrFailed()) { diff --git a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/MockConsumerTestWithCommitTimeoutException.java b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/MockConsumerTestWithCommitTimeoutException.java index 7593375b7..2c443eae8 100644 --- a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/MockConsumerTestWithCommitTimeoutException.java +++ b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/MockConsumerTestWithCommitTimeoutException.java @@ -87,39 +87,47 @@ public synchronized void commitSync(Map offse parallelConsumer.onPartitionsAssigned(of(tp)); mockConsumer.updateBeginningOffsets(startOffsets); - // - new Thread() { - public void run() { - addRecords(mockConsumer); - } - }.start(); - - // - ConcurrentLinkedQueue> records = new ConcurrentLinkedQueue<>(); - parallelConsumer.poll(recordContexts -> { - recordContexts.forEach(recordContext -> { - log.warn("Processing: {}", recordContext); - records.add(recordContext); + // Daemon thread: must NOT survive past this test method, or when it wakes + // from sleep it'll addRecord() on a closed mockConsumer and throw an + // uncaught exception that PIT attributes to whatever test is running next + // in the same minion JVM. We also interrupt it and close PC in the finally + // block. + Thread recordAdder = new Thread(() -> addRecords(mockConsumer), "commit-timeout-record-adder"); + recordAdder.setDaemon(true); + recordAdder.start(); + + try { + // + ConcurrentLinkedQueue> records = new ConcurrentLinkedQueue<>(); + parallelConsumer.poll(recordContexts -> { + recordContexts.forEach(recordContext -> { + log.warn("Processing: {}", recordContext); + records.add(recordContext); + }); }); - }); - // temporarily set the wait timeout - Awaitility.setDefaultTimeout(Duration.ofSeconds(50)); - // - Awaitility.await().untilAsserted(() -> { - assertThat(records).hasSize(10); - }); - - Awaitility.reset(); + // Scope the timeout locally (don't mutate Awaitility's global default — + // that was leaking across tests if the assertion below throws before reset()). + Awaitility.await().atMost(Duration.ofSeconds(50)).untilAsserted(() -> { + assertThat(records).hasSize(10); + }); + } finally { + recordAdder.interrupt(); + parallelConsumer.close(); + } } private void addRecords(MockConsumer mockConsumer) { - for(int i = 0; i < 10; i++) { - mockConsumer.addRecord(new org.apache.kafka.clients.consumer.ConsumerRecord<>(topic, 0, i, "key", "value")); + for (int i = 0; i < 10; i++) { try { + mockConsumer.addRecord(new org.apache.kafka.clients.consumer.ConsumerRecord<>(topic, 0, i, "key", "value")); Thread.sleep(1000L); + } catch (IllegalStateException e) { + // mockConsumer was closed - test has ended, stop quietly + return; } catch (InterruptedException e) { - throw new RuntimeException(e); + Thread.currentThread().interrupt(); + return; } } } diff --git a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/MockConsumerTestWithEarlyClose.java b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/MockConsumerTestWithEarlyClose.java index e12480913..9813296df 100644 --- a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/MockConsumerTestWithEarlyClose.java +++ b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/MockConsumerTestWithEarlyClose.java @@ -85,39 +85,49 @@ public synchronized void commitSync(Map offse parallelConsumer.onPartitionsAssigned(of(tp)); mockConsumer.updateBeginningOffsets(startOffsets); - // - new Thread() { - public void run() { - addRecords(mockConsumer); - } - }.start(); + // Daemon thread: must NOT survive past this test method, or when it wakes + // from sleep it'll addRecord() on a closed mockConsumer and throw an + // uncaught exception that PIT attributes to whatever test is running next + // in the same minion JVM. We also interrupt it explicitly in the finally + // block to stop the loop promptly. + Thread recordAdder = new Thread(() -> addRecords(mockConsumer), "early-close-record-adder"); + recordAdder.setDaemon(true); + recordAdder.start(); - // - ConcurrentLinkedQueue> records = new ConcurrentLinkedQueue<>(); - parallelConsumer.poll(recordContexts -> { - recordContexts.forEach(recordContext -> { - log.warn("Processing: {}", recordContext); - records.add(recordContext); - }); - }); try { - Thread.sleep(5000L); - } catch (InterruptedException e) { - throw new RuntimeException(e); - } + // + ConcurrentLinkedQueue> records = new ConcurrentLinkedQueue<>(); + parallelConsumer.poll(recordContexts -> { + recordContexts.forEach(recordContext -> { + log.warn("Processing: {}", recordContext); + records.add(recordContext); + }); + }); + try { + Thread.sleep(5000L); + } catch (InterruptedException e) { + throw new RuntimeException(e); + } - log.info("Trying to close..."); - parallelConsumer.close(); // request close after 5 seconds - log.info("Close successful!"); + log.info("Trying to close..."); + parallelConsumer.close(); // request close after 5 seconds + log.info("Close successful!"); + } finally { + recordAdder.interrupt(); + } } private void addRecords(MockConsumer mockConsumer) { - for(int i = 0; i < 100000; i++) { - mockConsumer.addRecord(new org.apache.kafka.clients.consumer.ConsumerRecord<>(topic, 0, i, "key", "value")); + for (int i = 0; i < 100000; i++) { try { + mockConsumer.addRecord(new org.apache.kafka.clients.consumer.ConsumerRecord<>(topic, 0, i, "key", "value")); Thread.sleep(1000L); + } catch (IllegalStateException e) { + // mockConsumer was closed - test has ended, stop quietly + return; } catch (InterruptedException e) { - throw new RuntimeException(e); + Thread.currentThread().interrupt(); + return; } } } diff --git a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/MockConsumerTestWithSaslAuthenticationException.java b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/MockConsumerTestWithSaslAuthenticationException.java index c324119f1..5825bab89 100644 --- a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/MockConsumerTestWithSaslAuthenticationException.java +++ b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/MockConsumerTestWithSaslAuthenticationException.java @@ -12,6 +12,7 @@ import org.apache.kafka.clients.consumer.OffsetResetStrategy; import org.apache.kafka.common.TopicPartition; import org.apache.kafka.common.errors.SaslAuthenticationException; +import org.junit.jupiter.api.AfterEach; import org.junit.jupiter.api.Test; import org.junit.jupiter.api.Timeout; import org.testcontainers.shaded.org.awaitility.Awaitility; @@ -27,14 +28,16 @@ import static pl.tlinkowski.unij.api.UniLists.of; /** - * Test that PC can survive for a temporary SaslAuthenticationException. + * Test that PC can survive a temporary SaslAuthenticationException. * - * In this test, MockConsumer starts to throw SaslAuthenticationException from the beginning until 20 seconds later. + * In this test, MockConsumer throws SaslAuthenticationException from the beginning for 8 seconds, then + * goes back to normal. * - * After that MockConsumer will back to normal. - * - * The saslAuthenticationRetryTimeout is set to 25 seconds. It is expected to resume normal after 20 seconds and will - * be able to consume all produced messages. + * The saslAuthenticationRetryTimeout is set to 30 seconds (generous margin over the 8s outage window) so + * PC has room to recover. The whole test fits comfortably inside PIT's per-test baseline coverage budget + * (which caps around ~80s). The earlier 20s outage + 25s retry version intermittently failed PIT's baseline + * because the total runtime scraped that cap. Test still verifies the same property: PC recovers if the + * retry budget exceeds the outage window. * @author Shilin Wu */ @Slf4j @@ -43,12 +46,25 @@ class MockConsumerTestWithSaslAuthenticationException { private final String topic = MockConsumerTestWithSaslAuthenticationException.class.getSimpleName(); + // Field so @AfterEach can close it. This class doesn't extend + // AbstractParallelEoSStreamProcessorTestBase, so no base-class cleanup runs. + private ParallelEoSStreamProcessor parallelConsumer; + + @AfterEach + void close() { + if (parallelConsumer != null && !parallelConsumer.isClosedOrFailed()) { + parallelConsumer.close(); + } + } + /** * Test that the mock consumer works as expected */ @Test void mockConsumer() { - final AtomicLong failUntil = new AtomicLong(System.currentTimeMillis() + 20000L); + // 8s mock-failure window (was 20s) — keeps total test runtime well within PIT's baseline + // per-test budget while still triggering PC's SASL retry path meaningfully. + final AtomicLong failUntil = new AtomicLong(System.currentTimeMillis() + 8000L); var mockConsumer = new MockConsumer(OffsetResetStrategy.EARLIEST) { @Override public synchronized ConsumerRecords poll(Duration timeout) { @@ -74,9 +90,11 @@ public synchronized void commitSync(Map offse // var options = ParallelConsumerOptions.builder() .consumer(mockConsumer) - .saslAuthenticationRetryTimeout(Duration.ofSeconds(25L)) // set retry to 25 seconds. + // 30s retry budget over an 8s mock-failure window — generous margin (22s) for + // PC's recovery poll even under PIT's slower JVM. + .saslAuthenticationRetryTimeout(Duration.ofSeconds(30L)) .build(); - var parallelConsumer = new ParallelEoSStreamProcessor(options); + parallelConsumer = new ParallelEoSStreamProcessor<>(options); parallelConsumer.subscribe(of(topic)); // MockConsumer is not a correct implementation of the Consumer contract - must manually rebalance++ - or use LongPollingMockConsumer @@ -96,14 +114,13 @@ public synchronized void commitSync(Map offse }); }); - // temporarily set the wait timeout - Awaitility.setDefaultTimeout(Duration.ofSeconds(50)); - // - Awaitility.await().untilAsserted(() -> { + // Scope the timeout locally (don't mutate Awaitility's global default — that was leaking + // across tests under PIT's different ordering, since this class doesn't have base-class + // Awaitility.reset() cleanup). + // 45s: 8s mock-failure window + retry + PIT's JVM slowdown, with headroom. + Awaitility.await().atMost(Duration.ofSeconds(45)).untilAsserted(() -> { assertThat(records).hasSize(3); }); - - Awaitility.reset(); } private void addRecords(MockConsumer mockConsumer) { diff --git a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/PCMetricsTest.java b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/PCMetricsTest.java index c85cafcef..a93d0c5ff 100644 --- a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/PCMetricsTest.java +++ b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/PCMetricsTest.java @@ -91,7 +91,9 @@ void metricsRegisterBinding() { }); // metrics show processing is complete - await().untilAsserted(() -> { + // 120s budget (was default 10s) - matches the atMost budgets elsewhere in this method, + // and gives headroom under PIT's instrumented JVM processing 1500 records. + await().atMost(Duration.ofSeconds(120)).untilAsserted(() -> { log.info("counterP0: {}, counterP1: {}", counterP0.get(), counterP1.get()); log.info(registry.getMetersAsString()); assertThat(registeredGaugeValueFor(PCMetricsDef.NUM_PAUSED_PARTITIONS)).isEqualTo(2); @@ -177,7 +179,7 @@ void metricsRegisterBinding() { numberToBlockAt.set(5000); latchPartition0.countDown(); latchPartition1.countDown(); - await().untilAsserted(() -> { + await().atMost(Duration.ofSeconds(120)).untilAsserted(() -> { assertThat(counterP0.get()).isEqualTo(quantityP0); }); diff --git a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/ParallelEoSSStreamProcessorRebalancedTest.java b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/ParallelEoSSStreamProcessorRebalancedTest.java index 224549f12..2c358e078 100644 --- a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/ParallelEoSSStreamProcessorRebalancedTest.java +++ b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/ParallelEoSSStreamProcessorRebalancedTest.java @@ -9,7 +9,6 @@ import lombok.SneakyThrows; import lombok.extern.slf4j.Slf4j; import org.apache.kafka.common.TopicPartition; -import org.junit.jupiter.api.AfterEach; import org.junit.jupiter.api.BeforeEach; import org.junit.jupiter.params.ParameterizedTest; import org.junit.jupiter.params.provider.EnumSource; @@ -30,10 +29,6 @@ public void setupAsyncConsumerTestBase() { setupClients(); } - @AfterEach() - public void close() { - } - @ParameterizedTest @EnumSource(CommitMode.class) @SneakyThrows diff --git a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/internal/EpochAndRecordsMapRaceTest.java b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/internal/EpochAndRecordsMapRaceTest.java new file mode 100644 index 000000000..95270a02f --- /dev/null +++ b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/internal/EpochAndRecordsMapRaceTest.java @@ -0,0 +1,139 @@ +package io.confluent.parallelconsumer.internal; + +/*- + * Copyright (C) 2026 Antony Stubbs and contributors + */ + +import io.confluent.parallelconsumer.state.ModelUtils; +import io.confluent.parallelconsumer.state.PartitionStateManager; +import io.confluent.parallelconsumer.state.ShardManager; +import io.confluent.parallelconsumer.state.WorkContainer; +import io.confluent.parallelconsumer.state.WorkManager; +import lombok.extern.slf4j.Slf4j; +import org.apache.kafka.clients.consumer.ConsumerRecord; +import org.apache.kafka.clients.consumer.ConsumerRecords; +import org.apache.kafka.common.TopicPartition; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import pl.tlinkowski.unij.api.UniLists; +import pl.tlinkowski.unij.api.UniMaps; + +import java.util.List; + +import static com.google.common.truth.Truth.assertThat; +import static com.google.common.truth.Truth.assertWithMessage; + +/** + * Verifies that the epoch initialization race is handled safely: + *

    + *
  1. poll() returns records for a partition before onPartitionsAssigned() fires
  2. + *
  3. Records are safely skipped (no NPE crash)
  4. + *
  5. onPartitionsAssigned() fires, establishing the epoch and partition state
  6. + *
  7. Next poll creates valid work at the correct epoch
  8. + *
+ * This race is more likely with Kafka 2.x's eager rebalance protocol. + */ +@Slf4j +class EpochAndRecordsMapRaceTest { + + ModelUtils mu = new ModelUtils(); + WorkManager wm; + ShardManager sm; + PartitionStateManager pm; + + String topic = "topic"; + TopicPartition tp = new TopicPartition(topic, 0); + + @BeforeEach + void setup() { + PCModuleTestEnv module = mu.getModule(); + wm = module.workManager(); + sm = wm.getSm(); + pm = wm.getPm(); + // Deliberately NOT calling onPartitionsAssigned — simulating the race + } + + /** + * Core race scenario: poll returns records before onPartitionsAssigned fires. + * Records should be safely skipped (no NPE), and the map should be empty. + */ + @Test + void pollBeforeAssignmentShouldSkipRecordsNotCrash() { + // No onPartitionsAssigned called — epoch map is empty + assertThat(pm.getEpochOfPartition(tp)).isNull(); + + // poll() returns records for the unassigned partition + ConsumerRecords poll = new ConsumerRecords<>(UniMaps.of(tp, UniLists.of( + new ConsumerRecord<>(topic, 0, 0, "key", "value"), + new ConsumerRecord<>(topic, 0, 1, "key", "value") + ))); + + // This should NOT throw NPE — records are skipped + EpochAndRecordsMap recordsMap = new EpochAndRecordsMap<>(poll, pm); + + // Records should NOT be in the map (skipped due to missing epoch) + assertThat(recordsMap.count()).isEqualTo(0); + assertThat(recordsMap.partitions()).isEmpty(); + } + + /** + * Full lifecycle: poll before assignment (skipped) → assignment fires → re-poll succeeds. + * Proves records are recovered after the assignment callback completes. + */ + @Test + void fullLifecycleRecordsRecoveredAfterAssignment() { + // Step 1: poll returns records before onPartitionsAssigned — safely skipped + ConsumerRecords firstPoll = new ConsumerRecords<>(UniMaps.of(tp, UniLists.of( + new ConsumerRecord<>(topic, 0, 0, "key-0", "value"), + new ConsumerRecord<>(topic, 0, 1, "key-1", "value") + ))); + EpochAndRecordsMap firstRecords = new EpochAndRecordsMap<>(firstPoll, pm); + + // Records were skipped — nothing to register + assertThat(firstRecords.count()).isEqualTo(0); + + // Step 2: onPartitionsAssigned fires (late) — epoch and partition state established + wm.onPartitionsAssigned(UniLists.of(tp)); + long epoch = pm.getEpochOfPartition(tp); + assertThat(epoch).isEqualTo(0L); + + // Step 3: Re-poll — Kafka re-delivers the same records (they were never committed) + ConsumerRecords secondPoll = new ConsumerRecords<>(UniMaps.of(tp, UniLists.of( + new ConsumerRecord<>(topic, 0, 0, "key-0", "value"), + new ConsumerRecord<>(topic, 0, 1, "key-1", "value") + ))); + EpochAndRecordsMap secondRecords = new EpochAndRecordsMap<>(secondPoll, pm); + + // Records should now be accepted with the correct epoch + assertThat(secondRecords.count()).isEqualTo(2); + assertThat(secondRecords.records(tp).getEpochOfPartitionAtPoll()).isEqualTo(0L); + + // Step 4: Register and verify work is created + wm.registerWork(secondRecords); + List> work = sm.getWorkIfAvailable(10); + assertWithMessage("Work should be available after assignment + re-poll") + .that(work).hasSize(2); + for (var wc : work) { + assertThat(wc.getEpoch()).isEqualTo(0L); + } + } + + /** + * When epoch is already present (normal case), records are processed normally. + */ + @Test + void normalCaseWithPreExistingEpochIsUnaffected() { + // Normal flow: onPartitionsAssigned first + wm.onPartitionsAssigned(UniLists.of(tp)); + assertThat(pm.getEpochOfPartition(tp)).isEqualTo(0L); + + // poll returns records — should use existing epoch + ConsumerRecords poll = new ConsumerRecords<>(UniMaps.of(tp, UniLists.of( + new ConsumerRecord<>(topic, 0, 0, "key", "value") + ))); + EpochAndRecordsMap recordsMap = new EpochAndRecordsMap<>(poll, pm); + + assertThat(recordsMap.count()).isEqualTo(1); + assertThat(recordsMap.records(tp).getEpochOfPartitionAtPoll()).isEqualTo(0L); + } +} diff --git a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/internal/ProducerManagerTest.java b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/internal/ProducerManagerTest.java index f716f151a..120167277 100644 --- a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/internal/ProducerManagerTest.java +++ b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/internal/ProducerManagerTest.java @@ -19,6 +19,8 @@ import org.apache.kafka.clients.consumer.OffsetAndMetadata; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; +import org.awaitility.Awaitility; +import org.junit.jupiter.api.AfterEach; import org.junit.jupiter.api.BeforeEach; import org.junit.jupiter.api.Tag; import org.junit.jupiter.api.Test; @@ -76,7 +78,17 @@ class ProducerManagerTest { void setup() { setup(ParallelConsumerOptions.builder() .commitMode(PERIODIC_TRANSACTIONAL_PRODUCER) - .commitLockAcquisitionTimeout(ofSeconds(2))); + // 10s (was 2s): 2s is too tight on a CI JVM under PIT instrumentation. + .commitLockAcquisitionTimeout(ofSeconds(10))); + } + + // This class doesn't extend AbstractParallelEoSStreamProcessorTestBase, so + // nothing else resets Awaitility between tests. Not closing pc here on purpose: + // buildModule() overrides close() as a no-op so each test manages its own pc + // lifecycle explicitly (by design, to inspect mid-commit state). + @AfterEach + void tearDown() { + Awaitility.reset(); } private void setup(ParallelConsumerOptions.ParallelConsumerOptionsBuilder optionsBuilder) { @@ -318,7 +330,9 @@ void producedRecordsCantBeInTransactionWithoutItsOffsetDirect() { { var msg = "wait for first record to finish"; log.debug(msg); - await(msg).untilAsserted(() -> assertThat(pc.getWorkMailBox()).hasSize(1)); + // 20s (was default 10s): tight under PIT's instrumented JVM + await(msg).atMost(ofSeconds(20)) + .untilAsserted(() -> assertThat(pc.getWorkMailBox()).hasSize(1)); } // send another record, register the work @@ -338,7 +352,7 @@ void producedRecordsCantBeInTransactionWithoutItsOffsetDirect() { // blocks, as offset 1 is blocked sending and so cannot acquire commit lock var msg = "Ensure expected produce lock is now held by blocked worker thread"; log.debug(msg); - await(msg).untilTrue(blockedOn1); + await(msg).atMost(ofSeconds(20)).untilTrue(blockedOn1); var commitBlocks = new BlockedThreadAsserter(); @@ -354,12 +368,13 @@ void producedRecordsCantBeInTransactionWithoutItsOffsetDirect() { }, () -> { log.debug("Unblocking offset processing offset1Mutex..."); offset1Mutex.countDown(); - }, ofSeconds(10)); + }, ofSeconds(20)); // was 10s; too tight under PIT // - await().untilAsserted(() -> Truth.assertWithMessage("commit should now have unlocked and returned") - .that(commitBlocks.functionHasCompleted()) - .isTrue()); + await().atMost(ofSeconds(20)) + .untilAsserted(() -> Truth.assertWithMessage("commit should now have unlocked and returned") + .that(commitBlocks.functionHasCompleted()) + .isTrue()); final int nextExpectedOffset = 2; // as only first of two work completed From dc346dadd110a4c74e0041b2bc9a0044d0681a3e Mon Sep 17 00:00:00 2001 From: Antony Stubbs Date: Tue, 21 Apr 2026 11:36:20 +1200 Subject: [PATCH 05/12] docs: rename to Kafka Parallel Consumer, update README, add solution docs - Rename from "Confluent Parallel Consumer" to "Kafka Parallel Consumer" - Point issue tracker to astubbs/parallel-consumer, remove Confluent Slack link - Add upstream PR analysis report - Add CHANGELOG 0.6.0.0 section - Fix AGENTS.md CI description, add build scripts and copyright rules - Add Documented Solutions section to AGENTS.md - Add solution docs: copyright header rules, CI security hardening - Remove duplicate Releasing section in README_TEMPLATE.adoc - Add multi-partition partial-skip test to EpochAndRecordsMapRaceTest - Fix deriveCpKafkaImage Javadoc comment Co-Authored-By: Claude Opus 4.6 (1M context) --- .github/workflows/maven.yml | 114 +++- .gitignore | 1 + AGENTS.md | 19 +- CHANGELOG.adoc | 17 + README.adoc | 22 + bin/ci-build.sh | 2 +- .../state/PartitionStateManager.java | 3 +- .../BrokerIntegrationTest.java | 2 +- .../internal/EpochAndRecordsMapRaceTest.java | 32 ++ parallel-consumer-mutiny/pom.xml | 7 +- src/docs/README_TEMPLATE.adoc | 55 +- .../development/upstream-pr-analysis.adoc | 542 ++++++++++++++++++ 12 files changed, 801 insertions(+), 15 deletions(-) create mode 100644 src/docs/development/upstream-pr-analysis.adoc diff --git a/.github/workflows/maven.yml b/.github/workflows/maven.yml index e1e53a0a2..7e09c88fa 100644 --- a/.github/workflows/maven.yml +++ b/.github/workflows/maven.yml @@ -59,7 +59,7 @@ jobs: - name: Download all dependencies run: ./mvnw --batch-mode -Pci dependency:go-offline -Dlicense.skip -DincludeScope=test -U - name: Save Maven cache (rotating key) - if: always() + if: success() uses: actions/cache/save@v4 with: path: ~/.m2/repository @@ -161,7 +161,9 @@ jobs: compare_with_base: true max_increase: 10 - # SpotBugs static analysis -finds null derefs, concurrency issues, resource leaks + # SpotBugs static analysis - finds null derefs, concurrency issues, resource leaks. + # Uses a baseline from the base branch so only NEW bugs introduced by the PR are + # reported. The baseline is generated by the push build and cached. spotbugs: if: github.event_name == 'pull_request' name: "SpotBugs" @@ -174,14 +176,66 @@ jobs: distribution: 'temurin' java-version: '17' cache: 'maven' + - name: Restore SpotBugs baseline + id: spotbugs-baseline + uses: actions/cache/restore@v4 + with: + path: .spotbugs-baseline.xml + key: spotbugs-baseline-${{ github.base_ref }}-${{ hashFiles('**/src/main/**/*.java') }} + restore-keys: | + spotbugs-baseline-${{ github.base_ref }}- + spotbugs-baseline- - name: Compile and run SpotBugs - run: ./mvnw --batch-mode -Pci compile spotbugs:spotbugs -Dlicense.skip -pl parallel-consumer-core -am + run: | + EXCLUDE_ARG="" + if [ -f .spotbugs-baseline.xml ]; then + EXCLUDE_ARG="-Dspotbugs.excludeFilterFile=.spotbugs-baseline.xml" + echo "Using baseline — only new bugs will be reported" + else + echo "No baseline found — reporting all bugs (first run on this base branch)" + fi + ./mvnw --batch-mode -Pci compile spotbugs:spotbugs -Dlicense.skip $EXCLUDE_ARG -pl parallel-consumer-core -am - name: Post SpotBugs annotations on PR if: always() uses: jwgmeligmeyling/spotbugs-github-action@v1.2 with: path: '**/target/spotbugsXml.xml' token: ${{ secrets.GITHUB_TOKEN }} + - name: Post SpotBugs summary to PR + if: always() + uses: actions/github-script@v7 + with: + script: | + const fs = require('fs'); + const { execSync } = require('child_process'); + const hasBaseline = fs.existsSync('.spotbugs-baseline.xml'); + const files = execSync('find . -path "*/target/spotbugsXml.xml" -type f').toString().trim().split('\n').filter(f => f); + let body; + if (files.length === 0) { + body = `## SpotBugs Report\n\nNo SpotBugs XML report found. The compile or analysis step may have failed.\n`; + } else { + let totalBugs = 0; + for (const f of files) { + const content = fs.readFileSync(f, 'utf8'); + const matches = content.match(/ c.body.includes('SpotBugs Report')); + if (existing) { + await github.rest.issues.updateComment({ owner: context.repo.owner, repo: context.repo.repo, comment_id: existing.id, body }); + } else { + await github.rest.issues.createComment({ owner: context.repo.owner, repo: context.repo.repo, issue_number: context.issue.number, body }); + } # Dependency vulnerability scanning -GitHub's own dependency review dependency-scan: @@ -224,10 +278,10 @@ jobs: restore-keys: | pit-history-${{ github.base_ref }}- pit-history- - # -DjvmArgs=-Xmx1g gives minion JVMs 1GB heap (default too small, - # causes MEMORY_ERROR / UNKNOWN_ERROR crashes). - # -DargLine= strips Jacoco's javaagent from minions (PIT has - # its own coverage). Separate from jvmArgs. + # -Djacoco.skip=true prevents Jacoco from injecting its agent into PIT + # minions (PIT has its own coverage; Jacoco's agent + # causes UNKNOWN_ERROR crashes). + # -DjvmArgs=-Xmx1g gives minion JVMs 1GB heap (default too small). # -DtimeoutConstant/Factor generous per-mutation timeout for CI hardware # -Dthreads=1 I/O-bound tests; parallelism adds contention not speed # -DwithHistory incremental: skips re-mutating unchanged code. @@ -239,10 +293,10 @@ jobs: run: | ./mvnw --batch-mode -Pci test-compile org.pitest:pitest-maven:mutationCoverage \ -Dlicense.skip \ + -Djacoco.skip=true \ -DtargetClasses="io.confluent.parallelconsumer.internal.*" \ -DtargetTests="io.confluent.parallelconsumer.*" \ -DjvmArgs=-Xmx1g \ - -DargLine= \ -DtimeoutConstant=30000 -DtimeoutFactor=3.0 \ -Dthreads=1 \ -DwithHistory \ @@ -296,6 +350,50 @@ jobs: # ── Push Builds (master) ─────────────────────────────────────────────── + # Generate SpotBugs baseline for PR comparisons. Runs on every push to + # master so PR builds can exclude known bugs and only report new ones. + spotbugs-baseline: + if: github.event_name == 'push' + name: "SpotBugs Baseline" + runs-on: ubuntu-latest + timeout-minutes: 10 + steps: + - uses: actions/checkout@v6 + - uses: actions/setup-java@v5 + with: + distribution: 'temurin' + java-version: '17' + cache: 'maven' + - name: Compile and generate SpotBugs XML + run: ./mvnw --batch-mode -Pci compile spotbugs:spotbugs -Dlicense.skip -pl parallel-consumer-core -am + - name: Convert SpotBugs results to exclude filter + run: | + # Transform the SpotBugs XML into an exclude filter that can be + # passed to future runs via -Dspotbugs.excludeFilterFile. Each + # BugInstance becomes a Match element keyed on class + method + bug type. + python3 -c " + import xml.etree.ElementTree as ET, sys + tree = ET.parse('parallel-consumer-core/target/spotbugsXml.xml') + root = tree.getroot() + out = ET.Element('FindBugsFilter') + for bug in root.findall('.//BugInstance'): + match = ET.SubElement(out, 'Match') + cls = bug.find('Class') + method = bug.find('Method') + if cls is not None: + ET.SubElement(match, 'Class', name=cls.get('classname', '')) + if method is not None: + ET.SubElement(match, 'Method', name=method.get('name', '')) + ET.SubElement(match, 'Bug', pattern=bug.get('type', '')) + ET.ElementTree(out).write('.spotbugs-baseline.xml', xml_declaration=True, encoding='utf-8') + print(f'Baseline: {len(root.findall(\".//BugInstance\"))} bugs excluded') + " + - name: Cache SpotBugs baseline + uses: actions/cache/save@v4 + with: + path: .spotbugs-baseline.xml + key: spotbugs-baseline-${{ github.ref_name }}-${{ hashFiles('**/src/main/**/*.java') }} + # Full Kafka version matrix as safety net build: if: github.event_name == 'push' diff --git a/.gitignore b/.gitignore index e7ef34e89..4bf56e9af 100644 --- a/.gitignore +++ b/.gitignore @@ -1,6 +1,7 @@ /mk-include /.mk-include-timestamp .DS_Store +CLAUDE.md # Compiled class file *.class diff --git a/AGENTS.md b/AGENTS.md index 123dc5401..364013855 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -20,11 +20,20 @@ This is a community-maintained fork of `confluentinc/parallel-consumer` (the ups # Quick local build (compile + unit tests) bin/build.sh -# Full CI build with all tests (unit + integration) +# Unit tests only (no Docker needed) +bin/ci-unit-test.sh + +# Integration tests only (requires Docker for TestContainers) +bin/ci-integration-test.sh + +# Full CI build with all tests against a Kafka version matrix (used by push-to-master CI) bin/ci-build.sh # Full CI build against a specific Kafka version bin/ci-build.sh 3.9.1 + +# Performance tests only (requires substantial hardware) +bin/performance-test.sh ``` ## Module Structure @@ -59,11 +68,17 @@ bin/ci-build.sh 3.9.1 - **Lombok**: Used extensively (builders, getters, logging). IntelliJ Lombok plugin required. - **EditorConfig**: Enforced via `.editorconfig` - 4-space indent for Java, 120 char line length. - **License headers**: Managed by `license-maven-plugin` (Mycila). Use `-Dlicense.skip` locally to skip checks. +- **Copyright rules for this fork**: + - Do not change copyright headers on existing files unless the file has substantive code changes in the same commit + - Do not bump copyright years as an incidental or standalone change + - The `NOTICE` file at repo root contains the legal attribution structure for the fork + - New files written entirely for the fork should not claim Confluent copyright + - Always pass `-Dlicense.skip` to Maven to prevent the license plugin from auto-bumping years - **Google Truth**: Used for test assertions alongside JUnit 5 and Mockito. ## CI -- **`.github/workflows/maven.yml`** — Build and test on every push/PR. Push uses default Kafka version, PRs run the full version matrix. Includes concurrency cancellation. +- **`.github/workflows/maven.yml`** — Build and test on every push/PR. PRs run unit, integration, and performance tests in parallel (via `bin/ci-unit-test.sh`, `bin/ci-integration-test.sh`, `bin/performance-test.sh`). Push to master runs the full Kafka version matrix (3.1.0, 3.7.0, 3.9.1 + experimental 4.x) via `bin/ci-build.sh`. Includes concurrency cancellation, SpotBugs, duplicate detection, mutation testing (PIT), and dependency vulnerability scanning on PRs. - **`.github/workflows/publish.yml`** — Publishes to Maven Central on every push to `master`. The pom.xml version is the source of truth: `-SNAPSHOT` versions deploy as snapshots, non-snapshot versions deploy as full releases (and create a git tag + GitHub release). - **`.semaphore/`** — Legacy Confluent internal CI/release pipelines, retained but inactive on the fork. diff --git a/CHANGELOG.adoc b/CHANGELOG.adoc index 543a21f07..605f2b0cb 100644 --- a/CHANGELOG.adoc +++ b/CHANGELOG.adoc @@ -14,6 +14,23 @@ ifndef::github_name[] toc::[] endif::[] +== 0.6.0.0 + +=== Breaking + +* Maven coordinates rebranded from `io.confluent.parallelconsumer` to `io.github.astubbs.parallelconsumer`. Update your dependency declarations. + +=== Fixes + +* fix: null epoch race condition in EpochAndRecordsMap - poll() returning records before onPartitionsAssigned() no longer causes NPE; records are safely skipped and re-delivered +* fix: test pollution from leaked threads and Awaitility global state across test classes + +=== Improvements + +* ci: modernized GitHub Actions CI with parallel test suites (unit, integration, performance), SpotBugs, duplicate detection, mutation testing (PIT), and dependency vulnerability scanning +* ci: added publish workflow for automated Maven Central deployment +* docs: updated README and added upstream PR analysis report + == 0.5.3.4 === Fixes diff --git a/README.adoc b/README.adoc index 4c6fa26d6..dafd23f29 100644 --- a/README.adoc +++ b/README.adoc @@ -1359,6 +1359,28 @@ CI build against a specific Kafka version:: The GitHub Actions CI workflow uses `bin/ci-build.sh`, so running it locally reproduces the CI environment. +=== Releasing + +The `pom.xml` version is the source of truth for publishing — there is no `maven-release-plugin` step. + +On every push to `master`, `.github/workflows/publish.yml` deploys to Maven Central: + +* If the version ends in `-SNAPSHOT` → publishes a snapshot +* If the version does not end in `-SNAPSHOT` → publishes a full release, creates a `v` git tag, and creates a GitHub release + +To cut a release: + +. Open a PR removing `-SNAPSHOT` from `` in the parent `pom.xml` (e.g. `0.6.0.0-SNAPSHOT` → `0.6.0.0`) +. Merge it to master → CI publishes the release +. Open another PR bumping to the next snapshot (e.g. `0.6.0.1-SNAPSHOT`) and merge + +Required GitHub repository secrets: + +* `MAVEN_CENTRAL_USERNAME` — Sonatype Central Portal token username +* `MAVEN_CENTRAL_PASSWORD` — Sonatype Central Portal token password +* `MAVEN_GPG_PRIVATE_KEY` — Armored GPG private key for signing artifacts +* `MAVEN_GPG_PASSPHRASE` — Passphrase for the GPG key + === Testing The project has good automated test coverage, of all features. diff --git a/bin/ci-build.sh b/bin/ci-build.sh index 705c2a823..acf353f6a 100755 --- a/bin/ci-build.sh +++ b/bin/ci-build.sh @@ -21,7 +21,7 @@ fi ./mvnw --batch-mode \ -Pci \ clean verify \ - $KAFKA_VERSION_ARG \ + ${KAFKA_VERSION_ARG:+"$KAFKA_VERSION_ARG"} \ -Dlicense.skip \ -Dexcluded.groups=performance \ -Dsurefire.rerunFailingTestsCount=2 diff --git a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/PartitionStateManager.java b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/PartitionStateManager.java index fd47dd552..c2c55cb5d 100644 --- a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/PartitionStateManager.java +++ b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/PartitionStateManager.java @@ -248,12 +248,13 @@ private void resetOffsetMapAndRemoveWork(Collection allRemovedPa } /** - * @return the current epoch of the partition + * @return the current epoch of the partition, or null if not yet assigned */ public Long getEpochOfPartition(TopicPartition partition) { return partitionsAssignmentEpochs.get(partition); } + private void incrementPartitionAssignmentEpoch(final Collection partitions) { for (final TopicPartition partition : partitions) { Long epoch = partitionsAssignmentEpochs.getOrDefault(partition, PartitionState.KAFKA_OFFSET_ABSENCE); diff --git a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/BrokerIntegrationTest.java b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/BrokerIntegrationTest.java index f244e6e57..56c84017a 100644 --- a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/BrokerIntegrationTest.java +++ b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/BrokerIntegrationTest.java @@ -60,7 +60,7 @@ public abstract class BrokerIntegrationTest { * the broker under test matches the client. The CI matrix overrides {@code kafka.version} * via {@code -Dkafka.version=X.Y.Z}, so we read it at runtime from the client jar. *

- * Mapping: CP major = AK major + 4 (e.g., AK 2.8 → CP 6.2, AK 3.9 → CP 7.9). + * Mapping: CP major = AK major + 4 (e.g., AK 3.1 → CP 7.1, AK 3.9 → CP 7.9). */ static String deriveCpKafkaImage() { String akVersion = org.apache.kafka.common.utils.AppInfoParser.getVersion(); diff --git a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/internal/EpochAndRecordsMapRaceTest.java b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/internal/EpochAndRecordsMapRaceTest.java index 95270a02f..703f10321 100644 --- a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/internal/EpochAndRecordsMapRaceTest.java +++ b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/internal/EpochAndRecordsMapRaceTest.java @@ -118,6 +118,38 @@ void fullLifecycleRecordsRecoveredAfterAssignment() { } } + /** + * Multi-partition poll where only some partitions have an epoch assigned. + * The assigned partition's records should be accepted while the unassigned + * partition's records are skipped in the same map construction call. + */ + @Test + void mixedPartitionPollSkipsOnlyUnassignedPartitions() { + TopicPartition tp1 = new TopicPartition(topic, 1); + + // Assign only tp (partition 0), NOT tp1 (partition 1) + wm.onPartitionsAssigned(UniLists.of(tp)); + assertThat(pm.getEpochOfPartition(tp)).isEqualTo(0L); + assertThat(pm.getEpochOfPartition(tp1)).isNull(); + + // poll returns records for both partitions + ConsumerRecords poll = new ConsumerRecords<>(UniMaps.of( + tp, UniLists.of( + new ConsumerRecord<>(topic, 0, 0, "key-0", "value") + ), + tp1, UniLists.of( + new ConsumerRecord<>(topic, 1, 0, "key-1", "value"), + new ConsumerRecord<>(topic, 1, 1, "key-1b", "value") + ) + )); + EpochAndRecordsMap recordsMap = new EpochAndRecordsMap<>(poll, pm); + + // Only tp (assigned) should be in the map; tp1 (unassigned) should be skipped + assertThat(recordsMap.partitions()).containsExactly(tp); + assertThat(recordsMap.count()).isEqualTo(1); + assertThat(recordsMap.records(tp).getEpochOfPartitionAtPoll()).isEqualTo(0L); + } + /** * When epoch is already present (normal case), records are processed normally. */ diff --git a/parallel-consumer-mutiny/pom.xml b/parallel-consumer-mutiny/pom.xml index 1434901a9..06d684f38 100644 --- a/parallel-consumer-mutiny/pom.xml +++ b/parallel-consumer-mutiny/pom.xml @@ -1,7 +1,7 @@ @@ -15,6 +15,11 @@ Kafka Parallel Consumer SmallRye Mutiny parallel-consumer-mutiny + + + 9 + + io.github.astubbs.parallelconsumer diff --git a/src/docs/README_TEMPLATE.adoc b/src/docs/README_TEMPLATE.adoc index 73812fec1..1b779a204 100644 --- a/src/docs/README_TEMPLATE.adoc +++ b/src/docs/README_TEMPLATE.adoc @@ -281,7 +281,7 @@ The user just has to provide a function to extract from the message the HTTP cal === Illustrative Performance Example -.(see link:./parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/VolumeTests.java[VolumeTests.java]) +.(see link:./parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/VeryLargeMessageVolumeTest.java[VeryLargeMessageVolumeTest.java]) These performance comparison results below, even though are based on real performance measurement results, are for illustrative purposes. To see how the performance of the tool is related to instance counts, partition counts, key distribution and how it would relate to the vanilla client. Actual results will vary wildly depending upon the setup being deployed into. @@ -1094,6 +1094,59 @@ Note:: See https://github.com/confluentinc/parallel-consumer/issues/162[issue #162] and this https://stackoverflow.com/questions/4786881/why-is-test-jar-dependency-required-for-mvn-compile[Stack Overflow question]. +=== Build Scripts + +Helper scripts are in the `bin/` directory: + +[qanda] +Quick local build (compile + unit tests):: +`bin/build.sh` + +Unit tests only (no Docker needed):: +`bin/ci-unit-test.sh` + +Integration tests only (requires Docker for TestContainers):: +`bin/ci-integration-test.sh` + +Full CI build with all tests (unit + integration):: +`bin/ci-build.sh` + +CI build against a specific Kafka version:: +`bin/ci-build.sh 3.9.1` + +Performance test suite (also `bin/performance-test.cmd` on Windows):: +`bin/performance-test.sh` + +All `ci-*` scripts use the `-Pci` Maven profile which enables license checking and disables parallel test execution. The GitHub Actions CI workflow uses these scripts, so running them locally reproduces the CI environment. + +=== Performance Tests + +Tests tagged `@Tag("performance")` are excluded from the regular CI build because they need substantial hardware. They run on a dedicated self-hosted runner via `.github/workflows/performance.yml` (manual trigger or weekly schedule). + +To run the performance suite locally, use `bin/performance-test.sh`. To set up your own self-hosted runner for these tests, see link:./docs/SELF_HOSTED_RUNNER.md[docs/SELF_HOSTED_RUNNER.md]. + +=== Releasing + +The `pom.xml` version is the source of truth for publishing — there is no `maven-release-plugin` step. + +On every push to `master`, `.github/workflows/publish.yml` deploys to Maven Central: + +* If the version ends in `-SNAPSHOT` → publishes a snapshot +* If the version does not end in `-SNAPSHOT` → publishes a full release, creates a `v` git tag, and creates a GitHub release + +To cut a release: + +. Open a PR removing `-SNAPSHOT` from `` in the parent `pom.xml` (e.g. `0.6.0.0-SNAPSHOT` → `0.6.0.0`) +. Merge it to master → CI publishes the release +. Open another PR bumping to the next snapshot (e.g. `0.6.0.1-SNAPSHOT`) and merge + +Required GitHub repository secrets: + +* `MAVEN_CENTRAL_USERNAME` — Sonatype Central Portal token username +* `MAVEN_CENTRAL_PASSWORD` — Sonatype Central Portal token password +* `MAVEN_GPG_PRIVATE_KEY` — Armored GPG private key for signing artifacts +* `MAVEN_GPG_PASSPHRASE` — Passphrase for the GPG key + === Testing The project has good automated test coverage, of all features. diff --git a/src/docs/development/upstream-pr-analysis.adoc b/src/docs/development/upstream-pr-analysis.adoc new file mode 100644 index 000000000..95b45c13c --- /dev/null +++ b/src/docs/development/upstream-pr-analysis.adoc @@ -0,0 +1,542 @@ += Upstream PR & Issue Analysis for parallel-consumer +:toc: macro +:toclevels: 3 +:numbered: 1 + +toc::[] + +== Part 1: Open Upstream PRs (confluentinc/parallel-consumer) + +The upstream repository has accumulated *19 open PRs* (565 closed). The repo appears sparsely maintained -- some PRs from early 2025 remain unreviewed. The local fork (`astubbs/parallel-consumer`) is pinned to Kafka 3.1.0, JUnit 5.8.2, with a Java 17 WIP branch in progress. + +=== Summary Table (19 PRs, most to least important) + +[cols="1,1,4,2,3", options="header"] +|=== +|Rank |PR |Title |Group |Why it matters + +|1 |#893 |Fix offset reset on partition reassignment |Correctness bug |Data-loss / reset risk during rebalance +|2 |#909 |Replace stale container when adding work container |Correctness bug |Record drops during rebalance +|3 |#915 |Feature: select batch construction strategy |Feature (closes #266) |Unlocks throughput & ordering flexibility +|4 |#908 |Feature: Support Virtual Threads (JDK 21+) |Feature (closes #896) |Major platform modernisation +|5 |#905 |Metric: max queued records per shard |Observability |Diagnose hot-key bottlenecks +|6 |#866 |Update Kafka to v7 (BREAKING) |Major dep bump |Keeps library on supported Kafka +|7 |#867 |Update vertx to v5 (BREAKING) |Major dep bump |Keeps vertx module alive +|8 |#901 |Fix failing licence check + gitignore |Build/tooling |Unblocks CI +|9 |#877 |chore: update repo by service bot |Build/tooling |Bot-maintained config +|10 |#851 |postgres 42.7.2 (security) |Security CVE |Fork should not ship vulnerable deps +|11 |#913 |assertj 3.27.7 (security) |Security CVE |Test-only, low severity +|12 |#914 |logback 1.5.25 (security, DRAFT) |Security CVE |Logging stack CVE +|13 |#855 |wiremock-jre8 to v3 (BREAKING) |Test dep |Test-only, unblocks other updates +|14 |#899 |JUnit platform 1.10.2 to 6.0.0 |Test dep |Huge jump, needs care +|15 |#900 |testcontainers 1.19.8 to 1.21.3 |Test dep |Routine bump +|16 |#897 |vertx 4.5.7 to 5.0.5 |Dep (overlaps #867) |Duplicate of #867 +|17 |#869 |threeten-extra 1.7.2 to 1.8.0 |Dep |Trivial +|18 |#898 |maven-gpg-plugin 3.1.0 to 3.2.8 |Build plugin |Trivial +|19 |#854 |renovate minor+patch batch |Dep batch |Routine +|=== + +=== Group A -- Correctness fixes (merge first) + +These address real bugs observed in production and should be the highest priority to cherry-pick. + +==== #893 -- Accurate committed offset on partition assignment + +* *Author:* Martyn Ye (sangreal), opened *Oct 2025* +* *Problem:* Race condition in `PartitionState.getOffsetToCommit()`. Between offset calculation and commit, `incompletes` may drain, so a higher offset gets committed than intended. After rebalance, the consumer tries to fetch a non-existent offset and *triggers `auto.offset.reset`* (data loss or replay). +* *Fix:* Call `getOffsetToCommit()` once per commit cycle, track `lastProcessedOffset`, add synchronisation. +* *Status:* Has unit tests, approved by Roman Kolesnev, run privately for >1 week with zero recurrence. Awaiting codeowner merge. +* *Verdict:* Should merge immediately. High-impact, low-risk. + +==== #909 -- Replace stale work container on rebalance + +* *Author:* cserspring, opened *Jan 2026* +* *Problem:* Broker-poll thread and control thread race during rebalance. Control thread adds a container at an outdated epoch _after_ `removeStaleContainers()` runs. The stale entry then blocks subsequent valid additions, *dropping records*. +* *Fix:* `addWorkContainer` replaces stale entries rather than rejecting new ones. +* *Status:* No visible tests yet, no review comments. Needs more scrutiny than #893 but the root-cause analysis is compelling. +* *Verdict:* High priority, but request tests before merging. + +=== Group B -- Features (real user value) + +==== #915 -- Batch construction strategy (closes #266) + +* *Author:* Devingryu, opened *Mar 2026* (most recent PR) +* Adds `batchStrategy` enum with three modes: +** `SEQUENTIAL` (default, current behaviour) -- one record per shard per cycle +** `BATCH_MULTIPLEX` -- multiple records from a shard, batches span shards +** `BATCH_BY_SHARD` -- multi-record batches, each batch from a single shard +* Closes a long-standing feature request (#266). Touches `ProcessingShard` work-selection and batch construction. +* Author explicitly flags open design questions (UNORDERED semantics, stage separation) and notes the repo looks unmaintained. +* *Verdict:* Most valuable user-facing feature in the queue. Needs an architectural review before merge. + +==== #908 -- Virtual Threads support (closes #896) + +* *Author:* Devingryu, opened *Jan 2026* +* Adds `useVirtualThreads` option for JDK 21+, migrates `synchronized` to `ReentrantLock` to avoid carrier-thread pinning, generalises `setupWorkerPool` return to `ExecutorService`. +* Tests skip on JDK 17 via `Assumptions`. +* *Caveats:* Backwards-compat risk for subclasses overriding executor type; queue-size querying uses `WorkerManager.numberRecordsOutForProcessing` as a proxy. +* *Verdict:* Strategically important modernisation. Sequence behind Java 17 branch. + +==== #905 -- Max-queued-records-per-shard metric + +* *Author:* flashmouse, opened *Jan 2026* +* Adds monitoring metric; useful with `orderType.KEY` to detect *hot-key shards*. +* Small, self-contained (2 commits). No review comments. +* *Verdict:* Easy merge, good observability win. + +=== Group C -- Major (breaking) dependency bumps + +[cols="1,3,4", options="header"] +|=== +|PR |Bump |Note + +|#866 |Kafka to v7 |Upstream Confluent Kafka major bump. Essential for staying current but will cascade test/API changes. +|#867 |vertx to v5 |Breaking for the `parallel-consumer-vertx` module. +|#897 |vertx 4.5.7 to 5.0.5 |Duplicate of #867; close one. +|#855 |wiremock-jre8 to v3 |Test-only, blocks other test-dep updates. +|#899 |junit platform 1.10.2 to *6.0.0* |Dependabot misread -- 6.0.0 likely incorrect. Close or pin. +|=== + +*Verdict:* Tackle #866 (Kafka v7) first. Consolidate #867/#897. Close #899. + +=== Group D -- Security CVEs + +[cols="1,2,2,2", options="header"] +|=== +|PR |Dep |Scope |Priority + +|#851 |postgresql 42.7.2 |Runtime (JDBC sample) |Medium +|#913 |assertj 3.27.7 |Test only |Low +|#914 |logback 1.5.25 (DRAFT) |Runtime logging |Medium +|=== + +*Verdict:* All renovate-generated and low-effort. Merge in a single batch once CI is green. + +=== Group E -- Routine / housekeeping + +* *#854* -- renovate minor+patch dependency batch (trivial) +* *#869* -- threeten-extra 1.7.2 to 1.8.0 (trivial) +* *#898* -- maven-gpg-plugin 3.1.0 to 3.2.8 (trivial) +* *#900* -- testcontainers 1.19.8 to 1.21.3 (routine test dep) +* *#877* -- service-bot repo config update +* *#901* -- fix failing license check + gitignore (unblocks CI -- promote to Group A priority) + +=== Recommended merge order for the local fork + +. *#901* -- unbreak the licence-check build (prerequisite) +. *#893* -- offset reset fix (correctness, approved upstream) +. *#909* -- stale container fix (correctness, request tests first) +. *#905* -- hot-key observability metric (cheap win) +. *Security batch:* #851, #913, #914 +. *Routine deps batch:* #854, #869, #898, #900, #877 +. *#915* -- batch strategy feature (needs architectural review) +. *#866* -- Kafka v7 (scheduled breaking upgrade) +. *#908* -- Virtual Threads (after Java 17 branch lands; requires JDK 21+ CI) +. *#867* (and close duplicate #897) -- vertx v5 +. *#855* -- wiremock v3 (after other test deps stabilise) +. *#899* -- close as incorrect; re-open with a sensible target version + +== Part 2: astubbs Closed-Unmerged PRs (Revival Candidates) + +A GitHub search of `confluentinc/parallel-consumer` for `author:astubbs is:closed is:unmerged` returns *53 PRs*. A striking pattern: *~40 of them were closed on the same day, 2023-06-15*, which looks like an administrative sweep (coinciding with astubbs leaving Confluent) rather than a substantive rejection of the ideas. The PRs represent a trove of unfinished but intentional design work. + +=== Group A -- Major features (top revival candidates) + +[cols="1,1,4,1,3", options="header"] +|=== +|Rank |PR |Title |Closes |Notes + +|A1 |#464 |Feature: Health check and metrics |#71 |Observability gap; still needed today. Consolidate with upstream #905. +|A2 |#366 |Dead Letter Queue (DLQ) as option for retry expiration |-- |Most-requested missing feature; no clean upstream alternative. +|A3 |#441 |Queue priority |#50 |Closes long-standing issue; enables priority-aware shards. +|A4 |#316 |Combine queues from different partitions/topics with KEY ordering |#314 |Powerful composition primitive. +|A5 |#300 |Loom POC (Virtual Threads) |#299 |Predates upstream #908. Compare approaches before redoing. +|A6 |#473 |Producer sends to least-loaded broker |-- |Throughput optimisation. +|A7 |#390 |Streams integration |-- |Expands reach to Kafka Streams users. +|A8 |#346 |Thread-safe user API exposure of ALL Consumer APIs |#520 |Safety-critical API surface expansion. +|A9 |#353 |Broker status informer |#185 |Surface broker health to user code. +|A10 |#291 |Explicit terminal and retry exceptions + poison pills |#242 |Pairs with #268; cleaner error semantics. +|A11 |#197 |Retry count available in header |-- |Small, widely useful. +|A12 |#22 |Dynamic concurrency control (WIP) |#21 |Adaptive parallelism. +|=== + +=== Group B -- Performance / correctness work + +[cols="1,1,4,3", options="header"] +|=== +|Rank |PR |Title |Notes + +|B1 |#530 |Caching of shard management counts (shard scanning O(n) to cached) |Concrete hot-path perf fix. High ROI. +|B2 |#237 |Shard starvation fix |#236 -- real user-visible bug. +|B3 |#356 |Faster record producing: remove Future blocks, async result processing |#29 -- throughput-critical. +|B4 |#408 |Runlength v3 encoding using Longs instead of Integers |Extends offset encoder capacity. +|B5 |#46 |Continuous encoding with per-partition realtime offset-space backpressure (WIP!) |Aspirational; pairs with #408. +|B6 |#66 |Retry system improvements |Iterative correctness work. +|B7 |#140 |Don't allow use of core methods from vert.x or reactor module |#99 -- prevents footguns. +|=== + +=== Group C -- Architectural refactors (keep ideas, not code) + +Use as design references. Branches are far too stale to apply directly. + +[cols="1,1,4,3", options="header"] +|=== +|Rank |PR |Title |Notes + +|C1 |#488 |Refactor God class to components |Long-standing maintainability pain. +|C2 |#270 |Shared-nothing architecture -- Partition Events |#200 -- scalability rearchitecture. +|C3 |#524 |Use Actor for commit commands |Pairs with C4; cleaner concurrency. +|C4 |#325 |New IPC system using lightweight Lambda actor queue |Underpins many other improvements. +|C5 |#271 |Major package restructure |Blast radius too large on its own; fold into C1/C2. +|C6 |#303 |Extend Consumer and Function for more cohesive API |Public-API evolution. +|C7 |#45 |Direct work loading, direct result processing (WIP!) |POC; informs C2/C4. +|C8 |#405 |Remove static state |Ties into B-group testability work. +|=== + +=== Group D -- Testing & chaos infrastructure + +Reviving these pays dividends on every other revival in Groups A-C. + +[cols="1,1,4,3", options="header"] +|=== +|Rank |PR |Title |Notes + +|D1 |#345 |Broker disconnect testing + ChaosBroker |#203 -- essential for validating correctness fixes. +|D2 |#126 |Remove static state manipulation that tests use |Unlocks parallel test execution. +|D3 |#143 |Parallel tests in CI |Dependent on D2. +|D4 |#106 |Client factory config: prevent client reuse, safer config validation |Test hygiene. +|D5 |#492 |Fix incorrect assumption in main PC unit test |Small, easy. +|D6 |#494 |Re-enable disabled legacy tests in ParallelEoSStreamProcessorTest |Coverage recovery. +|D7 |#496 |Implement the empty tests |Fill test placeholders. +|D8 |#73 |Enable integration tests on GH Workflow CI |Verify if merged-equivalent elsewhere. +|D9 |#75, #74 |Run CI on Java 8 / Java 9 |Obsolete; close permanently -- covered by Java 17 effort. +|=== + +=== Group E -- UX, logging, docs + +[cols="1,1,4,3", options="header"] +|=== +|Rank |PR |Title |Notes + +|E1 |#268 |Explicit retriable exception for cleaner ERROR logging |#242 -- superseded by #291 (A10); pick one. +|E2 |#139 |Warning when no SLF4J logger detected |Small UX polish. +|E3 |#508 |'Back Pressure' notes and Javadoc |Valuable docs addition. +|E4 |#81 |Parallel join technique documentation |Design docs. +|E5 |#506 |Fix chart links (docs) |Trivial. +|=== + +=== Group F -- Vertx / Reactor module work + +[cols="1,1,4,3", options="header"] +|=== +|Rank |PR |Title |Notes + +|F1 |#204 |Run user functions on a Vert.x vertical instead of Java ThreadPool |Re-evaluate in light of upstream #867 / Virtual Threads. +|F2 |#181 |Convenience methods for Vertx calls |Small API addition. +|F3 |#179 |Scatter-gather parallel webservice requests (POC) |Useful pattern for docs/examples. +|=== + +=== Group G -- Build, tooling, plumbing (mostly obsolete) + +[cols="1,3,2", options="header"] +|=== +|PR |Title |Status + +|#442 |Setup mvnw |Consider re-doing -- mvnw is still useful. +|#91 |Upgrade Apache Kafka version |Obsolete (see upstream #866). +|#220 |Docker naming, fixes and improvements |Niche. +|#10 |Extract Interface from implementation |Likely already done. +|#23 |Playing around with interface naming (WIP!) |Skip. +|#6 |Fix plugin versions / surefire bug |Obsolete. +|#2 |Fix groupid in poms |Obsolete. +|#7 |Store offset map of incomplete messages (Draft) |Likely superseded by current offset encoder. +|=== + +=== Top 10 revival candidates (flat ranking) + +. *#530* -- shard-count caching perf fix (small, high-impact, hot path) +. *#464* -- health check and metrics (still-missing observability) +. *#366* -- dead letter queue option +. *#237* -- shard starvation bug fix +. *#345* -- ChaosBroker testing infrastructure (amplifies every other fix) +. *#441* -- queue priority +. *#316* -- combine queues across partitions with KEY ordering +. *#291* -- terminal/retry exception types + poison pill handling +. *#300* -- Loom / Virtual Threads POC (compare directly to upstream #908) +. *#197* -- retry count header + +=== Cross-references to Part 1 + +* *#300 (Loom POC)* <-> upstream *#908 (Virtual Threads)* -- astubbs POC may offer earlier-design insight. +* *#464 (Health check + metrics)* <-> upstream *#905 (max-queued-per-shard metric)* -- merge philosophies. +* *#91 (Kafka upgrade)* <-> upstream *#866 (Kafka v7)* -- close #91 as obsolete. +* *#204 (Vertx vertical)* <-> upstream *#867 (vertx v5)* -- re-scope against v5 APIs. +* *#237 / #345* <-> upstream *#893 / #909* (rebalance correctness) -- astubbs chaos testing would help validate upstream fixes. + +== Part 3: Upstream Issues Analysis + +`confluentinc/parallel-consumer` has *~61 open issues* (dating back to Nov 2020) and *8 issues formally closed as "Not Planned"*. The issues were NOT bulk-closed in the Jun 2023 sweep -- only the PRs were. Most feature requests and bug reports are still sitting open with no maintainer triage since mid-2023. + +Issue #907 ("Is the project still actively maintained?") filed Jan 2026 summarises the community frustration. + +=== Category 1: Stability & timeout bugs (work on FIRST) + +Production-blocking issues reported by multiple users. Highest urgency. + +[cols="1,4,2,1,1,2", options="header"] +|=== +|Issue |Title |Author |Date |Labels |Cross-ref PRs + +|#803 |Transactional Producer timeout getting commit lock |PatrickChauveau |2024-06 |verified bug |-- +|#809 |Sporadic timeouts from ConsumerOffsetCommitter.CommitRequest |tedcaozoom |2024-06 |-- |PR #893 may help +|#833 |PC runs then exits due to InternalRuntimeException(Timeout) |dumontxiong |2024-09 |-- |PR #893 may help +|#597 |PC doesn't close Kafka consumer if commit fails during close |BartoszSta |2023 |-- |-- +|#402 |Max loading factor steps reached: 100/100 |bartman64 |2022-08 |-- |-- +|#857 |Paused consumption across multiple consumers |rbokade-rbk |2025-03 |-- |-- +|#825 |checkAutoCommitIsDisabled fails with kafka-clients < 3.7.0 |ddqof |2024-08 |-- |-- +|=== + +*Verdict:* #803 is the only "verified bug" in the entire tracker. Start here. #809 and #833 are likely related -- investigate together. + +=== Category 2: Rebalance / offset correctness (data loss risk) + +[cols="1,4,2,1,2", options="header"] +|=== +|Issue |Title |Author |Date |Cross-ref PRs + +|#777 |Partition revocation leads to duplicate event processing |ajax-levashov-m |2024-05 |PR #893 (offset reset), PR #909 (stale container) +|#843 |Record picked up by multiple threads simultaneously |singhsaurabh2409 |2024-12 |PR #909 likely root cause +|#326 |Error in onPartitionsAssigned |milansanjeev |2022-07 |labeled not-a-bug, but related to rebalance +|=== + +*Verdict:* Merging PR #893 and PR #909 likely resolves #777 and #843. Verify after cherry-pick. + +=== Category 3: Observability & health (most requested feature gap) + +[cols="1,4,2,1,1,2", options="header"] +|=== +|Issue |Title |Author |Date |Labels |Cross-ref PRs + +|#27 |Micrometer metrics |astubbs |2020-11 |enhancement, medium |-- +|#71 |Health-checks |JorgenRingen |2021-01 |medium |PR #464 (astubbs, closed) +|#859 |Memory leak in PCMetrics class |CMExile |2025-04 |-- |PR #905 (metric addition) +|#484 |Question: PC state to read from? |Ehud-Lev |2022-11 |question |-- +|=== + +*Verdict:* #27 and #71 have been open for 5+ years. PR #464 was the intended fix; redo it. PR #905 adds one metric but doesn't close these. + +=== Category 4: Error handling / DLQ / retry (most demanded feature) + +[cols="1,4,2,1,2", options="header"] +|=== +|Issue |Title |Author |Date |Cross-ref PRs + +|#310 |DLQ implementation |astubbs |2022-05 |PR #366 (astubbs, closed) +|#196 |Max retries with DLQ callback |astubbs |2022-02 |PR #366, PR #291 +|#304 |Handle deserialization exceptions |astubbs |2022-05 |-- +|#391 |Serialization error handling / flexibility |astubbs |2022-08 |PR #291 (terminal/retry exceptions) +|=== + +*Verdict:* DLQ (#310) is the single most impactful missing feature. Revive PR #366 and PR #291 as a pair. + +=== Category 5: Thread safety & modern JDK support + +[cols="1,4,2,1,1,2", options="header"] +|=== +|Issue |Title |Author |Date |Labels |Cross-ref PRs + +|#186 |Ensure all PC APIs are thread safe |fowlerp-qlik |2022-02 |*blocker, ver:1.0* |PR #346 (closed) +|#78 |Allow customization of ThreadPoolExecutor |mauricioszabo |2021-02 |-- |PR #908 generalises this +|#192 |Unique thread names for PC instances |nioertel |2022-02 |-- |-- +|#299 |Loom integration POC |astubbs |2022-05 |-- |PR #300 (closed), PR #908 (open) +|#520 |Safe user API exposure of ALL Consumer APIs |astubbs |2022-12 |-- |PR #346 (closed) +|#862 |PC cannot run on Java 24 |jvissers |2025-04 |-- |PR #908 may help +|=== + +*Verdict:* #186 is labeled *blocker for 1.0*. #862 (Java 24 compat) is time-critical. PR #908 ReentrantLock migration may resolve #862. + +=== Category 6: Batch processing + +[cols="1,4,2,1,1,2", options="header"] +|=== +|Issue |Title |Author |Date |Labels |Cross-ref PRs + +|#266 |Batch option: same key only |astubbs |2022-04 |good first issue |*PR #915* (implements this!) +|#551 |Batching not working as expected |dixitsingla |2023 |-- |PR #915 may clarify +|#560 |Min batch size + batch max wait time |Ehud-Lev-Forter |2023 |-- |Complementary to #266 +|=== + +*Verdict:* PR #915 directly closes #266. #560 could be a follow-up enhancement. + +=== Category 7: Advanced features + +[cols="1,4,2,1,2", options="header"] +|=== +|Issue |Title |Author |Date |Cross-ref PRs + +|#314 |Combine queues across partitions/topics with KEY ordering |astubbs |2022-05 |PR #316 (closed) +|#321 |Transparent large message chunking |astubbs |2022-06 |-- +|#322 |Disk-backed produce queue |astubbs |2022-06 |-- +|#394 |Producer sends to least loaded broker |astubbs |2022-08 |PR #473 (closed) +|#718 |Missing feature to terminate processing |Zordid |2024 |-- +|#782 |Seeking to a specific offset for a partition |ebrockman1 |2024-05 |-- +|=== + +*Verdict:* #718 (terminate processing) and #782 (seek to offset) are practical, self-contained features worth implementing. + +=== Category 8: Logging / UX polish + +[cols="1,4,2,1,2", options="header"] +|=== +|Issue |Title |Author |Date |Cross-ref PRs + +|#629 |Missing topic/offset info in ConsumerOffsetCommitter error log |bmaggi |2023 |-- +|#631 |Warning log too verbose in RemovedPartitionState |bmaggi |2023 |-- +|#640 |Error log too verbose in AbstractParallelEoSStreamProcessor |bmaggi |2023 |PR #291 +|#622 |Wrong multiplier value in retry delay function example |zerda |2023 |-- +|=== + +*Verdict:* All four are low-effort, high-ROI. Batch them in a single "logging cleanup" PR. + +=== Category 9: Documentation gaps + +[cols="1,4,2,1,2", options="header"] +|=== +|Issue |Title |Author |Date |Cross-ref PRs + +|#115 |Clarify tombstone message handling (javadoc) |astubbs |2021-05 |-- +|#171 |Spring boot example |astubbs |2021-10 |-- +|#178 |Distribute single message to HTTP endpoints |astubbs |2021-11 |-- +|#642 |Add explanation of close modes |rkolesnev |2023 |-- +|=== + +*Verdict:* #171 (Spring Boot example) is highest-impact doc. #642 (close modes) addresses real operational confusion. + +=== Category 10: Vertx / Reactor module + +[cols="1,4,2,1,2", options="header"] +|=== +|Issue |Title |Author |Date |Cross-ref PRs + +|#170 |CompletableFuture API support + docs |astubbs |2021-10 |-- +|#180 |vertxHttpReqInfo only supports GET |OpenSourceTycoon |2021-11 |-- +|#480 |Question: Produce events using reactor? |Ehud-Lev |2022-11 |-- +|#860 |Accept instance params for managedExecutorService |dougcavalheiro |2025-04 |-- +|#912 |Memory Leak from JStreamVertxParallelStreamProcessor |sharifahmad2061 |2026-01 |-- +|=== + +*Verdict:* #912 (memory leak) is a production bug and highest priority in this group. + +=== Category 11: Build / CI / tooling + +[cols="1,4,2,1,2", options="header"] +|=== +|Issue |Title |Author |Date |Cross-ref PRs + +|#103 |Matrix test against multiple AK/JDK versions |astubbs |2021-03 |-- +|#130 |Remove static state in tests |astubbs |2021-07 |PR #126, PR #405 (both closed) +|#162 |mvn compile fails without test-jar |rkolesnev |2021-09 |-- +|#259 |Adopt error-prone and checker |astubbs |2022-04 |-- +|#290 |Refactor test base |astubbs |2022-05 |-- +|#526 |Move LongPollingMockConsumer to main artefact |astubbs |2022-12 |-- +|#861 |Error running tests: ManagedTruth.assertThat not found |gihong-park |2025-04 |PR #901 +|#906 |pom.xml version mismatch |lutzh |2026-01 |-- +|=== + +*Verdict:* #162 and #861 are contributor-friction issues. Fix these early to unblock external contributors. + +=== Category 12: Architecture / refactoring (internal) + +[cols="1,4,2,1,2", options="header"] +|=== +|Issue |Title |Author |Date |Cross-ref PRs + +|#200 |Shared-nothing architecture |astubbs |2022-02 |PR #270 (closed) +|#233 |Refactor OffsetMapCodecManager |astubbs |2022-03 |-- +|#241 |Refactor WC type from String to Enum |astubbs |2022-03 |-- +|#109 |Review serialisation versioning strategy |astubbs |2021-04 |-- +|=== + +*Verdict:* Internal maintainability improvements. Tackle when touching adjacent code. + +=== Category 13: Meta / project governance + +[cols="1,4,2,1,1", options="header"] +|=== +|Issue |Title |Author |Date |Labels + +|#172 |Release train for 1.0 |astubbs |2021-10 |*high* +|#177 |Investigate Release Drafter |astubbs |2021-11 |-- +|#907 |Is the project still actively maintained? |amrynsky |2026-01 |-- +|=== + +*Verdict:* #907 is the community distress signal. Answering it with a new release would address the maintenance concern directly. + +=== Closed as "Not Planned" (8 issues) + +[cols="1,4,3", options="header"] +|=== +|Issue |Title |Reason + +|#847 |Python support? |Out of scope (Java only) +|#462 |How to set starting offset? |Question, answered +|#446 |Poller behavior question |Question, answered +|#433 |Different results with different concurrency configs |Cannot reproduce +|#395 |API to send records through wrapped producer |Closed by astubbs +|#379 |consumeAndProduce with read-one write-many |Not a bug / by design +|#377 |Multiple instances/nodes support |Not a bug / by design +|#373 |batchSize(100) but only getting 1-3 messages |Not a bug / by design +|=== + +*Verdict:* Only #395 might be worth reconsidering. The rest are correctly closed. + +=== Issue priority ranking (top 15 across all categories) + +[cols="1,1,4,2,3", options="header"] +|=== +|Rank |Issue |Title |Category |Why first + +|1 |#803 |Transactional Producer timeout (verified bug) |Stability |Only verified bug; production-blocking +|2 |#71 |Health-checks |Observability |5+ years open; most basic operational need +|3 |#27 |Micrometer metrics |Observability |5+ years open; pairs with #71 +|4 |#310 |DLQ implementation |Error handling |Most demanded feature; PR #366 exists as design ref +|5 |#186 |Thread-safe PC APIs (blocker) |Safety |Labeled *blocker for 1.0* +|6 |#777 |Partition revocation leads to duplicates |Correctness |Fixed by merging PR #893 + #909 +|7 |#862 |PC cannot run on Java 24 |Compat |Time-critical; JDK moves fast +|8 |#266 |Batch same-key-only option |Batch |PR #915 already implements this +|9 |#809 |Sporadic commit timeouts |Stability |Multiple users report; likely related to #803 +|10 |#859 |Memory leak in PCMetrics |Stability |Production leak +|11 |#912 |Memory leak in Vertx module |Stability |Production leak +|12 |#162 |mvn compile fails without test-jar |Build |Blocks contributors +|13 |#718 |Terminate processing feature |Feature |Practical operational gap +|14 |#196 |Max retries + DLQ callback |Error handling |Companion to #310 +|15 |#629+#631+#640 |Logging verbosity fixes |UX |Low-effort batch fix +|=== + +=== Issue to PR cross-reference matrix + +[cols="2,2,2,3", options="header"] +|=== +|Issue |Open upstream PR |Closed astubbs PR |Action + +|#71 (health checks) |#905 (partial -- one metric only) |#464 (full health check impl) |Revive #464, merge #905 +|#266 (batch same-key) |*#915* (full implementation) |-- |Merge #915 +|#299 (Loom POC) |*#908* (Virtual Threads) |#300 (Loom POC) |Compare #908 vs #300, merge #908 +|#310 (DLQ) |-- |#366 (DLQ impl) |Revive #366 +|#314 (combine queues) |-- |#316 (impl) |Revive #316 +|#520 (safe user API) |-- |#346 (thread-safe API) |Revive #346 +|#200 (shared-nothing) |-- |#270 (impl) |Keep as design ref +|#130 (static state tests) |-- |#126, #405 |Revive #126 +|#394 (least-loaded broker) |-- |#473 (impl) |Revive #473 +|#196 (max retries) |-- |#291 (terminal exceptions) |Revive #291 +|#777 (rebalance dupes) |*#893* (offset fix) |-- |Merge #893 +|#843 (multi-thread pickup) |*#909* (stale container) |-- |Merge #909 +|#862 (Java 24 compat) |*#908* (ReentrantLock migration) |-- |Merge #908 +|=== + +== Part 4: How to use this document + +. *Part 1 (open PRs)* = short-term backlog -- fixes and features already written, just needing review / merge / cherry-pick. +. *Part 2 (astubbs closed PRs)* = medium-term idea bank -- intentional design work that was never rejected on technical grounds. Use each entry as a *spec*, not a branch to resurrect (branches will have bitrot beyond usefulness). +. *Part 3 (issues)* = the requirements backlog -- user-reported needs and bugs. Use the issue priority ranking and the cross-reference matrix to decide what to tackle next. Issues with linked PRs (open or closed) can be resolved by merging/reviving those PRs. +. Before picking up a Part 2 item, check Part 1 for overlap (see cross-references) to avoid duplicating effort. From 92ffad9a97fa0f3464505f684aaf9c3207084503 Mon Sep 17 00:00:00 2001 From: Antony Stubbs Date: Tue, 21 Apr 2026 13:13:59 +1200 Subject: [PATCH 06/12] review: fix CI security, publish gate, and version parsing robustness - Switch check-dependencies.yml from pull_request_target to pull_request to prevent fork PRs running with base repo write permissions - Pin all three custom actions to commit SHAs instead of mutable branch refs (dependencies-action@a09974c, duplicate-code-cross-check@d3140ef, duplicate-code-detection-tool@4e302e7) - Gate publish.yml on CI success via workflow_run trigger instead of deploying on every master push regardless of test results - Add try/catch and pre-release suffix stripping to deriveCpKafkaImage() so unexpected version strings fall back gracefully instead of crashing all integration tests with ExceptionInInitializerError - Add Documented Solutions section to AGENTS.md for discoverability - Add copyright header management solution doc Co-Authored-By: Claude Opus 4.6 (1M context) --- .github/workflows/check-dependencies.yml | 4 +- .github/workflows/maven.yml | 4 +- .github/workflows/publish.yml | 7 +- AGENTS.md | 4 + README.adoc | 4 +- ...et-mutable-refs-publish-gate-2026-04-21.md | 186 ++++++++++++++++++ ...yright-header-rules-for-fork-2026-04-21.md | 132 +++++++++++++ .../BrokerIntegrationTest.java | 32 +-- src/docs/README_TEMPLATE.adoc | 4 +- 9 files changed, 356 insertions(+), 21 deletions(-) create mode 100644 docs/solutions/security-issues/ci-hardening-pull-request-target-mutable-refs-publish-gate-2026-04-21.md create mode 100644 docs/solutions/workflow-issues/copyright-header-rules-for-fork-2026-04-21.md diff --git a/.github/workflows/check-dependencies.yml b/.github/workflows/check-dependencies.yml index cd459d954..36ba5ba72 100644 --- a/.github/workflows/check-dependencies.yml +++ b/.github/workflows/check-dependencies.yml @@ -1,7 +1,7 @@ name: PR Dependency Check on: - pull_request_target: + pull_request: types: [opened, edited, closed, reopened] permissions: @@ -14,6 +14,6 @@ jobs: runs-on: ubuntu-latest name: Check Dependencies steps: - - uses: astubbs/dependencies-action@feat/auto-unblock-children-on-merge + - uses: astubbs/dependencies-action@a09974c # feat/auto-unblock-children-on-merge env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} diff --git a/.github/workflows/maven.yml b/.github/workflows/maven.yml index 7e09c88fa..6a0a70a48 100644 --- a/.github/workflows/maven.yml +++ b/.github/workflows/maven.yml @@ -132,7 +132,7 @@ jobs: - uses: actions/checkout@v6 with: fetch-depth: 0 - - uses: astubbs/duplicate-code-cross-check@v1 + - uses: astubbs/duplicate-code-cross-check@d3140ef # v1 with: github-token: ${{ secrets.GITHUB_TOKEN }} directories: 'parallel-consumer-core/src parallel-consumer-vertx/src parallel-consumer-reactor/src parallel-consumer-mutiny/src' @@ -149,7 +149,7 @@ jobs: - uses: actions/checkout@v6 with: fetch-depth: 0 - - uses: astubbs/duplicate-code-detection-tool@feat/base-vs-pr-comparison + - uses: astubbs/duplicate-code-detection-tool@4e302e7 # feat/base-vs-pr-comparison with: github_token: ${{ secrets.GITHUB_TOKEN }} directories: 'parallel-consumer-core/src,parallel-consumer-vertx/src,parallel-consumer-reactor/src,parallel-consumer-mutiny/src' diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index eda59dc81..3fcee2ad5 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -16,8 +16,11 @@ name: Publish to Maven Central on: - push: + # Only publish after CI passes - workflow_run triggers when "Build and Test" completes + workflow_run: + workflows: ["Build and Test"] branches: [ master ] + types: [completed] workflow_dispatch: # allow manual re-runs concurrency: @@ -30,6 +33,8 @@ permissions: jobs: publish: name: Publish + # Skip if CI failed (workflow_run fires on both success and failure) + if: github.event_name == 'workflow_dispatch' || github.event.workflow_run.conclusion == 'success' runs-on: ubuntu-latest steps: diff --git a/AGENTS.md b/AGENTS.md index 364013855..75796c849 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -96,3 +96,7 @@ The pom.xml version drives publishing — there is no `maven-release-plugin` dan - `MAVEN_CENTRAL_PASSWORD` — Sonatype Central Portal token password - `MAVEN_GPG_PRIVATE_KEY` — Armored GPG private key for signing artifacts - `MAVEN_GPG_PASSPHRASE` — Passphrase for the GPG key + +## Documented Solutions + +`docs/solutions/` - documented solutions to past problems and workflow patterns, organized by category with YAML frontmatter (`module`, `tags`, `problem_type`). Relevant when implementing or debugging in documented areas. diff --git a/README.adoc b/README.adoc index dafd23f29..5a28f87e6 100644 --- a/README.adoc +++ b/README.adoc @@ -22,7 +22,7 @@ TIP:: Editing template file endif::[] -= Confluent Parallel Consumer += Kafka Parallel Consumer :icons: :toc: macro :toclevels: 3 @@ -382,7 +382,7 @@ We also encourage participation, so if you have any feature ideas etc, please ge NOTE: We are very interested to hear about your experiences! And please vote on your favourite issues! -If you have questions, head over to the https://launchpass.com/confluentcommunity[Confluent Slack community], or raise an https://github.com/confluentinc/parallel-consumer/issues[issue] on GitHub. +If you have questions or find a bug, raise an https://github.com/astubbs/parallel-consumer/issues[issue] on GitHub. == License diff --git a/docs/solutions/security-issues/ci-hardening-pull-request-target-mutable-refs-publish-gate-2026-04-21.md b/docs/solutions/security-issues/ci-hardening-pull-request-target-mutable-refs-publish-gate-2026-04-21.md new file mode 100644 index 000000000..d2d499704 --- /dev/null +++ b/docs/solutions/security-issues/ci-hardening-pull-request-target-mutable-refs-publish-gate-2026-04-21.md @@ -0,0 +1,186 @@ +--- +title: "CI Security Hardening: pull_request_target, Mutable Action Refs, Missing Test Gates, and Fragile Static Parsers" +date: 2026-04-21 +category: security-issues +module: build-system +problem_type: security_issue +component: development_workflow +root_cause: config_error +resolution_type: config_change +severity: high +applies_when: + - Setting up GitHub Actions CI for a fork that accepts external PRs + - Using custom or third-party actions with write permissions + - Publishing artifacts to Maven Central from CI + - Parsing version strings in test infrastructure static initializers +tags: + - ci + - github-actions + - supply-chain + - action-pinning + - publish-gate + - static-initializer + - pull-request-target +--- + +# CI Security Hardening: pull_request_target, Mutable Action Refs, Missing Test Gates, and Fragile Static Parsers + +## Context + +During code review of the `dev/rebrand-fork` branch (parallel-consumer fork rebranding), four CI security and robustness issues were identified. These ranged from a critical supply-chain vulnerability (`pull_request_target` + mutable action ref) to a test-killing `ExceptionInInitializerError` from unparseable version strings. None had been exploited, but all represented real attack surface or reliability gaps. + +The fixes establish patterns applicable to any project using GitHub Actions with fork PRs, reusable actions, and Maven publishing. + +## Guidance + +### 1. Never use `pull_request_target` with mutable action refs + +`pull_request_target` runs workflows in the context of the *base* repository, with access to base-repo secrets and elevated permissions. When combined with a mutable action ref (branch name), an attacker controlling that action branch can execute arbitrary code with write permissions on every fork PR. + +**Rule**: If you need `pull_request_target`, pin every action to a commit SHA. Prefer switching to `pull_request` where the elevated capability is not actually needed. + +### 2. Pin all action refs to commit SHAs + +Branch and tag refs are mutable pointers. A force-push upstream silently changes what code runs in your CI, with whatever permissions `GITHUB_TOKEN` grants. SHA pins are immutable. + +**Rule**: Every `uses:` entry in a workflow that has any write permission must reference a commit SHA. Keep the original ref as a trailing comment for human readability. + +### 3. Gate publishing on a successful test run + +Publishing on raw `push` with `-DskipTests` means any test-failing commit produces a published SNAPSHOT artifact. Downstream consumers inherit broken code. + +**Rule**: Use `workflow_run` to trigger publishing only after the build-and-test workflow succeeds. Add an explicit `if:` guard on the conclusion. Keep `workflow_dispatch` so maintainers can manually re-run. + +### 4. Harden version string parsing for pre-release suffixes + +When parsing version strings at runtime, `Integer.parseInt()` on a string like `"4.0.0-SNAPSHOT"` throws `NumberFormatException`. If this code runs in a static initializer, the resulting `ExceptionInInitializerError` kills every test in every class that inherits the initializer - a confusing failure mode that appears unrelated to the actual cause. + +**Rule**: Strip pre-release suffixes before parsing. Wrap in a try/catch with a safe fallback. Never let version detection run in a static initializer without a fallback. + +## Why This Matters + +**Supply chain risk (Issues 1 & 2).** `pull_request_target` with mutable action refs is a well-documented GitHub Actions attack vector. An attacker who can push to the referenced action branch - including via a compromised maintainer account on a third-party repo - silently executes code with `GITHUB_TOKEN` write access on every incoming PR. This can exfiltrate secrets, poison releases, or modify repository state. + +**Artifact integrity (Issue 3).** Publishing broken SNAPSHOTs trains downstream developers to distrust the artifact feed. In a library project, a bad SNAPSHOT propagates into dependent projects before anyone notices. + +**Test infrastructure fragility (Issue 4).** Static initializer failures produce `ExceptionInInitializerError` which JUnit reports as a class-level error, not a test failure. The root cause is buried in the stack trace with no obvious connection to the test being run. + +## When to Apply + +- **SHA pinning**: Any GitHub Actions workflow with write permissions, secrets access, or `pull_request_target` trigger +- **`pull_request` vs `pull_request_target`**: Use `pull_request_target` only when you explicitly need base-repo secrets for fork PRs (e.g., posting a comment after a permission check). For read-only checks, use `pull_request` +- **Publish gating**: Whenever publishing is triggered by a push and the build/test workflow lives in a separate workflow file. If tests and publishing are in the same file, a `needs:` dependency is simpler +- **Version string hardening**: Whenever parsing version strings from dependency metadata, environment variables, or external configuration - especially in static initializers or `@BeforeAll` + +## Examples + +### check-dependencies.yml (pull_request_target + mutable ref) + +Before (vulnerable): +```yaml +on: + pull_request_target: + types: [opened, edited, closed, reopened] +permissions: + checks: write +jobs: + check_dependencies: + steps: + - uses: astubbs/dependencies-action@feat/auto-unblock-children-on-merge +``` + +After (hardened): +```yaml +on: + pull_request: + types: [opened, edited, closed, reopened] +permissions: + checks: write +jobs: + check_dependencies: + steps: + - uses: astubbs/dependencies-action@a09974c # feat/auto-unblock-children-on-merge +``` + +### maven.yml (mutable action refs) + +Before: +```yaml +- uses: astubbs/duplicate-code-cross-check@v1 +- uses: astubbs/duplicate-code-detection-tool@feat/base-vs-pr-comparison +``` + +After: +```yaml +- uses: astubbs/duplicate-code-cross-check@d3140ef # v1 +- uses: astubbs/duplicate-code-detection-tool@4e302e7 # feat/base-vs-pr-comparison +``` + +### publish.yml (missing test gate) + +Before: +```yaml +on: + push: + branches: [ master ] +jobs: + publish: + steps: + - run: ./mvnw deploy -DskipTests +``` + +After: +```yaml +on: + workflow_run: + workflows: ["Build and Test"] + branches: [ master ] + types: [completed] + workflow_dispatch: +jobs: + publish: + if: github.event_name == 'workflow_dispatch' || github.event.workflow_run.conclusion == 'success' + steps: + - run: ./mvnw deploy -DskipTests +``` + +### BrokerIntegrationTest.java (fragile version parsing) + +Before: +```java +static String deriveCpKafkaImage() { + String akVersion = AppInfoParser.getVersion(); + String[] parts = akVersion.split("\\."); + int akMajor = Integer.parseInt(parts[0]); // throws on "4.0.0-SNAPSHOT" + int akMinor = Integer.parseInt(parts[1]); + return "confluentinc/cp-kafka:" + (akMajor + 4) + "." + akMinor + ".0"; +} +``` + +After: +```java +private static final String FALLBACK_CP_IMAGE = "confluentinc/cp-kafka:7.9.0"; + +static String deriveCpKafkaImage() { + String akVersion = AppInfoParser.getVersion(); + try { + String cleanVersion = akVersion.split("-")[0]; // strip -SNAPSHOT, -rc1 + String[] parts = cleanVersion.split("\\."); + int akMajor = Integer.parseInt(parts[0]); + int akMinor = Integer.parseInt(parts[1]); + return "confluentinc/cp-kafka:" + (akMajor + 4) + "." + akMinor + ".0"; + } catch (NumberFormatException | ArrayIndexOutOfBoundsException e) { + log.warn("Could not parse Kafka version '{}', falling back to {}", + akVersion, FALLBACK_CP_IMAGE, e); + return FALLBACK_CP_IMAGE; + } +} +``` + +## Related + +- `docs/solutions/workflow-issues/copyright-header-rules-for-fork-2026-04-21.md` - companion fork workflow guidance (copyright headers) +- `.github/workflows/check-dependencies.yml` - primary subject of Issue 1 +- `.github/workflows/maven.yml` - primary subject of Issue 2 +- `.github/workflows/publish.yml` - primary subject of Issue 3 +- `parallel-consumer-core/src/test-integration/.../BrokerIntegrationTest.java` - primary subject of Issue 4 diff --git a/docs/solutions/workflow-issues/copyright-header-rules-for-fork-2026-04-21.md b/docs/solutions/workflow-issues/copyright-header-rules-for-fork-2026-04-21.md new file mode 100644 index 000000000..f139f5658 --- /dev/null +++ b/docs/solutions/workflow-issues/copyright-header-rules-for-fork-2026-04-21.md @@ -0,0 +1,132 @@ +--- +title: Copyright header management rules for Apache 2.0 fork +date: 2026-04-21 +category: workflow-issues +module: build-system +problem_type: workflow_issue +component: development_workflow +severity: medium +applies_when: + - Rebranding an Apache 2.0 fork under new Maven coordinates + - Creating new files in a forked repository + - Running Maven builds that trigger the license plugin + - Code review of changes to files with copyright headers +tags: + - copyright + - license-headers + - fork + - maven + - mycila + - apache-2 + - agents-md +--- + +# Copyright header management rules for Apache 2.0 fork + +## Context + +The `astubbs/parallel-consumer` repository is an Apache 2.0 fork of `confluentinc/parallel-consumer`, rebranded under `io.github.astubbs.parallelconsumer` Maven coordinates. During the `dev/rebrand-fork` branch review, three copyright header problems surfaced: + +1. **License plugin auto-bumps years**: The `license-maven-plugin` (Mycila) auto-bumps copyright year ranges when run without `-Dlicense.skip`. A routine pom.xml property addition caused `parallel-consumer-mutiny/pom.xml` to silently gain a year bump from `2020-2025` to `2020-2026` as an unintended side effect. + +2. **New fork files wrongly attributed**: Six new `bin/` scripts (build.sh, ci-build.sh, ci-integration-test.sh, ci-unit-test.sh, performance-test.cmd, performance-test.sh) were written entirely for the fork but carried `Copyright (C) 2020-2026 Confluent, Inc.` headers. Confluent didn't write them and they didn't exist before 2026. + +3. **Review fix changed copyright incorrectly**: During code review autofix, the reviewer changed `EpochAndRecordsMap.java`'s copyright from `2020-2022 Confluent, Inc.` to `2020-2026 Confluent, Inc. and contributors`. The original PR commit only added code (a null-epoch guard), not a copyright change - the reviewer introduced a spurious copyright modification that had to be reverted. + +None of this was caught early because AGENTS.md had no guidance on copyright header management. + +## Guidance + +Rules added to AGENTS.md under "Code Style": + +``` +- **Copyright rules for this fork**: + - Do not change copyright headers on existing files unless the file has + substantive code changes in the same commit + - Do not bump copyright years as an incidental or standalone change + - The `NOTICE` file at repo root contains the legal attribution structure + - New files written entirely for the fork should not claim Confluent copyright + - Always pass `-Dlicense.skip` to Maven to prevent the license plugin from + auto-bumping years +``` + +The NOTICE file at the repository root is the authoritative attribution record: + +``` +Parallel Consumer +Copyright 2020-2026 Confluent, Inc. +Copyright 2026 Antony Stubbs and contributors +``` + +The pom.xml license template header was changed from: + +``` +Copyright (C) ${license.git.copyrightYears} ${project.organization.name} +``` + +to: + +``` +Copyright (C) 2020-${license.git.copyrightYears} Confluent, Inc. and contributors +``` + +## Why This Matters + +**Legal accuracy.** Copyright headers are legal statements. Attributing work to Confluent that Confluent did not write, or claiming a file dates from 2020 when it was created in 2026, produces inaccurate legal records. + +**Diff noise.** Spurious year bumps pollute commits and PRs with meaningless diffs. A reviewer looking at a pom.xml change to add `9` should not have to mentally filter out an unrelated copyright line change. This also makes git blame less useful. + +**Review trust.** When an automated reviewer or agent changes copyright headers as a "fix," it undermines confidence in the review. The `EpochAndRecordsMap.java` incident changed a historically accurate range (`2020-2022`) to a wrong one (`2020-2026`). + +**Plugin footgun.** The Mycila license plugin runs as part of the normal build cycle. Without `-Dlicense.skip`, any Maven command that triggers `license:check` or `license:format` will rewrite headers silently. The plugin also breaks in git worktrees. + +## When to Apply + +- Any time you run a Maven command: always include `-Dlicense.skip` +- Any time you create a new file that did not exist in upstream: do not add Confluent copyright +- Any time a code review suggests changing a copyright header: only accept if the same commit has substantive code changes +- Any time you see a standalone "bump copyright year" commit: reject it +- Any time you add a property, dependency, or config-only change to a pom.xml: the copyright header must not change + +## Examples + +**Bad - license plugin auto-bump as side effect:** + +```xml + + + +``` + +Fix: run Maven with `-Dlicense.skip` and revert the header change. + +**Bad - new fork-only file claiming upstream copyright:** + +```bash +#!/bin/bash +# Copyright (C) 2020-2026 Confluent, Inc. <-- WRONG +# bin/build.sh - created entirely for the astubbs fork in 2026 +``` + +**Bad - reviewer adding spurious copyright change alongside a code fix:** + +```java +// EpochAndRecordsMap.java +// Before (historically accurate): Copyright (C) 2020-2022 Confluent, Inc. +// After reviewer autofix (wrong): Copyright (C) 2020-2026 Confluent, Inc. and contributors +``` + +The commit only added a null-epoch guard. The range `2020-2022` was correct. The reviewer's change was reverted. + +**Correct - substantive code change warrants a header update:** + +```java +// Before: Copyright (C) 2020-2022 Confluent, Inc. +// After a real 2026 code change: Copyright (C) 2020-2026 Confluent, Inc. and contributors +``` + +## Related + +- `AGENTS.md` lines 70-76 - the codified rules (authoritative source) +- `NOTICE` file at repo root - legal attribution structure +- `pom.xml` line ~613 - license template `inlineHeader` configuration diff --git a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/BrokerIntegrationTest.java b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/BrokerIntegrationTest.java index 56c84017a..c04b2e11a 100644 --- a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/BrokerIntegrationTest.java +++ b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/BrokerIntegrationTest.java @@ -62,22 +62,30 @@ public abstract class BrokerIntegrationTest { *

* Mapping: CP major = AK major + 4 (e.g., AK 3.1 → CP 7.1, AK 3.9 → CP 7.9). */ + private static final String FALLBACK_CP_IMAGE = "confluentinc/cp-kafka:7.9.0"; + static String deriveCpKafkaImage() { String akVersion = org.apache.kafka.common.utils.AppInfoParser.getVersion(); log.info("Kafka client version detected: {}", akVersion); - // Parse major.minor from the AK version - String[] parts = akVersion.split("\\."); - int akMajor = Integer.parseInt(parts[0]); - int akMinor = Integer.parseInt(parts[1]); - - // CP major = AK major + 4, CP minor = AK minor - int cpMajor = akMajor + 4; - int cpMinor = akMinor; - - String cpImage = "confluentinc/cp-kafka:" + cpMajor + "." + cpMinor + ".0"; - log.info("Using CP Kafka image: {} (derived from AK {})", cpImage, akVersion); - return cpImage; + try { + // Strip pre-release suffixes (e.g. "4.0.0-SNAPSHOT" -> "4.0.0") + String cleanVersion = akVersion.split("-")[0]; + String[] parts = cleanVersion.split("\\."); + int akMajor = Integer.parseInt(parts[0]); + int akMinor = Integer.parseInt(parts[1]); + + // CP major = AK major + 4, CP minor = AK minor + int cpMajor = akMajor + 4; + int cpMinor = akMinor; + + String cpImage = "confluentinc/cp-kafka:" + cpMajor + "." + cpMinor + ".0"; + log.info("Using CP Kafka image: {} (derived from AK {})", cpImage, akVersion); + return cpImage; + } catch (NumberFormatException | ArrayIndexOutOfBoundsException e) { + log.warn("Could not parse Kafka version '{}', falling back to {}", akVersion, FALLBACK_CP_IMAGE, e); + return FALLBACK_CP_IMAGE; + } } public static KafkaContainer createKafkaContainer(String logSegmentSize) { diff --git a/src/docs/README_TEMPLATE.adoc b/src/docs/README_TEMPLATE.adoc index 1b779a204..bb43bc404 100644 --- a/src/docs/README_TEMPLATE.adoc +++ b/src/docs/README_TEMPLATE.adoc @@ -22,7 +22,7 @@ TIP:: Editing template file endif::[] -= Confluent Parallel Consumer += Kafka Parallel Consumer :icons: :toc: macro :toclevels: 3 @@ -380,7 +380,7 @@ We also encourage participation, so if you have any feature ideas etc, please ge NOTE: We are very interested to hear about your experiences! And please vote on your favourite issues! -If you have questions, head over to the https://launchpass.com/confluentcommunity[Confluent Slack community], or raise an https://github.com/confluentinc/parallel-consumer/issues[issue] on GitHub. +If you have questions or find a bug, raise an https://github.com/astubbs/parallel-consumer/issues[issue] on GitHub. == License From c8277126a90f8cabe67cd91408a3f708e3bddc44 Mon Sep 17 00:00:00 2001 From: Antony Stubbs Date: Tue, 21 Apr 2026 15:06:12 +1200 Subject: [PATCH 07/12] refactor(ci): unify build config, move Kafka compat to PRs, extend prepare-deps cache warming PR builds now run two tiers in parallel: - Split suites (unit, integration, performance) on default Kafka 3.9.1 - Experimental Kafka 4.x compatibility check (non-blocking) Push-to-master runs a single ci-build.sh (default Kafka version) to gate SNAPSHOT publishing. No more version matrix on master. All jobs now use explicit cache/restore with rotating keys from prepare-deps. Eliminated all setup-java cache: 'maven' usage which caused the frozen-cache bug (immutable keys could never overwrite an incomplete cache entry). prepare-deps now runs on push builds too, so master hits the reactor with a warm Maven Central cache for the vertx dependencies that previously caused consistent 120s timeouts. Also skip unit tests in performance-test.sh so a flaky unit test can't cascade-cancel the other PR matrix jobs via fail-fast. Co-Authored-By: Claude Opus 4.6 (1M context) --- .github/workflows/check-dependencies.yml | 2 +- .github/workflows/maven.yml | 142 ++++++++++++------ AGENTS.md | 2 +- bin/performance-test.sh | 1 + ...l-timeout-azure-west-regions-2026-04-21.md | 122 +++++++++++++++ 5 files changed, 217 insertions(+), 52 deletions(-) create mode 100644 docs/solutions/build-errors/maven-central-timeout-azure-west-regions-2026-04-21.md diff --git a/.github/workflows/check-dependencies.yml b/.github/workflows/check-dependencies.yml index 36ba5ba72..1cd539b3d 100644 --- a/.github/workflows/check-dependencies.yml +++ b/.github/workflows/check-dependencies.yml @@ -14,6 +14,6 @@ jobs: runs-on: ubuntu-latest name: Check Dependencies steps: - - uses: astubbs/dependencies-action@a09974c # feat/auto-unblock-children-on-merge + - uses: astubbs/dependencies-action@a09974c14e84fb3e4c0df10c04f17bdc6ccc8878 # feat/auto-unblock-children-on-merge env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} diff --git a/.github/workflows/maven.yml b/.github/workflows/maven.yml index 6a0a70a48..28abdafe1 100644 --- a/.github/workflows/maven.yml +++ b/.github/workflows/maven.yml @@ -1,9 +1,16 @@ # PR builds: -# Unit, Integration, Performance - all in parallel with fail-fast -# (if unit tests fail at ~5 min, integration + performance are cancelled) +# Default Kafka version (3.9.1) split into unit, integration, and performance +# suites in parallel for fast feedback, plus an experimental Kafka 4.x +# compatibility check. # # Push builds (master): -# Full Kafka version matrix (3.1.0, 3.7.0, 3.9.1 + experimental 4.x) +# Single full build on default Kafka version (3.9.1) to gate SNAPSHOT publishing. +# +# Caching strategy: +# All jobs use explicit cache/restore with restore-keys fallback to the +# prepare-deps rotating cache. Do NOT use setup-java's built-in `cache: maven` +# anywhere - its immutable keys can freeze an incomplete cache permanently. +# See docs/solutions/build-errors/maven-central-timeout-azure-west-regions-2026-04-21.md name: Build and Test @@ -23,16 +30,12 @@ concurrency: jobs: - # ── PR Builds ────────────────────────────────────────────────────────── + # ── Shared ──────────────────────────────────────────────────────────── - # Pre-warm the Maven dependency cache so the three test jobs can start - # with everything local. Without this, each job races to download the - # same deps from Maven Central, and if Central is slow, they all stall. - # TODO: If this extra serial step doesn't consistently help, remove it - # and rely on per-job caching + longer timeouts instead. Trade-off is - # wall-clock time (adds ~1-2 min serial) vs reliability vs flaky networks. + # Pre-warm the Maven dependency cache so all downstream jobs start with + # everything local. Without this, jobs on Azure West US regions hit + # Maven Central CDN timeouts (see docs/solutions/build-errors/). prepare-deps: - if: github.event_name == 'pull_request' name: "Prepare Maven Cache" runs-on: ubuntu-latest timeout-minutes: 15 @@ -42,13 +45,7 @@ jobs: with: distribution: 'temurin' java-version: '17' - # Explicit cache steps below instead of setup-java's built-in - # `cache: maven`. Built-in uses actions/cache under the hood, - # which does NOT save when the primary key was a hit. Once an - # early run populates a partial cache (e.g. Central timed out - # on one POM), every subsequent run restores the same gap and - # can never write the completed version back. Rotating save - # key below forces a save every run. + # No `cache: maven` - see caching strategy note at top of file - name: Restore Maven cache uses: actions/cache/restore@v4 with: @@ -65,21 +62,21 @@ jobs: path: ~/.m2/repository key: setup-java-Linux-x64-maven-${{ hashFiles('**/pom.xml') }}-${{ github.run_id }} - # All test suites in parallel -fail-fast cancels the rest if any fails + # ── PR Builds ────────────────────────────────────────────────────────── + + # Default Kafka version: split suites for fast, parallel feedback. + # Uses the same scripts as local development (ci-unit-test.sh, etc.) test: if: github.event_name == 'pull_request' needs: prepare-deps strategy: - fail-fast: true + fail-fast: false matrix: include: - suite: unit name: "Unit Tests" cmd: "bin/ci-unit-test.sh" timeout: 15 - # TODO: tighten integration/performance to 30m once we have 5+ - # consecutive successful runs consistently under 20m. Current 60m - # buffer is for Maven Central network variance we've observed. - suite: integration name: "Integration Tests" cmd: "bin/ci-integration-test.sh" @@ -97,10 +94,6 @@ jobs: with: distribution: 'temurin' java-version: '17' - # Explicit restore (no save) so we can fall back via - # restore-keys to the rotating-key cache prepare-deps saved. - # setup-java's built-in `cache: maven` has no restore-keys - # support, so it would miss prepare-deps' `...-{run_id}` key. - name: Restore Maven cache uses: actions/cache/restore@v4 with: @@ -118,6 +111,37 @@ jobs: flags: ${{ matrix.suite }} token: ${{ secrets.CODECOV_TOKEN }} + # Experimental Kafka 4.x compatibility check. Non-blocking. + test-kafka-compat: + if: github.event_name == 'pull_request' + needs: prepare-deps + name: "Kafka Compat (experimental 4.x)" + runs-on: ubuntu-latest + timeout-minutes: 45 + continue-on-error: true + steps: + - uses: actions/checkout@v6 + - uses: actions/setup-java@v5 + with: + distribution: 'temurin' + java-version: '17' + - name: Restore Maven cache + uses: actions/cache/restore@v4 + with: + path: ~/.m2/repository + key: setup-java-Linux-x64-maven-${{ hashFiles('**/pom.xml') }} + restore-keys: | + setup-java-Linux-x64-maven- + - name: Build and Test (Kafka 4.x) + run: bin/ci-build.sh '[3.9.1,5)' + - name: Upload coverage to Codecov + if: always() + uses: codecov/codecov-action@v5 + with: + files: '**/target/site/jacoco/jacoco.xml,**/target/site/jacoco-it/jacoco.xml' + flags: ak-experimental + token: ${{ secrets.CODECOV_TOKEN }} + # Duplicate code detection using both PMD CPD and jscpd engines. # See https://github.com/astubbs/duplicate-code-cross-check duplicate-detection: @@ -132,7 +156,7 @@ jobs: - uses: actions/checkout@v6 with: fetch-depth: 0 - - uses: astubbs/duplicate-code-cross-check@d3140ef # v1 + - uses: astubbs/duplicate-code-cross-check@d3140ef6e9a4adf68c9f729a789e8b7ec8d058f7 # v1 with: github-token: ${{ secrets.GITHUB_TOKEN }} directories: 'parallel-consumer-core/src parallel-consumer-vertx/src parallel-consumer-reactor/src parallel-consumer-mutiny/src' @@ -149,7 +173,7 @@ jobs: - uses: actions/checkout@v6 with: fetch-depth: 0 - - uses: astubbs/duplicate-code-detection-tool@4e302e7 # feat/base-vs-pr-comparison + - uses: astubbs/duplicate-code-detection-tool@4e302e720f86c4e95f709196bae21333a9696937 # feat/base-vs-pr-comparison with: github_token: ${{ secrets.GITHUB_TOKEN }} directories: 'parallel-consumer-core/src,parallel-consumer-vertx/src,parallel-consumer-reactor/src,parallel-consumer-mutiny/src' @@ -169,13 +193,20 @@ jobs: name: "SpotBugs" runs-on: ubuntu-latest timeout-minutes: 10 + needs: prepare-deps steps: - uses: actions/checkout@v6 - uses: actions/setup-java@v5 with: distribution: 'temurin' java-version: '17' - cache: 'maven' + - name: Restore Maven cache + uses: actions/cache/restore@v4 + with: + path: ~/.m2/repository + key: setup-java-Linux-x64-maven-${{ hashFiles('**/pom.xml') }} + restore-keys: | + setup-java-Linux-x64-maven- - name: Restore SpotBugs baseline id: spotbugs-baseline uses: actions/cache/restore@v4 @@ -237,7 +268,7 @@ jobs: await github.rest.issues.createComment({ owner: context.repo.owner, repo: context.repo.repo, issue_number: context.issue.number, body }); } - # Dependency vulnerability scanning -GitHub's own dependency review + # Dependency vulnerability scanning - GitHub's own dependency review dependency-scan: if: github.event_name == 'pull_request' name: "Dependency Vulnerabilities" @@ -259,13 +290,20 @@ jobs: runs-on: ubuntu-latest continue-on-error: true timeout-minutes: 300 + needs: prepare-deps steps: - uses: actions/checkout@v6 - uses: actions/setup-java@v5 with: distribution: 'temurin' java-version: '17' - cache: 'maven' + - name: Restore Maven cache + uses: actions/cache/restore@v4 + with: + path: ~/.m2/repository + key: setup-java-Linux-x64-maven-${{ hashFiles('**/pom.xml') }} + restore-keys: | + setup-java-Linux-x64-maven- # Restore PIT history for incremental analysis — only re-mutates # code that changed since last run. PIT writes its history to /tmp/ # with a filename derived from groupId.artifactId.version, ignoring @@ -357,13 +395,20 @@ jobs: name: "SpotBugs Baseline" runs-on: ubuntu-latest timeout-minutes: 10 + needs: prepare-deps steps: - uses: actions/checkout@v6 - uses: actions/setup-java@v5 with: distribution: 'temurin' java-version: '17' - cache: 'maven' + - name: Restore Maven cache + uses: actions/cache/restore@v4 + with: + path: ~/.m2/repository + key: setup-java-Linux-x64-maven-${{ hashFiles('**/pom.xml') }} + restore-keys: | + setup-java-Linux-x64-maven- - name: Compile and generate SpotBugs XML run: ./mvnw --batch-mode -Pci compile spotbugs:spotbugs -Dlicense.skip -pl parallel-consumer-core -am - name: Convert SpotBugs results to exclude filter @@ -394,36 +439,33 @@ jobs: path: .spotbugs-baseline.xml key: spotbugs-baseline-${{ github.ref_name }}-${{ hashFiles('**/src/main/**/*.java') }} - # Full Kafka version matrix as safety net + # Single full build on default Kafka version to gate SNAPSHOT publishing. + # Uses the same ci-build.sh script that PRs use for Kafka compat testing. build: if: github.event_name == 'push' - strategy: - fail-fast: false - matrix: - ak: [ 3.1.0, 3.7.0, 3.9.1 ] - experimental: [ false ] - name: [ "Stable" ] - include: - - ak: "'[3.9.1,5)'" - experimental: true - name: "Newest AK (4.x?)" - continue-on-error: ${{ matrix.experimental }} - name: "AK: ${{ matrix.ak }} (${{ matrix.name }})" + needs: prepare-deps + name: "Build and Test" runs-on: ubuntu-latest - timeout-minutes: 30 + timeout-minutes: 45 steps: - uses: actions/checkout@v6 - uses: actions/setup-java@v5 with: distribution: 'temurin' java-version: '17' - cache: 'maven' + - name: Restore Maven cache + uses: actions/cache/restore@v4 + with: + path: ~/.m2/repository + key: setup-java-Linux-x64-maven-${{ hashFiles('**/pom.xml') }} + restore-keys: | + setup-java-Linux-x64-maven- - name: Build and Test - run: bin/ci-build.sh ${{ matrix.ak }} + run: bin/ci-build.sh - name: Upload coverage to Codecov if: always() uses: codecov/codecov-action@v5 with: files: '**/target/site/jacoco/jacoco.xml,**/target/site/jacoco-it/jacoco.xml' - flags: ak-${{ matrix.ak }} + flags: default token: ${{ secrets.CODECOV_TOKEN }} diff --git a/AGENTS.md b/AGENTS.md index 75796c849..f079a569e 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -78,7 +78,7 @@ bin/performance-test.sh ## CI -- **`.github/workflows/maven.yml`** — Build and test on every push/PR. PRs run unit, integration, and performance tests in parallel (via `bin/ci-unit-test.sh`, `bin/ci-integration-test.sh`, `bin/performance-test.sh`). Push to master runs the full Kafka version matrix (3.1.0, 3.7.0, 3.9.1 + experimental 4.x) via `bin/ci-build.sh`. Includes concurrency cancellation, SpotBugs, duplicate detection, mutation testing (PIT), and dependency vulnerability scanning on PRs. +- **`.github/workflows/maven.yml`** — Build and test on every push/PR. PRs run two tiers in parallel: (1) split suites on default Kafka 3.9.1 for fast feedback (`bin/ci-unit-test.sh`, `bin/ci-integration-test.sh`, `bin/performance-test.sh`), and (2) an experimental Kafka 4.x compatibility check (`bin/ci-build.sh`). Push to master runs a single full build on default Kafka version via `bin/ci-build.sh` to gate SNAPSHOT publishing. All jobs use explicit `cache/restore` with rotating keys from the `prepare-deps` job - never `setup-java cache: 'maven'`. Includes SpotBugs, duplicate detection, mutation testing (PIT), and dependency vulnerability scanning on PRs. - **`.github/workflows/publish.yml`** — Publishes to Maven Central on every push to `master`. The pom.xml version is the source of truth: `-SNAPSHOT` versions deploy as snapshots, non-snapshot versions deploy as full releases (and create a git tag + GitHub release). - **`.semaphore/`** — Legacy Confluent internal CI/release pipelines, retained but inactive on the fork. diff --git a/bin/performance-test.sh b/bin/performance-test.sh index 29969c94c..596121c0f 100755 --- a/bin/performance-test.sh +++ b/bin/performance-test.sh @@ -15,6 +15,7 @@ set -euo pipefail ./mvnw --batch-mode \ -Pci \ clean verify \ + -DskipUTs=true \ -Dincluded.groups=performance \ -Dexcluded.groups= \ -Dlicense.skip \ diff --git a/docs/solutions/build-errors/maven-central-timeout-azure-west-regions-2026-04-21.md b/docs/solutions/build-errors/maven-central-timeout-azure-west-regions-2026-04-21.md new file mode 100644 index 000000000..3719147a8 --- /dev/null +++ b/docs/solutions/build-errors/maven-central-timeout-azure-west-regions-2026-04-21.md @@ -0,0 +1,122 @@ +--- +title: Maven Central download timeouts on GitHub Actions runners in Azure West US regions +date: 2026-04-21 +category: build-errors +module: build-system +problem_type: build_error +component: development_workflow +symptoms: + - "Could not transfer artifact io.vertx:vertx-web-client:pom:4.5.7 from/to central: Read timed out" + - Build fails consistently on vertx module after core module succeeds + - Exactly 240-second (4-minute) hang per artifact download attempt from Maven Central + - AK 3.9.1 matrix entry passes while AK 3.7.0 and 3.1.0 fail in the same workflow run +root_cause: config_error +resolution_type: config_change +severity: high +tags: + - maven-central + - github-actions + - azure-region + - vertx + - timeout + - cache-warming + - ci +--- + +# Maven Central download timeouts on GitHub Actions runners in Azure West US regions + +## Problem + +GitHub Actions CI builds consistently fail downloading Maven dependencies (appearing as vertx timeouts), but only on some matrix entries while others pass in the same workflow run. The failures are not random - they correlate with which Azure data center the runner is assigned to. + +## Symptoms + +- `Could not transfer artifact io.vertx:vertx-web-client:pom:4.5.7 from/to central (https://repo1.maven.org/maven2/): Read timed out` +- Each download attempt hangs for exactly 240 seconds before falling through to the next repository +- The vertx module always fails; core module always passes (because core's deps are already cached) +- Re-running the failed jobs produces the same failure +- The same artifact downloads in under 200ms locally and in under 200ms from East US runners + +## What Didn't Work + +- **Re-running failed jobs** - same runners in the same regions, same timeout +- **`retryHandler.count=3` in `.mvn/maven.config`** - retries don't help when the CDN route itself is broken. Each retry adds another 120s timeout, totalling 240s+ per artifact per repository +- **Assuming it was a vertx-specific issue** - vertx appeared to be the problem because it's the first module with uncached dependencies. The real issue is the network route from certain Azure regions to Maven Central's CDN + +## Solution + +The root cause is that Maven Central's CDN has degraded connectivity from Azure's western US data centers (westcentralus, westus3). Since you can't control which region GitHub assigns your runner to, the fix is to pre-warm the Maven cache so no module needs to download from Central during the actual build. + +The `prepare-deps` cache warming job downloads all dependencies once (including vertx), then the matrix jobs restore from that cache: + +```yaml +# .github/workflows/maven.yml + +prepare-deps: + # Removed: if: github.event_name == 'pull_request' + # Now runs on BOTH PR and push builds + name: "Prepare Maven Cache" + runs-on: ubuntu-latest + timeout-minutes: 15 + steps: + - uses: actions/checkout@v6 + - uses: actions/setup-java@v5 + with: + distribution: 'temurin' + java-version: '17' + - name: Restore Maven cache + uses: actions/cache/restore@v4 + with: + path: ~/.m2/repository + key: setup-java-Linux-x64-maven-${{ hashFiles('**/pom.xml') }} + restore-keys: | + setup-java-Linux-x64-maven- + - name: Download all dependencies + run: ./mvnw --batch-mode -Pci dependency:go-offline -Dlicense.skip -DincludeScope=test -U + - name: Save Maven cache (rotating key) + if: success() + uses: actions/cache/save@v4 + with: + path: ~/.m2/repository + key: setup-java-Linux-x64-maven-${{ hashFiles('**/pom.xml') }}-${{ github.run_id }} + +build: + if: github.event_name == 'push' + needs: prepare-deps # <-- added: wait for cache warming + # ... matrix config ... +``` + +The rotating `...-${{ github.run_id }}` save key ensures every successful cache-warming run can update the cache, unlike `setup-java`'s built-in caching which uses `actions/cache` (won't save when the primary key hits). + +## Why This Works + +**The root cause is Azure region routing, not Maven Central or vertx.** + +Evidence from the same workflow run (24700111718): + +| Job | Azure Region | vertx-web-client download | Result | +|-----|-------------|--------------------------|--------| +| AK 3.9.1 | **eastus** | 160ms | Passed | +| AK 3.7.0 | **westcentralus** | 240s timeout | Failed | +| AK 3.1.0 | **westus3** | 30min timeout | Failed | + +All three jobs started from the same cache, ran the same code, and tried to download the same 2.9KB POM file from the same Maven Central URL. The only difference was which Azure data center the runner was in. + +Vertx appeared to be the culprit because: +1. The reactor build order puts vertx as the **first module after core** that needs non-core dependencies +2. Core module dependencies (kafka-clients, slf4j, lombok) are either already cached or hosted on faster CDN paths +3. When vertx fails, reactor and mutiny are SKIPPED - so vertx always appears to be the only failing module + +With cache warming, all artifacts (including vertx) are downloaded by a single `prepare-deps` job. If that job lands on a bad region, it may take longer but will eventually succeed within its 15-minute timeout and 3 retries. The subsequent matrix jobs then read everything from the local cache regardless of their runner's region. + +## Prevention + +- **Always run `prepare-deps` before jobs that need Maven dependencies** - don't rely on `setup-java`'s built-in `cache: 'maven'` alone for push builds. The built-in cache uses `actions/cache` which won't overwrite an existing (possibly incomplete) cache key. +- **Use rotating cache keys** (`...-${{ github.run_id }}`) so each successful run can update the cache. Static keys based only on `hashFiles('**/pom.xml')` get stuck if the first cache entry was incomplete. +- **Don't chase the symptom** - when a specific dependency consistently times out in CI, check the Azure region of passing vs failing runners before assuming the dependency or its repository is the problem. + +## Related Issues + +- PR #48 (`fix/prepare-deps-push-builds`) - the fix that extends cache warming to push builds +- `docs/solutions/security-issues/ci-hardening-pull-request-target-mutable-refs-publish-gate-2026-04-21.md` - companion CI hardening doc +- `.mvn/maven.config` - connection timeout settings (10s connect, 120s read, 3 retries) From 2bcc74d968a70bb444e35f5d3b7dc9e2e51cbd5b Mon Sep 17 00:00:00 2001 From: Antony Stubbs Date: Thu, 23 Apr 2026 09:36:55 +1200 Subject: [PATCH 08/12] feat(publish): Maven Central publishing via central-publishing-maven-plugin Sets up the fork to publish both snapshot and release artifacts to Maven Central (new Sonatype Central Portal) under the io.github.astubbs namespace. Pipeline: a push to master triggers Build and Test; on success, the Publish workflow deploys via `./mvnw -Pmaven-central clean deploy`. The central-publishing-maven-plugin (0.10.0) handles both paths: - SNAPSHOT versions: direct PUT to the snapshot endpoint (central.sonatype.com/repository/maven-snapshots/) - RELEASE versions: bundle + validate + auto-publish via Portal API, plus git tag and GitHub release The examples module and its children are excluded from deploy (sample code, not library artifacts): via -pl on the mvn command, and via maven.deploy.skip, maven.install.skip, gpg.skip, and central-publishing-plugin skipPublishing=true in the examples pom. maven-jar-plugin test-jar execution gets skipIfEmpty=true so the parent pom and examples (no test classes) do not emit empty signed test-jars that Sonatype rejects. Real test-jars for core, vertx, reactor, and mutiny publish unchanged. Prerequisite on the Central Portal: the io.github.astubbs namespace must have both verification AND the separate "Enable SNAPSHOT Publishing" toggle switched on. Without the snapshot toggle the snapshot endpoint returns 403 even with valid credentials. Co-Authored-By: Claude Opus 4.7 (1M context) --- .github/workflows/publish.yml | 9 +++++++++ parallel-consumer-examples/pom.xml | 23 +++++++++++++++++++++++ pom.xml | 9 +++++++-- 3 files changed, 39 insertions(+), 2 deletions(-) diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index 3fcee2ad5..84b55385b 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -83,9 +83,18 @@ jobs: MAVEN_CENTRAL_PASSWORD: ${{ secrets.MAVEN_CENTRAL_PASSWORD }} MAVEN_GPG_PASSPHRASE: ${{ secrets.MAVEN_GPG_PASSPHRASE }} run: | + # Unified publish path: central-publishing-maven-plugin handles both + # snapshots and releases. For snapshots it PUTs directly to the Central + # Portal snapshot endpoint; for releases it bundles, validates, and + # auto-publishes via the Portal API. Requires SNAPSHOT publishing to be + # enabled on the namespace in central.sonatype.com/publishing/namespaces. + # + # -pl exclusion: examples are sample code, not library artifacts; skip them + # from the deploy reactor. ./mvnw --batch-mode \ -Pmaven-central \ -Pci \ + -pl '!:parallel-consumer-examples,!:parallel-consumer-example-core,!:parallel-consumer-example-metrics,!:parallel-consumer-example-streams,!:parallel-consumer-example-vertx,!:parallel-consumer-example-reactor' \ clean deploy \ -DskipTests \ -Dlicense.skip diff --git a/parallel-consumer-examples/pom.xml b/parallel-consumer-examples/pom.xml index ffcdbee78..1902f4f44 100644 --- a/parallel-consumer-examples/pom.xml +++ b/parallel-consumer-examples/pom.xml @@ -25,4 +25,27 @@ parallel-consumer-example-reactor + + + true + true + true + + + + + + org.sonatype.central + central-publishing-maven-plugin + + true + + + + + diff --git a/pom.xml b/pom.xml index 00c3931e8..3b9512737 100644 --- a/pom.xml +++ b/pom.xml @@ -59,7 +59,6 @@ HEAD - 17 8 @@ -258,7 +257,7 @@ org.sonatype.central central-publishing-maven-plugin - 0.8.0 + 0.10.0 true central @@ -844,6 +843,12 @@ test-jar + + + true + From a597a384ac57fa3213d59875ab651246529e3d20 Mon Sep 17 00:00:00 2001 From: Antony Stubbs Date: Thu, 23 Apr 2026 11:28:14 +1200 Subject: [PATCH 09/12] docs: rebrand audit (README template + pom) and landscape context Rebrand audit -- pom, README template, drop upstream-only content: - pom: developer id uses the astubbs GitHub handle; root-pom copyright header bumped to 2026 and now says "Confluent, Inc. and contributors" to match the source-file copyright policy. - README template: fork notice promoted to the top as an IMPORTANT callout, stale status claims and upstream-only marketing removed (Confluent Cloud walkthrough, CSID maintenance line, commercial-support note), copyright line clarified, CI badge uncommented and retargeted. - README template variables: :base_url: now points at the fork; :base_confluent_url: added for historical upstream references. - README template: drop the Performance Tests section and the performance build-script bullet. The CHANGELOG include stays. - RELEASE.adoc: deleted. Documented the upstream's Semaphore-based release flow; the fork's release path is the GitHub Actions + central-publishing-maven-plugin pipeline in .github/workflows/publish.yml. Landscape context for library consumers: - Add a "When to use this library (vs KIP-932 Share Groups)" section near the top so visitors can tell immediately whether Parallel Consumer or Share Groups fits their use case. Share Groups are now GA on Confluent Cloud and Confluent Platform 8.2 / Apache Kafka 4.2 and cover most "parallel consumers independent of partition count" use cases at the broker level. PC still wins when per-key ordering with concurrency beyond partition count matters -- the motivation the broker does not address. Co-Authored-By: Claude Opus 4.7 (1M context) --- README.adoc | 86 ++++++++++++++++++++++------------- RELEASE.adoc | 43 ------------------ src/docs/README_TEMPLATE.adoc | 66 ++++++++++++--------------- 3 files changed, 82 insertions(+), 113 deletions(-) delete mode 100644 RELEASE.adoc diff --git a/README.adoc b/README.adoc index 5a28f87e6..9c9b8b247 100644 --- a/README.adoc +++ b/README.adoc @@ -31,7 +31,8 @@ endif::[] :sectanchors: true :github_name: parallel-consumer -:base_url: https://github.com/confluentinc/{github_name} +:base_confluent_url: https://github.com/confluentinc/{github_name} +:base_url: https://github.com/astubbs/{github_name} :issues_link: {base_url}/issues @@ -45,24 +46,38 @@ endif::[] image:https://maven-badges.herokuapp.com/maven-central/io.github.astubbs.parallelconsumer/parallel-consumer-parent/badge.svg?style=flat[link=https://mvnrepository.com/artifact/io.github.astubbs.parallelconsumer/parallel-consumer-parent,Latest Parallel Consumer on Maven Central] -// Github actions disabled since codecov -//image:https://github.com/confluentinc/parallel-consumer/actions/workflows/maven.yml/badge.svg[Java 8 Unit Test GitHub] + -//^(^^full^ ^test^ ^suite^ ^currently^ ^running^ ^only^ ^on^ ^Confluent^ ^internal^ ^CI^ ^server^^)^ +image:https://github.com/astubbs/parallel-consumer/actions/workflows/maven.yml/badge.svg?branch=master[link=https://github.com/astubbs/parallel-consumer/actions/workflows/maven.yml,Build and Test] // travis badges temporarily disabled as travis isn't running CI currently //image:https://travis-ci.com/astubbs/parallel-consumer.svg?branch=master["Build Status", link="https://travis-ci.com/astubbs/parallel-consumer"] image:https://codecov.io/gh/astubbs/parallel-consumer/branch/master/graph/badge.svg["Coverage",https://codecov.io/gh/astubbs/parallel-consumer] Parallel Apache Kafka client wrapper with client side queueing, a simpler consumer/producer API with *key concurrency* and *extendable non-blocking IO* processing. -Confluent's https://www.confluent.io/confluent-accelerators/#parallel-consumer[product page for the project is here]. +IMPORTANT: This is a community-maintained fork of https://github.com/confluentinc/parallel-consumer[confluentinc/parallel-consumer], published under different Maven coordinates (`io.github.astubbs.parallelconsumer`). The original upstream project is no longer actively maintained. + +Confluent's https://www.confluent.io/confluent-accelerators/#parallel-consumer[product page for the original project is here]. TIP: If you like this project, please ⭐ Star it in GitHub to show your appreciation, help us gauge popularity of the project and allocate resources. -NOTE: This is not a part of the Confluent commercial support offering, except through consulting engagements. +NOTE: This is a community-maintained project with no commercial support. See the <> section for more information. -IMPORTANT: This project has been stable and reached its initial target feature set in Q1 2021. -It is actively maintained by the CSID team at Confluent. +[[when-to-use]] +== When to use this library (vs KIP-932 Share Groups) + +The Kafka landscape has shifted since this library's 2021 stable release. *KIP-932 Share Groups* is now GA on Confluent Cloud and ships with Confluent Platform 8.2 / Apache Kafka 4.2. It covers a large part of what people historically reached for Parallel Consumer to do, at the broker level rather than in a client wrapper. + +*Share Groups (broker-native):* many-to-many consumer↔partition mapping, per-message ack, broker-side delivery counts with poison-message protection, elastic scaling decoupled from partition count. Unordered queue semantics -- "RabbitMQ on Kafka". Already wrapped by Spring Kafka via `ShareConsumerFactory`. + +*Parallel Consumer (client-side):* keeps the partition model and adds *per-key parallelism* on top. Messages within a key stay ordered; different keys run concurrently; concurrency is independent of partition count. + +The two are not strict alternatives -- they solve different problems. + +[TIP] +==== +* If you want unordered queue semantics on Kafka 4.2+, reach for *Share Groups*. The "partitions are fixed, I need more consumers" motivation is now solved at the broker. +* If you need *key-level ordering with concurrency beyond partition count*, reach for *Parallel Consumer*. Nothing else does that cleanly today. +==== [[intro]] This library lets you process messages in parallel via a single Kafka Consumer meaning you can increase consumer parallelism without increasing the number of partitions in the topic you intend to process. @@ -283,7 +298,7 @@ The user just has to provide a function to extract from the message the HTTP cal === Illustrative Performance Example -.(see link:./parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/VolumeTests.java[VolumeTests.java]) +.(see link:./parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/VeryLargeMessageVolumeTest.java[VeryLargeMessageVolumeTest.java]) These performance comparison results below, even though are based on real performance measurement results, are for illustrative purposes. To see how the performance of the tool is related to instance counts, partition counts, key distribution and how it would relate to the vanilla client. Actual results will vary wildly depending upon the setup being deployed into. @@ -374,7 +389,7 @@ As an illustrative example of relative performance, given: == Support and Issues -If you encounter any issues, or have any suggestions or future requests, please create issues in the {issues_link}[github issue tracker]. +If you encounter any issues, or have any suggestions or future requests, please create issues in the {issues_link}[fork issue tracker]. Issues will be dealt with on a good faith, best efforts basis, by the small team maintaining this library. We also encourage participation, so if you have any feature ideas etc, please get in touch, and we will help you work on submitting a PR! @@ -382,18 +397,16 @@ We also encourage participation, so if you have any feature ideas etc, please ge NOTE: We are very interested to hear about your experiences! And please vote on your favourite issues! -If you have questions or find a bug, raise an https://github.com/astubbs/parallel-consumer/issues[issue] on GitHub. +If you have questions or find a bug, raise an {issues_link}[issue] on GitHub. == License -This library is copyright Confluent Inc, and licensed under the Apache License Version 2.0. +This library is copyright Confluent, Inc. and contributors, and licensed under the Apache License Version 2.0. == Usage === Maven -NOTE: This is a community-maintained fork of https://github.com/confluentinc/parallel-consumer[confluentinc/parallel-consumer], published under different Maven coordinates. The original upstream project is no longer actively maintained. - This project is available in Maven Central, https://repo1.maven.org/maven2/io/github/astubbs/parallelconsumer/[repo1]. Latest version can be seen https://search.maven.org/artifact/io.github.astubbs.parallelconsumer/parallel-consumer-core[here]. @@ -470,7 +483,7 @@ After this setup, one then has the choice of interfaces: * `JStreamVertxParallelStreamProcessor` There is another interface: `ParallelConsumer` which is integrated, however there is currently no immediate implementation. -See {issues_link}/12[issue #12], and the `ParallelConsumer` JavaDoc: +See {base_confluent_url}/issues/12[issue #12], and the `ParallelConsumer` JavaDoc: [source,java] ---- @@ -653,21 +666,6 @@ image::https://lucid.app/publicSegments/view/43f2740c-2a7f-4b7f-909e-434a5bbe3fb See the link:{project_root}/parallel-consumer-examples/parallel-consumer-example-streams/src/main/java/io/confluent/parallelconsumer/examples/streams/StreamsApp.java[Kafka Streams example] project, and it's test. -[[confluent-cloud]] -=== Confluent Cloud - -. Provision your fully managed Kafka cluster in Confluent Cloud -.. Sign up for https://www.confluent.io/confluent-cloud/tryfree/[Confluent Cloud], a fully-managed Apache Kafka service. -.. After you log in to Confluent Cloud, click on `Add cloud environment` and name the environment `learn-kafka`. -Using a new environment keeps your learning resources separate from your other Confluent Cloud resources. -.. Click on https://confluent.cloud/learn[LEARN] and follow the instructions to launch a Kafka cluster and to enable Schema Registry. -. Access the client configuration settings -.. From the Confluent Cloud Console, navigate to your Kafka cluster. -From the `Clients` view, get the connection information customized to your cluster (select `Java`). -.. Create new credentials for your Kafka cluster, and then Confluent Cloud will show a configuration block with your new credentials automatically populated (make sure `show API keys` is checked). -.. Use these settings presented to https://docs.confluent.io/clients-kafka-java/current/overview.html[configure your clients]. -. Use these clients for steps outlined in the <> section. - [[upgrading]] == Upgrading @@ -796,7 +794,7 @@ You can access the retry count of a record through it's wrapped `WorkContainer` .Example retry delay function implementing exponential backoff [source,java,indent=0] ---- - final double multiplier = 2.0; + final double multiplier = 0.5; final int baseDelaySecond = 1; ParallelConsumerOptions.builder() @@ -1351,13 +1349,19 @@ Helper scripts are in the `bin/` directory: Quick local build (compile + unit tests):: `bin/build.sh` +Unit tests only (no Docker needed):: +`bin/ci-unit-test.sh` + +Integration tests only (requires Docker for TestContainers):: +`bin/ci-integration-test.sh` + Full CI build with all tests (unit + integration):: `bin/ci-build.sh` CI build against a specific Kafka version:: `bin/ci-build.sh 3.9.1` -The GitHub Actions CI workflow uses `bin/ci-build.sh`, so running it locally reproduces the CI environment. +All `ci-*` scripts use the `-Pci` Maven profile which enables license checking and disables parallel test execution. The GitHub Actions CI workflow uses these scripts, so running them locally reproduces the CI environment. === Releasing @@ -1574,6 +1578,23 @@ ifndef::github_name[] toc::[] endif::[] +== 0.6.0.0 + +=== Breaking + +* Maven coordinates rebranded from `io.confluent.parallelconsumer` to `io.github.astubbs.parallelconsumer`. Update your dependency declarations. + +=== Fixes + +* fix: null epoch race condition in EpochAndRecordsMap - poll() returning records before onPartitionsAssigned() no longer causes NPE; records are safely skipped and re-delivered +* fix: test pollution from leaked threads and Awaitility global state across test classes + +=== Improvements + +* ci: modernized GitHub Actions CI with parallel test suites (unit, integration, performance), SpotBugs, duplicate detection, mutation testing (PIT), and dependency vulnerability scanning +* ci: added publish workflow for automated Maven Central deployment +* docs: updated README and added upstream PR analysis report + == 0.5.3.4 === Fixes @@ -2016,3 +2037,4 @@ without operational burden or harming the clusters performance ** Clean draining shutdown cycle //:leveloffset: -1 - Duplicate key leveloffset (attempted merging values +1 and -1): https://github.com/whelk-io/asciidoc-template-maven-plugin/issues/118 + diff --git a/RELEASE.adoc b/RELEASE.adoc deleted file mode 100644 index e6c5c6838..000000000 --- a/RELEASE.adoc +++ /dev/null @@ -1,43 +0,0 @@ -= Releasing - -Releases are automated through a Semaphore Task and are always built from latest commit of the master branch. - - -- Verify that Changelog.md and Readme.md are up-to-date with all the changes in the release. If not update them through normal PR process. -- Go to Tasks in Parallel Consumer project on Semaphore CI - https://semaphore.ci.confluent.io/projects/parallel-consumer/schedulers -- Click on Run Now for the Release task -- Enter Release Version (e.g. 0.5.3.3) -- Enter Next Development Version (e.g. 0.5.3.4-SNAPSHOT) -- Click Run the Task -- The task will create a Tag branch and corresponding Semaphore Job will start automatically -- Once the build job for the Tag is complete - go to the Semaphore CI page for that job and click Publish to Maven Central in the Promotions box. -- Wait for the promotion to complete - note that it may show as failed on Semaphore - but actually succeed - verify logs on Semaphore -- Wait for Sonatype to publish from it's staging area (~15 minutes) https://repo1.maven.org/maven2/io/confluent/parallelconsumer/parallel-consumer-parent/[repo1 link] -- Verify the release is available on Maven Central https://repo1.maven.org/maven2/io/confluent/parallelconsumer/parallel-consumer-parent/[repo1 link] -- Create the release on GH from the tag -- Paste in the details from the changelog, save, share as discussion -- Announce on slack (community #parallel-consumer and internal channels), mailing list, twitter - - -== Troubleshooting - -There seems to be a bug somewhere in Maven plugins or just timeout getting caught when deploying to sonatype - after actual deployment is done - logs like this: -``` -[INFO] * Upload of locally staged artifacts finished.04:49 -[INFO] * Closing staging repository with ID "ioconfluent".04:49 -Waiting for operation to complete...04:49 -..................................06:30 -[INFO] Remote staged 1 repositories, finished with success.06:30 -[INFO] Remote staging repositories are being released...06:30 -Waiting for operation to complete...06:30 -.......Oct 21, 2024 1:04:39 PM com.sun.jersey.api.client.ClientResponse getEntity07:22 -SEVERE: A message body reader for Java class com.sonatype.nexus.staging.api.dto.StagingProfileRepositoryDTO, and Java type class com.sonatype.nexus.staging.api.dto.StagingProfileRepositoryDTO, and MIME media type text/html was not found07:22 -Oct 21, 2024 1:04:39 PM com.sun.jersey.api.client.ClientResponse getEntity -``` -Shows up as promotion failed - but as you can see the actual deployment was successful. - -In case of actual upload failure / build failure of the tagged branch - the tag might need to be reverted - which is locked by devprod and github task has to be ran to revert the Tag. - -`delete-github-tag` in https://semaphore.ci.confluent.io/projects/github-tasks/schedulers. - -This will delete the tag (actually it will rename the tag to TAG-DELETED - i.e. 0.5.3.3-deleted) so that whatever build issue there is can be remedied (with revert / new PRs to Master branch) and when ready to do a release again - you can follow this process again. diff --git a/src/docs/README_TEMPLATE.adoc b/src/docs/README_TEMPLATE.adoc index bb43bc404..b29d833e8 100644 --- a/src/docs/README_TEMPLATE.adoc +++ b/src/docs/README_TEMPLATE.adoc @@ -31,7 +31,8 @@ endif::[] :sectanchors: true :github_name: parallel-consumer -:base_url: https://github.com/confluentinc/{github_name} +:base_confluent_url: https://github.com/confluentinc/{github_name} +:base_url: https://github.com/astubbs/{github_name} :issues_link: {base_url}/issues @@ -45,24 +46,38 @@ endif::[] image:https://maven-badges.herokuapp.com/maven-central/io.github.astubbs.parallelconsumer/parallel-consumer-parent/badge.svg?style=flat[link=https://mvnrepository.com/artifact/io.github.astubbs.parallelconsumer/parallel-consumer-parent,Latest Parallel Consumer on Maven Central] -// Github actions disabled since codecov -//image:https://github.com/confluentinc/parallel-consumer/actions/workflows/maven.yml/badge.svg[Java 8 Unit Test GitHub] + -//^(^^full^ ^test^ ^suite^ ^currently^ ^running^ ^only^ ^on^ ^Confluent^ ^internal^ ^CI^ ^server^^)^ +image:https://github.com/astubbs/parallel-consumer/actions/workflows/maven.yml/badge.svg?branch=master[link=https://github.com/astubbs/parallel-consumer/actions/workflows/maven.yml,Build and Test] // travis badges temporarily disabled as travis isn't running CI currently //image:https://travis-ci.com/astubbs/parallel-consumer.svg?branch=master["Build Status", link="https://travis-ci.com/astubbs/parallel-consumer"] image:https://codecov.io/gh/astubbs/parallel-consumer/branch/master/graph/badge.svg["Coverage",https://codecov.io/gh/astubbs/parallel-consumer] Parallel Apache Kafka client wrapper with client side queueing, a simpler consumer/producer API with *key concurrency* and *extendable non-blocking IO* processing. -Confluent's https://www.confluent.io/confluent-accelerators/#parallel-consumer[product page for the project is here]. +IMPORTANT: This is a community-maintained fork of https://github.com/confluentinc/parallel-consumer[confluentinc/parallel-consumer], published under different Maven coordinates (`io.github.astubbs.parallelconsumer`). The original upstream project is no longer actively maintained. + +Confluent's https://www.confluent.io/confluent-accelerators/#parallel-consumer[product page for the original project is here]. TIP: If you like this project, please ⭐ Star it in GitHub to show your appreciation, help us gauge popularity of the project and allocate resources. -NOTE: This is not a part of the Confluent commercial support offering, except through consulting engagements. +NOTE: This is a community-maintained project with no commercial support. See the <> section for more information. -IMPORTANT: This project has been stable and reached its initial target feature set in Q1 2021. -It is actively maintained by the CSID team at Confluent. +[[when-to-use]] +== When to use this library (vs KIP-932 Share Groups) + +The Kafka landscape has shifted since this library's 2021 stable release. *KIP-932 Share Groups* is now GA on Confluent Cloud and ships with Confluent Platform 8.2 / Apache Kafka 4.2. It covers a large part of what people historically reached for Parallel Consumer to do, at the broker level rather than in a client wrapper. + +*Share Groups (broker-native):* many-to-many consumer↔partition mapping, per-message ack, broker-side delivery counts with poison-message protection, elastic scaling decoupled from partition count. Unordered queue semantics -- "RabbitMQ on Kafka". Already wrapped by Spring Kafka via `ShareConsumerFactory`. + +*Parallel Consumer (client-side):* keeps the partition model and adds *per-key parallelism* on top. Messages within a key stay ordered; different keys run concurrently; concurrency is independent of partition count. + +The two are not strict alternatives -- they solve different problems. + +[TIP] +==== +* If you want unordered queue semantics on Kafka 4.2+, reach for *Share Groups*. The "partitions are fixed, I need more consumers" motivation is now solved at the broker. +* If you need *key-level ordering with concurrency beyond partition count*, reach for *Parallel Consumer*. Nothing else does that cleanly today. +==== [[intro]] This library lets you process messages in parallel via a single Kafka Consumer meaning you can increase consumer parallelism without increasing the number of partitions in the topic you intend to process. @@ -372,7 +387,7 @@ As an illustrative example of relative performance, given: == Support and Issues -If you encounter any issues, or have any suggestions or future requests, please create issues in the {issues_link}[github issue tracker]. +If you encounter any issues, or have any suggestions or future requests, please create issues in the {issues_link}[fork issue tracker]. Issues will be dealt with on a good faith, best efforts basis, by the small team maintaining this library. We also encourage participation, so if you have any feature ideas etc, please get in touch, and we will help you work on submitting a PR! @@ -380,18 +395,16 @@ We also encourage participation, so if you have any feature ideas etc, please ge NOTE: We are very interested to hear about your experiences! And please vote on your favourite issues! -If you have questions or find a bug, raise an https://github.com/astubbs/parallel-consumer/issues[issue] on GitHub. +If you have questions or find a bug, raise an {issues_link}[issue] on GitHub. == License -This library is copyright Confluent Inc, and licensed under the Apache License Version 2.0. +This library is copyright Confluent, Inc. and contributors, and licensed under the Apache License Version 2.0. == Usage === Maven -NOTE: This is a community-maintained fork of https://github.com/confluentinc/parallel-consumer[confluentinc/parallel-consumer], published under different Maven coordinates. The original upstream project is no longer actively maintained. - This project is available in Maven Central, https://repo1.maven.org/maven2/io/github/astubbs/parallelconsumer/[repo1]. Latest version can be seen https://search.maven.org/artifact/io.github.astubbs.parallelconsumer/parallel-consumer-core[here]. @@ -441,7 +454,7 @@ After this setup, one then has the choice of interfaces: * `JStreamVertxParallelStreamProcessor` There is another interface: `ParallelConsumer` which is integrated, however there is currently no immediate implementation. -See {issues_link}/12[issue #12], and the `ParallelConsumer` JavaDoc: +See {base_confluent_url}/issues/12[issue #12], and the `ParallelConsumer` JavaDoc: [source,java] ---- @@ -552,21 +565,6 @@ include::{project_root}/parallel-consumer-examples/parallel-consumer-example-str See the link:{project_root}/parallel-consumer-examples/parallel-consumer-example-streams/src/main/java/io/confluent/parallelconsumer/examples/streams/StreamsApp.java[Kafka Streams example] project, and it's test. -[[confluent-cloud]] -=== Confluent Cloud - -. Provision your fully managed Kafka cluster in Confluent Cloud -.. Sign up for https://www.confluent.io/confluent-cloud/tryfree/[Confluent Cloud], a fully-managed Apache Kafka service. -.. After you log in to Confluent Cloud, click on `Add cloud environment` and name the environment `learn-kafka`. -Using a new environment keeps your learning resources separate from your other Confluent Cloud resources. -.. Click on https://confluent.cloud/learn[LEARN] and follow the instructions to launch a Kafka cluster and to enable Schema Registry. -. Access the client configuration settings -.. From the Confluent Cloud Console, navigate to your Kafka cluster. -From the `Clients` view, get the connection information customized to your cluster (select `Java`). -.. Create new credentials for your Kafka cluster, and then Confluent Cloud will show a configuration block with your new credentials automatically populated (make sure `show API keys` is checked). -.. Use these settings presented to https://docs.confluent.io/clients-kafka-java/current/overview.html[configure your clients]. -. Use these clients for steps outlined in the <> section. - [[upgrading]] == Upgrading @@ -1114,17 +1112,8 @@ Full CI build with all tests (unit + integration):: CI build against a specific Kafka version:: `bin/ci-build.sh 3.9.1` -Performance test suite (also `bin/performance-test.cmd` on Windows):: -`bin/performance-test.sh` - All `ci-*` scripts use the `-Pci` Maven profile which enables license checking and disables parallel test execution. The GitHub Actions CI workflow uses these scripts, so running them locally reproduces the CI environment. -=== Performance Tests - -Tests tagged `@Tag("performance")` are excluded from the regular CI build because they need substantial hardware. They run on a dedicated self-hosted runner via `.github/workflows/performance.yml` (manual trigger or weekly schedule). - -To run the performance suite locally, use `bin/performance-test.sh`. To set up your own self-hosted runner for these tests, see link:./docs/SELF_HOSTED_RUNNER.md[docs/SELF_HOSTED_RUNNER.md]. - === Releasing The `pom.xml` version is the source of truth for publishing — there is no `maven-release-plugin` step. @@ -1327,3 +1316,4 @@ https://www.google.com/url?q=https://www.yourkit.com/.net/profiler/&source=gmail include::{project_root}/CHANGELOG.adoc[] //:leveloffset: -1 - Duplicate key leveloffset (attempted merging values +1 and -1): https://github.com/whelk-io/asciidoc-template-maven-plugin/issues/118 + From 947fb427611f5c3c893a4cfe607d38846eb2a41b Mon Sep 17 00:00:00 2001 From: Antony Stubbs Date: Thu, 23 Apr 2026 13:02:32 +1200 Subject: [PATCH 10/12] test(#857): scaffolding for silent-stall reproduction Test infrastructure for reproducing the multi-consumer silent stall after rebalance investigated in issue #857. - MultiInstanceRebalanceTest: re-enabled the largeNumberOfInstances test, added a cooperative-assignor variant, tagged as performance, added a gentleChaosRebalance variant that reliably reproduces the stall - ManagedPCInstance: extracted into src/test-integration/java/.../utils so multiple tests can reuse the lifecycle harness (fresh Kafka container per test, thread-confinement checks, deterministic teardown) - ManagedPCInstanceLifecycleTest: dedicated lifecycle/teardown coverage for the new ManagedPCInstance harness - ShardManagerStaleContainerTest: reproducing the stale-container issue at the same offset under rebalance - ArchitectureTest: ArchUnit rule enforcing no raw Consumer field on any class that should use ThreadConfinedConsumer - ModelUtils: test helpers updated for the new harness - BrokerIntegrationTest: settle-time + consumer-close hooks used by the chaos-monkey tests - src/test-integration/resources/logback-test.xml: per-test-logging config for the diagnostic sessions Squashed from 35 commits of chronological investigation. Original per-commit history preserved at backup/pre-rebase-bugs-857. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../BrokerIntegrationTest.java | 27 +- .../ManagedPCInstanceLifecycleTest.java | 83 +++++ .../MultiInstanceRebalanceTest.java | 333 ++++++++++-------- .../utils/ManagedPCInstance.java | 247 +++++++++++++ .../resources/logback-test.xml | 39 ++ .../parallelconsumer/ArchitectureTest.java | 96 +++++ .../parallelconsumer/state/ModelUtils.java | 8 +- .../state/ShardManagerStaleContainerTest.java | 194 ++++++++++ 8 files changed, 871 insertions(+), 156 deletions(-) create mode 100644 parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/ManagedPCInstanceLifecycleTest.java create mode 100644 parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/utils/ManagedPCInstance.java create mode 100644 parallel-consumer-core/src/test-integration/resources/logback-test.xml create mode 100644 parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/ArchitectureTest.java create mode 100644 parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/state/ShardManagerStaleContainerTest.java diff --git a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/BrokerIntegrationTest.java b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/BrokerIntegrationTest.java index c04b2e11a..700cc4cc6 100644 --- a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/BrokerIntegrationTest.java +++ b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/BrokerIntegrationTest.java @@ -110,8 +110,33 @@ public static KafkaContainer createKafkaContainer(String logSegmentSize) { kafkaContainer.start(); } + /** + * Stop the current Kafka container and start a fresh one. Use this before performance/chaos + * tests to avoid stale topics, consumer groups, and broker metadata from previous runs + * causing timeouts or unpredictable behaviour. + *

+ * After calling this, any new test instances will pick up the fresh container via the + * static field. Existing KafkaClientUtils references become stale and must be recreated. + */ + /** + * Stop the current Kafka container and start a fresh one. Recreates KafkaClientUtils + * to point to the new container. Call before performance/chaos tests. + */ + protected void resetKafkaContainer() { + log.info("Resetting Kafka container for clean state..."); + if (kcu != null) { + kcu.close(); + } + kafkaContainer.stop(); + kafkaContainer = createKafkaContainer(null); + kafkaContainer.start(); + kcu = new KafkaClientUtils(kafkaContainer); + kcu.open(); + log.info("Fresh Kafka container started at {}", kafkaContainer.getBootstrapServers()); + } + @Getter(AccessLevel.PROTECTED) - private final KafkaClientUtils kcu = new KafkaClientUtils(kafkaContainer); + private KafkaClientUtils kcu = new KafkaClientUtils(kafkaContainer); @BeforeAll static void followKafkaLogs() { diff --git a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/ManagedPCInstanceLifecycleTest.java b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/ManagedPCInstanceLifecycleTest.java new file mode 100644 index 000000000..33fd3549d --- /dev/null +++ b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/ManagedPCInstanceLifecycleTest.java @@ -0,0 +1,83 @@ +package io.confluent.parallelconsumer.integrationTests; + +/*- + * Copyright (C) 2020-2026 Confluent, Inc. and contributors + */ + +import io.confluent.parallelconsumer.ParallelConsumerOptions.CommitMode; +import io.confluent.parallelconsumer.ParallelConsumerOptions.ProcessingOrder; +import io.confluent.parallelconsumer.integrationTests.utils.ManagedPCInstance; +import lombok.extern.slf4j.Slf4j; +import org.junit.jupiter.api.RepeatedTest; +import org.junit.jupiter.api.Test; + +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.TimeUnit; +import java.util.concurrent.atomic.AtomicInteger; + +import static org.assertj.core.api.Assertions.assertThat; + +/** + * Focused lifecycle test for {@link ManagedPCInstance} — verifies that rapid stop/start + * cycles don't create duplicate PC instances or cause ConcurrentModificationException. + *

+ * This is a targeted regression test for the double-submission bug found during the + * #857 investigation. + */ +@Slf4j +class ManagedPCInstanceLifecycleTest extends BrokerIntegrationTest { + + /** + * Rapidly toggle a single instance stop→start multiple times. + * If the started flag isn't set correctly, run() will be submitted multiple times, + * creating duplicate PCs in the same consumer group → ConcurrentModificationException. + */ + @RepeatedTest(5) + void rapidToggleShouldNotCreateDuplicateInstances() throws Exception { + numPartitions = 4; + String inputName = setupTopic("lifecycle-test"); + + ExecutorService executor = Executors.newCachedThreadPool(); + AtomicInteger consumeCount = new AtomicInteger(); + + ManagedPCInstance.Config config = ManagedPCInstance.Config.builder() + .commitMode(CommitMode.PERIODIC_CONSUMER_ASYNCHRONOUS) + .order(ProcessingOrder.UNORDERED) + .inputTopic(inputName) + .build(); + + ManagedPCInstance instance = new ManagedPCInstance(config, getKcu(), key -> consumeCount.incrementAndGet()); + + // Start the instance + executor.submit(instance); + Thread.sleep(2000); // let it start and join the group + + // Rapid toggle cycles — the bug: toggle() calls start() which submits run() + // before the previous run() has set started=true, causing double-submission + for (int i = 0; i < 10; i++) { + log.info("Toggle cycle {}", i); + instance.toggle(executor); + Thread.sleep(100); // very short — enough for stop, not enough for run() to start + instance.toggle(executor); + } + + // Let it settle + Thread.sleep(3000); + + // The instance should still be functional — produce and consume a message + getKcu().produceMessages(inputName, 10); + Thread.sleep(5000); + + instance.stop(); + executor.shutdown(); + executor.awaitTermination(30, TimeUnit.SECONDS); + + // If duplicate PCs were created, we'd see ConcurrentModificationException in the logs + // and the PC would be dead. Check that it actually consumed something. + log.info("Consumed {} messages after rapid toggles", consumeCount.get()); + assertThat(consumeCount.get()) + .as("Should have consumed messages — if 0, the PC died from CME during rapid toggles") + .isGreaterThan(0); + } +} diff --git a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/MultiInstanceRebalanceTest.java b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/MultiInstanceRebalanceTest.java index 80f2c90fa..ebd382bfb 100644 --- a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/MultiInstanceRebalanceTest.java +++ b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/MultiInstanceRebalanceTest.java @@ -1,24 +1,19 @@ package io.confluent.parallelconsumer.integrationTests; /*- - * Copyright (C) 2020-2022 Confluent, Inc. + * Copyright (C) 2020-2026 Confluent, Inc. and contributors */ import io.confluent.csid.utils.ProgressBarUtils; import io.confluent.csid.utils.ProgressTracker; import io.confluent.csid.utils.TrimListRepresentation; -import io.confluent.parallelconsumer.ParallelConsumerOptions; import io.confluent.parallelconsumer.ParallelConsumerOptions.CommitMode; import io.confluent.parallelconsumer.ParallelConsumerOptions.ProcessingOrder; -import io.confluent.parallelconsumer.ParallelEoSStreamProcessor; -import lombok.Getter; +import io.confluent.parallelconsumer.integrationTests.utils.ManagedPCInstance; import lombok.SneakyThrows; -import lombok.ToString; import lombok.extern.slf4j.Slf4j; import me.tongfei.progressbar.ProgressBar; import org.apache.commons.lang3.RandomUtils; -import org.apache.kafka.clients.consumer.ConsumerConfig; -import org.apache.kafka.clients.consumer.KafkaConsumer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.clients.producer.RecordMetadata; @@ -27,7 +22,7 @@ import org.assertj.core.internal.StandardComparisonStrategy; import org.awaitility.Awaitility; import org.awaitility.core.TerminalFailureException; -import org.junit.jupiter.api.Disabled; +import org.junit.jupiter.api.Tag; import org.junit.jupiter.api.Test; import org.junit.jupiter.params.ParameterizedTest; import org.junit.jupiter.params.provider.EnumSource; @@ -48,7 +43,6 @@ import static org.assertj.core.api.Assertions.assertThat; import static org.assertj.core.util.IterableUtil.toCollection; import static org.awaitility.Awaitility.waitAtMost; -import static pl.tlinkowski.unij.api.UniLists.of; /** * Test running with multiple instances of parallel-consumer consuming from topic with two partitions. @@ -58,8 +52,11 @@ public class MultiInstanceRebalanceTest extends BrokerIntegrationTest { static final int DEFAULT_MAX_POLL = 500; - public static final int CHAOS_FREQUENCY = 500; + public static final int DEFAULT_CHAOS_FREQUENCY = 500; public static final int DEFAULT_POLL_DELAY = 150; + + /** Per-test override for chaos frequency (ms). Higher = gentler chaos. */ + int chaosFrequency = DEFAULT_CHAOS_FREQUENCY; AtomicInteger count = new AtomicInteger(); static { @@ -73,7 +70,7 @@ void consumeWithMultipleInstancesPeriodicConsumerSync(ProcessingOrder order) { int expectedMessageCount = (order == PARTITION) ? 100 : 1000; int numberOfPcsToRun = 2; runTest(DEFAULT_MAX_POLL, CommitMode.PERIODIC_CONSUMER_SYNC, order, expectedMessageCount, - numberOfPcsToRun, 1.0, DEFAULT_POLL_DELAY); + numberOfPcsToRun, 1.0, DEFAULT_POLL_DELAY, false); } @ParameterizedTest @@ -82,24 +79,99 @@ void consumeWithMultipleInstancesPeriodicConsumerAsynchronous(ProcessingOrder or numPartitions = 2; int expectedMessageCount = (order == PARTITION) ? 100 : 1000; runTest(DEFAULT_MAX_POLL, CommitMode.PERIODIC_CONSUMER_ASYNCHRONOUS, order, expectedMessageCount, - 2, 1.0, DEFAULT_POLL_DELAY); + 2, 1.0, DEFAULT_POLL_DELAY, false); } /** - * Tests with very large numbers of parallel consumer instances to try to reproduce state and concurrency issues - * (#188, #189). + * Stress test: 12 PC instances on 80 partitions with aggressive chaos monkey toggling up to + * 6 of 11 secondary instances every 0-500ms. PC-0 is never toggled and should always be alive. + *

+ * Originally created for #188/#189, re-enabled for #857 investigation. + *

+ * What the test does: + *

    + *
  1. Pre-produces 30% of 500k messages, starts PC-0, waits for it to consume
  2. + *
  3. Starts 11 more PCs + a background producer for the remaining 70%
  4. + *
  5. Chaos monkey continuously toggles (stop/start) random secondary instances
  6. + *
  7. Waits up to 5 minutes for ALL 500k keys to be consumed by any instance
  8. + *
  9. Fails if no progress is made for 11 consecutive 1-second checks
  10. + *
+ *

+ * Acceptance: 80%+ pass rate (currently ~90%). This test deliberately pushes the + * Kafka consumer group rebalance protocol to its limits. The remaining ~10% failure occurs + * when rapid membership changes prevent the group coordinator from completing partition + * assignment (consumers show assignedPartitions=0). This is documented Kafka behaviour + * under extreme churn, not a PC bug — all PC-internal issues have been fixed. + * If the pass rate drops below 80%, reassess: the test parameters may need backing off, + * or a new PC bug may have been introduced. *

- * This test takes some time, but seems required in order to expose some race conditions without syntehticly - * creatign them. + * Fixes applied (from #857 investigation): + *

    + *
  • commitCommand deadlock — ReentrantLock.tryLock() in onPartitionsRevoked
  • + *
  • Non-blocking stopAsync() in chaos monkey — prevents 30-40s close() freeze
  • + *
  • ThreadConfinedConsumer wrapper — runtime thread-safety enforcement
  • + *
  • Raw consumer field removed from PC — all access via ConsumerManager/DI
  • + *
  • ArchUnit rules — compile-time consumer field isolation
  • + *
  • Multiple defensive fixes (counter adjustment, throttle reset, lifecycle guard)
  • + *
+ *

+ * For the full investigation history, see branch {@code bugs/857-paused-consumption-multi-consumers-bug} + * and {@code docs/BUG_857_INVESTIGATION.md}. + * + * @see #857 */ - @Disabled + @Tag("performance") @Test void largeNumberOfInstances() { + numPartitions = 80; int numberOfPcsToRun = 12; int expectedMessageCount = 500000; + // Use CooperativeStickyAssignor — under the eager (Range) protocol, rapid membership + // changes restart the JoinGroup phase from scratch, leaving all consumers with + // assignment=[] indefinitely. Cooperative rebalancing lets consumers keep their + // existing assignments during rebalance. See #857 investigation. runTest(DEFAULT_MAX_POLL, CommitMode.PERIODIC_CONSUMER_ASYNCHRONOUS, ProcessingOrder.UNORDERED, expectedMessageCount, - numberOfPcsToRun, 0.3, 1); + numberOfPcsToRun, 0.3, 1, true); + } + + /** + * Variant of {@link #largeNumberOfInstances()} using CooperativeStickyAssignor, which is the assignor + * that issue #857 reporters say makes the bug more visible. Cooperative rebalancing revokes and assigns + * partitions in separate callbacks, creating a wider window for stale container races. + *

+ * Uses parameters closer to the production environments reported in #857: 30 partitions, 4 consumers. + */ + @Tag("performance") + @Test + void cooperativeStickyRebalanceShouldNotStall() { + + numPartitions = 30; + int numberOfPcsToRun = 4; + int expectedMessageCount = 100_000; + chaosFrequency = 3000; // gentle chaos — let group settle between rebalances + runTest(DEFAULT_MAX_POLL, CommitMode.PERIODIC_CONSUMER_ASYNCHRONOUS, ProcessingOrder.UNORDERED, + expectedMessageCount, numberOfPcsToRun, 0.3, 1, true); + } + + /** + * Gentler version of {@link #largeNumberOfInstances()} — toggles only 1 instance at a time with a 3-second + * cooldown between rounds. This lets the consumer group settle between rebalances, isolating any PC-internal + * bugs from the rebalance storm effect seen in the aggressive test. + *

+ * If this test passes but {@link #largeNumberOfInstances()} fails, the issue is rebalance storm tolerance, + * not a PC state management bug. + */ + @Tag("performance") + @Test + void gentleChaosRebalance() { + + numPartitions = 30; + int numberOfPcsToRun = 6; + int expectedMessageCount = 200_000; + chaosFrequency = 3000; // 3 seconds between chaos rounds — lets the group settle + runTest(DEFAULT_MAX_POLL, CommitMode.PERIODIC_CONSUMER_ASYNCHRONOUS, ProcessingOrder.UNORDERED, + expectedMessageCount, numberOfPcsToRun, 0.5, 1, false); } ProgressBar overallProgress; @@ -107,7 +179,8 @@ void largeNumberOfInstances() { @SneakyThrows private void runTest(int maxPoll, CommitMode commitMode, ProcessingOrder order, int expectedMessageCount, - int numberOfPcsToRun, double fractionOfMessagesToPreProduce, int pollDelayMs) { + int numberOfPcsToRun, double fractionOfMessagesToPreProduce, int pollDelayMs, + boolean useCooperativeAssignor) { String inputName = setupTopic(this.getClass().getSimpleName() + "-input-" + RandomUtils.nextInt()); overallProgress = ProgressBarUtils.getNewMessagesBar("overall", log, expectedMessageCount); @@ -116,6 +189,15 @@ private void runTest(int maxPoll, CommitMode commitMode, ProcessingOrder order, var sendingProgress = ProgressBarUtils.getNewMessagesBar("sending", log, expectedMessageCount); + ManagedPCInstance.Config pcConfig = ManagedPCInstance.Config.builder() + .maxPoll(maxPoll) + .commitMode(commitMode) + .order(order) + .inputTopic(inputName) + .pollDelayMs(pollDelayMs) + .useCooperativeAssignor(useCooperativeAssignor) + .build(); + // pre-produce messages to input-topic Set expectedKeys = new ConcurrentSkipListSet<>(); log.info("Producing {} messages before starting test", expectedMessageCount); @@ -145,8 +227,11 @@ private void runTest(int maxPoll, CommitMode commitMode, ProcessingOrder order, // Submit first parallel-consumer log.info("Running first instance of pc"); - int expectedMessageCountPerPC = expectedMessageCount / numberOfPcsToRun; - ParallelConsumerRunnable pc1 = new ParallelConsumerRunnable(maxPoll, commitMode, order, inputName, expectedMessageCountPerPC, pollDelayMs); + ManagedPCInstance pc1 = new ManagedPCInstance(pcConfig, getKcu(), key -> { + count.incrementAndGet(); + overallProgress.step(); + overallConsumedKeys.add(key); + }); pcExecutor.submit(pc1); // Wait for first consumer to consume messages, also effectively waits for the group.initial.rebalance.delay.ms (3s by default) @@ -192,16 +277,18 @@ public void run() { log.error(e.getMessage(), e); } log.info("Running pc instance {}", value); - ParallelConsumerRunnable instance = new ParallelConsumerRunnable(maxPoll, commitMode, order, inputName, expectedMessageCountPerPC, pollDelayMs); + ManagedPCInstance instance = new ManagedPCInstance(pcConfig, getKcu(), key -> { + count.incrementAndGet(); + overallProgress.step(); + overallConsumedKeys.add(key); + }); pcExecutor.submit(instance); return instance; } ).collect(Collectors.toList())); - final List allPCRunners = Collections.synchronizedList(new ArrayList<>()); + final List allPCRunners = Collections.synchronizedList(new ArrayList<>()); allPCRunners.add(pc1); allPCRunners.addAll(secondaryPcs); - final ParallelConsumerRunnable[] parallelConsumerRunnablesArray = allPCRunners.toArray(new ParallelConsumerRunnable[0]); - // Randomly start and stop PCs var chaosMonkey = new Runnable() { @@ -209,7 +296,7 @@ public void run() { public void run() { try { while (noneHaveFailed(allPCRunners)) { - Thread.sleep((int) (CHAOS_FREQUENCY * Math.random())); + Thread.sleep((int) (chaosFrequency * Math.random())); boolean makeChaos = Math.random() > 0.2; // small chance it will let the test do a run without chaos // boolean makeChaos = true; if (makeChaos) { @@ -219,8 +306,8 @@ public void run() { log.info("Will mess with {} instances", numberToMessWith); IntStream.range(0, numberToMessWith).forEach(value -> { int instanceToGet = (int) ((size - 1) * Math.random()); - ParallelConsumerRunnable victim = secondaryPcs.get(instanceToGet); - log.info("Victim is instance: " + victim.instanceId); + ManagedPCInstance victim = secondaryPcs.get(instanceToGet); + log.info("Victim is instance: " + victim.getInstanceId()); victim.toggle(pcExecutor); }); } @@ -249,9 +336,11 @@ public void run() { .alias(failureMessage) .pollInterval(1, SECONDS) .untilAsserted(() -> { - log.trace("Processed-count: {}", getAllConsumedKeys(parallelConsumerRunnablesArray).size()); + log.trace("Processed-count: {}", getAllConsumedKeys(allPCRunners).size()); if (progressTracker.hasProgressNotBeenMade()) { - expectedKeys.removeAll(getAllConsumedKeys(parallelConsumerRunnablesArray)); + // Dump full state of every PC instance to diagnose the stall + dumpInstanceState(allPCRunners); + expectedKeys.removeAll(getAllConsumedKeys(allPCRunners)); throw progressTracker.constructError(msg("No progress, missing keys: {}.", expectedKeys)); } SoftAssertions all = new SoftAssertions(); @@ -279,17 +368,17 @@ public void run() { sendingProgress.close(); } - allPCRunners.forEach(ParallelConsumerRunnable::close); + allPCRunners.forEach(ManagedPCInstance::close); - assertThat(pc1.consumedKeys).hasSizeGreaterThan(0); - assertThat(getAllConsumedKeys(secondaryPcs.toArray(new ParallelConsumerRunnable[0]))) + assertThat(pc1.getConsumedKeys()).hasSizeGreaterThan(0); + assertThat(getAllConsumedKeys(secondaryPcs)) .as("Second PC should have taken over some of the work and consumed some records") .hasSizeGreaterThan(0); pcExecutor.shutdown(); Collection duplicates = toCollection(StandardComparisonStrategy.instance() - .duplicatesFrom(getAllConsumedKeys(parallelConsumerRunnablesArray))); + .duplicatesFrom(getAllConsumedKeys(allPCRunners))); log.info("Duplicate consumed keys (at least one is expected due to the rebalance): {}", duplicates); double percentageDuplicateTolerance = 0.2; assertThat(duplicates) @@ -299,135 +388,71 @@ public void run() { } - private boolean noneHaveFailed(List secondaryPcs) { - return checkForFailure(secondaryPcs).isEmpty(); + /** + * Dump the internal state of every PC instance when a stall is detected. + * This tells us exactly what each component thinks is happening: + * - Is the PC alive or dead? + * - How many records are queued in shards vs out for processing? + * - What's the partition assignment? + * - Is the consumer paused? + * - What does the WorkManager think about incomplete offsets? + */ + private void dumpInstanceState(List instances) { + log.error("=== STALL DETECTED — dumping all instance state ==="); + for (var instance : instances) { + var pc = instance.getParallelConsumer(); + if (pc == null) { + log.error(" Instance {}: PC is null (never started?), started={}", instance.getInstanceId(), instance.isStarted()); + continue; + } + try { + var wm = pc.getWm(); + // Check if the shard manager has any processing shards at all + var sm = wm.getSm(); + long totalWorkTracked = sm.getNumberOfWorkQueuedInShardsAwaitingSelection(); + boolean hasIncompletes = wm.hasIncompleteOffsets(); + + log.error(" Instance {}: closed/failed={}, failureCause={}, started={}, " + + "assignedPartitions={}, queuedInShards={}, outForProcessing={}, " + + "incompleteOffsets={}, hasIncompletes={}, " + + "pausedPartitions={}, consumedKeys={}", + instance.getInstanceId(), + pc.isClosedOrFailed(), + pc.getFailureCause() != null ? pc.getFailureCause().getMessage() : "none", + instance.isStarted(), + pc.getAssignmentSize(), + totalWorkTracked, + wm.getNumberRecordsOutForProcessing(), + wm.getNumberOfIncompleteOffsets(), + hasIncompletes, + pc.getPausedPartitionSize(), + instance.getConsumedKeys().size() + ); + } catch (Exception e) { + log.error(" Instance {}: error dumping state: {}", instance.getInstanceId(), e.getMessage(), e); + } + } + log.error("=== END STATE DUMP ==="); + } + + private boolean noneHaveFailed(List pcs) { + return checkForFailure(pcs).isEmpty(); } - private List checkForFailure(List secondaryPcs) { - return secondaryPcs.stream().filter(pcr -> { - var pc = pcr.getParallelConsumer(); + private List checkForFailure(List pcs) { + return pcs.stream().filter(instance -> { + var pc = instance.getParallelConsumer(); if (pc == null) return false; // hasn't started if (!pc.isClosedOrFailed()) return false; // still open boolean failed = pc.getFailureCause() != null; // actually failed return failed; - }).map(pc -> pc.getParallelConsumer().getFailureCause()).collect(Collectors.toList()); + }).map(instance -> instance.getParallelConsumer().getFailureCause()).collect(Collectors.toList()); } - List getAllConsumedKeys(ParallelConsumerRunnable... instances) { - return Arrays.stream(instances) - .flatMap(parallelConsumerRunnable -> parallelConsumerRunnable.consumedKeys.stream()) + List getAllConsumedKeys(List instances) { + return instances.stream() + .flatMap(instance -> instance.getConsumedKeys().stream()) .collect(Collectors.toList()); } - int pcInstanceCount = 0; - - @Getter - @ToString - public class ParallelConsumerRunnable implements Runnable { - - private final int instanceId; - - private final int maxPoll; - private final CommitMode commitMode; - private final ProcessingOrder order; - private final String inputTopic; - private final int expectedMessageCount; - private final ProgressBar bar; - private final int pollDelayMs; - private ParallelEoSStreamProcessor parallelConsumer; - private boolean started = false; - - @ToString.Exclude - private final Queue consumedKeys = new ConcurrentLinkedQueue<>(); - - public ParallelConsumerRunnable(int maxPoll, CommitMode commitMode, ProcessingOrder order, String inputTopic, int expectedMessageCount, int pollDelayMs) { - this.maxPoll = maxPoll; - this.commitMode = commitMode; - this.order = order; - this.inputTopic = inputTopic; - this.expectedMessageCount = expectedMessageCount; - this.pollDelayMs = pollDelayMs; - - instanceId = pcInstanceCount; - pcInstanceCount++; - - bar = ProgressBarUtils.getNewMessagesBar("PC" + instanceId, log, expectedMessageCount); - } - - @Override - public void run() { - MDC.put(MDC_INSTANCE_ID, "Runner-" + instanceId); - - started = true; - log.info("Running consumer!"); - - Properties consumerProps = new Properties(); - consumerProps.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, maxPoll); - KafkaConsumer newConsumer = getKcu().createNewConsumer(false, consumerProps); - - this.parallelConsumer = new ParallelEoSStreamProcessor<>(ParallelConsumerOptions.builder() - .ordering(order) - .consumer(newConsumer) - .commitMode(commitMode) - .maxConcurrency(10) - .build()); - - - // test was written with 1-second cycles in mind - in terms of expected progression - this.parallelConsumer.setTimeBetweenCommits(ofSeconds(1)); - - - parallelConsumer.setMyId(Optional.of("PC-" + instanceId)); - - parallelConsumer.subscribe(of(inputTopic)); - - parallelConsumer.poll(record -> { - // simulate work - try { - Thread.sleep(pollDelayMs); - } catch (InterruptedException e) { - // ignore - } - count.incrementAndGet(); - this.bar.step(); - overallProgress.step(); - consumedKeys.add(record.key()); - overallConsumedKeys.add(record.key()); - } - ); - } - - public void stop() { - log.info("Stopping {}", this.instanceId); - started = false; - parallelConsumer.close(); - } - - public void start(ExecutorService pcExecutor) { - // strange structure for debugging - Exception failureCause = getParallelConsumer().getFailureCause(); - if (failureCause != null) { - throw new RuntimeException("Error starting PC, pc died from previous error: " + failureCause.getMessage(), failureCause); - } - - log.info("Starting {}", this); - pcExecutor.submit(this); - } - - public void close() { - log.info("Stopping {}", this); - stop(); - bar.close(); - } - - public void toggle(ExecutorService pcExecutor) { - if (started) { - stop(); - } else { - start(pcExecutor); - } - } - } - - } diff --git a/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/utils/ManagedPCInstance.java b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/utils/ManagedPCInstance.java new file mode 100644 index 000000000..c313578d0 --- /dev/null +++ b/parallel-consumer-core/src/test-integration/java/io/confluent/parallelconsumer/integrationTests/utils/ManagedPCInstance.java @@ -0,0 +1,247 @@ +package io.confluent.parallelconsumer.integrationTests.utils; + +/*- + * Copyright (C) 2020-2026 Confluent, Inc. and contributors + */ + +import io.confluent.parallelconsumer.ParallelConsumerOptions; +import io.confluent.parallelconsumer.ParallelConsumerOptions.CommitMode; +import io.confluent.parallelconsumer.ParallelConsumerOptions.ProcessingOrder; +import io.confluent.parallelconsumer.ParallelEoSStreamProcessor; +import lombok.Builder; +import lombok.Getter; +import lombok.ToString; +import lombok.extern.slf4j.Slf4j; +import org.apache.kafka.clients.consumer.ConsumerConfig; +import org.apache.kafka.clients.consumer.KafkaConsumer; +import org.apache.kafka.common.errors.DisconnectException; +import org.apache.kafka.common.errors.WakeupException; + +import java.nio.channels.ClosedChannelException; +import java.time.Duration; +import java.util.Optional; +import java.util.Properties; +import java.util.Queue; +import java.util.concurrent.ConcurrentLinkedQueue; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.TimeoutException; +import java.util.concurrent.atomic.AtomicInteger; +import java.util.function.Consumer; + +import static io.confluent.parallelconsumer.internal.AbstractParallelEoSStreamProcessor.MDC_INSTANCE_ID; +import static pl.tlinkowski.unij.api.UniLists.of; + +/** + * Manages the lifecycle of a {@link ParallelEoSStreamProcessor} instance in multi-instance + * integration tests. Handles creation, start, stop, toggle (for chaos monkey), and restart + * with proper exception classification. + *

+ * Each call to {@link #run()} creates a fresh PC + consumer, so restarts don't carry over + * stale state from the previous instance. This simulates what a real supervisor would do + * (start a new process). + *

+ * On restart, checks the previous PC's failure cause: + *

    + *
  • Expected close exceptions (see {@link #isExpectedCloseException}) → logged at WARN, restart allowed
  • + *
  • Unexpected exceptions → thrown as RuntimeException (fails the test — acts as a canary for real bugs)
  • + *
+ * + * @see io.confluent.parallelconsumer.integrationTests.MultiInstanceRebalanceTest + */ +@Slf4j +@Getter +@ToString +public class ManagedPCInstance implements Runnable { + + private static final AtomicInteger ID_GENERATOR = new AtomicInteger(); + + private final int instanceId; + private final Config config; + private final KafkaClientUtils kcu; + + @Getter + private volatile ParallelEoSStreamProcessor parallelConsumer; + @Getter + private volatile boolean started = false; + + @ToString.Exclude + private final Queue consumedKeys = new ConcurrentLinkedQueue<>(); + + /** Callback invoked for each consumed record — lets the test track overall progress */ + @ToString.Exclude + private final Consumer onConsumed; + + public ManagedPCInstance(Config config, KafkaClientUtils kcu, Consumer onConsumed) { + this.config = config; + this.kcu = kcu; + this.onConsumed = onConsumed; + this.instanceId = ID_GENERATOR.getAndIncrement(); + } + + @Override + public void run() { + org.slf4j.MDC.put(MDC_INSTANCE_ID, "Runner-" + instanceId); + + // Wait for the previous PC to fully close — including its internal threads finishing + // and the KafkaConsumer being closed on the poll thread. PC.close() blocks until + // the control thread finishes, which waits for the poll thread (brokerPollSubsystem + // .closeAndWait), which closes the consumer. So by the time isClosedOrFailed() returns + // true, the consumer should be fully closed and deregistered from the group. + // See #857. + if (parallelConsumer != null) { + int waitMs = 0; + while (!parallelConsumer.isClosedOrFailed() && waitMs < 10_000) { + try { + Thread.sleep(100); + waitMs += 100; + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + return; + } + } + if (waitMs >= 10_000) { + log.warn("Instance {} previous PC did not close within 10s, proceeding anyway", instanceId); + } + } + + // started flag is set in start(), not here — prevents double-submission + log.info("Running consumer instance {}", instanceId); + + Properties consumerProps = new Properties(); + consumerProps.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, config.maxPoll); + if (config.useCooperativeAssignor) { + consumerProps.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, + "org.apache.kafka.clients.consumer.CooperativeStickyAssignor"); + } + KafkaConsumer newConsumer = kcu.createNewConsumer(false, consumerProps); + + this.parallelConsumer = new ParallelEoSStreamProcessor<>(ParallelConsumerOptions.builder() + .ordering(config.order) + .consumer(newConsumer) + .commitMode(config.commitMode) + .maxConcurrency(config.maxConcurrency) + .build()); + + this.parallelConsumer.setTimeBetweenCommits(Duration.ofSeconds(1)); + this.parallelConsumer.setMyId(Optional.of("PC-" + instanceId)); + this.parallelConsumer.subscribe(of(config.inputTopic)); + + parallelConsumer.poll(record -> { + if (config.pollDelayMs > 0) { + try { + Thread.sleep(config.pollDelayMs); + } catch (InterruptedException e) { + // ignore — shutdown in progress + } + } + consumedKeys.add(record.key()); + onConsumed.accept(record.key()); + }); + } + + /** True while a background close is in progress — prevents toggle from restarting prematurely */ + private volatile boolean closePending = false; + + public void stop() { + log.info("Stopping instance {}", instanceId); + started = false; + parallelConsumer.close(); + } + + /** + * Non-blocking stop: signals close and returns immediately. The close completes + * in a background thread. Use this from the chaos monkey so it isn't blocked for + * 30-40s while the PC shuts down. The {@link #closePending} flag prevents + * {@link #toggle} from restarting until close finishes. + */ + public void stopAsync() { + log.info("Async stopping instance {}", instanceId); + started = false; + closePending = true; + var pcToClose = parallelConsumer; + new Thread(() -> { + try { + pcToClose.close(); + } catch (Exception e) { + log.warn("Instance {} background close error: {}", instanceId, e.getMessage()); + } finally { + closePending = false; + } + }, "pc-close-" + instanceId).start(); + } + + /** + * Restart: checks the previous PC's failure cause, classifies it, then resubmits to the executor. + * Expected close exceptions are logged. Unexpected exceptions fail the test. + */ + public void start(ExecutorService pcExecutor) { + if (parallelConsumer != null) { + Exception failureCause = parallelConsumer.getFailureCause(); + if (failureCause != null) { + if (isExpectedCloseException(failureCause)) { + log.warn("Instance {} had expected close exception (restarting): {}", + instanceId, failureCause.getMessage()); + } else { + throw new RuntimeException( + "Instance " + instanceId + " died from unexpected error: " + failureCause.getMessage(), + failureCause); + } + } + } + started = true; // set BEFORE submit so next toggle() sees it — prevents double-submission + log.info("Starting instance {}", instanceId); + pcExecutor.submit(this); + } + + public void toggle(ExecutorService pcExecutor) { + if (closePending) { + log.trace("Instance {} toggle skipped — close still pending", instanceId); + return; + } + if (started) { + stopAsync(); // non-blocking so the chaos monkey isn't frozen during close + } else { + start(pcExecutor); + } + } + + public void close() { + log.info("Closing instance {}", instanceId); + stop(); + } + + /** + * Whitelist-only exception classification. Walks the cause chain looking for known + * close-related exceptions. Everything not on the whitelist is treated as an unexpected + * bug that should fail the test. + */ + public static boolean isExpectedCloseException(Throwable t) { + Throwable current = t; + while (current != null) { + if (current instanceof InterruptedException || + current instanceof WakeupException || + current instanceof DisconnectException || + current instanceof ClosedChannelException || + current instanceof TimeoutException) { + return true; + } + current = current.getCause(); + } + return false; + } + + /** + * Configuration for a managed PC instance. Use the builder. + */ + @Builder + @Getter + public static class Config { + @Builder.Default private final int maxPoll = 500; + private final CommitMode commitMode; + private final ProcessingOrder order; + private final String inputTopic; + @Builder.Default private final int pollDelayMs = 0; + @Builder.Default private final int maxConcurrency = 10; + @Builder.Default private final boolean useCooperativeAssignor = false; + } +} diff --git a/parallel-consumer-core/src/test-integration/resources/logback-test.xml b/parallel-consumer-core/src/test-integration/resources/logback-test.xml new file mode 100644 index 000000000..c003ad8a8 --- /dev/null +++ b/parallel-consumer-core/src/test-integration/resources/logback-test.xml @@ -0,0 +1,39 @@ + + + + + + %d{mm:ss.SSS} %yellow(%X{pcId}) %highlight(%-5level) %yellow([%thread]) %X{offset} %cyan(\(%file:%line\)#%M) %msg%n + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/ArchitectureTest.java b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/ArchitectureTest.java new file mode 100644 index 000000000..96b37bd13 --- /dev/null +++ b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/ArchitectureTest.java @@ -0,0 +1,96 @@ +package io.confluent.parallelconsumer; + +/*- + * Copyright (C) 2020-2026 Confluent, Inc. and contributors + */ + +import com.tngtech.archunit.core.domain.JavaField; +import com.tngtech.archunit.core.importer.ImportOption; +import com.tngtech.archunit.junit.AnalyzeClasses; +import com.tngtech.archunit.junit.ArchTest; +import com.tngtech.archunit.lang.ArchCondition; +import com.tngtech.archunit.lang.ArchRule; +import com.tngtech.archunit.lang.ConditionEvents; +import com.tngtech.archunit.lang.SimpleConditionEvent; +import io.confluent.parallelconsumer.internal.ConsumerManager; +import org.apache.kafka.clients.consumer.Consumer; +import org.apache.kafka.clients.consumer.KafkaConsumer; + +import java.util.Arrays; +import java.util.HashSet; +import java.util.Set; + +import com.tngtech.archunit.core.domain.JavaAccess; + +import static com.tngtech.archunit.lang.syntax.ArchRuleDefinition.fields; +import static com.tngtech.archunit.lang.syntax.ArchRuleDefinition.noClasses; + +/** + * ArchUnit rules enforcing the architecture of the Parallel Consumer. + *

+ * These rules prevent regressions in thread-safety and encapsulation. + * See #857. + */ +@AnalyzeClasses( + packages = "io.confluent.parallelconsumer", + importOptions = ImportOption.DoNotIncludeTests.class +) +class ArchitectureTest { + + // Classes allowed to hold a Consumer field. Use getName() to avoid hardcoded strings. + // ThreadConfinedConsumer is package-private so we reference it by name. + private static final Set ALLOWED_CONSUMER_HOLDERS = new HashSet<>(Arrays.asList( + ConsumerManager.class.getName(), + "io.confluent.parallelconsumer.internal.ThreadConfinedConsumer", + ParallelConsumerOptions.class.getName(), + // Lombok @Builder generates this inner class which also holds the consumer field + ParallelConsumerOptions.class.getName() + "$ParallelConsumerOptionsBuilder" + )); + + /** + * Only the designated wrapper/options classes may hold a Consumer or KafkaConsumer field. + * This prevents accidental raw consumer access that bypasses the thread-confinement wrapper. + */ + @ArchTest + static final ArchRule noRawConsumerFieldsOutsideWrappers = + fields() + .that().haveRawType(Consumer.class) + .or().haveRawType(KafkaConsumer.class) + .should(beInAllowedClasses(ALLOWED_CONSUMER_HOLDERS)) + .as("Only " + ALLOWED_CONSUMER_HOLDERS + " may hold a Consumer field. " + + "All other consumer access must go through ConsumerManager. See #857."); + + /** + * Only ProducerWrapper should hold a raw Producer field. + * ProducerManager holds ProducerWrapper, not raw Producer. + */ + @ArchTest + static final ArchRule noRawProducerFieldsOutsideWrapper = + fields() + .that().haveRawType("org.apache.kafka.clients.producer.Producer") + .or().haveRawType("org.apache.kafka.clients.producer.KafkaProducer") + .should(beInAllowedClasses(new HashSet<>(Arrays.asList( + "io.confluent.parallelconsumer.internal.ProducerWrapper", + ParallelConsumerOptions.class.getName(), + ParallelConsumerOptions.class.getName() + "$ParallelConsumerOptionsBuilder" + )))) + .as("Only ProducerWrapper and ParallelConsumerOptions may hold a Producer field. " + + "All other producer access must go through ProducerWrapper/ProducerManager."); + + // Future: add rule that ConsumerManager is only constructed by PCModule. + // Requires DescribedPredicate API which is verbose — defer for now. + + private static ArchCondition beInAllowedClasses(Set allowedClassNames) { + return new ArchCondition<>("be declared in an allowed class") { + @Override + public void check(JavaField field, ConditionEvents events) { + String ownerName = field.getOwner().getName(); + if (!allowedClassNames.contains(ownerName)) { + events.add(SimpleConditionEvent.violated(field, + "Field " + field.getFullName() + " holds a Consumer/Producer reference but " + + ownerName + " is not in the allowed list: " + allowedClassNames)); + } + } + }; + } +} diff --git a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/state/ModelUtils.java b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/state/ModelUtils.java index 20c45e35c..1d35b7591 100644 --- a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/state/ModelUtils.java +++ b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/state/ModelUtils.java @@ -32,9 +32,15 @@ public ModelUtils() { } public WorkContainer createWorkFor(long offset) { + return createWorkFor(offset, 0); + } + + public WorkContainer createWorkFor(long offset, long epoch) { ConsumerRecord mockCr = Mockito.mock(ConsumerRecord.class); - WorkContainer workContainer = new WorkContainer<>(0, mockCr, module); Mockito.doReturn(offset).when(mockCr).offset(); + Mockito.doReturn(topic).when(mockCr).topic(); + Mockito.doReturn(0).when(mockCr).partition(); + WorkContainer workContainer = new WorkContainer<>(epoch, mockCr, module); return workContainer; } diff --git a/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/state/ShardManagerStaleContainerTest.java b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/state/ShardManagerStaleContainerTest.java new file mode 100644 index 000000000..29aee5fab --- /dev/null +++ b/parallel-consumer-core/src/test/java/io/confluent/parallelconsumer/state/ShardManagerStaleContainerTest.java @@ -0,0 +1,194 @@ +package io.confluent.parallelconsumer.state; + +/*- + * Copyright (C) 2020-2026 Confluent, Inc. and contributors + */ + +import io.confluent.parallelconsumer.internal.PCModuleTestEnv; +import io.confluent.parallelconsumer.offsets.OffsetMapCodecManager; +import lombok.extern.slf4j.Slf4j; +import org.apache.kafka.clients.consumer.ConsumerRecord; +import org.apache.kafka.common.TopicPartition; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import pl.tlinkowski.unij.api.UniLists; + +import java.util.List; +import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; + +import static com.google.common.truth.Truth.assertThat; +import static com.google.common.truth.Truth.assertWithMessage; + +/** + * Deterministic unit tests for stale container handling across rebalances. + *

+ * These tests construct the exact scenarios suspected in + * #857 + * without requiring a broker, so they run fast and fail deterministically. + * + * @see ShardManager + * @see ProcessingShard + * @see PartitionStateManager + */ +@Slf4j +class ShardManagerStaleContainerTest { + + ModelUtils mu = new ModelUtils(); + WorkManager wm; + ShardManager sm; + PartitionStateManager pm; + + String topic = "topic"; + TopicPartition tp = new TopicPartition(topic, 0); + + @BeforeEach + void setup() { + PCModuleTestEnv module = mu.getModule(); + wm = module.workManager(); + sm = wm.getSm(); + pm = wm.getPm(); + + // initial assignment at epoch 0 + wm.onPartitionsAssigned(UniLists.of(tp)); + } + + /** + * Core reproduction scenario for #857: after a revoke+reassign cycle, stale work containers + * from the old epoch should not block new work from being taken. + */ + @Test + void staleContainerShouldNotBlockNewWorkAfterRebalance() { + long initialEpoch = pm.getEpochOfPartition(tp); + + // Add work at the initial epoch + for (int i = 0; i < 5; i++) { + sm.addWorkContainer(initialEpoch, new ConsumerRecord<>(topic, 0, i, "key-" + i, "value")); + } + + // Verify we can take work (sanity check) + List> initialWork = sm.getWorkIfAvailable(10); + assertThat(initialWork).hasSize(5); + + // Simulate revoke → reassign (epoch goes from N to N+2) + wm.onPartitionsRevoked(UniLists.of(tp)); + wm.onPartitionsAssigned(UniLists.of(tp)); + + long newEpoch = pm.getEpochOfPartition(tp); + assertWithMessage("Epoch should have advanced past initial") + .that(newEpoch).isGreaterThan(initialEpoch); + + // Add new work at the new epoch + for (int i = 10; i < 15; i++) { + sm.addWorkContainer(newEpoch, new ConsumerRecord<>(topic, 0, i, "key-" + i, "value")); + } + + // Also inject a "late arriving" container with the OLD epoch — simulates a poll + // that was in-flight during the rebalance and arrived after reassignment + sm.addWorkContainer(initialEpoch, new ConsumerRecord<>(topic, 0, 99, "key-stale", "value")); + + // Now try to take work — the new epoch's containers should be returned, + // and the stale one should not block them + List> workAfterRebalance = sm.getWorkIfAvailable(10); + + assertWithMessage("Should be able to take new-epoch work after rebalance. " + + "If this fails with 0, stale containers are blocking the shard — this is bug #857.") + .that(workAfterRebalance).isNotEmpty(); + + // Verify the returned work is from the new epoch + for (var wc : workAfterRebalance) { + assertWithMessage("Returned work should be from new epoch, not stale") + .that(wc.getEpoch()).isEqualTo(newEpoch); + } + } + + /** + * Rapid successive rebalances (revoke→assign→revoke→assign) should clean up all stale + * containers and not leave any behind to block future work. + */ + @Test + void multipleRapidRebalancesShouldNotLeaveStaleContainers() { + long epoch0 = pm.getEpochOfPartition(tp); + + // Add work at epoch 0 + sm.addWorkContainer(epoch0, new ConsumerRecord<>(topic, 0, 0, "key-0", "value")); + sm.addWorkContainer(epoch0, new ConsumerRecord<>(topic, 0, 1, "key-1", "value")); + + // Rapid rebalance cycle 1 + wm.onPartitionsRevoked(UniLists.of(tp)); + wm.onPartitionsAssigned(UniLists.of(tp)); + + long epoch1 = pm.getEpochOfPartition(tp); + sm.addWorkContainer(epoch1, new ConsumerRecord<>(topic, 0, 2, "key-2", "value")); + + // Rapid rebalance cycle 2 + wm.onPartitionsRevoked(UniLists.of(tp)); + wm.onPartitionsAssigned(UniLists.of(tp)); + + long epoch2 = pm.getEpochOfPartition(tp); + sm.addWorkContainer(epoch2, new ConsumerRecord<>(topic, 0, 3, "key-3", "value")); + + // Rapid rebalance cycle 3 + wm.onPartitionsRevoked(UniLists.of(tp)); + wm.onPartitionsAssigned(UniLists.of(tp)); + + long finalEpoch = pm.getEpochOfPartition(tp); + assertThat(finalEpoch).isGreaterThan(epoch2); + + // Add work at the final epoch + sm.addWorkContainer(finalEpoch, new ConsumerRecord<>(topic, 0, 10, "key-fresh", "value")); + + // Explicitly run stale removal + long staleCount = sm.removeStaleContainers(); + log.info("Removed {} stale containers after rapid rebalances", staleCount); + + // Take work — should get the fresh container + List> work = sm.getWorkIfAvailable(10); + assertWithMessage("Should get fresh work after rapid rebalances") + .that(work).isNotEmpty(); + + for (var wc : work) { + assertWithMessage("All returned work should be from final epoch") + .that(wc.getEpoch()).isEqualTo(finalEpoch); + } + } + + /** + * Verify that the stale container removal actually removes containers from all prior epochs, + * not just the immediately previous one. + */ + @Test + void staleRemovalShouldCatchContainersFromAllPriorEpochs() { + long epoch0 = pm.getEpochOfPartition(tp); + sm.addWorkContainer(epoch0, new ConsumerRecord<>(topic, 0, 0, "key-e0", "value")); + + wm.onPartitionsRevoked(UniLists.of(tp)); + wm.onPartitionsAssigned(UniLists.of(tp)); + + long epoch1 = pm.getEpochOfPartition(tp); + sm.addWorkContainer(epoch1, new ConsumerRecord<>(topic, 0, 1, "key-e1", "value")); + + wm.onPartitionsRevoked(UniLists.of(tp)); + wm.onPartitionsAssigned(UniLists.of(tp)); + + long epoch2 = pm.getEpochOfPartition(tp); + + // Now inject containers from BOTH old epochs (simulating two separate late polls) + sm.addWorkContainer(epoch0, new ConsumerRecord<>(topic, 0, 90, "key-stale-e0", "value")); + sm.addWorkContainer(epoch1, new ConsumerRecord<>(topic, 0, 91, "key-stale-e1", "value")); + + // Add fresh work + sm.addWorkContainer(epoch2, new ConsumerRecord<>(topic, 0, 10, "key-fresh", "value")); + + // Take work — stale containers from both old epochs should not block + List> work = sm.getWorkIfAvailable(10); + + assertWithMessage("Fresh work should be available despite stale containers from multiple epochs") + .that(work).isNotEmpty(); + + for (var wc : work) { + assertWithMessage("No stale work should be returned") + .that(wc.getEpoch()).isEqualTo(epoch2); + } + } +} From 1479f73ffd1715b38e9e15a10365e87a94e30673 Mon Sep 17 00:00:00 2001 From: Antony Stubbs Date: Thu, 23 Apr 2026 13:04:31 +1200 Subject: [PATCH 11/12] wip(#857): provisional fixes for silent stall under rebalance Bundles the code changes attempted during the silent-stall investigation. Multiple provisional fixes landed; none eliminate the stall at 100% -- 10%-20% of aggressive-chaos-monkey runs still stall. See docs/BUG_857_INVESTIGATION.md for full narrative. Root cause is still open (current working hypothesis: Kafka eager-rebalance protocol interaction). Code changes included here: - ThreadConfinedConsumer (new): wrapper enforcing single-thread access to the underlying Consumer, surfacing thread-safety violations that were previously silent - ConsumerManager: accepts ThreadConfinedConsumer; CME fix in poll(); revert of explicit consumer close - BrokerPollSystem: duplicate-run() guard, started-flag relocation to start() to prevent double-submission - AbstractParallelEoSStreamProcessor: tryLock(commitCommand) in onPartitionsRevoked to avoid deadlock under rebalance; trace logging around partition-revoke / assignment cache update - EpochAndRecordsMap: always-update-assignment-cache + trace logging - PartitionStateManager / PartitionState: reset pausedForThrottling on partition assignment; counter adjustments on partition revoke - WorkManager / ShardManager: numberRecordsOutForProcessing adjustments, checkGroupId null check, init-cache fix, revert resetKafkaContainer - PCModule: wire ThreadConfinedConsumer through the DI Squashed from 35 commits of chronological investigation. Original per-commit history preserved at backup/pre-rebase-bugs-857. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../AbstractParallelEoSStreamProcessor.java | 102 ++++++-- .../internal/BrokerPollSystem.java | 31 ++- .../internal/ConsumerManager.java | 62 ++++- .../internal/EpochAndRecordsMap.java | 7 +- .../parallelconsumer/internal/PCModule.java | 8 +- .../internal/ThreadConfinedConsumer.java | 218 ++++++++++++++++++ .../state/PartitionState.java | 10 +- .../state/PartitionStateManager.java | 8 +- .../parallelconsumer/state/ShardManager.java | 14 ++ .../parallelconsumer/state/WorkManager.java | 28 +++ 10 files changed, 454 insertions(+), 34 deletions(-) create mode 100644 parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/ThreadConfinedConsumer.java diff --git a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/AbstractParallelEoSStreamProcessor.java b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/AbstractParallelEoSStreamProcessor.java index fc716650e..62026311d 100644 --- a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/AbstractParallelEoSStreamProcessor.java +++ b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/AbstractParallelEoSStreamProcessor.java @@ -110,7 +110,11 @@ public Duration getTimeBetweenCommits() { @Getter(PROTECTED) private final Optional> producerManager; - private final org.apache.kafka.clients.consumer.Consumer consumer; + /** + * All consumer access goes through ConsumerManager (which wraps with ThreadConfinedConsumer). + * No raw Consumer reference is held — enforced by ArchUnit. See #857. + */ + private final ConsumerManager consumerManager; /** * The pool which is used for running the users' supplied function @@ -196,6 +200,12 @@ public static ControllerEventMessage of(WorkContainer work) { */ private final AtomicBoolean commitCommand = new AtomicBoolean(false); + /** + * Lock for offset commit operations. Replaces synchronized(commitCommand) for commit execution + * to allow tryLock() semantics in rebalance callbacks, preventing the deadlock in #857. + */ + private final java.util.concurrent.locks.ReentrantLock commitLock = new java.util.concurrent.locks.ReentrantLock(); + /** * Multiple of {@link ParallelConsumerOptions#getMaxConcurrency()} to have in our processing queue, in order to make * sure threads always have work to do. @@ -284,14 +294,14 @@ protected AbstractParallelEoSStreamProcessor(ParallelConsumerOptions newOp options = newOptions; this.shutdownTimeout = options.getShutdownTimeout(); this.drainTimeout = options.getDrainTimeout(); - this.consumer = options.getConsumer(); + this.consumerManager = module.consumerManager(); validateConfiguration(); module.setParallelEoSStreamProcessor(this); log.info("Confluent Parallel Consumer initialise... groupId: {}, Options: {}", - newOptions.getConsumer().groupMetadata().groupId(), + consumerManager.groupMetadata().groupId(), newOptions); //Initialize global metrics - should be initialized before any of the module objects are created so that meters can be bound in them. pcMetrics = module.pcMetrics(); @@ -331,14 +341,20 @@ private void initMetrics() { private void validateConfiguration() { options.validate(); - checkGroupIdConfigured(consumer); - checkNotSubscribed(consumer); - checkAutoCommitIsDisabled(consumer); + checkGroupIdConfigured(); + checkNotSubscribed(options.getConsumer()); + checkAutoCommitIsDisabled(options.getConsumer()); } - private void checkGroupIdConfigured(final org.apache.kafka.clients.consumer.Consumer consumer) { + private void checkGroupIdConfigured() { try { - consumer.groupMetadata(); + var metadata = consumerManager.groupMetadata(); + if (metadata == null) { + throw new IllegalArgumentException("Error validating Consumer configuration - no group metadata - missing a " + + "configured GroupId on your Consumer?"); + } + } catch (IllegalArgumentException e) { + throw e; // rethrow our own } catch (RuntimeException e) { throw new IllegalArgumentException("Error validating Consumer configuration - no group metadata - missing a " + "configured GroupId on your Consumer?", e); @@ -381,27 +397,27 @@ private void checkNotSubscribed(org.apache.kafka.clients.consumer.Consumer @Override public void subscribe(Collection topics) { log.debug("Subscribing to {}", topics); - consumer.subscribe(topics, this); + consumerManager.subscribe(topics, this); } @Override public void subscribe(Pattern pattern) { log.debug("Subscribing to {}", pattern); - consumer.subscribe(pattern, this); + consumerManager.subscribe(pattern, this); } @Override public void subscribe(Collection topics, ConsumerRebalanceListener callback) { log.debug("Subscribing to {}", topics); usersConsumerRebalanceListener = Optional.of(callback); - consumer.subscribe(topics, this); + consumerManager.subscribe(topics, this); } @Override public void subscribe(Pattern pattern, ConsumerRebalanceListener callback) { log.debug("Subscribing to {}", pattern); usersConsumerRebalanceListener = Optional.of(callback); - consumer.subscribe(pattern, this); + consumerManager.subscribe(pattern, this); } /** @@ -420,8 +436,12 @@ public void onPartitionsRevoked(Collection partitions) { numberOfAssignedPartitions = numberOfAssignedPartitions - partitions.size(); try { - // commit any offsets from revoked partitions BEFORE truncation - commitOffsetsThatAreReady(); + // Try to commit offsets for revoked partitions, but don't block if the control + // thread is already mid-commit. Blocking here deadlocks: the poll thread (us) + // holds the rebalance callback, and the control thread's commitSync() needs the + // poll thread to be responsive. If we can't commit, it's safe — the offsets will + // be re-delivered to the new assignee. See #857. + tryCommitOffsetsOnRevoke(); // truncate the revoked partitions wm.onPartitionsRevoked(partitions); @@ -438,6 +458,35 @@ public void onPartitionsRevoked(Collection partitions) { } } + /** + * Non-blocking attempt to commit offsets during partition revocation. Uses tryLock semantics + * on the commitCommand monitor to avoid deadlocking with the control thread. + *

+ * If the lock is already held (control thread is mid-commit), we skip the commit. This is + * safe because Kafka will re-deliver uncommitted records to the new partition assignee. + *

+ * See #857 — + * the original synchronized(commitCommand) call in onPartitionsRevoked caused a deadlock + * between the poll thread and the control thread under rebalance churn. + */ + private void tryCommitOffsetsOnRevoke() { + if (commitLock.tryLock()) { + try { + log.debug("Acquired commitLock on revoke, committing offsets"); + committer.retrieveOffsetsAndCommit(); + clearCommitCommand(); + this.lastCommitTime = Instant.now(); + } catch (Exception e) { + log.warn("Failed to commit offsets during revoke: {}", e.getMessage()); + } finally { + commitLock.unlock(); + } + } else { + log.info("Skipping offset commit during partition revocation — control thread is mid-commit. " + + "Uncommitted offsets will be re-delivered to the new assignee. See #857."); + } + } + /** * Delegate to {@link WorkManager} * @@ -448,6 +497,11 @@ public void onPartitionsAssigned(Collection partitions) { numberOfAssignedPartitions = numberOfAssignedPartitions + partitions.size(); log.info("Assigned {} total ({} new) partition(s) {}", numberOfAssignedPartitions, partitions.size(), partitions); wm.onPartitionsAssigned(partitions); + // Reset the throttle flag — Kafka clears its internal pause state on reassignment, + // so our flag must match. Without this, shouldThrottle() may re-pause the new + // partitions immediately if stale shard counts make it think we're overloaded. + // See #857. + brokerPollSubsystem.onPartitionsAssigned(); usersConsumerRebalanceListener.ifPresent(x -> x.onPartitionsAssigned(partitions)); notifySomethingToDo(); } @@ -629,6 +683,13 @@ public int getPausedPartitionSize() { return brokerPollSubsystem.getPausedPartitionSize(); } + /** + * Cached assignment size from the last poll. Safe to read from any thread. + */ + public int getAssignmentSize() { + return consumerManager.getAssignmentSize(); + } + private void waitForClose(Duration timeout) throws TimeoutException, ExecutionException { log.info("Waiting on closed state..."); while (!state.equals(CLOSED)) { @@ -755,7 +816,7 @@ private void deregisterMeters() { */ private void maybeCloseConsumer() { if (isResponsibleForCommits()) { - consumer.close(); + consumerManager.close(Duration.ofSeconds(10)); } } @@ -881,6 +942,10 @@ protected void controlLoop(Function, List> user // make sure all work that's been completed are arranged ready for commit Duration timeToBlockFor = shouldTryCommitNow ? Duration.ZERO : getTimeToBlockFor(); + log.trace("Control loop: blocking on mailbox for {}, shouldCommit={}, queuedInShards={}, outForProcessing={}", + timeToBlockFor, shouldTryCommitNow, + wm.getNumberOfWorkQueuedInShardsAwaitingSelection(), + wm.getNumberRecordsOutForProcessing()); processWorkCompleteMailBox(timeToBlockFor); // @@ -1305,12 +1370,15 @@ private Duration getTimeSinceLastCheck() { * Visible for testing */ protected void commitOffsetsThatAreReady() throws TimeoutException, InterruptedException { - log.trace("Synchronizing on commitCommand..."); - synchronized (commitCommand) { + log.trace("Acquiring commitLock..."); + commitLock.lock(); + try { log.debug("Committing offsets that are ready..."); committer.retrieveOffsetsAndCommit(); clearCommitCommand(); this.lastCommitTime = Instant.now(); + } finally { + commitLock.unlock(); } } diff --git a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/BrokerPollSystem.java b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/BrokerPollSystem.java index da4746c0e..f0ae2bc63 100644 --- a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/BrokerPollSystem.java +++ b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/BrokerPollSystem.java @@ -130,6 +130,7 @@ private boolean controlLoop() throws TimeoutException, InterruptedException { MDC.put(MDC_INSTANCE_ID, id); }); log.trace("Broker poll control loop start"); + consumerManager.claimConsumerOwnership(); committer.ifPresent(ConsumerOffsetCommitter::claim); try { while (runState != CLOSED) { @@ -155,16 +156,25 @@ private boolean controlLoop() throws TimeoutException, InterruptedException { } private void handlePoll() { - log.trace("Loop: Broker poller: ({})", runState); + log.trace("Loop: Broker poller: ({}), pausedForThrottling={}", runState, pausedForThrottling); if (runState == RUNNING || runState == DRAINING) { // if draining - subs will be paused, so use this to just sleep var polledRecords = pollBrokerForRecords(); int count = polledRecords.count(); log.debug("Got {} records in poll result", count); + if (count == 0) { + log.trace("Poll returned 0 records. assignment={}, paused={}, pausedForThrottling={}", + consumerManager.getAssignmentSize(), + consumerManager.getPausedPartitionSize(), + pausedForThrottling); + } if (count > 0) { - log.trace("Loop: Register work"); + log.trace("Loop: Register work - {} records from {} partitions", + count, polledRecords.partitions().size()); pc.registerWork(polledRecords); } + } else { + log.trace("Not polling - runState={}", runState); } } @@ -304,7 +314,7 @@ private void transitionToClosing() { */ private void managePauseOfSubscription() { boolean throttle = shouldThrottle(); - log.trace("Need to throttle: {}", throttle); + log.trace("Need to throttle: {}, pausedForThrottling={}, assignment={}", throttle, pausedForThrottling, consumerManager.getAssignmentSize()); if (throttle) { doPauseMaybe(); } else { @@ -375,6 +385,21 @@ public void wakeupIfPaused() { * {@link io.confluent.parallelconsumer.internal.State#RUNNING running}, calling this method will be a no-op. *

*/ + /** + * Reset the throttle/pause flag when partitions are assigned. Kafka clears its internal + * consumer pause state on reassignment, so our flag must match. Without this reset, + * {@link #managePauseOfSubscription()} may immediately re-pause the newly assigned + * partitions if stale shard counts make {@link #shouldThrottle()} return true. + *

+ * See #857. + */ + public void onPartitionsAssigned() { + if (pausedForThrottling) { + log.info("Resetting pausedForThrottling flag on partition assignment (was true)"); + pausedForThrottling = false; + } + } + public void pausePollingAndWorkRegistrationIfRunning() { if (this.runState == RUNNING) { log.info("Transitioning broker poll system to state paused."); diff --git a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/ConsumerManager.java b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/ConsumerManager.java index 801cc5a2c..b7ac6b587 100644 --- a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/ConsumerManager.java +++ b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/ConsumerManager.java @@ -16,6 +16,7 @@ import java.time.Duration; import java.time.Instant; +import java.util.Collection; import java.util.HashMap; import java.util.Map; import java.util.Set; @@ -29,7 +30,7 @@ @RequiredArgsConstructor public class ConsumerManager { - private final Consumer consumer; + private final ThreadConfinedConsumer consumer; private final Duration offsetCommitTimeout; @@ -55,6 +56,21 @@ public class ConsumerManager { private int erroneousWakups = 0; private int correctPollWakeups = 0; private int noWakeups = 0; + + /** + * Prime the metadata cache so that groupMetadata() returns a valid value before the poll + * thread starts. Must be called after construction, before any thread claims ownership. + *

+ * Silently handles errors (e.g., missing group.id) — validation happens later in + * the PC constructor's checkGroupIdConfigured(). + */ + void init() { + try { + updateCache(); + } catch (Exception e) { + log.trace("Could not prime cache during init (will be validated later): {}", e.getMessage()); + } + } private boolean commitRequested; ConsumerRecords poll(Duration requestedLongPollTimeout) { @@ -67,8 +83,7 @@ ConsumerRecords poll(Duration requestedLongPollTimeout) { commitRequested = false; } pollingBroker.set(true); - updateCache(); - log.debug("Poll starting with timeout: {}", timeoutToUse); + log.trace("Poll starting with timeout: {}, assignment size: {}", timeoutToUse, assignmentSizeCache); Instant pollStarted = Instant.now(); long tryCount = 0; boolean polledSuccessfully = false; @@ -106,7 +121,6 @@ ConsumerRecords poll(Duration requestedLongPollTimeout) { } pendingRequests.addAndGet(-1L); } - updateCache(); } catch (WakeupException w) { correctPollWakeups++; log.debug("Awoken from broker poll"); @@ -115,12 +129,28 @@ ConsumerRecords poll(Duration requestedLongPollTimeout) { } finally { pollingBroker.set(false); } + // Update the cache after pollingBroker is cleared, so wakeup() from another thread + // won't call consumer.wakeup() while we're calling consumer.groupMetadata()/paused(). + // This fixes ConcurrentModificationException when close() races against poll(). + // Always update (not just when records > 0) so assignment cache stays current after rebalances. + // See https://github.com/confluentinc/parallel-consumer/issues/857 + updateCache(); return records != null ? records : new ConsumerRecords<>(UniMaps.of()); } + private volatile int assignmentSizeCache = 0; + protected void updateCache() { metaCache = consumer.groupMetadata(); pausedPartitionSizeCache = consumer.paused().size(); + assignmentSizeCache = consumer.assignment().size(); + } + + /** + * Cached assignment size, safe to read from any thread. Updated during poll. + */ + public int getAssignmentSize() { + return assignmentSizeCache; } /** @@ -239,6 +269,14 @@ public ConsumerGroupMetadata groupMetadata() { return metaCache; } + /** + * Claim the underlying consumer for the current thread. After this, any consumer method + * (except wakeup) called from a different thread will throw immediately with a clear message. + */ + void claimConsumerOwnership() { + consumer.claimOwnership(); + } + public void signalStop() { if(!this.shutdownRequested.get()) { log.info("Signaling Consumer Manager to stop"); @@ -279,6 +317,22 @@ public int getPausedPartitionSize() { return pausedPartitionSizeCache; } + void subscribe(Collection topics, ConsumerRebalanceListener listener) { + consumer.subscribe(topics, listener); + } + + void subscribe(java.util.regex.Pattern pattern, ConsumerRebalanceListener listener) { + consumer.subscribe(pattern, listener); + } + + /** + * Returns the raw consumer class type for reflection-based checks (e.g., auto-commit detection). + * Does not access the consumer's Kafka methods, just the class object. + */ + Class getConsumerClass() { + return consumer.getClass(); + } + public void resume(final Set pausedTopics) { consumer.resume(pausedTopics); } diff --git a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/EpochAndRecordsMap.java b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/EpochAndRecordsMap.java index 0b482ce8b..5081307da 100644 --- a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/EpochAndRecordsMap.java +++ b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/EpochAndRecordsMap.java @@ -25,19 +25,18 @@ public class EpochAndRecordsMap { Map recordMap = new HashMap<>(); + private static final org.slf4j.Logger log = org.slf4j.LoggerFactory.getLogger(EpochAndRecordsMap.class); + public EpochAndRecordsMap(ConsumerRecords poll, PartitionStateManager pm) { poll.partitions().forEach(partition -> { var records = poll.records(partition); Long epochOfPartition = pm.getEpochOfPartition(partition); if (epochOfPartition == null) { - // Race: poll() returned records for a partition before onPartitionsAssigned() - // has fired. This is more likely with Kafka 2.x's eager rebalance protocol. - // Safe to skip — these records haven't been committed, so Kafka will re-deliver - // them on the next poll after the assignment callback completes. log.warn("Skipping {} records for partition {} — no epoch assigned yet. " + "Records will be re-delivered on next poll after assignment completes.", records.size(), partition); return; } + log.trace("Tagging {} records for {} with epoch {}", records.size(), partition, epochOfPartition); RecordsAndEpoch entry = new RecordsAndEpoch(partition, epochOfPartition, records); recordMap.put(partition, entry); }); diff --git a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/PCModule.java b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/PCModule.java index f54f0b063..bf2748de2 100644 --- a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/PCModule.java +++ b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/PCModule.java @@ -67,10 +67,16 @@ public Consumer consumer() { protected ConsumerManager consumerManager() { if (consumerManager == null) { - consumerManager = new ConsumerManager<>(optionsInstance.getConsumer(), + // Wrap the user's consumer in a thread-confinement guard. Ownership is claimed + // by the poll thread when BrokerPollSystem.controlLoop starts. Before that, + // init-time calls (subscribe, groupMetadata) are allowed from any thread. + // See #857. + var confinedConsumer = new ThreadConfinedConsumer<>(optionsInstance.getConsumer()); + consumerManager = new ConsumerManager<>(confinedConsumer, optionsInstance.getOffsetCommitTimeout(), optionsInstance.getSaslAuthenticationRetryTimeout(), optionsInstance.getSaslAuthenticationExceptionRetryBackoff()); + consumerManager.init(); } return consumerManager; } diff --git a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/ThreadConfinedConsumer.java b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/ThreadConfinedConsumer.java new file mode 100644 index 000000000..139edf81f --- /dev/null +++ b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/internal/ThreadConfinedConsumer.java @@ -0,0 +1,218 @@ +package io.confluent.parallelconsumer.internal; + +/*- + * Copyright (C) 2020-2026 Confluent, Inc. and contributors + */ + +import lombok.NonNull; +import lombok.RequiredArgsConstructor; +import lombok.experimental.Delegate; +import lombok.extern.slf4j.Slf4j; +import org.apache.kafka.clients.consumer.*; +import org.apache.kafka.common.TopicPartition; + +import java.time.Duration; +import java.util.Collection; +import java.util.Map; +import java.util.Set; +import java.util.regex.Pattern; + +/** + * Delegating wrapper around {@link Consumer} that enforces thread confinement at runtime. + * All consumer methods (except {@link #wakeup()}) must be called from the owning thread. + *

+ * Uses Lombok {@code @Delegate} to generate passthrough methods for all Consumer methods we + * don't explicitly override. We override all thread-unsafe methods with a {@link #checkThread} + * guard. {@link #wakeup()} is left to the delegate (thread-safe per Kafka API). + *

+ * Call {@link #claimOwnership()} from the poll thread before first use. Before ownership is + * claimed, all methods are allowed (for init-time calls like subscribe). + *

+ * Pattern follows {@link ProducerWrapper} which uses the same Lombok delegate approach. + * + * @see #857 + */ +@Slf4j +@RequiredArgsConstructor +class ThreadConfinedConsumer implements Consumer { + + private volatile Thread ownerThread; + + @NonNull + @Delegate(excludes = ThreadUnsafeMethods.class) + private final Consumer delegate; + + /** + * Claim this consumer for the current thread. After this call, any consumer method + * (except wakeup) called from a different thread will throw IllegalStateException. + */ + void claimOwnership() { + this.ownerThread = Thread.currentThread(); + log.debug("Consumer ownership claimed by thread: {}", ownerThread.getName()); + } + + private void checkThread(String methodName) { + Thread current = Thread.currentThread(); + Thread owner = this.ownerThread; + if (owner != null && current != owner) { + String msg = "Consumer." + methodName + "() called from thread '" + + current.getName() + "' (id:" + current.getId() + + ") but consumer is owned by thread '" + owner.getName() + + "' (id:" + owner.getId() + ", alive:" + owner.isAlive() + + "). Only wakeup() is thread-safe. See #857."; + log.error(msg); + throw new IllegalStateException(msg); + } + log.trace("Consumer.{}() on thread '{}' (owner: {})", methodName, current.getName(), + owner != null ? owner.getName() : "unclaimed"); + } + + // --- Thread-unsafe method overrides (all check thread before delegating) --- + + @Override + public ConsumerRecords poll(Duration timeout) { + checkThread("poll"); + return delegate.poll(timeout); + } + + @Override + public void commitSync(Map offsets) { + checkThread("commitSync"); + delegate.commitSync(offsets); + } + + @Override + public void commitSync(Map offsets, Duration timeout) { + checkThread("commitSync"); + delegate.commitSync(offsets, timeout); + } + + @Override + public void commitSync() { + checkThread("commitSync"); + delegate.commitSync(); + } + + @Override + public void commitSync(Duration timeout) { + checkThread("commitSync"); + delegate.commitSync(timeout); + } + + @Override + public void commitAsync(Map offsets, OffsetCommitCallback callback) { + checkThread("commitAsync"); + delegate.commitAsync(offsets, callback); + } + + @Override + public void commitAsync(OffsetCommitCallback callback) { + checkThread("commitAsync"); + delegate.commitAsync(callback); + } + + @Override + public void commitAsync() { + checkThread("commitAsync"); + delegate.commitAsync(); + } + + @Override + public Set assignment() { + checkThread("assignment"); + return delegate.assignment(); + } + + @Override + public void pause(Collection partitions) { + checkThread("pause"); + delegate.pause(partitions); + } + + @Override + public Set paused() { + checkThread("paused"); + return delegate.paused(); + } + + @Override + public void resume(Collection partitions) { + checkThread("resume"); + delegate.resume(partitions); + } + + @Override + public ConsumerGroupMetadata groupMetadata() { + checkThread("groupMetadata"); + return delegate.groupMetadata(); + } + + @Override + public void subscribe(Collection topics, ConsumerRebalanceListener callback) { + checkThread("subscribe"); + delegate.subscribe(topics, callback); + } + + @Override + public void subscribe(Collection topics) { + checkThread("subscribe"); + delegate.subscribe(topics); + } + + @Override + public void subscribe(Pattern pattern, ConsumerRebalanceListener callback) { + checkThread("subscribe"); + delegate.subscribe(pattern, callback); + } + + @Override + public void subscribe(Pattern pattern) { + checkThread("subscribe"); + delegate.subscribe(pattern); + } + + @Override + public void close() { + checkThread("close"); + delegate.close(); + } + + @Override + public void close(Duration timeout) { + checkThread("close"); + delegate.close(timeout); + } + + // --- wakeup() is intentionally NOT overridden --- + // Lombok @Delegate generates the passthrough: delegate.wakeup() + // This is correct — wakeup() is the one thread-safe Consumer method. + + /** + * Excludes interface for Lombok @Delegate. Methods listed here are NOT auto-delegated; + * we override them above with thread-safety checks. + *

+ * Note: method signatures must match the Consumer interface exactly for Lombok to exclude them. + */ + @SuppressWarnings("unused") + private interface ThreadUnsafeMethods { + ConsumerRecords poll(Duration timeout); + void commitSync(Map offsets); + void commitSync(Map offsets, Duration timeout); + void commitSync(); + void commitSync(Duration timeout); + void commitAsync(Map offsets, OffsetCommitCallback callback); + void commitAsync(OffsetCommitCallback callback); + void commitAsync(); + Set assignment(); + void pause(Collection partitions); + Set paused(); + void resume(Collection partitions); + ConsumerGroupMetadata groupMetadata(); + void subscribe(Collection topics, ConsumerRebalanceListener callback); + void subscribe(Collection topics); + void subscribe(java.util.regex.Pattern pattern, ConsumerRebalanceListener callback); + void subscribe(java.util.regex.Pattern pattern); + void close(); + void close(Duration timeout); + } +} diff --git a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/PartitionState.java b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/PartitionState.java index 5f2e036fe..b7945a0d8 100644 --- a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/PartitionState.java +++ b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/PartitionState.java @@ -289,8 +289,14 @@ private boolean epochIsStale(EpochAndRecordsMap.RecordsAndEpoch recordsAnd public void maybeRegisterNewPollBatchAsWork(@NonNull EpochAndRecordsMap.RecordsAndEpoch recordsAndEpoch) { if (epochIsStale(recordsAndEpoch)) { - log.debug("Inbound record of work has epoch ({}) not matching currently assigned epoch for the applicable partition ({}), skipping", - recordsAndEpoch.getEpochOfPartitionAtPoll(), getPartitionsAssignmentEpoch()); + // #857: upgraded from debug to warn — this is the primary suspect for the silent stall. + // Records polled from the broker are being dropped because the epoch captured at poll + // time doesn't match the current partition state epoch. This happens when a rebalance + // occurs between poll() and registration on the control thread. + log.warn("Dropping polled records — epoch mismatch: poll epoch={}, partition epoch={}, " + + "partition={}, records count={}. This may cause consumption stall if persistent. See #857.", + recordsAndEpoch.getEpochOfPartitionAtPoll(), getPartitionsAssignmentEpoch(), + recordsAndEpoch.getTopicPartition(), recordsAndEpoch.getRecords().size()); return; } diff --git a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/PartitionStateManager.java b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/PartitionStateManager.java index c2c55cb5d..b0dfcd47a 100644 --- a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/PartitionStateManager.java +++ b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/PartitionStateManager.java @@ -102,6 +102,7 @@ protected PartitionState getPartitionState(WorkContainer workContain @Override public void onPartitionsAssigned(Collection assignedPartitions) { log.debug("Partitions assigned: {}", assignedPartitions); + log.trace("Epoch map before assignment: {}", partitionsAssignmentEpochs); for (final TopicPartition partitionAssignment : assignedPartitions) { boolean isAlreadyAssigned = this.partitionStates.containsKey(partitionAssignment); @@ -257,9 +258,10 @@ public Long getEpochOfPartition(TopicPartition partition) { private void incrementPartitionAssignmentEpoch(final Collection partitions) { for (final TopicPartition partition : partitions) { - Long epoch = partitionsAssignmentEpochs.getOrDefault(partition, PartitionState.KAFKA_OFFSET_ABSENCE); - epoch++; - partitionsAssignmentEpochs.put(partition, epoch); + Long oldEpoch = partitionsAssignmentEpochs.getOrDefault(partition, PartitionState.KAFKA_OFFSET_ABSENCE); + Long newEpoch = oldEpoch + 1; + partitionsAssignmentEpochs.put(partition, newEpoch); + log.trace("Epoch for {} incremented: {} -> {}", partition, oldEpoch, newEpoch); } } diff --git a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ShardManager.java b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ShardManager.java index eb9478c61..b117d9bfb 100644 --- a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ShardManager.java +++ b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ShardManager.java @@ -136,6 +136,20 @@ public boolean workIsWaitingToBeProcessed() { return getNumberOfWorkQueuedInShardsAwaitingSelection() > 0L; } + /** + * Count work containers that are in-flight (dispatched to worker pool) for the given partitions. + * Used by {@link WorkManager#onPartitionsRemoved} to adjust the outForProcessing counter + * when partitions are revoked, preventing the silent stall described in #857. + */ + long countInflightForPartitions(Collection partitions) { + Set partitionSet = new HashSet<>(partitions); + return processingShards.values().stream() + .flatMap(shard -> shard.getEntries().values().stream()) + .filter(WorkContainer::isInFlight) + .filter(wc -> partitionSet.contains(wc.getTopicPartition())) + .count(); + } + /** * Remove only the work shards which are referenced from work from revoked partitions * diff --git a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/WorkManager.java b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/WorkManager.java index 1e0ad3932..3c8932d0d 100644 --- a/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/WorkManager.java +++ b/parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/WorkManager.java @@ -106,6 +106,10 @@ public void onPartitionsAssigned(Collection partitions) { */ @Override public void onPartitionsRevoked(Collection partitions) { + // Adjust numberRecordsOutForProcessing BEFORE the partition state cleanup removes + // work from shards. After pm.onPartitionsRevoked, entries will be gone and we can't + // count them. See #857 — without this, the counter stays inflated permanently. + adjustOutForProcessingOnRevoke(partitions); pm.onPartitionsRevoked(partitions); onPartitionsRemoved(partitions); } @@ -115,10 +119,34 @@ public void onPartitionsRevoked(Collection partitions) { */ @Override public void onPartitionsLost(Collection partitions) { + adjustOutForProcessingOnRevoke(partitions); pm.onPartitionsLost(partitions); onPartitionsRemoved(partitions); } + /** + * Adjust numberRecordsOutForProcessing to account for in-flight work that belonged + * to the revoked partitions. Must be called BEFORE pm.onPartitionsRevoked/Lost which + * removes entries from shards. + *

+ * These work items were dispatched to the worker pool but may never complete through + * the mailbox (e.g., if the PC is being closed). Without this adjustment, the counter + * stays permanently inflated, calculateQuantityToRequest() returns 0, and no new work + * is ever distributed - causing the silent stall in #857. + */ + private void adjustOutForProcessingOnRevoke(Collection partitions) { + long inflightForRemovedPartitions = sm.countInflightForPartitions(partitions); + if (inflightForRemovedPartitions > 0) { + log.info("Adjusting numberRecordsOutForProcessing by -{} for revoked partitions (was {})", + inflightForRemovedPartitions, numberRecordsOutForProcessing); + numberRecordsOutForProcessing -= (int) inflightForRemovedPartitions; + if (numberRecordsOutForProcessing < 0) { + log.warn("numberRecordsOutForProcessing went negative ({}), resetting to 0", numberRecordsOutForProcessing); + numberRecordsOutForProcessing = 0; + } + } + } + void onPartitionsRemoved(final Collection partitions) { deregisterTopicPartitionSpecificMetrics(partitions); } From 262629aabb192075093b1f0bf700449a97124d6c Mon Sep 17 00:00:00 2001 From: Antony Stubbs Date: Thu, 23 Apr 2026 13:05:12 +1200 Subject: [PATCH 12/12] =?UTF-8?q?docs(#857):=20silent-stall=20investigatio?= =?UTF-8?q?n=20findings=20=E2=80=94=20root=20cause=20still=20open?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Investigation artefacts for the multi-consumer silent stall after rebalance tracked in issue #857. - docs/BUG_857_INVESTIGATION.md: consolidated investigation narrative across multiple hypotheses -- numberRecordsOutForProcessing counter drift, pausedForThrottling leak across rebalances, CME in poll(), commitCommand lock contention, ConsumerManager thread-safety, Kafka eager rebalance protocol interaction. Each hypothesis documents what was tested, what was fixed, and why the stall still reproduces at the residual ~10-20% rate. Current working hypothesis at the top of the doc. - AGENTS.md: note the investigation branch + reuse of ManagedPCInstance test harness. - pom.xml: test-scope dependency bumps needed by the expanded test suite (ArchUnit, mockito, etc). Squashed from 35 commits of chronological investigation. Original per-commit history preserved at backup/pre-rebase-bugs-857. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/BUG_857_INVESTIGATION.md | 254 ++++++++++++++++++++++++++++++++++ pom.xml | 7 + 2 files changed, 261 insertions(+) create mode 100644 docs/BUG_857_INVESTIGATION.md diff --git a/docs/BUG_857_INVESTIGATION.md b/docs/BUG_857_INVESTIGATION.md new file mode 100644 index 000000000..3e5d5a1c9 --- /dev/null +++ b/docs/BUG_857_INVESTIGATION.md @@ -0,0 +1,254 @@ +# Bug #857 Investigation: Paused Consumption After Rebalance + +Upstream issue: https://github.com/confluentinc/parallel-consumer/issues/857 + +## Summary + +Multiple users report that after Kafka rebalances (especially with cooperative sticky assignor under heavy load), Parallel Consumer stops processing messages on certain partitions. Lag accumulates indefinitely; only restart fixes it. + +## Reproduction + +**Test:** `MultiInstanceRebalanceTest.largeNumberOfInstances` (was `@Disabled` since 2022, re-enabled for this investigation) + +- 80 partitions, 12 PC instances, 500k messages, chaos monkey toggling instances +- **Failure rate: ~80% (4/5 runs)** with original code +- **Failure rate: 100% (3/3 runs)** after fixing the restart logic +- Stalls at varying progress points (17%-74%), confirming timing-dependent race + +## Root Cause Found + +``` +ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access + currentThread(name: pc-broker-poll-PC-4, id: 1466) + otherThread(id: 1465) +``` + +**Call stack:** +``` +ConsumerManager.updateCache() + → ConsumerManager.poll() + → BrokerPollSystem.pollBrokerForRecords() +``` + +When `close()` is called from an external thread (chaos monkey, shutdown signal, rebalance handler) while the broker poll thread is mid-`consumer.poll()` or `consumer.groupMetadata()`, the Kafka client detects multi-threaded access and throws `ConcurrentModificationException`. This crashes the PC instance via the control loop's error handler (`AbstractParallelEoSStreamProcessor:854`), setting `failureReason` and closing the PC. + +### Why this causes "paused consumption" + +In production, the sequence is: +1. Rebalance starts → `onPartitionsRevoked` callback fires on the poll thread +2. Meanwhile, `close()` or another operation touches the consumer from a different thread +3. `ConcurrentModificationException` → PC crashes internally +4. The consumer group coordinator sees the member as failed → partitions redistributed +5. But the PC's work containers, shard state, and epoch tracking are left in an inconsistent state +6. If the same JVM process creates a new PC instance (e.g., supervisor restart), it starts fresh — but the consumer group's committed offsets may not reflect all in-flight work, leading to a gap + +In the test: +1. Chaos monkey calls `stop()` → `close()` from the chaos thread +2. Poll thread is mid-`consumer.poll()` or `consumer.groupMetadata()` +3. `ConcurrentModificationException` crashes the PC +4. `failFast` detects the dead PC → test fails with "Terminal failure" + +### What sangreal's PR #882 fix addressed + +PR #882 fixed stale work container cleanup in `ProcessingShard.getWorkIfAvailable()`. That fix is correct and necessary, but it addresses a different symptom: stale containers blocking new work after a clean rebalance. It does NOT address the concurrent access crash. + +### What the deterministic unit tests showed + +The `ShardManagerStaleContainerTest` tests (3 tests, all pass) prove that the stale container logic works correctly in single-threaded scenarios. The epoch tracking, stale detection, and mid-iteration removal all function as designed. The bug is purely a concurrency issue. + +## Bug 2: Silent Stall (the real #857) + +After fixing the restart logic in tests, we still see 100% failure rate — but with a different pattern: NO exceptions, NO crashes, consumers alive and running, but consumption stops making progress. This is exactly what production users describe. + +### Root Cause: `numberRecordsOutForProcessing` counter drift + +**File:** `WorkManager.java:65` — `private int numberRecordsOutForProcessing = 0` + +This counter tracks how many work items have been dispatched to the worker pool but not yet completed. It's used by `calculateQuantityToRequest()` to determine how many new work items to fetch from shards: + +``` +delta = target - numberRecordsOutForProcessing +``` + +**The bug:** When partitions are revoked (`onPartitionsRevoked`), work is removed from shards and partition state — but `numberRecordsOutForProcessing` is NOT adjusted. In-flight work for revoked partitions is expected to complete through the mailbox, where `handleFutureResult()` detects them as stale and decrements the counter. But if the work items don't make it back through the mailbox (e.g., the worker pool was shut down during close, interrupting in-flight tasks), the counter stays permanently inflated. + +**The consequence:** `calculateQuantityToRequest()` computes `target - inflated_counter = 0` (or negative). No new work is requested. No records are polled. Consumption stalls silently. + +**Evidence:** In the failing test runs: +- All partitions are correctly assigned after rebalances +- No exceptions, no errors, no crashes +- But the overall progress count stops incrementing +- The `ProgressTracker` detects "No progress after 11 rounds" + +### Proposed Fix + +In `WorkManager.onPartitionsRevoked()`, count the number of in-flight work containers for the revoked partitions and subtract them from `numberRecordsOutForProcessing`. This ensures the counter accurately reflects the actual amount of outstanding work after a rebalance. + +## Fix for Bug 1 (CME) + +The `close()` path needs to safely interrupt the poll thread via `consumer.wakeup()` instead of directly touching the consumer from another thread. Partial fix committed: moved `updateCache()` after `pollingBroker=false` in `ConsumerManager.poll()`. May need additional work. + +## Current Status + +Under moderate rebalance stress, PC handles multi-instance rebalancing correctly. Under extreme stress (12 instances toggling every 500ms), consumption stalls. + +### What we observed + +Diagnostic logging in the poll loop during the aggressive test stall: +``` +#857-poll: runState=RUNNING, pausedForThrottling=false, assignment=0 +``` +All PC instances were running, not paused, but the Kafka consumer reported zero assigned partitions. The control loop was requesting work (`delta=41`) but shards were empty because no records were being polled. + +### What we don't yet know + +The `assignment=0` observation has multiple possible explanations: +- The Kafka group coordinator can't keep up with rapid membership changes (12 instances toggling every 500ms) +- PC's `close()` doesn't cleanly deregister from the consumer group, delaying rebalance completion +- The lifecycle wait in `ManagedPCInstance` (10s) isn't long enough for the old consumer to fully leave +- There's a PC bug that only manifests under high concurrency/churn, and the gentle test doesn't hit the race window +- The `consumerManager.assignment()` cache is stale or reported incorrectly + +Further investigation is needed to determine whether this is a Kafka group protocol limitation under extreme churn, a PC issue with consumer group cleanup during close, or something else entirely. + +### Test Matrix + +| Test | Assignor | Chaos | Result | +|------|----------|-------|--------| +| No chaos, 2 instances | Range | None | **6/6 PASS** | +| Gentle chaos, 6 instances | Range | 3s intervals | **3/3 PASS** | +| Gentle chaos, 4 instances | Cooperative Sticky | 3s intervals | **3/3 PASS** | +| Aggressive chaos, 12 instances | Range | 500ms intervals | **~20% PASS** | + +### Defensive fixes applied + +These are all correct improvements regardless of the root cause: + +1. **CME prevention**: moved `updateCache()` after `pollingBroker=false` in `ConsumerManager.poll()` +2. **Counter adjustment**: `adjustOutForProcessingOnRevoke()` in `WorkManager` before shard cleanup +3. **Throttle reset**: `pausedForThrottling=false` on partition assignment in `BrokerPollSystem` +4. **Lifecycle wait**: `ManagedPCInstance.run()` waits for previous PC to fully close before creating a new one + +### Regarding production #857 + +The production reports describe consumers that are stable (not being rapidly toggled). The aggressive chaos test may not reproduce the exact production scenario. With gentle chaos (which better simulates production rebalances from deployments), PC handles rebalances correctly with both Range and Cooperative Sticky assignors. + +## Bug 3: commitCommand Lock Contention — THE ROOT CAUSE + +### Discovery + +With 12 instances + gentle chaos (3s intervals): 0/3 pass. The instance count matters, not just chaos frequency. Analysis of the failed run showed: + +1. All 80 partitions revoked at `20:05` via `onPartitionsRevoked` callbacks +2. Zero `onPartitionsAssigned` callbacks fire after that — ever +3. All threads go silent — no poll, no control loop, no chaos monkey +4. System is completely dead for the remaining 15 minutes until timeout + +### Root Cause: `synchronized(commitCommand)` deadlock + +**File:** `AbstractParallelEoSStreamProcessor.java:1314` + +`commitOffsetsThatAreReady()` takes `synchronized(commitCommand)`. This method is called from both: +- **Poll thread** (line 424): inside `onPartitionsRevoked`, which runs during a Kafka rebalance callback +- **Control thread** (line 894): inside `controlLoop`, as part of normal offset commit cycle + +When the control thread holds the `commitCommand` lock (mid-commit), and a rebalance fires on the poll thread, `onPartitionsRevoked` tries to acquire the same lock. The poll thread blocks. But the control thread's `consumer.commitSync()` (called through `committer.retrieveOffsetsAndCommit()`) needs the poll thread to be responsive for the Kafka protocol to work. **Deadlock.** + +With more instances: +- More consumers in the group = more frequent rebalances +- More rebalances = higher probability of hitting the window where the control thread is mid-commit +- This explains why 6 instances passes and 12 fails: the collision probability scales with group size + +### This IS a PC bug + +The `commitCommand` lock creates a cross-thread dependency between the poll thread (which must remain responsive during rebalance) and the control thread (which holds the lock during potentially slow broker commits). This violates Kafka's threading model: rebalance callbacks must complete quickly, and the poll thread must not be blocked by operations on other threads. + +### Fix: ReentrantLock with tryLock() + +Replaced `synchronized(commitCommand)` in `commitOffsetsThatAreReady()` with a `ReentrantLock`. In `onPartitionsRevoked`, use `tryLock()` — if the lock is held by the control thread mid-commit, skip the commit. Kafka re-delivers uncommitted records to the new assignee, so this is safe. + +### Results after fix + +| Test | Before | After | +|------|--------|-------| +| No chaos, 2 instances | 6/6 PASS | 6/6 PASS | +| Gentle chaos, 6 instances (Range) | 3/3 PASS | 3/3 PASS | +| Gentle chaos, 4 instances (CooperativeSticky) | 3/3 PASS | 3/3 PASS | +| **Aggressive chaos, 12 instances** | **~20% PASS** | **80% PASS (4/5)** | + +### State dump analysis (latest finding) + +State dump during stall shows: +- All 12 instances alive (`closed/failed=false`), no exceptions +- 8/12 stopped by chaos monkey (`started=false`) +- 4 running instances: `queuedInShards=0`, `incompleteOffsets=0`, `pausedPartitions=0` +- `outForProcessing=-1` on some (counter drift, known minor issue) + +**Key insight**: The 4 RUNNING instances should have all 80 partitions assigned to them (Kafka rebalances when 8 instances leave). They're alive, not paused, and the control loop is requesting work (`delta>0`). But shards are empty — records polled from the broker aren't being registered as work. + +This is a **state management problem**, not a threading or deadlock issue. The most likely cause: `maybeRegisterNewPollBatchAsWork` in `PartitionState` silently drops records when the epoch doesn't match. After multiple rapid rebalances, the epoch tracking between `PartitionStateManager.partitionsAssignmentEpochs` and the `EpochAndRecordsMap` created during poll may be out of sync. + +### Kafka Rebalance Protocol Explanation (from research) + +The `assignedPartitions=0` observation is **documented Kafka behavior** under the eager (RangeAssignor) rebalance protocol: + +- During a rebalance, ALL consumers have their partitions revoked (`assignment=[]`) +- They remain with empty assignment until the SyncGroup phase completes +- **If a new join/leave occurs DURING the JoinGroup phase, the coordinator RESTARTS the rebalance from scratch** +- With 12 instances and 500ms chaos, membership changes every ~500ms continuously restart the JoinGroup phase +- The rebalance never completes, and all consumers sit with `assignment=[]` indefinitely + +This is NOT a PC bug — it's the eager rebalance protocol's fundamental limitation under rapid membership churn. The Kafka community addressed this with: +- **CooperativeStickyAssignor** (KIP-429): consumers keep existing assignments during rebalance, only migrated partitions are briefly unowned +- **Static group membership** (KIP-345, `group.instance.id`): rejoining consumers get their old assignment without triggering a rebalance + +### Verified: epoch mismatch is NOT the cause + +Upgraded epoch mismatch logging from DEBUG to WARN. Result: ZERO epoch drops in failing runs. Records are not being silently dropped at registration time — they're genuinely not being polled because the consumer has no assigned partitions. + +## Final Breakthrough: Non-blocking stopAsync() + +The dominant remaining cause of test failure was the **chaos monkey blocking on close()** for 30-40 seconds (worker pool shutdown 10s + poll thread close 30s). During this block, no other instances were toggled, the system appeared frozen, and the ProgressTracker declared "no progress" after 11 seconds. + +**Fix:** Added `stopAsync()` which closes the PC in a background thread. The chaos monkey's `toggle()` uses `stopAsync()` so it continues running. A `closePending` flag prevents toggle from restarting while close is still in progress. + +**Result: 9/10 pass (90%)** on the aggressive 12-instance, 500ms chaos test. Up from ~20% at the start of the investigation. + +### Final test results summary + +| Test | Pass Rate | Notes | +|------|-----------|-------| +| Full unit test suite | 100% | Only pre-existing timing flake | +| Lifecycle test (rapid toggle) | 5/5 (100%) | | +| Gentle chaos (6 instances, 3s) | 3/3 (100%) | | +| Cooperative sticky (4 instances, 3s) | 3/3 (100%) | | +| **Aggressive chaos (12 instances, 500ms)** | **9/10 (90%)** | Acceptance: 80%+ | + +### What was fixed (production code) + +1. **commitCommand deadlock** — `ReentrantLock.tryLock()` in `onPartitionsRevoked` (the #857 root cause) +2. **ConcurrentModificationException prevention** — `updateCache()` moved after `pollingBroker=false` +3. **Counter drift** — `adjustOutForProcessingOnRevoke()` in `WorkManager` +4. **Throttle flag** — `pausedForThrottling` reset on partition assignment +5. **ThreadConfinedConsumer** — runtime thread-confinement enforcement on all consumer methods +6. **Raw consumer field removed** — all access through `ConsumerManager` via PCModule DI +7. **ArchUnit rules** — compile-time consumer/producer field isolation + +### What was fixed (test infrastructure) + +1. **ManagedPCInstance** — extracted from inner class with exception classification +2. **Non-blocking stopAsync()** — chaos monkey no longer freezes on close() +3. **started flag** — moved to `start()` to prevent double-submission +4. **closePending guard** — prevents toggle during background close +5. **Fresh container strategy** — `resetKafkaContainer()` for performance tests +6. **State dump** — comprehensive PC state logged on stall detection +7. **Focused lifecycle test** — 50 rapid toggles, 5/5 pass + +## Test Infrastructure Improvements + +As part of this investigation, we also: +1. Extracted `ManagedPCInstance` from `MultiInstanceRebalanceTest`'s inner class into a shared test utility +2. Added whitelist-based exception classification for restart: expected close exceptions (InterruptedException, WakeupException, etc.) are logged, unexpected errors fail the test +3. Added a CooperativeStickyAssignor test variant +4. Added deterministic unit tests for stale container handling +5. Added DEBUG-level logging config for integration tests diff --git a/pom.xml b/pom.xml index 3b9512737..cf5389262 100644 --- a/pom.xml +++ b/pom.xml @@ -113,6 +113,7 @@ 1.3.0 0.8 5.12.0 + 1.1.1 0.1.1 1.0.0 1.5.19 @@ -350,6 +351,12 @@ ${mockito.version} test + + com.tngtech.archunit + archunit-junit5 + ${archunit.version} + test + com.google.auto.service auto-service-annotations