Skip to content

Conversation

@yruslan
Copy link
Collaborator

@yruslan yruslan commented Jan 28, 2026

Closes #698

Summary by CodeRabbit

  • New Features

    • Enhanced backfill and late-run scheduling logic for improved accuracy in date range calculations
    • Integrated data availability validation into scheduling decisions
  • Dependencies

    • Upgraded Apache Spark to version 3.5.8
    • Updated CI/CD pipeline to support Spark versions 2.4.8, 3.4.4, and 3.5.5

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 28, 2026

Walkthrough

The PR modifies backfill scheduling logic in ScheduleStrategySourcing to enforce backfill.days limits by introducing conditional branching on the lateOnly flag, removes lastProcessedDate from the getBackFillDays method signature, updates test utilities and expectations to align with the new behavior, and bumps Spark version dependencies.

Changes

Cohort / File(s) Summary
Core scheduling logic
pramen/core/src/main/scala/za/co/absa/pramen/core/runner/splitter/ScheduleStrategySourcing.scala
Reworked backfill branching on lateOnly flag; adjusted newDays and lateDaysOrig computation; introduced conditional aggregation based on hasInfoDateColumn; removed lastProcessedDate parameter from getBackFillDays signature; hardened backfill no-op conditions.
Test runner utilities
pramen/core/src/test/scala/za/co/absa/pramen/core/tests/runner/jobrunner/TaskRunnerMultithreadedSuite.scala
Extended getUseCase helper and JobSpy constructor to accept and propagate jobBackfillDays parameter through multi-task test scenarios.
Scheduling tests
pramen/core/src/test/scala/za/co/absa/pramen/core/tests/runner/splitter/ScheduleStrategySuite.scala
Added data availability mocking via Bookkeeper.getDataAvailability; adjusted backfill/track day parameters and test expectations to reflect interaction with mocked data availability.
Build configuration
.github/workflows/scala.yml, pramen/pom.xml
Reduced Spark matrix from [2.4.8, 3.3.4, 3.4.4, 3.5.5] to [2.4.8, 3.4.4, 3.5.5]; updated Spark version dependency from 3.5.7 to 3.5.8.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 A bunny hops through backfill days,

No longer lost in endless maze,

With branching logic, clean and tight,

Late jobs now honor limits right!

✨ Schedules hop in perfect time. 🌙

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning The PR includes out-of-scope changes unrelated to issue #698: Spark version updates (3.5.7→3.5.8, matrix changes) and test infrastructure modifications not addressing the backfill logic issue. Remove the Spark version updates and workflow matrix changes from this PR; these should be submitted as a separate pull request.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: fixing backfill logic for late jobs, which aligns with the primary objective from issue #698.
Linked Issues check ✅ Passed The changes implement the core requirement from issue #698 by reworking backfill logic to respect backfill.days limit for late jobs and allowing unlimited backfill only when backfill.days=-1.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@pramen/pom.xml`:
- Line 114: The pom currently sets <spark.version> but uses
${maven.scala.version} for Iceberg; add an explicit <iceberg.version> property
(e.g., 1.6.1) and update the Iceberg dependency to reference ${iceberg.version}
instead of ${maven.scala.version}; also ensure the spark-avro dependency/version
matches <spark.version> (3.5.8) exactly so spark-avro uses 3.5.8.

In `@pramen/project/Versions.scala`:
- Line 22: The project pins Spark 3.5.8 for Scala 2.13
(defaultSparkVersionForScala213) but the Kafka clients version used
(kafkaClientsVersion) is 3.9.0, which mismatches Spark 3.5.x's tested dependency
(3.4.1); update the kafkaClientsVersion constant to "3.4.1" to match Spark 3.5
compatibility or, if you must keep "3.9.0", add a clear comment next to
kafkaClientsVersion and run/record validation tests for SASL/SSL and
exactly-once streaming scenarios to prove compatibility before merging.

val defaultSparkVersionForScala211 = "2.4.8"
val defaultSparkVersionForScala212 = "3.3.4"
val defaultSparkVersionForScala213 = "3.4.4"
val defaultSparkVersionForScala213 = "3.5.8"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, examine the Versions.scala file to understand the full context
find . -name "Versions.scala" -type f | head -5

Repository: AbsaOSS/pramen

Length of output: 89


🏁 Script executed:

# Search for the file and read its contents
if [ -f "pramen/project/Versions.scala" ]; then
  cat -n pramen/project/Versions.scala
fi

Repository: AbsaOSS/pramen

Length of output: 7133


🏁 Script executed:

# Look for dependency mappings in build files
fd -e "build.sbt" -e "build.scala" -e "Dependencies.scala" | head -10

Repository: AbsaOSS/pramen

Length of output: 40


🌐 Web query:

Spark 3.5.8 release Scala 2.13 support official

💡 Result:

Yes—Apache Spark 3.5.8 officially supports Scala 2.13.

  • The official Apache download artifacts for Spark 3.5.8 include a pre-built binary explicitly labeled spark-3.5.8-bin-hadoop3-scala2.13.tgz, published on downloads.apache.org. [2]
  • The official Spark downloads page states that Spark 3.2+ provides an additional pre-built distribution with Scala 2.13 (in addition to the default Scala 2.12 builds). [3]
  • Spark’s official build documentation for the 3.5 line states Spark requires Scala 2.12/2.13. [4]
  • Spark 3.5.8 is an official Spark 3.5 maintenance release (released Jan 15, 2026). [1]

Sources: Spark release notes/news and Apache download listings. [1][2][3][4]


🌐 Web query:

Delta Lake 3.0.0 Spark 3.5 compatibility requirements

💡 Result:

Delta Lake 3.0.0 (Delta Spark 3.0.0) compatibility requirements for Apache Spark 3.5:

  • Spark version: Delta Lake 3.0.x is compatible with Apache Spark 3.5.x (i.e., you must run Spark 3.5 to use Delta 3.0.x). [1][2]
  • Artifact / module: Use the renamed Maven artifact delta-spark (it was previously delta-core). [3]
  • Scala binary version: Pick the artifact that matches your Spark build’s Scala version: delta-spark_2.12 or delta-spark_2.13. [3]
  • Practical implication: If your platform/runtime doesn’t offer Spark 3.5 yet, you generally can’t use Delta Lake 3.0.0 there (you’d need an older Delta line matching your Spark version). [1][2]

Sources: [1] [2] [3] [4]


🌐 Web query:

Apache Iceberg 1.6.1 Spark 3.5 compatibility

💡 Result:

Apache Iceberg 1.6.1 is compatible with Apache Spark 3.5.

  • Iceberg publishes a dedicated Spark 3.5 runtime artifact for 1.6.1 (Scala 2.12): org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1. [1]
  • Spark 3.5 support was added in the Iceberg 1.5.x line (and therefore is included in 1.6.1). [2]

Typical Spark 3.5 usage

spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1

Sources: Maven Central artifact listing [1], Iceberg release notes indicating Spark 3.5 support [2].

[1] https://central.sonatype.com/artifact/org.apache.iceberg/iceberg-spark-runtime-3.5_2.12/1.6.1
[2] https://iceberg.apache.org/releases/


🌐 Web query:

Kafka clients 3.9.0 Spark 3.5 compatibility

💡 Result:

  • What Spark 3.5 ships/tests with: Spark’s Kafka integration artifacts in the Spark 3.5.x line depend on org.apache.kafka:kafka-clients:3.4.1 (e.g., Spark 3.5.0 spark-streaming-kafka-0-10 lists kafka-clients 3.4.1). [1]
  • So is kafka-clients:3.9.0 “Spark 3.5 compatible”? It’s not the version Spark 3.5 is built/tested against, so it’s not “officially compatible” in the sense of being Spark’s pinned dependency. Practically, many users do run Spark with a newer Kafka client by overriding dependencies, but you should treat it as “may work, verify in your environment.” [1][2]
  • Broker-side compatibility note: Kafka 3.9.0 (released Nov 6, 2024) is in the Kafka 3.x line. [3] Kafka’s compatibility docs indicate Kafka 3.x clients are fully compatible (wire/protocol-wise) even with Kafka 4.0 brokers, which is a good sign for broker interoperability—but it doesn’t guarantee Spark’s connector has been tested with that client jar. [4]

If you plan to use kafka-clients:3.9.0 with Spark 3.5: make sure you align all Kafka jars on the classpath (avoid mixing Kafka versions transitively) and run an end-to-end streaming test (SASL/SSL, exactly-once if used, headers, etc.). [2]


Spark 3.5.8 default for Scala 2.13 is confirmed safe; however, kafka-clients version needs alignment.

Spark 3.5.8 officially supports Scala 2.13 (released Jan 2026) and the Delta Lake 3.0.0 and Iceberg 1.6.1 mappings are correct. However, the Kafka clients version at line 116 (kafka-clients:3.9.0) does not align with Spark 3.5's official dependency: Spark 3.5.x is built and tested against kafka-clients:3.4.1, not 3.9.0. While 3.9.0 may work in practice, this creates a version mismatch and should be verified in your Spark streaming scenarios (SASL/SSL, exactly-once semantics). Consider either pinning kafka-clients to 3.4.1 for Spark 3.5 compatibility or explicitly validating 3.9.0 in your environment.

🤖 Prompt for AI Agents
In `@pramen/project/Versions.scala` at line 22, The project pins Spark 3.5.8 for
Scala 2.13 (defaultSparkVersionForScala213) but the Kafka clients version used
(kafkaClientsVersion) is 3.9.0, which mismatches Spark 3.5.x's tested dependency
(3.4.1); update the kafkaClientsVersion constant to "3.4.1" to match Spark 3.5
compatibility or, if you must keep "3.9.0", add a clear comment next to
kafkaClientsVersion and run/record validation tests for SASL/SSL and
exactly-once streaming scenarios to prove compatibility before merging.

@github-actions
Copy link

github-actions bot commented Jan 28, 2026

Unit Test Coverage

Overall Project 84.1% 🍏
Files changed 100% 🍏

Module Coverage
pramen:core Jacoco Report 86.04% 🍏
Files
Module File Coverage
pramen:core Jacoco Report ScheduleStrategySourcing.scala 98.35% 🍏

@yruslan yruslan force-pushed the bugfix/698-backfill-logic-for-late-jobs branch from 37071f9 to 4c0a563 Compare January 28, 2026 08:47
@yruslan yruslan merged commit ff58649 into main Jan 28, 2026
7 checks passed
@yruslan yruslan deleted the bugfix/698-backfill-logic-for-late-jobs branch January 28, 2026 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Late job logic does not honor backfill.days

2 participants