Skip to content

Add Job workload support to CRUD benchmarking framework#1133

Draft
engineeredcurlz wants to merge 79 commits intomainfrom
dipowell/crud-jobs
Draft

Add Job workload support to CRUD benchmarking framework#1133
engineeredcurlz wants to merge 79 commits intomainfrom
dipowell/crud-jobs

Conversation

@engineeredcurlz
Copy link
Copy Markdown

Summary

Implements the jobs workload type as the third and final planned workload
method (deployment, statefulset, jobs) in the CRUD benchmarking framework.

Changes

  • workload_templates/job.yml — New Kubernetes manifest template for Jobs
    using batch/v1 API. No Service required as Jobs are run-to-completion
    workloads. Uses restartPolicy: Never and JOB_COMPLETIONS placeholder.
    No parallelism field — defaults to match completions.

  • node_pool_crud.py — New create_job() method following the same loop
    pattern as create_deployment. Uses complete condition instead of
    available/ready since Jobs terminate after completion. No
    wait_for_pods_ready call since pods exit after the job finishes.

  • main.py — Added jobs subparser with --node-pool-name,
    --number-of-jobs, --completions, and --manifest-dir arguments.
    Added elif command == "jobs" routing in handle_workload_operations.

  • steps/engine/crud/k8s/execute.yml — Added jobs script block that
    calls python3 main.py jobs with the appropriate CLI flags. Added
    number_of_jobs and completions parameters.

  • steps/topology/k8s-crud-gpu/execute-crud.yml — Wires number_of_jobs
    and completions through to the engine step.

Tests (to be added)

4 unit tests to be added to test_azure_node_pool_crud.py:

  • test_create_job_success
  • test_create_job_failure
  • test_create_job_no_client
  • test_create_job_partial_success

Dependencies

This branch is based on test-refactor and depends on it being merged
before this can merge to main. It is independent of
dipowell/crud-statefulset.

yaml.safe_load_all() enters an infinite loop when passed a MagicMock
object because PyYAML detects the .read attribute and treats it as a
file-like stream, then loops forever waiting to buffer enough bytes
(len(MagicMock()) returns 0 by default).

Fix by setting create_template.return_value to a valid YAML string in
the three create_deployment tests, so yaml.safe_load_all receives a
real string and parses it via the non-blocking code path.

Affected tests:
- test_create_deployment_success
- test_create_deployment_failure
- test_create_deployment_partial_success
begin_create_or_update() returns an LROPoller that was being discarded, allowing execution to continue while Azure still had an operation in-progress. Subsequent scale/delete calls were then rejected with OperationNotAllowed.

Fix by calling poller.result() in scale_node_pool and _progressive_scale to block until Azure fully completes each operation before proceeding.
nginx-container was hardcoded in deployment template and in create deployment method

- add label_selector to parameters
- replace nginx-container in deployment.yaml (label_alue)
- derive label_value from selector
- pass label_selector directly
Implements Job workload creation following the same loop pattern as create_deployment
add elif routing branch to call create_job and subparser
add jobs command through execute.yml and script block that calls jobs in main.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants