Skip to content

Latest commit

 

History

History
537 lines (423 loc) · 21.4 KB

File metadata and controls

537 lines (423 loc) · 21.4 KB

Workflow Writing Notes

These notes apply to workflow code in general.

bin/workflow boots Rails through the internal workflow_cli runtime profile. That profile is headless: it skips the dashboard/Mission Control web stack, web-only gems, and app route registration, and it keeps ActionController::Base.include_all_helpers = false so framework eager-load does not scan app helpers. Unlike the slimmer jobs profile used by bin/jobs-worker and bin/jobs-scheduler, workflow_cli still leaves lib/r3x/workflow/cli.rb available for the Thor wrapper.

Temporarily Disabling A Workflow

  • To disable a workflow without deleting it, add a top-of-file pragma near the start of workflow.rb:

    # r3x:disable Reason for disabling
  • R3x::Workflow::PackLoader scans the first lines of each entrypoint and skips files with this pragma, so the workflow file is not required, not registered, and not scheduled.

  • Keep the pragma near the top of the file (before constants/classes) so it is easy to spot during reviews and maintenance.

Workflows Are Jobs

  • Workflows inherit from R3x::Workflow::Base and implement #run.
  • R3x::Workflow::Base is an ApplicationJob and includes ActiveJob::Continuable.
  • The current execution context is available as ctx on the workflow instance during perform, so helper methods can use it without threading it through every call.
  • Use step around boundaries that should be resumable, such as external API calls, slow network work, or other distinct phases.
  • Keep steps small and meaningful. A step should mark a real unit of progress, not just wrap every line.
  • Use condition for high-level guards that must be true before run starts. Conditions are evaluated only on the initial execution, not when ActiveJob::Continuable resumes an isolated or interrupted step.

Step Semantics

  • step is a resumable boundary, not a value-return helper.

  • Use isolated: true for steps that should run in their own execution after prior progress is serialized. Pair it with workflow-specific resume_options when the next step should run after a deliberate delay instead of blocking a worker with sleep.

  • Do not assign the result of a step block to a variable and assume it is the block result.

  • Put data-fetching logic in a normal helper, then use step around the resumable work that consumes that data.

  • If you see true.select or nil.select in a workflow crash, check whether a step block was used as if it returned the fetched value.

    # Good
    raw_events = fetch_from_apify
    
    step :process_events do |step|
      process_events(Array.wrap(raw_events), step)
    end
    
    # Bad
    raw_events = step :fetch_events do
      fetch_from_apify
    end

Conditions

Use condition for workflow-level preconditions that must be true before any work starts:

condition :nobody_home?, reason: "Somebody is home"

The predicate is an instance method, so it can use ctx and workflow helpers. If the predicate returns false, the workflow returns { "status" => "skipped", "reason" => reason } and logs the skip instead of calling run. Conditions do not run during Continuable resumes, so they are safe to combine with delayed or isolated steps.

Completion Callbacks

Use on_complete for tiny side effects that should happen only after the whole workflow succeeds:

on_complete { ctx.client.healthchecks_io(HEALTHCHECK_UUID).ping }

Completion callbacks run after run returns and before the workflow logs Workflow run completed. Blocks execute on the workflow instance, so they can use ctx, constants, and private workflow helpers. Multiple callbacks run in declaration order.

on_complete does not run when condition skips the workflow, when run raises, or when ActiveJob::Continuable interrupts before the full workflow completes. If a completion callback raises, the workflow fails instead of logging a false success.

Available Helpers

  • ctx
    • The current workflow execution context.
    • Available during perform / run.
    • Use it to access clients and runtime data without passing context through every helper method.
  • step
    • Marks a resumable boundary using ActiveJob::Continuable.
    • Use it around slow or externally dependent phases that should resume cleanly after interruption or retry.
    • Use isolated: true plus resume_options = { wait: ... } when the next step should be queued for later instead of sleeping inside the current job.
  • condition
    • Declares a workflow-level guard that must return true before run on the initial execution.
    • Requires a positive predicate method and a human-readable reason for skipped runs.
  • on_complete
    • Runs a block after successful full workflow completion.
    • Keep it for small final side effects, such as success healthcheck pings.
    • A raised callback error fails the workflow.
  • with_cache
    • Wraps an expensive block in Rails.cache so repeated local runs can reuse the same result.
    • Good for slow, noisy, or hard-to-reproduce API calls while iterating on a workflow.
    • bin/workflow run --skip-cache <path> bypasses all with_cache blocks for that run without editing the workflow.
    • R3X_SKIP_CACHE=true does the same override at the env level.
    • In production, with_cache still raises by default unless R3X_SKIP_CACHE=true is set.
    • Use with_cache(force: true) when you need to refresh a stale cached value.
  • ctx.durable_set(name = :default, ttl: 90.days)
    • Returns a workflow-scoped durable set backed by Rails.cache.
    • Good for remembering which items were already processed, sent, uploaded, or otherwise handled across workflow runs.
    • Members are scoped by workflow key and set name, so different workflows and different sets do not collide.
    • When the app uses :solid_cache_store, custom ttl: values must not exceed config/cache.yml store_options.max_age.
    • Use include?, add, and delete on the returned set.
  • Prefer this for best-effort dedup across runs; prefer a real table only when you need permanent history or hard uniqueness guarantees.

Reusing HTTP Clients

  • ctx.client.http returns a new instance of the HTTP client (R3x::Client::Http) on every call.

  • When making requests in a loop, instantiate the client once outside the loop so timeout and SSL configuration is built once and the workflow code stays clear.

  • Good:

    http = ctx.client.http
    urls.map do |url|
      response = http.get(url)
      # ...
    end
  • Bad:

    urls.map do |url|
      response = ctx.client.http.get(url)
      # ...
    end
  • For ad hoc batch work that benefits from a persistent HTTPX session, use ctx.client.persistent_http so connection lifecycle is explicit:

    ctx.client.persistent_http(timeout: 30) do |http|
      urls.each { |url| http.get(url) }
    end
  • Do not add persistent: options to thin clients such as Discord, Google Translate, Prometheus, or VictoriaLogs unless there is a measured hot path that performs many requests in one controlled scope.

Example Workflow

See docs/workflows/example_multi_step_digest.md for a full worked example that combines step, with_cache, ctx.durable_set, structured LLM output, and multiple ctx.client.* integrations in one workflow.

Running Workflows Locally

Use bin/workflow for local workflow development. It boots Rails through the workflow_cli runtime profile, loads the requested workflow file, and runs the workflow in-process.

export R3X_WORKFLOW_PATHS="$PWD/workflows"
bin/workflow list
bin/workflow info <workflow_key>
bin/workflow run workflows/<workflow_name>/workflow.rb
bin/workflow run --dry-run workflows/<workflow_name>/workflow.rb
bin/workflow run --skip-cache workflows/<workflow_name>/workflow.rb

For an included workflow in this checkout:

bin/workflow info porto_santo_news
bin/workflow run --dry-run workflows/porto_santo_news/workflow.rb
  • list and info load workflow packs from R3X_WORKFLOW_PATHS.

  • run always takes a direct path to a workflow.rb file.

  • In development and test, bin/workflow run defaults to dry-run, so dry-run-aware clients avoid real side effects.

  • --dry-run explicitly enables dry-run for that run (R3X_DRY_RUN=true).

  • --no-dry-run explicitly disables dry-run for that run (R3X_DRY_RUN=false), even in development.

  • --skip-cache sets R3X_SKIP_CACHE=true for that run and bypasses with_cache.

  • Use --dry-run --skip-cache together when you want a fresh, low-risk local run:

    bin/workflow run --dry-run --skip-cache workflows/<workflow_name>/workflow.rb
  • To run with real delivery in development, use --no-dry-run or set R3X_DRY_RUN=false:

    R3X_DRY_RUN=false bin/workflow run workflows/<workflow_name>/workflow.rb

Testing Workflows

Workflow packs can have their own Minitest tests next to the workflow code:

workflows/
  example_digest/
    workflow.rb
    test/
      workflow_test.rb

Pack-local tests should require the Rails environment and the workflow file, then instantiate the workflow directly. Use small fakes for workflow clients so tests verify behavior without calling external services.

#!/usr/bin/env ruby

require "bundler/setup"
require "minitest/autorun"
require_relative "../../../config/environment"
require_relative "../workflow"

class ExampleDigestTest < Minitest::Test
  def test_declares_schedule_trigger
    schedule = Workflows::ExampleDigest.triggers.find(&:cron_schedulable?)

    assert schedule
    assert_equal "0 8 * * *", schedule.cron
  end
end

Run a pack-local test directly:

ruby workflows/<workflow_name>/test/workflow_test.rb

Use Minitest for workflow tests. Prefer real parsing and transformation code with fake client boundaries over stubbing the whole workflow. A useful test should fail if the workflow stops filtering, deduplicating, rendering, or delivering the expected thing.

Local Secret Parity With Vault

For workflows that depend on real integration credentials, Vault gives local development strong dev/prod parity. When local env points at the same Vault address and secret path as production, the app can boot, log in to Vault, and hydrate ENV with the same secret names used by production pods.

That means you can test a workflow locally against the same credential contract that production uses:

export R3X_VAULT_ADDR=http://vault.example.internal:8200
export R3X_VAULT_SECRETS_PATH=secret/data/env/r3x
export R3X_VAULT_AUTH_METHOD=kubernetes
export R3X_VAULT_KUBERNETES_ROLE=r3x

bin/workflow run --dry-run workflows/<workflow_name>/workflow.rb

The exact auth mode depends on your environment. Token auth is also supported for local operator work:

export R3X_VAULT_ADDR=http://vault.example.internal:8200
export R3X_VAULT_TOKEN=<token>
export R3X_VAULT_SECRETS_PATH=secret/data/env/r3x

Keep the distinction clear:

  • Vault parity means the same secret names and values can be loaded locally and in production.
  • bin/workflow run --dry-run should still be the default local command for workflows with side effects.
  • Dry run protects delivery behavior; Vault parity protects configuration drift.
  • Use just vault_check to verify Vault connectivity and visible secret keys without printing secret values.
  • See docs/deployment.md#vault-secrets for the full Vault setup.

Inline Parsing

  • For small extraction chains, prefer presence and chained fallbacks over repeated blank? branches.

  • Keep simple parsing close to the data source unless the logic is genuinely reusable.

  • Good:

    body = normalize_text(node.at_xpath("./description")&.inner_html).presence ||
      normalize_text(node.at_xpath("./encoded")&.inner_html).presence ||
      normalize_text(node.at_xpath("./title")&.text)
  • Bad:

    body = normalize_text(node.at_xpath("./description")&.inner_html)
    body = normalize_text(node.at_xpath("./encoded")&.inner_html) if body.blank?
    body = normalize_text(node.at_xpath("./title")&.text) if body.blank?

Fail Fast

  • Prefer letting workflows fail loudly.
  • Avoid broad rescue blocks that hide the original problem.
  • Only rescue when translating a known boundary error into a clearer domain failure or cleanup.

Debugging And Caching

  • Prefer with_cache only around clearly expensive or noisy calls, not around the whole workflow.
  • Prefer ctx.durable_set for cross-run item dedup, not with_cache.
  • The normal workflow is:
    • add with_cache around the slowest boundary while iterating
    • use bin/workflow run --skip-cache <path> when you want a fresh uncached run
    • leave the helper in place if it remains useful for future debugging
  • For durable dedup, use a stable member key from the item itself, such as a URL digest or external post ID, and add it only after the relevant side effect succeeds.
  • If a cached block becomes confusing or hides too much behavior, remove it instead of stacking more flags or conditions around it.
  • When a workflow suddenly sees a boolean or nil where an array should be, inspect the nearest step boundary first before blaming the external API.

Schedule Timezones

  • trigger :schedule accepts an optional timezone:.

  • Timezones may be IANA names like Europe/Paris or Rails names like Pacific Time (US & Canada).

  • Rails-style names are normalized to canonical TZInfo names before scheduling.

  • If timezone: is omitted, R3X_TIMEZONE is used when present.

  • If the cron string already embeds a timezone, that embedded timezone wins over R3X_TIMEZONE.

  • Use one of these styles, not both:

    trigger :schedule, cron: "every day at 9am Europe/Paris"
    trigger :schedule, cron: "every day at 9am", timezone: "Europe/Paris"
  • If both timezone: and the cron string specify timezones, configuration fails fast.

Logging

  • Prefer logger.debug { ... } for debug logs so the message is lazy evaluated.
  • Use block form when the log string is expensive to build or includes interpolated values.
  • For info, warn, and error, use eager string logging when the message should always be emitted.
  • Workflow execution already carries tagged context such as r3x.run_active_job_id and r3x.trigger_key for indexed log correlation in the dashboard. Orchestration jobs also tag lines with r3x.workflow_key where that broader workflow-level context is useful. Prefer logging through the existing Rails logger so those tags stay attached to emitted lines.
  • App logs are always emitted as structured JSON with explicit level, message, and tags.
  • Dashboard run logs read the explicit level from structured log payloads. They do not infer levels from free-form message text.

Pretty-Printing Hashes And Structures

  • When logging hashes or structured data, avoid manually interpolating individual fields.

  • amazing_print is preloaded globally — you can call .ai(...) on any object without adding require "amazing_print" yourself.

  • Use .ai(plain: true) so keys are aligned and output is readable, without ANSI colour codes that clutter log files.

    # Good
    logger.info("Camera check result:\n#{result.ai(plain: true)}")
    
    # Bad
    logger.info("Checked camera #{url}, result: #{result["status"]}, description: #{result["description"]}")
  • If the structure is large, consider logging it on debug instead of info.

LLM Output

  • When a workflow expects structured LLM output, prefer RubyLLM schema support.

  • Use a schema when you want JSON-like data back instead of parsing free-form text by hand.

  • Keep the schema close to the prompt so the expected shape is obvious.

  • Define new workflow schemas with R3x::Workflow::LlmSchema.define.

  • This is the current convention because it keeps ruby_llm-schema off the boot path for workflows that do not use structured LLM output.

  • Older direct inheritance from RubyLLM::Schema still works, but treat it as legacy in new code.

  • For nested JSON, define the shape with array and object blocks inside the helper block, then pass that schema to message(...).

  • Read the parsed structured result from response.content; avoid manual JSON parsing when the schema already captures the shape.

    WeeklyDigestSchema = R3x::Workflow::LlmSchema.define do
      array :EN do
        object :entry do
          string :name
          string :location
          string :date_time
        end
      end
    
      array :PT do
        object :entry do
          string :name
          string :location
          string :date_time
        end
      end
    end

Dry Run

  • Side-effecting workflow helpers should support dry_run when it makes sense.
  • Resolve defaults with R3x::Policy.dry_run_for(:key, dry_run).
  • Not every client supports dry run.
  • If a client does not support it, say so clearly in the workflow or helper instead of assuming it will no-op.

Retry Fragile Operations

  • Network calls and other flaky external interactions should be wrapped with the retryable gem (already in the Gemfile).

  • Prefer Retryable.retryable(...) over manual begin/rescue/retry loops in workflow code.

  • Use it for HTTP requests, API calls, file downloads, or any operation where transient failures are expected and safe to retry.

  • Basic usage:

    Retryable.retryable(tries: 3, on: [HTTPX::TimeoutError, HTTPX::ConnectionError]) do
      connection.get("/api/data").body
    end
  • Common options:

    • tries — total attempts (default 2). Set to 3 for two retries.
    • on — exception class or array of classes to catch (default StandardError).
    • sleep — seconds between retries (default 1). Use 0 to skip pauses, or a lambda for exponential backoff: lambda { |n| 4**n }.
    • matching — retry based on exception message: matching: /timeout/i.
    • not — exceptions that should never be retried, takes precedence over on.
  • Block receives two optional arguments: retry count so far and the last exception:

    Retryable.retryable(tries: 4, on: HTTPX::HTTPError) do |retries, exception|
      logger.debug { "Attempt #{retries} failed: #{exception}" } if retries > 0
      http.get("/endpoint")
    end
  • For logging retries, use log_method:

    log_method = lambda do |retries, exception|
      logger.debug { "[Attempt ##{retries}] Retrying: #{exception.class} - #{exception.message}" }
    end
    
    Retryable.retryable(tries: 3, on: HTTPX::TimeoutError, log_method: log_method) do
      http.get("/endpoint")
    end
  • Avoid retrying operations that are not idempotent or that cause external side effects (e.g. sending emails, creating records) unless the remote API guarantees idempotency.

  • Full documentation: https://github.com/nfedyashev/retryable

LLM Retry

ruby_llm has built-in automatic retry through its Faraday middleware. Defaults are applied per RubyLLM::Context inside R3x::Client::Llm, so every workflow run gets an isolated copy. Processes that never call ctx.client.llm do not load the gem at all.

The retry defaults are set in app/lib/r3x/client/llm.rb inside the RubyLLM.context block. Only retry_interval has a project-level override (the gem default is 0.1); everything else uses the gem defaults.

This means the first retry waits 60 seconds, the second waits 120 seconds, then it gives up. The gem automatically retries on transient provider errors:

  • RubyLLM::RateLimitError (HTTP 429)
  • RubyLLM::ServerError (HTTP 500)
  • RubyLLM::ServiceUnavailableError (HTTP 502-504)
  • RubyLLM::OverloadedError (HTTP 529)
  • Network timeouts and connection failures

Per-workflow override

If a particular workflow needs different retry behavior, pass overrides directly to ctx.client.llm(...):

response = ctx.client.llm(
  api_key_env: "GEMINI_API_KEY_MICHAL",
  max_retries: 5,
  retry_interval: 30.0
).message(
  model: "gemini-3-flash-preview",
  prompt: prompt
)

Any option passed this way overrides the default for that single R3x::Client::Llm instance. The rest of the call stays the same -- the retry is handled transparently by ruby_llm.

OpenAI-compatible provider aliases can carry their own RubyLLM routing defaults and are automatically resolved based on the API key environment variable name:

response = ctx.client.llm(
  api_key_env: "OPENCODE_GO_API_KEY"
).message(
  model: "deepseek-chat",
  prompt: prompt
)

The provider is dynamically inferred from the environment variable name prefix (e.g. OPENCODE_GO to :opencode_go), routing the request through the lazy registered custom provider adapter while allowing the workflow to choose the exact token env variant (such as OPENCODE_GO_API_KEY, OPENCODE_GO_API_KEY_PROJECTA, or OPENCODE_GO_API_KEY_PROJECTB). Dynamic providers use RubyLLM's provider-level model assumption hook, so workflows can pass provider-specific model IDs without maintaining a static model registry.

Return Value

  • Do not design #run to return a special metadata hash, status object, or summary structure.
  • If the last expression happens to return a value (for example an array from filter_map or the result of a helper), that is acceptable, but do not write run specifically to produce a return value unless the user explicitly asks for one.
  • Workflows are side-effect driven: their purpose is to fetch data, transform it, and deliver it. Prefer logging and monitoring over return-value contracts.
  • If a caller needs to observe what a workflow did, inspect the durable set, the logs, or the downstream system (Discord, email, API) rather than relying on a return value from run.