Skip to content

fix: added lock to prevent same task run multiple time at the same time#784

Merged
marslanabdulrauf merged 3 commits intomainfrom
marslan/10758-git-auto-export
May 4, 2026
Merged

fix: added lock to prevent same task run multiple time at the same time#784
marslanabdulrauf merged 3 commits intomainfrom
marslan/10758-git-auto-export

Conversation

@marslanabdulrauf
Copy link
Copy Markdown
Contributor

@marslanabdulrauf marslanabdulrauf commented Apr 9, 2026

What are the relevant tickets?

https://github.com/mitodl/hq/issues/10758#issuecomment-4201469687
https://github.com/mitodl/hq/issues/10243

Description (What does it do?)

This pull request introduces two main improvements to the git export process for courses: (1) it debounces repeated export tasks triggered by multiple signals during a single course save, ensuring only one export is scheduled per burst, and (2) it adds cleanup of stale git lock files to prevent export failures due to prior crashes.

Debounce logic for export tasks:

  • Added EXPORT_DEBOUNCE_DELAY and EXPORT_DEBOUNCE_CACHE_KEY constants to control and identify the debounce window for export scheduling.
  • Modified export_course_to_git to use Django's cache to ensure only the first export signal within the debounce window schedules a task, with subsequent signals being ignored. The export task is scheduled with a delay to allow the burst of signals to settle.

Robustness improvements for git operations:

  • Introduced a clear_stale_git_lock utility function to remove any leftover .git/index.lock files from previously crashed workers before starting a new export. This prevents git operations from failing due to stale locks.
  • Updated async_export_to_git to call clear_stale_git_lock before exporting, ensuring a clean working directory.

How can this be tested?

  1. Prerequisites

    • A local Tutor dev environment with ol_openedx_git_auto_export installed
    • A course with CourseGitRepository configured and is_export_enabled = True
    • Celery workers running with concurrency >= 2 (to reproduce the original race)
  2. In Studio, make a change on the "Schedule & Details" page (this fires ~10–30 signals).

  3. This should cause errors in cms-worker logs -- different sort of errors

  4. Checkout to this branch -- re-install if needed and restart cms-workers

  5. Repeat the process -- You should be able to see 1 new commit and there will only 1 red log with message similar to "Nothing to commit"

Additional Info:

This multiple signal fire issue is also being resolved here openedx/openedx-platform#38126

Comment on lines 42 to 43
course_key = CourseKey.from_string(course_key_string)
course_module = modulestore().get_course(course_key)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: An exception in CourseKey.from_string or modulestore().get_course can cause the acquired distributed lock to not be released, as these calls are outside the try...finally block.
Severity: MEDIUM

Suggested Fix

Move the calls to CourseKey.from_string(course_key_string) and modulestore().get_course(course_key) inside the try block. This ensures that if an exception occurs, the finally block will still execute and release the lock.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/ol_openedx_git_auto_export/ol_openedx_git_auto_export/tasks.py#L42-L43

Potential issue: A distributed lock is acquired via `acquire_export_lock_or_schedule` to
prevent concurrent git exports for a course. However, subsequent operations at lines
42-43, specifically `CourseKey.from_string(course_key_string)` and
`modulestore().get_course(course_key)`, can raise exceptions. Because these lines are
executed before the `try` block on line 45, an exception will prevent the `finally`
block (which contains `release_export_lock`) from ever running. This causes the lock to
be held for its full 120-second TTL, blocking any further export tasks for that course
during that period.

Did we get this right? 👍 / 👎 to inform future reviews.

@arslanashraf7
Copy link
Copy Markdown
Contributor

@asadali145 could you review this?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a per-course distributed locking and deduplication mechanism for the async_export_to_git Celery task to prevent concurrent exports for the same course, and adds cleanup for stale Git index lock files that can be left behind after worker crashes.

Changes:

  • Added a cache-based per-course lock with “single pending task” scheduling/deduplication logic.
  • Updated async_export_to_git to be a bound Celery task that uses the new lock helpers and always releases the lock in a finally.
  • Added a helper to remove stale .git/index.lock files prior to export attempts.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
src/ol_openedx_git_auto_export/ol_openedx_git_auto_export/utils.py Adds cache lock/pending-task helpers and stale git lock-file cleanup.
src/ol_openedx_git_auto_export/ol_openedx_git_auto_export/tasks.py Uses the new locking helpers around export_to_git and clears stale lock files.
src/ol_openedx_git_auto_export/ol_openedx_git_auto_export/constants.py Introduces lock/pending TTLs and cache-key templates for the new protocol.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +194 to +201
def release_export_lock(course_key_string):
"""
Release the per-course git-export distributed lock.

Must be called in a ``finally`` block after ``acquire_export_lock_or_schedule``
returns True.
"""
cache.delete(EXPORT_LOCK_CACHE_KEY.format(course_key=course_key_string))
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXPORT_LOCK_TIMEOUT is a fixed 120s TTL, but release_export_lock unconditionally deletes the lock key. If an export takes longer than the TTL, the lock can expire and be re-acquired by another worker while the first task is still running; when the first task finishes it will delete the new task’s lock, allowing concurrent exports. Consider storing a unique token (e.g., task_id) as the lock value and only deleting if the current value matches, and/or using a lock mechanism with TTL refresh (or a much larger/configurable timeout) to prevent expiry mid-export.

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +17
EXPORT_LOCK_TIMEOUT = 120 # seconds; safety TTL if a worker crashes holding the lock
EXPORT_LOCK_RETRY_DELAY = 30 # seconds between retries for the pending task
EXPORT_LOCK_MAX_RETRIES = 5 # max retries for the pending task
# At most one extra task queues behind the running task; all other duplicates drop.
EXPORT_PENDING_TIMEOUT = EXPORT_LOCK_TIMEOUT # match the lock TTL
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXPORT_LOCK_TIMEOUT = 120 seconds looks very low for a course export and can cause the distributed lock to expire while export_to_git is still running, reintroducing concurrency/race issues. Consider making this configurable (settings/env) and setting it to a value that safely exceeds worst-case export duration, or switching to a lock implementation that renews the TTL while the task is active.

Copilot uses AI. Check for mistakes.
Comment on lines +127 to +131
if index_lock.exists():
log.warning(
"Removing stale .git/index.lock for repo %s at %s", git_url, index_lock
)
index_lock.unlink()
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clear_stale_git_lock calls index_lock.unlink() after an existence check but doesn’t handle filesystem errors (permission issues, transient disappearance, read-only volumes), which would fail the whole export. Consider wrapping the unlink in a try/except OSError and logging a warning (and optionally continuing), and using is_file()/missing_ok=True to avoid TOCTOU issues.

Suggested change
if index_lock.exists():
log.warning(
"Removing stale .git/index.lock for repo %s at %s", git_url, index_lock
)
index_lock.unlink()
if index_lock.is_file():
log.warning(
"Removing stale .git/index.lock for repo %s at %s", git_url, index_lock
)
try:
index_lock.unlink(missing_ok=True)
except OSError:
log.warning(
"Failed to remove stale .git/index.lock for repo %s at %s",
git_url,
index_lock,
exc_info=True,
)

Copilot uses AI. Check for mistakes.
async_create_github_repo.delay(str(course_key), export_course=True)
except Exception:
LOGGER.exception(
"Unknown error occured during async course content export to git (course id: %s)", # noqa: E501
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in log message: "occured" should be "occurred".

Suggested change
"Unknown error occured during async course content export to git (course id: %s)", # noqa: E501
"Unknown error occurred during async course content export to git (course id: %s)", # noqa: E501

Copilot uses AI. Check for mistakes.
@asadali145
Copy link
Copy Markdown
Contributor

@marslanabdulrauf, can you please look into the review comments by Seer and Copilot?

@asadali145
Copy link
Copy Markdown
Contributor

and let me know if you think that those are invalid and I should proceed with the review.

@asadali145
Copy link
Copy Markdown
Contributor

Here is an analysis for the suggested improvements with the help of Copilot. That 10-30 tasks thing might be exaggerated after our recent upstream fix but it looks legit and can improve our execution.


💡 Improvements

💡 Debounce at the Signal Level to Eliminate Bursts Before They Hit the Queue

Why the problem happens

COURSE_PUBLISHED fires 10–30 times in a single web request when a teacher saves "Schedule & Details". Tracing the call chain:

COURSE_PUBLISHED signal
  → listen_for_course_publish() [signals.py]
    → export_course_to_git() [utils.py]
      → async_export_to_git.delay()  ← called 10–30 times almost simultaneously

Each signal invocation enqueues a separate Celery task. The PR's lock then discards duplicates inside the worker — but all 30 tasks have already been created, enqueued in the broker, dequeued by workers, and had lock acquisition attempted before being dropped. That's significant overhead for 29 tasks that do nothing.

What debouncing does differently

Debouncing intercepts the burst at the signal handler, before .delay() is ever called, so only 1 task enters the broker:

# In listen_for_course_publish (signals.py) or export_course_to_git (utils.py)
from django.core.cache import cache

DEBOUNCE_DELAY = 5  # seconds — tune to exceed the publish burst window

def listen_for_course_publish(sender, course_key, **kwargs):
    debounce_key = f"git_export_debounce:{course_key}"

    # cache.add() is atomic — returns True only if the key didn't already exist.
    # The first signal in the burst wins; all subsequent ones are skipped.
    if cache.add(debounce_key, "1", timeout=DEBOUNCE_DELAY):
        async_export_to_git.apply_async(
            args=[str(course_key)],
            countdown=DEBOUNCE_DELAY,  # task runs after the burst window closes
        )
    # else: a task is already scheduled — skip without touching the broker

Side-by-side comparison

Current PR (lock inside the worker):

t=0s:  30 tasks enqueued in broker
t=1s:  Task #1 dequeued → acquires lock → runs export
t=1s:  Task #2 dequeued → lock held → becomes pending, retries at t=31s
t=1s:  Tasks #3–30 dequeued → dropped (lock held, pending slot taken)
t=31s: Task #2 retries → acquires lock → potentially runs a redundant export

With debounce (burst collapsed at signal level):

t=0s:  Signal #1  → cache.add succeeds → 1 task scheduled for t=5s
t=0s:  Signals #2–30 → cache.add fails → skipped (nothing enqueued)
t=5s:  ONE task dequeued and runs

How the two mechanisms complement each other

The distributed lock in the PR is still valuable — it guards against genuine concurrent saves (e.g. two different editors saving a course simultaneously, far enough apart that the debounce window doesn't catch both). The correct role for each:

Mechanism Guards against
Debounce (signal level) Burst of signals from a single save (the common case)
Distributed lock (task level) True concurrent saves that arrive outside the debounce window

With both in place: debounce eliminates the burst, the lock is a lightweight safety net for the edge case. Without debounce, the lock is doing heavy lifting it shouldn't need to.


💡 EXPORT_PENDING_TIMEOUT Must Be Longer Than EXPORT_LOCK_TIMEOUT

The current values

EXPORT_LOCK_TIMEOUT = 120     # seconds — safety TTL for the running export's lock
EXPORT_LOCK_RETRY_DELAY = 30  # seconds between retries for the pending task
EXPORT_LOCK_MAX_RETRIES = 5   # max retries the pending task will attempt

EXPORT_PENDING_TIMEOUT = EXPORT_LOCK_TIMEOUT  # = 120s  ← the problem

What each key does

  • EXPORT_LOCK_CACHE_KEY — held by the running task. Lives for EXPORT_LOCK_TIMEOUT seconds as a crash-safety TTL.
  • EXPORT_PENDING_CACHE_KEY — stores the task_id of the single waiting task. acquire_export_lock_or_schedule checks cache.get(pending_key) == task_id on every retry to confirm the task is still the designated waiter.

The bug, step by step

The pending task's total possible wait time is:

MAX_RETRIES × RETRY_DELAY = 5 × 30 = 150 seconds

But the pending key only lives for 120 seconds. Here's what happens if the export runs longer than 120s:

t=0s:   Task A acquires lock (lock key expires at t=120s)
t=1s:   Task B fails to acquire → calls cache.add(pending_key, task_B_id, timeout=120)
        → pending key expires at t=121s
t=31s:  Task B retries → cache.get(pending_key) == task_B_id ✓ → retries again
t=61s:  Task B retries → cache.get(pending_key) == task_B_id ✓ → retries again
t=91s:  Task B retries → cache.get(pending_key) == task_B_id ✓ → retries again
t=121s: Task B retries → pending key HAS EXPIRED
        → cache.get(pending_key) returns None ≠ task_B_id
        → falls through to cache.add(pending_key, task_B_id, ...) → SUCCEEDS
        → Task B re-registers itself as a "new" pending task with a fresh TTL
        → effectively resets its own retry counter

This means Task B can accumulate unbounded retries by re-registering itself each time its pending key expires. It also means the "at most one pending task" guarantee breaks down — if Task C arrives just after t=121s, cache.add also succeeds for it, giving you two pending tasks simultaneously.

The fix

The pending key must outlive the entire period the pending task could possibly be retrying:

# The pending key must survive: the running export's full TTL
# + the entire retry window of the pending task
EXPORT_PENDING_TIMEOUT = EXPORT_LOCK_TIMEOUT + (EXPORT_LOCK_MAX_RETRIES * EXPORT_LOCK_RETRY_DELAY)
# = 120 + (5 × 30)
# = 270 seconds

With this value, cache.get(pending_key) == task_id will always return True on every retry attempt, the re-registration bug cannot happen, and the single-pending-task guarantee holds for the entire lifecycle.

@marslanabdulrauf
Copy link
Copy Markdown
Contributor Author

@asadali145 Copilot's suggestions is added to add debounce strategy before starting/invoking the celery task.
you can take another look

@asadali145
Copy link
Copy Markdown
Contributor

@marslanabdulrauf I was thinking that we are going to implement both approaches. Was there any specific reason to remove the previous implementation?

@marslanabdulrauf
Copy link
Copy Markdown
Contributor Author

@marslanabdulrauf I was thinking that we are going to implement both approaches. Was there any specific reason to remove the previous implementation?

yeah the use-case is really specific to burst of same signal which we already have identified and should be fixed by openedx/openedx-platform#38126. This Debounce strategy really handles that well so there is no need to have both,

@asadali145
Copy link
Copy Markdown
Contributor

@marslanabdulrauf I was thinking that we are going to implement both approaches. Was there any specific reason to remove the previous implementation?

yeah the use-case is really specific to burst of same signal which we already have identified and should be fixed by openedx/openedx-platform#38126. This Debounce strategy really handles that well so there is no need to have both,

In the above analysis by Copilot, it gives an example of concurrent saves by 2 different authors close enough that they are outside the debounce window of 5s then in that case, we will need that first approach and it should handle these cases really well. WDYT?

@marslanabdulrauf
Copy link
Copy Markdown
Contributor Author

I would say a 5 second head start should be enough to avoid the original issue.

Copy link
Copy Markdown
Contributor

@asadali145 asadali145 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Can you please update the PR description as per current implementation?

@marslanabdulrauf marslanabdulrauf merged commit f750fdb into main May 4, 2026
9 checks passed
@marslanabdulrauf marslanabdulrauf deleted the marslan/10758-git-auto-export branch May 4, 2026 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants