fix: added lock to prevent same task run multiple time at the same time by marslanabdulrauf · Pull Request #784 · mitodl/open-edx-plugins

marslanabdulrauf · 2026-04-09T12:03:58Z

What are the relevant tickets?

https://github.com/mitodl/hq/issues/10758#issuecomment-4201469687
https://github.com/mitodl/hq/issues/10243

Description (What does it do?)

This pull request introduces two main improvements to the git export process for courses: (1) it debounces repeated export tasks triggered by multiple signals during a single course save, ensuring only one export is scheduled per burst, and (2) it adds cleanup of stale git lock files to prevent export failures due to prior crashes.

Debounce logic for export tasks:

Added EXPORT_DEBOUNCE_DELAY and EXPORT_DEBOUNCE_CACHE_KEY constants to control and identify the debounce window for export scheduling.
Modified export_course_to_git to use Django's cache to ensure only the first export signal within the debounce window schedules a task, with subsequent signals being ignored. The export task is scheduled with a delay to allow the burst of signals to settle.

Robustness improvements for git operations:

Introduced a clear_stale_git_lock utility function to remove any leftover .git/index.lock files from previously crashed workers before starting a new export. This prevents git operations from failing due to stale locks.
Updated async_export_to_git to call clear_stale_git_lock before exporting, ensuring a clean working directory.

How can this be tested?

Prerequisites
- A local Tutor dev environment with ol_openedx_git_auto_export installed
- A course with CourseGitRepository configured and is_export_enabled = True
- Celery workers running with concurrency >= 2 (to reproduce the original race)
In Studio, make a change on the "Schedule & Details" page (this fires ~10–30 signals).
This should cause errors in cms-worker logs -- different sort of errors
Checkout to this branch -- re-install if needed and restart cms-workers
Repeat the process -- You should be able to see 1 new commit and there will only 1 red log with message similar to "Nothing to commit"

Additional Info:

This multiple signal fire issue is also being resolved here openedx/openedx-platform#38126

sentry · 2026-04-09T12:08:23Z

    course_key = CourseKey.from_string(course_key_string)
    course_module = modulestore().get_course(course_key)


Bug: An exception in CourseKey.from_string or modulestore().get_course can cause the acquired distributed lock to not be released, as these calls are outside the try...finally block.
_{Severity: MEDIUM}

Suggested Fix

Move the calls to CourseKey.from_string(course_key_string) and modulestore().get_course(course_key) inside the try block. This ensures that if an exception occurs, the finally block will still execute and release the lock.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/ol_openedx_git_auto_export/ol_openedx_git_auto_export/tasks.py#L42-L43 Potential issue: A distributed lock is acquired via `acquire_export_lock_or_schedule` to prevent concurrent git exports for a course. However, subsequent operations at lines 42-43, specifically `CourseKey.from_string(course_key_string)` and `modulestore().get_course(course_key)`, can raise exceptions. Because these lines are executed before the `try` block on line 45, an exception will prevent the `finally` block (which contains `release_export_lock`) from ever running. This causes the lock to be held for its full 120-second TTL, blocking any further export tasks for that course during that period.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

arslanashraf7 · 2026-04-13T13:00:05Z

@asadali145 could you review this?

Copilot

Pull request overview

This PR introduces a per-course distributed locking and deduplication mechanism for the async_export_to_git Celery task to prevent concurrent exports for the same course, and adds cleanup for stale Git index lock files that can be left behind after worker crashes.

Changes:

Added a cache-based per-course lock with “single pending task” scheduling/deduplication logic.
Updated async_export_to_git to be a bound Celery task that uses the new lock helpers and always releases the lock in a finally.
Added a helper to remove stale .git/index.lock files prior to export attempts.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
src/ol_openedx_git_auto_export/ol_openedx_git_auto_export/utils.py	Adds cache lock/pending-task helpers and stale git lock-file cleanup.
src/ol_openedx_git_auto_export/ol_openedx_git_auto_export/tasks.py	Uses the new locking helpers around `export_to_git` and clears stale lock files.
src/ol_openedx_git_auto_export/ol_openedx_git_auto_export/constants.py	Introduces lock/pending TTLs and cache-key templates for the new protocol.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T06:10:29Z

+def release_export_lock(course_key_string):
+    """
+    Release the per-course git-export distributed lock.
+
+    Must be called in a ``finally`` block after ``acquire_export_lock_or_schedule``
+    returns True.
+    """
+    cache.delete(EXPORT_LOCK_CACHE_KEY.format(course_key=course_key_string))


EXPORT_LOCK_TIMEOUT is a fixed 120s TTL, but release_export_lock unconditionally deletes the lock key. If an export takes longer than the TTL, the lock can expire and be re-acquired by another worker while the first task is still running; when the first task finishes it will delete the new task’s lock, allowing concurrent exports. Consider storing a unique token (e.g., task_id) as the lock value and only deleting if the current value matches, and/or using a lock mechanism with TTL refresh (or a much larger/configurable timeout) to prevent expiry mid-export.

Copilot · 2026-04-14T06:10:29Z

+EXPORT_LOCK_TIMEOUT = 120  # seconds; safety TTL if a worker crashes holding the lock
+EXPORT_LOCK_RETRY_DELAY = 30  # seconds between retries for the pending task
+EXPORT_LOCK_MAX_RETRIES = 5  # max retries for the pending task
+# At most one extra task queues behind the running task; all other duplicates drop.
+EXPORT_PENDING_TIMEOUT = EXPORT_LOCK_TIMEOUT  # match the lock TTL


EXPORT_LOCK_TIMEOUT = 120 seconds looks very low for a course export and can cause the distributed lock to expire while export_to_git is still running, reintroducing concurrency/race issues. Consider making this configurable (settings/env) and setting it to a value that safely exceeds worst-case export duration, or switching to a lock implementation that renews the TTL while the task is active.

Copilot · 2026-04-14T06:10:29Z

+    if index_lock.exists():
+        log.warning(
+            "Removing stale .git/index.lock for repo %s at %s", git_url, index_lock
+        )
+        index_lock.unlink()


clear_stale_git_lock calls index_lock.unlink() after an existence check but doesn’t handle filesystem errors (permission issues, transient disappearance, read-only volumes), which would fail the whole export. Consider wrapping the unlink in a try/except OSError and logging a warning (and optionally continuing), and using is_file()/missing_ok=True to avoid TOCTOU issues.

Suggested change

if index_lock.exists():

log.warning(

"Removing stale .git/index.lock for repo %s at %s", git_url, index_lock

)

index_lock.unlink()

if index_lock.is_file():

log.warning(

"Removing stale .git/index.lock for repo %s at %s", git_url, index_lock

)

try:

index_lock.unlink(missing_ok=True)

except OSError:

log.warning(

"Failed to remove stale .git/index.lock for repo %s at %s",

git_url,

index_lock,

exc_info=True,

)

Copilot · 2026-04-14T06:10:29Z

            async_create_github_repo.delay(str(course_key), export_course=True)
    except Exception:
        LOGGER.exception(
            "Unknown error occured during async course content export to git (course id: %s)",  # noqa: E501


Typo in log message: "occured" should be "occurred".

Suggested change

"Unknown error occured during async course content export to git (course id: %s)", # noqa: E501

"Unknown error occurred during async course content export to git (course id: %s)", # noqa: E501

asadali145 · 2026-04-14T08:19:38Z

@marslanabdulrauf, can you please look into the review comments by Seer and Copilot?

asadali145 · 2026-04-14T08:20:14Z

and let me know if you think that those are invalid and I should proceed with the review.

asadali145 · 2026-04-14T08:41:22Z

Here is an analysis for the suggested improvements with the help of Copilot. That 10-30 tasks thing might be exaggerated after our recent upstream fix but it looks legit and can improve our execution.

💡 Improvements

💡 Debounce at the Signal Level to Eliminate Bursts Before They Hit the Queue

Why the problem happens

COURSE_PUBLISHED fires 10–30 times in a single web request when a teacher saves "Schedule & Details". Tracing the call chain:

COURSE_PUBLISHED signal
  → listen_for_course_publish() [signals.py]
    → export_course_to_git() [utils.py]
      → async_export_to_git.delay()  ← called 10–30 times almost simultaneously

Each signal invocation enqueues a separate Celery task. The PR's lock then discards duplicates inside the worker — but all 30 tasks have already been created, enqueued in the broker, dequeued by workers, and had lock acquisition attempted before being dropped. That's significant overhead for 29 tasks that do nothing.

What debouncing does differently

Debouncing intercepts the burst at the signal handler, before .delay() is ever called, so only 1 task enters the broker:

# In listen_for_course_publish (signals.py) or export_course_to_git (utils.py)
from django.core.cache import cache

DEBOUNCE_DELAY = 5  # seconds — tune to exceed the publish burst window

def listen_for_course_publish(sender, course_key, **kwargs):
    debounce_key = f"git_export_debounce:{course_key}"

    # cache.add() is atomic — returns True only if the key didn't already exist.
    # The first signal in the burst wins; all subsequent ones are skipped.
    if cache.add(debounce_key, "1", timeout=DEBOUNCE_DELAY):
        async_export_to_git.apply_async(
            args=[str(course_key)],
            countdown=DEBOUNCE_DELAY,  # task runs after the burst window closes
        )
    # else: a task is already scheduled — skip without touching the broker

Side-by-side comparison

Current PR (lock inside the worker):

t=0s:  30 tasks enqueued in broker
t=1s:  Task #1 dequeued → acquires lock → runs export
t=1s:  Task #2 dequeued → lock held → becomes pending, retries at t=31s
t=1s:  Tasks #3–30 dequeued → dropped (lock held, pending slot taken)
t=31s: Task #2 retries → acquires lock → potentially runs a redundant export

With debounce (burst collapsed at signal level):

t=0s:  Signal #1  → cache.add succeeds → 1 task scheduled for t=5s
t=0s:  Signals #2–30 → cache.add fails → skipped (nothing enqueued)
t=5s:  ONE task dequeued and runs

How the two mechanisms complement each other

The distributed lock in the PR is still valuable — it guards against genuine concurrent saves (e.g. two different editors saving a course simultaneously, far enough apart that the debounce window doesn't catch both). The correct role for each:

Mechanism	Guards against
Debounce (signal level)	Burst of signals from a single save (the common case)
Distributed lock (task level)	True concurrent saves that arrive outside the debounce window

With both in place: debounce eliminates the burst, the lock is a lightweight safety net for the edge case. Without debounce, the lock is doing heavy lifting it shouldn't need to.

💡 `EXPORT_PENDING_TIMEOUT` Must Be Longer Than `EXPORT_LOCK_TIMEOUT`

The current values

EXPORT_LOCK_TIMEOUT = 120     # seconds — safety TTL for the running export's lock
EXPORT_LOCK_RETRY_DELAY = 30  # seconds between retries for the pending task
EXPORT_LOCK_MAX_RETRIES = 5   # max retries the pending task will attempt

EXPORT_PENDING_TIMEOUT = EXPORT_LOCK_TIMEOUT  # = 120s  ← the problem

What each key does

EXPORT_LOCK_CACHE_KEY — held by the running task. Lives for EXPORT_LOCK_TIMEOUT seconds as a crash-safety TTL.
EXPORT_PENDING_CACHE_KEY — stores the task_id of the single waiting task. acquire_export_lock_or_schedule checks cache.get(pending_key) == task_id on every retry to confirm the task is still the designated waiter.

The bug, step by step

The pending task's total possible wait time is:

MAX_RETRIES × RETRY_DELAY = 5 × 30 = 150 seconds

But the pending key only lives for 120 seconds. Here's what happens if the export runs longer than 120s:

t=0s:   Task A acquires lock (lock key expires at t=120s)
t=1s:   Task B fails to acquire → calls cache.add(pending_key, task_B_id, timeout=120)
        → pending key expires at t=121s
t=31s:  Task B retries → cache.get(pending_key) == task_B_id ✓ → retries again
t=61s:  Task B retries → cache.get(pending_key) == task_B_id ✓ → retries again
t=91s:  Task B retries → cache.get(pending_key) == task_B_id ✓ → retries again
t=121s: Task B retries → pending key HAS EXPIRED
        → cache.get(pending_key) returns None ≠ task_B_id
        → falls through to cache.add(pending_key, task_B_id, ...) → SUCCEEDS
        → Task B re-registers itself as a "new" pending task with a fresh TTL
        → effectively resets its own retry counter

This means Task B can accumulate unbounded retries by re-registering itself each time its pending key expires. It also means the "at most one pending task" guarantee breaks down — if Task C arrives just after t=121s, cache.add also succeeds for it, giving you two pending tasks simultaneously.

The fix

The pending key must outlive the entire period the pending task could possibly be retrying:

# The pending key must survive: the running export's full TTL
# + the entire retry window of the pending task
EXPORT_PENDING_TIMEOUT = EXPORT_LOCK_TIMEOUT + (EXPORT_LOCK_MAX_RETRIES * EXPORT_LOCK_RETRY_DELAY)
# = 120 + (5 × 30)
# = 270 seconds

With this value, cache.get(pending_key) == task_id will always return True on every retry attempt, the re-registration bug cannot happen, and the single-pending-task guarantee holds for the entire lifecycle.

…ueuing extra tasks

marslanabdulrauf · 2026-04-15T11:20:49Z

@asadali145 Copilot's suggestions is added to add debounce strategy before starting/invoking the celery task.
you can take another look

asadali145 · 2026-04-15T11:27:57Z

@marslanabdulrauf I was thinking that we are going to implement both approaches. Was there any specific reason to remove the previous implementation?

marslanabdulrauf · 2026-04-15T11:49:49Z

@marslanabdulrauf I was thinking that we are going to implement both approaches. Was there any specific reason to remove the previous implementation?

yeah the use-case is really specific to burst of same signal which we already have identified and should be fixed by openedx/openedx-platform#38126. This Debounce strategy really handles that well so there is no need to have both,

asadali145 · 2026-04-17T08:00:11Z

@marslanabdulrauf I was thinking that we are going to implement both approaches. Was there any specific reason to remove the previous implementation?

yeah the use-case is really specific to burst of same signal which we already have identified and should be fixed by openedx/openedx-platform#38126. This Debounce strategy really handles that well so there is no need to have both,

In the above analysis by Copilot, it gives an example of concurrent saves by 2 different authors close enough that they are outside the debounce window of 5s then in that case, we will need that first approach and it should handle these cases really well. WDYT?

marslanabdulrauf · 2026-04-17T11:52:59Z

I would say a 5 second head start should be enough to avoid the original issue.

asadali145

LGTM! Can you please update the PR description as per current implementation?

fix: added lock to prevent same task run multiple time at the same time

54a02ba

sentry Bot reviewed Apr 9, 2026

View reviewed changes

asadali145 self-assigned this Apr 13, 2026

asadali145 requested a review from Copilot April 14, 2026 06:06

Copilot started reviewing on behalf of asadali145 April 14, 2026 06:06 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

fix: add debounce strategy before calling the celery tasks to avoid q…

0f53727

…ueuing extra tasks

marslanabdulrauf added the Needs Review label Apr 15, 2026

asadali145 approved these changes Apr 28, 2026

View reviewed changes

asadali145 added Waiting on Author and removed Needs Review labels Apr 28, 2026

chore: version bumped

c920e41

marslanabdulrauf merged commit f750fdb into main May 4, 2026
9 checks passed

marslanabdulrauf deleted the marslan/10758-git-auto-export branch May 4, 2026 11:55

		course_key = CourseKey.from_string(course_key_string)
		course_module = modulestore().get_course(course_key)

	"Unknown error occured during async course content export to git (course id: %s)", # noqa: E501
	"Unknown error occurred during async course content export to git (course id: %s)", # noqa: E501

Conversation

marslanabdulrauf commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What are the relevant tickets?

Description (What does it do?)

How can this be tested?

Additional Info:

Uh oh!

sentry Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

arslanashraf7 commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

asadali145 commented Apr 14, 2026

Uh oh!

asadali145 commented Apr 14, 2026

Uh oh!

asadali145 commented Apr 14, 2026

💡 Improvements

💡 Debounce at the Signal Level to Eliminate Bursts Before They Hit the Queue

Why the problem happens

What debouncing does differently

Side-by-side comparison

How the two mechanisms complement each other

💡 EXPORT_PENDING_TIMEOUT Must Be Longer Than EXPORT_LOCK_TIMEOUT

The current values

What each key does

The bug, step by step

The fix

Uh oh!

marslanabdulrauf commented Apr 15, 2026

Uh oh!

asadali145 commented Apr 15, 2026

Uh oh!

marslanabdulrauf commented Apr 15, 2026

Uh oh!

asadali145 commented Apr 17, 2026

Uh oh!

marslanabdulrauf commented Apr 17, 2026

Uh oh!

asadali145 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

marslanabdulrauf commented Apr 9, 2026 •

edited

Loading

💡 `EXPORT_PENDING_TIMEOUT` Must Be Longer Than `EXPORT_LOCK_TIMEOUT`