fix: added lock to prevent same task run multiple time at the same time#784
fix: added lock to prevent same task run multiple time at the same time#784marslanabdulrauf merged 3 commits intomainfrom
Conversation
| course_key = CourseKey.from_string(course_key_string) | ||
| course_module = modulestore().get_course(course_key) |
There was a problem hiding this comment.
Bug: An exception in CourseKey.from_string or modulestore().get_course can cause the acquired distributed lock to not be released, as these calls are outside the try...finally block.
Severity: MEDIUM
Suggested Fix
Move the calls to CourseKey.from_string(course_key_string) and modulestore().get_course(course_key) inside the try block. This ensures that if an exception occurs, the finally block will still execute and release the lock.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/ol_openedx_git_auto_export/ol_openedx_git_auto_export/tasks.py#L42-L43
Potential issue: A distributed lock is acquired via `acquire_export_lock_or_schedule` to
prevent concurrent git exports for a course. However, subsequent operations at lines
42-43, specifically `CourseKey.from_string(course_key_string)` and
`modulestore().get_course(course_key)`, can raise exceptions. Because these lines are
executed before the `try` block on line 45, an exception will prevent the `finally`
block (which contains `release_export_lock`) from ever running. This causes the lock to
be held for its full 120-second TTL, blocking any further export tasks for that course
during that period.
Did we get this right? 👍 / 👎 to inform future reviews.
|
@asadali145 could you review this? |
There was a problem hiding this comment.
Pull request overview
This PR introduces a per-course distributed locking and deduplication mechanism for the async_export_to_git Celery task to prevent concurrent exports for the same course, and adds cleanup for stale Git index lock files that can be left behind after worker crashes.
Changes:
- Added a cache-based per-course lock with “single pending task” scheduling/deduplication logic.
- Updated
async_export_to_gitto be a bound Celery task that uses the new lock helpers and always releases the lock in afinally. - Added a helper to remove stale
.git/index.lockfiles prior to export attempts.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/ol_openedx_git_auto_export/ol_openedx_git_auto_export/utils.py | Adds cache lock/pending-task helpers and stale git lock-file cleanup. |
| src/ol_openedx_git_auto_export/ol_openedx_git_auto_export/tasks.py | Uses the new locking helpers around export_to_git and clears stale lock files. |
| src/ol_openedx_git_auto_export/ol_openedx_git_auto_export/constants.py | Introduces lock/pending TTLs and cache-key templates for the new protocol. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def release_export_lock(course_key_string): | ||
| """ | ||
| Release the per-course git-export distributed lock. | ||
|
|
||
| Must be called in a ``finally`` block after ``acquire_export_lock_or_schedule`` | ||
| returns True. | ||
| """ | ||
| cache.delete(EXPORT_LOCK_CACHE_KEY.format(course_key=course_key_string)) |
There was a problem hiding this comment.
EXPORT_LOCK_TIMEOUT is a fixed 120s TTL, but release_export_lock unconditionally deletes the lock key. If an export takes longer than the TTL, the lock can expire and be re-acquired by another worker while the first task is still running; when the first task finishes it will delete the new task’s lock, allowing concurrent exports. Consider storing a unique token (e.g., task_id) as the lock value and only deleting if the current value matches, and/or using a lock mechanism with TTL refresh (or a much larger/configurable timeout) to prevent expiry mid-export.
| EXPORT_LOCK_TIMEOUT = 120 # seconds; safety TTL if a worker crashes holding the lock | ||
| EXPORT_LOCK_RETRY_DELAY = 30 # seconds between retries for the pending task | ||
| EXPORT_LOCK_MAX_RETRIES = 5 # max retries for the pending task | ||
| # At most one extra task queues behind the running task; all other duplicates drop. | ||
| EXPORT_PENDING_TIMEOUT = EXPORT_LOCK_TIMEOUT # match the lock TTL |
There was a problem hiding this comment.
EXPORT_LOCK_TIMEOUT = 120 seconds looks very low for a course export and can cause the distributed lock to expire while export_to_git is still running, reintroducing concurrency/race issues. Consider making this configurable (settings/env) and setting it to a value that safely exceeds worst-case export duration, or switching to a lock implementation that renews the TTL while the task is active.
| if index_lock.exists(): | ||
| log.warning( | ||
| "Removing stale .git/index.lock for repo %s at %s", git_url, index_lock | ||
| ) | ||
| index_lock.unlink() |
There was a problem hiding this comment.
clear_stale_git_lock calls index_lock.unlink() after an existence check but doesn’t handle filesystem errors (permission issues, transient disappearance, read-only volumes), which would fail the whole export. Consider wrapping the unlink in a try/except OSError and logging a warning (and optionally continuing), and using is_file()/missing_ok=True to avoid TOCTOU issues.
| if index_lock.exists(): | |
| log.warning( | |
| "Removing stale .git/index.lock for repo %s at %s", git_url, index_lock | |
| ) | |
| index_lock.unlink() | |
| if index_lock.is_file(): | |
| log.warning( | |
| "Removing stale .git/index.lock for repo %s at %s", git_url, index_lock | |
| ) | |
| try: | |
| index_lock.unlink(missing_ok=True) | |
| except OSError: | |
| log.warning( | |
| "Failed to remove stale .git/index.lock for repo %s at %s", | |
| git_url, | |
| index_lock, | |
| exc_info=True, | |
| ) |
| async_create_github_repo.delay(str(course_key), export_course=True) | ||
| except Exception: | ||
| LOGGER.exception( | ||
| "Unknown error occured during async course content export to git (course id: %s)", # noqa: E501 |
There was a problem hiding this comment.
Typo in log message: "occured" should be "occurred".
| "Unknown error occured during async course content export to git (course id: %s)", # noqa: E501 | |
| "Unknown error occurred during async course content export to git (course id: %s)", # noqa: E501 |
|
@marslanabdulrauf, can you please look into the review comments by Seer and Copilot? |
|
and let me know if you think that those are invalid and I should proceed with the review. |
|
Here is an analysis for the suggested improvements with the help of Copilot. That 10-30 tasks thing might be exaggerated after our recent upstream fix but it looks legit and can improve our execution. 💡 Improvements💡 Debounce at the Signal Level to Eliminate Bursts Before They Hit the QueueWhy the problem happens
Each signal invocation enqueues a separate Celery task. The PR's lock then discards duplicates inside the worker — but all 30 tasks have already been created, enqueued in the broker, dequeued by workers, and had lock acquisition attempted before being dropped. That's significant overhead for 29 tasks that do nothing. What debouncing does differentlyDebouncing intercepts the burst at the signal handler, before # In listen_for_course_publish (signals.py) or export_course_to_git (utils.py)
from django.core.cache import cache
DEBOUNCE_DELAY = 5 # seconds — tune to exceed the publish burst window
def listen_for_course_publish(sender, course_key, **kwargs):
debounce_key = f"git_export_debounce:{course_key}"
# cache.add() is atomic — returns True only if the key didn't already exist.
# The first signal in the burst wins; all subsequent ones are skipped.
if cache.add(debounce_key, "1", timeout=DEBOUNCE_DELAY):
async_export_to_git.apply_async(
args=[str(course_key)],
countdown=DEBOUNCE_DELAY, # task runs after the burst window closes
)
# else: a task is already scheduled — skip without touching the brokerSide-by-side comparisonCurrent PR (lock inside the worker): With debounce (burst collapsed at signal level): How the two mechanisms complement each otherThe distributed lock in the PR is still valuable — it guards against genuine concurrent saves (e.g. two different editors saving a course simultaneously, far enough apart that the debounce window doesn't catch both). The correct role for each:
With both in place: debounce eliminates the burst, the lock is a lightweight safety net for the edge case. Without debounce, the lock is doing heavy lifting it shouldn't need to. 💡
|
…ueuing extra tasks
|
@asadali145 Copilot's suggestions is added to add debounce strategy before starting/invoking the celery task. |
|
@marslanabdulrauf I was thinking that we are going to implement both approaches. Was there any specific reason to remove the previous implementation? |
yeah the use-case is really specific to burst of same signal which we already have identified and should be fixed by openedx/openedx-platform#38126. This Debounce strategy really handles that well so there is no need to have both, |
In the above analysis by Copilot, it gives an example of concurrent saves by 2 different authors close enough that they are outside the debounce window of 5s then in that case, we will need that first approach and it should handle these cases really well. WDYT? |
|
I would say a 5 second head start should be enough to avoid the original issue. |
asadali145
left a comment
There was a problem hiding this comment.
LGTM! Can you please update the PR description as per current implementation?
What are the relevant tickets?
https://github.com/mitodl/hq/issues/10758#issuecomment-4201469687
https://github.com/mitodl/hq/issues/10243
Description (What does it do?)
This pull request introduces two main improvements to the git export process for courses: (1) it debounces repeated export tasks triggered by multiple signals during a single course save, ensuring only one export is scheduled per burst, and (2) it adds cleanup of stale git lock files to prevent export failures due to prior crashes.
Debounce logic for export tasks:
EXPORT_DEBOUNCE_DELAYandEXPORT_DEBOUNCE_CACHE_KEYconstants to control and identify the debounce window for export scheduling.export_course_to_gitto use Django's cache to ensure only the first export signal within the debounce window schedules a task, with subsequent signals being ignored. The export task is scheduled with a delay to allow the burst of signals to settle.Robustness improvements for git operations:
clear_stale_git_lockutility function to remove any leftover.git/index.lockfiles from previously crashed workers before starting a new export. This prevents git operations from failing due to stale locks.async_export_to_gitto callclear_stale_git_lockbefore exporting, ensuring a clean working directory.How can this be tested?
Prerequisites
ol_openedx_git_auto_exportinstalledCourseGitRepositoryconfigured andis_export_enabled = Trueconcurrency >= 2(to reproduce the original race)In Studio, make a change on the "Schedule & Details" page (this fires ~10–30 signals).
This should cause errors in cms-worker logs -- different sort of errors
Checkout to this branch -- re-install if needed and restart cms-workers
Repeat the process -- You should be able to see 1 new commit and there will only 1 red log with message similar to "Nothing to commit"
Additional Info:
This multiple signal fire issue is also being resolved here openedx/openedx-platform#38126