Skip to content

Race condition: concurrent flows in same pipeline create duplicate events #123

@chubes4

Description

@chubes4

Problem

When two flows in the same pipeline run concurrently (e.g., Ticketmaster + venue scraper both importing the same show), both can pass the findExistingEvent() dedup check before either has committed its post to the database. This creates duplicate events.

Evidence

From the Mar 19, 2026 duplicate audit (140 pairs found across 14,684 events — 0.95% rate):

Category 2 examples (race condition):

  • The Falling Spikes — Flow 93 (Flagpole, 01:33am) vs Flow 81 (Flicker Freshtix, 02:56am). Same pipeline 16 (Athens), ~83 min apart
  • The Browning — Flow 76 (TM, 10:51am) vs Flow 72 (TM, 10:53am). Same pipeline 17 (Greenville), 2 minutes apart

The title matching algorithm correctly identifies these as duplicates when tested after the fact — the issue is purely timing.

Why it's getting worse

The Action Scheduler concurrent batches fix (commit a92989ab, Mar 19 2026) increased throughput from ~1 action/5min to 623 actions/5min. More concurrent jobs = more concurrent upserts = more race windows.

Current dedup flow

EventUpsert::findExistingEvent()
  1. Ticket URL match (meta query)
  2. Fuzzy title + venue + date (WP_Query → SimilarityEngine)
  3. Exact title + venue (WP_Query)
  4. Fuzzy title + date without venue (WP_Query → SimilarityEngine)
  → If no match found → wp_insert_post()

The gap between "no match found" and wp_insert_post() completing (with all meta/taxonomy writes) is the race window.

Possible approaches

1. Advisory lock on (pipeline_id, event_date, normalized_title_hash)

Before findExistingEvent(), acquire a MySQL GET_LOCK() with a key derived from the pipeline + date + title hash. Release after post insert. This serializes upserts for the same event identity within a pipeline.

Pros: Simple, no schema changes, works at DB level
Cons: Lock contention could slow throughput, need careful timeout handling

2. Processed-items dedup at the event identity level

After the AI step but before upsert, check a processed-items table keyed on (pipeline_id, normalized_title, date). If already processed by another flow in the same pipeline, skip.

Pros: Uses existing infrastructure, prevents the race entirely
Cons: Needs the normalized title before upsert (available from AI output)

3. Post-insert dedup sweep

After wp_insert_post(), immediately query for duplicates at the same venue+date and trash the newer one if found. Essentially what clean-duplicates does, but inline.

Pros: Catches 100% of races, simple
Cons: Creates-then-deletes (wasteful), brief window where dupe is visible

Recommendation

Approach 1 (advisory lock) is the cleanest. The lock key could be:

$lock_key = 'dme_upsert_' . md5($pipeline_id . '|' . $date . '|' . SimilarityEngine::normalizeTitle($title));

With a 10-second timeout, this would serialize concurrent upserts for the same event without blocking unrelated events.

Current mitigation

wp data-machine-events check clean-duplicates --scope=all --yes catches and cleans these after the fact. Sub-1% rate is manageable but will grow with throughput.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions