Problem
When two flows in the same pipeline run concurrently (e.g., Ticketmaster + venue scraper both importing the same show), both can pass the findExistingEvent() dedup check before either has committed its post to the database. This creates duplicate events.
Evidence
From the Mar 19, 2026 duplicate audit (140 pairs found across 14,684 events — 0.95% rate):
Category 2 examples (race condition):
The Falling Spikes — Flow 93 (Flagpole, 01:33am) vs Flow 81 (Flicker Freshtix, 02:56am). Same pipeline 16 (Athens), ~83 min apart
The Browning — Flow 76 (TM, 10:51am) vs Flow 72 (TM, 10:53am). Same pipeline 17 (Greenville), 2 minutes apart
The title matching algorithm correctly identifies these as duplicates when tested after the fact — the issue is purely timing.
Why it's getting worse
The Action Scheduler concurrent batches fix (commit a92989ab, Mar 19 2026) increased throughput from ~1 action/5min to 623 actions/5min. More concurrent jobs = more concurrent upserts = more race windows.
Current dedup flow
EventUpsert::findExistingEvent()
1. Ticket URL match (meta query)
2. Fuzzy title + venue + date (WP_Query → SimilarityEngine)
3. Exact title + venue (WP_Query)
4. Fuzzy title + date without venue (WP_Query → SimilarityEngine)
→ If no match found → wp_insert_post()
The gap between "no match found" and wp_insert_post() completing (with all meta/taxonomy writes) is the race window.
Possible approaches
1. Advisory lock on (pipeline_id, event_date, normalized_title_hash)
Before findExistingEvent(), acquire a MySQL GET_LOCK() with a key derived from the pipeline + date + title hash. Release after post insert. This serializes upserts for the same event identity within a pipeline.
Pros: Simple, no schema changes, works at DB level
Cons: Lock contention could slow throughput, need careful timeout handling
2. Processed-items dedup at the event identity level
After the AI step but before upsert, check a processed-items table keyed on (pipeline_id, normalized_title, date). If already processed by another flow in the same pipeline, skip.
Pros: Uses existing infrastructure, prevents the race entirely
Cons: Needs the normalized title before upsert (available from AI output)
3. Post-insert dedup sweep
After wp_insert_post(), immediately query for duplicates at the same venue+date and trash the newer one if found. Essentially what clean-duplicates does, but inline.
Pros: Catches 100% of races, simple
Cons: Creates-then-deletes (wasteful), brief window where dupe is visible
Recommendation
Approach 1 (advisory lock) is the cleanest. The lock key could be:
$lock_key = 'dme_upsert_' . md5($pipeline_id . '|' . $date . '|' . SimilarityEngine::normalizeTitle($title));
With a 10-second timeout, this would serialize concurrent upserts for the same event without blocking unrelated events.
Current mitigation
wp data-machine-events check clean-duplicates --scope=all --yes catches and cleans these after the fact. Sub-1% rate is manageable but will grow with throughput.
Problem
When two flows in the same pipeline run concurrently (e.g., Ticketmaster + venue scraper both importing the same show), both can pass the
findExistingEvent()dedup check before either has committed its post to the database. This creates duplicate events.Evidence
From the Mar 19, 2026 duplicate audit (140 pairs found across 14,684 events — 0.95% rate):
Category 2 examples (race condition):
The Falling Spikes— Flow 93 (Flagpole, 01:33am) vs Flow 81 (Flicker Freshtix, 02:56am). Same pipeline 16 (Athens), ~83 min apartThe Browning— Flow 76 (TM, 10:51am) vs Flow 72 (TM, 10:53am). Same pipeline 17 (Greenville), 2 minutes apartThe title matching algorithm correctly identifies these as duplicates when tested after the fact — the issue is purely timing.
Why it's getting worse
The Action Scheduler concurrent batches fix (commit
a92989ab, Mar 19 2026) increased throughput from ~1 action/5min to 623 actions/5min. More concurrent jobs = more concurrent upserts = more race windows.Current dedup flow
The gap between "no match found" and
wp_insert_post()completing (with all meta/taxonomy writes) is the race window.Possible approaches
1. Advisory lock on (pipeline_id, event_date, normalized_title_hash)
Before
findExistingEvent(), acquire a MySQLGET_LOCK()with a key derived from the pipeline + date + title hash. Release after post insert. This serializes upserts for the same event identity within a pipeline.Pros: Simple, no schema changes, works at DB level
Cons: Lock contention could slow throughput, need careful timeout handling
2. Processed-items dedup at the event identity level
After the AI step but before upsert, check a processed-items table keyed on
(pipeline_id, normalized_title, date). If already processed by another flow in the same pipeline, skip.Pros: Uses existing infrastructure, prevents the race entirely
Cons: Needs the normalized title before upsert (available from AI output)
3. Post-insert dedup sweep
After
wp_insert_post(), immediately query for duplicates at the same venue+date and trash the newer one if found. Essentially whatclean-duplicatesdoes, but inline.Pros: Catches 100% of races, simple
Cons: Creates-then-deletes (wasteful), brief window where dupe is visible
Recommendation
Approach 1 (advisory lock) is the cleanest. The lock key could be:
With a 10-second timeout, this would serialize concurrent upserts for the same event without blocking unrelated events.
Current mitigation
wp data-machine-events check clean-duplicates --scope=all --yescatches and cleans these after the fact. Sub-1% rate is manageable but will grow with throughput.