Race condition: concurrent flows in same pipeline create duplicate events

## Problem

When two flows in the same pipeline run concurrently (e.g., Ticketmaster + venue scraper both importing the same show), both can pass the `findExistingEvent()` dedup check before either has committed its post to the database. This creates duplicate events.

## Evidence

From the Mar 19, 2026 duplicate audit (140 pairs found across 14,684 events — 0.95% rate):

**Category 2 examples (race condition):**
- `The Falling Spikes` — Flow 93 (Flagpole, 01:33am) vs Flow 81 (Flicker Freshtix, 02:56am). Same pipeline 16 (Athens), ~83 min apart
- `The Browning` — Flow 76 (TM, 10:51am) vs Flow 72 (TM, 10:53am). Same pipeline 17 (Greenville), **2 minutes apart**

The title matching algorithm correctly identifies these as duplicates when tested after the fact — the issue is purely timing.

## Why it's getting worse

The Action Scheduler concurrent batches fix (commit `a92989ab`, Mar 19 2026) increased throughput from ~1 action/5min to 623 actions/5min. More concurrent jobs = more concurrent upserts = more race windows.

## Current dedup flow

```
EventUpsert::findExistingEvent()
  1. Ticket URL match (meta query)
  2. Fuzzy title + venue + date (WP_Query → SimilarityEngine)
  3. Exact title + venue (WP_Query)
  4. Fuzzy title + date without venue (WP_Query → SimilarityEngine)
  → If no match found → wp_insert_post()
```

The gap between "no match found" and `wp_insert_post()` completing (with all meta/taxonomy writes) is the race window.

## Possible approaches

### 1. Advisory lock on (pipeline_id, event_date, normalized_title_hash)
Before `findExistingEvent()`, acquire a MySQL `GET_LOCK()` with a key derived from the pipeline + date + title hash. Release after post insert. This serializes upserts for the same event identity within a pipeline.

**Pros:** Simple, no schema changes, works at DB level
**Cons:** Lock contention could slow throughput, need careful timeout handling

### 2. Processed-items dedup at the event identity level
After the AI step but before upsert, check a processed-items table keyed on `(pipeline_id, normalized_title, date)`. If already processed by another flow in the same pipeline, skip.

**Pros:** Uses existing infrastructure, prevents the race entirely
**Cons:** Needs the normalized title before upsert (available from AI output)

### 3. Post-insert dedup sweep
After `wp_insert_post()`, immediately query for duplicates at the same venue+date and trash the newer one if found. Essentially what `clean-duplicates` does, but inline.

**Pros:** Catches 100% of races, simple
**Cons:** Creates-then-deletes (wasteful), brief window where dupe is visible

### Recommendation

Approach 1 (advisory lock) is the cleanest. The lock key could be:
```php
$lock_key = 'dme_upsert_' . md5($pipeline_id . '|' . $date . '|' . SimilarityEngine::normalizeTitle($title));
```

With a 10-second timeout, this would serialize concurrent upserts for the same event without blocking unrelated events.

## Current mitigation

`wp data-machine-events check clean-duplicates --scope=all --yes` catches and cleans these after the fact. Sub-1% rate is manageable but will grow with throughput.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition: concurrent flows in same pipeline create duplicate events #123

Problem

Evidence

Why it's getting worse

Current dedup flow

Possible approaches

1. Advisory lock on (pipeline_id, event_date, normalized_title_hash)

2. Processed-items dedup at the event identity level

3. Post-insert dedup sweep

Recommendation

Current mitigation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Race condition: concurrent flows in same pipeline create duplicate events #123

Description

Problem

Evidence

Why it's getting worse

Current dedup flow

Possible approaches

1. Advisory lock on (pipeline_id, event_date, normalized_title_hash)

2. Processed-items dedup at the event identity level

3. Post-insert dedup sweep

Recommendation

Current mitigation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions