Trash Recovery Race Condition - Failed Rollback Leaves Orphaned Records

Trash Recovery Race Condition - Failed Rollback Leaves Orphaned Records

### Description:

There's a subtle but real problem in how we handle failed deletions. When `BaseRaw.findOneAndDelete()` tries to delete something and it fails, the code attempts to "undo" by removing the trash record it just created. Sounds good in theory, but here's the catch - that rollback cleanup can also fail, and when it does, the code just... doesn't notice. 

So you end up with trash records floating around that point to nothing. The main document is still there (because deletion failed), but now there's also a trash record saying "hey, this was deleted" - except it wasn't.

**Where this happens:** `packages/models/src/models/BaseRaw.ts` lines 338-366

**Why it matters:** Over time, the trash collection fills up with these orphaned records. It's not catastrophic like Issue #4, but it's a slow leak - wasted storage, slower queries, and a sign that our error handling is kind of optimistic. This affects all 87+ models that inherit from BaseRaw.

### Steps to reproduce:

**Scenario A - When MongoDB is slow to respond:**

Let's say you have a MongoDB replica set with write concerns enabled (pretty standard for production).

1. Set up MongoDB with `writeConcern: { w: 'majority', wtimeout: 1000 }`
2. Add some network lag between your MongoDB nodes (like 500ms - not uncommon in cloud deployments)
3. Go delete a user from the admin panel
4. Watch what happens:
   - First, the trash record gets written successfully
   - Then the actual deletion times out - MongoDB says "took too long, giving up"
   - Code goes "oh no, better clean up that trash record"
   - But... the trash deletion ALSO times out
   - MongoDB reOops, wrong permissions:**

Someone configured MongoDB permissions a bit carelessly (happens more than you'd think):

1. The rocketchat user can insert/update the trash collection, but not delete from it
   ```javascript
   // Trash collection permissions: find, update, insert... but no "remove"
   // Main collection permissions: find, update, insert, remove ✓
   ```
2. Try to delete a message
3. Trash insert works fine (we have permission)
4. Main deletion fails for some reason (maybe a constraint violation)
5. Code tries to rollback by deleting the trash record
6. MongoDB says "Access Denied - you can't delete from trash"
7. But the code only throws the original error, so nobody knows the cleanup failed

The trash record is now stuck there permanently, and your logs won't even tell you why.
   ```
2. Attempt message deletion
3. `trash.updateOne()` succeeds (has insert/update permission)
4. `col.deleteOne()` fails with unique constraint violation
5. Rollback `trash.deleteOne()` throws `OperationNotPermitted`
6. Only original exception thrown, trash cleanup error lost

**Scenario C - Two servers fighting over the same deletion:**

You're running multiple Rocket.Chat instances behind a load balancer (because you're smart about scaling).

1. Two users somehow trigger deletion of the same room at almost the exact same time
2. Request hits Instance A, request hits Instance B
3. Instance A writes a trash record for "room1"
4. Instance B also writes a trash record for "room1" (upsert updates the same record)
5. Instance A tries to delete the actual room - fails because Instance B just modified it
6. Instance A goes "okay, rollback time" and tries to delete the trash record
7. Instance B ALSO fails and ALSO tries to rollback
8. One of them successfully deletes the trash record, the other gets `deletedCount: 0`
9. Neither instance checks if their rollback actually worked
What should actually happen when a deletion fails and we need to rollback:

1. After trying to delete the trash record, **check if it actually worked** - did `deletedCount` equal 1?
2. If the rollback failed, **log it as a separate error** with all the details - don't just swallow it
3. **Alert someone** when rollbacks fail - this shouldn't be silent
4. Make rollbacks safe to retry if needed
5. Better yet, use MongoDB transactions so we don't need manual rollbacks at all

Here's what the code should look like:llback fails, log separate error with full context
3. **Alerting:** Failed rollbacks should trigger monitoring alerts
4. **Idempotency:** Rollback should be safe to retry
5. **Transaction Usage:** Use MongoDB multi-document transactions to eliminate rollback need entirely

**Correct implementation:**
```typescript
try {
    await this.col.deleteOne({ _id } as Filter<T>);
} catch (e) {
    const rollbackResult = await this.trash?.deleteOne({ _id } as Filter<TDeleted>);
    
    if (!rollbackResult || rollbackResult.deletedCount === 0) {
        logger.error('Rollback failed: trash record not deleted', {
            collection: this.name,
            _id,
            originalError: e.message,
            rollbackResult
        });
        // Fire alert to monitoring system
        await alerting.criticalError('TRASH_ROLLBACK_FAILURE', { ... });
    }
Here's the actual code running in production right now (BaseRaw.ts lines 338-366):

```typescript
async findOneAndDelete(filter: Filter<T>, options?: FindOneAndDeleteOptions): Promise<WithId<T> | null> {
    const doc = await this.col.findOne(filter);
    if (!doc) { return null; }
    
    // Put a copy in trash
    await this.trash?.updateOne({ _id } as Filter<TDeleted>, { $set: trash }, {
        upsert: true,
    });

    try {
        await this.col.deleteOne({ _id } as Filter<T>);  // This can fail
    } catch (e) {
        await this.trash?.deleteOne({ _id } as Filter<TDeleted>);  // This can ALSO fail... but we never check
        throw e;  // Just throw the original error and hope for the best
    }

    return doc as WithId<T>;
}
```

**What actually goes wrong:**

**The MongoDB "success" that isn't:**

Here's the thing about MongoDB that makes this worse - when you try to delete something that doesn't exist, it doesn't throw an error:

```javascript
await trash.deleteOne({ _id: "nonexistent123" });
// Returns: { acknowledged: true, deletedCount: 0 }
**The slow accumulation problem:**

Let's say you're processing 100,000 deletions a day across all your collections. If just 0.1% of rollbacks fail (which is pretty optimistic):

- **100 orphaned trash records every day**
- Over the 30-day trash TTL window: **3,000 orphaned records** hanging around
- In a year: **36,500 trash orphans created** (though older ones do expire)

It's not catastrophic, but it's like having a small leak in your roof. Over time, it adds up.

**The lost error problem:**

```typescript
catch (e) {
    await this.trash?.deleteOne(...);  // If this throws, it replaces the original error
    throw e;  // Either way, we only see the first error
}
```

If the rollback deletion throws an error, we lose the original error message. If it fails silently (deletedCount: 0), we don't know anything went wrong at all.

**Performance slowly gets worse:**

As trash accumulates with orphans:
- Queries like `db.rocketchat__trash.find({__collection__: "message"})` take longer
- Indexes grow bigger
- Backups take more time
- Restores take more time

It's death by a thousand cuts.
   catch (e) {
       await this.trash?.deleteOne(...);  // This can throw but is NOT caught
       throw e;  // Only original error propagates
   }
   ```
   If trash deletion throws, it overwrites original error. If it fails silently (deletedCount: 0), no indication of problem.

4. **Performance Degradation:**
   - Trash collection grows with orphans
   - Queries against trash slow down: `db.rocketchat__trash.find({__collection__: "message"}).count()`
   - Indexing overhead increases
   - Backup/restore times grow

### Server Setup Information:

- How this relates to Issue #4:**

Both bugs come from the same fundamental mistake - treating multiple separate database operations like they're one atomic thing, when they're not.

```
Issue #4 (deleteOne): Doesn't even try to rollback
  findOne → trash.updateOne → col.deleteOne
  (if it crashes between steps, you're just... stuck)
  
Issue #5 (findOneAndDelete): Tries to rollback, but doesn't check if it worked
  col.findOne → trash.updateOne → col.deleteOne → trash.deleteOne (rollback)
  (at least there's an attempt, but it's unverified)
```
Which parts of the app are affected:**

Any model that inherits from BaseRaw can hit this - that's basically everything:
- Messages - your chat history
- Rooms - channels, DMs, all of it
- Users - account management
- Subscriptions - who's in what room
- LivechatInquiry - customer support queue
- Settings - system config
- ...and 81 other collection**

BoWhy trash keeps growing:**

The trash collection has a 30-day expiry on records (defined in `packages/models/src/models/Trash.ts`):

```typescript
{ key: { _deletedAt: 1 }, expireAfterSeconds: 60 * 60 * 24 * 30 } // 30 days
```

So orphaned records stick around for a full month before MongoDB auto-deletes them. During that time they're taking up:
- Disk space (each orphan record is stored)
- Index space (MongoDB maintains indexes for them)
- Query time (have to scan through them)
- Backup space (they get backed up too)w subclasses

**Trash Collection Growth:**
`packages/models/src/models/Trash.ts`:
```typescript
export class TrashRaw extends BaseRaw<RocketChatRecordDeleted<any>> {
    modelIndexes() {
        return [
            { key: { __collection__: 1 } },
            { key: { _deletedAt: 1 }, expireAfterSeconds: 60 * 60 * 24 * 30 }, // 30-day TTL
            { key: { rid: 1, __collection__: 1, _deletedAt: 1 } }
        ];
    }
} examples from production code:**

**Example 1: Livechat deletion gone wrong**

From `apps/meteor/app/livechat/server/lib/rooms.ts:290`:
```typescript
LivechatRooms.removeById(rid)  // Uses the broken findOneAndDelete
```

Picture this: A support agent closes a livechat and deletes the room. At the exact same moment, the visitor sends one last "thank you" message. The message updates the room, which causes the deletion to fail. The code tries to rollback and clean up the trash record, but that also fails silently. Now you have a trash record for a livechat room that isn't actually deleted.

**Example 2: User can't stay deleted**

From `apps/meteor/app/lib/server/functions/deleteUser.ts:170`:
```typescript
await Users.removeById(userId);
```

Admin deletes an inactive user account. But the user happens to be logging in at that exact moment, which updates their `lastLogin` timestamp. The deletion fails because the document just changed. The rollback times out due to write concerns. Result: The user record is still in the main collection (they can still log in), but there's also an orphaned trash record claiming they were deleted.deleteUser.ts:170`
   ```typescript
   await Users.removeById(userId);  // Can use findOneAndDelete path
   ```
   - Admin deletes inactive user
   - User logging in at same moment updates `lastLogin`
   - Delete fails due to document modification
   - Rollback attempt fails (write concern timeout)
   - Trash has orphaned user record
   - User still exists in main collection

**MongoDB Transaction Alternative:**

MongoDB 4.0+ supports this pattern but code doesn't use it:
```typescript
const session = db.startSession();
session.startTransaction();
trThe fix that already exists but we're not using:**

MongoDB has supported multi-document transactions since version 4.0. We could use them and avoid this entire rollback mess:

```typescript
const session = db.startSession();
session.startTransaction();
try {
    await trash.updateOne({ _id }, { $set: trash }, { upsert: true, session });
    await col.deleteOne({ _id }, { session });
    await session.commitTransaction();  // Everything succeeds together
} catch (e) {
    await session.abortTransaction();  // Or everything rolls back automatically
    throw e;
} finally {
    session.endSession();
}
```

With transactions, if the deletion fails, MongoDB automatically undoes the trash insert. No manual cleanup needed, no orphans possible.

**We're flying blind:**

Right now there's zero monitoring for this problem. We don't track:
- How often rollbacks are attempted
- How often they fail
- How many orphans exist
- How fast trash is growing

We should be logging stuff like:
```javascript
metrics.increment('trash.rollback.attempt');
if (rollbackResult.deletedCount === 0) {
    metrics.increment('trash.rollback.failure');
    alerting.critical('Trash rollback just failed', { collection, _id });
}How much space does this waste?**

Let's do some back-of-napkin math:
```
Say you have 100,000 deletions per day (messages, rooms, users, everything)
Even a tiny 0.1% rollback failure rate = 100 orphans daily
Each record is roughly 2KB on average
Trash keeps them for 30 days before auto-cleanup

Storage waste:
- Per day: 100 records × 2KB = 200KB (not much)
- Over 30 days: 3,000 records × 2KB = 6MB (still okay)
- Created in a year: 36,500 records × 2KB = 73MB total (many get cleaned up)

Plus index overhead:
- 3 indexes on the trash collection
- 3,000 orphans × 3 indexes × ~100 bytes each = ~900KB extra index space
```

It's not going to bring down your server, but it's like leaving the tap dripping - wasteful and sloppy
### Relevant logs:

**The problem: There are no logs for this.**

When a rollback fails, nothing gets logged. Silence. Which is why this issue probably exists in production right now and nobody's noticed.

**What we SHOULD be logging:**

When the main deletion fails:
```
[ERROR] BaseRaw.findOneAndDelete: Main deletion failed, trying to rollback
  Collection: rocketchat_message
  MessageID: msg_abc123
  Error: WriteConcernError: took too long, gave up
  TrashInserted: yes
  AttemptingRollback: yes
```

When the rollback also fails:
```
[CRITICAL] BaseRaw.findOneAndDelete: Rollback failed - we have a problem
  Collection: rocketchat_message  
  MessageID: msg_abc123
  TrashDeleteResult: {acknowledged: true, deletedCount: 0}  ← deleted nothing!
  MainDeleteError: WriteConcernError: took too long
  State: INCONSISTENT - trash orphaned, main record still there
  RequiresManualCleanup: yes, someone needs to look at this
```

WhHow to detect if you have this problem right now:**

Want to see how many orphaned trash records you have? Run this:

```javascript
// Find trash records where the main record doesn't exist anymore
db.rocketchat__trash.aggregate([
  { $match: { __collection__: "message" } },
  { $lookup: {
      from: "rocketchat_message",
      localField: "_id",
      foreignField: "_id", 
      as: "main_record"
  }},
  { $match: { "main_record": { $eq: [] } } },  // Trash record but no main record
  { $group: {
      _id: "$__collection__",
      count: { $sum: 1 },
      oldestOrphan: { $min: "$_deletedAt" },
      newestOrphan: { $max: "$_deletedAt" }
  }}
])

// You'll get something like:
// { "_id": "message", "count": 1234, "oldestOrphan": ISODate("2025-12-15"), ... }
// { "_id": "room", "count": 456, ... }
// { "_id": "user", "count": 89, ... }
```

If those counts are high, you're seeing this bug in action.

**Check how big your trash collection is:**
```javascript
db.rocketchat__trash.stats().size   // Total size in bytes
db.rocketchat__trash.stats().count  // Total number of records

// See which collections are creating the most trash
db.rocketchat__trash.aggregate([
  { $group: { _id: "$__collection__", count: { $sum: 1 } } },
  { $sort: { count: -1 } }
])
```

If trash is growing faster than expected, rollbacks are probably failing.rash Collection Size Monitoring:**
```javascript
db.rocketchat__trash.stats().size  // Should be monitored
db.rocketchat__trash.stats().count // Track growth rate
db.rocketchat__trash.aggregate([
  { $group: { _id: "$__collection__", count: { $sum: 1 } } },
  { $sort: { count: -1 } }
])
// Shows which collections contribute most orphans
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trash Recovery Race Condition - Failed Rollback Leaves Orphaned Records #38179

Description:

Steps to reproduce:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Trash Recovery Race Condition - Failed Rollback Leaves Orphaned Records #38179

Description

Description:

Steps to reproduce:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions