Trash Recovery Race Condition - Failed Rollback Leaves Orphaned Records
Description:
There's a subtle but real problem in how we handle failed deletions. When BaseRaw.findOneAndDelete() tries to delete something and it fails, the code attempts to "undo" by removing the trash record it just created. Sounds good in theory, but here's the catch - that rollback cleanup can also fail, and when it does, the code just... doesn't notice.
So you end up with trash records floating around that point to nothing. The main document is still there (because deletion failed), but now there's also a trash record saying "hey, this was deleted" - except it wasn't.
Where this happens: packages/models/src/models/BaseRaw.ts lines 338-366
Why it matters: Over time, the trash collection fills up with these orphaned records. It's not catastrophic like Issue #4, but it's a slow leak - wasted storage, slower queries, and a sign that our error handling is kind of optimistic. This affects all 87+ models that inherit from BaseRaw.
Steps to reproduce:
Scenario A - When MongoDB is slow to respond:
Let's say you have a MongoDB replica set with write concerns enabled (pretty standard for production).
- Set up MongoDB with
writeConcern: { w: 'majority', wtimeout: 1000 }
- Add some network lag between your MongoDB nodes (like 500ms - not uncommon in cloud deployments)
- Go delete a user from the admin panel
- Watch what happens:
- First, the trash record gets written successfully
- Then the actual deletion times out - MongoDB says "took too long, giving up"
- Code goes "oh no, better clean up that trash record"
- But... the trash deletion ALSO times out
- MongoDB reOops, wrong permissions:**
Someone configured MongoDB permissions a bit carelessly (happens more than you'd think):
- The rocketchat user can insert/update the trash collection, but not delete from it
// Trash collection permissions: find, update, insert... but no "remove"
// Main collection permissions: find, update, insert, remove ✓
- Try to delete a message
- Trash insert works fine (we have permission)
- Main deletion fails for some reason (maybe a constraint violation)
- Code tries to rollback by deleting the trash record
- MongoDB says "Access Denied - you can't delete from trash"
- But the code only throws the original error, so nobody knows the cleanup failed
The trash record is now stuck there permanently, and your logs won't even tell you why.
2. Attempt message deletion
3. `trash.updateOne()` succeeds (has insert/update permission)
4. `col.deleteOne()` fails with unique constraint violation
5. Rollback `trash.deleteOne()` throws `OperationNotPermitted`
6. Only original exception thrown, trash cleanup error lost
**Scenario C - Two servers fighting over the same deletion:**
You're running multiple Rocket.Chat instances behind a load balancer (because you're smart about scaling).
1. Two users somehow trigger deletion of the same room at almost the exact same time
2. Request hits Instance A, request hits Instance B
3. Instance A writes a trash record for "room1"
4. Instance B also writes a trash record for "room1" (upsert updates the same record)
5. Instance A tries to delete the actual room - fails because Instance B just modified it
6. Instance A goes "okay, rollback time" and tries to delete the trash record
7. Instance B ALSO fails and ALSO tries to rollback
8. One of them successfully deletes the trash record, the other gets `deletedCount: 0`
9. Neither instance checks if their rollback actually worked
What should actually happen when a deletion fails and we need to rollback:
1. After trying to delete the trash record, **check if it actually worked** - did `deletedCount` equal 1?
2. If the rollback failed, **log it as a separate error** with all the details - don't just swallow it
3. **Alert someone** when rollbacks fail - this shouldn't be silent
4. Make rollbacks safe to retry if needed
5. Better yet, use MongoDB transactions so we don't need manual rollbacks at all
Here's what the code should look like:llback fails, log separate error with full context
3. **Alerting:** Failed rollbacks should trigger monitoring alerts
4. **Idempotency:** Rollback should be safe to retry
5. **Transaction Usage:** Use MongoDB multi-document transactions to eliminate rollback need entirely
**Correct implementation:**
```typescript
try {
await this.col.deleteOne({ _id } as Filter<T>);
} catch (e) {
const rollbackResult = await this.trash?.deleteOne({ _id } as Filter<TDeleted>);
if (!rollbackResult || rollbackResult.deletedCount === 0) {
logger.error('Rollback failed: trash record not deleted', {
collection: this.name,
_id,
originalError: e.message,
rollbackResult
});
// Fire alert to monitoring system
await alerting.criticalError('TRASH_ROLLBACK_FAILURE', { ... });
}
Here's the actual code running in production right now (BaseRaw.ts lines 338-366):
```typescript
async findOneAndDelete(filter: Filter<T>, options?: FindOneAndDeleteOptions): Promise<WithId<T> | null> {
const doc = await this.col.findOne(filter);
if (!doc) { return null; }
// Put a copy in trash
await this.trash?.updateOne({ _id } as Filter<TDeleted>, { $set: trash }, {
upsert: true,
});
try {
await this.col.deleteOne({ _id } as Filter<T>); // This can fail
} catch (e) {
await this.trash?.deleteOne({ _id } as Filter<TDeleted>); // This can ALSO fail... but we never check
throw e; // Just throw the original error and hope for the best
}
return doc as WithId<T>;
}
What actually goes wrong:
The MongoDB "success" that isn't:
Here's the thing about MongoDB that makes this worse - when you try to delete something that doesn't exist, it doesn't throw an error:
await trash.deleteOne({ _id: "nonexistent123" });
// Returns: { acknowledged: true, deletedCount: 0 }
**The slow accumulation problem:**
Let's say you're processing 100,000 deletions a day across all your collections. If just 0.1% of rollbacks fail (which is pretty optimistic):
- **100 orphaned trash records every day**
- Over the 30-day trash TTL window: **3,000 orphaned records** hanging around
- In a year: **36,500 trash orphans created** (though older ones do expire)
It's not catastrophic, but it's like having a small leak in your roof. Over time, it adds up.
**The lost error problem:**
```typescript
catch (e) {
await this.trash?.deleteOne(...); // If this throws, it replaces the original error
throw e; // Either way, we only see the first error
}
If the rollback deletion throws an error, we lose the original error message. If it fails silently (deletedCount: 0), we don't know anything went wrong at all.
Performance slowly gets worse:
As trash accumulates with orphans:
- Queries like
db.rocketchat__trash.find({__collection__: "message"}) take longer
- Indexes grow bigger
- Backups take more time
- Restores take more time
It's death by a thousand cuts.
catch (e) {
await this.trash?.deleteOne(...); // This can throw but is NOT caught
throw e; // Only original error propagates
}
If trash deletion throws, it overwrites original error. If it fails silently (deletedCount: 0), no indication of problem.
4. **Performance Degradation:**
- Trash collection grows with orphans
- Queries against trash slow down: `db.rocketchat__trash.find({__collection__: "message"}).count()`
- Indexing overhead increases
- Backup/restore times grow
### Server Setup Information:
- How this relates to Issue #4:**
Both bugs come from the same fundamental mistake - treating multiple separate database operations like they're one atomic thing, when they're not.
Issue #4 (deleteOne): Doesn't even try to rollback
findOne → trash.updateOne → col.deleteOne
(if it crashes between steps, you're just... stuck)
Issue #5 (findOneAndDelete): Tries to rollback, but doesn't check if it worked
col.findOne → trash.updateOne → col.deleteOne → trash.deleteOne (rollback)
(at least there's an attempt, but it's unverified)
Which parts of the app are affected:**
Any model that inherits from BaseRaw can hit this - that's basically everything:
- Messages - your chat history
- Rooms - channels, DMs, all of it
- Users - account management
- Subscriptions - who's in what room
- LivechatInquiry - customer support queue
- Settings - system config
- ...and 81 other collection**
BoWhy trash keeps growing:**
The trash collection has a 30-day expiry on records (defined in `packages/models/src/models/Trash.ts`):
```typescript
{ key: { _deletedAt: 1 }, expireAfterSeconds: 60 * 60 * 24 * 30 } // 30 days
So orphaned records stick around for a full month before MongoDB auto-deletes them. During that time they're taking up:
- Disk space (each orphan record is stored)
- Index space (MongoDB maintains indexes for them)
- Query time (have to scan through them)
- Backup space (they get backed up too)w subclasses
Trash Collection Growth:
packages/models/src/models/Trash.ts:
export class TrashRaw extends BaseRaw<RocketChatRecordDeleted<any>> {
modelIndexes() {
return [
{ key: { __collection__: 1 } },
{ key: { _deletedAt: 1 }, expireAfterSeconds: 60 * 60 * 24 * 30 }, // 30-day TTL
{ key: { rid: 1, __collection__: 1, _deletedAt: 1 } }
];
}
} examples from production code:**
**Example 1: Livechat deletion gone wrong**
From `apps/meteor/app/livechat/server/lib/rooms.ts:290`:
```typescript
LivechatRooms.removeById(rid) // Uses the broken findOneAndDelete
Picture this: A support agent closes a livechat and deletes the room. At the exact same moment, the visitor sends one last "thank you" message. The message updates the room, which causes the deletion to fail. The code tries to rollback and clean up the trash record, but that also fails silently. Now you have a trash record for a livechat room that isn't actually deleted.
Example 2: User can't stay deleted
From apps/meteor/app/lib/server/functions/deleteUser.ts:170:
await Users.removeById(userId);
Admin deletes an inactive user account. But the user happens to be logging in at that exact moment, which updates their lastLogin timestamp. The deletion fails because the document just changed. The rollback times out due to write concerns. Result: The user record is still in the main collection (they can still log in), but there's also an orphaned trash record claiming they were deleted.deleteUser.ts:170`
await Users.removeById(userId); // Can use findOneAndDelete path
- Admin deletes inactive user
- User logging in at same moment updates
lastLogin
- Delete fails due to document modification
- Rollback attempt fails (write concern timeout)
- Trash has orphaned user record
- User still exists in main collection
MongoDB Transaction Alternative:
MongoDB 4.0+ supports this pattern but code doesn't use it:
const session = db.startSession();
session.startTransaction();
trThe fix that already exists but we're not using:**
MongoDB has supported multi-document transactions since version 4.0. We could use them and avoid this entire rollback mess:
```typescript
const session = db.startSession();
session.startTransaction();
try {
await trash.updateOne({ _id }, { $set: trash }, { upsert: true, session });
await col.deleteOne({ _id }, { session });
await session.commitTransaction(); // Everything succeeds together
} catch (e) {
await session.abortTransaction(); // Or everything rolls back automatically
throw e;
} finally {
session.endSession();
}
With transactions, if the deletion fails, MongoDB automatically undoes the trash insert. No manual cleanup needed, no orphans possible.
We're flying blind:
Right now there's zero monitoring for this problem. We don't track:
- How often rollbacks are attempted
- How often they fail
- How many orphans exist
- How fast trash is growing
We should be logging stuff like:
metrics.increment('trash.rollback.attempt');
if (rollbackResult.deletedCount === 0) {
metrics.increment('trash.rollback.failure');
alerting.critical('Trash rollback just failed', { collection, _id });
}How much space does this waste?**
Let's do some back-of-napkin math:
Say you have 100,000 deletions per day (messages, rooms, users, everything)
Even a tiny 0.1% rollback failure rate = 100 orphans daily
Each record is roughly 2KB on average
Trash keeps them for 30 days before auto-cleanup
Storage waste:
- Per day: 100 records × 2KB = 200KB (not much)
- Over 30 days: 3,000 records × 2KB = 6MB (still okay)
- Created in a year: 36,500 records × 2KB = 73MB total (many get cleaned up)
Plus index overhead:
- 3 indexes on the trash collection
- 3,000 orphans × 3 indexes × ~100 bytes each = ~900KB extra index space
It's not going to bring down your server, but it's like leaving the tap dripping - wasteful and sloppy
### Relevant logs:
**The problem: There are no logs for this.**
When a rollback fails, nothing gets logged. Silence. Which is why this issue probably exists in production right now and nobody's noticed.
**What we SHOULD be logging:**
When the main deletion fails:
[ERROR] BaseRaw.findOneAndDelete: Main deletion failed, trying to rollback
Collection: rocketchat_message
MessageID: msg_abc123
Error: WriteConcernError: took too long, gave up
TrashInserted: yes
AttemptingRollback: yes
When the rollback also fails:
[CRITICAL] BaseRaw.findOneAndDelete: Rollback failed - we have a problem
Collection: rocketchat_message
MessageID: msg_abc123
TrashDeleteResult: {acknowledged: true, deletedCount: 0} ← deleted nothing!
MainDeleteError: WriteConcernError: took too long
State: INCONSISTENT - trash orphaned, main record still there
RequiresManualCleanup: yes, someone needs to look at this
WhHow to detect if you have this problem right now:**
Want to see how many orphaned trash records you have? Run this:
```javascript
// Find trash records where the main record doesn't exist anymore
db.rocketchat__trash.aggregate([
{ $match: { __collection__: "message" } },
{ $lookup: {
from: "rocketchat_message",
localField: "_id",
foreignField: "_id",
as: "main_record"
}},
{ $match: { "main_record": { $eq: [] } } }, // Trash record but no main record
{ $group: {
_id: "$__collection__",
count: { $sum: 1 },
oldestOrphan: { $min: "$_deletedAt" },
newestOrphan: { $max: "$_deletedAt" }
}}
])
// You'll get something like:
// { "_id": "message", "count": 1234, "oldestOrphan": ISODate("2025-12-15"), ... }
// { "_id": "room", "count": 456, ... }
// { "_id": "user", "count": 89, ... }
If those counts are high, you're seeing this bug in action.
Check how big your trash collection is:
db.rocketchat__trash.stats().size // Total size in bytes
db.rocketchat__trash.stats().count // Total number of records
// See which collections are creating the most trash
db.rocketchat__trash.aggregate([
{ $group: { _id: "$__collection__", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
])
If trash is growing faster than expected, rollbacks are probably failing.rash Collection Size Monitoring:**
db.rocketchat__trash.stats().size // Should be monitored
db.rocketchat__trash.stats().count // Track growth rate
db.rocketchat__trash.aggregate([
{ $group: { _id: "$__collection__", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
])
// Shows which collections contribute most orphans
Trash Recovery Race Condition - Failed Rollback Leaves Orphaned Records
Description:
There's a subtle but real problem in how we handle failed deletions. When
BaseRaw.findOneAndDelete()tries to delete something and it fails, the code attempts to "undo" by removing the trash record it just created. Sounds good in theory, but here's the catch - that rollback cleanup can also fail, and when it does, the code just... doesn't notice.So you end up with trash records floating around that point to nothing. The main document is still there (because deletion failed), but now there's also a trash record saying "hey, this was deleted" - except it wasn't.
Where this happens:
packages/models/src/models/BaseRaw.tslines 338-366Why it matters: Over time, the trash collection fills up with these orphaned records. It's not catastrophic like Issue #4, but it's a slow leak - wasted storage, slower queries, and a sign that our error handling is kind of optimistic. This affects all 87+ models that inherit from BaseRaw.
Steps to reproduce:
Scenario A - When MongoDB is slow to respond:
Let's say you have a MongoDB replica set with write concerns enabled (pretty standard for production).
writeConcern: { w: 'majority', wtimeout: 1000 }Someone configured MongoDB permissions a bit carelessly (happens more than you'd think):
The trash record is now stuck there permanently, and your logs won't even tell you why.
What actually goes wrong:
The MongoDB "success" that isn't:
Here's the thing about MongoDB that makes this worse - when you try to delete something that doesn't exist, it doesn't throw an error:
If the rollback deletion throws an error, we lose the original error message. If it fails silently (deletedCount: 0), we don't know anything went wrong at all.
Performance slowly gets worse:
As trash accumulates with orphans:
db.rocketchat__trash.find({__collection__: "message"})take longerIt's death by a thousand cuts.
catch (e) {
await this.trash?.deleteOne(...); // This can throw but is NOT caught
throw e; // Only original error propagates
}
Issue #4 (deleteOne): Doesn't even try to rollback
findOne → trash.updateOne → col.deleteOne
(if it crashes between steps, you're just... stuck)
Issue #5 (findOneAndDelete): Tries to rollback, but doesn't check if it worked
col.findOne → trash.updateOne → col.deleteOne → trash.deleteOne (rollback)
(at least there's an attempt, but it's unverified)
So orphaned records stick around for a full month before MongoDB auto-deletes them. During that time they're taking up:
Trash Collection Growth:
packages/models/src/models/Trash.ts:Picture this: A support agent closes a livechat and deletes the room. At the exact same moment, the visitor sends one last "thank you" message. The message updates the room, which causes the deletion to fail. The code tries to rollback and clean up the trash record, but that also fails silently. Now you have a trash record for a livechat room that isn't actually deleted.
Example 2: User can't stay deleted
From
apps/meteor/app/lib/server/functions/deleteUser.ts:170:Admin deletes an inactive user account. But the user happens to be logging in at that exact moment, which updates their
lastLogintimestamp. The deletion fails because the document just changed. The rollback times out due to write concerns. Result: The user record is still in the main collection (they can still log in), but there's also an orphaned trash record claiming they were deleted.deleteUser.ts:170`lastLoginMongoDB Transaction Alternative:
MongoDB 4.0+ supports this pattern but code doesn't use it:
With transactions, if the deletion fails, MongoDB automatically undoes the trash insert. No manual cleanup needed, no orphans possible.
We're flying blind:
Right now there's zero monitoring for this problem. We don't track:
We should be logging stuff like:
Say you have 100,000 deletions per day (messages, rooms, users, everything)
Even a tiny 0.1% rollback failure rate = 100 orphans daily
Each record is roughly 2KB on average
Trash keeps them for 30 days before auto-cleanup
Storage waste:
Plus index overhead:
[ERROR] BaseRaw.findOneAndDelete: Main deletion failed, trying to rollback
Collection: rocketchat_message
MessageID: msg_abc123
Error: WriteConcernError: took too long, gave up
TrashInserted: yes
AttemptingRollback: yes
[CRITICAL] BaseRaw.findOneAndDelete: Rollback failed - we have a problem
Collection: rocketchat_message
MessageID: msg_abc123
TrashDeleteResult: {acknowledged: true, deletedCount: 0} ← deleted nothing!
MainDeleteError: WriteConcernError: took too long
State: INCONSISTENT - trash orphaned, main record still there
RequiresManualCleanup: yes, someone needs to look at this
If those counts are high, you're seeing this bug in action.
Check how big your trash collection is:
If trash is growing faster than expected, rollbacks are probably failing.rash Collection Size Monitoring:**