Skip to content

Conversation

@XingY
Copy link
Contributor

@XingY XingY commented Jan 15, 2026

Rationale

Lock provisioned table on domain update
Operations such as add/drop column requires ACCESS EXCLUSIVE lock on the table. If another transaction performed a SELECT on a provisioned table, adding/dropping columns from the provisioned table would have to wait until the other transaction to complete. If the other transaction happened to be waiting for the add/drop column transaction (in this case, updating exp.dataclass table), the two would dead lock.

This PR adds a ACCESS EXCLUSIVE MODE LOCK on the provisioned table when it's queried in the transaction.

Related Pull Requests

Changes

  • move exp.dataclass an exp.materialsource indexing to be done after exp.data and exp.material indexing to avoid holding on to lock.

@XingY XingY requested a review from labkey-jeckels January 15, 2026 17:58
@labkey-jeckels
Copy link
Contributor

I see the search indexer activity in later dumps but it looks to be idle in the first dumps in the log. Thus, I don't think it's causing the deadlock here, though it is piling on.

In the first dump, I see two problems, which may or may not be connected.

One is synchronization in PipelineQueueImpl. There's a job cancellation attempt from https-jsse-nio-443-exec-1 that's holding the lock but apparently not actually closing the DB connection. Other threads like https-jsse-nio-443-exec-10 and https-jsse-nio-443-exec-16 are trying to get the lock.

The second is related to the construct domain. https-jsse-nio-443-exec-2 and https-jsse-nio-443-exec-11 are both trying to update the domain. One is trying to update the exp.dataclass row while the other is trying to add the column to the provisioned table. https-jsse-nio-443-exec-24 and other threads are blocked trying to query that domain. JobThread-2.1 is running a domain validation job that's also blocked trying to query it.

Can you take another look at the earliest dump in the log and see if you agree with my assessment? And if so, do you think your patch will help, or should we pursue a different fix? Your patch seems OK to me, but I'm worried it won't address the root problem.

@XingY
Copy link
Contributor Author

XingY commented Jan 20, 2026

I see the search indexer activity in later dumps but it looks to be idle in the first dumps in the log. Thus, I don't think it's causing the deadlock here, though it is piling on.

In the first dump, I see two problems, which may or may not be connected.

One is synchronization in PipelineQueueImpl. There's a job cancellation attempt from https-jsse-nio-443-exec-1 that's holding the lock but apparently not actually closing the DB connection. Other threads like https-jsse-nio-443-exec-10 and https-jsse-nio-443-exec-16 are trying to get the lock.

The second is related to the construct domain. https-jsse-nio-443-exec-2 and https-jsse-nio-443-exec-11 are both trying to update the domain. One is trying to update the exp.dataclass row while the other is trying to add the column to the provisioned table. https-jsse-nio-443-exec-24 and other threads are blocked trying to query that domain. JobThread-2.1 is running a domain validation job that's also blocked trying to query it.

Can you take another look at the earliest dump in the log and see if you agree with my assessment? And if so, do you think your patch will help, or should we pursue a different fix? Your patch seems OK to me, but I'm worried it won't address the root problem.

Another change made to lock provisioned table IN ACCESS EXCLUSIVE MODE. I don't think it hurts to keep the search indexer change so I'll leave that in.

@XingY XingY requested review from labkey-jeckels and removed request for labkey-jeckels January 21, 2026 18:16
@XingY XingY self-assigned this Jan 21, 2026
Copy link
Contributor

@labkey-jeckels labkey-jeckels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good for @labkey-matthewb to review as well. He added the delete-specific locking in June. This adds locking to the update scenario, and makes the lock even more restrictive.

#6775

https://www.postgresql.org/docs/current/explicit-locking.html#LOCKING-TABLES

*/
Lock getDatabaseLock();
void lockForDelete(DbSchema expSchema);
void lockForUpdateDelete(DbSchema lockSchema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this isn't from your change, but I don't understand why we're passing this as an argument. It looks like all callers are using the exp schema.

indexSampleTypeMaterials(sampleType, q);

// GitHub Issue 783: Server lockup when updating data class domain design
// Index MaterialSource after materials indexing, to avoid holding locks on exp.MaterialSource table for too long
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the reordering, but I'm confused by the "holding locks on exp.MaterialSource table for too long" comment. Are we somehow holding locks after fetching the rows and closing the ResultSet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not holding locks after reading and closing the resultset. However, since the index task also updates exp.materialsource.lastindex, it locks this exp.materialsource row for other transactions to update. In this case, it blocks the updateDomain.api from updating exp.materialsource.modified (after it had already ADD COLUMN on provisioned).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I'm not understanding how the ordering makes a difference. For deadlocks, it shouldn't matter if you lock X and then Y, or Y and then X, as long as you're not holding them both at the same time, correct? And the indexing part shouldn't keep a lock open.

Let me know if it would be easier to talk through the scenario. I'm probably missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So indexer thread locked exp.materialsource (by updating it early in the transaction, it holds edit lock to that row till the end of transaction), indexer thread wants to read provisioned (which is currently access locked by saveDomain thread, which prevents reading).
saveDomain thread locked provisioned by Add Column (until end of transaction), saveDomain wants to update exp.materialsource, which is locked by indexer thread.
The 2 threads are waiting for each other to release the lock and cannot proceed, hence the deadlock.

Copy link
Contributor Author

@XingY XingY Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the confusion (on my part) is the fact that indexer task is not wrapped in a transaction, and hence should not lock.

Update: in the case of query.saveRows, for example, indexing is done inside a transaction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our discussion, the change of ordering has been reverted.


var lockSchema = ExperimentService.get().getSchema();
if (lockSchema.getScope().isTransactionActive())
d.lockForUpdateDelete(lockSchema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock is very heavy-handed. Is it possible to only acquire this if/when we know we have actual changes to the provisioned table?

Copy link
Contributor Author

@XingY XingY Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be too late though. In this case the other transaction acquired shared lock before the current transaction perform Add column, resulting in failure to Add column, because the other transaction is also waiting for the current transaction to release write lock on exp.dataclass.

What we really need I think is READ "nolock", which postgres doesn't seem to support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants