Skip to content
This repository was archived by the owner on Jan 14, 2026. It is now read-only.

Comments

ADBDEV-3685 Error handling for disqkuota worker startup stage#20

Merged
Stolb27 merged 4 commits intogpdbfrom
ADBDEV-3685
Jun 29, 2023
Merged

ADBDEV-3685 Error handling for disqkuota worker startup stage#20
Stolb27 merged 4 commits intogpdbfrom
ADBDEV-3685

Conversation

@bimboterminator1
Copy link
Member

@bimboterminator1 bimboterminator1 commented Jun 13, 2023

During diskquota worker's first run the initial set of active tables with their sizes is being loaded from diskquota.table_size table in order to warm up diskquota rejectmap and other shared memory objects. If an error occurs during this initialization process, the error will be ignored in PG_CATCH() block. Such ignorance can be potentially harmful and can lead to undesired behaviour of the whole extension. For example, if an error ocurs during initialization, local_active_table_stat_map will not be filled properly. And at the next loop iteration, tables, that are not in acitive table list will be marked as irrelevant and to be deleted both from table_size_map and table_size table in flush_to_table_size function. This situation produces extra perfomance load, which is not guaranteed to be safe.

This commit proposes the handling of the initialization errors, which occur during worker's first run. In the DiskquotaDBEntry structure the bool variable corrupted is added in order to indicate, that the worker wasn't able to initialize itself on given database. And DiskquotaDBEntry also is now passed to refresh_disk_quota_model function from worker main loop, because one need to change the state of dbEntry. The state is changed when the refresh_disk_quota_usage function catches an error, which occured during the initialization step, in PG_CATCH() block. And after the error is catched, the corrupted flag is set in given dbEntry, and then the error is rethrown. This leads to worker process termination. The launcher will not be able to start it again, because added flag is set in the database structure, and this flag is being checked inside the disk_quota_launcher_main function. The flag can be reseted by calling resetBackgroundWorkerCorruption function, which is currently called in SIGHUP handler.

@bimboterminator1 bimboterminator1 force-pushed the ADBDEV-3685 branch 2 times, most recently from 4451314 to 2bb759b Compare June 16, 2023 11:55
@bimboterminator1 bimboterminator1 changed the title ADBDEV-3685 Bad worker initialization handling ADBDEV-3685 Error handling for disqkuota worker startup stage Jun 16, 2023
@bimboterminator1 bimboterminator1 force-pushed the ADBDEV-3685 branch 2 times, most recently from b5bf85b to 5748cba Compare June 16, 2023 14:44
During diskquota worker's first run the initial set of active tables
with their sizes is being loaded from diskquota.table_size table in
order to warm up diskquota rejectmap and other shared memory objects.
If an error occurs during this initialization process, the error will be
ignored in PG_CATCH() block. Because of that local_active_table_stat_map
will not be filled properly. And at the next loop iteration tables, that
are not in acitive table list will be marked as irrelevant and to be deleted
both from table_size_map and table_size table in flush_to_table_size function.
In case when the inital set of active tables is huge (thousands of tables),
this error ignorance could lead to the formation of a too long
delete statement, which the SPI executor won't be able to process due to
memory limits. And this case can lead to worker's segmentation fault or
other errorneous behaviour of whole extension.

This commit proposes the handling of the initialization errors, which
occur during worker's first run. In the DiskquotaDBEntry structure the
bool variable "corrupted" is added in order to indicate, that the
worker wasn't able to initialize itself on given database. And
DiskquotaDBEntry also is now passed to refresh_disk_quota_model function
from worker main loop, because one need to change the state of dbEntry.
The state is changed when the refresh_disk_quota_usage function catches
an error, which occured during the initialization step, in PG_CATCH()
block. And after the error is catched, the "corrupted" flag is set in
given dbEntry, and then the error is rethrown. This leads to worker
process termination. The launcher will not be able to start it again,
because added flag is set in the database structure, and this flag is
being checked inside the disk_quota_launcher_main function. The flag
can be reseted by calling resetBackgroundWorkerCorruption function,
which is currently called in SIGHUP handler.
@bimboterminator1 bimboterminator1 marked this pull request as ready for review June 16, 2023 16:46
@bimboterminator1 bimboterminator1 requested a review from a team June 19, 2023 06:18
@Stolb27 Stolb27 requested review from RekGRpth and red1452 June 19, 2023 08:57
@red1452
Copy link

red1452 commented Jun 19, 2023

Could you get patch from PR and do upgrade test with your changes?

@bimboterminator1
Copy link
Member Author

Could you get patch from PR and do upgrade test with your changes?

Have run the upgrade tests in docker with my changes and Eugene's changes. All test passed.

red1452
red1452 previously approved these changes Jun 20, 2023
red1452
red1452 previously approved these changes Jun 23, 2023
@RekGRpth
Copy link
Member

In case when the inital set of active tables is huge (thousands of tables), this error ignorance could lead to the formation of a too long delete statement, which the SPI executor won't be able to process due to memory limits.

I think that in the latest version this is no longer true and another rationale for this patch should be found.

@bimboterminator1
Copy link
Member Author

I think that in the latest version this is no longer true and another rationale for this patch should be found.

I have rewritten PR description.

@Stolb27 Stolb27 merged commit 3b06e37 into gpdb Jun 29, 2023
@Stolb27 Stolb27 deleted the ADBDEV-3685 branch June 29, 2023 06:50
Stolb27 added a commit that referenced this pull request Jul 25, 2023
Stolb27 pushed a commit that referenced this pull request Jul 25, 2023
During diskquota worker's first run the initial set of active tables
with their sizes is being loaded from diskquota.table_size table in
order to warm up diskquota rejectmap and other shared memory objects.
If an error occurs during this initialization process, the error will be
ignored in PG_CATCH() block. Because of that local_active_table_stat_map
will not be filled properly. And at the next loop iteration tables, that
are not in acitive table list will be marked as irrelevant and to be deleted
both from table_size_map and table_size table in flush_to_table_size function.
In case when the inital set of active tables is huge (thousands of tables),
this error ignorance could lead to the formation of a too long
delete statement, which the SPI executor won't be able to process due to
memory limits. And this case can lead to worker's segmentation fault or
other errorneous behaviour of whole extension.

This commit proposes the handling of the initialization errors, which
occur during worker's first run. In the DiskquotaDBEntry structure the
bool variable "corrupted" is added in order to indicate, that the
worker wasn't able to initialize itself on given database. And
DiskquotaDBEntry also is now passed to refresh_disk_quota_model function
from worker main loop, because one need to change the state of dbEntry.
The state is changed when the refresh_disk_quota_usage function catches
an error, which occured during the initialization step, in PG_CATCH()
block. And after the error is catched, the "corrupted" flag is set in
given dbEntry, and then the error is rethrown. This leads to worker
process termination. The launcher will not be able to start it again,
because added flag is set in the database structure, and this flag is
being checked inside the disk_quota_launcher_main function. The flag
can be reseted by calling resetBackgroundWorkerCorruption function,
which is currently called in SIGHUP handler.

Cherry-picked-from: 3b06e37
to reapply above c2686c9
red1452 added a commit that referenced this pull request Nov 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants