ruler: stop all rule groups asynchronously on shutdown (#15804)
Some checks are pending
buf.build / lint and publish (push) Waiting to run
CI / Go tests (push) Waiting to run
CI / More Go tests (push) Waiting to run
CI / Go tests with previous Go version (push) Waiting to run
CI / UI tests (push) Waiting to run
CI / Go tests on Windows (push) Waiting to run
CI / Mixins tests (push) Waiting to run
CI / Build Prometheus for common architectures (0) (push) Waiting to run
CI / Build Prometheus for common architectures (1) (push) Waiting to run
CI / Build Prometheus for common architectures (2) (push) Waiting to run
CI / Build Prometheus for all architectures (0) (push) Waiting to run
CI / Build Prometheus for all architectures (1) (push) Waiting to run
CI / Build Prometheus for all architectures (10) (push) Waiting to run
CI / Build Prometheus for all architectures (11) (push) Waiting to run
CI / Build Prometheus for all architectures (2) (push) Waiting to run
CI / Build Prometheus for all architectures (3) (push) Waiting to run
CI / Build Prometheus for all architectures (4) (push) Waiting to run
CI / Build Prometheus for all architectures (5) (push) Waiting to run
CI / Build Prometheus for all architectures (6) (push) Waiting to run
CI / Build Prometheus for all architectures (7) (push) Waiting to run
CI / Build Prometheus for all architectures (8) (push) Waiting to run
CI / Build Prometheus for all architectures (9) (push) Waiting to run
CI / Report status of build Prometheus for all architectures (push) Blocked by required conditions
CI / Check generated parser (push) Waiting to run
CI / golangci-lint (push) Waiting to run
CI / fuzzing (push) Waiting to run
CI / codeql (push) Waiting to run
CI / Publish main branch artifacts (push) Blocked by required conditions
CI / Publish release artefacts (push) Blocked by required conditions
CI / Publish UI on npm Registry (push) Blocked by required conditions
Scorecards supply-chain security / Scorecards analysis (push) Waiting to run

* ruler: stop all rule groups asynchronously on shutdown

During shutdown of the rules manager some rule groups have already stopped and are missing evaluations while we're waiting for other groups to finish their evaluation.

When there are many groups (in the thousands), the whole shutdown process can take up to 10 minutes, during which we get miss evaluations.

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

* Use wrappers in stop(); rename awaitStopped()

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

* Add comment

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

---------

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
This commit is contained in:
Dimitar Dimitrov 2025-01-20 21:26:58 +01:00 committed by GitHub
parent 32d306854b
commit 2a8ae586f4
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 16 additions and 2 deletions

View file

@ -302,11 +302,19 @@ func (g *Group) run(ctx context.Context) {
}
}
func (g *Group) stop() {
func (g *Group) stopAsync() {
close(g.done)
}
func (g *Group) waitStopped() {
<-g.terminated
}
func (g *Group) stop() {
g.stopAsync()
g.waitStopped()
}
func (g *Group) hash() uint64 {
l := labels.New(
labels.Label{Name: "name", Value: g.name},

View file

@ -188,8 +188,14 @@ func (m *Manager) Stop() {
m.logger.Info("Stopping rule manager...")
// Stop all groups asynchronously, then wait for them to finish.
// This is faster than stopping and waiting for each group in sequence.
for _, eg := range m.groups {
eg.stop()
eg.stopAsync()
}
for _, eg := range m.groups {
eg.waitStopped()
}
// Shut down the groups waiting multiple evaluation intervals to write