* [PATCH] Allow having evaluation delay for rule groups
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* [PATCH] Fix lint
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* [PATCH] Move the option to ManagerOptions
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* [PATCH] Include evaluation_delay in the group config
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix comments
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Add a server configuration option.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Appease the linter #1
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Add the new server flag documentation
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Improve documentation of the new flag and configuration
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Use named parameters for clarity on the `Rule` interface
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Add `initial` to the flag help
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Change the CHANGELOG area from `ruler` to `rules`
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Rename evaluation_delay to `rule_query_offset`/`query_offset` and make it a global configuration option.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
E Your branch is up to date with 'origin/gotjosh/evaluation-delay'.
* more docs
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Improve wording on CHANGELOG
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Add `RuleQueryOffset` to the default config in tests in case it changes
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Update docs/configuration/recording_rules.md
Co-authored-by: Julius Volz <julius.volz@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Rename `RuleQueryOffset` to `QueryOffset` when in the group context.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Improve docstring and documentation on the `rule_query_offset`
Signed-off-by: gotjosh <josue.abreu@gmail.com>
---------
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Julius Volz <julius.volz@gmail.com>
On Windows, Go will sleep 15ms if you ask for less. TestAsyncRuleEvaluation
compares actual delay to the nominal time, so using 15ms should work
better on Windows, and be hardly noticeable elsewhere.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* BUGFIX: Mark the rule's restoration process as completed always
In https://github.com/prometheus/prometheus/pull/13980 I introduced a change to reduce the number of queries executed when we restore alert statuses.
With this, the querying semantics changed as we now need to go through all series before we enter the alert restoration loop and I missed the fact that exiting early when there are no rules to restore would lead to an incomplete restoration.
An alert being restored is used as a proxy for "we're now ready to write `ALERTS/ALERTS_FOR_SERIES` metrics" so as a result we weren't writing the series if we didn't restore anything the first time around.
---------
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Prometheus restores alert state between restarts and updates. For each rule, it looks at the alerts that are meant to be active and then queries the `ALERTS_FOR_STATE` series for _each_ alert within the rules.
If the alert rule has 120 instances (or series) it'll execute the same query with slightly different labels.
This PR changes the approach so that we only query once per alert rule and then match the corresponding alert that we're about to restore against the series-set. While the approach might use a bit more memory at start-up (if even?) the restore proccess is only ran once per restart so I'd consider this a big win.
This builds on top of #13974
Signed-off-by: gotjosh <josue.abreu@gmail.com>
When a rule group changes or prometheus is restarted we need to ensure we restore the active alerts that were firing for a corresponding rule, for that Prometheus uses the `ALERTS_FOR_STATE` series to query the previous state and restore it. If a given rule has high cardinality (think 100s of 1000s for series) this proccess can take a bit of time - this is the first of a series of PRs to improve this problem and I'd like to start with exposing the time it takes to restore a rule group as a gauge.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Rule warnings are logged with numDropped=N while every other component uses num_dropped=N:
```
notifier/notifier.go: level.Warn(n.logger).Log("msg", "Alert batch larger than queue capacity, dropping alerts", "num_dropped", d)
notifier/notifier.go: level.Warn(n.logger).Log("msg", "Alert notification queue full, dropping alerts", "num_dropped", d)
storage/remote/write_handler.go: _ = level.Warn(h.logger).Log("msg", "Error on ingesting out-of-order exemplars", "num_dropped", outOfOrderExemplarErrs)
rules/group.go: level.Warn(logger).Log("msg", "Error on ingesting out-of-order result from rule evaluation", "num_dropped", numOutOfOrder)
rules/group.go: level.Warn(logger).Log("msg", "Error on ingesting too old result from rule evaluation", "num_dropped", numTooOld)
rules/group.go: level.Warn(logger).Log("msg", "Error on ingesting results from rule evaluation with different value but same timestamp", "num_dropped", numDuplicates)
scrape/scrape.go: level.Warn(sl.l).Log("msg", "Error on ingesting out-of-order samples", "num_dropped", appErrs.numOutOfOrder)
scrape/scrape.go: level.Warn(sl.l).Log("msg", "Error on ingesting samples with different value but same timestamp", "num_dropped", appErrs.numDuplicates)
scrape/scrape.go: level.Warn(sl.l).Log("msg", "Error on ingesting samples that are too old or are too far into the future", "num_dropped", appErrs.numOutOfBounds)
scrape/scrape.go: level.Warn(sl.l).Log("msg", "Error on ingesting out-of-order exemplars", "num_dropped", appErrs.numExemplarOutOfOrder)
```
Rename numDropped to num_dropped for consistency.
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Updated & added tests
Review feedback nits
Return empty map if not indeterminate
Use highWatermark to track inflight requests counter
Appease the linter
Clarify feature flag
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Optimize histogram iterators
Histogram iterators allocate new objects in the AtHistogram and
AtFloatHistogram methods, which makes calculating rates over long
ranges expensive.
In #13215 we allowed an existing object to be reused
when converting an integer histogram to a float histogram. This commit follows
the same idea and allows injecting an existing object in the AtHistogram and
AtFloatHistogram methods. When the injected value is nil, iterators allocate
new histograms, otherwise they populate and return the injected object.
The commit also adds a CopyTo method to Histogram and FloatHistogram which
is used in the BufferedIterator to overwrite items in the ring instead of making
new copies.
Note that a specialized HPoint pool is needed for all of this to work
(`matrixSelectorHPool`).
---------
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
The 'ToFloat' method on integer histograms currently allocates new memory
each time it is called.
This commit adds an optional *FloatHistogram parameter that can be used
to reuse span and bucket slices. It is up to the caller to make sure the
input float histogram is not used anymore after the call.
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>