prometheus

mirror of https://github.com/prometheus/prometheus.git synced 2025-03-05 20:59:13 -08:00

Author	SHA1	Message	Date
gotjosh	37b408c6cd	Feature: Allow configuration of a rule evaluation delay (#14061 ) * [PATCH] Allow having evaluation delay for rule groups Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> * [PATCH] Fix lint Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> * [PATCH] Move the option to ManagerOptions Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> * [PATCH] Include evaluation_delay in the group config Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> * Fix comments Signed-off-by: gotjosh <josue.abreu@gmail.com> * Add a server configuration option. Signed-off-by: gotjosh <josue.abreu@gmail.com> * Appease the linter #1 Signed-off-by: gotjosh <josue.abreu@gmail.com> * Add the new server flag documentation Signed-off-by: gotjosh <josue.abreu@gmail.com> * Improve documentation of the new flag and configuration Signed-off-by: gotjosh <josue.abreu@gmail.com> * Use named parameters for clarity on the `Rule` interface Signed-off-by: gotjosh <josue.abreu@gmail.com> * Add `initial` to the flag help Signed-off-by: gotjosh <josue.abreu@gmail.com> * Change the CHANGELOG area from `ruler` to `rules` Signed-off-by: gotjosh <josue.abreu@gmail.com> * Rename evaluation_delay to `rule_query_offset`/`query_offset` and make it a global configuration option. Signed-off-by: gotjosh <josue.abreu@gmail.com> E Your branch is up to date with 'origin/gotjosh/evaluation-delay'. * more docs Signed-off-by: gotjosh <josue.abreu@gmail.com> * Improve wording on CHANGELOG Signed-off-by: gotjosh <josue.abreu@gmail.com> * Add `RuleQueryOffset` to the default config in tests in case it changes Signed-off-by: gotjosh <josue.abreu@gmail.com> * Update docs/configuration/recording_rules.md Co-authored-by: Julius Volz <julius.volz@gmail.com> Signed-off-by: gotjosh <josue.abreu@gmail.com> * Rename `RuleQueryOffset` to `QueryOffset` when in the group context. Signed-off-by: gotjosh <josue.abreu@gmail.com> * Improve docstring and documentation on the `rule_query_offset` Signed-off-by: gotjosh <josue.abreu@gmail.com> --------- Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> Signed-off-by: gotjosh <josue.abreu@gmail.com> Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com> Co-authored-by: Julius Volz <julius.volz@gmail.com>	2024-05-30 11:49:50 +01:00
Julien	d1eff95faf	Merge pull request #14100 from bboreham/windows-flake [TEST] Rules: Sleep 15ms to fit Windows behaviour better	2024-05-16 12:04:42 +02:00
Oleksandr Redko	f10c3454e9	Enable perfsprint linter and fix up code Signed-off-by: Oleksandr Redko <oleksandr.red+github@gmail.com>	2024-05-15 17:51:05 +03:00
Bryan Boreham	10eb23bd6b	[TEST] Rules: Sleep 15ms to fit Windows behaviour better On Windows, Go will sleep 15ms if you ask for less. TestAsyncRuleEvaluation compares actual delay to the nominal time, so using 15ms should work better on Windows, and be hardly noticeable elsewhere. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2024-05-14 17:45:42 +01:00
Bryan Boreham	3fd24d1cd7	Merge pull request #13999 from bboreham/extract-promqltest [Test] Extract most PromQL test code into separate packages	2024-05-09 13:23:11 +01:00
Bryan Boreham	8fd96241ab	test: add promqltest package references To packages outside of promql. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2024-05-08 16:08:04 +01:00
gotjosh	c10186eeea	BUGFIX: Mark the rule's restoration process as completed always (#14048 ) * BUGFIX: Mark the rule's restoration process as completed always In https://github.com/prometheus/prometheus/pull/13980 I introduced a change to reduce the number of queries executed when we restore alert statuses. With this, the querying semantics changed as we now need to go through all series before we enter the alert restoration loop and I missed the fact that exiting early when there are no rules to restore would lead to an incomplete restoration. An alert being restored is used as a proxy for "we're now ready to write `ALERTS/ALERTS_FOR_SERIES` metrics" so as a result we weren't writing the series if we didn't restore anything the first time around. --------- Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-05-03 14:23:46 +01:00
gotjosh	1dd0bff4f1	Merge pull request #13980 from prometheus/gotjosh/restore-only-with-rule-query Rule Manager: Only query once per alert rule when restoring alert state	2024-04-30 15:29:21 +01:00
gotjosh	379dec9d36	querier.Select cannot return a nil series set. Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-30 13:09:30 +01:00
gotjosh	05ca082b07	Rename `alerts` to `expectedAlerts` in the test case input Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-30 12:43:09 +01:00
gotjosh	f63dbc3db2	Remove duplicated sorted and assignment of expected alerts. Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-30 12:39:07 +01:00
gotjosh	63b09944b8	Use labels.Len() instead of manually counting the labels Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-30 12:25:48 +01:00
gotjosh	ccfafae36d	Rename QueryforStateSeries to QueryForStateSeries Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-30 12:19:18 +01:00
gotjosh	151f6e0ed6	Add an assertion on the count of alerts before adding an active alert Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-30 12:17:56 +01:00
George Robinson	dde2e5eb73	Improve comments around resending resolved alerts (#13990 ) Signed-off-by: George Robinson <george.robinson@grafana.com>	2024-04-25 14:18:50 +02:00
gotjosh	cc2207148e	fix typo Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-24 19:20:57 +01:00
gotjosh	2de2fee035	Allow the result map for the series set before hand with a hint. Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-24 19:10:34 +01:00
gotjosh	6cfc584308	- Add a changelog entry - Improve variable name of the map produced by the series set Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-24 19:02:47 +01:00
gotjosh	fa75985c1c	Use the string representation of the labels instead of the hash Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-24 18:46:05 +01:00
gotjosh	276201598c	Fix tests and a bug with the series lookup logic. Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-24 18:46:05 +01:00
gotjosh	e6dcbd2e26	bug: nil check against the series set not errors Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-24 18:46:05 +01:00
gotjosh	4daaa59c08	Rule Manager: Only query once per alert rule when restoring alert state Prometheus restores alert state between restarts and updates. For each rule, it looks at the alerts that are meant to be active and then queries the `ALERTS_FOR_STATE` series for _each_ alert within the rules. If the alert rule has 120 instances (or series) it'll execute the same query with slightly different labels. This PR changes the approach so that we only query once per alert rule and then match the corresponding alert that we're about to restore against the series-set. While the approach might use a bit more memory at start-up (if even?) the restore proccess is only ran once per restart so I'd consider this a big win. This builds on top of #13974 Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-24 18:46:05 +01:00
gotjosh	5beb2fe005	Improve the metric description Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-24 15:24:35 +01:00
gotjosh	381a77ac1e	Change variable name to `restoreStartTime` from `now` and introduce a log line to record total time Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-24 14:21:11 +01:00
gotjosh	e7219e3d36	Rule Manager: Add `rule_group_last_restore_duration_seconds` to measure restore time per rule group When a rule group changes or prometheus is restarted we need to ensure we restore the active alerts that were firing for a corresponding rule, for that Prometheus uses the `ALERTS_FOR_STATE` series to query the previous state and restore it. If a given rule has high cardinality (think 100s of 1000s for series) this proccess can take a bit of time - this is the first of a series of PRs to improve this problem and I'd like to start with exposing the time it takes to restore a rule group as a gauge. Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-23 09:57:08 +01:00
Björn Rabenstein	4ec5c25393	Merge pull request #13731 from suntala/suntala/native-histogram-template histograms: support expansion of native histogram values in templating	2024-04-11 13:24:26 +02:00
Matthieu MOREL	6f595c6762	golangci-lint: enable whitespace linter (#13905 ) Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>	2024-04-11 09:27:54 +01:00
suntala	44f385fd51	Support expansion of native histogram values in alert templates Co-authored-by: Aleks Fazlieva <britishrum@users.noreply.github.com> Signed-off-by: suntala <arati.rana@grafana.com>	2024-03-26 22:30:01 +01:00
Łukasz Mierzwa	3bb27c33e9	Use consistent keys for logs Rule warnings are logged with numDropped=N while every other component uses num_dropped=N: ``` notifier/notifier.go: level.Warn(n.logger).Log("msg", "Alert batch larger than queue capacity, dropping alerts", "num_dropped", d) notifier/notifier.go: level.Warn(n.logger).Log("msg", "Alert notification queue full, dropping alerts", "num_dropped", d) storage/remote/write_handler.go: _ = level.Warn(h.logger).Log("msg", "Error on ingesting out-of-order exemplars", "num_dropped", outOfOrderExemplarErrs) rules/group.go: level.Warn(logger).Log("msg", "Error on ingesting out-of-order result from rule evaluation", "num_dropped", numOutOfOrder) rules/group.go: level.Warn(logger).Log("msg", "Error on ingesting too old result from rule evaluation", "num_dropped", numTooOld) rules/group.go: level.Warn(logger).Log("msg", "Error on ingesting results from rule evaluation with different value but same timestamp", "num_dropped", numDuplicates) scrape/scrape.go: level.Warn(sl.l).Log("msg", "Error on ingesting out-of-order samples", "num_dropped", appErrs.numOutOfOrder) scrape/scrape.go: level.Warn(sl.l).Log("msg", "Error on ingesting samples with different value but same timestamp", "num_dropped", appErrs.numDuplicates) scrape/scrape.go: level.Warn(sl.l).Log("msg", "Error on ingesting samples that are too old or are too far into the future", "num_dropped", appErrs.numOutOfBounds) scrape/scrape.go: level.Warn(sl.l).Log("msg", "Error on ingesting out-of-order exemplars", "num_dropped", appErrs.numExemplarOutOfOrder) ``` Rename numDropped to num_dropped for consistency. Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>	2024-03-21 15:59:20 +00:00
Charles Korn	4e77e8e5ef	Allow using alternative PromQL engines for rule evaluation Signed-off-by: Charles Korn <charles.korn@grafana.com>	2024-03-06 14:54:33 +11:00
machine424	f477e0539a	Move from golang.org/x/exp/slices into slices now that we only support Go >= 1.21 Prevent adding back golang.org/x/exp/slices. Signed-off-by: machine424 <ayoubmrini424@gmail.com>	2024-02-28 14:54:53 +01:00
Bryan Boreham	3716326f3f	rules: call NewScratchBuilder Need to initialize ScratchBuilder with a SymbolTable. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2024-02-26 11:45:25 +00:00
Bryan Boreham	c0e36e6bb3	Standardise exemplar label as "trace_id" This is consistent with the OpenTelemetry standard, and an example in OpenMetrics. https://github.com/open-telemetry/opentelemetry-specification/blob/89aa01348139/specification/metrics/data-model.md#exemplars https://github.com/OpenObservability/OpenMetrics/blob/138654493130/specification/OpenMetrics.md#exemplars-1 Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2024-02-15 14:20:08 +00:00
Bryan Boreham	17f48f2b3b	Tests: use replacement DeepEquals in more places Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2024-02-08 19:32:33 +00:00
Bryan Boreham	39af788dbd	Tests: use replacement DeepEquals using go-cmp Use DeepEqual replacement using go-cmp, which is more flexible. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2024-02-08 19:30:20 +00:00
Marco Pracucci	5ee3fbe825	Decouple ruler dependency controller from concurrency controller Signed-off-by: Marco Pracucci <marco@pracucci.com>	2024-02-02 10:06:37 +01:00
Marco Pracucci	cbbbd6e70a	Remove superfluous nil check in Group.metrics Signed-off-by: Marco Pracucci <marco@pracucci.com>	2024-01-29 10:21:57 +01:00
Marco Pracucci	046cd7599f	Introduced sequentialRuleEvalController Signed-off-by: Marco Pracucci <marco@pracucci.com>	2024-01-29 10:19:18 +01:00
Marco Pracucci	23f89c18b2	Improved RuleConcurrencyController interface doc Signed-off-by: Marco Pracucci <marco@pracucci.com>	2024-01-29 10:18:29 +01:00
Marco Pracucci	2764c46531	Added more test cases to TestDependenciesEdgeCases Signed-off-by: Marco Pracucci <marco@pracucci.com>	2024-01-29 10:18:03 +01:00
Marco Pracucci	52bc568d04	Add more test cases to TestDependenciesEdgeCases Signed-off-by: Marco Pracucci <marco@pracucci.com>	2024-01-29 10:17:13 +01:00
Marco Pracucci	21a03dc018	Simplify the design to update concurrency controller once the rule evaluation has done Signed-off-by: Marco Pracucci <marco@pracucci.com>	2024-01-29 10:16:31 +01:00
Danny Kopping	7aa3b10c3f	Block until all rules, both sync & async, have completed evaluating Updated & added tests Review feedback nits Return empty map if not indeterminate Use highWatermark to track inflight requests counter Appease the linter Clarify feature flag Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2024-01-29 10:08:41 +01:00
Danny Kopping	f922534c4d	Refactoring for performance, and to allow controller to be overridden Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2024-01-29 10:08:41 +01:00
Danny Kopping	94cdfa30cd	Refactoring Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2024-01-29 10:08:41 +01:00
Danny Kopping	0dc7036db3	Optimising dependencies/dependents funcs to not produce new slices each request Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2024-01-29 10:08:41 +01:00
Danny Kopping	e7758d187e	Refactor concurrency control Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2024-01-29 10:08:39 +01:00
Danny Kopping	940f83a540	Implementation NOTE: Rebased from main after refactor in #13014 Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2024-01-29 10:07:15 +01:00
Filip Petkovski	583f3e587c	Optimize histogram iterators (#13340 ) Optimize histogram iterators Histogram iterators allocate new objects in the AtHistogram and AtFloatHistogram methods, which makes calculating rates over long ranges expensive. In #13215 we allowed an existing object to be reused when converting an integer histogram to a float histogram. This commit follows the same idea and allows injecting an existing object in the AtHistogram and AtFloatHistogram methods. When the injected value is nil, iterators allocate new histograms, otherwise they populate and return the injected object. The commit also adds a CopyTo method to Histogram and FloatHistogram which is used in the BufferedIterator to overwrite items in the ring instead of making new copies. Note that a specialized HPoint pool is needed for all of this to work (`matrixSelectorHPool`). --------- Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>	2024-01-23 17:02:14 +01:00
Filip Petkovski	10a82f87fd	Enable reusing memory when converting between histogram types The 'ToFloat' method on integer histograms currently allocates new memory each time it is called. This commit adds an optional *FloatHistogram parameter that can be used to reuse span and bucket slices. It is up to the caller to make sure the input float histogram is not used anymore after the call. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>	2023-12-08 10:22:59 +01:00

1 2 3 4 5 ...

586 commits