Commit graph

275 commits

Author SHA1 Message Date
György Krajcsovits d42e296516 Merge remote-tracking branch 'upstream/main' into krajo/merge-upstream 2023-11-02 20:45:05 +01:00
Oleksandr Redko fa90ca46e5 ci(lint): enable godot; append dot at the end of comments
Signed-off-by: Oleksandr Redko <Oleksandr_Redko@epam.com>
2023-10-31 19:53:38 +02:00
Jeanette Tan 6341ba7374 Merge remote-tracking branch 'upstream/main' into sync-upstream-20231026 2023-10-26 22:18:24 +08:00
Danny Kopping 498b836654
Refactoring manager.go into separate concerns
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
2023-10-21 11:11:11 +02:00
Charles Korn 8801c44cff
Add trace ID to log lines emitted during rule evaluation
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2023-10-12 14:33:37 +11:00
Charles Korn 799ce0b9a3
Use common logger instance to reduce duplication in Group.Eval()
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2023-10-12 14:29:51 +11:00
Arve Knudsen 35ab75918a Merge remote-tracking branch 'prometheus/main' into arve/upgrade-exp
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2023-10-06 16:11:40 +02:00
Goutham Veeramachaneni 86729d4d7b
Update exp package (#12650) 2023-09-21 22:53:51 +02:00
Arve Knudsen e48d4e5835 Merge remote-tracking branch 'prometheus/main' into chore/sync-prometheus
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2023-09-18 09:29:42 +02:00
Arve Knudsen 6daee89e5f
Add context argument to Querier.Select (#12660)
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2023-09-12 12:37:38 +02:00
Jeanette Tan 8035c04624 Merge remote-tracking branch 'upstream/main'
Minor conflicts:
rules/manager.go
tsdb/compact.go
tsdb/db.go
go.mod
2023-07-19 21:40:27 +08:00
Julien Pivotto 782e6f64fb
Merge pull request #11295 from dimitarvdimitrov/dimitar/simplify-evalTimestamp
Simplify rule group's EvalTimestamp formula
2023-07-18 13:21:20 +02:00
Bryan Boreham 5255bf06ad Replace sort.Slice with faster slices.SortFunc
The generic version is more efficient.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2023-07-02 22:17:08 +00:00
Jeanette Tan 0fccba0db9 Merge remote-tracking branch 'upstream/main' 2023-04-26 21:25:21 +08:00
beorn7 c3c7d44d84 lint: Adjust to the lint warnings raised by current versions of golint-ci
We haven't updated golint-ci in our CI yet, but this commit prepares
for that.

There are a lot of new warnings, and it is mostly because the "revive"
linter got updated. I agree with most of the new warnings, mostly
around not naming unused function parameters (although it is justified
in some cases for documentation purposes – while things like mocks are
a good example where not naming the parameter is clearer).

I'm pretty upset about the "empty block" warning to include `for`
loops. It's such a common pattern to do something in the head of the
`for` loop and then have an empty block. There is still an open issue
about this: https://github.com/mgechev/revive/issues/810 I have
disabled "revive" altogether in files where empty blocks are used
excessively, and I have made the effort to add individual
`// nolint:revive` where empty blocks are used just once or twice.
It's borderline noisy, though, but let's go with it for now.

I should mention that none of the "empty block" warnings for `for`
loop bodies were legitimate.

Signed-off-by: beorn7 <beorn@grafana.com>
2023-04-19 17:10:10 +02:00
Đurica Yuri Nikolić d162bb51b6
Merge pull request #485 from grafana/yuri/bring-prometheus-upstream
Synch with Prometheus up to 2023-04-18 (b028112)
2023-04-18 15:07:26 +02:00
Ben Ye fd3630b9a3 add ctx to QueryEngine interface
Signed-off-by: Ben Ye <benye@amazon.com>
2023-04-17 21:32:38 -07:00
Jeanette Tan 1570114ae1 Merge remote-tracking branch 'upstream/main' 2023-04-14 17:34:40 +08:00
beorn7 c0879d64cf promql: Separate Point into FPoint and HPoint
In other words: Instead of having a “polymorphous” `Point` that can
either contain a float value or a histogram value, use an `FPoint` for
floats and an `HPoint` for histograms.

This seemingly small change has a _lot_ of repercussions throughout
the codebase.

The idea here is to avoid the increase in size of `Point` arrays that
happened after native histograms had been added.

The higher-level data structures (`Sample`, `Series`, etc.) are still
“polymorphous”. The same idea could be applied to them, but at each
step the trade-offs needed to be evaluated.

The idea with this change is to do the minimum necessary to get back
to pre-histogram performance for functions that do not touch
histograms. Here are comparisons for the `changes` function. The test
data doesn't include histograms yet. Ideally, there would be no change
in the benchmark result at all.

First runtime v2.39 compared to directly prior to this commit:

```
name                                                  old time/op    new time/op    delta
RangeQuery/expr=changes(a_one[1d]),steps=1-16            391µs ± 2%     542µs ± 1%  +38.58%  (p=0.000 n=9+8)
RangeQuery/expr=changes(a_one[1d]),steps=10-16           452µs ± 2%     617µs ± 2%  +36.48%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_one[1d]),steps=100-16         1.12ms ± 1%    1.36ms ± 2%  +21.58%  (p=0.000 n=8+10)
RangeQuery/expr=changes(a_one[1d]),steps=1000-16        7.83ms ± 1%    8.94ms ± 1%  +14.21%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_ten[1d]),steps=1-16           2.98ms ± 0%    3.30ms ± 1%  +10.67%  (p=0.000 n=9+10)
RangeQuery/expr=changes(a_ten[1d]),steps=10-16          3.66ms ± 1%    4.10ms ± 1%  +11.82%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_ten[1d]),steps=100-16         10.5ms ± 0%    11.8ms ± 1%  +12.50%  (p=0.000 n=8+10)
RangeQuery/expr=changes(a_ten[1d]),steps=1000-16        77.6ms ± 1%    87.4ms ± 1%  +12.63%  (p=0.000 n=9+9)
RangeQuery/expr=changes(a_hundred[1d]),steps=1-16       30.4ms ± 2%    32.8ms ± 1%   +8.01%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=10-16      37.1ms ± 2%    40.6ms ± 2%   +9.64%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=100-16      105ms ± 1%     117ms ± 1%  +11.69%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=1000-16     783ms ± 3%     876ms ± 1%  +11.83%  (p=0.000 n=9+10)
```

And then runtime v2.39 compared to after this commit:

```
name                                                  old time/op    new time/op    delta
RangeQuery/expr=changes(a_one[1d]),steps=1-16            391µs ± 2%     547µs ± 1%  +39.84%  (p=0.000 n=9+8)
RangeQuery/expr=changes(a_one[1d]),steps=10-16           452µs ± 2%     616µs ± 2%  +36.15%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_one[1d]),steps=100-16         1.12ms ± 1%    1.26ms ± 1%  +12.20%  (p=0.000 n=8+10)
RangeQuery/expr=changes(a_one[1d]),steps=1000-16        7.83ms ± 1%    7.95ms ± 1%   +1.59%  (p=0.000 n=10+8)
RangeQuery/expr=changes(a_ten[1d]),steps=1-16           2.98ms ± 0%    3.38ms ± 2%  +13.49%  (p=0.000 n=9+10)
RangeQuery/expr=changes(a_ten[1d]),steps=10-16          3.66ms ± 1%    4.02ms ± 1%   +9.80%  (p=0.000 n=10+9)
RangeQuery/expr=changes(a_ten[1d]),steps=100-16         10.5ms ± 0%    10.8ms ± 1%   +3.08%  (p=0.000 n=8+10)
RangeQuery/expr=changes(a_ten[1d]),steps=1000-16        77.6ms ± 1%    78.1ms ± 1%   +0.58%  (p=0.035 n=9+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=1-16       30.4ms ± 2%    33.5ms ± 4%  +10.18%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=10-16      37.1ms ± 2%    40.0ms ± 1%   +7.98%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=100-16      105ms ± 1%     107ms ± 1%   +1.92%  (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=1000-16     783ms ± 3%     775ms ± 1%   -1.02%  (p=0.019 n=9+9)
```

In summary, the runtime doesn't really improve with this change for
queries with just a few steps. For queries with many steps, this
commit essentially reinstates the old performance. This is good
because the many-step queries are the one that matter most (longest
absolute runtime).

In terms of allocations, though, this commit doesn't make a dent at
all (numbers not shown). The reason is that most of the allocations
happen in the sampleRingIterator (in the storage package), which has
to be addressed in a separate commit.

Signed-off-by: beorn7 <beorn@grafana.com>
2023-04-13 19:25:16 +02:00
Soon-Ping 6cecb87941
Generalized rule group iteration evaluation hook (#11885)
Signed-off-by: Soon-Ping Phang <soonping@amazon.com>
2023-04-04 20:21:13 +02:00
Ganesh Vernekar 41649ceb1b
Merge remote-tracking branch 'upstream/main' into codesome/sync-prom
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
2023-03-22 08:35:08 +05:30
Trevor Whitney c3e0a83725
rules: no longer force CounterResetHint to Gauge
Signed-off-by: Trevor Whitney <trevorjwhitney@gmail.com>
2023-03-14 14:22:07 -06:00
Marco Pracucci 2461dee551
Merge remote-tracking branch 'remotes/prometheus/main' into update-upstream 2023-01-26 18:41:17 +01:00
Peter Štibraný 44904a663c
Rename "execution time" to "evaluation time". (#401)
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
2023-01-19 15:11:44 +00:00
Peter Štibraný 806e71e828
Option to align rule group's evaluation time to interval (#400)
* Allow rule groups evaluation timestamp to be aligned on the evaluation interval.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
2023-01-19 14:51:26 +01:00
fayzal-g 7aef6e28fe Merge remote-tracking branch 'upstream/main' into merge-jan-16-upstream 2023-01-16 15:24:00 +00:00
Julien Pivotto ce55e5074d Add 'keep_firing_for' field to alerting rules
This commit adds a new 'keep_firing_for' field to Prometheus alerting
rules. The 'resolve_delay' field specifies the minimum amount of time
that an alert should remain firing, even if the expression does not
return any results.

This feature was discussed at a previous dev summit, and it was
determined that a feature like this would be useful in order to allow
the expression time to stabilize and prevent confusing resolved messages
from being propagated through Alertmanager.

This approach is simpler than having two PromQL queries, as was
sometimes discussed, and it should be easy to implement.

This commit does not include tests for the 'resolve_delay' field.  This
is intentional, as the purpose of this commit is to gather comments on
the proposed design of the 'resolve_delay' field before implementing
tests. Once the design of the 'resolve_delay' field has been finalized,
a follow-up commit will be submitted with tests."

See https://github.com/prometheus/prometheus/issues/11570

Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>
2023-01-13 12:11:39 +01:00
Ganesh Vernekar d82ea2eb1c
Merge pull request #11838 from codesome/histo-rec
rules: Support native histograms
2023-01-12 12:35:15 +05:30
Ganesh Vernekar 53a5071a72
rules: Support native histograms
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
2023-01-10 19:07:24 +05:30
György Krajcsovits 103c4fd289 Merge remote-tracking branch 'upstream/main' into main
# Conflicts:
#	.github/workflows/ci.yml
#	tsdb/block.go
#	tsdb/compact.go
#	tsdb/compact_test.go
#	tsdb/head_read.go
#	tsdb/index/index.go
#	tsdb/ooo_head_read.go
#	tsdb/querier_test.go
2023-01-08 14:55:44 +01:00
Ganesh Vernekar f1a332c496
rules: Consider ErrTooOldSample in expected errors
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
2023-01-05 14:49:30 +05:30
Bryan Boreham 3c7de69059 storage: allow re-use of iterators
Patterned after `Chunk.Iterator()`: pass the old iterator in so it
can be re-used to avoid allocating a new object.

(This commit does not do any re-use; it is just changing all the method
signatures so re-use is possible in later commits.)

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-12-15 18:32:45 +00:00
Julius Volz 1a2c645dfa Correctly handle error unwrapping in rules and remote write receiver
errors.Unwrap() actually dangerously returns nil if the error does not have an
Unwrap() method, which is the case in at least one of these places where I
noticed that no error was being logged at all when it should have.

Signed-off-by: Julius Volz <julius.volz@gmail.com>
2022-12-15 12:50:55 +01:00
Jeanette Tan 51cf003517 Merge remote-tracking branch 'upstream/main'
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
2022-11-23 01:39:23 +08:00
Dimitar Dimitrov 03ab8dcca0
Add comments on EvalTimestamp
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
2022-10-12 14:16:22 +02:00
Ganesh Vernekar 648be89822
Merge remote-tracking branch 'upstream/main' into fix-conflict
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
2022-10-12 14:20:02 +05:30
Ganesh Vernekar c77b24bcb2
Merge pull request #337 from grafana/sync-prom
Sync with upstream
2022-10-11 11:31:52 +05:30
Ganesh Vernekar b522fb0b76
Merge remote-tracking branch 'upstream/main' into sync-prom 2022-10-10 18:07:39 +05:30
Ganesh Vernekar 46b26c4f09
Fix notifier relabel changing the labels of active alerts (#11427)
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
2022-10-07 20:28:17 +05:30
Dimitar Dimitrov 3fb881af26
Simplify rule group's EvalTimestamp formula
I found it hard to understand how EvalTimestamp works, so I wanted to simplify the math there. This PR should be a noop.

Current formula is:

```
offset        = g.hash % g.interval
adjNow        = startTime - offset
base          = adjNow - (adjNow % g.interval)
EvalTimestamp = base + offset
```

I simplify `EvalTimestamp`

```
EvalTimestamp = base + offset
                # expand base
              = adjNow - (adjNow % g.interval) + offset
                # expand adjNow
              = startTime - offset - ((startTime - offset) % g.interval) + offset
                # cancel out offset
              = startTime - ((startTime - offset) % g.interval)
                # expand A+B (mod M) = (A (mod M) + B (mod M)) (mod M)
              = startTime - (startTime % g.interval - offset % g.interval) % g.interval
                # expand offset
              = startTime - (startTime % g.interval - ((g.hash % g.interval) % g.interval)) % g.interval
                # remove redundant mod g.interval
              = startTime - (startTime % g.interval - g.hash % g.interval) % g.interval
                # simplify (A (mod M) + B (mod M)) (mod M) = A+B (mod M)
              = startTime - (startTime - g.hash) % g.interval

offset            = (startTime - g.hash) % g.interval
EvalTimestamp     = startTime - offset
```

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
2022-09-13 10:52:32 +02:00
Dimitar Dimitrov 0fc0832427
Add tests 2022-09-09 22:35:10 +02:00
Dimitar Dimitrov 6b33c90efe
Add option to always restore the state of rules when loading 2022-09-09 22:01:11 +02:00
Peter Štibraný ae49ab5ea8 Merge remote-tracking branch 'upstream/main' into update-upstream-prometheus 2022-07-13 10:18:09 +02:00
beorn7 28f028e938 Merge branch 'main' into sparsehistogram 2022-07-12 19:07:13 +02:00
Matthieu MOREL ddfa9a7cc5
refactor (rules): move from github.com/pkg/errors to 'errors' and 'fmt' (#10855)
* refactor (rules): move from github.com/pkg/errors to 'errors' and 'fmt'

Signed-off-by: Matthieu MOREL <mmorel-35@users.noreply.github.com>
2022-06-17 09:54:25 +02:00
beorn7 40ad5e284a Merge branch 'main' into beorn7/sparsehistogram 2022-06-09 20:50:30 +02:00
Peter Štibraný 9d51bf50db Merge upstream Prometheus 2022-06-09 11:29:19 +02:00
Julien Pivotto 3a56817a30
Rules: set otel status to ERROR when a rule fails (#10745)
Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>
2022-05-25 10:06:17 +02:00
Julien Pivotto 0d94cdf107
rules: remove classic UI code (#10730)
Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>
2022-05-23 16:21:50 +02:00
Łukasz Mierzwa d3c9c4f574
Stop rule manager before TSDB is stopped (#10680)
During shutdown TSDB is stopped before rule manager is stopped. Since  TSDB shutdown can take a long time (minutes or 10s of minutes) it keeps rule manager running while parts of Prometheus are already stopped (most notebly scrape manager). This can cause false positive alerts to fire, mostly those that rely on absent() calls since new sample appends will stop while alert queries are still evaluated.
Stop rules before stopping TSDB and scrape manager to avoid this problem.

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
2022-05-20 23:26:06 +02:00