* Reduce the flakiness of TestAsyncRuleEvaluation
This tests sleeps for 15 millisecond per rule group, and then comprares
the entire execution time to be smaller than a multiple of that delay.
The ruleCount is 6, so it assumes that the test will come to the
assertions in less than 90ms.
Meanwhile, the Github's Windows runner:
- ...Huh, oh? What? How much time? milliwhat? Sorry I don't speak that.
TL;DR, this increases the delay to 250 millisecond. This won't prevent
the test from being flaky, but will reduce the flakiness by several
orders of magnitude and hopefully won't be an issue anymore.
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
* Make tests parallel
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
---------
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
* [PATCH] Allow having evaluation delay for rule groups
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* [PATCH] Fix lint
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* [PATCH] Move the option to ManagerOptions
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* [PATCH] Include evaluation_delay in the group config
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix comments
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Add a server configuration option.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Appease the linter #1
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Add the new server flag documentation
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Improve documentation of the new flag and configuration
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Use named parameters for clarity on the `Rule` interface
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Add `initial` to the flag help
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Change the CHANGELOG area from `ruler` to `rules`
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Rename evaluation_delay to `rule_query_offset`/`query_offset` and make it a global configuration option.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
E Your branch is up to date with 'origin/gotjosh/evaluation-delay'.
* more docs
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Improve wording on CHANGELOG
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Add `RuleQueryOffset` to the default config in tests in case it changes
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Update docs/configuration/recording_rules.md
Co-authored-by: Julius Volz <julius.volz@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Rename `RuleQueryOffset` to `QueryOffset` when in the group context.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Improve docstring and documentation on the `rule_query_offset`
Signed-off-by: gotjosh <josue.abreu@gmail.com>
---------
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Julius Volz <julius.volz@gmail.com>
On Windows, Go will sleep 15ms if you ask for less. TestAsyncRuleEvaluation
compares actual delay to the nominal time, so using 15ms should work
better on Windows, and be hardly noticeable elsewhere.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* BUGFIX: Mark the rule's restoration process as completed always
In https://github.com/prometheus/prometheus/pull/13980 I introduced a change to reduce the number of queries executed when we restore alert statuses.
With this, the querying semantics changed as we now need to go through all series before we enter the alert restoration loop and I missed the fact that exiting early when there are no rules to restore would lead to an incomplete restoration.
An alert being restored is used as a proxy for "we're now ready to write `ALERTS/ALERTS_FOR_SERIES` metrics" so as a result we weren't writing the series if we didn't restore anything the first time around.
---------
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* add custom buckets to native histogram model
* simple copy for custom bounds
* return errors for unsupported add/sub operations
* add test cases for string and update appendhistogram in scrape to account for new schema
* check fields which are supposed to be unused but may affect results in equals
* allow appending custom buckets histograms regardless of max schema
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Updated & added tests
Review feedback nits
Return empty map if not indeterminate
Use highWatermark to track inflight requests counter
Appease the linter
Clarify feature flag
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Optimize histogram iterators
Histogram iterators allocate new objects in the AtHistogram and
AtFloatHistogram methods, which makes calculating rates over long
ranges expensive.
In #13215 we allowed an existing object to be reused
when converting an integer histogram to a float histogram. This commit follows
the same idea and allows injecting an existing object in the AtHistogram and
AtFloatHistogram methods. When the injected value is nil, iterators allocate
new histograms, otherwise they populate and return the injected object.
The commit also adds a CopyTo method to Histogram and FloatHistogram which
is used in the BufferedIterator to overwrite items in the ring instead of making
new copies.
Note that a specialized HPoint pool is needed for all of this to work
(`matrixSelectorHPool`).
---------
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
The 'ToFloat' method on integer histograms currently allocates new memory
each time it is called.
This commit adds an optional *FloatHistogram parameter that can be used
to reuse span and bucket slices. It is up to the caller to make sure the
input float histogram is not used anymore after the call.
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Wiser coders than myself have come to the conclusion that a `switch`
statement is almost always superior to a statement that includes any
`else if`.
The exceptions that I have found in our codebase are just these two:
* The `if else` is followed by an additional statement before the next
condition (separated by a `;`).
* The whole thing is within a `for` loop and `break` statements are
used. In this case, using `switch` would require tagging the `for`
loop, which probably tips the balance.
Why are `switch` statements more readable?
For one, fewer curly braces. But more importantly, the conditions all
have the same alignment, so the whole thing follows the natural flow
of going down a list of conditions. With `else if`, in contrast, all
conditions but the first are "hidden" behind `} else if `, harder to
spot and (for no good reason) presented differently from the first
condition.
I'm sure the aforemention wise coders can list even more reasons.
In any case, I like it so much that I have found myself recommending
it in code reviews. I would like to make it a habit in our code base,
without making it a hard requirement that we would test on the CI. But
for that, there has to be a role model, so this commit eliminates all
`if else` occurrences, unless it is autogenerated code or fits one of
the exceptions above.
Signed-off-by: beorn7 <beorn@grafana.com>
We haven't updated golint-ci in our CI yet, but this commit prepares
for that.
There are a lot of new warnings, and it is mostly because the "revive"
linter got updated. I agree with most of the new warnings, mostly
around not naming unused function parameters (although it is justified
in some cases for documentation purposes – while things like mocks are
a good example where not naming the parameter is clearer).
I'm pretty upset about the "empty block" warning to include `for`
loops. It's such a common pattern to do something in the head of the
`for` loop and then have an empty block. There is still an open issue
about this: https://github.com/mgechev/revive/issues/810 I have
disabled "revive" altogether in files where empty blocks are used
excessively, and I have made the effort to add individual
`// nolint:revive` where empty blocks are used just once or twice.
It's borderline noisy, though, but let's go with it for now.
I should mention that none of the "empty block" warnings for `for`
loop bodies were legitimate.
Signed-off-by: beorn7 <beorn@grafana.com>
In other words: Instead of having a “polymorphous” `Point` that can
either contain a float value or a histogram value, use an `FPoint` for
floats and an `HPoint` for histograms.
This seemingly small change has a _lot_ of repercussions throughout
the codebase.
The idea here is to avoid the increase in size of `Point` arrays that
happened after native histograms had been added.
The higher-level data structures (`Sample`, `Series`, etc.) are still
“polymorphous”. The same idea could be applied to them, but at each
step the trade-offs needed to be evaluated.
The idea with this change is to do the minimum necessary to get back
to pre-histogram performance for functions that do not touch
histograms. Here are comparisons for the `changes` function. The test
data doesn't include histograms yet. Ideally, there would be no change
in the benchmark result at all.
First runtime v2.39 compared to directly prior to this commit:
```
name old time/op new time/op delta
RangeQuery/expr=changes(a_one[1d]),steps=1-16 391µs ± 2% 542µs ± 1% +38.58% (p=0.000 n=9+8)
RangeQuery/expr=changes(a_one[1d]),steps=10-16 452µs ± 2% 617µs ± 2% +36.48% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_one[1d]),steps=100-16 1.12ms ± 1% 1.36ms ± 2% +21.58% (p=0.000 n=8+10)
RangeQuery/expr=changes(a_one[1d]),steps=1000-16 7.83ms ± 1% 8.94ms ± 1% +14.21% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_ten[1d]),steps=1-16 2.98ms ± 0% 3.30ms ± 1% +10.67% (p=0.000 n=9+10)
RangeQuery/expr=changes(a_ten[1d]),steps=10-16 3.66ms ± 1% 4.10ms ± 1% +11.82% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_ten[1d]),steps=100-16 10.5ms ± 0% 11.8ms ± 1% +12.50% (p=0.000 n=8+10)
RangeQuery/expr=changes(a_ten[1d]),steps=1000-16 77.6ms ± 1% 87.4ms ± 1% +12.63% (p=0.000 n=9+9)
RangeQuery/expr=changes(a_hundred[1d]),steps=1-16 30.4ms ± 2% 32.8ms ± 1% +8.01% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=10-16 37.1ms ± 2% 40.6ms ± 2% +9.64% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=100-16 105ms ± 1% 117ms ± 1% +11.69% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=1000-16 783ms ± 3% 876ms ± 1% +11.83% (p=0.000 n=9+10)
```
And then runtime v2.39 compared to after this commit:
```
name old time/op new time/op delta
RangeQuery/expr=changes(a_one[1d]),steps=1-16 391µs ± 2% 547µs ± 1% +39.84% (p=0.000 n=9+8)
RangeQuery/expr=changes(a_one[1d]),steps=10-16 452µs ± 2% 616µs ± 2% +36.15% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_one[1d]),steps=100-16 1.12ms ± 1% 1.26ms ± 1% +12.20% (p=0.000 n=8+10)
RangeQuery/expr=changes(a_one[1d]),steps=1000-16 7.83ms ± 1% 7.95ms ± 1% +1.59% (p=0.000 n=10+8)
RangeQuery/expr=changes(a_ten[1d]),steps=1-16 2.98ms ± 0% 3.38ms ± 2% +13.49% (p=0.000 n=9+10)
RangeQuery/expr=changes(a_ten[1d]),steps=10-16 3.66ms ± 1% 4.02ms ± 1% +9.80% (p=0.000 n=10+9)
RangeQuery/expr=changes(a_ten[1d]),steps=100-16 10.5ms ± 0% 10.8ms ± 1% +3.08% (p=0.000 n=8+10)
RangeQuery/expr=changes(a_ten[1d]),steps=1000-16 77.6ms ± 1% 78.1ms ± 1% +0.58% (p=0.035 n=9+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=1-16 30.4ms ± 2% 33.5ms ± 4% +10.18% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=10-16 37.1ms ± 2% 40.0ms ± 1% +7.98% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=100-16 105ms ± 1% 107ms ± 1% +1.92% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=1000-16 783ms ± 3% 775ms ± 1% -1.02% (p=0.019 n=9+9)
```
In summary, the runtime doesn't really improve with this change for
queries with just a few steps. For queries with many steps, this
commit essentially reinstates the old performance. This is good
because the many-step queries are the one that matter most (longest
absolute runtime).
In terms of allocations, though, this commit doesn't make a dent at
all (numbers not shown). The reason is that most of the allocations
happen in the sampleRingIterator (in the storage package), which has
to be addressed in a separate commit.
Signed-off-by: beorn7 <beorn@grafana.com>