Same idea as for the avg aggregator before: Most of the time, there is
no overflow, so we don't have to revert to the more expensive and less
precise incremental calculation of the mean value.
Signed-off-by: beorn7 <beorn@grafana.com>
The calculation of the mean value in avg_over_time is performed in an
incremental fashion. This introduces additional numerical errors that
even Kahan summation cannot compensate, but at least we can use the
Kahan-corrected mean value when we use the intermediate mean value in
the calculation.
Signed-off-by: beorn7 <beorn@grafana.com>
This is a bit tough to explain, but I'll try:
`rate` & friends have a sophisticated extrapolation algorithm.
Usually, we extrapolate the result to the total interval specified in
the range selector. However, if the first sample within the range is
too far away from the beginning of the interval, or if the last sample
within the range is too far away from the end of the interval, we
assume the series has just started half a sampling interval before the
first sample or after the last sample, respectively, and shorten the
extrapolation interval correspondingly. We calculate the sampling
interval by looking at the average time between samples within the
range, and we define "too far away" as "more than 110% of that
sampling interval".
However, if this algorithm leads to an extrapolated starting value
that is negative, we limit the start of the extrapolation interval to
the point where the extrapolated starting value is zero.
At least that was the intention.
What we actually implemented is the following: If extrapolating all
the way to the beginning of the total interval would lead to an
extrapolated negative value, we would only extrapolate to the zero
point as above, even if the algorithm above would have selected a
starting point that is just half a sampling interval before the first
sample and that starting point would not have an extrapolated negative
value. In other word: What was meant as a _limitation_ of the
extrapolation interval yielded a _longer_ extrapolation interval in
this case.
There is an exception to the case just described: If the increase of
the extrapolation interval is more than 110% of the sampling interval,
we suddenly drop back to only extrapolate to half a sampling interval.
This behavior can be nicely seen in the testcounter_zero_cutoff test,
where the rate goes up all the way to 0.7 and then jumps back to 0.6.
This commit changes the behavior to what was (presumably) intended
from the beginning: The extension of the extrapolation interval is
only limited if actually needed to prevent extrapolation to negative
values, but the "limitation" never leads to _more_ extrapolation
anymore.
The difference is subtle, and probably it never bothered anyone.
However, if you calculate a rate of a classic histograms, the old
behavior might create non-monotonic histograms as a result (because of
the jumps you can see nicely in the old version of the
testcounter_zero_cutoff test). With this fix, that doesn't happen
anymore.
Signed-off-by: beorn7 <beorn@grafana.com>
* add custom buckets to native histogram model
* simple copy for custom bounds
* return errors for unsupported add/sub operations
* add test cases for string and update appendhistogram in scrape to account for new schema
* check fields which are supposed to be unused but may affect results in equals
* allow appending custom buckets histograms regardless of max schema
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
These functions act on the labels only, so don't need to go step by step
over the samples in a range query.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
The last_over_time retains a histogram sample without making a copy.
This sample is now coming from the buffered iterator used for windowing functions,
and can be reused for reading subsequent samples as the iterator progresses.
I would propose copying the sample in the last_over_time function, similar to
how it is done for rate, sum_over_time and others.
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
This function is called very frequently when executing PromQL functions,
and we can do it much more efficiently inside Labels.
In the common case that `__name__` comes first in the labels, we simply
re-point to start at the next label, which is nearly free.
`DropMetricName` is now so cheap I removed the cache - benchmarks show
everything still goes faster.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Add warnings for histogramRate applied with isCounter not matching counter/gauge histogram
---------
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
promql: Improve histogram_quantile calculation for classic buckets
Tiny differences between classic buckets are most likely caused by floating point precision issues. With this commit, relative changes below a certain threshold are ignored. This makes the result of histogram_quantile more meaningful, and also avoids triggering the _input to histogram_quantile needed to be fixed for monotonicity_ annotations in unactionable cases.
This commit also adds explanation of the new adjustment and of the monotonicity annotation to the documentation of `histogram_quantile`.
---------
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
* Remove NewPossibleNonCounterInfo until it can be made more efficient, and avoid creating empty annotations as much as possible
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Return annotations (warnings and infos) from PromQL queries
This generalizes the warnings we have already used before (but only for problems with remote read) as "annotations".
Annotations can be warnings or infos (the latter could be false positives). We do not treat them different in the API for now and return them all as "warnings". It would be easy to distinguish them and return infos separately, should that appear useful in the future.
The new annotations are then used to create a lot of warnings or infos during PromQL evaluations. Partially these are things we have wanted for a long time (e.g. inform the user that they have applied `rate` to a metric that doesn't look like a counter), but the new native histograms have created even more needs for those annotations (e.g. if a query tries to aggregate float numbers with histograms).
The annotations added here are not yet complete. A prominent example would be a warning about a range too short for a rate calculation. But such a warnings is more tricky to create with good fidelity and we will tackle it later.
Another TODO is to take annotations into account when evaluating recording rules.
---------
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Handle more arithmetic operators and aggregators for native histograms
This includes operators for multiplication (formerly known as scaling), division, and subtraction. Plus aggregations for average and the avg_over_time function.
Stdvar and stddev will (for now) ignore histograms properly (rather than counting them but adding a 0 for them).
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Wiser coders than myself have come to the conclusion that a `switch`
statement is almost always superior to a statement that includes any
`else if`.
The exceptions that I have found in our codebase are just these two:
* The `if else` is followed by an additional statement before the next
condition (separated by a `;`).
* The whole thing is within a `for` loop and `break` statements are
used. In this case, using `switch` would require tagging the `for`
loop, which probably tips the balance.
Why are `switch` statements more readable?
For one, fewer curly braces. But more importantly, the conditions all
have the same alignment, so the whole thing follows the natural flow
of going down a list of conditions. With `else if`, in contrast, all
conditions but the first are "hidden" behind `} else if `, harder to
spot and (for no good reason) presented differently from the first
condition.
I'm sure the aforemention wise coders can list even more reasons.
In any case, I like it so much that I have found myself recommending
it in code reviews. I would like to make it a habit in our code base,
without making it a hard requirement that we would test on the CI. But
for that, there has to be a role model, so this commit eliminates all
`if else` occurrences, unless it is autogenerated code or fits one of
the exceptions above.
Signed-off-by: beorn7 <beorn@grafana.com>
We haven't updated golint-ci in our CI yet, but this commit prepares
for that.
There are a lot of new warnings, and it is mostly because the "revive"
linter got updated. I agree with most of the new warnings, mostly
around not naming unused function parameters (although it is justified
in some cases for documentation purposes – while things like mocks are
a good example where not naming the parameter is clearer).
I'm pretty upset about the "empty block" warning to include `for`
loops. It's such a common pattern to do something in the head of the
`for` loop and then have an empty block. There is still an open issue
about this: https://github.com/mgechev/revive/issues/810 I have
disabled "revive" altogether in files where empty blocks are used
excessively, and I have made the effort to add individual
`// nolint:revive` where empty blocks are used just once or twice.
It's borderline noisy, though, but let's go with it for now.
I should mention that none of the "empty block" warnings for `for`
loop bodies were legitimate.
Signed-off-by: beorn7 <beorn@grafana.com>
In the past, every sample value was a float, so it was fine to call a
variable holding such a float "value" or "sample". With native
histograms, a sample might have a histogram value. And a histogram
value is still a value. Calling a float value just "value" or "sample"
or "V" is therefore misleading. Over the last few commits, I already
renamed many variables, but this cleans up a few more places where the
changes are more invasive.
Note that we do not to attempt naming in the JSON APIs or in the
protobufs. That would be quite a disruption. However, internally, we
can call variables as we want, and we should go with the option of
avoiding misunderstandings.
Signed-off-by: beorn7 <beorn@grafana.com>
In other words: Instead of having a “polymorphous” `Point` that can
either contain a float value or a histogram value, use an `FPoint` for
floats and an `HPoint` for histograms.
This seemingly small change has a _lot_ of repercussions throughout
the codebase.
The idea here is to avoid the increase in size of `Point` arrays that
happened after native histograms had been added.
The higher-level data structures (`Sample`, `Series`, etc.) are still
“polymorphous”. The same idea could be applied to them, but at each
step the trade-offs needed to be evaluated.
The idea with this change is to do the minimum necessary to get back
to pre-histogram performance for functions that do not touch
histograms. Here are comparisons for the `changes` function. The test
data doesn't include histograms yet. Ideally, there would be no change
in the benchmark result at all.
First runtime v2.39 compared to directly prior to this commit:
```
name old time/op new time/op delta
RangeQuery/expr=changes(a_one[1d]),steps=1-16 391µs ± 2% 542µs ± 1% +38.58% (p=0.000 n=9+8)
RangeQuery/expr=changes(a_one[1d]),steps=10-16 452µs ± 2% 617µs ± 2% +36.48% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_one[1d]),steps=100-16 1.12ms ± 1% 1.36ms ± 2% +21.58% (p=0.000 n=8+10)
RangeQuery/expr=changes(a_one[1d]),steps=1000-16 7.83ms ± 1% 8.94ms ± 1% +14.21% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_ten[1d]),steps=1-16 2.98ms ± 0% 3.30ms ± 1% +10.67% (p=0.000 n=9+10)
RangeQuery/expr=changes(a_ten[1d]),steps=10-16 3.66ms ± 1% 4.10ms ± 1% +11.82% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_ten[1d]),steps=100-16 10.5ms ± 0% 11.8ms ± 1% +12.50% (p=0.000 n=8+10)
RangeQuery/expr=changes(a_ten[1d]),steps=1000-16 77.6ms ± 1% 87.4ms ± 1% +12.63% (p=0.000 n=9+9)
RangeQuery/expr=changes(a_hundred[1d]),steps=1-16 30.4ms ± 2% 32.8ms ± 1% +8.01% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=10-16 37.1ms ± 2% 40.6ms ± 2% +9.64% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=100-16 105ms ± 1% 117ms ± 1% +11.69% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=1000-16 783ms ± 3% 876ms ± 1% +11.83% (p=0.000 n=9+10)
```
And then runtime v2.39 compared to after this commit:
```
name old time/op new time/op delta
RangeQuery/expr=changes(a_one[1d]),steps=1-16 391µs ± 2% 547µs ± 1% +39.84% (p=0.000 n=9+8)
RangeQuery/expr=changes(a_one[1d]),steps=10-16 452µs ± 2% 616µs ± 2% +36.15% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_one[1d]),steps=100-16 1.12ms ± 1% 1.26ms ± 1% +12.20% (p=0.000 n=8+10)
RangeQuery/expr=changes(a_one[1d]),steps=1000-16 7.83ms ± 1% 7.95ms ± 1% +1.59% (p=0.000 n=10+8)
RangeQuery/expr=changes(a_ten[1d]),steps=1-16 2.98ms ± 0% 3.38ms ± 2% +13.49% (p=0.000 n=9+10)
RangeQuery/expr=changes(a_ten[1d]),steps=10-16 3.66ms ± 1% 4.02ms ± 1% +9.80% (p=0.000 n=10+9)
RangeQuery/expr=changes(a_ten[1d]),steps=100-16 10.5ms ± 0% 10.8ms ± 1% +3.08% (p=0.000 n=8+10)
RangeQuery/expr=changes(a_ten[1d]),steps=1000-16 77.6ms ± 1% 78.1ms ± 1% +0.58% (p=0.035 n=9+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=1-16 30.4ms ± 2% 33.5ms ± 4% +10.18% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=10-16 37.1ms ± 2% 40.0ms ± 1% +7.98% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=100-16 105ms ± 1% 107ms ± 1% +1.92% (p=0.000 n=10+10)
RangeQuery/expr=changes(a_hundred[1d]),steps=1000-16 783ms ± 3% 775ms ± 1% -1.02% (p=0.019 n=9+9)
```
In summary, the runtime doesn't really improve with this change for
queries with just a few steps. For queries with many steps, this
commit essentially reinstates the old performance. This is good
because the many-step queries are the one that matter most (longest
absolute runtime).
In terms of allocations, though, this commit doesn't make a dent at
all (numbers not shown). The reason is that most of the allocations
happen in the sampleRingIterator (in the storage package), which has
to be addressed in a separate commit.
Signed-off-by: beorn7 <beorn@grafana.com>