Implement histogram statistics decoder
This commit speeds up histogram_count and histogram_sum
functions on native histograms. The idea is to have separate decoders which can be
used by the engine to only read count/sum values from histogram objects. This should help
with reducing allocations when decoding histograms, as well as with speeding up aggregations
like sum since they will be done on floats and not on histogram objects.
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
---------
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>
This can give a more precise result, by keeping a separate running
compensation value to accumulate small errors.
See https://en.wikipedia.org/wiki/Kahan_summation_algorithm
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
When the label name of a matcher contains non-standard characters, like
a dot, or starts with a digit, it should be quoted.
If it's not quoted, then `VectorSelector.String()` isn't a valid PromQL.
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
* process custom values in histogram unit test framework
* check for warnings when evaluating in unit test framework
* add test cases for custom buckets in test framework
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
The function `rangeEvalTimestampFunctionOverVectorSelector` appeared to be checking histogram size, however the value it used was always 0 due to subtle variable shadowing.
However we don't need to pass sample values to the `timestamp` function, since the latter only cares about timestamps. This also affects peak sample count in statistics, since we are no longer copying histogram samples.
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
The check fell into "this matcher equals vector selector's name" case when vector selector doesn't have a name and the matcher is an explicit matcher for an empty __name__ label.
To provide some context about why this is important: some downstream projects use the promql.Parse(expr.String()) to clone an expression's AST, and with this bug that matcher disappears in the cloning.
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
* modify unit test framework to automatically generate native histograms with custom buckets from classic histogram series
* add very basic tests for classic histogram converted into native histogram with custom bounds
* fix histogram_quantile for native histograms with custom buckets
* make loading with nhcb explicit
* evaluate native histograms with custom buckets on queries with explicit keyword
* use regex replacer
* use temp histogram struct for automatically loading converted nhcb
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Signed-off-by: George Krajcsovits <krajorama@users.noreply.github.com>
In a previous PR, the generated parser was created using an old version of goyacc.
Also adds -l to disable line directives, which fixes debug processing and reduces diffs at the expense of making it more difficult to reason about the generated output.
Signed-off-by: Owen Williams <owen.williams@grafana.com>
includes Inf and NaN as numbers to histogram
---------
Signed-off-by: Neeraj Gartia <neerajgartia211002@gmail.com>
Signed-off-by: Björn Rabenstein <github@rabenste.in>
Co-authored-by: Björn Rabenstein <github@rabenste.in>
This saves memory in other kinds of aggregation.
We don't need `orderedResult` in `aggregationCountValues`; the ordering
is not guaranteed.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
They aggregate results in different ways.
topk/bottomk don't consider histograms so can simplify data collection.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
This is a cleaner split of responsibilities.
We now check the sample count after calling rangeEvalAgg.
Changed re-use of samples to use `Clone` and `defer`.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Pass it as a float64 not as interface{}.
Make k a simple int, since that is the parameter to make().
Pull invalid quantile warning out of the loop.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
The new function `rangeEvalAgg` is mostly a copy of `rangeEval`, but
without `initSeries` which we don't need and inlining the callback to
`aggregation()`.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
The existing aggregation function is very long and covers very different
cases.
`aggregationCountValues` is just for `count_values`, which differs from
other aggregations in that it outputs as many series per group as there
are values in the input.
Remove the top-level switch on string parameter type; use the same `Op`
check there as elswehere.
Pull checking parameters out to caller, where it is only executed once.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
The strings produced by these tests can run to thousands of characters,
which makes test logs difficult to read.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* promql: include more details in error message when creating test query fails
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Include more details when an unexpected metric is returned
Signed-off-by: Charles Korn <charles.korn@grafana.com>
---------
Signed-off-by: Charles Korn <charles.korn@grafana.com>
The definition of histograms in the test framework may create
histograms in a non-compact form. Since histogram comparison relies on
exact equality of the bucket layout, we have to compact the histograms
created by the test framework language before comparing them to
histograms returned from the PromQL engine.
Signed-off-by: beorn7 <beorn@grafana.com>
The size of histogram points are now bigger by 24 bytes due to the
custom values slice.
When histograms are loaded into partial results in vector selectors
we use HPoint type where the size is calculated as
(size of histogram + 8 for timestamp)/16.
a3d1a46eda/promql/value.go (L176)
When histograms are put into Sample type in range evaluations, the
Sample has more overhead and the size is calculated differently:
(size of histogram / 16) + 1 for time stamp.
a3d1a46eda/promql/engine.go (L1928)
When the size of the histogram is 16k, then the first calculation gives k
but the second gives k+1 for the sample count.
If the histogram size is 16k+8, then both would give k+1.
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Restrict the capacity of first argument to `append()` to force an allocation.
This is for the slice implementation only.
Signed-off-by: Domantas Jadenkus <djadenkus@gmail.com>
* Extract method to make it easier to test.
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Remove superfluous interface definition.
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Add test cases for existing instant query functionality.
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Add support for testing range queries
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Expand test coverage for instant queries and clarify error when a float is returned but a histogram is expected (or vice versa)
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Improve error message formatting
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Add test case for instant query command with invalid timestamp
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Fix linting warning.
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Remove superfluous print statement and expected result
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Fix linting warning.
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Add note about ordered range eval commands.
Signed-off-by: Charles Korn <charles.korn@grafana.com>
* Check that matrix results are always sorted by labels.
Signed-off-by: Charles Korn <charles.korn@grafana.com>
---------
Signed-off-by: Charles Korn <charles.korn@grafana.com>
Using testify outside of unit tests results in panics rather than a
useful error for the user.
Fixes#13703
Signed-off-by: David Leadbeater <dgl@dgl.cx>
This is a bit tough to explain, but I'll try:
`rate` & friends have a sophisticated extrapolation algorithm.
Usually, we extrapolate the result to the total interval specified in
the range selector. However, if the first sample within the range is
too far away from the beginning of the interval, or if the last sample
within the range is too far away from the end of the interval, we
assume the series has just started half a sampling interval before the
first sample or after the last sample, respectively, and shorten the
extrapolation interval correspondingly. We calculate the sampling
interval by looking at the average time between samples within the
range, and we define "too far away" as "more than 110% of that
sampling interval".
However, if this algorithm leads to an extrapolated starting value
that is negative, we limit the start of the extrapolation interval to
the point where the extrapolated starting value is zero.
At least that was the intention.
What we actually implemented is the following: If extrapolating all
the way to the beginning of the total interval would lead to an
extrapolated negative value, we would only extrapolate to the zero
point as above, even if the algorithm above would have selected a
starting point that is just half a sampling interval before the first
sample and that starting point would not have an extrapolated negative
value. In other word: What was meant as a _limitation_ of the
extrapolation interval yielded a _longer_ extrapolation interval in
this case.
There is an exception to the case just described: If the increase of
the extrapolation interval is more than 110% of the sampling interval,
we suddenly drop back to only extrapolate to half a sampling interval.
This behavior can be nicely seen in the testcounter_zero_cutoff test,
where the rate goes up all the way to 0.7 and then jumps back to 0.6.
This commit changes the behavior to what was (presumably) intended
from the beginning: The extension of the extrapolation interval is
only limited if actually needed to prevent extrapolation to negative
values, but the "limitation" never leads to _more_ extrapolation
anymore.
The difference is subtle, and probably it never bothered anyone.
However, if you calculate a rate of a classic histograms, the old
behavior might create non-monotonic histograms as a result (because of
the jumps you can see nicely in the old version of the
testcounter_zero_cutoff test). With this fix, that doesn't happen
anymore.
Signed-off-by: beorn7 <beorn@grafana.com>
* add custom buckets to native histogram model
* simple copy for custom bounds
* return errors for unsupported add/sub operations
* add test cases for string and update appendhistogram in scrape to account for new schema
* check fields which are supposed to be unused but may affect results in equals
* allow appending custom buckets histograms regardless of max schema
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Aggregations discard the metric name, so don't try to
include it in the error message.
Add a test that generates this warning.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
This adds support for the new grammar of `{"metric_name", "l1"="val"}` to promql and some of the exposition formats.
This grammar will also be valid for non-UTF-8 names.
UTF-8 names will not be considered valid unless model.NameValidationScheme is changed.
This does not update the go expfmt parser in text_parse.go, which will be addressed by https://github.com/prometheus/common/issues/554/.
Part of https://github.com/prometheus/prometheus/issues/13095
Signed-off-by: Owen Williams <owen.williams@grafana.com>
Fixes#11708.
If a range vector is fixen in time with the @ modifier, it gets still
moved around for different steps in a range query. Since no additional
points are retrieved from the TSDB, this leads to steadily emptying
the range, leading to the weird behavior described in isse #11708.
This only happens for functions listed in `AtModifierUnsafeFunctions`,
and the only of those that takes a range vector is `predict_linear`,
which is the reason why we see it only for this particular function.
Signed-off-by: beorn7 <beorn@grafana.com>
These functions act on the labels only, so don't need to go step by step
over the samples in a range query.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Reusing points slice from previous series when the slice is under utilized
* Adding comments on the bench test
Signed-off-by: Alan Protasio <alanprot@gmail.com>
The last_over_time retains a histogram sample without making a copy.
This sample is now coming from the buffered iterator used for windowing functions,
and can be reused for reading subsequent samples as the iterator progresses.
I would propose copying the sample in the last_over_time function, similar to
how it is done for rate, sum_over_time and others.
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
This function is called very frequently when executing PromQL functions,
and we can do it much more efficiently inside Labels.
In the common case that `__name__` comes first in the labels, we simply
re-point to start at the next label, which is nearly free.
`DropMetricName` is now so cheap I removed the cache - benchmarks show
everything still goes faster.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Optimize histogram iterators
Histogram iterators allocate new objects in the AtHistogram and
AtFloatHistogram methods, which makes calculating rates over long
ranges expensive.
In #13215 we allowed an existing object to be reused
when converting an integer histogram to a float histogram. This commit follows
the same idea and allows injecting an existing object in the AtHistogram and
AtFloatHistogram methods. When the injected value is nil, iterators allocate
new histograms, otherwise they populate and return the injected object.
The commit also adds a CopyTo method to Histogram and FloatHistogram which
is used in the BufferedIterator to overwrite items in the ring instead of making
new copies.
Note that a specialized HPoint pool is needed for all of this to work
(`matrixSelectorHPool`).
---------
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
Add warnings for histogramRate applied with isCounter not matching counter/gauge histogram
---------
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Restore more efficient version of NewPossibleNonCounterInfo annotation
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
---------
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Conditions are ANDed inside the same matcher but matchers are ORed
Including unit tests for "promtool tsdb dump".
Refactor some matchers scraping utils.
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
This reverts commit 2ddb3596ef.
Various tests are failing in CI after this change; reverting to free up
other work.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
In https://github.com/prometheus/prometheus/pull/13276 we started reusing float histogram objects to reduce allocations in PromQL.
That PR introduces a bug where histogram pointers gets copied to the beginning of the histograms slice,
but are still kept in the end of the slice. When a new histogram is read into the last element,
it can overwrite a previous element because the pointer is the same.
This commit fixes the issue by moving outdated points to the end of the slice
so that we don't end up with duplicate pointers in the same buffer. In other words,
the slice gets rotated so that old objects can get reused.
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
This commit reduces the memory needed to query native histogram objects
by reusing existing HPoint instances.
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
The 'ToFloat' method on integer histograms currently allocates new memory
each time it is called.
This commit adds an optional *FloatHistogram parameter that can be used
to reuse span and bucket slices. It is up to the caller to make sure the
input float histogram is not used anymore after the call.
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
promql: Improve histogram_quantile calculation for classic buckets
Tiny differences between classic buckets are most likely caused by floating point precision issues. With this commit, relative changes below a certain threshold are ignored. This makes the result of histogram_quantile more meaningful, and also avoids triggering the _input to histogram_quantile needed to be fixed for monotonicity_ annotations in unactionable cases.
This commit also adds explanation of the new adjustment and of the monotonicity annotation to the documentation of `histogram_quantile`.
---------
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
* Add benchmark for native histograms
This commit adds a PromQL benchmark for queries on native histograms.
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
This PR adds an Experimental flag to the functions.
This can be used by https://github.com/prometheus/prometheus/pull/13059
but also xrate and other future functions.
Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>
This function is useful to analyze promQL queries. We want to use this in Mimir to record the time range which the query touches.
I also chose to remove the `Engine` receiver because it was unnecessary, and it makes it easier to use, but happy to refactor that if you disagree.
The function is untested on its own. If you prefer to have unit tests now that its exported, I can look into adding some.
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
* Remove NewPossibleNonCounterInfo until it can be made more efficient, and avoid creating empty annotations as much as possible
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Return annotations (warnings and infos) from PromQL queries
This generalizes the warnings we have already used before (but only for problems with remote read) as "annotations".
Annotations can be warnings or infos (the latter could be false positives). We do not treat them different in the API for now and return them all as "warnings". It would be easy to distinguish them and return infos separately, should that appear useful in the future.
The new annotations are then used to create a lot of warnings or infos during PromQL evaluations. Partially these are things we have wanted for a long time (e.g. inform the user that they have applied `rate` to a metric that doesn't look like a counter), but the new native histograms have created even more needs for those annotations (e.g. if a query tries to aggregate float numbers with histograms).
The annotations added here are not yet complete. A prominent example would be a warning about a range too short for a rate calculation. But such a warnings is more tricky to create with good fidelity and we will tackle it later.
Another TODO is to take annotations into account when evaluating recording rules.
---------
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
Make it more likely that contributors will run the benchmark suite.
count_values needs more than 2GB at 1,000 steps, so just run it for 100.
And remove 10-step variant because it doesn't add much to 100 and
1000-step benchmarks.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Otherwise we have a highly unusual situation of over 100 chunks
in the headChunks list of each series, which heavily skews
performance.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
promql: Extend testing framework to support native histograms
This includes both the internal testing framework as well as the rules unit test feature of promtool.
This also adds a bunch of basic tests. Many of the code level tests can now be converted to tests within the framework, and more tests can be added easily.
---------
Signed-off-by: Harold Dost <h.dost@criteo.com>
Signed-off-by: Gregor Zeitlinger <gregor.zeitlinger@grafana.com>
Signed-off-by: Stephen Lang <stephen.lang@grafana.com>
Co-authored-by: Harold Dost <h.dost@criteo.com>
Co-authored-by: Stephen Lang <stephen.lang@grafana.com>
Co-authored-by: Gregor Zeitlinger <gregor.zeitlinger@grafana.com>
So far, `ValidateHistogram` would not detect if the count did not
include the count in the zero bucket. This commit fixes the problem
and updates all the tests that have been undetected offenders so far.
Note that this problem would only ever create false negatives, so we
never falsely rejected to store a histogram because of it.
On the other hand, `ValidateFloatHistogram` has been to strict with
the count being at least as large as the sum of the counts in all the
buckets. Float precision issues could create false positives here, see
products of PromQL evaluations, it's actually quite hard to put an
upper limit no the floating point imprecision. Users could produce the
weirdest expressions, maxing out float precision problems. Therefore,
this commit simply removes that particular check from
`ValidateFloatHistogram`.
Signed-off-by: beorn7 <beorn@grafana.com>
promql engine: check unique labels using existing map
ContainsSameLabelset constructs a map with the same hash key as the one used to compile the output of rangeEval, so we can use that one and save work.
Need to hold the timestamp so we can be sure we saw the same series in the same evaluation.
`ContainsSameLabelset` constructs a map with the same hash key as
the one used to compile the output of `rangeEval`, so we can use that
one and save work.
Need to hold the timestamp so we can be sure we saw the same series
in the same evaluation.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
The operator changes the meaning of the metric, so the metric name should
be dropped. Technically this would be a breaking change, but it's also very
obviously a bug and not likely that anyone depends on it.
Signed-off-by: Julius Volz <julius.volz@gmail.com>