Commit graph

1333 commits

Author SHA1 Message Date
William Dumont ce6ad15422 remote-write: TestClientRetryAfter status code 500
and compare the retryAfter values.

Signed-off-by: William Dumont <william.dumont@grafana.com>
2023-09-20 10:25:43 +00:00
William Dumont febd62a23e remote-write: refactor TestClientRetryAfter
The new version features a set of test cases that simplify the addition
of new HTTP status codes.

Signed-off-by: William Dumont <william.dumont@grafana.com>
2023-09-20 10:24:52 +00:00
Bryan Boreham 9b85354acd remote-write: respect Retry-After header on 5xx errors
If the server sent it to us, we should assume it knows better than we
do and respect it.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2023-09-20 10:14:38 +00:00
Paschalis Tsilias c173cd57c9
Add a header to count retried remote write requests (#12729)
Header name is `Retry-Attempt`, only set when >0.

Signed-off-by: Marc Tuduri <marctc@protonmail.com>
Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
2023-09-20 11:11:03 +01:00
George Krajcsovits 3512b2d678
storage: make histogram reset handling consistent in chainSampleIterator (#12779)
storage: make histogram reset handling consistent in chainSampleIterator

---------

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
2023-09-19 17:06:46 +02:00
zenador 69edd8709b
Add warnings (and annotations) to PromQL query results (#12152)
Return annotations (warnings and infos) from PromQL queries

This generalizes the warnings we have already used before (but only for problems with remote read) as "annotations".

Annotations can be warnings or infos (the latter could be false positives). We do not treat them different in the API for now and return them all as "warnings". It would be easy to distinguish them and return infos separately, should that appear useful in the future.

The new annotations are then used to create a lot of warnings or infos during PromQL evaluations. Partially these are things we have wanted for a long time (e.g. inform the user that they have applied `rate` to a metric that doesn't look like a counter), but the new native histograms have created even more needs for those annotations (e.g. if a query tries to aggregate float numbers with histograms).

The annotations added here are not yet complete. A prominent example would be a warning about a range too short for a rate calculation. But such a warnings is more tricky to create with good fidelity and we will tackle it later.

Another TODO is to take annotations into account when evaluating recording rules.

---------

Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
2023-09-14 18:57:31 +02:00
Arve Knudsen 156222cc50
Add context argument to LabelQuerier.LabelValues (#12665)
Add context argument to LabelQuerier.LabelValues and
LabelQuerier.SortedLabelValues.

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2023-09-14 16:02:04 +02:00
Arve Knudsen a964349e97
Add context argument to LabelQuerier.LabelNames (#12666)
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2023-09-14 10:39:51 +02:00
beorn7 0521ec12af storage: remove obsolete TODO
This was solved one layer deeper with #11687.

Signed-off-by: beorn7 <beorn@grafana.com>
2023-09-13 15:10:22 +02:00
Arve Knudsen 6daee89e5f
Add context argument to Querier.Select (#12660)
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2023-09-12 12:37:38 +02:00
Gregor Zeitlinger f01718262a
Unit tests for native histograms (#12668)
promql: Extend testing framework to support native histograms

This includes both the internal testing framework as well as the rules unit test feature of promtool.

This also adds a bunch of basic tests. Many of the code level tests can now be converted to tests within the framework, and more tests can be added easily.

---------

Signed-off-by: Harold Dost <h.dost@criteo.com>
Signed-off-by: Gregor Zeitlinger <gregor.zeitlinger@grafana.com>
Signed-off-by: Stephen Lang <stephen.lang@grafana.com>
Co-authored-by: Harold Dost <h.dost@criteo.com>
Co-authored-by: Stephen Lang <stephen.lang@grafana.com>
Co-authored-by: Gregor Zeitlinger <gregor.zeitlinger@grafana.com>
2023-08-25 23:35:42 +02:00
Justin Lei 8ef7dfdeeb
Add a chunk size limit in bytes (#12054)
Add a chunk size limit in bytes

This creates a hard cap for XOR chunks of 1024 bytes.

The limit for histogram chunk is also 1024 bytes, but it is a soft limit as a histogram has a dynamic size, and even a single one could be larger than 1024 bytes.

This also avoids cutting new histogram chunks if the existing chunk has fewer than 10 histograms yet. In that way, we are accepting "jumbo chunks" in order to have at least 10 histograms in a chunk, allowing compression to kick in.

Signed-off-by: Justin Lei <justin.lei@grafana.com>
2023-08-24 15:21:17 +02:00
beorn7 aa82fe198f tsdb: Fix histogram validation
So far, `ValidateHistogram` would not detect if the count did not
include the count in the zero bucket. This commit fixes the problem
and updates all the tests that have been undetected offenders so far.

Note that this problem would only ever create false negatives, so we
never falsely rejected to store a histogram because of it.

On the other hand, `ValidateFloatHistogram` has been to strict with
the count being at least as large as the sum of the counts in all the
buckets. Float precision issues could create false positives here, see
products of PromQL evaluations, it's actually quite hard to put an
upper limit no the floating point imprecision. Users could produce the
weirdest expressions, maxing out float precision problems. Therefore,
this commit simply removes that particular check from
`ValidateFloatHistogram`.

Signed-off-by: beorn7 <beorn@grafana.com>
2023-08-22 23:04:01 +02:00
Michael Hoffmann 4d8e380269
promql: allow tests to be imported (#12050)
Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>
2023-08-18 20:48:59 +02:00
Bryan Boreham d2ae8dc3cb remote-write: add http.resend_count tracing attribute
As recommended by the OpenTelemetry semantic conventions.

https://opentelemetry.io/docs/specs/otel/trace/semantic_conventions/http/#http-client
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2023-08-11 16:20:12 +00:00
Goutham Veeramachaneni ad4f514e66
Add OTLP Ingestion endpoint (#12571)
* Add OTLP Ingestion endpoint

We copy files from the otel-collector-contrib. See the README in
`storage/remote/otlptranslator/README.md`.

This supersedes: https://github.com/prometheus/prometheus/pull/11965

Signed-off-by: gouthamve <gouthamve@gmail.com>

* Return a 200 OK

It is what the OTEL Golang SDK expect :(

https://github.com/open-telemetry/opentelemetry-go/issues/4363

Signed-off-by: Goutham <gouthamve@gmail.com>

---------

Signed-off-by: gouthamve <gouthamve@gmail.com>
Signed-off-by: Goutham <gouthamve@gmail.com>
2023-07-28 12:35:28 +02:00
George Krajcsovits 6cd2d1621f
Hide histogram chunk append and reset header internals (#12352)
tsdb: Hide histogram chunk append and reset header internals

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: George Krajcsovits <krajorama@users.noreply.github.com>
2023-07-26 15:08:16 +02:00
LHHDZ 7d8f9b0978
remote-write receiver: reuse 'ref' to optimize multiple samples for same series (#12580)
reuse 'ref' to optimize multi samples processing efficiency

Signed-off-by: changlin.shi <changlin.shi@ly.com>
2023-07-22 14:24:46 +01:00
György Krajcsovits d4e355243a tsdbutil/ChunkFromSamplesGeneric should not panic
Add error handling instead.
Prepares for #12352

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
2023-07-20 17:01:34 +02:00
Julien Pivotto 0f85e4f41d
Merge pull request #12539 from bboreham/slices-sorts
Replace sort.Slice with faster slices.SortFunc
2023-07-11 13:09:02 +02:00
Bryan Boreham ce153e3fff Replace sort.Sort with faster slices.SortFunc
The generic version is more efficient.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2023-07-10 09:43:45 +00:00
Julien Pivotto 986fde06b2
Merge pull request #11688 from damnever/fix/datamodelvalidation-remotewriteapi
Validate the metric names and labels in the remote write handler
2023-07-04 13:52:02 +02:00
Bryan Boreham 5255bf06ad Replace sort.Slice with faster slices.SortFunc
The generic version is more efficient.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2023-07-02 22:17:08 +00:00
rakshith210 b1675e23af
Add Azure AD package for remote write (#11944)
* Add Azure AD package for remote write
* Made AzurePublic default and updated configuration.md
* Updated config structure and removed getToken at initialization
* Changed passing context from request

Signed-off-by: Rakshith Padmanabha <rapadman@microsoft.com>
Signed-off-by: rakshith210 <rakshith.me@gmail.com>
2023-06-01 15:20:10 -06:00
Bryan Boreham a073e04a9b
Merge pull request #12366 from prometheus/release-2.44
Merge release 2.44 back to main
2023-05-16 18:06:29 +01:00
Callum Styan 0d2108ad79
[tsdb] re-implement WAL watcher to read via a "notification" channel (#11949)
* WIP implement WAL watcher reading via notifications over a channel from
the TSDB code

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Notify via head appenders Commit (finished all WAL logging) rather than
on each WAL Log call

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Fix misspelled Notify plus add a metric for dropped Write notifications

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Update tests to handle new notification pattern

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* this test maybe needs more time on windows?

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* does this test need more time on windows as well?

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* read timeout is already a time.Duration

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* remove mistakenly commited benchmark data files

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* address some review feedback

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* fix missed changes from previous commit

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Fix issues from wrapper function

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* try fixing race condition in test by allowing tests to overwrite the
read ticker timeout instead of calling the Notify function

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* fix linting

Signed-off-by: Callum Styan <callumstyan@gmail.com>

---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2023-05-15 12:31:49 -07:00
George Krajcsovits f5fcaa3872
Fix setting reset header to gauge histogram in seriesToChunkEncoder (#12329)
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
2023-05-05 18:04:30 +05:30
Justin Lei 7bbf24b707 Make MemoizedSeriesIterator not implement chunkenc.Iterator
Signed-off-by: Justin Lei <justin.lei@grafana.com>
2023-05-03 12:45:39 -07:00
beorn7 b0272255b7 storage: optimise sampleRing
Replace many checks for the lengths of slices with a single tracking
variable.

Signed-off-by: beorn7 <beorn@grafana.com>
2023-05-03 20:09:29 +02:00
Justin Lei 6985dcbe73 Optimize and test MemoizedSeriesIterator
Signed-off-by: Justin Lei <justin.lei@grafana.com>
2023-05-02 08:53:18 -07:00
Filip Petkovski 0d049feac7
Fix encoding samples in ChunkSeries (#12185)
The storage.ChunkSeries iterator assumes that a histogram sample can always be
appended to the currently open chunk. This is not the case when there is a counter reset,
or when appending a stale sample to a chunk with non-stale samples. In addition, the open chunk sometimes
needs to be recoded before a sample can be appended.

This commit addresses the issue by implementing a RecodingAppender which can recode incoming
samples in a transparent way. It also detects cases when a sample cannot be appended at all and
returns `false` so that the caller can open a new chunk.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
2023-04-28 16:52:21 -04:00
Jeanette Tan 1102ffd188 Fix according to code review
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
2023-04-22 02:27:15 +08:00
Jeanette Tan e9a1e26ab7 Perform integer/float histogram type checking on conversions, and use a consistent method for determining integer vs float histogram
Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>
2023-04-22 02:27:15 +08:00
Björn Rabenstein 78cd9ae2c3
Merge pull request #12264 from rabenhorst/sample-ring-iterator-mixed-histograms-fix
Fix for `sampleRingIterator` with mixed histograms
2023-04-20 16:58:18 +02:00
Matthieu MOREL bae9a21200
Merge branch 'main' into linter/nilerr
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2023-04-19 19:56:39 +02:00
beorn7 5b53aa1108 style: Replace else if cascades with switch
Wiser coders than myself have come to the conclusion that a `switch`
statement is almost always superior to a statement that includes any
`else if`.

The exceptions that I have found in our codebase are just these two:

* The `if else` is followed by an additional statement before the next
  condition (separated by a `;`).
* The whole thing is within a `for` loop and `break` statements are
  used. In this case, using `switch` would require tagging the `for`
  loop, which probably tips the balance.

Why are `switch` statements more readable?

For one, fewer curly braces. But more importantly, the conditions all
have the same alignment, so the whole thing follows the natural flow
of going down a list of conditions. With `else if`, in contrast, all
conditions but the first are "hidden" behind `} else if `, harder to
spot and (for no good reason) presented differently from the first
condition.

I'm sure the aforemention wise coders can list even more reasons.

In any case, I like it so much that I have found myself recommending
it in code reviews. I would like to make it a habit in our code base,
without making it a hard requirement that we would test on the CI. But
for that, there has to be a role model, so this commit eliminates all
`if else` occurrences, unless it is autogenerated code or fits one of
the exceptions above.

Signed-off-by: beorn7 <beorn@grafana.com>
2023-04-19 17:22:31 +02:00
beorn7 c3c7d44d84 lint: Adjust to the lint warnings raised by current versions of golint-ci
We haven't updated golint-ci in our CI yet, but this commit prepares
for that.

There are a lot of new warnings, and it is mostly because the "revive"
linter got updated. I agree with most of the new warnings, mostly
around not naming unused function parameters (although it is justified
in some cases for documentation purposes – while things like mocks are
a good example where not naming the parameter is clearer).

I'm pretty upset about the "empty block" warning to include `for`
loops. It's such a common pattern to do something in the head of the
`for` loop and then have an empty block. There is still an open issue
about this: https://github.com/mgechev/revive/issues/810 I have
disabled "revive" altogether in files where empty blocks are used
excessively, and I have made the effort to add individual
`// nolint:revive` where empty blocks are used just once or twice.
It's borderline noisy, though, but let's go with it for now.

I should mention that none of the "empty block" warnings for `for`
loop bodies were legitimate.

Signed-off-by: beorn7 <beorn@grafana.com>
2023-04-19 17:10:10 +02:00
Sebastian Rabenhorst 5d4ec08a1f
Fixed sampleRingIterator for mixed histograms
Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Fixed sampleRingIterator for mixed histograms

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Fixed lint
2023-04-14 12:45:36 +02:00
Matthieu MOREL fb3eb21230 enable gocritic, unconvert and unused linters
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2023-04-13 19:20:22 +00:00
beorn7 717a3f8e25 storage: Manually expand genericAdd for specific types
This commit is doing what I would have expected that Go generics do
for me. However, the PromQL benchmarks show a significant runtime and
allocation increase with `genericAdd`, so this replaces it with
hand-coded non-generic versions of it.

Signed-off-by: beorn7 <beorn@grafana.com>
2023-04-13 19:25:24 +02:00
beorn7 817a2396cb Name float values as "floats", not as "values"
In the past, every sample value was a float, so it was fine to call a
variable holding such a float "value" or "sample". With native
histograms, a sample might have a histogram value. And a histogram
value is still a value. Calling a float value just "value" or "sample"
or "V" is therefore misleading. Over the last few commits, I already
renamed many variables, but this cleans up a few more places where the
changes are more invasive.

Note that we do not to attempt naming in the JSON APIs or in the
protobufs. That would be quite a disruption. However, internally, we
can call variables as we want, and we should go with the option of
avoiding misunderstandings.

Signed-off-by: beorn7 <beorn@grafana.com>
2023-04-13 19:25:24 +02:00
beorn7 462240bc78 storage: add specialized buffers to sampleRing
This utilizes the fact that most sampleRings will only contain samples
of one type. In this case, the generic interface is circumvented, and
a bespoke buffer for the one actually occurring sample type is
used. Should a sampleRing receive a sample of a different kind later,
it will transparently switch to the generic behavior.

Signed-off-by: beorn7 <beorn@grafana.com>
2023-04-13 19:25:24 +02:00
beorn7 630bcb494b storage: Use separate sample types for histogram vs. float
Previously, we had one “polymorphous” `sample` type in the `storage`
package. This commit breaks it up into `fSample`, `hSample`, and
`fhSample`, each still implementing the `tsdbutil.Sample` interface.

This reduces allocations in `sampleRing.Add` but inflicts the penalty
of the interface wrapper, which makes things worse in total.

This commit therefore just demonstrates the step taken. The next
commit will tackle the interface overhead problem.

Signed-off-by: beorn7 <beorn@grafana.com>
2023-04-13 19:25:24 +02:00
Björn Rabenstein 6e0a46900b
Merge pull request #12192 from leizor/leizor/prometheus/issues/11204
Add support for native histograms to concreteSeriesIterator
2023-04-11 12:30:35 +02:00
Justin Lei f90013a5a0 Update storage/remote/codec.go
Co-authored-by: Björn Rabenstein <github@rabenste.in>
Signed-off-by: Justin Lei <97976793+leizor@users.noreply.github.com>
2023-04-06 09:54:15 -07:00
Justin Lei 83f43982c9 Add support for native histograms to concreteSeriesIterator
Signed-off-by: Justin Lei <justin.lei@grafana.com>
2023-04-06 09:54:15 -07:00
Xiaochao Dong (@damnever) 2b7202c4cc Validate the metric names and labels in the remote write handler
Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
2023-04-05 19:09:05 +08:00
Bryan Boreham b987afa7ef labels: simplify call to get Labels from Builder
It took a `Labels` where the memory could be re-used, but in practice
this hardly ever benefitted. Especially after converting `relabel.Process`
to `relabel.ProcessBuilder`.

Comparing the parameter to `nil` was a bug; `EmptyLabels` is not `nil`
so the slice was reallocated multiple times by `append`.

Lastly `Builder.Labels()` now estimates that the final size will depend
on labels added and deleted.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2023-03-22 17:05:20 +00:00
Björn Rabenstein 559adab471
Merge pull request #12085 from leizor/leizor/prometheus/issues/11204
Handle native histograms in remote read
2023-03-21 17:25:34 +01:00
Oleg Zaytsev beb7d3b80f
remote.Client: store urlString
During remote write, we call url.String() twice:
- to add the Endpoint() to the span
- to actually know where whe should send the request

This value does not change over time, and it's not really that
lightweight to calculate. I wrote this simple benchmark:

    func BenchmarkURLString(b *testing.B) {
        u, err := url.Parse("https://remote.write.com/api/v1")
        require.NoError(b, err)

        b.Run("string", func(b *testing.B) {
            count := 0
            for i := 0; i < b.N; i++ {
                count += len(u.String())
            }
        })
    }

And the results are ~200ns/op, 80B/op, 3 allocs/op.

Yes, we're going to go to the network here, which is a huge amount of
resources compared to this, but still, on agents that send 500 requests
per second, that is 1500 wasteful allocations per second.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
2023-03-16 09:53:10 +01:00