Commit graph

180 commits

Author SHA1 Message Date
Callum Styan eb63f30459 fix minor things
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2023-11-09 10:35:43 -03:00
Callum Styan 3e5facc0c0 add functionality for new minimized remote write request format
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2023-11-09 10:35:43 -03:00
Nicolás Pazos 46b84ab3fb lint 2023-11-09 10:18:12 -03:00
alexgreenbank 8ab14f2456 Remove config, update proto 2023-11-09 10:18:12 -03:00
alexgreenbank 15c4d45635 Add 1.1 version handling code 2023-11-09 10:18:12 -03:00
Nicolás Pazos 44f166d066 Use github.com/golang/snappy 2023-11-09 10:18:12 -03:00
Nicolás Pazos c1593fd35a Improve sender benchmarks and some allocations 2023-11-09 10:18:12 -03:00
Nicolás Pazos b35ab6c080 fix build 2023-11-09 10:18:12 -03:00
Nicolás Pazos 2f815ee3dd refactor queue manager code to remove some duplication 2023-11-09 10:18:12 -03:00
Nicolás Pazos eebf7ac1fc fix: queue manager to include float histograms in new requests 2023-11-09 10:18:12 -03:00
Nicolás Pazos ed34405d68 remove some comented code 2023-11-09 10:18:12 -03:00
Callum Styan f1992591b2 Implement code paths for new proto format
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2023-11-09 10:18:12 -03:00
Callum Styan 7a862d265f replace snappy encoding library
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2023-11-09 10:18:12 -03:00
Levi Harrison dcaca86958
Update dependencies for 2.48 (#12964) 2023-10-15 10:53:59 -04:00
Paschalis Tsilias c173cd57c9
Add a header to count retried remote write requests (#12729)
Header name is `Retry-Attempt`, only set when >0.

Signed-off-by: Marc Tuduri <marctc@protonmail.com>
Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
2023-09-20 11:11:03 +01:00
Bryan Boreham d2ae8dc3cb remote-write: add http.resend_count tracing attribute
As recommended by the OpenTelemetry semantic conventions.

https://opentelemetry.io/docs/specs/otel/trace/semantic_conventions/http/#http-client
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2023-08-11 16:20:12 +00:00
Matthieu MOREL bae9a21200
Merge branch 'main' into linter/nilerr
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2023-04-19 19:56:39 +02:00
beorn7 5b53aa1108 style: Replace else if cascades with switch
Wiser coders than myself have come to the conclusion that a `switch`
statement is almost always superior to a statement that includes any
`else if`.

The exceptions that I have found in our codebase are just these two:

* The `if else` is followed by an additional statement before the next
  condition (separated by a `;`).
* The whole thing is within a `for` loop and `break` statements are
  used. In this case, using `switch` would require tagging the `for`
  loop, which probably tips the balance.

Why are `switch` statements more readable?

For one, fewer curly braces. But more importantly, the conditions all
have the same alignment, so the whole thing follows the natural flow
of going down a list of conditions. With `else if`, in contrast, all
conditions but the first are "hidden" behind `} else if `, harder to
spot and (for no good reason) presented differently from the first
condition.

I'm sure the aforemention wise coders can list even more reasons.

In any case, I like it so much that I have found myself recommending
it in code reviews. I would like to make it a habit in our code base,
without making it a hard requirement that we would test on the CI. But
for that, there has to be a role model, so this commit eliminates all
`if else` occurrences, unless it is autogenerated code or fits one of
the exceptions above.

Signed-off-by: beorn7 <beorn@grafana.com>
2023-04-19 17:22:31 +02:00
Matthieu MOREL fb3eb21230 enable gocritic, unconvert and unused linters
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2023-04-13 19:20:22 +00:00
Arve Knudsen bc9a82f5a1
remote: Improve some comments (#12102)
Improve some comments in storage/remote/queue_manager.go, wrt. general
language and a typo.

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2023-03-09 11:05:24 +00:00
Arve Knudsen 435b500de7
remote: Convert to RecoverableError using errors.As (#12103)
In storage/remote, try converting to RecoverableError using errors.As,
instead of through direct casting.

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2023-03-08 13:58:09 -07:00
Julien Pivotto 475f9984d0
Merge pull request #11787 from damnever/perf/avoid-alloc-if-no-externallabels
Avoid allocation during remote write if external labels is empty
2023-02-22 23:38:21 +01:00
Marc Tudurí 721f33dbb0
histograms: Add remote-write support for Float Histograms (#11817)
* adapt code.go and write_handler.go to support float histograms
* adapt watcher.go to support float histograms
* wip adapt queue_manager.go to support float histograms
* address comments for metrics in queue_manager.go
* set test cases for queue manager
* use same counts for histograms and float histograms
* refactor createHistograms tests
* fix float histograms ref in watcher_test.go
* address PR comments

Signed-off-by: Marc Tuduri <marctc@protonmail.com>
2023-01-13 16:39:20 +05:30
Xiaochao Dong (@damnever) 2d61d012ff Avoid copy during remote write if external labels is empty
Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>
2022-12-30 19:18:30 +08:00
Bryan Boreham abd9909595 Update package storage/remote for new labels.Labels type
`QueueManager.externalLabels` becomes a slice rather than a `Labels` so
we can index into it when doing the merge operation.

Note we avoid calling `Labels.Len()` in `labelProtosToLabels()`.
It isn't necessary - `append()` will enlarge the buffer and we're
expecting to re-use it many times.

Also, we now validate protobuf input before converting to Labels.
This way we can detect errors first, and we don't place unnecessary
requirements on the Labels structure.

Re-do seriesFilter using labels.Builder (albeit N^2).

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-12-19 15:22:09 +00:00
Ganesh Vernekar 648be89822
Merge remote-tracking branch 'upstream/main' into fix-conflict
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
2022-10-12 14:20:02 +05:30
Jesus Vazquez 775d90d5f8
TSDB: Rename wal package to wlog (#11352)
The wlog.WL type can now be used to create a Write Ahead Log or a Write
Behind Log.

Before the prefix for wbl metrics was
'prometheus_tsdb_out_of_order_wal_' and has been replaced with
'prometheus_tsdb_out_of_order_wbl_'.

Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
Signed-off-by: Jesus Vazquez <jesusvazquez@users.noreply.github.com>
Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com>
2022-10-10 20:38:46 +05:30
Ganesh Vernekar f540c1dbd3
Add support for histograms in WAL checkpointing (#11210)
* Add support for histograms in WAL checkpointing

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Fix review comments

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Fix tests

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
2022-08-29 17:38:36 +05:30
Levi Harrison 0db6b072bc
Export histogramToHistogramProto() (#11046)
Signed-off-by: Levi Harrison <git@leviharrison.dev>
2022-07-21 10:12:50 -04:00
Levi Harrison 08f3ddb864
Sparse histogram remote-write support (#11001) 2022-07-14 09:13:12 -04:00
Matej Gera 1dd247f68b
Remote Write: Rename confusing walDir parameter to dir (#10464)
* Rename walDir parameter to dir

Signed-off-by: Matej Gera <matejgera@gmail.com>

* Improve NewQueueManager comment

Signed-off-by: Matej Gera <matejgera@gmail.com>
2022-05-30 21:45:30 -07:00
Chris Marchbanks a11e73edda
Fix a deadlock between Batch and FlushAndShutdown (#10608)
If FlushAndShutdown is called with a full batchQueue, and then Batch is
called rather than the normal path of reading from a queue a deadlock
might be encountered. Rather than having FlushAndShutdown having
blocking code while holding a lock retry sending the batch every second.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2022-04-20 20:50:41 +02:00
beorn7 79376c1e94 Merge branch 'release-2.33' into beorn7/release 2022-03-08 17:42:49 +01:00
Chris Marchbanks e970acb085
Fix deadlock between adding to queue and getting batch
Do not block when trying to write a batch to the queue. This can cause
appends to lock forever if the only thing reading from the queue needs
the mutex to write. Instead, if batchQueue is full pop the sample that
was just added from the partial batch and return false. The code doing
the appending already handles retries with backoff.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2022-03-07 17:15:57 -07:00
Julien Pivotto b0d70557b7
Merge pull request #10285 from prometheus/release-2.33 2022-02-12 00:02:24 +01:00
Chris Marchbanks bfb1500a38
Fix deadlock when stopping a shard (#10279)
If a queue is stopped and one of its shards happens to hit the
batch_send_deadline at the same time a deadlock can occur where stop
holds the mutex and will not release it until the send is finished, but
the send needs the mutex to retrieve the most recent batch. This is
fixed by using a second mutex just for writing.

In addition, the test I wrote exposed a case where during shutdown a
batch could be sent twice due to concurrent calls to queue.Batch() and
queue.FlushAndShutdown(). Protect these with a mutex as well.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2022-02-11 07:07:41 -07:00
Matej Gera 2c61d29b2a
Tracing: Migrate to OpenTelemetry library (#9724)
Signed-off-by: Matej Gera <matejgera@gmail.com>
2022-01-25 11:08:04 +01:00
Bryan Boreham 954c0e8020 remote_write: round desired shards up before check
Previously we would reject an increase from 2 to 2.5 as being
within 30%; by rounding up first we see this as an increase from 2 to 3.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-01-10 09:57:37 +00:00
Bryan Boreham 6d01ce8c4d remote_write: shard up more when backlogged
Change the coefficient from 1% to 5%, so instead of targetting to clear
the backlog in 100s we target 20s.

Update unit test to reflect the new behaviour.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-01-10 09:57:37 +00:00
Goutham Veeramachaneni 6696b7a5f0
Don't update metrics on context cancellation
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2022-01-04 10:46:52 +01:00
Goutham Veeramachaneni 1af81dc5c9
Update sent timestamp when write irrecoverably fails.
We have an alert that fires when prometheus_remote_storage_highest_timestamp_in_seconds - prometheus_remote_storage_queue_highest_sent_timestamp_seconds
becomes too high. But we have an agent that fires this when the remote "rate-limits" the user.

This is because prometheus_remote_storage_queue_highest_sent_timestamp_seconds doesn't get updated
when the remote sends a 429.

I think we should update the metrics, and the change I made makes sense. Because if the requests fails
because of connectivity issues, etc. we will never exit the `sendWriteRequestWithBackoff` function. It only
exits the function when there is a non-recoverable error, like a bad status code, and in that case, I think
the metric needs to be updated.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2022-01-03 11:13:48 +01:00
Bryan Boreham bd6436605d Review feedback
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2021-12-09 14:40:44 +00:00
Bryan Boreham 50878ebe5e remote-write: buffer struct instead of interface
This reduces the amount of individual objects allocated, allowing sends
to run a bit faster.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2021-12-03 14:30:42 +00:00
Chris Marchbanks c655684142
Subtract from enqueued samples/exemplars upon send
Right now the values for enqueuedSamples and enqueuedExemplars is never
subtracted leading to inflated values for failedSamples/failedExemplars
when a hard shutdown of a shard occurs.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2021-11-30 12:54:50 -07:00
Chris Marchbanks 319249f9db
Batch samples before sending them to channels
Channels can cause bottlenecks and tons of context switches when reading
hundreds of thousands of samples per second from a single queue.
Instead, pre-batch the samples to amortize the cost of the concurrency
overhead.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2021-11-30 12:54:45 -07:00
Hu Shuai eb43437d83
Fix golint issue (#9800)
Signed-off-by: Hu Shuai <hus.fnst@fujitsu.com>
2021-11-18 09:26:07 +01:00
beorn7 c954cd9d1d Move packages out of deprecated pkg directory
This creates a new `model` directory and moves all data-model related
packages over there:
  exemplar labels relabel rulefmt textparse timestamp value

All the others are more or less utilities and have been moved to `util`:
  gate logging modetimevfs pool runtime

Signed-off-by: beorn7 <beorn@grafana.com>
2021-11-09 08:03:10 +01:00
Dieter Plaetinck cda025b5b5
TSDB: demistify SeriesRefs and ChunkRefs (#9536)
* TSDB: demistify seriesRefs and ChunkRefs

The TSDB package contains many types of series and chunk references,
all shrouded in uint types.  Often the same uint value may
actually mean one of different types, in non-obvious ways.

This PR aims to clarify the code and help navigating to relevant docs,
usage, etc much quicker.

Concretely:

* Use appropriately named types and document their semantics and
  relations.
* Make multiplexing and demuxing of types explicit
  (on the boundaries between concrete implementations and generic
  interfaces).
* Casting between different types should be free.  None of the changes
  should have any impact on how the code runs.

TODO: Implement BlockSeriesRef where appropriate (for a future PR)

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* feedback

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* agent: demistify seriesRefs and ChunkRefs

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>
2021-11-06 15:40:04 +05:30
sniper f82e56fbba
fix request bytes size and continue is useless (#9635)
Signed-off-by: kalmanzhao <kalmanzhao@tencent.com>

Co-authored-by: kalmanzhao <kalmanzhao@tencent.com>
2021-11-03 14:40:31 +05:30
Mateusz Gozdek b7bdf6fab2 Fix imports formatting
According to
2829908806 (r58457095).

Signed-off-by: Mateusz Gozdek <mgozdekof@gmail.com>
2021-11-02 19:52:34 +01:00