Commit graph

1198 commits

Author SHA1 Message Date
Matej Gera 1dd247f68b
Remote Write: Rename confusing walDir parameter to dir (#10464)
* Rename walDir parameter to dir

Signed-off-by: Matej Gera <matejgera@gmail.com>

* Improve NewQueueManager comment

Signed-off-by: Matej Gera <matejgera@gmail.com>
2022-05-30 21:45:30 -07:00
Bryan Boreham 4b9f248e85
unit tests: make all Labels sorted alphabetically (#10532)
"Labels is a sorted set of labels. Order has to be guaranteed upon
instantiation." says the comment, so fix all the tests that break this
rule.

For `BenchmarkLabelValuesWithMatchers()` and
`BenchmarkHeadLabelValuesWithMatchers()` the amount of work done changes
significantly if you put the labels in order, because all series refs
get neatly partitioned by the `tens` label, so I renamed the labels
to maintain the previous behaviour.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-05-04 23:41:36 +02:00
Matthieu MOREL e2ede285a2
refactor: move from io/ioutil to io and os packages (#10528)
* refactor: move from io/ioutil to io and os packages
* use fs.DirEntry instead of os.FileInfo after os.ReadDir

Signed-off-by: MOREL Matthieu <matthieu.morel@cnp.fr>
2022-04-27 11:24:36 +02:00
Chris Marchbanks a11e73edda
Fix a deadlock between Batch and FlushAndShutdown (#10608)
If FlushAndShutdown is called with a full batchQueue, and then Batch is
called rather than the normal path of reading from a queue a deadlock
might be encountered. Rather than having FlushAndShutdown having
blocking code while holding a lock retry sending the batch every second.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2022-04-20 20:50:41 +02:00
Wilbert Guo 83a2e52bc2
Add SyncForState Implementation for Ruler HA (#10070)
* continuously syncing activeAt for alerts

Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* add import

Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* Refactor SyncForState and add unit tests

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* Format code

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* Add hook for syncForState

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Fix go lint

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Refactor syncForState override implementation

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Add syncForState override func as argument to Update()

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Fix go formatting

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Fix circleci test errors

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Remove overrideFunc as argument to run()

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* remove the syncForState

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* use the override function to decide if need to replace the activeAt or not

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix test case

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix format

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* Trigger build

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fixing comments

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* return the result of map of alerts instead of single one

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* upper case the QueryforStateSeries

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* use a more generic rule group post process function type

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix indentation

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix gofmt

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix lint

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fixing naming

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix comments

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* add the lastEvalTimestamp as parameter

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fmt

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* change funcType to func

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

Co-authored-by: Yijie Qin <qinyijie@amazon.com>
Co-authored-by: Yijie Qin <63399121+qinxx108@users.noreply.github.com>
2022-03-29 02:16:46 +02:00
beorn7 79376c1e94 Merge branch 'release-2.33' into beorn7/release 2022-03-08 17:42:49 +01:00
Chris Marchbanks e970acb085
Fix deadlock between adding to queue and getting batch
Do not block when trying to write a batch to the queue. This can cause
appends to lock forever if the only thing reading from the queue needs
the mutex to write. Instead, if batchQueue is full pop the sample that
was just added from the partial batch and return false. The code doing
the appending already handles retries with backoff.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2022-03-07 17:15:57 -07:00
Chris Marchbanks afdc1decac
Write a test that reproduces the deadlock
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2022-03-07 17:15:51 -07:00
Łukasz Mierzwa a4317bf0ec
Run gofumpt on all files (#10392)
* Run gofumpt on all files

Getting golangci-lint errors when building on my laptop, possibly because I have newer version of gofumpt then what it was formatted with.
Run gofumpt -w -extra on all files as it will be needed in the future anyway.

* Update golangci-lint to v1.44.2

v1.44.0 upgraded gofumpt so bumping version in CI will help keep formatting correct for everyone

* Address golangci-lint error

Getting 'error-strings: error strings should not be capitalized or end with punctuation or a newline' from revive here.
Drop new line.

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
2022-03-03 17:21:05 +01:00
DrAuYueng 5a6e26556b
Add an option to use the external labels as selectors for the remote read endpoint (#10254)
* An option to ignore external_labels

Signed-off-by: DrAuYueng <ouyang1204@gmail.com>
2022-02-16 22:12:47 +01:00
Julien Pivotto b0d70557b7
Merge pull request #10285 from prometheus/release-2.33 2022-02-12 00:02:24 +01:00
Chris Marchbanks bfb1500a38
Fix deadlock when stopping a shard (#10279)
If a queue is stopped and one of its shards happens to hit the
batch_send_deadline at the same time a deadlock can occur where stop
holds the mutex and will not release it until the send is finished, but
the send needs the mutex to retrieve the most recent batch. This is
fixed by using a second mutex just for writing.

In addition, the test I wrote exposed a case where during shutdown a
batch could be sent twice due to concurrent calls to queue.Batch() and
queue.FlushAndShutdown(). Protect these with a mutex as well.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2022-02-11 07:07:41 -07:00
Matej Gera 2c61d29b2a
Tracing: Migrate to OpenTelemetry library (#9724)
Signed-off-by: Matej Gera <matejgera@gmail.com>
2022-01-25 11:08:04 +01:00
Eng Zer Jun 3e67654d37
refactor: use T.TempDir() and B.TempDir to create temporary directory
The directory created by `T.TempDir()` and `B.TempDir()` is
automatically removed when the test and all its subtests complete.

Reference: https://pkg.go.dev/testing#T.TempDir
Reference: https://pkg.go.dev/testing#B.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2022-01-22 18:57:30 +08:00
Bryan Boreham 954c0e8020 remote_write: round desired shards up before check
Previously we would reject an increase from 2 to 2.5 as being
within 30%; by rounding up first we see this as an increase from 2 to 3.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-01-10 09:57:37 +00:00
Bryan Boreham 6d01ce8c4d remote_write: shard up more when backlogged
Change the coefficient from 1% to 5%, so instead of targetting to clear
the backlog in 100s we target 20s.

Update unit test to reflect the new behaviour.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-01-10 09:57:37 +00:00
Bryan Boreham d588b14d9c remote_write: detailed test for shard calculation
Drive the input parameters to `calculateDesiredShards()` very precisely,
to illustrate some questionable behaviour marked with `?!`.

See https://github.com/prometheus/prometheus/issues/9178,
https://github.com/prometheus/prometheus/issues/9207,

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-01-10 09:57:37 +00:00
Chris Marchbanks ba03f7fc23
Merge pull request #10102 from prometheus/update-metrics-on-rw-fails
Update sent timestamp when write irrecoverably fails
2022-01-05 10:46:09 -07:00
Goutham Veeramachaneni 6696b7a5f0
Don't update metrics on context cancellation
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2022-01-04 10:46:52 +01:00
Chris Marchbanks dfa5cb7462
Merge pull request #10038 from charlesxsh/fix-TestReshardRaceWithStop
add proper exit for loop
2022-01-03 09:02:45 -07:00
Goutham Veeramachaneni 1af81dc5c9
Update sent timestamp when write irrecoverably fails.
We have an alert that fires when prometheus_remote_storage_highest_timestamp_in_seconds - prometheus_remote_storage_queue_highest_sent_timestamp_seconds
becomes too high. But we have an agent that fires this when the remote "rate-limits" the user.

This is because prometheus_remote_storage_queue_highest_sent_timestamp_seconds doesn't get updated
when the remote sends a 429.

I think we should update the metrics, and the change I made makes sense. Because if the requests fails
because of connectivity issues, etc. we will never exit the `sendWriteRequestWithBackoff` function. It only
exits the function when there is a non-recoverable error, like a bad status code, and in that case, I think
the metric needs to be updated.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2022-01-03 11:13:48 +01:00
Shihao Xia c3e7bfb813 add proper exit for loop
Signed-off-by: Shihao Xia <charlesxsh@hotmail.com>
2021-12-29 23:48:11 -05:00
Julien Pivotto 27343277fa
Merge release-2.32 forward into main (#10032)
* storage: expose bug in iterators #10027

Signed-off-by: beorn7 <beorn@grafana.com>

* storage: fix bug #10027 in iterators' Seek method

Signed-off-by: beorn7 <beorn@grafana.com>

* Append reporting metrics without limit

If reporting metrics fails due to reaching the limit, this makes the
target appear as UP in the UI, but the metrics are missing.

This commit bypasses that limit for report metrics.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>

* Remove check against cfg so interval/ timeout are always set (#10023) (#10031)

Signed-off-by: Nicholas Blott <blottn@tcd.ie>

Co-authored-by: Nicholas Blott <blottn@tcd.ie>

* Cut v2.32.1

Signed-off-by: Julius Volz <julius.volz@gmail.com>

* Apply suggestions from code review

Signed-off-by: Julius Volz <julius.volz@gmail.com>

Co-authored-by: Levi Harrison <git@leviharrison.dev>

Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu>
Co-authored-by: Nicholas Blott <blottn@tcd.ie>
Co-authored-by: Julius Volz <julius.volz@gmail.com>
Co-authored-by: Levi Harrison <git@leviharrison.dev>
2021-12-17 23:18:38 +01:00
beorn7 0ede6ae321 storage: fix bug #10027 in iterators' Seek method
Signed-off-by: beorn7 <beorn@grafana.com>
2021-12-16 12:07:35 +01:00
beorn7 b042e29569 storage: expose bug in iterators #10027
Signed-off-by: beorn7 <beorn@grafana.com>
2021-12-16 12:02:15 +01:00
Chris Marchbanks 0a8d28ea93
Merge pull request #9934 from bboreham/remote-write-struct
remote-write: buffer struct instead of interface to reduce garbage-collection
2021-12-09 09:17:45 -07:00
Bryan Boreham bd6436605d Review feedback
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2021-12-09 14:40:44 +00:00
Sebastian Rabenhorst d8b8678bd1
Log time series details for out-of-order samples in remote write receiver (#9894)
* Improved out-of-order sample logs in write handler

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

sign commit

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Inlined logAppendError

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Update storage/remote/write_handler.go

Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com>

Fixed fmt

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

* Improved out-of-order sample logs in write handler

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

sign commit

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Inlined logAppendError

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>
2021-12-08 15:07:51 +00:00
detailyang 3e482c905f
fix:storage:avoid panic when iterater exhauested (#9945)
Signed-off-by: detailyang <detailyang@gmail.com>
2021-12-07 19:50:00 +05:30
Bryan Boreham 50878ebe5e remote-write: buffer struct instead of interface
This reduces the amount of individual objects allocated, allowing sends
to run a bit faster.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2021-12-03 14:30:42 +00:00
Bryan Boreham c478d6477a remote-write: benchmark just sending, on 20 shards
Previously BenchmarkSampleDelivery spent a lot of effort checking each
sample had arrived, so was largely showing the performance of test-only
code.

Increase the number of shards to be more realistic for a large
workload.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2021-12-03 14:02:10 +00:00
Chris Marchbanks e95d4ec3f1
Merge pull request #9830 from prometheus/batch-queues
Batch samples before sending them to channels
2021-12-02 08:37:41 -07:00
Chris Marchbanks c655684142
Subtract from enqueued samples/exemplars upon send
Right now the values for enqueuedSamples and enqueuedExemplars is never
subtracted leading to inflated values for failedSamples/failedExemplars
when a hard shutdown of a shard occurs.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2021-11-30 12:54:50 -07:00
Chris Marchbanks 319249f9db
Batch samples before sending them to channels
Channels can cause bottlenecks and tons of context switches when reading
hundreds of thousands of samples per second from a single queue.
Instead, pre-batch the samples to amortize the cost of the concurrency
overhead.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2021-11-30 12:54:45 -07:00
Björn Rabenstein d677aa4b29
storage: Consolidate iterator method names (Values -> At) (#9888)
`BufferedSeriesIterator` and `MemoizedSeriesIterator` use a method
called `Values` for exactly the purpose for which all other iterators
of the same kind use a method called `At`. That alone is confusing,
but on top of that, the `Values` method only returns a single sample,
not multiple values. I assume the naming has historical reasons. This
commit makes it more consistent. It is now easier to read, and now
`BufferedSeriesIterator` and `MemoizedSeriesIterator` implement
`chunkenc.Iterator` like many other iterators, too.

Signed-off-by: beorn7 <beorn@grafana.com>
2021-11-29 11:16:40 +01:00
Björn Rabenstein b866db009b
storage: Fix and improve the Seek method of various iterators (#9878)
There was a subtle and nasty bug in listSeriesIterator.Seek.

In addition, the Seek call is defined to be a no-op if the current
position of the iterator is already pointing to a suitable
sample. This commit adds fast paths for this case to several
potentially expensive Seek calls.

Another bug was in concreteSeriesIterator.Seek. It always searched the
whole series and not from the current position of the iterator.

Signed-off-by: beorn7 <beorn@grafana.com>
2021-11-29 15:17:56 +05:30
Matheus Alcantara e673805d67
storage/remote: use t.TempDir instead of ioutil.TempDir on tests (#9811)
Signed-off-by: Matheus Alcantara <matheusssilv97@gmail.com>
2021-11-19 15:21:45 -05:00
Hu Shuai eb43437d83
Fix golint issue (#9800)
Signed-off-by: Hu Shuai <hus.fnst@fujitsu.com>
2021-11-18 09:26:07 +01:00
Dieter Plaetinck 0fac9bb859
Add basic initial developer docs for TSDB (#9451)
* Add basic initial developer docs for TSDB

There's a decent amount of content already out there (blog posts,
conference talks, etc), but:
* when they get stale, they don't tend to get updated
* they still leave me with questions that I'ld like to answer
  for developers (like me) who want to use, or work with, TSDB

What I propose is developer docs inside the prometheus
repository.  Easy to find and harness the power of the community
to expand it and keep it up to date.

* perfect is the enemy of good.  Let's have a base and incrementally improve
* Markdown docs should be broad but not too deep.  Source code comments
  can complement them, and are the ideal place for implementation details.

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* use example code that works out of the box

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* Apply suggestions from code review

Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com>
Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* PR feedback

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* more docs

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* PR feedback

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* Apply suggestions from code review

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Apply suggestions from code review

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com>

* feedback

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* Update tsdb/docs/usage.md

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com>

* final tweaks

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* workaround docs versioning issue

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* Move example code to real executable, testable example.

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* cleanup example test and make sure it always reproduces

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* obtain temp dir in a way that works with older Go versions

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* Fix Ganesh's comments

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com>
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
2021-11-17 15:51:27 +05:30
Mateusz Gozdek d8561dbfd8 storage/remote: make tests use separate remote write configs
So tests can be run in parallel without races.

Signed-off-by: Mateusz Gozdek <mgozdekof@gmail.com>
2021-11-10 09:40:43 +01:00
Mateusz Gozdek 116552cc58 storage/remote: check errors from ApplyConfig in tests
So tests do not produce obscure errors when applying configuration
fails.

Signed-off-by: Mateusz Gozdek <mgozdekof@gmail.com>
2021-11-10 09:40:43 +01:00
beorn7 c954cd9d1d Move packages out of deprecated pkg directory
This creates a new `model` directory and moves all data-model related
packages over there:
  exemplar labels relabel rulefmt textparse timestamp value

All the others are more or less utilities and have been moved to `util`:
  gate logging modetimevfs pool runtime

Signed-off-by: beorn7 <beorn@grafana.com>
2021-11-09 08:03:10 +01:00
Dieter Plaetinck cda025b5b5
TSDB: demistify SeriesRefs and ChunkRefs (#9536)
* TSDB: demistify seriesRefs and ChunkRefs

The TSDB package contains many types of series and chunk references,
all shrouded in uint types.  Often the same uint value may
actually mean one of different types, in non-obvious ways.

This PR aims to clarify the code and help navigating to relevant docs,
usage, etc much quicker.

Concretely:

* Use appropriately named types and document their semantics and
  relations.
* Make multiplexing and demuxing of types explicit
  (on the boundaries between concrete implementations and generic
  interfaces).
* Casting between different types should be free.  None of the changes
  should have any impact on how the code runs.

TODO: Implement BlockSeriesRef where appropriate (for a future PR)

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* feedback

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* agent: demistify seriesRefs and ChunkRefs

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>
2021-11-06 15:40:04 +05:30
Marco Pracucci 9f5ff5b269
Allow to disable trimming when querying TSDB (#9647)
* Allow to disable trimming when querying TSDB

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Addressed review comments

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added unit test

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Renamed TrimDisabled to DisableTrimming

Signed-off-by: Marco Pracucci <marco@pracucci.com>
2021-11-03 15:38:34 +05:30
sniper f82e56fbba
fix request bytes size and continue is useless (#9635)
Signed-off-by: kalmanzhao <kalmanzhao@tencent.com>

Co-authored-by: kalmanzhao <kalmanzhao@tencent.com>
2021-11-03 14:40:31 +05:30
Mateusz Gozdek b7bdf6fab2 Fix imports formatting
According to
2829908806 (r58457095).

Signed-off-by: Mateusz Gozdek <mgozdekof@gmail.com>
2021-11-02 19:52:34 +01:00
Mateusz Gozdek 1a6c2283a3 Format Go source files using 'gofumpt -w -s -extra'
Part of #9557

Signed-off-by: Mateusz Gozdek <mgozdekof@gmail.com>
2021-11-02 19:52:34 +01:00
lzhfromustc 9da5382103
storage/remote: Prevent two goroutines from endless loop (#8967)
Signed-off-by: lzhfromustc <lzhfromustc@gmail.com>
2021-10-29 16:39:02 -07:00
lzhfromustc d42be7be76
test:Fix two potential goroutine leaks (#8964)
Signed-off-by: lzhfromustc <lzhfromustc@gmail.com>
2021-10-29 15:44:32 -07:00
Bryan Boreham 5afa606ecb
Remote-write: reuse memory for marshalling (#9412)
By holding a `proto.Buffer` per shard and passing it down to where
marshalling is done, we avoid creating a lot of garbage.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2021-10-29 14:44:40 -07:00