Commit graph

1227 commits

Author SHA1 Message Date
beorn7 40ad5e284a Merge branch 'main' into beorn7/sparsehistogram 2022-06-09 20:50:30 +02:00
Matej Gera 1dd247f68b
Remote Write: Rename confusing walDir parameter to dir (#10464)
* Rename walDir parameter to dir

Signed-off-by: Matej Gera <matejgera@gmail.com>

* Improve NewQueueManager comment

Signed-off-by: Matej Gera <matejgera@gmail.com>
2022-05-30 21:45:30 -07:00
Bryan Boreham 4b9f248e85
unit tests: make all Labels sorted alphabetically (#10532)
"Labels is a sorted set of labels. Order has to be guaranteed upon
instantiation." says the comment, so fix all the tests that break this
rule.

For `BenchmarkLabelValuesWithMatchers()` and
`BenchmarkHeadLabelValuesWithMatchers()` the amount of work done changes
significantly if you put the labels in order, because all series refs
get neatly partitioned by the `tens` label, so I renamed the labels
to maintain the previous behaviour.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-05-04 23:41:36 +02:00
beorn7 3bc711e333 Merge branch 'main' into sparsehistogram 2022-05-04 13:37:13 +02:00
Matthieu MOREL e2ede285a2
refactor: move from io/ioutil to io and os packages (#10528)
* refactor: move from io/ioutil to io and os packages
* use fs.DirEntry instead of os.FileInfo after os.ReadDir

Signed-off-by: MOREL Matthieu <matthieu.morel@cnp.fr>
2022-04-27 11:24:36 +02:00
Chris Marchbanks a11e73edda
Fix a deadlock between Batch and FlushAndShutdown (#10608)
If FlushAndShutdown is called with a full batchQueue, and then Batch is
called rather than the normal path of reading from a queue a deadlock
might be encountered. Rather than having FlushAndShutdown having
blocking code while holding a lock retry sending the batch every second.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2022-04-20 20:50:41 +02:00
beorn7 7ee1836ef5 Merge branch 'main' into sparsehistogram 2022-04-05 18:31:19 +02:00
Wilbert Guo 83a2e52bc2
Add SyncForState Implementation for Ruler HA (#10070)
* continuously syncing activeAt for alerts

Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* add import

Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* Refactor SyncForState and add unit tests

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* Format code

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* Add hook for syncForState

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Fix go lint

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Refactor syncForState override implementation

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Add syncForState override func as argument to Update()

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Fix go formatting

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Fix circleci test errors

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Remove overrideFunc as argument to run()

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* remove the syncForState

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* use the override function to decide if need to replace the activeAt or not

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix test case

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix format

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* Trigger build

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fixing comments

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* return the result of map of alerts instead of single one

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* upper case the QueryforStateSeries

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* use a more generic rule group post process function type

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix indentation

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix gofmt

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix lint

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fixing naming

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix comments

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* add the lastEvalTimestamp as parameter

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fmt

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* change funcType to func

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

Co-authored-by: Yijie Qin <qinyijie@amazon.com>
Co-authored-by: Yijie Qin <63399121+qinxx108@users.noreply.github.com>
2022-03-29 02:16:46 +02:00
beorn7 4210aac74a Merge branch 'main' into sparsehistogram 2022-03-22 14:47:42 +01:00
beorn7 79376c1e94 Merge branch 'release-2.33' into beorn7/release 2022-03-08 17:42:49 +01:00
Chris Marchbanks e970acb085
Fix deadlock between adding to queue and getting batch
Do not block when trying to write a batch to the queue. This can cause
appends to lock forever if the only thing reading from the queue needs
the mutex to write. Instead, if batchQueue is full pop the sample that
was just added from the partial batch and return false. The code doing
the appending already handles retries with backoff.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2022-03-07 17:15:57 -07:00
Chris Marchbanks afdc1decac
Write a test that reproduces the deadlock
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2022-03-07 17:15:51 -07:00
Łukasz Mierzwa a4317bf0ec
Run gofumpt on all files (#10392)
* Run gofumpt on all files

Getting golangci-lint errors when building on my laptop, possibly because I have newer version of gofumpt then what it was formatted with.
Run gofumpt -w -extra on all files as it will be needed in the future anyway.

* Update golangci-lint to v1.44.2

v1.44.0 upgraded gofumpt so bumping version in CI will help keep formatting correct for everyone

* Address golangci-lint error

Getting 'error-strings: error strings should not be capitalized or end with punctuation or a newline' from revive here.
Drop new line.

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
2022-03-03 17:21:05 +01:00
DrAuYueng 5a6e26556b
Add an option to use the external labels as selectors for the remote read endpoint (#10254)
* An option to ignore external_labels

Signed-off-by: DrAuYueng <ouyang1204@gmail.com>
2022-02-16 22:12:47 +01:00
Julien Pivotto b0d70557b7
Merge pull request #10285 from prometheus/release-2.33 2022-02-12 00:02:24 +01:00
Chris Marchbanks bfb1500a38
Fix deadlock when stopping a shard (#10279)
If a queue is stopped and one of its shards happens to hit the
batch_send_deadline at the same time a deadlock can occur where stop
holds the mutex and will not release it until the send is finished, but
the send needs the mutex to retrieve the most recent batch. This is
fixed by using a second mutex just for writing.

In addition, the test I wrote exposed a case where during shutdown a
batch could be sent twice due to concurrent calls to queue.Batch() and
queue.FlushAndShutdown(). Protect these with a mutex as well.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2022-02-11 07:07:41 -07:00
Matej Gera 2c61d29b2a
Tracing: Migrate to OpenTelemetry library (#9724)
Signed-off-by: Matej Gera <matejgera@gmail.com>
2022-01-25 11:08:04 +01:00
Eng Zer Jun 3e67654d37
refactor: use T.TempDir() and B.TempDir to create temporary directory
The directory created by `T.TempDir()` and `B.TempDir()` is
automatically removed when the test and all its subtests complete.

Reference: https://pkg.go.dev/testing#T.TempDir
Reference: https://pkg.go.dev/testing#B.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2022-01-22 18:57:30 +08:00
Bryan Boreham 954c0e8020 remote_write: round desired shards up before check
Previously we would reject an increase from 2 to 2.5 as being
within 30%; by rounding up first we see this as an increase from 2 to 3.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-01-10 09:57:37 +00:00
Bryan Boreham 6d01ce8c4d remote_write: shard up more when backlogged
Change the coefficient from 1% to 5%, so instead of targetting to clear
the backlog in 100s we target 20s.

Update unit test to reflect the new behaviour.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-01-10 09:57:37 +00:00
Bryan Boreham d588b14d9c remote_write: detailed test for shard calculation
Drive the input parameters to `calculateDesiredShards()` very precisely,
to illustrate some questionable behaviour marked with `?!`.

See https://github.com/prometheus/prometheus/issues/9178,
https://github.com/prometheus/prometheus/issues/9207,

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-01-10 09:57:37 +00:00
Chris Marchbanks ba03f7fc23
Merge pull request #10102 from prometheus/update-metrics-on-rw-fails
Update sent timestamp when write irrecoverably fails
2022-01-05 10:46:09 -07:00
beorn7 e7592fe353 sparsehistogram: Address two TODOs
Signed-off-by: beorn7 <beorn@grafana.com>
2022-01-04 12:48:59 +01:00
Goutham Veeramachaneni 6696b7a5f0
Don't update metrics on context cancellation
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2022-01-04 10:46:52 +01:00
Chris Marchbanks dfa5cb7462
Merge pull request #10038 from charlesxsh/fix-TestReshardRaceWithStop
add proper exit for loop
2022-01-03 09:02:45 -07:00
Goutham Veeramachaneni 1af81dc5c9
Update sent timestamp when write irrecoverably fails.
We have an alert that fires when prometheus_remote_storage_highest_timestamp_in_seconds - prometheus_remote_storage_queue_highest_sent_timestamp_seconds
becomes too high. But we have an agent that fires this when the remote "rate-limits" the user.

This is because prometheus_remote_storage_queue_highest_sent_timestamp_seconds doesn't get updated
when the remote sends a 429.

I think we should update the metrics, and the change I made makes sense. Because if the requests fails
because of connectivity issues, etc. we will never exit the `sendWriteRequestWithBackoff` function. It only
exits the function when there is a non-recoverable error, like a bad status code, and in that case, I think
the metric needs to be updated.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2022-01-03 11:13:48 +01:00
Shihao Xia c3e7bfb813 add proper exit for loop
Signed-off-by: Shihao Xia <charlesxsh@hotmail.com>
2021-12-29 23:48:11 -05:00
beorn7 86cc83b13c storage: iterator fixes after merge
Signed-off-by: beorn7 <beorn@grafana.com>
2021-12-18 14:12:01 +01:00
beorn7 64c7bd2b08 Merge branch 'main' into sparsehistogram 2021-12-18 14:04:25 +01:00
Julien Pivotto 27343277fa
Merge release-2.32 forward into main (#10032)
* storage: expose bug in iterators #10027

Signed-off-by: beorn7 <beorn@grafana.com>

* storage: fix bug #10027 in iterators' Seek method

Signed-off-by: beorn7 <beorn@grafana.com>

* Append reporting metrics without limit

If reporting metrics fails due to reaching the limit, this makes the
target appear as UP in the UI, but the metrics are missing.

This commit bypasses that limit for report metrics.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>

* Remove check against cfg so interval/ timeout are always set (#10023) (#10031)

Signed-off-by: Nicholas Blott <blottn@tcd.ie>

Co-authored-by: Nicholas Blott <blottn@tcd.ie>

* Cut v2.32.1

Signed-off-by: Julius Volz <julius.volz@gmail.com>

* Apply suggestions from code review

Signed-off-by: Julius Volz <julius.volz@gmail.com>

Co-authored-by: Levi Harrison <git@leviharrison.dev>

Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu>
Co-authored-by: Nicholas Blott <blottn@tcd.ie>
Co-authored-by: Julius Volz <julius.volz@gmail.com>
Co-authored-by: Levi Harrison <git@leviharrison.dev>
2021-12-17 23:18:38 +01:00
beorn7 0ede6ae321 storage: fix bug #10027 in iterators' Seek method
Signed-off-by: beorn7 <beorn@grafana.com>
2021-12-16 12:07:35 +01:00
beorn7 b042e29569 storage: expose bug in iterators #10027
Signed-off-by: beorn7 <beorn@grafana.com>
2021-12-16 12:02:15 +01:00
beorn7 6f33ab2b35 Merge branch 'main' into sparsehistogram 2021-12-15 13:49:33 +01:00
Chris Marchbanks 0a8d28ea93
Merge pull request #9934 from bboreham/remote-write-struct
remote-write: buffer struct instead of interface to reduce garbage-collection
2021-12-09 09:17:45 -07:00
Bryan Boreham bd6436605d Review feedback
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2021-12-09 14:40:44 +00:00
Sebastian Rabenhorst d8b8678bd1
Log time series details for out-of-order samples in remote write receiver (#9894)
* Improved out-of-order sample logs in write handler

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

sign commit

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Inlined logAppendError

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Update storage/remote/write_handler.go

Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com>

Fixed fmt

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

* Improved out-of-order sample logs in write handler

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

sign commit

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>

Inlined logAppendError

Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>
2021-12-08 15:07:51 +00:00
detailyang 3e482c905f
fix:storage:avoid panic when iterater exhauested (#9945)
Signed-off-by: detailyang <detailyang@gmail.com>
2021-12-07 19:50:00 +05:30
Bryan Boreham 50878ebe5e remote-write: buffer struct instead of interface
This reduces the amount of individual objects allocated, allowing sends
to run a bit faster.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2021-12-03 14:30:42 +00:00
Bryan Boreham c478d6477a remote-write: benchmark just sending, on 20 shards
Previously BenchmarkSampleDelivery spent a lot of effort checking each
sample had arrived, so was largely showing the performance of test-only
code.

Increase the number of shards to be more realistic for a large
workload.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2021-12-03 14:02:10 +00:00
Chris Marchbanks e95d4ec3f1
Merge pull request #9830 from prometheus/batch-queues
Batch samples before sending them to channels
2021-12-02 08:37:41 -07:00
Chris Marchbanks c655684142
Subtract from enqueued samples/exemplars upon send
Right now the values for enqueuedSamples and enqueuedExemplars is never
subtracted leading to inflated values for failedSamples/failedExemplars
when a hard shutdown of a shard occurs.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2021-11-30 12:54:50 -07:00
Chris Marchbanks 319249f9db
Batch samples before sending them to channels
Channels can cause bottlenecks and tons of context switches when reading
hundreds of thousands of samples per second from a single queue.
Instead, pre-batch the samples to amortize the cost of the concurrency
overhead.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2021-11-30 12:54:45 -07:00
beorn7 68e02be963 Post-merge fixes
Signed-off-by: beorn7 <beorn@grafana.com>
2021-11-30 17:20:28 +01:00
beorn7 e4e24453fa Merge branch 'main' into beorn7/merge2 2021-11-30 17:19:06 +01:00
Björn Rabenstein 4ce01e9770
storage: Rename ...Values methods to At... (#9889)
This mirrors #9888 for the richer iterators we have with histograms in
the game.

Signed-off-by: beorn7 <beorn@grafana.com>
2021-11-29 16:23:04 +05:30
Björn Rabenstein d677aa4b29
storage: Consolidate iterator method names (Values -> At) (#9888)
`BufferedSeriesIterator` and `MemoizedSeriesIterator` use a method
called `Values` for exactly the purpose for which all other iterators
of the same kind use a method called `At`. That alone is confusing,
but on top of that, the `Values` method only returns a single sample,
not multiple values. I assume the naming has historical reasons. This
commit makes it more consistent. It is now easier to read, and now
`BufferedSeriesIterator` and `MemoizedSeriesIterator` implement
`chunkenc.Iterator` like many other iterators, too.

Signed-off-by: beorn7 <beorn@grafana.com>
2021-11-29 11:16:40 +01:00
Björn Rabenstein b866db009b
storage: Fix and improve the Seek method of various iterators (#9878)
There was a subtle and nasty bug in listSeriesIterator.Seek.

In addition, the Seek call is defined to be a no-op if the current
position of the iterator is already pointing to a suitable
sample. This commit adds fast paths for this case to several
potentially expensive Seek calls.

Another bug was in concreteSeriesIterator.Seek. It always searched the
whole series and not from the current position of the iterator.

Signed-off-by: beorn7 <beorn@grafana.com>
2021-11-29 15:17:56 +05:30
Björn Rabenstein 7e42acd3b1
tsdb: Rework iterators (#9877)
- Pick At... method via return value of Next/Seek.
- Do not clobber returned buckets.
- Add partial FloatHistogram suppert.

Note that the promql package is now _only_ dealing with
FloatHistograms, following the idea that PromQL only knows float
values.

As a byproduct, I have removed the histogramSeries metric. In my
understanding, series can have both float and histogram samples, so
that metric doesn't make sense anymore.

As another byproduct, I have converged the sampleBuf and the
histogramSampleBuf in memSeries into one. The sample type stored in
the sampleBuf has been extended to also contain histograms even before
this commit.

Signed-off-by: beorn7 <beorn@grafana.com>
2021-11-29 13:24:23 +05:30
Ganesh Vernekar 26c0a433f5
Support appending different sample types to the same series (#9705)
* Support appending different sample types to the same series

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Fix comments

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Fix build

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
2021-11-26 17:43:27 +05:30
Matheus Alcantara e673805d67
storage/remote: use t.TempDir instead of ioutil.TempDir on tests (#9811)
Signed-off-by: Matheus Alcantara <matheusssilv97@gmail.com>
2021-11-19 15:21:45 -05:00