Commit graph

1245 commits

Author SHA1 Message Date
Jesus Vazquez e934d0f011 Merge 'main' into sparsehistogram
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
2022-10-05 22:14:49 +02:00
Bryan Boreham 8120af22e2 benchmarks: SetBytes takes bytes per operation
Where the code was multiplying bytes by number of operations, this
resulted in absurdly high throughput numbers.

Also, in `BenchmarkParse()`, don't run the `expfmt` case twice.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-10-02 22:11:33 +02:00
Bryan Boreham 3330d85ba8
Replace sort.Strings and sort.Ints with faster slices.Sort (#11318)
Use new experimental package `golang.org/x/exp/slices`.

slices.Sort works on values that are directly comparable, like ints,
so avoids the overhad of an interface call to `.Less()`.

Left tests unchanged, because they don't need the speed and it may be
a cross-check that slices.Sort gives the same answer.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-09-30 20:03:56 +05:30
Jesus Vazquez c1b669bf9b
Add out-of-order sample support to the TSDB (#11075)
* Introduce out-of-order TSDB support

This implementation is based on this design doc:
https://docs.google.com/document/d/1Kppm7qL9C-BJB1j6yb6-9ObG3AbdZnFUBYPNNWwDBYM/edit?usp=sharing

This commit adds support to accept out-of-order ("OOO") sample into the TSDB
up to a configurable time allowance. If OOO is enabled, overlapping querying
are automatically enabled.

Most of the additions have been borrowed from
https://github.com/grafana/mimir-prometheus/
Here is the list ist of the original commits cherry picked
from mimir-prometheus into this branch:
- 4b2198d7ec
- 2836e5513f
- 00b379c3a5
- ff0dc75758
- a632c73352
- c6f3d4ab33
- 5e8406a1d4
- abde1e0ba1
- e70e769889
- df59320886

Co-authored-by: Jesus Vazquez <jesus.vazquez@grafana.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Dieter Plaetinck <dieter@grafana.com>
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>

* gofumpt files

Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>

* Add license header to missing files

Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>

* Fix OOO tests due to existing chunk disk mapper implementation

Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>

* Fix truncate int overflow

Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>

* Add Sync method to the WAL and update tests

Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>

* remove useless sync

Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>

* Update minOOOTime after truncating Head

* Update minOOOTime after truncating Head

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Fix lint

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Add a unit test

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>

* Load OutOfOrderTimeWindow only once per appender

Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>

* Fix OOO Head LabelValues and PostingsForMatchers

Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>

* Fix replay of OOO mmap chunks

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Remove unnecessary err check

Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>

* Prevent panic with ApplyConfig

Signed-off-by: Ganesh Vernekar 15064823+codesome@users.noreply.github.com
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>

* Run OOO compaction after restart if there is OOO data from WBL

Signed-off-by: Ganesh Vernekar 15064823+codesome@users.noreply.github.com
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>

* Apply Bartek's suggestions

Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>

* Refactor OOO compaction

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Address comments and TODOs

- Added a comment explaining why we need the allow overlapping
  compaction toggle
- Clarified TSDBConfig OutOfOrderTimeWindow doc
- Added an owner to all the TODOs in the code

Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>

* Run go format

Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>

* Fix remaining review comments

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Fix tests

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Change wbl reference when truncating ooo in TestHeadMinOOOTimeUpdate

Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>

* Fix TestWBLAndMmapReplay test failure on windows

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Address most of the feedback

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Refactor the block meta for out of order

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Fix windows error

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Fix review comments

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Signed-off-by: Ganesh Vernekar 15064823+codesome@users.noreply.github.com
Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Dieter Plaetinck <dieter@grafana.com>
Co-authored-by: Oleg Zaytsev <mail@olegzaytsev.com>
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
2022-09-20 22:35:50 +05:30
Ganesh Vernekar d354f20c2a
Add a feature flag to control native histogram ingestion (#11253)
* Add runtime config to control native histogram ingestion

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Make the config into a CLI flag

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
2022-09-14 17:38:34 +05:30
Bryan Boreham 3029320ce6 storage/remote: in tests use labels.FromStrings
And a few cases of `EmptyLabels()`.
Replacing code which assumes the internal structure of `Labels`.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-09-09 13:34:49 +02:00
Bryan Boreham ad1c257899 storage: in tests use labels.FromStrings
Replacing code which assumes the internal structure of `Labels`.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-09-09 13:34:49 +02:00
Ganesh Vernekar f540c1dbd3
Add support for histograms in WAL checkpointing (#11210)
* Add support for histograms in WAL checkpointing

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Fix review comments

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Fix tests

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
2022-08-29 17:38:36 +05:30
Ganesh Vernekar 0f4e5196c4
Implement vertical compaction for native histograms (#11184)
* Implement vertical compaction for native histograms

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Fix typo

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
2022-08-22 19:04:39 +05:30
beorn7 c9fd3c235d Merge branch 'main' into sparsehistogram 2022-08-10 17:54:37 +02:00
Levi Harrison 77a7af4461
Add histogram validation (#11052)
* Add histogram validation

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Correct negative offset validation

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Address review comments

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Validation benchmark

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Add more checks

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Attempt to fix tests

Signed-off-by: Levi Harrison <git@leviharrison.dev>

* Fix stuff

Signed-off-by: Levi Harrison <git@leviharrison.dev>
2022-07-29 09:52:49 -05:00
Levi Harrison 0db6b072bc
Export histogramToHistogramProto() (#11046)
Signed-off-by: Levi Harrison <git@leviharrison.dev>
2022-07-21 10:12:50 -04:00
Paschalis Tsilias d1122e0743
Introduce TSDB changes for appending metadata to the WAL (#10972)
* Append metadata to the WAL

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Remove extra whitespace; Reword some docstrings and comments

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Use RLock() for hasNewMetadata check

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Use single byte for metric type in RefMetadata

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Update proposed WAL format for single-byte type metadata

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Implementa MetadataAppender interface for the Agent

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Address first round of review comments

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Amend description of metadata in wal.md

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Correct key used to retrieve metadata from cache

When we're setting metadata entries in the scrapeCace, we're using the
p.Help(), p.Unit(), p.Type() helpers, which retrieve the series name and
use it as the cache key. When checking for cache entries though, we used
p.Series() as the key, which included the metric name _with_ its labels.
That meant that we were never actually hitting the cache. We're fixing
this by utiling the __name__ internal label for correctly getting the
cache entries after they've been set by setHelp(), setType() or
setUnit().

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Put feature behind a feature flag

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Fix AppendMetadata docstring

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Reorder WAL format document

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Change error message of AppendMetadata; Fix access of s.meta in AppendMetadata

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Reuse temporary buffer in Metadata encoder

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Only keep latest metadata for each refID during checkpointing

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Fix test that's referencing decoding metadata

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Avoid creating metadata block if no new metadata are present

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Add tests for corrupt metadata block and relevant record type

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Fix CR comments

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Extract logic about changing metadata in an anonymous function

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Implement new proposed WAL format and amend relevant tests

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Use 'const' for metadata field names

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Apply metadata to head memSeries in Commit, not in AppendMetadata

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Add docstring and rename extracted helper in scrape.go

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Add tests for tsdb-related cases

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Fix linter issues vol1

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Fix linter issues vol2

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Fix Windows test by closing WAL reader files

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Use switch instead of two if statements in metadata decoding

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Fix review comments around TestMetadata* tests

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Add code for replaying WAL; test correctness of in-memory data after a replay

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Remove scrape-loop related code from PR

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Address first round of comments

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Simplify tests by sorting slices before comparison

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Fix test to use separate transactions

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Empty out buffer and record slices after encoding latest metadata

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Fix linting issue

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Update calculation for DroppedMetadata metric

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Rename MetadataAppender interface and AppendMetadata method to MetadataUpdater/UpdateMetadata

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Reuse buffer when encoding latest metadata for each series

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Fix review comments; Check all returned error values using two helpers

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Simplify use of helpers

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>

* Satisfy linter

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
2022-07-19 10:58:52 +02:00
beorn7 87351f2318 prompb: Modify layout of histograms
Note: This is deliberately an incompatible change. Since we have never
used histograms in remote read/write yet, there is no point in keeping
compatibility. This _is_, however, compatible to the state in the main
branch.

This commit flattens the bucket message into top-level fields. This
has the disadvantage of now having two triples of fields prefixed with
`negative_...` or `positive_...`. However, with this structure, we
save one tag on the wire. And, perhaps more importantly, we mirror the
structure of the `histogram.Histogram` Go type.

This commit also adjusts `repeated` fields to use names in the plural
form, as it is also the case for the fields that already existed.

This also adds a doc comment to `HistogramProtoToHistogram` and
changes its return type to a pointer (which is more convenient and
probably more efficient).

Signed-off-by: beorn7 <beorn@grafana.com>
2022-07-14 17:47:17 +02:00
Levi Harrison 08f3ddb864
Sparse histogram remote-write support (#11001) 2022-07-14 09:13:12 -04:00
beorn7 28f028e938 Merge branch 'main' into sparsehistogram 2022-07-12 19:07:13 +02:00
Matthieu MOREL d56d0a9d52
(storage): move from github.com/pkg/errors to 'errors' and 'fmt' (#10946)
Signed-off-by: Matthieu MOREL <mmorel-35@users.noreply.github.com>

Co-authored-by: Matthieu MOREL <mmorel-35@users.noreply.github.com>
2022-07-01 18:59:50 +02:00
Leonardo Zamariola 3326df42bb
Removing global state modification on unit tests (fix #10033 #10034) (#10935)
* Removing global state modification on unit tests (fix #10033 #10034)

The config.DefaultRemoteReadConfig and config.DefaultRemoteWriteConfig
instances hold global state. Unit tests were changing their url.URL reference
globally causing false positives when tests were ran through package.
Two helper functions were created to copy those global values instead of changing
them in place to fix null point when running unit tests by method instead of
by package.

Signed-off-by: Leonardo Zamariola <leonardo.zamariola@gmail.com>

* Fixing pull request suggestions

Copying by value from default config

Signed-off-by: Leonardo Zamariola <leonardo.zamariola@gmail.com>
2022-06-30 10:20:16 -06:00
beorn7 40ad5e284a Merge branch 'main' into beorn7/sparsehistogram 2022-06-09 20:50:30 +02:00
Matej Gera 1dd247f68b
Remote Write: Rename confusing walDir parameter to dir (#10464)
* Rename walDir parameter to dir

Signed-off-by: Matej Gera <matejgera@gmail.com>

* Improve NewQueueManager comment

Signed-off-by: Matej Gera <matejgera@gmail.com>
2022-05-30 21:45:30 -07:00
Bryan Boreham 4b9f248e85
unit tests: make all Labels sorted alphabetically (#10532)
"Labels is a sorted set of labels. Order has to be guaranteed upon
instantiation." says the comment, so fix all the tests that break this
rule.

For `BenchmarkLabelValuesWithMatchers()` and
`BenchmarkHeadLabelValuesWithMatchers()` the amount of work done changes
significantly if you put the labels in order, because all series refs
get neatly partitioned by the `tens` label, so I renamed the labels
to maintain the previous behaviour.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-05-04 23:41:36 +02:00
beorn7 3bc711e333 Merge branch 'main' into sparsehistogram 2022-05-04 13:37:13 +02:00
Matthieu MOREL e2ede285a2
refactor: move from io/ioutil to io and os packages (#10528)
* refactor: move from io/ioutil to io and os packages
* use fs.DirEntry instead of os.FileInfo after os.ReadDir

Signed-off-by: MOREL Matthieu <matthieu.morel@cnp.fr>
2022-04-27 11:24:36 +02:00
Chris Marchbanks a11e73edda
Fix a deadlock between Batch and FlushAndShutdown (#10608)
If FlushAndShutdown is called with a full batchQueue, and then Batch is
called rather than the normal path of reading from a queue a deadlock
might be encountered. Rather than having FlushAndShutdown having
blocking code while holding a lock retry sending the batch every second.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2022-04-20 20:50:41 +02:00
beorn7 7ee1836ef5 Merge branch 'main' into sparsehistogram 2022-04-05 18:31:19 +02:00
Wilbert Guo 83a2e52bc2
Add SyncForState Implementation for Ruler HA (#10070)
* continuously syncing activeAt for alerts

Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* add import

Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* Refactor SyncForState and add unit tests

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* Format code

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* Add hook for syncForState

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Fix go lint

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Refactor syncForState override implementation

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Add syncForState override func as argument to Update()

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Fix go formatting

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Fix circleci test errors

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Remove overrideFunc as argument to run()

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* remove the syncForState

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* use the override function to decide if need to replace the activeAt or not

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix test case

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix format

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* Trigger build

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fixing comments

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* return the result of map of alerts instead of single one

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* upper case the QueryforStateSeries

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* use a more generic rule group post process function type

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix indentation

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix gofmt

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix lint

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fixing naming

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix comments

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* add the lastEvalTimestamp as parameter

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fmt

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* change funcType to func

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

Co-authored-by: Yijie Qin <qinyijie@amazon.com>
Co-authored-by: Yijie Qin <63399121+qinxx108@users.noreply.github.com>
2022-03-29 02:16:46 +02:00
beorn7 4210aac74a Merge branch 'main' into sparsehistogram 2022-03-22 14:47:42 +01:00
beorn7 79376c1e94 Merge branch 'release-2.33' into beorn7/release 2022-03-08 17:42:49 +01:00
Chris Marchbanks e970acb085
Fix deadlock between adding to queue and getting batch
Do not block when trying to write a batch to the queue. This can cause
appends to lock forever if the only thing reading from the queue needs
the mutex to write. Instead, if batchQueue is full pop the sample that
was just added from the partial batch and return false. The code doing
the appending already handles retries with backoff.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2022-03-07 17:15:57 -07:00
Chris Marchbanks afdc1decac
Write a test that reproduces the deadlock
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2022-03-07 17:15:51 -07:00
Łukasz Mierzwa a4317bf0ec
Run gofumpt on all files (#10392)
* Run gofumpt on all files

Getting golangci-lint errors when building on my laptop, possibly because I have newer version of gofumpt then what it was formatted with.
Run gofumpt -w -extra on all files as it will be needed in the future anyway.

* Update golangci-lint to v1.44.2

v1.44.0 upgraded gofumpt so bumping version in CI will help keep formatting correct for everyone

* Address golangci-lint error

Getting 'error-strings: error strings should not be capitalized or end with punctuation or a newline' from revive here.
Drop new line.

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
2022-03-03 17:21:05 +01:00
DrAuYueng 5a6e26556b
Add an option to use the external labels as selectors for the remote read endpoint (#10254)
* An option to ignore external_labels

Signed-off-by: DrAuYueng <ouyang1204@gmail.com>
2022-02-16 22:12:47 +01:00
Julien Pivotto b0d70557b7
Merge pull request #10285 from prometheus/release-2.33 2022-02-12 00:02:24 +01:00
Chris Marchbanks bfb1500a38
Fix deadlock when stopping a shard (#10279)
If a queue is stopped and one of its shards happens to hit the
batch_send_deadline at the same time a deadlock can occur where stop
holds the mutex and will not release it until the send is finished, but
the send needs the mutex to retrieve the most recent batch. This is
fixed by using a second mutex just for writing.

In addition, the test I wrote exposed a case where during shutdown a
batch could be sent twice due to concurrent calls to queue.Batch() and
queue.FlushAndShutdown(). Protect these with a mutex as well.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2022-02-11 07:07:41 -07:00
Matej Gera 2c61d29b2a
Tracing: Migrate to OpenTelemetry library (#9724)
Signed-off-by: Matej Gera <matejgera@gmail.com>
2022-01-25 11:08:04 +01:00
Eng Zer Jun 3e67654d37
refactor: use T.TempDir() and B.TempDir to create temporary directory
The directory created by `T.TempDir()` and `B.TempDir()` is
automatically removed when the test and all its subtests complete.

Reference: https://pkg.go.dev/testing#T.TempDir
Reference: https://pkg.go.dev/testing#B.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2022-01-22 18:57:30 +08:00
Bryan Boreham 954c0e8020 remote_write: round desired shards up before check
Previously we would reject an increase from 2 to 2.5 as being
within 30%; by rounding up first we see this as an increase from 2 to 3.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-01-10 09:57:37 +00:00
Bryan Boreham 6d01ce8c4d remote_write: shard up more when backlogged
Change the coefficient from 1% to 5%, so instead of targetting to clear
the backlog in 100s we target 20s.

Update unit test to reflect the new behaviour.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-01-10 09:57:37 +00:00
Bryan Boreham d588b14d9c remote_write: detailed test for shard calculation
Drive the input parameters to `calculateDesiredShards()` very precisely,
to illustrate some questionable behaviour marked with `?!`.

See https://github.com/prometheus/prometheus/issues/9178,
https://github.com/prometheus/prometheus/issues/9207,

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2022-01-10 09:57:37 +00:00
Chris Marchbanks ba03f7fc23
Merge pull request #10102 from prometheus/update-metrics-on-rw-fails
Update sent timestamp when write irrecoverably fails
2022-01-05 10:46:09 -07:00
beorn7 e7592fe353 sparsehistogram: Address two TODOs
Signed-off-by: beorn7 <beorn@grafana.com>
2022-01-04 12:48:59 +01:00
Goutham Veeramachaneni 6696b7a5f0
Don't update metrics on context cancellation
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2022-01-04 10:46:52 +01:00
Chris Marchbanks dfa5cb7462
Merge pull request #10038 from charlesxsh/fix-TestReshardRaceWithStop
add proper exit for loop
2022-01-03 09:02:45 -07:00
Goutham Veeramachaneni 1af81dc5c9
Update sent timestamp when write irrecoverably fails.
We have an alert that fires when prometheus_remote_storage_highest_timestamp_in_seconds - prometheus_remote_storage_queue_highest_sent_timestamp_seconds
becomes too high. But we have an agent that fires this when the remote "rate-limits" the user.

This is because prometheus_remote_storage_queue_highest_sent_timestamp_seconds doesn't get updated
when the remote sends a 429.

I think we should update the metrics, and the change I made makes sense. Because if the requests fails
because of connectivity issues, etc. we will never exit the `sendWriteRequestWithBackoff` function. It only
exits the function when there is a non-recoverable error, like a bad status code, and in that case, I think
the metric needs to be updated.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2022-01-03 11:13:48 +01:00
Shihao Xia c3e7bfb813 add proper exit for loop
Signed-off-by: Shihao Xia <charlesxsh@hotmail.com>
2021-12-29 23:48:11 -05:00
beorn7 86cc83b13c storage: iterator fixes after merge
Signed-off-by: beorn7 <beorn@grafana.com>
2021-12-18 14:12:01 +01:00
beorn7 64c7bd2b08 Merge branch 'main' into sparsehistogram 2021-12-18 14:04:25 +01:00
Julien Pivotto 27343277fa
Merge release-2.32 forward into main (#10032)
* storage: expose bug in iterators #10027

Signed-off-by: beorn7 <beorn@grafana.com>

* storage: fix bug #10027 in iterators' Seek method

Signed-off-by: beorn7 <beorn@grafana.com>

* Append reporting metrics without limit

If reporting metrics fails due to reaching the limit, this makes the
target appear as UP in the UI, but the metrics are missing.

This commit bypasses that limit for report metrics.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>

* Remove check against cfg so interval/ timeout are always set (#10023) (#10031)

Signed-off-by: Nicholas Blott <blottn@tcd.ie>

Co-authored-by: Nicholas Blott <blottn@tcd.ie>

* Cut v2.32.1

Signed-off-by: Julius Volz <julius.volz@gmail.com>

* Apply suggestions from code review

Signed-off-by: Julius Volz <julius.volz@gmail.com>

Co-authored-by: Levi Harrison <git@leviharrison.dev>

Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu>
Co-authored-by: Nicholas Blott <blottn@tcd.ie>
Co-authored-by: Julius Volz <julius.volz@gmail.com>
Co-authored-by: Levi Harrison <git@leviharrison.dev>
2021-12-17 23:18:38 +01:00
beorn7 0ede6ae321 storage: fix bug #10027 in iterators' Seek method
Signed-off-by: beorn7 <beorn@grafana.com>
2021-12-16 12:07:35 +01:00
beorn7 b042e29569 storage: expose bug in iterators #10027
Signed-off-by: beorn7 <beorn@grafana.com>
2021-12-16 12:02:15 +01:00