Commit graph

122 commits

Author SHA1 Message Date
Harkishen Singh cd412470d7
Consider status code 429 as recoverable errors to avoid resharding (#8237)
* Consider status code 429 as recoverable errors to avoid resharding.
* Adds support for Retry-After in backoff logic in remote storage.

Signed-off-by: Harkishen-Singh <harkishensingh@hotmail.com>
2021-02-10 15:25:37 -07:00
fuling 4a407210f5 [remote storage] remove sendWriteRequestWithBackoff() "s" and "req" param
Signed-off-by: fuling <fuling.lgz@alibaba-inc.com>
2021-02-04 21:38:32 +08:00
Chris Marchbanks 275f7e7766
Log recoverable remote write errors as warnings (#8412)
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2021-01-27 09:38:34 -07:00
kevinForMyself db445844d3
Fix garbage collection about t.droppedSeries in QueueManager.SeriesReset. (#8387)
* Fix memory leak about t.droppedSeries in QueueManager.SeriesReset.

Signed-off-by: kevinForMyself <zise_2001@163.com>

* Fix garbage collection about t.droppedSeries in QueueManager.SeriesReset

Signed-off-by: kevinForMyself <zise_2001@163.com>
2021-01-22 08:03:10 -07:00
gotjosh 4eca4dffb8
Allow metric metadata to be propagated via Remote Write. (#6815)
* Introduce a metadata watcher

Similarly to the WAL watcher, its purpose is to observe the scrape manager and pull metadata. Then, send it to a remote storage.

Signed-off-by: gotjosh <josue@grafana.com>

* Additional fixes after rebasing.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Rework samples/metadata metrics.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Use more descriptive variable names in MetadataWatcher collect.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Fix issues caused during rebasing.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Fix missing metric add and unneeded config code.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address some review comments.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Fix metrics and docs

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Replace assert with require

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Bring back max_samples_per_send metric

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Fix tests

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

Co-authored-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2020-11-19 20:53:03 +05:30
Jorge Luis Betancourt 4dc755cd27
Add a metric for tracking max_samples_per_send (#8102)
Currently there is no way of tracking the value of the
`max_samples_per_send` configuration option, which is commonly tweaked
when integrating with a remote write backend.

Signed-off-by: Jorge Luis Betancourt Gonzalez <jorge-luis.betancourt@trivago.com>
2020-10-28 11:39:36 +00:00
Julien Pivotto 4e5b1722b3
Move away from testutil, refactor imports (#8087)
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-10-22 11:00:08 +02:00
Julien Pivotto 8c9850c2bb
Remote: Do not collect non-initialized timestamp metrics (#8060)
* Remote: Do not collect non-initialized timestamp metrics

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-10-15 23:53:59 +02:00
Harkishen Singh 072b9649a3
Refactor vars to avoid test failures in storage/remote with -count > 1 (#7934)
* Refactor global vars to avoid failure with run test more than once.

Signed-off-by: Harkishen-Singh <harkishensingh@hotmail.com>

* Register highestRecvTimestamp metric.

Signed-off-by: Harkishen-Singh <harkishensingh@hotmail.com>

* Use local interner vars.

Signed-off-by: Harkishen-Singh <harkishensingh@hotmail.com>

* Declare interner in write storage.

Signed-off-by: Harkishen-Singh <harkishensingh@hotmail.com>
2020-09-24 12:44:18 -06:00
Javier Palomo Almena b58a613443
Replace sync/atomic with uber-go/atomic (#7683)
* storage: Replace usage of sync/atomic with uber-go/atomic

Signed-off-by: Javier Palomo <javier.palomo.almena@gmail.com>

* tsdb: Replace usage of sync/atomic with uber-go/atomic

Signed-off-by: Javier Palomo <javier.palomo.almena@gmail.com>

* web: Replace usage of sync/atomic with uber-go/atomic

Signed-off-by: Javier Palomo <javier.palomo.almena@gmail.com>

* notifier: Replace usage of sync/atomic with uber-go/atomic

Signed-off-by: Javier Palomo <javier.palomo.almena@gmail.com>

* cmd: Replace usage of sync/atomic with uber-go/atomic

Signed-off-by: Javier Palomo <javier.palomo.almena@gmail.com>

* scripts: Verify that we are not using restricted packages

It checks that we are not directly importing 'sync/atomic'.

Signed-off-by: Javier Palomo <javier.palomo.almena@gmail.com>

* Reorganise imports in blocks

Signed-off-by: Javier Palomo <javier.palomo.almena@gmail.com>

* notifier/test: Apply PR suggestions

Signed-off-by: Javier Palomo <javier.palomo.almena@gmail.com>

* storage/remote: avoid storing references on newEntry

Signed-off-by: Javier Palomo <javier.palomo.almena@gmail.com>

* Revert "scripts: Verify that we are not using restricted packages"

This reverts commit 278d32748e.

Signed-off-by: Javier Palomo <javier.palomo.almena@gmail.com>

* web: Group imports accordingly

Signed-off-by: Javier Palomo <javier.palomo.almena@gmail.com>
2020-07-30 13:15:42 +05:30
Joe Elliott 04b028f1e6
Exports recoverable error (#7689)
Signed-off-by: Joe Elliott <number101010@gmail.com>
2020-07-29 11:08:25 -06:00
Ganesh Vernekar a4c2ea1ca3
Merge remote-tracking branch 'upstream/master' into merge-release-2.19
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2020-06-26 14:33:50 +05:30
Chris Marchbanks b299aba6cf
Fix panic when updating a remote write queue (#7452)
Right now Queue Manager metrics are registered when the metrics struct
is created, which happens before a changed queue is shutdown and the old
metrics are unregistered. In the case of named queues or updates to
external labels the apply config will panic due to duplicate metrics.

Instead, register the metrics as part of starting the queue as we always
guarantee that Stop will be called before a new Start.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2020-06-26 12:03:52 +05:30
Chris Marchbanks d78656c244
Pending Samples metric includes samples in channel (#7335)
* Pending Samples metric includes samples in channel

The pending samples metric should also include samples waiting in the
channels to be sent to provide a more accurate measure. In addition,
make sure that the pending samples is reset to 0 anytime a queue is
started as we remake all of the shards at that time.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>

* Log the number of dropped samples on hard shutdown

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2020-06-25 14:48:30 -06:00
Bartlomiej Plotka b788986717
storage: Adjusted fully storage layer support for chunk iterators: Remote read client, readyStorage, fanout. (#7059)
* Fixed nits introduced by https://github.com/prometheus/prometheus/pull/7334
* Added ChunkQueryable implementation to fanout and readyStorage.
* Added more comments.
* Changed NewVerticalChunkSeriesMerger to CompactingChunkSeriesMerger, removed tiny interface by reusing VerticalSeriesMergeFunc for overlapping algorithm for
both chunks and series, for both querying and compacting (!) + made sure duplicates are merged.
* Added ErrChunkSeriesSet
* Added Samples interface for seamless []promb.Sample to []tsdbutil.Sample conversion.
* Deprecating non chunks serieset based StreamChunkedReadResponses, added chunk one.
* Improved tests.
* Split remote client into Write (old storage) and read.
* Queryable client is now SampleAndChunkQueryable. Since we cannot use nice QueryableFunc I moved
all config based options to sampleAndChunkQueryableClient to aboid boilerplate.

In next commit: Changes for TSDB.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
2020-06-24 14:41:52 +01:00
Bert Hartmann 82c7cd320a
increase the remote write bucket range (#7323)
* increase the remote write bucket range

Increase the range of remote write buckets to capture times above 10s for laggy scenarios
Buckets had been: {.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}
Buckets are now: {0.03125, 0.0625, 0.125, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512}

Signed-off-by: Bert Hartmann <berthartm@gmail.com>

* revert back to DefBuckets with addons to be backwards compatible

Signed-off-by: Bert Hartmann <berthartm@gmail.com>

* shuffle the buckets to maintain 2-2.5x increases

Signed-off-by: Bert Hartmann <berthartm@gmail.com>
2020-06-04 13:54:47 -06:00
Cody Boggs 3268eac2dd
Trace Remote Write requests (#7206)
* Trace Remote Write requests

Signed-off-by: Cody Boggs <cboggs@splunk.com>

* Refactor store attempts to keep code flow clearer, and avoid so many places to deal with span finishing

Signed-off-by: Cody Boggs <cboggs@splunk.com>
2020-06-01 09:21:13 -06:00
Chris Marchbanks dfad1da296
Remove duplicate metrics in QueueManager
Right now any new metrics added for remote write need to be added to
both the QueueManager struct, and the queueManagerMetrics struct.
Instead, use the queueManagerMetrics struct directly from QueueManager.

The newQueueManagerMetrics constructor will now create the metrics for a
specific queue with name and endpoint pre-populated, and a new copy of
the struct will be created specifically for each queue.

This also fixes a bug where prometheus_remote_storage_sent_bytes_total
is not being unregistered after a queue is changed.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2020-05-05 14:13:59 -06:00
qinng f36ae1c21c
[remote-storage] use warn log level when send samples to remote failed (#7184)
[remote] increasing sendbatch error log level

Signed-off-by: guoruyi1 <guoruyi1@xiaomi.com>
Co-authored-by: guoruyi1 <guoruyi1@xiaomi.com>
2020-04-30 17:06:22 -06:00
Marek Slabicki 4b5e7d4984
Adding a shouldReshard function to modularize logic for the QueueManager deciding if it should shard or not (#7143)
Signed-off-by: Marek Slabicki <thaniri@gmail.com>
2020-04-20 16:20:39 -06:00
Chris Marchbanks cd12f0873c
Merge pull request #7073 from csmarchbanks/fix-md5-remote-write
Fix remote write not updating when relabel configs or secrets change
2020-04-16 16:36:25 -06:00
Chris Marchbanks 5ab6b043c1
Always update lastSendTimestamp after a request (#7122)
If the server is returning non-recoverable errors, such as if we are
trying to push samples that are too old, remote write will never
reshard. Non-recoverable errors should be treated the same as success
for the purpose of resharding, just as we do with sample rates and
durations.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2020-04-15 09:03:28 -06:00
Chris Marchbanks d88a2b0261 Handle secret changes in remote write ApplyConfig
Remake the http client whenever ApplyConfig is called. This allows
secrets to be updated without needing to restart an otherwise unchanged
queue.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2020-04-13 23:14:15 +00:00
Marek Slabicki 8224ddec23
Capitalizing first letter of all log lines (#7043)
Signed-off-by: Marek Slabicki <thaniri@gmail.com>
2020-04-11 09:22:18 +01:00
Callum Styan f802f1e8ca
Fix bug with WAL watcher and Live Reader metrics usage. (#6998)
* Fix bug with WAL watcher and Live Reader metrics usage.

Calling NewXMetrics when creating a Watcher or LiveReader results in a
registration error, which we're ignoring, and as a result other than the
first Watcher/Reader created, we had no metrics for either. So we would
only have metrics like Watcher Records Read for the first remote write
config in a users config file.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2020-03-20 17:34:15 +01:00
helenxu1221 7df4fe3faa reset counter after collecting metric (#6798)
Signed-off-by: HelenXu <helenxu@Helens-MacBook-Pro.local>
2020-02-09 20:51:21 -07:00
Robert Fratto a53e00f9fd
pass registerer from storage to queue manager for its metrics (#6728)
* pass registerer from storage to queue manager for its metrics

Signed-off-by: Robert Fratto <robert.fratto@grafana.com>
2020-02-03 13:47:03 -08:00
Chris Marchbanks 7f3aca62c4 Only reduce the number of shards when caught up.
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2020-01-06 14:53:23 -07:00
Chris Marchbanks 9e24e1f9e8 Use samplesPending rather than integral
The integral accumulator in the remote write sharding code is just a
second way of keeping track of the number of samples pending. Remove
integralAccumulator and use the samplesPending value we already
calculate to calculate the number of shards.

This has the added benefit of fixing a bug where the integralAccumulator
was not being initialized correctly due to not taking into account the
number of ticks being counted, causing the integralAccumulator initial
value to be off by an order of magnitude in some cases.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2020-01-06 14:53:23 -07:00
Callum Styan 67838643ee
Add config option for remote job name (#6043)
* Track remote write queues via a map so we don't care about index.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Support a job name for remote write/read so we can differentiate between
them using the name.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Remote write/read has Name to not confuse the meaning of the field with
scrape job names.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Split queue/client label into remote_name and url labels.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Don't allow for duplicate remote write/read configs.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Ensure we restart remote write queues if the hash of their config has
not changed, but the remote name has changed.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Include name in remote read/write config hashes, simplify duplicates
check, update test accordingly.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-12-12 12:47:23 -08:00
Chris Marchbanks 6f34e35b3e
Record the exact value of desired shards in metric
It is possible that desired shards is always a bit higher than the
number of shards (less than 30%) and by exporting desired shards as the
raw number it will be easy to tell if a Prometheus is in that situation.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2019-11-26 06:26:03 -07:00
Chris Marchbanks 0e684ca205
Fix unknown type in sharding up log
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2019-11-26 06:22:56 -07:00
Callum Styan c2cb1e4103 Add a metric to track total bytes sent per remote write queue. (#6344)
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-11-25 13:25:18 -07:00
Tom Wilkie de0a772b8e Port tsdb to use pkg/labels. (#6326)
* Port tsdb to use pkg/labels.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

* Get tests passing.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

* Remove useless cast.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

* Appease linters.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

* Fix review comments

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2019-11-18 11:53:33 -08:00
Callum Styan 5f1be2cf45 Refactor calculateDesiredShards + don't reshard if we're having issues sending samples. (#6111)
* Refactor calculateDesiredShards + don't reshard if we're having issues
sending samples.
* Track lastSendTimestamp via an int64 with atomic add/load, add a test
for reshard calculation.
* Simplify conditional for skipping resharding, add samplesIn/Out to shard
testcase struct.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-10-21 15:54:25 -06:00
Chris Marchbanks 8df4bca470
Garbage collect asynchronously in the WAL Watcher
The WAL Watcher replays a checkpoint after it is created in order to
garbage collect series that no longer exist in the WAL. Currently the
garbage collection process is done serially with reading from the tip of
the WAL which can cause large delays in writing samples to remote
storage just after compaction occurs.

This also fixes a memory leak where dropped series are not cleaned up as
part of the SeriesReset process.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2019-10-07 14:36:10 -06:00
Callum Styan 3344bb5c33 Move WAL watcher code to tsdb/wal package. (#5999)
* Move WAL watcher code to tsdb/wal package.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Fix tests after moving WAL watcher code.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Lint fixes.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-09-19 14:45:41 +05:30
Björn Rabenstein 3b3eaf3496
Merge pull request #5787 from cstyan/reshard-max-logging
Add metrics for max/min/desired shards to queue manager.
2019-09-09 22:32:54 +02:00
Chris Marchbanks 791a2409a2
Pre-allocate pendingSamples to reduce allocations
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2019-09-03 15:41:47 -06:00
Chris Marchbanks 160186da18
Store labels.Labels instead of []prompb.Label
This will use half the steady state memory as required by prompb.Label.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2019-09-03 15:41:46 -06:00
Bartek Płotka 32be514845
Merge pull request #5805 from codesome/merge-tsdb
Merge tsdb into prometheus
2019-08-13 11:39:41 +01:00
Chris Marchbanks a6a55c433c Improve desired shards calculation (#5763)
The desired shards calculation now properly keeps track of the rate of
pending samples, and uses the previously unused integralAccumulator to
adjust for missing information in the desired shards calculation.

Also, configure more capacity for each shard.  The default 10 capacity
causes shards to block on each other while
sending remote requests. Default to a 500 sample capacity and explain in
the documentation that having more capacity will help throughput.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2019-08-13 10:10:21 +01:00
Ganesh Vernekar 5ecef3542d
Cleanup after merging tsdb into prometheus
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
2019-08-13 14:04:14 +05:30
ethan 38ccf0157e cleanup: correct func name in log message (#5852)
Signed-off-by: Guangming Wang <guangming.wang@daocloud.io>
2019-08-10 16:24:58 +01:00
Callum Styan c40a83f386 Add metrics for max shards, min shards, and desired shards.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-08-04 20:04:19 -07:00
Xigang Wang 445bcd1251 Update the runShard method and change len(pendingSamples) to n=len(pendingSamples) (#5708)
Signed-off-by: xigang <wangxigang2014@gmail.com>
2019-07-09 19:09:11 +01:00
Chris Marchbanks 06bdaf076f Remote Write Allocs Improvements (#5614)
* Add benchmark for sample delivery
* Simplify StoreSeries to have only one loop
* Reduce allocations for pending samples in runShard
* Only allocate one send slice per segment
* Cache a buffer in each shard for snappy to use
* Remove queue manager seriesMtx

  It is not possible for any of the places protected by the seriesMtx to
  be called concurrently so it is safe to remove. By removing the mutex we
  can simplify the Append code to one loop.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
2019-06-27 19:48:21 +01:00
Callum Styan 3639d51eb6 Remote Storage: string interner should not panic in release (#5487)
* Don't panic if we try to release a string that is not in the interner.

* Move seriesMtx locking in QueueManager's StoreSeries function.

This stops us from calling release for strings that aren't interned if
there's a race between reading a checkpoint and storing new series
labels, which could happen during checkpointing or reloading config.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-04-24 10:46:31 +01:00
Callum Styan e87449b59d Remote Write: Queue Manager specific metrics shouldn't exist if the queue no longer exists (#5445)
* Unregister remote write queue manager specific metrics when stopping the
queue manager.

* Use DeleteLabelValues instead of Unregister to remove queue and watcher
related metrics when we stop them. Create those metrics in the structs
start functions rather than in their constructors because of the
ordering of creation, start, and stop in remote storage ApplyConfig.

* Add setMetrics function to WAL watcher so we can set
the watchers metrics in it's Start function, but not
have to call Start in some tests (causes data race).

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-04-23 09:49:17 +01:00
Vasily Sliouniaev 5be9a1426f Prevent reshard concurrent with calling stop (#5460)
* Prevent reshard concurrent with calling stop

Signed-off-by: Vasily <v.sliouniaev@gmail.com>
2019-04-16 11:25:19 +01:00