Commit graph

66 commits

Author SHA1 Message Date
Callum Styan b8106dd459 Review feedback:
- Add a dropped samples EWMA and use it in calculating desired shards.
- Update metric names and a log messages.
- Limit number of entries in the dedupe logging middleware to prevent potential OOM.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-02-28 08:38:39 -08:00
Tom Wilkie f795942572 Decrement pending sample when queue exits.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-02-28 08:38:39 -08:00
Tom Wilkie efbd9559f4 Deal with corruptions in the WAL:
- If we're replaying the WAL to get series records, skip that segment when we hit corruptions.
- If we're tailing the WAL for samples, fail the watcher.
- When the watcher fails, restart from the latest checkpoint - and only send new samples by updating startTime.
- Tidy up log lines and error handling, don't return so many errors on quiting.
- Expect EOF when processing checkpoints.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-02-28 08:38:39 -08:00
Tom Wilkie d6f911b511 Factor out logging ratelimit & dedupe middleware.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-02-28 08:38:39 -08:00
Tom Wilkie 37ad4db485 Export timestamps in seconds since epoch.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-02-28 08:38:39 -08:00
JoeWrightss 362873f72b Fix .Log() error message (#5257)
Signed-off-by: zhoulin xie <zhoulin.xie@daocloud.io>
2019-02-22 14:39:37 +00:00
Callum Styan 37e35f9e0c Various improvements to WAL based remote write.
- Use the queue name in WAL watcher logging.
- Don't return from watch if the reader error was EOF.
- Fix sample timestamp check logic regarding what samples we send.
- Refactor so we don't need readToEnd/readSeriesRecords
- Fix wal_watcher tests since readToEnd no longer exists

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-02-12 11:39:13 +00:00
Tom Wilkie b93bafeee1 Various fixes to locking & shutdown for WAL-based remote write.
- Remove datarace in the exported highest scrape timestamp.
- Backoff on enqueue should be per-sample - reset the result for each sample.
- Remove diffKeys, unused ctx and cancelfunc in WALWatcher, 'name' from writeTo interface, and pass it to constructor.
- Reorder functions in WALWatcher depth-first according to call graph.
- Fix vendor/modules.txt.
- Split out the various timer periods into consts at the top of the file.
- Move w.currentSegmentMetric.Set close to where we set the currentSegment.
- Combine r.Next() and isClosed(w.quit) into a single loop.
- Unnest some ifs in WALWatcher.watch, propagate erros in decodeRecord, add some new lines to make it easier to read.
- Reorganise checkpoint handling to reduce nesting and make it easier to follow.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2019-02-12 11:39:13 +00:00
Callum Styan 6f69e31398 Tail the TSDB WAL for remote_write
This change switches the remote_write API to use the TSDB WAL.  This should reduce memory usage and prevent sample loss when the remote end point is down.

We use the new LiveReader from TSDB to tail WAL segments.  Logic for finding the tracking segment is included in this PR.  The WAL is tailed once for each remote_write endpoint specified. Reading from the segment is based on a ticker rather than relying on fsnotify write events, which were found to be complicated and unreliable in early prototypes.

Enqueuing a sample for sending via remote_write can now block, to provide back pressure.  Queues are still required to acheive parallelism and batching.  We have updated the queue config based on new defaults for queue capacity and pending samples values - much smaller values are now possible.  The remote_write resharding code has been updated to prevent deadlocks, and extra tests have been added for these cases.

As part of this change, we attempt to guarantee that samples are not lost; however this initial version doesn't guarantee this across Prometheus restarts or non-retryable errors from the remote end (eg 400s).

This changes also includes the following optimisations:
- only marshal the proto request once, not once per retry
- maintain a single copy of the labels for given series to reduce GC pressure

Other minor tweaks:
- only reshard if we've also successfully sent recently
- add pending samples, latest sent timestamp, WAL events processed metrics

Co-authored-by: Chris Marchbanks <csmarchbanks.com> (initial prototype)
Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com> (sharding changes)
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2019-02-12 11:39:13 +00:00
Simon Pasquier f678e27eb6
*: use latest release of staticcheck (#5057)
* *: use latest release of staticcheck

It also fixes a couple of things in the code flagged by the additional
checks.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Use official release of staticcheck

Also run 'go list' before staticcheck to avoid failures when downloading packages.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2019-01-04 14:47:38 +01:00
Bartek Płotka 62c8337e77 Moved configuration into relabel package. (#4955)
Adapted top dir relabel to use pkg relabel structs.

Removal of this in a separate tracked here: https://github.com/prometheus/prometheus/issues/3647

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
2018-12-18 11:26:36 +00:00
Ryota Arai 135d580ab2 Introduce min_shards for remote write to set minimum number of shards. (#4924)
Signed-off-by: Ryota Arai <ryota.arai@gmail.com>
2018-12-04 17:32:14 +00:00
Ben Kochie c6399296dc
Fix spelling/typos (#4921)
* Fix spelling/typos

Fix spelling/typos reported by codespell/misspell.
* UK -> US spelling changes.

Signed-off-by: Ben Kochie <superq@gmail.com>
2018-11-27 17:44:29 +01:00
fengyuceNv 94fff219ab improve remote storage enqueue performance (#4772)
Signed-off-by: fyc <fyc22788@ly.com>
2018-11-13 12:19:05 +00:00
Daisy T 7d01ead689 change time.duration to model.duration for standardization (#4479)
Signed-off-by: Daisy T <daisyts@gmx.com>
2018-08-24 16:55:21 +02:00
Goutham Veeramachaneni c28cc5076c Saner defaults and metrics for remote-write (#4279)
* Rename queueCapacity to shardCapacity
* Saner defaults for remote write
* Reduce allocs on retries

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2018-07-18 05:15:16 +01:00
Bryan Boreham 3277aeefaa Add queue name to logger for remote writes
More than one remote_write destination can be configured, in which
case it's essential to know which one each log message refers to.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2018-06-01 13:04:00 +00:00
Tom Wilkie b58199bf12 Review feedback.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2018-05-29 11:35:43 +01:00
Tom Wilkie 3353bbd018 Add proper unclean shutdown handling with a cancellable context.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2018-05-29 09:51:29 +01:00
Tom Wilkie e51d6c4b6c Make remote flush deadline a command line param.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2018-05-23 15:06:01 +01:00
Tom Wilkie a6c353613a Make the flush deadline configurable.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2018-05-23 15:04:36 +01:00
Tom Wilkie aa17263edd Remove WaitGroup and extra goroutine.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2018-05-23 15:04:34 +01:00
Tom Wilkie f3c61f8bb2 Only give remote queues 1 minute to flush samples on shutdown.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2018-05-23 15:04:32 +01:00
Mario Trangoni 464e747f1e fix some comments typos (#4059) 2018-04-08 10:51:54 +01:00
Tom Wilkie 22d820ef8e Review feedback. 2018-03-12 14:27:48 +00:00
Tom Wilkie f8c9d375b6 Correctly stop the timer used in the remote write path. 2018-03-09 12:00:26 +00:00
ferhat elmas ffa673f7d8 General simplifications (#3887)
Another try as in #1516
2018-02-26 07:58:10 +00:00
Bryan Boreham 8a4535e6ad Re-use timer instead of creating new ones on every sample
The docs for `time.After()` note that "The underlying Timer is not
recovered by the garbage collector until the timer fires".
2018-01-24 12:36:29 +00:00
Tom Wilkie 6e4d4ea402 Initialise some counters in remote storage API. 2017-10-26 11:09:45 +01:00
Tom Wilkie ee011d906d Port remote read server to 2.0. 2017-10-26 11:09:14 +01:00
Tom Wilkie 8fe0212ff7 Port 'Make queue manager configurable.' to 2.0, see #2991 2017-10-26 11:08:33 +01:00
Brian Brazil 73dc96e7f5 Fix leak of ticker in remote storage queue manager. 2017-10-09 19:44:03 +01:00
Fabian Reinartz d21f149745 *: migrate to go-kit/log 2017-09-08 22:01:51 +05:30
Tom Wilkie ec999ff397 Prevent number of remote write shards from going negative.
This can happen in the situation where the system scales up the number of shards massively (to deal with some backlog), then scales it down again as the number of samples sent during the time period is less than the number received.
2017-07-19 16:32:09 +01:00
Tom Wilkie 2dda5775e3 Initial port of remote storage to v2. 2017-07-12 12:27:57 +01:00
Fabian Reinartz 8ffc851147 Merge branch 'master' into dev-2.0 2017-04-04 15:17:56 +02:00
Tom Wilkie 75bb0f3253 Review feedback 2017-03-13 21:24:49 +00:00
Tom Wilkie 9d22f030cf Dynamically reshard the QueueManager based on observed load. 2017-03-13 14:41:16 +00:00
Tom Wilkie 1ab893c6ec Limit 'discarding sample' logs to 1 every 10s (#2446)
* Limit 'discarding sample' logs to 1 every 10s

* Include the vendored library

* Review feedback
2017-02-23 19:20:39 +01:00
Julius Volz 2f39dbc8b3 Rename StorageQueueManager -> QueueManager 2017-02-21 21:45:43 +01:00
Julius Volz e9476b35d5 Re-add multiple remote writers
Each remote write endpoint gets its own set of relabeling rules.

This is based on the (yet-to-be-merged)
https://github.com/prometheus/prometheus/pull/2419, which removes legacy
remote write implementations.
2017-02-20 13:23:12 +01:00
Tom Wilkie 4520e12440 Add HTTP Basic Auth & TLS support to the generic write path. (#1957)
* Add config, HTTP Basic Auth and TLS support to the generic write path.

- Move generic write path configuration to the config file
- Factor out config.TLSConfig -> tlf.Config translation
- Support TLSConfig for generic remote storage
- Rename Run to Start, and make it non-blocking.
- Dedupe code in httputil for TLS config.
- Make remote queue metrics global.
2016-09-19 22:47:51 +02:00
Tom Wilkie a6931b71e8 Rationalise retrieval metrics so we have the state (success/failed) on both samples and batches, in a consistent fashion.
Also, report total queue capacity of all queues, i.e. capacity * shards.
2016-08-30 17:42:42 +02:00
Tom Wilkie ece12bff93 Shard/parrallelise samples by fingerprint in StorageQueueManager
By splitting the single queue into multiple queues and flushing each individual queue in serially (and all queues in parallel), we can guarantee to preserve the order of timestampsin samples sent to downstream systems.
2016-08-30 17:42:36 +02:00
beorn7 064b57858e Consistently use the Seconds() method for conversion of durations
This also fixes one remaining case of recording integral numbers
of seconds only for a metric, i.e. this will probably fix #1796.
2016-07-07 15:24:35 +02:00
Dmitry Vorobev bd2a770015 storage/remote: Spawn not more than "maxConcurrentSends" goroutines. 2016-05-19 16:15:04 +02:00
Fabian Reinartz 59f1e722df Return error on sample appending 2016-02-02 14:01:44 +01:00
beorn7 ec08c9a391 Rework the way to communicate backpressure (AKA suspended ingestion)
This gives up on the idea to communicate throuh the Append() call (by
either not returning as it is now or returning an error as
suggested/explored elsewhere). Here I have added a Throttled() call,
which has the advantage that it can be called before a whole _batch_
of Append()'s. Scrapes will happen completely or not at all. Same for
rule group evaluations. That's a highly desired behavior (as discussed
elsewhere). The code is even simpler now as the whole ingestion buffer
could be removed.

Logging of throttled mode has been streamlined and will create at most
one message per minute.
2016-02-01 14:45:44 +01:00
Fabian Reinartz e3b6ec9784 Switch to common/log 2015-10-03 10:21:43 +02:00
Julius Volz 5f77fce578 Improve remote storage queue manager metrics. 2015-09-16 17:20:23 +02:00