i) Uses the more idiomatic Wrap and Wrapf methods for creating nested errors.
ii) Fixes some incorrect usages of fmt.Errorf where the error messages don't have any formatting directives.
iii) Does away with the use of fmt package for errors in favour of pkg/errors
Signed-off-by: tariqibrahim <tariq181290@gmail.com>
* Display correct values for the retention in the flags web gui.
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
* adding a log entry
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
* added the retention info to the runtime status page
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
* simplify the retention display
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
fixes https://github.com/prometheus/prometheus/issues/5213
Now that we have time and size base retention time bases should not have a default value. A default is set only when both - time and size flags are not set.
This change will not affect current installations that rely on the default time based value, and will avoid confusions when only the size retention is set and it is expected that the default time based setting would be no longer in place.
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
This change switches the remote_write API to use the TSDB WAL. This should reduce memory usage and prevent sample loss when the remote end point is down.
We use the new LiveReader from TSDB to tail WAL segments. Logic for finding the tracking segment is included in this PR. The WAL is tailed once for each remote_write endpoint specified. Reading from the segment is based on a ticker rather than relying on fsnotify write events, which were found to be complicated and unreliable in early prototypes.
Enqueuing a sample for sending via remote_write can now block, to provide back pressure. Queues are still required to acheive parallelism and batching. We have updated the queue config based on new defaults for queue capacity and pending samples values - much smaller values are now possible. The remote_write resharding code has been updated to prevent deadlocks, and extra tests have been added for these cases.
As part of this change, we attempt to guarantee that samples are not lost; however this initial version doesn't guarantee this across Prometheus restarts or non-retryable errors from the remote end (eg 400s).
This changes also includes the following optimisations:
- only marshal the proto request once, not once per retry
- maintain a single copy of the labels for given series to reduce GC pressure
Other minor tweaks:
- only reshard if we've also successfully sent recently
- add pending samples, latest sent timestamp, WAL events processed metrics
Co-authored-by: Chris Marchbanks <csmarchbanks.com> (initial prototype)
Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com> (sharding changes)
Signed-off-by: Callum Styan <callumstyan@gmail.com>
This no longer waits for all of the scrape reload to complete
before getting a list of AMs again.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Add flag for size based retention
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
* Deprecate the old retention flag for a new one.
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
* Add ability to take a suffix for size flag
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
* Address feedback
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
Add WALSegmentSize as an option, and the corresponding flag "storage.tsdb.wal-segment-size" to tune the max size of wal segment files.
The addressed base problem is to reduce the disk space used by wal segment files : on a raspberry pi, for instance, we often want to reduce write load of the sd card, then, the wal directory is mounted on a memory (space limited) partition.
the default value of the segment max file size, pushed the size of directory to 128 MB for each segment , which is too much ram consumption on a rasp.
the initial discussion is at https://github.com/prometheus/tsdb/pull/450
* update promlog to latest version
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* Update api tests, fix main setup
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* tidy go.sum
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* revendor prometheus/common
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* only initialize config; use kingpin for remote_storage_adapter
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* actually parse the flags
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* clean up imports
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* web: added ability to set page title through flag.
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* Reformatted variable names and Flag description for readability.
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* assets_vfsdata.go
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* Flag name changed from web.ui-title to web.page-title
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* make assets
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
According to the GoDoc for os.Signal [0]:
> Package signal will not block sending to c: the caller must ensure that
> c has sufficient buffer space to keep up with the expected signal rate.
> For a channel used for notification of just one signal value, a buffer
> of size 1 is sufficient.
[0] https://golang.org/pkg/os/signal/#Notify
Signed-off-by: Lucas Serven <lserven@gmail.com>
* Limit the number of samples remote read can return.
- Return 413 entity too large.
- Limit can be set be a flag. Allow 0 to mean no limit.
- Include limit in error message.
- Set default limit to 50M (* 16 bytes = 800MB).
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
To make local debuging with `go run` easyer moved all files into a
dedicate package `runtime`.
This allows running prometheus just by using `go run main.go` instead of
passing mani files like `go run main.go limits_default.go ...`
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
Start rule manager only after tsdb and config is loaded.
Stop rule manager before tsdb to avoid writing to closed storage.
Wait for any in-progress reloads to complete before shutting
down rule manager, so that rule manager doesn't get updated after
being shut down.
Remove incorrect comment around shutting down query enginge.
Log when config reload is completed.
Fixes#4133Fixes#4262
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
This adds a parameter to the storage selection interface which allows
query engine(s) to pass information about the operations surrounding a
data selection.
This can for example be used by remote storage backends to infer the
correct downsampling aggregates that need to be provided.
Users are starting to use these mistakenly thinking they'll help
with issues, and thus causing some confusion.
Thus hide them and make it clear that they're only there for testing
reasons.
Currently all read queries are simply pushed to remote read clients.
This is fine, except for remote storage for wich it unefficient and
make query slower even if remote read is unnecessary.
So we need instead to compare the oldest timestamp in primary/local
storage with the query range lower boundary. If the oldest timestamp
is older than the mint parameter, then there is no need for remote read.
This is an optionnal behavior per remote read client.
Signed-off-by: Thibault Chataigner <t.chataigner@criteo.com>
Instead or only printing the help message, which is not always helpful.
For example, when upgrading from prometheus v1, the retention time value
format has changed and now only accepts one unit (e.g. "15d") where it
previously allowed more complex strings (e.g. "360h0m0s").
This commit provides the error message as an explanation for the parsing
failure.