i) Uses the more idiomatic Wrap and Wrapf methods for creating nested errors.
ii) Fixes some incorrect usages of fmt.Errorf where the error messages don't have any formatting directives.
iii) Does away with the use of fmt package for errors in favour of pkg/errors
Signed-off-by: tariqibrahim <tariq181290@gmail.com>
* Display correct values for the retention in the flags web gui.
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
* adding a log entry
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
* added the retention info to the runtime status page
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
* simplify the retention display
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
fixes https://github.com/prometheus/prometheus/issues/5213
Now that we have time and size base retention time bases should not have a default value. A default is set only when both - time and size flags are not set.
This change will not affect current installations that rely on the default time based value, and will avoid confusions when only the size retention is set and it is expected that the default time based setting would be no longer in place.
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
This change switches the remote_write API to use the TSDB WAL. This should reduce memory usage and prevent sample loss when the remote end point is down.
We use the new LiveReader from TSDB to tail WAL segments. Logic for finding the tracking segment is included in this PR. The WAL is tailed once for each remote_write endpoint specified. Reading from the segment is based on a ticker rather than relying on fsnotify write events, which were found to be complicated and unreliable in early prototypes.
Enqueuing a sample for sending via remote_write can now block, to provide back pressure. Queues are still required to acheive parallelism and batching. We have updated the queue config based on new defaults for queue capacity and pending samples values - much smaller values are now possible. The remote_write resharding code has been updated to prevent deadlocks, and extra tests have been added for these cases.
As part of this change, we attempt to guarantee that samples are not lost; however this initial version doesn't guarantee this across Prometheus restarts or non-retryable errors from the remote end (eg 400s).
This changes also includes the following optimisations:
- only marshal the proto request once, not once per retry
- maintain a single copy of the labels for given series to reduce GC pressure
Other minor tweaks:
- only reshard if we've also successfully sent recently
- add pending samples, latest sent timestamp, WAL events processed metrics
Co-authored-by: Chris Marchbanks <csmarchbanks.com> (initial prototype)
Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com> (sharding changes)
Signed-off-by: Callum Styan <callumstyan@gmail.com>
This no longer waits for all of the scrape reload to complete
before getting a list of AMs again.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Add flag for size based retention
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
* Deprecate the old retention flag for a new one.
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
* Add ability to take a suffix for size flag
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
* Address feedback
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
Add WALSegmentSize as an option, and the corresponding flag "storage.tsdb.wal-segment-size" to tune the max size of wal segment files.
The addressed base problem is to reduce the disk space used by wal segment files : on a raspberry pi, for instance, we often want to reduce write load of the sd card, then, the wal directory is mounted on a memory (space limited) partition.
the default value of the segment max file size, pushed the size of directory to 128 MB for each segment , which is too much ram consumption on a rasp.
the initial discussion is at https://github.com/prometheus/tsdb/pull/450
* update promlog to latest version
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* Update api tests, fix main setup
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* tidy go.sum
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* revendor prometheus/common
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* only initialize config; use kingpin for remote_storage_adapter
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* actually parse the flags
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* clean up imports
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* web: added ability to set page title through flag.
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* Reformatted variable names and Flag description for readability.
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* assets_vfsdata.go
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* Flag name changed from web.ui-title to web.page-title
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* make assets
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
According to the GoDoc for os.Signal [0]:
> Package signal will not block sending to c: the caller must ensure that
> c has sufficient buffer space to keep up with the expected signal rate.
> For a channel used for notification of just one signal value, a buffer
> of size 1 is sufficient.
[0] https://golang.org/pkg/os/signal/#Notify
Signed-off-by: Lucas Serven <lserven@gmail.com>
* Limit the number of samples remote read can return.
- Return 413 entity too large.
- Limit can be set be a flag. Allow 0 to mean no limit.
- Include limit in error message.
- Set default limit to 50M (* 16 bytes = 800MB).
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
To make local debuging with `go run` easyer moved all files into a
dedicate package `runtime`.
This allows running prometheus just by using `go run main.go` instead of
passing mani files like `go run main.go limits_default.go ...`
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
Start rule manager only after tsdb and config is loaded.
Stop rule manager before tsdb to avoid writing to closed storage.
Wait for any in-progress reloads to complete before shutting
down rule manager, so that rule manager doesn't get updated after
being shut down.
Remove incorrect comment around shutting down query enginge.
Log when config reload is completed.
Fixes#4133Fixes#4262
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
This adds a parameter to the storage selection interface which allows
query engine(s) to pass information about the operations surrounding a
data selection.
This can for example be used by remote storage backends to infer the
correct downsampling aggregates that need to be provided.
Users are starting to use these mistakenly thinking they'll help
with issues, and thus causing some confusion.
Thus hide them and make it clear that they're only there for testing
reasons.
Currently all read queries are simply pushed to remote read clients.
This is fine, except for remote storage for wich it unefficient and
make query slower even if remote read is unnecessary.
So we need instead to compare the oldest timestamp in primary/local
storage with the query range lower boundary. If the oldest timestamp
is older than the mint parameter, then there is no need for remote read.
This is an optionnal behavior per remote read client.
Signed-off-by: Thibault Chataigner <t.chataigner@criteo.com>
Instead or only printing the help message, which is not always helpful.
For example, when upgrading from prometheus v1, the retention time value
format has changed and now only accepts one unit (e.g. "15d") where it
previously allowed more complex strings (e.g. "360h0m0s").
This commit provides the error message as an explanation for the parsing
failure.
Whenever a route prefix is applied, the router prepends the prefix to
the URL path on the request. For most handlers, this is not an issue
because the request's path is only used for routing and is not actually
needed by the handler itself. However, Prometheus delegates the handling
of the /debug/* endpoints to the http.DefaultServeMux which has it's own
routing logic that depends on the url.Path. As a result, whenever a
prefix is applied, the prefixed URL is passed to the DefaultServeMux
which has no awareness of the prefix and returns a 404.
This change fixes the issue by creating a new serveDebug handler which
routes requests /debug/* requests to appropriate net/http/pprof handler
and removing the net/http/pprof import in cmd/prometheus since it is no
longer necessary.
Fixes#2183.
* Print uname on prom startup
* Make uname file linux-only
* Add missing license headers
Add missing license headers
* Print OS when uname is not available
* Print only OS name when uname not available
* Remove extra space, fix cmd/prometheus/main.go license header
* Add fix for int8 and uint8 systems
* Better formatting for build tags in cmd/prometheus/uname files
* Remove newline
Rationale: The default value for GOGC is 100, i.e. a garbage collected
is initialized once as many heap space has been allocated as was in
use after the last GC was done. This ratio doesn't make a lot of sense
in Prometheus, as typically about 60% of the heap is allocated for
long-lived memory chunks (most of which are around for many hours if
not days). Thus, short-lived heap objects are accumulated for quite
some time until they finally match the large amount of memory used by
bulk memory chunks and a gigantic GC cyle is invoked. With GOGC=40, we
are essentially reinstating "normal" GC behavior by acknowledging that
about 60% of the heap are used for long-term bulk storage.
The median Prometheus production server at SoundCloud runs a GC cycle
every 90 seconds. With GOGC=40, a GC cycle is run every 35 seconds
(which is still not very often). However, the effective RAM usage is
now reduced by about 30%. If settings are updated to utilize more RAM,
the time between GC cycles goes up again (as the heap size is larger
with more long-lived memory chunks, but the frequency of creating
short-lived heap objects does not change). On a quite busy large
Prometheus server, the timing changed from one GC run every 20s to one
GC run every 12s.
In the former case (just changing GOGC, leave everything else as it
is), the CPU usage increases by about 10% (on a mid-size referenc
server from 8.1 to 8.9). If settings are adjusted, the CPU
consumptions increases more drastically (from 8 cores to 13 cores on a
large reference server), despite GCs happening more rarely, presumably
because a 50% larger set of memory chunks is managed now. Having more
memory chunks is good in many regards, and most servers are running
out of memory long before they run out of CPU cycles, so the tradeoff
is overwhelmingly positive in most cases.
Power users can still set the GOGC environment variable as usual, as
the implementation in this commit honors an explicitly set variable.
This removes legacy support for specific remote storage systems in favor
of only offering the generic remote write protocol. An example bridge
application that translates from the generic protocol to each of those
legacy backends is still provided at:
documentation/examples/remote_storage/remote_storage_bridge
See also https://github.com/prometheus/prometheus/issues/10
The next step in the plan is to re-add support for multiple remote
storages.
* Add config, HTTP Basic Auth and TLS support to the generic write path.
- Move generic write path configuration to the config file
- Factor out config.TLSConfig -> tlf.Config translation
- Support TLSConfig for generic remote storage
- Rename Run to Start, and make it non-blocking.
- Dedupe code in httputil for TLS config.
- Make remote queue metrics global.
This is based on https://github.com/prometheus/prometheus/pull/1997.
This adds contexts to the relevant Storage methods and already passes
PromQL's new per-query context into the storage's query methods.
The immediate motivation supporting multi-tenancy in Frankenstein, but
this could also be used by Prometheus's normal local storage to support
cancellations and timeouts at some point.
For Weaveworks' Frankenstein, we need to support multitenancy. In
Frankenstein, we initially solved this without modifying the promql
package at all: we constructed a new promql.Engine for every
query and injected a storage implementation into that engine which would
be primed to only collect data for a given user.
This is problematic to upstream, however. Prometheus assumes that there
is only one engine: the query concurrency gate is part of the engine,
and the engine contains one central cancellable context to shut down all
queries. Also, creating a new engine for every query seems like overkill.
Thus, we want to be able to pass per-query contexts into a single engine.
This change gets rid of the promql.Engine's built-in base context and
allows passing in a per-query context instead. Central cancellation of
all queries is still possible by deriving all passed-in contexts from
one central one, but this is now the responsibility of the caller. The
central query context is now created in main() and passed into the
relevant components (web handler / API, rule manager).
In a next step, the per-query context would have to be passed to the
storage implementation, so that the storage can implement multi-tenancy
or other features based on the contextual information.
This adds a flag -storage.local.engine which allows turning off local
storage in Prometheus. Instead of adding if-conditions and nil checks to
all parts of Prometheus that deal with Prometheus's local storage
(including the web interface), disabling local storage simply means
replacing the normal local storage with a noop version that throws
samples away and returns empty query results. We also don't add the noop
storage to the fanout appender to decrease internal overhead.
Instead of returning empty results, an alternate behavior could be to
return errors on any query that point out that the local storage is
disabled. Not sure which one is more preferable, so I went with the
empty result option for now.
- fold metric name into labels
- return initialization errors back to main
- add snappy compression
- better context handling
- pre-allocation of labels
- remove generic naming
- other cleanups
I got feedback from different sources about rules and targets being
too heavy in the status tab if their are lots of them.
This change also allows for more fine-granular locking.
This is with `golint -min_confidence=0.5`.
I left several lint warnings untouched because they were either
incorrect or I felt it was better not to change them at the moment.
This moves the concern of resolving the files relative to the config
file into the configuration loading itself.
It also fixes#921 which did not load the cert and token files relatively.
Besides fixing https://github.com/prometheus/prometheus/issues/805 by
making the entire externally reachable server URL configurable, this
adds tests for the "globalURL" template function and makes it easier to
test other such functions in the future.
This breaks the `web.Hostname` flag (and introduces `web.external-url`).
This flag is likely only used by few users, so I hope that's
justifiable.
Fixes https://github.com/prometheus/prometheus/issues/805
Version information is determined at build-time and thus there is
no need to pass it down from main. In its own package it can
be used from various other packages.