* Default to bigger remote_write sends
Raise the default MaxSamplesPerSend to amortise the cost of remote
calls across more samples. Lower MaxShards to keep the expected max
memory usage within reason.
Signed-off-by: Bryan Boreham <bryan@weave.works>
* Change default Capacity to 2500
To maintain ratio with MaxSamplesPerSend
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
This also fixes a bug in query_log_file, which now is relative to the config file like all other paths.
Signed-off-by: Andy Bursavich <abursavich@gmail.com>
* Track remote write queues via a map so we don't care about index.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Support a job name for remote write/read so we can differentiate between
them using the name.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Remote write/read has Name to not confuse the meaning of the field with
scrape job names.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Split queue/client label into remote_name and url labels.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Don't allow for duplicate remote write/read configs.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Ensure we restart remote write queues if the hash of their config has
not changed, but the remote name has changed.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Include name in remote read/write config hashes, simplify duplicates
check, update test accordingly.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
This change makes sure that nearly-identical Alertmanager configurations
aren't merged together.
The config's identifier was the MD5 hash of the configuration serialized
to JSON but because `relabel.Regexp` has no public field and doesn't
implement the JSON.Marshaler interface, it was always serialized to
"{}".
In practice, the identifier can be based on the index of the
configuration in the list.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
The desired shards calculation now properly keeps track of the rate of
pending samples, and uses the previously unused integralAccumulator to
adjust for missing information in the desired shards calculation.
Also, configure more capacity for each shard. The default 10 capacity
causes shards to block on each other while
sending remote requests. Default to a 500 sample capacity and explain in
the documentation that having more capacity will help throughput.
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
With v0.16.0 Alertmanager introduced a new API (v2). This patch adds a
configuration option for Prometheus to send alerts to the v2 endpoint
instead of the defautl v1 endpoint.
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
i) Uses the more idiomatic Wrap and Wrapf methods for creating nested errors.
ii) Fixes some incorrect usages of fmt.Errorf where the error messages don't have any formatting directives.
iii) Does away with the use of fmt package for errors in favour of pkg/errors
Signed-off-by: tariqibrahim <tariq181290@gmail.com>
- Unmarshall external_labels config as labels.Labels, add tests.
- Convert some more uses of model.LabelSet to labels.Labels.
- Remove old relabel pkg (fixes#3647).
- Validate external label names.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
* discovery/kubernetes: fix support for password_file
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Create and pass custom RoundTripper to Kubernetes client
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Use inline HTTPClientConfig
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
This change switches the remote_write API to use the TSDB WAL. This should reduce memory usage and prevent sample loss when the remote end point is down.
We use the new LiveReader from TSDB to tail WAL segments. Logic for finding the tracking segment is included in this PR. The WAL is tailed once for each remote_write endpoint specified. Reading from the segment is based on a ticker rather than relying on fsnotify write events, which were found to be complicated and unreliable in early prototypes.
Enqueuing a sample for sending via remote_write can now block, to provide back pressure. Queues are still required to acheive parallelism and batching. We have updated the queue config based on new defaults for queue capacity and pending samples values - much smaller values are now possible. The remote_write resharding code has been updated to prevent deadlocks, and extra tests have been added for these cases.
As part of this change, we attempt to guarantee that samples are not lost; however this initial version doesn't guarantee this across Prometheus restarts or non-retryable errors from the remote end (eg 400s).
This changes also includes the following optimisations:
- only marshal the proto request once, not once per retry
- maintain a single copy of the labels for given series to reduce GC pressure
Other minor tweaks:
- only reshard if we've also successfully sent recently
- add pending samples, latest sent timestamp, WAL events processed metrics
Co-authored-by: Chris Marchbanks <csmarchbanks.com> (initial prototype)
Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com> (sharding changes)
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* config: set target group source index during unmarshalling
Fixes issue #4214 where the scrape pool is unnecessarily reloaded for a
config reload where the config hasn't changed. Previously, the discovery
manager changed the static config after loading which caused the in-memory
config to differ from a freshly reloaded config.
Signed-off-by: Paul Gier <pgier@redhat.com>
* [issue #4214] Test that static targets are not modified by discovery manager
Signed-off-by: Paul Gier <pgier@redhat.com>
This adds support for basic authentication which closes#3090
The support for specifying the client timeout was removed as discussed in https://github.com/prometheus/common/pull/123. Marathon was the only sd mechanism doing this and configuring the timeout is done through `Context`.
DC/OS uses a custom `Authorization` header for authenticating. This adds 2 new configuration properties to reflect this.
Existing configuration files that use the bearer token will no longer work. More work is required to make this backwards compatible.
* refactor: move targetGroup struct and CheckOverflow() to their own package
* refactor: move auth and security related structs to a utility package, fix import error in utility package
* refactor: Azure SD, remove SD struct from config
* refactor: DNS SD, remove SD struct from config into dns package
* refactor: ec2 SD, move SD struct from config into the ec2 package
* refactor: file SD, move SD struct from config to file discovery package
* refactor: gce, move SD struct from config to gce discovery package
* refactor: move HTTPClientConfig and URL into util/config, fix import error in httputil
* refactor: consul, move SD struct from config into consul discovery package
* refactor: marathon, move SD struct from config into marathon discovery package
* refactor: triton, move SD struct from config to triton discovery package, fix test
* refactor: zookeeper, move SD structs from config to zookeeper discovery package
* refactor: openstack, remove SD struct from config, move into openstack discovery package
* refactor: kubernetes, move SD struct from config into kubernetes discovery package
* refactor: notifier, use targetgroup package instead of config
* refactor: tests for file, marathon, triton SD - use targetgroup package instead of config.TargetGroup
* refactor: retrieval, use targetgroup package instead of config.TargetGroup
* refactor: storage, use config util package
* refactor: discovery manager, use targetgroup package instead of config.TargetGroup
* refactor: use HTTPClient and TLS config from configUtil instead of config
* refactor: tests, use targetgroup package instead of config.TargetGroup
* refactor: fix tagetgroup.Group pointers that were removed by mistake
* refactor: openstack, kubernetes: drop prefixes
* refactor: remove import aliases forced due to vscode bug
* refactor: move main SD struct out of config into discovery/config
* refactor: rename configUtil to config_util
* refactor: rename yamlUtil to yaml_config
* refactor: kubernetes, remove prefixes
* refactor: move the TargetGroup package to discovery/
* refactor: fix order of imports
For special remote read endpoints which have only data for specific
queries, it is desired to limit the number of queries sent to the
configured remote read endpoint to reduce latency and performance
overhead.
Currently all read queries are simply pushed to remote read clients.
This is fine, except for remote storage for wich it unefficient and
make query slower even if remote read is unnecessary.
So we need instead to compare the oldest timestamp in primary/local
storage with the query range lower boundary. If the oldest timestamp
is older than the mint parameter, then there is no need for remote read.
This is an optionnal behavior per remote read client.
Signed-off-by: Thibault Chataigner <t.chataigner@criteo.com>
* k8s: Support discovery of ingresses
* Move additional labels below allocation
This makes it more obvious why the additional elements are allocated.
Also fix allocation for node where we only set a single label.
* k8s: Remove port from ingress discovery
* k8s: Add comment to ingress discovery example
* Add openstack service discovery.
* Add gophercloud code for openstack service discovery.
* first changes for juliusv comments.
* add gophercloud code for floatingip.
* Add tests to openstack sd.
* Add testify suite vendor files.
* add copyright and make changes for code climate.
* Fixed typos in provider openstack.
* Renamed tenant to project in openstack sd.
* Change type of password to Secret in openstack sd.
Allow namespace discovery to be more easily extended in the future by using a struct rather than just a list.
Rename fields for kubernetes namespace discovery
Each remote write endpoint gets its own set of relabeling rules.
This is based on the (yet-to-be-merged)
https://github.com/prometheus/prometheus/pull/2419, which removes legacy
remote write implementations.
This imposes a hard limit on the number of samples ingested from the
target. This is counted after metric relabelling, to allow dropping of
problemtic metrics.
This is intended as a very blunt tool to prevent overload due to
misbehaving targets that suddenly jump in sample count (e.g. adding
a label containing email addresses).
Add metric to track how often this happens.
Fixes#2137
Introduce two new relabel actions. labeldrop, and labelkeep.
These can be used to filter the set of labels by matching regex
- labeldrop: drops all labels that match the regex
- labelkeep: drops all labels that do not match the regex
* Add config, HTTP Basic Auth and TLS support to the generic write path.
- Move generic write path configuration to the config file
- Factor out config.TLSConfig -> tlf.Config translation
- Support TLSConfig for generic remote storage
- Rename Run to Start, and make it non-blocking.
- Dedupe code in httputil for TLS config.
- Make remote queue metrics global.