This commit adds a new 'keep_firing_for' field to Prometheus alerting
rules. The 'resolve_delay' field specifies the minimum amount of time
that an alert should remain firing, even if the expression does not
return any results.
This feature was discussed at a previous dev summit, and it was
determined that a feature like this would be useful in order to allow
the expression time to stabilize and prevent confusing resolved messages
from being propagated through Alertmanager.
This approach is simpler than having two PromQL queries, as was
sometimes discussed, and it should be easy to implement.
This commit does not include tests for the 'resolve_delay' field. This
is intentional, as the purpose of this commit is to gather comments on
the proposed design of the 'resolve_delay' field before implementing
tests. Once the design of the 'resolve_delay' field has been finalized,
a follow-up commit will be submitted with tests."
See https://github.com/prometheus/prometheus/issues/11570
Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>
* Update docs example rules for default config
The prometheus download includes a default config to scrape itself.
This self-scraping prometheus doesn't include any metric named as
`http_inprogress_requests`, but does include one named
`prometheus_http_requests_total`.
Updating this example rule in the docs to one which can be used
out-of-the-box with the default download would be a nice improvement.
Signed-off-by: Sam Jewell <sam.jewell@grafana.com>
* Update syntax as per @LeviHarrison's review
Co-authored-by: Levi Harrison <levisamuelharrison@gmail.com>
Signed-off-by: Sam Jewell <2903904+samjewell@users.noreply.github.com>
Signed-off-by: Sam Jewell <sam.jewell@grafana.com>
Signed-off-by: Sam Jewell <2903904+samjewell@users.noreply.github.com>
Co-authored-by: Levi Harrison <levisamuelharrison@gmail.com>
Since the struct defines proxy_connect_header instead of proxy_connect_headers, all relevant occurences of it were replaced with the correct configuration name as defined in the HTTPClientConfig struct.
Signed-off-by: Robbe Haesendonck <googleit@inuits.eu>
* Add VM size label to azure service discovery (#11575)
Signed-off-by: davidifr <davidfr.mail@gmail.com>
* Add VM size label to azure service discovery (#11575)
Signed-off-by: davidifr <davidfr.mail@gmail.com>
* Add VM size label to azure service discovery (#11575)
Signed-off-by: davidifr <davidfr.mail@gmail.com>
Signed-off-by: davidifr <davidfr.mail@gmail.com>
* docs: Add link to best practices in "Defining Recording Rules" page
Signed-off-by: John Carlo Roberto <10111643+Irizwaririz@users.noreply.github.com>
* docs: Improve wording
Signed-off-by: John Carlo Roberto <10111643+Irizwaririz@users.noreply.github.com>
Signed-off-by: John Carlo Roberto <10111643+Irizwaririz@users.noreply.github.com>
* Common client in EC2 and Lightsail
Signed-off-by: Levi Harrison <git@leviharrison.dev>
* Azure -> AWS
Signed-off-by: Levi Harrison <git@leviharrison.dev>
Signed-off-by: Levi Harrison <git@leviharrison.dev>
This PR updates the Serverset url at the configuration.md documentation.
Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
Currently, it's not explicitly called out which permissions are needed
for service discovery of EC2 instances. It's not super hard to figure
out, but it did involve a bit of guesswork when I did it yesterday, and
I figure it's worth saving people that effort.
I've also seen examples around the internet where people are granting
much broader permissions than they need to, so maybe we can save on that
by explicitly saying what's needed.
Signed-off-by: Chris Sinjakli <chris@sinjakli.co.uk>
It's currently possible to use blackbox_exporter to probe MX records
themselves. However it's not possible to do an end-to-end test, like is
possible with SRV records. This makes it possible to use MX records as a
source of hostnames in the same way as SRV records.
Signed-off-by: David Leadbeater <dgl@dgl.cx>
Signed-off-by: Jonathan Stevens <jonathanstevens89@gmail.com>
Signed-off-by: Jonathan Stevens <jon.stevens@getweave.com>
Co-authored-by: Jonathan Stevens <jon.stevens@getweave.com>
* add description for __meta_kubernetes_endpoints_label_* and __meta_kubernetes_endpoints_labelpresent_*
Signed-off-by: renzheng.wang <wangrzneu@gmail.com>
The Kubernetes service discovery can only add node labels to
targets from the pod role.
This commit extends this functionality to the endpoints and
endpointslices roles.
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Update the wording of the `labelmap` relabel action to make it more
clear that it acts on all the label names, rather than the list provided
by source_labels.
Signed-off-by: SuperQ <superq@gmail.com>
* Fix documentation for Docker API filters
Signed-off-by: Robert Jacob <xperimental@solidproject.de>
* Undo indentation change
Signed-off-by: Robert Jacob <xperimental@solidproject.de>
* Added pathPrefix function to the template reference documentation
Signed-off-by: David N Perkins <David.N.Perkins@ibm.com>
* PR suggestions
Signed-off-by: David N Perkins <David.N.Perkins@ibm.com>
* Switch wording back
Signed-off-by: David N Perkins <David.N.Perkins@ibm.com>
* Followup on OpenTelemetry migration
- tracing_config: Change with_insecure to insecure, default to false.
- tracing_config: Call SetDirectory to make TLS certificates relative to the Prometheus
configuration
- documentation: Change bool to boolean in the configuration
- documentation: document type float
- tracing: Always restart the tracing manager when TLS config is set to
reload certificates
- tracing: Always set TLS config, which could be used e.g. in case of
potential redirects.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>\\
This commit adds support for discovering targets from the same
Kubernetes namespace as the Prometheus pod itself. Own-namespace
discovery can be indicated by using "." as the namespace.
Fixes#9782
Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>
When using Kubernetes on cloud providers, nodes will have the
spec.providerID field populated to contain the cloud provider specific
name of the EC2/GCE/... instance.
Let's expose this information as an additional label, so that it's
easier to annotate metrics and alerts to contain the cloud provider
specific name of the instance to which it pertains.
Signed-off-by: Ed Schouten <eschouten@apple.com>
This can be useful when generating rules, a query may use a duration,
and it may be useful to template that into a URL parameter. Therefore
this allows interfacing with systems that don't implement Prometheus
style duration parsing.
Signed-off-by: David Leadbeater <dgl@dgl.cx>
* remote-write: slow down retries to avoid DDOS
Increase the default max retry time from 100ms to 5 seconds.
Remote write calls are retried after a recoverable error such as the
back-end returning 500. Prometheus waits the minimum time and retries,
then doubles the wait on each subsequent retry until the maximum is
reached.
If some data is still getting through, remote-write will also increase
shards, and the default maximum is 200. 200 shards sending every 100ms
is 20 calls per second, to a back-end that is already in trouble.
5 seconds was chosen to match the default BatchSendDeadline: if we can
afford to wait that long for no response, then we can wait the same time
to retry. We will reach 5 seconds after 9 successive failures.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Update config doc for max_backoff change
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
We have been Puppet user for 10 years and we are users of
https://github.com/camptocamp/prometheus-puppetdb-sd
However, that file_sd implementation contains business logic and
assumptions around e.g. the modules which you are using.
This pull request adds a simple PuppetDB service discovery, which will
enable more use cases than the upstream sd.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* PromQL: Fix start and end keywords masking label and metric names
This commit fixes an issue with the "at modifier" that introduced two
new keywords: `start` and `end`. In grouping options and in metric
names, these keywords took precedence over metric or label names, so
that those metrics and labels could no longer be referenced.
Signed-off-by: Clayton Peters <clayton.peters@man.com>
* Add in additional tests for metrics and/or labels called start/end.
Signed-off-by: Clayton Peters <clayton.peters@man.com>
* *: Cut 2.29.0-rc.0
Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
* VERSION: bump to 2.29.0-rc.0
Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
* Remove experimental wording on size-based retention
Followup of #9004
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* Fix PR reference in changelog
Signed-off-by: George Brighton <george@gebn.co.uk>
* Describe EC2 availability zone IDs at most once per refresh (#9142)
Signed-off-by: George Brighton <george@gebn.co.uk>
* Describe EC2 availability zones at most once per SD load
Closes#9142.
Signed-off-by: George Brighton <george@gebn.co.uk>
* Incorporate feedback
Signed-off-by: George Brighton <george@gebn.co.uk>
* Integrate feedback
Signed-off-by: George Brighton <george@gebn.co.uk>
* Add a compatibility note for macOS users.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* *: Cut v2.29.0-rc.1
Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
* Fix `kuma_sd` targetgroup reporting (#9157)
* Bundle all xDS targets into a single group
Signed-off-by: austin ce <austin.cawley@gmail.com>
* *: cut v2.29.0-rc.2
Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
* Rename links
Signed-off-by: Levi Harrison <git@leviharrison.dev>
* bump codemirror-promql to 0.17.0
Signed-off-by: Augustin Husson <husson.augustin@gmail.com>
* *: cut v2.29.0
Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
* tsdb: align atomically accessed int64 (#9192)
This prevents a panic in 32-bit archs:
https://pkg.go.dev/sync/atomic#pkg-note-BUGFixed#9190
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* Release 2.29.1 (#9193)
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
Co-authored-by: Clayton Peters <clayton.peters@man.com>
Co-authored-by: Frederic Branczyk <fbranczyk@gmail.com>
Co-authored-by: George Brighton <george@gebn.co.uk>
Co-authored-by: Austin Cawley-Edwards <austin.cawley@gmail.com>
Co-authored-by: Levi Harrison <git@leviharrison.dev>
Co-authored-by: Augustin Husson <husson.augustin@gmail.com>
* optimize Linode SD by polling for event changes during refresh
Most accounts are fairly "static", in the sense that they're not cycling
through instances constantly. So rather than do a full refresh every
interval and potentially make several behind-the-scenes paginated API
calls, this will now poll the `/account/events/` endpoint every minute
with a list of events that we care about. If a matching event is found,
we then do a full refresh.
Co-authored-by: William Smith <wsmith@linode.com>
Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
Signed-off-by: William Smith <wsmith@linode.com>