Commit graph

89 commits

Author SHA1 Message Date
beorn7 e01c5cefac notifier: fix increment of metric prometheus_notifications_errors_total
Some checks failed
CI / Go tests (push) Has been cancelled
CI / More Go tests (push) Has been cancelled
CI / Go tests with previous Go version (push) Has been cancelled
CI / UI tests (push) Has been cancelled
CI / Go tests on Windows (push) Has been cancelled
CI / Mixins tests (push) Has been cancelled
CI / Build Prometheus for common architectures (0) (push) Has been cancelled
CI / Build Prometheus for common architectures (1) (push) Has been cancelled
CI / Build Prometheus for common architectures (2) (push) Has been cancelled
CI / Build Prometheus for all architectures (0) (push) Has been cancelled
CI / Build Prometheus for all architectures (1) (push) Has been cancelled
CI / Build Prometheus for all architectures (10) (push) Has been cancelled
CI / Build Prometheus for all architectures (11) (push) Has been cancelled
CI / Build Prometheus for all architectures (2) (push) Has been cancelled
CI / Build Prometheus for all architectures (3) (push) Has been cancelled
CI / Build Prometheus for all architectures (4) (push) Has been cancelled
CI / Build Prometheus for all architectures (5) (push) Has been cancelled
CI / Build Prometheus for all architectures (6) (push) Has been cancelled
CI / Build Prometheus for all architectures (7) (push) Has been cancelled
CI / Build Prometheus for all architectures (8) (push) Has been cancelled
CI / Build Prometheus for all architectures (9) (push) Has been cancelled
CI / Check generated parser (push) Has been cancelled
CI / golangci-lint (push) Has been cancelled
CI / fuzzing (push) Has been cancelled
CI / codeql (push) Has been cancelled
CI / Report status of build Prometheus for all architectures (push) Has been cancelled
CI / Publish main branch artifacts (push) Has been cancelled
CI / Publish release artefacts (push) Has been cancelled
CI / Publish UI on npm Registry (push) Has been cancelled
Previously, prometheus_notifications_errors_total was incremented by
one whenever a batch of alerts was affected by an error during sending
to a specific alertmanager. However, the corresponding metric
prometheus_notifications_sent_total, counting all alerts that were
sent (including those where the sent ended in error), is incremented
by the batch size, i.e. the number of alerts.

Therefore, the ratio used in the mixin for the
PrometheusErrorSendingAlertsToSomeAlertmanagers alert is inconsistent.

This commit changes the increment of
prometheus_notifications_errors_total to the number of alerts that
were sent in the attempt that ended in an error. It also adjusts the
metrics help string accordingly and makes the wording in the alert in
the mixin more precise.

Signed-off-by: beorn7 <beorn@grafana.com>
2024-11-26 15:50:02 +01:00
machine424 f9ca6c4ae6 chore: add an alert based on the metric prometheus_sd_kubernetes_failures_total
that was introcued in https://github.com/prometheus/prometheus/pull/13554

The same motivation for adding the metric applies: To avoid silent SD failures,
as existing logs may not be regularly checked and can be missed.

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Co-authored-by: Simon Pasquier <spasquie@redhat.com>
2024-06-19 17:51:56 +02:00
Pranshu Srivastava 87427682fd
bugfix: allow opting-out of multi-cluster setups
Allow users to opt-out of the multi-cluster setup for Prometheus
dashboard, in environments where it isn't applicable.

Refer: https://github.com/prometheus/prometheus/pull/13180.

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
2024-05-07 23:46:10 +05:30
Will Bollock 839b9e5b53
fix: PrometheusNotIngestingSamples label matching
This alert will never return anything as the left side of the query has
the labels `[component, environment, instance, job, type]` while the
right side has `[component, environment, instance, job]`.

The `type` label was added to `prometheus_tsdb_head_samples_appended_total` in this PR but the mixin wasn't updated
for the new label: https://github.com/prometheus/prometheus/pull/11395

This was found with [pint](https://github.com/cloudflare/pint) PromQL
linting

Signed-off-by: Will Bollock <wbollock@linode.com>
2024-01-31 09:08:36 -05:00
Erik Sommer d09d77b62a included instance in all necessary descriptions
Signed-off-by: Erik Sommer <ersotech@posteo.de>
2024-01-10 19:40:16 +01:00
Erik Sommer 0e585bf5c0 add cluster variable to Overview dashboard
Signed-off-by: Erik Sommer <ersotech@posteo.de>
2023-11-23 17:04:57 +01:00
Julien Pivotto 7a07a279c9
Merge pull request #10721 from ncauchois/fix_prometheus_remote_write_dashboard
mixin: Use url filter on Remote Write dashboard
2023-11-03 15:49:42 -04:00
Julien Pivotto 95606830fd
Merge pull request #11498 from paulfantom/selector
documentation/mixin: use prometheus metrics for dashboard variables
2023-07-11 13:36:00 +02:00
Leo Q 4268feb9d7
add alert for sd refresh failure (#12410)
* add alert for sd refresh failure

Due to config error or sd service down, prometheus may fail to refresh sd resource, which may lead to scrape fail or irrelavant metrics.

Signed-off-by: Leo Q <LeoQuote@users.noreply.github.com>

* apply suggestions

Signed-off-by: Leo Q <LeoQuote@users.noreply.github.com>

---------

Signed-off-by: Leo Q <LeoQuote@users.noreply.github.com>
2023-06-07 14:28:13 +02:00
Paweł Krupa (paulfantom) b6caa6cabf documentation/mixin: use prometheus metrics for dashboard variables
Signed-off-by: Paweł Krupa (paulfantom) <pawel@krupa.net.pl>
2022-11-30 14:02:58 +01:00
Arunprasad Rajkumar 400d50eb7e
Add unit for uptime column in Prometheus stats dashboard
Prior to this fix uptime column interpreted as number and the higher values are suffixed with raw units like `K`. This commit adds unit for the column as `second` to make visual interpretation easy.

Signed-off-by: Arunprasad Rajkumar <ar.arunprasad@gmail.com>
2022-11-08 19:09:14 +05:30
Simon Pasquier ec7929aaa8 documentation/prometheus-mixin: fix comment typo
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2022-09-09 17:03:53 +02:00
Iain Lane e5cd5a33d0
PrometheusHighQueryLoad alert: use configured selector
Currently we're hardcoding `job="prometheus-k8s"` as selector. This
doesn't work if your prometheus is elsewhere. Fortunately we have
`prometheusSelector` in `$._config`  which all the other alerts use.

Use that here too.

Signed-off-by: Iain Lane <iain@orangesquash.org.uk>
2022-07-15 10:04:32 +01:00
Haoyu Sun 26a7f80aa1 add alert PrometheusHighQueryLoad.
Signed-off-by: Haoyu Sun <hasun@redhat.com>
2022-07-13 14:08:24 +02:00
Nolwenn Cauchois ff3d4e91dc mixin: Use url filter on Remote Write dashboard
Signed-off-by: Nolwenn Cauchois <nolwenn.cauchois@orange.com>
2022-05-20 15:04:55 +02:00
fpetkovski 501a8a7865
Address code review comments
Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>
2022-03-30 09:35:08 +02:00
fpetkovski 877320784b
Add alert in mixin for exceeded sample limit
This commit adds an alert in the prometheus mixin which triggers when
Prometheus has failed scrapes that have exceeded the configured
sample_limit for that job.

Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>
2022-03-30 09:31:35 +02:00
Haoyu Sun 3c903af474
Add Alert PrometheusScrapeBodySizeLimitHit
Signed-off-by: Haoyu Sun <hasun@redhat.com>
2022-03-22 15:13:00 +01:00
Björn Rabenstein 2234798f60
Merge pull request #9700 from nikosmeds/nikosmeds/hagroupcrashlooping-mixin-60m
Increase time range for PrometheusHAGroupCrashlooping alert
2021-11-19 12:53:55 +01:00
Niko Smeds 53ca693f9e Be specific
Signed-off-by: Niko Smeds <nikosmeds@gmail.com>
2021-11-18 11:28:38 -08:00
Niko Smeds 0bc2cbdd7d Leave time range for clean restarts as-is
Signed-off-by: Niko Smeds <nikosmeds@gmail.com>
2021-11-17 15:14:26 -08:00
Fatih Sarhan bc89e9e494 mixin: Reorder template variables on Remote Write dashboard
Signed-off-by: f9n <f9n@protonmail.com>
2021-11-12 14:38:05 +03:00
Niko Smeds fdcd423dfe Increase time range for PrometheusHAGroupCrashlooping alert
Signed-off-by: Niko Smeds <nikosmeds@gmail.com>
2021-11-08 15:06:42 -08:00
SuperQ 3cd2c033e2
Use Go 1.16+ install for mixin tests
Use new `go install` syntax to fetch tools.

Signed-off-by: SuperQ <superq@gmail.com>
2021-10-23 22:52:16 +02:00
Julien Pivotto d5676fb9e0
Merge pull request #9254 from prometheus/superq/go1.17
Build with Go 1.17 / npm 7 / node 16
2021-08-28 18:36:42 +02:00
Frederic Hemberger 16b8911b1a
docs: Replace go get with go install for command installation (#9098)
`go get` is deprecated for installation of commands as of go v1.17
Ref: https://go.googlesource.com/go/+/ced0fdbad0655d63d535390b1a7126fd1fef8348

Signed-off-by: Frederic Hemberger <mail@frederic-hemberger.de>
2021-08-27 11:08:21 +02:00
SuperQ e167a45c65
Add new Go build tags.
Add new go:build comments based on 1.17 formatting[0].

[0]: https://golang.org/doc/go1.17#gofmt

Signed-off-by: SuperQ <superq@gmail.com>
2021-08-27 10:24:14 +02:00
Philip Gough 751ca03fad mixin: Filter instance by job for Prometheus overview dashboard
Signed-off-by: Philip Gough <philip.p.gough@gmail.com>
2021-07-28 14:34:26 +01:00
Julien Duchesne 8855c2e626
Add prometheus_tsdb_clean_start metric (#8824)
Add cleanup of the lockfile when the db is cleanly closed

The metric describes the status of the lockfile on startup
0: Already existed
1: Did not exist
-1: Disabled

Therefore, if the min value over time of this metric is 0, that means that executions have exited uncleanly
We can then use that metric to have a much lower threshold on the crashlooping alert:

If the metric exists and it has been zero, two restarts is enough to trigger the alarm
If it does not exist (old prom version for example), the current five restarts threshold remains

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>

* Change metric name + set unset value to -1

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>

* Only check the last value of the clean start alert

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>

* Fix test + nit

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
2021-06-16 15:03:02 +05:30
hanjm 1df05bfd49 Add body_size_limit to prevent bad targets response large body cause Prometheus server OOM (#8827)
Signed-off-by: hanjm <hanjinming@outlook.com>
2021-05-29 07:05:42 +08:00
Levi Harrison 2826fbeeb7
SD: Add target creation failure counter and change failure handling (#8786)
* Added metric and changed failure/drop strategy

Signed-off-by: Levi Harrison <git@leviharrison.dev>
2021-05-28 23:50:59 +02:00
Damien Grisonnet b50f9c1c84
Add label scrape limits (#8777)
* scrape: add label limits per scrape

Add three new limits to the scrape configuration to provide some
mechanism to defend against unbound number of labels and excessive
label lengths. If any of these limits are broken by a sample from a
scrape, the whole scrape will fail. For all of these configuration
options, a zero value means no limit.

The `label_limit` configuration will provide a mechanism to bound the
number of labels per-scrape of a certain sample to a user defined limit.
This limit will be tested against the sample labels plus the discovery
labels, but it will exclude the __name__ from the count since it is a
mandatory Prometheus label to which applying constraints isn't
meaningful.

The `label_name_length_limit` and `label_value_length_limit` will
prevent having labels of excessive lengths. These limits also skip the
__name__ label for the same reasons as the `label_limit` option and will
also make the scrape fail if any sample has a label name/value length
that exceed the predefined limits.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* scrape: add metrics and alert to label limits

Add three gauge, one for each label limit to easily access the
limit set by a certain scrape target.
Also add a counter to count the number of targets that exceeded the
label limits and thus were dropped. This is useful for the
`PrometheusLabelLimitHit` alert that will notify the users that scraping
some targets failed because they had samples exceeding the label limits
defined in the scrape configuration.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* scrape: apply label limits to __name__ label

Apply limits to the __name__ label that was previously skipped and
truncate the label names and values in the error messages as they can be
very very long.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* scrape: remove label limits gauges and refactor

Remove `prometheus_target_scrape_pool_label_limit`,
`prometheus_target_scrape_pool_label_name_length_limit`, and
`prometheus_target_scrape_pool_label_value_length_limit` as they are not
really useful since we don't have the information on the labels in it.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
2021-05-06 09:56:21 +01:00
ravilr adc8807851
Update remote-write alert rules mixin (#8423)
Signed-off-by: ravilr <raviprasad_lr@yahoo.com>
2021-01-31 20:07:49 +00:00
Frederic Branczyk 62bc755733
mixin: Scope grafana config
In its current form this configuration clashes in one of the most widely
used configurations (kube-prometheus). This patch scopes the
configuration to prevent this.

Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
2020-12-30 17:50:34 +01:00
Nicolas Lamirault aa1ca13025
Add: Custom tags and prefix in Prometheus Mixin (#8287)
* Add: custom tags and prefix

Signed-off-by: Nicolas Lamirault <nicolas.lamirault@gmail.com>

* Fix: fmt

Signed-off-by: Nicolas Lamirault <nicolas.lamirault@gmail.com>
2020-12-16 18:49:06 +01:00
Björn Rabenstein 511511324a
Merge pull request #8235 from Allex1/master
Update remote-write grafana mixin
2020-12-08 14:50:47 +01:00
beorn7 553f904f2d mixin: Add a capability to exclude non-prod AM instances
Signed-off-by: beorn7 <beorn@grafana.com>
2020-12-03 20:59:53 +01:00
birca 3ec4161575 Update remote-write grafana mixin
Signed-off-by: birca <birca@adobe.com>
2020-12-02 09:50:15 +02:00
beorn7 638e99c814 prometheus-mixin: Make PrometheusRemoteWriteBehind more generic
Currently, it relies on `job, instance` being the labels completely
identifying a Prometheus instance. However, what's intended is to
simply not match on `remote_name, url`.

Signed-off-by: beorn7 <beorn@grafana.com>
2020-11-17 13:29:49 +01:00
beorn7 371ca9ff46 prometheus-mixin: add HA-group aware alerts
There is certainly a potential to add more of these. This is mostly
meant to introduce the concept and cover a few critical parts.

Signed-off-by: beorn7 <beorn@grafana.com>
2020-11-11 19:45:34 +01:00
Matthias Loibl 13ba013a24
Use absolute jsonnet import paths
This should be the way forward when importing libraries in jsonnet. It's
closer to how Go imports look and makes it more obvious where packages
live.

This is not breaking anything, as the old imports were already symlinks
to the now directly used directories.

Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
2020-10-20 11:42:30 +02:00
Björn Rabenstein d49f267f76
Merge pull request #8054 from simonpasquier/improve-not-ingesting-samples-alert
documentation/prometheus-mixin: improve PrometheusNotIngestingSamples
2020-10-15 12:29:39 +02:00
Simon Pasquier f381d8a9bd documentation/prometheus-mixin: improve PrometheusNotIngestingSamples
The alert shouldn't fire when there's no target and no rule configured.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-10-15 11:13:17 +02:00
Julien Pivotto 4596abee4d
Mixin: Ignore unset remote write timestamp (#8046)
* Mixin: Ignore unset remote write timestamp

This pull request ignores the zero value of highest_sent_timestamp_seconds
in Highest Timestamp In vs. Highest Timestamp Sent which just show that
remote write has not been successful yet.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-10-15 09:15:59 +02:00
Simon Pasquier e693af6c01
.circleci/config.yml: check mixins (#6895)
* .circleci/config.yml: check mixins

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Run jsonnetfmt

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Install tools in the image instead of using coreos/jsonnet-ci

The latter is deprecated

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Update jsonnetfile.json

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
2020-08-25 15:59:41 +02:00
Julien Pivotto f482c7bdd7
Add per scrape-config targets limit (#7554)
* Add per scrape-config targets limit

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-07-30 14:20:24 +02:00
Tom Wilkie 27b1009acd
Rename the dashboard in the mixin to 'Prometheus Overview'. (#7489)
Due to https://github.com/grafana/grafana/issues/15642, this prevents users putting this dashboard in a Grafana folder called 'Prometheus'.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2020-06-30 15:45:44 +01:00
Manuel Fontan 6e7554639b Update Readme since jsonnetfmt is available in the jsonnet go implementation since v0.16.0
Signed-off-by: Manuel Fontan <mfontangarcia@slack-corp.com>
2020-06-16 10:41:58 +01:00
Callum Styan 5400e71b91 Update mixin dashboards and alerts for new remote write label names.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
2020-04-08 12:56:00 -07:00
Marco Pracucci 1e1785690a
Fix queue in alerts annotation
Signed-off-by: Marco Pracucci <marco@pracucci.com>
2020-02-12 12:48:13 +01:00