Commit graph

254 commits

Author SHA1 Message Date
paulfantom 151a8daa98 documentation: align kubernetes example with the prom operator and mixins
Signed-off-by: paulfantom <pawel@krupa.net.pl>
2021-11-22 11:13:47 +01:00
Björn Rabenstein 2234798f60
Merge pull request #9700 from nikosmeds/nikosmeds/hagroupcrashlooping-mixin-60m
Increase time range for PrometheusHAGroupCrashlooping alert
2021-11-19 12:53:55 +01:00
Niko Smeds 53ca693f9e Be specific
Signed-off-by: Niko Smeds <nikosmeds@gmail.com>
2021-11-18 11:28:38 -08:00
Niko Smeds 0bc2cbdd7d Leave time range for clean restarts as-is
Signed-off-by: Niko Smeds <nikosmeds@gmail.com>
2021-11-17 15:14:26 -08:00
Fatih Sarhan bc89e9e494 mixin: Reorder template variables on Remote Write dashboard
Signed-off-by: f9n <f9n@protonmail.com>
2021-11-12 14:38:05 +03:00
Niko Smeds fdcd423dfe Increase time range for PrometheusHAGroupCrashlooping alert
Signed-off-by: Niko Smeds <nikosmeds@gmail.com>
2021-11-08 15:06:42 -08:00
Mateusz Gozdek 1a6c2283a3 Format Go source files using 'gofumpt -w -s -extra'
Part of #9557

Signed-off-by: Mateusz Gozdek <mgozdekof@gmail.com>
2021-11-02 19:52:34 +01:00
Arthur Silva Sens be2599c853
config: Make remote-write required for Agent mode (#9618)
* config: Make remote-write required for Agent mode

Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2021-10-30 01:41:40 +02:00
SuperQ 3cd2c033e2
Use Go 1.16+ install for mixin tests
Use new `go install` syntax to fetch tools.

Signed-off-by: SuperQ <superq@gmail.com>
2021-10-23 22:52:16 +02:00
Julien Pivotto 3458e338c6
docs: Improve PuppetDB example (#9547)
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2021-10-20 21:03:17 +02:00
Witek Bedyk cda2dbbef6
Add Uyuni service discovery (#8190)
* Add Uyuni service discovery

Signed-off-by: Witek Bedyk <witold.bedyk@suse.com>

Co-authored-by: Joao Cavalheiro <jcavalheiro@suse.de>
Co-authored-by: Marcelo Chiaradia <mchiaradia@suse.com>
Co-authored-by: Stefano Torresi <stefano@torresi.io>
Co-authored-by: Julien Pivotto <roidelapluie@gmail.com>
2021-10-19 01:00:44 +02:00
Julien Pivotto 8920024323 Add PuppetDB service discovery
We have been Puppet user for 10 years and we are users of
https://github.com/camptocamp/prometheus-puppetdb-sd

However, that file_sd implementation contains business logic and
assumptions around e.g. the modules which you are using.

This pull request adds a simple PuppetDB service discovery, which will
enable more use cases than the upstream sd.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2021-09-16 16:54:26 +02:00
Paweł Szulik f5563bfe95
tests: Move from t.Errorf and others. (Part 2) (#9309)
* Refactor util tests.

Signed-off-by: Paweł Szulik <paul.szulik@gmail.com>
2021-09-13 21:19:20 +02:00
Julien Pivotto d5676fb9e0
Merge pull request #9254 from prometheus/superq/go1.17
Build with Go 1.17 / npm 7 / node 16
2021-08-28 18:36:42 +02:00
Frederic Hemberger 16b8911b1a
docs: Replace go get with go install for command installation (#9098)
`go get` is deprecated for installation of commands as of go v1.17
Ref: https://go.googlesource.com/go/+/ced0fdbad0655d63d535390b1a7126fd1fef8348

Signed-off-by: Frederic Hemberger <mail@frederic-hemberger.de>
2021-08-27 11:08:21 +02:00
SuperQ e167a45c65
Add new Go build tags.
Add new go:build comments based on 1.17 formatting[0].

[0]: https://golang.org/doc/go1.17#gofmt

Signed-off-by: SuperQ <superq@gmail.com>
2021-08-27 10:24:14 +02:00
Björn Rabenstein 9c43ac451c
Merge pull request #9129 from PhilipGough/bz-1984365
mixin: Filter instance by selected job for Prometheus overview dashboard
2021-08-13 14:03:16 +02:00
TJ Hoplock 7baf084092
optimize Linode SD by polling for event changes during refresh (#8980)
* optimize Linode SD by polling for event changes during refresh

Most accounts are fairly "static", in the sense that they're not cycling
through instances constantly. So rather than do a full refresh every
interval and potentially make several behind-the-scenes paginated API
calls, this will now poll the `/account/events/` endpoint every minute
with a list of events that we care about. If a matching event is found,
we then do a full refresh.

Co-authored-by: William Smith <wsmith@linode.com>
Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
Signed-off-by: William Smith <wsmith@linode.com>
2021-08-04 12:05:49 +02:00
Philip Gough 751ca03fad mixin: Filter instance by job for Prometheus overview dashboard
Signed-off-by: Philip Gough <philip.p.gough@gmail.com>
2021-07-28 14:34:26 +01:00
Julius Volz 179b2155d1
Fix: Use json.Unmarshal() instead of json.Decoder (#9033)
* Fix: Use json.Unmarshal() instead of json.Decoder

See https://ahmet.im/blog/golang-json-decoder-pitfalls/

json.Decoder is for JSON streams, not single JSON objects / bodies.

Signed-off-by: Julius Volz <julius.volz@gmail.com>

* Revert modifications to targetgroup parsing

Signed-off-by: Julius Volz <julius.volz@gmail.com>
2021-07-02 09:38:14 +01:00
Ben Kochie 7cb55d5732
Merge pull request #8802 from mwasilew2/yaml-linting
Adds yamllinting to Makefile.common
2021-06-24 15:59:35 +02:00
Levi Harrison 4a4882d4c7 Replace godoc.org links
Signed-off-by: Levi Harrison <git@leviharrison.dev>
2021-06-17 07:18:51 -04:00
Julien Duchesne 8855c2e626
Add prometheus_tsdb_clean_start metric (#8824)
Add cleanup of the lockfile when the db is cleanly closed

The metric describes the status of the lockfile on startup
0: Already existed
1: Did not exist
-1: Disabled

Therefore, if the min value over time of this metric is 0, that means that executions have exited uncleanly
We can then use that metric to have a much lower threshold on the crashlooping alert:

If the metric exists and it has been zero, two restarts is enough to trigger the alarm
If it does not exist (old prom version for example), the current five restarts threshold remains

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>

* Change metric name + set unset value to -1

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>

* Only check the last value of the clean start alert

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>

* Fix test + nit

Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
2021-06-16 15:03:02 +05:30
Michal Wasilewski 3f686cad8b
fixes yamllint errors
Signed-off-by: Michal Wasilewski <mwasilewski@gmx.com>
2021-06-12 12:47:47 +02:00
Levi Harrison b5f6f8fb36 Switched to go-kit/log
Signed-off-by: Levi Harrison <git@leviharrison.dev>
2021-06-11 12:28:36 -04:00
Julien Pivotto 20c6739adc
Merge pull request #8833 from hanjm/feature/add-scape-read-body-limit
Add body_size_limit to prevent bad targets response large body cause Prometheus server OOM (#8827)
2021-06-02 09:24:59 +02:00
TJ Hoplock dc22c65349
Add Linode Service Discovery (#8846)
* Add Linode Service Discovery

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
2021-06-01 20:32:36 +02:00
hanjm 1df05bfd49 Add body_size_limit to prevent bad targets response large body cause Prometheus server OOM (#8827)
Signed-off-by: hanjm <hanjinming@outlook.com>
2021-05-29 07:05:42 +08:00
Levi Harrison 2826fbeeb7
SD: Add target creation failure counter and change failure handling (#8786)
* Added metric and changed failure/drop strategy

Signed-off-by: Levi Harrison <git@leviharrison.dev>
2021-05-28 23:50:59 +02:00
Callum Styan 8fd73b1d28
Add Exemplar Remote Write support (#8296)
* Write exemplars to the WAL and send them over remote write.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Update example for exemplars, print data in a more obvious format.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Add metrics for remote write of exemplars.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Fix incorrect slices passed to send in remote write.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* We need to unregister the new metrics.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address review comments

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Order of exemplar append vs write exemplar to WAL needs to change.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Several fixes to prevent sending uninitialized or incorrect samples with an exemplar. Fix dropping exemplar for missing series. Add tests for queue_manager sending exemplars

Signed-off-by: Martin Disibio <mdisibio@gmail.com>

* Store both samples and exemplars in the same timeseries buffer to remove the alloc when building final request, keep sub-slices in separate buffers for re-use

Signed-off-by: Martin Disibio <mdisibio@gmail.com>

* Condense sample/exemplar delivery tests to parameterized sub-tests

Signed-off-by: Martin Disibio <mdisibio@gmail.com>

* Rename test methods for clarity now that they also handle exemplars

Signed-off-by: Martin Disibio <mdisibio@gmail.com>

* Rename counter variable. Fix instances where metrics were not updated correctly

Signed-off-by: Martin Disibio <mdisibio@gmail.com>

* Add exemplars to LoadWAL benchmark

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* last exemplars timestamp metric needs to convert value to seconds with
ms precision

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Process exemplar records in a separate go routine when loading the WAL.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address review comments related to clarifying comments and variable
names. Also refactor sample/exemplar to enqueue prompb types.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Regenerate types proto with comments, update protoc version again.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Put remote write of exemplars behind a feature flag.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address some of Ganesh's review comments.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Move exemplar remote write feature flag to a config file field.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address Bartek's review comments.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Don't allocate exemplar buffers in queue_manager if we're not going to
send exemplars over remote write.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Add ValidateExemplar function, validate exemplars when appending to head
and log them all to WAL before adding them to exemplar storage.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address more reivew comments from Ganesh.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Add exemplar total label length check.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address a few last review comments

Signed-off-by: Callum Styan <callumstyan@gmail.com>

Co-authored-by: Martin Disibio <mdisibio@gmail.com>
2021-05-06 13:53:52 -07:00
Damien Grisonnet b50f9c1c84
Add label scrape limits (#8777)
* scrape: add label limits per scrape

Add three new limits to the scrape configuration to provide some
mechanism to defend against unbound number of labels and excessive
label lengths. If any of these limits are broken by a sample from a
scrape, the whole scrape will fail. For all of these configuration
options, a zero value means no limit.

The `label_limit` configuration will provide a mechanism to bound the
number of labels per-scrape of a certain sample to a user defined limit.
This limit will be tested against the sample labels plus the discovery
labels, but it will exclude the __name__ from the count since it is a
mandatory Prometheus label to which applying constraints isn't
meaningful.

The `label_name_length_limit` and `label_value_length_limit` will
prevent having labels of excessive lengths. These limits also skip the
__name__ label for the same reasons as the `label_limit` option and will
also make the scrape fail if any sample has a label name/value length
that exceed the predefined limits.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* scrape: add metrics and alert to label limits

Add three gauge, one for each label limit to easily access the
limit set by a certain scrape target.
Also add a counter to count the number of targets that exceeded the
label limits and thus were dropped. This is useful for the
`PrometheusLabelLimitHit` alert that will notify the users that scraping
some targets failed because they had samples exceeding the label limits
defined in the scrape configuration.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* scrape: apply label limits to __name__ label

Apply limits to the __name__ label that was previously skipped and
truncate the label names and values in the error messages as they can be
very very long.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* scrape: remove label limits gauges and refactor

Remove `prometheus_target_scrape_pool_label_limit`,
`prometheus_target_scrape_pool_label_name_length_limit`, and
`prometheus_target_scrape_pool_label_value_length_limit` as they are not
really useful since we don't have the information on the labels in it.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
2021-05-06 09:56:21 +01:00
Gezim Sejdiu 97acd170b2 Fix a broken link for the bcrypt ref. at the web-config.yml example
Signed-off-by: Gezim Sejdiu <g.sejdiu@gmail.com>
2021-04-20 22:43:37 +02:00
zhangshj 1956f07197 update redirected url
Signed-off-by: zhangshj <zhangshj@inspur.com>
2021-04-14 13:54:40 +08:00
Robert Jacob b253056163
Implement Docker discovery (#8629)
* Implement Docker discovery

Signed-off-by: Robert Jacob <xperimental@solidproject.de>
2021-03-29 22:30:23 +02:00
Rémy Léone f690b811c5
add support for scaleway service discovery (#8555)
Co-authored-by: Patrik <patrik@ptrk.io>
Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu>

Signed-off-by: Rémy Léone <rleone@scaleway.com>
2021-03-10 15:10:17 +01:00
Julien Pivotto 432d5ebc6c Rename default branch to main
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2021-02-22 20:28:02 +01:00
Julien Pivotto 8787f0aed7 Update common to support credentials type
Most of the backwards compat tests is done in common.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2021-02-18 23:28:22 +01:00
Tom Wilkie d479151f1f Various enhancements and refactorings for remote write receiver:
- Remove unrelated changes
- Refactor code out of the API module - that is already getting pretty crowded.
- Don't track reference for AddFast in remote write.  This has the potential to consume unlimited server-side memory if a malicious client pushes a different label set for every series.  For now, its easier and safer to always use the 'slow' path.
- Return 400 on out of order samples.
- Use remote.DecodeWriteRequest in the remote write adapters.
- Put this behing the 'remote-write-server' feature flag
- Add some (very) basic docs.
- Used named return & add test for commit error propagation

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2021-02-08 20:41:23 +00:00
ravilr adc8807851
Update remote-write alert rules mixin (#8423)
Signed-off-by: ravilr <raviprasad_lr@yahoo.com>
2021-01-31 20:07:49 +00:00
Julien Pivotto 5bd7145e55
Merge pull request #8327 from roidelapluie/tlsexemple
https: Add example configuration file
2021-01-15 09:50:52 +01:00
Julien Pivotto 08c259cda6 https: Add example configuration file
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2021-01-15 01:37:50 +01:00
Frederic Branczyk 62bc755733
mixin: Scope grafana config
In its current form this configuration clashes in one of the most widely
used configurations (kube-prometheus). This patch scopes the
configuration to prevent this.

Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
2020-12-30 17:50:34 +01:00
Nicolas Lamirault aa1ca13025
Add: Custom tags and prefix in Prometheus Mixin (#8287)
* Add: custom tags and prefix

Signed-off-by: Nicolas Lamirault <nicolas.lamirault@gmail.com>

* Fix: fmt

Signed-off-by: Nicolas Lamirault <nicolas.lamirault@gmail.com>
2020-12-16 18:49:06 +01:00
Björn Rabenstein 511511324a
Merge pull request #8235 from Allex1/master
Update remote-write grafana mixin
2020-12-08 14:50:47 +01:00
beorn7 553f904f2d mixin: Add a capability to exclude non-prod AM instances
Signed-off-by: beorn7 <beorn@grafana.com>
2020-12-03 20:59:53 +01:00
birca 3ec4161575 Update remote-write grafana mixin
Signed-off-by: birca <birca@adobe.com>
2020-12-02 09:50:15 +02:00
beorn7 638e99c814 prometheus-mixin: Make PrometheusRemoteWriteBehind more generic
Currently, it relies on `job, instance` being the labels completely
identifying a Prometheus instance. However, what's intended is to
simply not match on `remote_name, url`.

Signed-off-by: beorn7 <beorn@grafana.com>
2020-11-17 13:29:49 +01:00
beorn7 371ca9ff46 prometheus-mixin: add HA-group aware alerts
There is certainly a potential to add more of these. This is mostly
meant to introduce the concept and cover a few critical parts.

Signed-off-by: beorn7 <beorn@grafana.com>
2020-11-11 19:45:34 +01:00
Julien Pivotto 6c56a1faaa
Testify: move to require (#8122)
* Testify: move to require

Moving testify to require to fail tests early in case of errors.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>

* More moves

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-10-29 09:43:23 +00:00
like-inspur 29b551225b
add networking.k8s.io for ingress (#8091)
* add networking.k8s.io for ingress

level=error ts=2020-10-19T08:32:30.544Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:494: Failed to watch *v1beta1.Ingress: failed to list *v1beta1.Ingress: ingresses.networking.k8s.io is forbidden: User \"system:serviceaccount:monitoring:prometheus\" cannot list resource \"ingresses\" in API group \"networking.k8s.io\" at the cluster scope"

Signed-off-by: root <likerj@inspur.com>

* Update rbac-setup.yml

Signed-off-by: root <likerj@inspur.com>
2020-10-22 15:08:12 -06:00