We have been Puppet user for 10 years and we are users of
https://github.com/camptocamp/prometheus-puppetdb-sd
However, that file_sd implementation contains business logic and
assumptions around e.g. the modules which you are using.
This pull request adds a simple PuppetDB service discovery, which will
enable more use cases than the upstream sd.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* optimize Linode SD by polling for event changes during refresh
Most accounts are fairly "static", in the sense that they're not cycling
through instances constantly. So rather than do a full refresh every
interval and potentially make several behind-the-scenes paginated API
calls, this will now poll the `/account/events/` endpoint every minute
with a list of events that we care about. If a matching event is found,
we then do a full refresh.
Co-authored-by: William Smith <wsmith@linode.com>
Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
Signed-off-by: William Smith <wsmith@linode.com>
* Fix: Use json.Unmarshal() instead of json.Decoder
See https://ahmet.im/blog/golang-json-decoder-pitfalls/
json.Decoder is for JSON streams, not single JSON objects / bodies.
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* Revert modifications to targetgroup parsing
Signed-off-by: Julius Volz <julius.volz@gmail.com>
Add cleanup of the lockfile when the db is cleanly closed
The metric describes the status of the lockfile on startup
0: Already existed
1: Did not exist
-1: Disabled
Therefore, if the min value over time of this metric is 0, that means that executions have exited uncleanly
We can then use that metric to have a much lower threshold on the crashlooping alert:
If the metric exists and it has been zero, two restarts is enough to trigger the alarm
If it does not exist (old prom version for example), the current five restarts threshold remains
Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
* Change metric name + set unset value to -1
Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
* Only check the last value of the clean start alert
Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
* Fix test + nit
Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
* Write exemplars to the WAL and send them over remote write.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Update example for exemplars, print data in a more obvious format.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Add metrics for remote write of exemplars.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Fix incorrect slices passed to send in remote write.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* We need to unregister the new metrics.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address review comments
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Order of exemplar append vs write exemplar to WAL needs to change.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Several fixes to prevent sending uninitialized or incorrect samples with an exemplar. Fix dropping exemplar for missing series. Add tests for queue_manager sending exemplars
Signed-off-by: Martin Disibio <mdisibio@gmail.com>
* Store both samples and exemplars in the same timeseries buffer to remove the alloc when building final request, keep sub-slices in separate buffers for re-use
Signed-off-by: Martin Disibio <mdisibio@gmail.com>
* Condense sample/exemplar delivery tests to parameterized sub-tests
Signed-off-by: Martin Disibio <mdisibio@gmail.com>
* Rename test methods for clarity now that they also handle exemplars
Signed-off-by: Martin Disibio <mdisibio@gmail.com>
* Rename counter variable. Fix instances where metrics were not updated correctly
Signed-off-by: Martin Disibio <mdisibio@gmail.com>
* Add exemplars to LoadWAL benchmark
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* last exemplars timestamp metric needs to convert value to seconds with
ms precision
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Process exemplar records in a separate go routine when loading the WAL.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address review comments related to clarifying comments and variable
names. Also refactor sample/exemplar to enqueue prompb types.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Regenerate types proto with comments, update protoc version again.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Put remote write of exemplars behind a feature flag.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address some of Ganesh's review comments.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Move exemplar remote write feature flag to a config file field.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address Bartek's review comments.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Don't allocate exemplar buffers in queue_manager if we're not going to
send exemplars over remote write.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Add ValidateExemplar function, validate exemplars when appending to head
and log them all to WAL before adding them to exemplar storage.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address more reivew comments from Ganesh.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Add exemplar total label length check.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address a few last review comments
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Martin Disibio <mdisibio@gmail.com>
* scrape: add label limits per scrape
Add three new limits to the scrape configuration to provide some
mechanism to defend against unbound number of labels and excessive
label lengths. If any of these limits are broken by a sample from a
scrape, the whole scrape will fail. For all of these configuration
options, a zero value means no limit.
The `label_limit` configuration will provide a mechanism to bound the
number of labels per-scrape of a certain sample to a user defined limit.
This limit will be tested against the sample labels plus the discovery
labels, but it will exclude the __name__ from the count since it is a
mandatory Prometheus label to which applying constraints isn't
meaningful.
The `label_name_length_limit` and `label_value_length_limit` will
prevent having labels of excessive lengths. These limits also skip the
__name__ label for the same reasons as the `label_limit` option and will
also make the scrape fail if any sample has a label name/value length
that exceed the predefined limits.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
* scrape: add metrics and alert to label limits
Add three gauge, one for each label limit to easily access the
limit set by a certain scrape target.
Also add a counter to count the number of targets that exceeded the
label limits and thus were dropped. This is useful for the
`PrometheusLabelLimitHit` alert that will notify the users that scraping
some targets failed because they had samples exceeding the label limits
defined in the scrape configuration.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
* scrape: apply label limits to __name__ label
Apply limits to the __name__ label that was previously skipped and
truncate the label names and values in the error messages as they can be
very very long.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
* scrape: remove label limits gauges and refactor
Remove `prometheus_target_scrape_pool_label_limit`,
`prometheus_target_scrape_pool_label_name_length_limit`, and
`prometheus_target_scrape_pool_label_value_length_limit` as they are not
really useful since we don't have the information on the labels in it.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
- Remove unrelated changes
- Refactor code out of the API module - that is already getting pretty crowded.
- Don't track reference for AddFast in remote write. This has the potential to consume unlimited server-side memory if a malicious client pushes a different label set for every series. For now, its easier and safer to always use the 'slow' path.
- Return 400 on out of order samples.
- Use remote.DecodeWriteRequest in the remote write adapters.
- Put this behing the 'remote-write-server' feature flag
- Add some (very) basic docs.
- Used named return & add test for commit error propagation
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
In its current form this configuration clashes in one of the most widely
used configurations (kube-prometheus). This patch scopes the
configuration to prevent this.
Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
Currently, it relies on `job, instance` being the labels completely
identifying a Prometheus instance. However, what's intended is to
simply not match on `remote_name, url`.
Signed-off-by: beorn7 <beorn@grafana.com>
There is certainly a potential to add more of these. This is mostly
meant to introduce the concept and cover a few critical parts.
Signed-off-by: beorn7 <beorn@grafana.com>
* Testify: move to require
Moving testify to require to fail tests early in case of errors.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* More moves
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* add networking.k8s.io for ingress
level=error ts=2020-10-19T08:32:30.544Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:494: Failed to watch *v1beta1.Ingress: failed to list *v1beta1.Ingress: ingresses.networking.k8s.io is forbidden: User \"system:serviceaccount:monitoring:prometheus\" cannot list resource \"ingresses\" in API group \"networking.k8s.io\" at the cluster scope"
Signed-off-by: root <likerj@inspur.com>
* Update rbac-setup.yml
Signed-off-by: root <likerj@inspur.com>
This should be the way forward when importing libraries in jsonnet. It's
closer to how Go imports look and makes it more obvious where packages
live.
This is not breaking anything, as the old imports were already symlinks
to the now directly used directories.
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
* Mixin: Ignore unset remote write timestamp
This pull request ignores the zero value of highest_sent_timestamp_seconds
in Highest Timestamp In vs. Highest Timestamp Sent which just show that
remote write has not been successful yet.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* update the doc link in internal_arthitecture.md
* address reviewer's comment to remove out-dated wrapper
Signed-off-by: Luke Chen <showuon@gmail.com>
* .circleci/config.yml: check mixins
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Run jsonnetfmt
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Install tools in the image instead of using coreos/jsonnet-ci
The latter is deprecated
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Update jsonnetfile.json
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Due to https://github.com/grafana/grafana/issues/15642, this prevents users putting this dashboard in a Grafana folder called 'Prometheus'.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
* add test to custom-sd/adapter writeOutput() function
Signed-off-by: Benoit Gagnon <benoit.gagnon@ubisoft.com>
* fix Adapter.writeOutput() function to work on Windows
On that platform, files cannot be moved while a process holds a handle
to them. Added an explicit Close() before that move. With this change,
the unit test succeeds.
Signed-off-by: Benoit Gagnon <benoit.gagnon@ubisoft.com>
* add missing dot to comment
Signed-off-by: Benoit Gagnon <benoit.gagnon@ubisoft.com>
* [bugfix] custom SD: when ip out of order, reflect.deepEqual can not correctly identify whether there is a change
Signed-off-by: fuling <fuling.lgz@alibaba-inc.com>
* [format] makefile:Makefile.common:116: common-style
Signed-off-by: fuling <fuling.lgz@alibaba-inc.com>
* [bugfix] custom sd: simonpasquier comment,It would be simpler to sort the targets alphabetically and keep reflect.DeepEqual.
Signed-off-by: fuling <fuling.lgz@alibaba-inc.com>
* [bugfix]custom SD:fix sort
Signed-off-by: fuling <fuling.lgz@alibaba-inc.com>
* [bugfix] custom SD : adapter.go need an empty line after "sort"
Signed-off-by: fuling <fuling.lgz@alibaba-inc.com>
* [bugfix]custom SD:test sign-off
Signed-off-by: fuling <fuling.lgz@alibaba-inc.com>
* [bugfix]custom SD: fix adaper_test.go
Signed-off-by: fuling <fuling.lgz@alibaba-inc.com>
The counter is only increased when tsdb.Open() is called which
Prometheus does only once in its lifetime (when it initializes). If the
corruption can't be recovered, tsdb.Open() returns an error and
Prometheus exits. Hence the metric is either 0 (no corruption) or 1
(corruption detected and repaired). If the latter, the alert isn't
actionable and the only way to resolve it is to restart Prometheus which
would reset the counter.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
While doing so, re-introduce the summary/description
annotations. Also, add a few more rules and tweak a few of the
existing ones.
Signed-off-by: beorn7 <beorn@grafana.com>