* consul: improve consul service discovery
Related to #3711
- Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services`
allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`).
Tags and nore-meta are also used in `/catalog/service` requests.
- Do not require a call to the catalog if services are specified by name. This is important
because on large cluster `/catalog/services` changes all the time.
- Add `allow_stale` configuration option to do stale reads. Non-stale
reads can be costly, even more when you are doing them to a remote
datacenter with 10k+ targets over WAN (which is common for federation).
- Add `refresh_interval` to minimize the strain on the catalog and on the
service endpoint. This is needed because of that kind of behavior from
consul: https://github.com/hashicorp/consul/issues/3712 and because a catalog
on a large cluster would basically change *all* the time. No need to discover
targets in 1sec if we scrape them every minute.
- Added plenty of unit tests.
Benchmarks
----------
```yaml
scrape_configs:
- job_name: prometheus
scrape_interval: 60s
static_configs:
- targets: ["127.0.0.1:9090"]
- job_name: "observability-by-tag"
scrape_interval: "60s"
metrics_path: "/metrics"
consul_sd_configs:
- server: consul.service.par.consul.prod.crto.in:8500
tag: marathon-user-observability # Used in After
refresh_interval: 30s # Used in After+delay
relabel_configs:
- source_labels: [__meta_consul_tags]
regex: ^(.*,)?marathon-user-observability(,.*)?$
action: keep
- job_name: "observability-by-name"
scrape_interval: "60s"
metrics_path: "/metrics"
consul_sd_configs:
- server: consul.service.par.consul.prod.crto.in:8500
services:
- observability-cerebro
- observability-portal-web
- job_name: "fake-fake-fake"
scrape_interval: "15s"
metrics_path: "/metrics"
consul_sd_configs:
- server: consul.service.par.consul.prod.crto.in:8500
services:
- fake-fake-fake
```
Note: tested with ~1200 services, ~5000 nodes.
| Resource | Empty | Before | After | After + delay |
| -------- |:-----:|:------:|:-----:|:-------------:|
|/service-discovery size|5K|85MiB|27k|27k|27k|
|`go_memstats_heap_objects`|100k|1M|120k|110k|
|`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB|
|`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s|
|`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%|
|`process_open_fds`|16|*1236*|22|22|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5|
|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms|
|Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps|
Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul.
Being a little bit smarter about this reduces the overhead quite a lot.
Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.
* consul: tweak `refresh_interval` behavior
`refresh_interval` now does what is advertised in the documentation,
there won't be more that one update per `refresh_interval`. It now
defaults to 30s (which was also the current waitTime in the consul query).
This also make sure we don't wait another 30s if we already waited 29s
in the blocking call by substracting the number of elapsed seconds.
Hopefully this will do what people expect it does and will be safer
for existing consul infrastructures.
Based on https://groups.google.com/d/topic/prometheus-users/02kezHbuea4/discussion
Does not attempt to handle a situation where the server does not understand
EDNS0, however that is an unlikely case, and the behaviour of such ancient
systems is hard to predict in advance, so if it does come up, it will need
to be handled on a case-by-case basis.
There is currently no way to differentiate Windows instances from Linux
ones. This is needed when you have a mix of node_exporters /
wmi_exporters for OS-level metrics and you want to have them in separate
scrape jobs.
This change allows you to do just that. Example:
```
- job_name: 'node'
azure_sd_configs:
- <azure_sd_config>
relabel_configs:
- source_labels: [__meta_azure_machine_os_type]
regex: Linux
action: keep
```
The way the vendor'd AzureSDK provides to get the OsType is a bit
awkward - as far as I can tell, this information can only be gotten from
the startup disk. Newer versions of the SDK appear to improve this a
bit (by having OS information in the InstanceView), but the current way
still works.
* Fix Kubernetes endpoints SD for empty subsets
When an endpoints object has no associated pods (replica scaled to zero
for instance), the endpoints SD should return a target group with no
targets so that the SD manager propagates this information to the scrape
manager.
Fixes#3659
* Don't send nil target groups from the Kubernetes SD
This is to be consistent with the endpoints SD part.
* refactor: move targetGroup struct and CheckOverflow() to their own package
* refactor: move auth and security related structs to a utility package, fix import error in utility package
* refactor: Azure SD, remove SD struct from config
* refactor: DNS SD, remove SD struct from config into dns package
* refactor: ec2 SD, move SD struct from config into the ec2 package
* refactor: file SD, move SD struct from config to file discovery package
* refactor: gce, move SD struct from config to gce discovery package
* refactor: move HTTPClientConfig and URL into util/config, fix import error in httputil
* refactor: consul, move SD struct from config into consul discovery package
* refactor: marathon, move SD struct from config into marathon discovery package
* refactor: triton, move SD struct from config to triton discovery package, fix test
* refactor: zookeeper, move SD structs from config to zookeeper discovery package
* refactor: openstack, remove SD struct from config, move into openstack discovery package
* refactor: kubernetes, move SD struct from config into kubernetes discovery package
* refactor: notifier, use targetgroup package instead of config
* refactor: tests for file, marathon, triton SD - use targetgroup package instead of config.TargetGroup
* refactor: retrieval, use targetgroup package instead of config.TargetGroup
* refactor: storage, use config util package
* refactor: discovery manager, use targetgroup package instead of config.TargetGroup
* refactor: use HTTPClient and TLS config from configUtil instead of config
* refactor: tests, use targetgroup package instead of config.TargetGroup
* refactor: fix tagetgroup.Group pointers that were removed by mistake
* refactor: openstack, kubernetes: drop prefixes
* refactor: remove import aliases forced due to vscode bug
* refactor: move main SD struct out of config into discovery/config
* refactor: rename configUtil to config_util
* refactor: rename yamlUtil to yaml_config
* refactor: kubernetes, remove prefixes
* refactor: move the TargetGroup package to discovery/
* refactor: fix order of imports
remove some select state that is most likely obsoleete and hoepfully doesn't braje anything :)
merge targets will sort by Discoverer name so we can have consistent tests for the maps.
* Adds a test covering the case where a target providers sends updated versions of the same target groups and the system should reconcile to the latest version of each of the target groups
* Refactors how input data is represented in the tests. It used to be literal declarations of necessary structs, now it is parsing yaml. Yaml declarations are half as long as the former. And these can be put in a fixture file
* Adds a tiny bit of refactoring on test timeouts
* flaky test caused by invalid fsnotify updates before the test files are written to disk causing the fd service to send empty `group[]` struct
* `close(filesReady)` needs to be before the file closing so that fsnotify triggers a new loop of the discovery service.
* nits
* use filepath.Join for the path url to be cross platform
* stupid mistake revert
* Allow getting credentials via EC2 role
This is subtly different than the existing `role_arn` solution, which
allows Prometheus to assume an IAM role given some set of credentials
already in-scope. With EC2 roles, one specifies the role at instance
launch time (via an instance profile.) The instance then exposes
temporary credentials via its metadata. The AWS Go SDK exposes a
credential provider that polls the [instance metadata endpoint][1]
already, so we can simply use that and it will take care of renewing the
credentials when they expire.
Without this, if this is being used inside EC2, it is difficult to
cleanly allow the use of STS credentials. One has to set up a proxy role
that can assume the role you really want, and launch the EC2 instance
with the proxy role. This isn't very clean, and also doesn't seem to be
[supported very well][2].
[1]:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html
[2]: https://github.com/aws/aws-cli/issues/1390
* Automatically try to detect EC2 role credentials
The `Available()` function exposed on ec2metadata returns a simple
true/false if the ec2 metadata is available. This is the best way to
know if we're actually running in EC2 (which is the only valid use-case
for this credential provider.)
This allows this to "just work" if you are using EC2 instance roles.
This change enables the OpenStack service discovery to read the
authentication parameters from the OS_* environment variables when the
identity endpoint URL is not defined in the Prometheus configuration
file.
The problem reported in #2799 was that in the event that all records for a
name were removed, the target group was never updated to be the "empty" set.
Essentially, whatever Prometheus last saw as a non-empty list of targets
would stay that way forever (or at least until Prometheus restarted...). This
came about because of a fairly naive interpretation of what a valid-looking
DNS response actually looked like -- essentially, the only valid DNS responses
were ones that had a non-empty record list. That's fine as long as your
config always lists only target names which have non-empty record sets; if
your environment happens to legitimately have empty record sets sometimes,
all hell breaks loose (otherwise-cleanly shutdown systems trigger up==0 alerts,
for instance).
This patch is a refactoring of the DNS lookup behaviour that maintains
existing behaviour with regard to search paths, but correctly handles empty
and non-existent record sets.
RFC1034 s4.3.1 says there's three ways a recursive DNS server can respond:
1. Here is your answer (possibly an empty answer, because of the way DNS
considers all records for a name, regardless of type, when deciding
whether the name exists).
2. There is no spoon (the name you asked for definitely does not exist).
3. I am a teapot (something has gone terribly wrong).
Situations 1 and 2 are fine and dandy; whatever the answer is (empty or
otherwise) is the list of targets. If something has gone wrong, then we
shouldn't go updating the target list because we don't really *know* what
the target list should be.
Multiple DNS servers to query is a straightforward augmentation; if you get
an error, then try the next server in the list, until you get an answer or
run out servers to ask. Only if *all* the servers return errors should you
return an error to the calling code.
Where things get complicated is the search path. In order to be able to
confidently say, "this name does not exist anywhere, you can remove all the
targets for this name because it's definitely GORN", at least one server for
*all* the possible names need to return either successful-but-empty
responses, or NXDOMAIN. If any name errors out, then -- since that one
might have been the one where the records came from -- you need to say
"maintain the status quo until we get a known-good response".
It is possible, though unlikely, that a poorly-configured DNS setup (say,
one which had a domain in its search path for which all configured recursive
resolvers respond with REFUSED) could result in the same "stuck" records
problem we're solving here, but the DNS configuration should be fixed in
that case, and there's nothing we can do in Prometheus itself to fix the
problem.
I've tested this patch on a local scratch instance in all the various ways I
can think of:
1. Adding records (targets get scraped)
2. Adding records of a different type
3. Remove records of the requested type, leaving other type records intact
(targets don't get scraped)
4. Remove all records for the name (targets don't get scraped)
5. Shutdown the resolver (targets still get scraped)
There's no automated test suite additions, because there isn't a test suite
for DNS discovery, and I was stretching my Go skills to the limit to make
this happen; mock objects are beyond me.
The changes [1][] to Marathon service discovery to support multiple
ports mean that Prometheus now attempts to scrape all ports belonging to
a Marathon service.
You can use port definition or port mapping labels to filter out which
ports to scrape but that requires service owners to update their
Marathon configuration.
To allow for a smoother migration path, add a
`__meta_marathon_port_index` label, whose value is set to the port's
sequential index integer. For example, PORT0 has the value `0`, PORT1
has the value `1`, and so on.
This allows you to support scraping both the first available port (the
previous behaviour) in addition to ports with a `metrics` label.
For example, here's the relabel configuration we might use with
this patch:
- action: keep
source_labels: ['__meta_marathon_port_definition_label_metrics', '__meta_marathon_port_mapping_label_metrics', '__meta_marathon_port_index']
# Keep if port mapping or definition has a 'metrics' label with any
# non-empty value, or if no 'metrics' port label exists but this is the
# service's first available port
regex: ([^;]+;;[^;]+|;[^;]+;[^;]+|;;0)
This assumes that the Marathon API returns the ports in sorted order
(matching PORT0, PORT1, etc), which it appears that it does.
[1]: https://github.com/prometheus/prometheus/pull/2506
* k8s: Support discovery of ingresses
* Move additional labels below allocation
This makes it more obvious why the additional elements are allocated.
Also fix allocation for node where we only set a single label.
* k8s: Remove port from ingress discovery
* k8s: Add comment to ingress discovery example
Fixing the config/config_test, the discovery/file/file_test and the
promql/promql_test tests for Windows. For most of the tests, the fix involved
correct handling of path separators. In the case of the promql tests, the
issue was related to the removal of the temporal directories used by the
storage. The issue is that the RemoveAll() call returns an error when it
tries to remove a directory which is not empty, which seems to be true due to
some kind of process that is still running after closing the storage. To fix
it I added some retries to the remove of the temporal directories.
Adding tags file from Universal Ctags to .gitignore
The changes [1][] to Marathon service discovery to support multiple
ports mean that Prometheus now attempts to scrape all ports belonging to
a Marathon service.
You can use port definition or port mapping labels to filter out which
ports to scrape but that requires service owners to update their
Marathon configuration.
To allow for a smoother migration path, add a
`__meta_marathon_port_index` label, whose value is set to the port's
sequential index integer. For example, PORT0 has the value `0`, PORT1
has the value `1`, and so on.
This allows you to support scraping both the first available port (the
previous behaviour) in addition to ports with a `metrics` label.
For example, here's the relabel configuration we might use with
this patch:
- action: keep
source_labels: ['__meta_marathon_port_definition_label_metrics', '__meta_marathon_port_mapping_label_metrics', '__meta_marathon_port_index']
# Keep if port mapping or definition has a 'metrics' label with any
# non-empty value, or if no 'metrics' port label exists but this is the
# service's first available port
regex: ([^;]+;;[^;]+|;[^;]+;[^;]+|;;0)
This assumes that the Marathon API returns the ports in sorted order
(matching PORT0, PORT1, etc), which it appears that it does.
[1]: https://github.com/prometheus/prometheus/pull/2506
* Add openstack service discovery.
* Add gophercloud code for openstack service discovery.
* first changes for juliusv comments.
* add gophercloud code for floatingip.
* Add tests to openstack sd.
* Add testify suite vendor files.
* add copyright and make changes for code climate.
* Fixed typos in provider openstack.
* Renamed tenant to project in openstack sd.
* Change type of password to Secret in openstack sd.
Rational:
* When the config is reloaded and the provider context is canceled, we need to
exit the current ZK `TargetProvider.Run` method as a new provider will be
instantiated.
* In case `Stop` is called on the `ZookeeperTreeCache`, the update/events
channel may not be closed as it is shared by multiple caches and would
thus be double closed.
* Stopping all `zookeeperTreeCacheNode`s on teardown ensures all associated
watcher go-routines will be closed eagerly rather than implicityly on
connection close events.
Allow namespace discovery to be more easily extended in the future by using a struct rather than just a list.
Rename fields for kubernetes namespace discovery
The staticcheck warns about testing.T usage in goroutines. Moving the
t.Fatal* calls to the main thread showed immediately that this is a good
practice, as one of the test setups didn't work.