Commit graph

105 commits

Author SHA1 Message Date
Matt Palmer 3369422327 Improve DNS response handling to prevent "stuck" records [Fixes #2799] (#3138)
The problem reported in #2799 was that in the event that all records for a
name were removed, the target group was never updated to be the "empty" set.
Essentially, whatever Prometheus last saw as a non-empty list of targets
would stay that way forever (or at least until Prometheus restarted...).  This
came about because of a fairly naive interpretation of what a valid-looking
DNS response actually looked like -- essentially, the only valid DNS responses
were ones that had a non-empty record list.  That's fine as long as your
config always lists only target names which have non-empty record sets; if
your environment happens to legitimately have empty record sets sometimes,
all hell breaks loose (otherwise-cleanly shutdown systems trigger up==0 alerts,
for instance).

This patch is a refactoring of the DNS lookup behaviour that maintains
existing behaviour with regard to search paths, but correctly handles empty
and non-existent record sets.

RFC1034 s4.3.1 says there's three ways a recursive DNS server can respond:

1.  Here is your answer (possibly an empty answer, because of the way DNS
   considers all records for a name, regardless of type, when deciding
   whether the name exists).

2. There is no spoon (the name you asked for definitely does not exist).

3. I am a teapot (something has gone terribly wrong).

Situations 1 and 2 are fine and dandy; whatever the answer is (empty or
otherwise) is the list of targets.  If something has gone wrong, then we
shouldn't go updating the target list because we don't really *know* what
the target list should be.

Multiple DNS servers to query is a straightforward augmentation; if you get
an error, then try the next server in the list, until you get an answer or
run out servers to ask.  Only if *all* the servers return errors should you
return an error to the calling code.

Where things get complicated is the search path.  In order to be able to
confidently say, "this name does not exist anywhere, you can remove all the
targets for this name because it's definitely GORN", at least one server for
*all* the possible names need to return either successful-but-empty
responses, or NXDOMAIN.  If any name errors out, then -- since that one
might have been the one where the records came from -- you need to say
"maintain the status quo until we get a known-good response".

It is possible, though unlikely, that a poorly-configured DNS setup (say,
one which had a domain in its search path for which all configured recursive
resolvers respond with REFUSED) could result in the same "stuck" records
problem we're solving here, but the DNS configuration should be fixed in
that case, and there's nothing we can do in Prometheus itself to fix the
problem.

I've tested this patch on a local scratch instance in all the various ways I
can think of:

1. Adding records (targets get scraped)

2. Adding records of a different type

3. Remove records of the requested type, leaving other type records intact
   (targets don't get scraped)

4. Remove all records for the name (targets don't get scraped)

5. Shutdown the resolver (targets still get scraped)

There's no automated test suite additions, because there isn't a test suite
for DNS discovery, and I was stretching my Go skills to the limit to make
this happen; mock objects are beyond me.
2017-09-15 12:26:10 +02:00
Goutham Veeramachaneni f5aed810f9 logging: Port to common/promlog
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-09-15 12:40:50 +05:30
Matt Bostock e758260986 Marathon SD: Set port index label
The changes [1][] to Marathon service discovery to support multiple
ports mean that Prometheus now attempts to scrape all ports belonging to
a Marathon service.

You can use port definition or port mapping labels to filter out which
ports to scrape but that requires service owners to update their
Marathon configuration.

To allow for a smoother migration path, add a
`__meta_marathon_port_index` label, whose value is set to the port's
sequential index integer. For example, PORT0 has the value `0`, PORT1
has the value `1`, and so on.

This allows you to support scraping both the first available port (the
previous behaviour) in addition to ports with a `metrics` label.

For example, here's the relabel configuration we might use with
this patch:

    - action: keep
      source_labels: ['__meta_marathon_port_definition_label_metrics', '__meta_marathon_port_mapping_label_metrics', '__meta_marathon_port_index']
      # Keep if port mapping or definition has a 'metrics' label with any
      # non-empty value, or if no 'metrics' port label exists but this is the
      # service's first available port
      regex: ([^;]+;;[^;]+|;[^;]+;[^;]+|;;0)

This assumes that the Marathon API returns the ports in sorted order
(matching PORT0, PORT1, etc), which it appears that it does.

[1]: https://github.com/prometheus/prometheus/pull/2506
2017-09-11 13:40:51 +01:00
Fabian Reinartz e746282772 Merge branch 'master' into dev-2.0 2017-09-11 10:55:19 +02:00
Jamie Moore 7a135e0a1b Add the ability to assume a role for ec2 discovery 2017-09-10 00:36:43 +10:00
Fabian Reinartz d21f149745 *: migrate to go-kit/log 2017-09-08 22:01:51 +05:30
Johannes 'fish' Ziemke 75aec7d970 k8s: Use versioned struct for ingress discovery 2017-09-06 12:47:03 +02:00
Fabian Reinartz 87918f3097 Merge branch 'master' into dev-2.0 2017-09-04 14:09:21 +02:00
Johannes 'fish' Ziemke 70f3d1e9f9 k8s: Support discovery of ingresses (#3111)
* k8s: Support discovery of ingresses

* Move additional labels below allocation

This makes it more obvious why the additional elements are allocated.
Also fix allocation for node where we only set a single label.

* k8s: Remove port from ingress discovery

* k8s: Add comment to ingress discovery example
2017-09-04 13:10:44 +02:00
Tobias Schmidt 29fff1eca4 Merge pull request #2966 from alkalinecoffee/consul-node-metadata
Add support for consul's node metadata
2017-09-02 18:43:25 +02:00
Tobias Schmidt d0a02703a2 Merge pull request #3105 from sak0/dev
discovery openstack: support discovery hypervisors, add rule option.
2017-08-31 14:08:16 +02:00
CuiHaozhi b1c18bf29b discovery openstack: support discovery hosts, add rule option.
Signed-off-by: CuiHaozhi <cuihz@wise2c.com>
2017-08-29 10:14:00 -04:00
Colstuwjx 2b49df2c61 Fix target group foreach nil bug, directly return err. 2017-08-22 08:37:39 +08:00
CuiHaozhi 31b6f8b04c discovery openstack: handle instances without ip
Signed-off-by: CuiHaozhi <cuihz@wise2c.com>
2017-08-11 12:36:12 -04:00
Fabian Reinartz 25f3e1c424 Merge branch 'master' into mergemaster 2017-08-10 17:04:25 +02:00
Fabian Reinartz ac511ecf30 Merge pull request #2970 from Gouthamve/docs/sd-interface
Add docs about SD interface
2017-08-01 22:44:28 +02:00
Goutham Veeramachaneni ab96e79bc8 Add docs about SD interface
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-08-01 13:53:50 +05:30
Fabian Reinartz 40db026381 Merge pull request #2957 from prometheus/sd-doc
Tweaks to SD README from review
2017-07-28 08:51:50 +02:00
Joe Martin aba41c7d0f add support for consul's node metadata 2017-07-18 16:46:16 -04:00
J. Taylor O'Connor 5a19ffb315 A few spelling corrections. (#2960) 2017-07-17 22:13:50 +01:00
Brian Brazil 84be97bd98 Tweaks to SD README from review 2017-07-17 14:20:54 +01:00
Brian Brazil 2a9ca394dd Document how/when to write service discovery (#2943) 2017-07-14 15:22:09 +01:00
Fabian Reinartz dba7586671 Merge branch 'master' into dev-2.0 2017-07-11 17:22:14 +02:00
Fuente, Pablo Andres 902fafb8e7 Fixing tests for Windows
Fixing the config/config_test, the discovery/file/file_test and the
promql/promql_test tests for Windows. For most of the tests, the fix involved
correct handling of path separators. In the case of the promql tests, the
issue was related to the removal of the temporal directories used by the
storage. The issue is that the RemoveAll() call returns an error when it
tries to remove a directory which is not empty, which seems to be true due to
some kind of process that is still running after closing the storage. To fix
it I added some retries to the remove of the temporal directories.
Adding tags file from Universal Ctags to .gitignore
2017-07-09 01:59:30 -03:00
Matt Bostock ab4d64959f Marathon SD: Set port index label
The changes [1][] to Marathon service discovery to support multiple
ports mean that Prometheus now attempts to scrape all ports belonging to
a Marathon service.

You can use port definition or port mapping labels to filter out which
ports to scrape but that requires service owners to update their
Marathon configuration.

To allow for a smoother migration path, add a
`__meta_marathon_port_index` label, whose value is set to the port's
sequential index integer. For example, PORT0 has the value `0`, PORT1
has the value `1`, and so on.

This allows you to support scraping both the first available port (the
previous behaviour) in addition to ports with a `metrics` label.

For example, here's the relabel configuration we might use with
this patch:

    - action: keep
      source_labels: ['__meta_marathon_port_definition_label_metrics', '__meta_marathon_port_mapping_label_metrics', '__meta_marathon_port_index']
      # Keep if port mapping or definition has a 'metrics' label with any
      # non-empty value, or if no 'metrics' port label exists but this is the
      # service's first available port
      regex: ([^;]+;;[^;]+|;[^;]+;[^;]+|;;0)

This assumes that the Marathon API returns the ports in sorted order
(matching PORT0, PORT1, etc), which it appears that it does.

[1]: https://github.com/prometheus/prometheus/pull/2506
2017-06-23 09:52:52 +01:00
Goutham Veeramachaneni 507790a357
Rework logging to use explicitly passed logger
Mostly cleaned up the global logger use. Still some uses in discovery
package.

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-16 15:52:44 +05:30
Christian Groschupp 8f781e411c Openstack Service Discovery (#2701)
* Add openstack service discovery.

* Add gophercloud code for openstack service discovery.

* first changes for juliusv comments.

* add gophercloud code for floatingip.

* Add tests to openstack sd.

* Add testify suite vendor files.

* add copyright and make changes for code climate.

* Fixed typos in provider openstack.

* Renamed tenant to project in openstack sd.

* Change type of password to Secret in openstack sd.
2017-06-01 23:49:02 +02:00
Roman Vynar dbe2eb2afc Hide consul token on UI. (#2797) 2017-06-01 22:14:23 +01:00
Chris Goller 42de0ae013 Use log.Logger interface for all discovery services 2017-06-01 11:25:55 -05:00
Tobias Schmidt 287ec6e6cc Fix outdated target_group naming in error message
The target_groups config has been renamed to static_configs, the error
message for overflow attributes should reflect that.
2017-05-31 11:01:13 +02:00
Conor Broderick 6766123f93 Replace regex with Secret type and remarshal config to hide secrets (#2775) 2017-05-29 12:46:23 +01:00
Fabian Reinartz 11aa049b05 Merge branch 'release-1.6' into merge16 2017-05-11 15:00:51 +02:00
Fabian Reinartz ddbbd2b712 Merge branch 'release-1.5' into cut162 2017-05-11 14:29:49 +02:00
Fabian Reinartz 2ff8855ae6 discovery/k8s: update client library 2017-05-11 13:53:12 +02:00
Fabian Reinartz aaaec6431e Merge pull request #2642 from bakins/kubernetes-namespaces
Allow limiting Kubernetes service discover to certain namespaces
2017-05-04 07:36:21 +02:00
Stephan Erb 0b9fca983b Fix reload of ZooKeeper service discovery config (#2669)
Rational:

* When the config is reloaded and the provider context is canceled, we need to
  exit the current ZK `TargetProvider.Run` method as a new provider will be
  instantiated.
* In case `Stop` is called on the `ZookeeperTreeCache`, the update/events
  channel may not be closed as it is shared by multiple caches and would
  thus be double closed.
* Stopping all `zookeeperTreeCacheNode`s on teardown ensures all associated
  watcher go-routines will be closed eagerly rather than implicityly on
  connection close events.
2017-05-02 18:21:37 -05:00
Brian Akins 27d66628a1 Allow limiting Kubernetes service discover to certain namespaces
Allow namespace discovery to be more easily extended in the future by using a struct rather than just a list.

Rename fields for kubernetes namespace discovery
2017-04-27 07:41:36 -04:00
Goutham Veeramachaneni 0f48d07f95 Fix Map Race by Moving Locking closer to the Write (#2476) 2017-04-07 08:55:01 +02:00
Richard Kiene ec692f6161 Add triton zone brand metadata 2017-04-06 21:35:42 +00:00
Julius Volz 525da88c35 Merge pull request #2479 from YKlausz/consul-tls
Adding consul capability to connect via tls
2017-03-20 11:40:18 +01:00
Robson Roberto Souza Peixoto cc3e859d9e Add support for multiple ports in Marathon (#2506)
- create a target for every port
- add meta labels for Marathon labels in portMappings and portDefinitions
2017-03-18 22:10:44 +02:00
yklausz 75880b594f Adding consul capability to connect via tls 2017-03-17 22:37:18 +01:00
Tobias Schmidt 7bde44e98e Remove testing.T usage in goroutines
The staticcheck warns about testing.T usage in goroutines. Moving the
t.Fatal* calls to the main thread showed immediately that this is a good
practice, as one of the test setups didn't work.
2017-03-16 23:40:46 -03:00
Tobias Schmidt 58cd39aacd Follow golang naming conventions in discovery packages 2017-03-16 23:40:46 -03:00
Robert Neumayer feb7670929 Add tests for consul service discovery (#2490)
* Add tests for consul service discovery

* Add license header

* Address comments

* inline variables
* check for extra error

* Fix error formatting
2017-03-15 09:33:53 +01:00
Michael Kraus 690b49e503 Fix marathon tests 2017-03-06 11:36:55 +01:00
Michael Kraus 31252cc1b5 Clarify explicit use of authorization header 2017-03-06 11:36:36 +01:00
Michael Kraus 47bdcf0f67 Allow the use of bearer_token or bearer_token_file for MarathonSD 2017-03-02 09:44:20 +01:00
James Hartig 865f28bb15 discovery: Instead of looping over conf.Search, use NameList() 2017-02-13 15:48:51 -05:00
Alex Somesan b22eb65d0f Cleaner separation between ServiceAccount and custom authentication in K8S SD (#2348)
* Canonical usage of cluster service-account in K8S SD

* Early validation for opt-in custom auth in K8S SD

* Fix typo in condition
2017-01-19 10:52:52 +01:00
Richard Kiene f3d9692d09 Add Joyent Triton discovery 2017-01-17 20:34:32 +00:00
Fabian Reinartz 35da23fd82 consul: start service watch as goroutine 2016-11-27 11:01:16 +01:00
Fabian Reinartz 200bbe1bad config: extract SD and HTTPClient configurations 2016-11-23 18:23:37 +01:00
Fabian Reinartz d7f4f8b879 discovery: move TargetSet into discovery package 2016-11-23 09:14:44 +01:00
Fabian Reinartz d19d1bcad3 discovery: move into top-level package 2016-11-22 12:56:33 +01:00