Commit graph

3905 commits

Author SHA1 Message Date
Takahito Yamatoya 7a3c348f83 fix decimal y-axis 2017-09-17 00:16:40 +09:00
Matt Palmer 3369422327 Improve DNS response handling to prevent "stuck" records [Fixes #2799] (#3138)
The problem reported in #2799 was that in the event that all records for a
name were removed, the target group was never updated to be the "empty" set.
Essentially, whatever Prometheus last saw as a non-empty list of targets
would stay that way forever (or at least until Prometheus restarted...).  This
came about because of a fairly naive interpretation of what a valid-looking
DNS response actually looked like -- essentially, the only valid DNS responses
were ones that had a non-empty record list.  That's fine as long as your
config always lists only target names which have non-empty record sets; if
your environment happens to legitimately have empty record sets sometimes,
all hell breaks loose (otherwise-cleanly shutdown systems trigger up==0 alerts,
for instance).

This patch is a refactoring of the DNS lookup behaviour that maintains
existing behaviour with regard to search paths, but correctly handles empty
and non-existent record sets.

RFC1034 s4.3.1 says there's three ways a recursive DNS server can respond:

1.  Here is your answer (possibly an empty answer, because of the way DNS
   considers all records for a name, regardless of type, when deciding
   whether the name exists).

2. There is no spoon (the name you asked for definitely does not exist).

3. I am a teapot (something has gone terribly wrong).

Situations 1 and 2 are fine and dandy; whatever the answer is (empty or
otherwise) is the list of targets.  If something has gone wrong, then we
shouldn't go updating the target list because we don't really *know* what
the target list should be.

Multiple DNS servers to query is a straightforward augmentation; if you get
an error, then try the next server in the list, until you get an answer or
run out servers to ask.  Only if *all* the servers return errors should you
return an error to the calling code.

Where things get complicated is the search path.  In order to be able to
confidently say, "this name does not exist anywhere, you can remove all the
targets for this name because it's definitely GORN", at least one server for
*all* the possible names need to return either successful-but-empty
responses, or NXDOMAIN.  If any name errors out, then -- since that one
might have been the one where the records came from -- you need to say
"maintain the status quo until we get a known-good response".

It is possible, though unlikely, that a poorly-configured DNS setup (say,
one which had a domain in its search path for which all configured recursive
resolvers respond with REFUSED) could result in the same "stuck" records
problem we're solving here, but the DNS configuration should be fixed in
that case, and there's nothing we can do in Prometheus itself to fix the
problem.

I've tested this patch on a local scratch instance in all the various ways I
can think of:

1. Adding records (targets get scraped)

2. Adding records of a different type

3. Remove records of the requested type, leaving other type records intact
   (targets don't get scraped)

4. Remove all records for the name (targets don't get scraped)

5. Shutdown the resolver (targets still get scraped)

There's no automated test suite additions, because there isn't a test suite
for DNS discovery, and I was stretching my Go skills to the limit to make
this happen; mock objects are beyond me.
2017-09-15 12:26:10 +02:00
Björn Rabenstein 4b8666b739 Merge pull request #3176 from prometheus/beorn7/release
Backport the templating fix from master
2017-09-14 19:07:52 +02:00
beorn7 a3fd7dd335 Backport the templating fix from master
The original fix is in commit 5f5d77848e
2017-09-14 18:12:00 +02:00
Björn Rabenstein df4bc3e407 Merge pull request #3170 from tomwilkie/1.7-2969-negative-shards
Prevent number of remote write shards from going negative.
2017-09-14 13:29:34 +02:00
Tom Wilkie 4f8efdbd59 Prevent number of remote write shards from going negative.
This can happen in the situation where the system scales up the number of shards massively (to deal with some backlog), then scales it down again as the number of samples sent during the time period is less than the number received.
2017-09-14 08:07:40 +01:00
Björn Rabenstein 4d8e7ca185 Merge pull request #3159 from mattbostock/1.7_marathon_sd_cherrypick
Marathon SD: Set port index label
2017-09-12 18:53:40 +02:00
Matt Bostock e758260986 Marathon SD: Set port index label
The changes [1][] to Marathon service discovery to support multiple
ports mean that Prometheus now attempts to scrape all ports belonging to
a Marathon service.

You can use port definition or port mapping labels to filter out which
ports to scrape but that requires service owners to update their
Marathon configuration.

To allow for a smoother migration path, add a
`__meta_marathon_port_index` label, whose value is set to the port's
sequential index integer. For example, PORT0 has the value `0`, PORT1
has the value `1`, and so on.

This allows you to support scraping both the first available port (the
previous behaviour) in addition to ports with a `metrics` label.

For example, here's the relabel configuration we might use with
this patch:

    - action: keep
      source_labels: ['__meta_marathon_port_definition_label_metrics', '__meta_marathon_port_mapping_label_metrics', '__meta_marathon_port_index']
      # Keep if port mapping or definition has a 'metrics' label with any
      # non-empty value, or if no 'metrics' port label exists but this is the
      # service's first available port
      regex: ([^;]+;;[^;]+|;[^;]+;[^;]+|;;0)

This assumes that the Marathon API returns the ports in sorted order
(matching PORT0, PORT1, etc), which it appears that it does.

[1]: https://github.com/prometheus/prometheus/pull/2506
2017-09-11 13:40:51 +01:00
Björn Rabenstein a5ddcf5fb2 Merge pull request #2979 from prometheus/beorn7/storage2
Fix iterator issue in varbit chunk
2017-07-21 19:38:23 +02:00
beorn7 ea5e7eafde Fix #2965
We would overscan when hitting a value directly, interspersed with
samples in between timestamps. Apparently, that happens rarely enough
that it was only noticed recently.
2017-07-21 16:35:15 +02:00
beorn7 c06292af2f Add test to expose #2965 2017-07-21 16:25:24 +02:00
Alexey Palazhchenko b6f89a1982 Parse custom step parameter correctly. (#2928)
Backport of 6a767b736b.
Refs #2827, #2861.
2017-07-10 21:05:40 +02:00
Frederic Branczyk 3afb3fffa3 Merge pull request #2836 from brancz/cut-1.7.1
cut 1.7.1
2017-06-12 13:41:21 +02:00
Frederic Branczyk aef7104791
cut 1.7.1 2017-06-12 11:44:04 +02:00
Fabian Reinartz f71d39a053 Merge pull request #2832 from brancz/fix-redirect
web: fix double prefix redirect
2017-06-12 10:40:43 +02:00
Frederic Branczyk 9063f8dedd
web: fix double prefix 2017-06-10 12:07:43 +02:00
Frederic Branczyk bfa37c8ee3 Merge pull request #2814 from brancz/cut-1.7.0
cut 1.7.0
2017-06-07 11:40:29 +02:00
Frederic Branczyk 5d9ac3565e Merge pull request #2786 from tomwilkie/report-error-in-remote-write
Remote write: read first line of response and include it in the error.
2017-06-07 10:21:34 +02:00
Frederic Branczyk 6d29f0c19f
cut 1.7.0 2017-06-07 10:00:48 +02:00
Frederic Branczyk d26f789cb0 Merge pull request #2816 from brancz/gophercloud-vendoring
vendor: fix mixed versions of gophercloud packages
2017-06-07 09:54:31 +02:00
Frederic Branczyk d6a55c3013
vendor: fix mixed versions of gophercloud packages 2017-06-07 09:21:43 +02:00
Christian Groschupp 8f781e411c Openstack Service Discovery (#2701)
* Add openstack service discovery.

* Add gophercloud code for openstack service discovery.

* first changes for juliusv comments.

* add gophercloud code for floatingip.

* Add tests to openstack sd.

* Add testify suite vendor files.

* add copyright and make changes for code climate.

* Fixed typos in provider openstack.

* Renamed tenant to project in openstack sd.

* Change type of password to Secret in openstack sd.
2017-06-01 23:49:02 +02:00
Roman Vynar dbe2eb2afc Hide consul token on UI. (#2797) 2017-06-01 22:14:23 +01:00
Fabian Reinartz a391156dfb Merge pull request #2667 from goller/go-discovery-logger
Add logger injection into discovery services
2017-06-01 10:20:10 -07:00
Chris Goller 42de0ae013 Use log.Logger interface for all discovery services 2017-06-01 11:25:55 -05:00
Tom Wilkie 24a113bb09 Review feedback: limit number of bytes read under error. 2017-06-01 11:21:48 +01:00
Tom Wilkie 46abe8cbf2 Remote write: read first line of response and include it in the error. 2017-05-31 13:46:08 +01:00
Tobias Schmidt 1c9499bbbd Merge pull request #2785 from prometheus/grobie/fix-target-group-naming
Fix outdated target_group naming in error message
2017-05-31 11:28:46 +02:00
Tobias Schmidt 287ec6e6cc Fix outdated target_group naming in error message
The target_groups config has been renamed to static_configs, the error
message for overflow attributes should reflect that.
2017-05-31 11:01:13 +02:00
Julius Volz 240bb671e2 config: Fix overflow checking in global config (#2783) 2017-05-30 20:58:06 +02:00
Julius Volz e0f046396a Fix InfluxDB retention policy usage in read adapter (#2781) 2017-05-29 16:24:24 +02:00
Stephan Erb 14eee34da3 Update vendored go-zookeeper client (#2778)
It is likely this will fix #2758.
2017-05-29 15:59:30 +02:00
Benjamin 51626f2573 change deprecated maintainer to label (#2724) 2017-05-29 15:58:40 +02:00
Conor Broderick 6766123f93 Replace regex with Secret type and remarshal config to hide secrets (#2775) 2017-05-29 12:46:23 +01:00
Brian Brazil d66799d7f3 Show gaps in graphs. (#2766)
Fixes #345
2017-05-26 16:17:48 +01:00
Tobias Schmidt 2a426bfead Revert "Use tag names consistently (#2743)"
Apparently, a decision was made at some point to only use the v prefix
in tags and similar contexts where other things can appear. There was a
vote to stick to that decision. For more information, read
https://github.com/prometheus/prometheus/pull/2743.

This reverts commit 5405a4724f.
2017-05-25 08:57:01 +02:00
conorbroderick 9c953064c3 check if result is a scalar in order to display correct number of returned time series 2017-05-24 14:07:24 +01:00
Fabian Reinartz 09fcbf78df Merge pull request #2755 from brancz/redirect-prefix
prefix redirect with external url path
2017-05-24 10:09:47 +02:00
Tobias Schmidt 5405a4724f Use tag names consistently (#2743) 2017-05-23 14:14:15 +02:00
Frederic Branczyk ad22606a3d
web: prefix redirect with ExternalURL path 2017-05-22 14:56:52 +02:00
Frederic Branczyk 45df5c2daf
Merge branch 'release-1.6' 2017-05-22 13:44:44 +02:00
Jacky Wu 75b89739de Fix go version hint. (#2750) 2017-05-20 18:33:14 +02:00
Frederic Branczyk 7d17ecbd48 Merge pull request #2735 from brancz/cut-1.6.3
cut 1.6.3
2017-05-18 16:56:54 +02:00
Frederic Branczyk 53a2bd71b9
*: cut 1.6.3 2017-05-18 16:51:46 +02:00
Tobias Schmidt 2ae2b663a9
Create sha256 checksums file during release 2017-05-18 16:50:44 +02:00
Tom Wilkie e9787382b4
Ensure ewma int64s are always aligned. (#2675) 2017-05-18 16:50:44 +02:00
Frederic Branczyk 363554f675 Merge pull request #2739 from Conorbro/stack-graph-fix
Fixed graph ui max/min logic to accommodate for toggling of stacked graph option
2017-05-18 16:49:30 +02:00
conorbroderick 9287a01bbf Fixed fixed yaxis of stacked graph being cut off 2017-05-18 15:18:29 +01:00
Frederic Branczyk b916b3784b Merge pull request #2731 from brancz/lset-non-cloned
notifier: clone and not reuse LabelSet in AM discovery
2017-05-18 10:59:38 +02:00
Frederic Branczyk 94e8b43aae
notifier: clone and not reuse LabelSet in AM discovery 2017-05-18 10:12:42 +02:00