Commit graph

3282 commits

Author SHA1 Message Date
beorn7 836f1db04c Improve MetricsForLabelMatchers
WIP: This needs more tests.

It now gets a from and through value, which it may opportunistically
use to optimize the retrieval. With possible future range indices,
this could be used in a very efficient way. This change merely applies
some easy checks, which should nevertheless solve the use case of
heavy rule evaluations on servers with a lot of series churn.

Idea is the following:

- Only archive series that are at least as old as the headChunkTimeout
  (which was already extremely unlikely to happen).

- Then maintain a high watermark for the last archival, i.e. no
  archived series has a sample more recent than that watermark.

- Any query that doesn't reach to a time before that watermark doesn't
  have to touch the archive index at all. (A production server at
  Soundcloud with the aforementioned series churn and heavy rule
  evaluations spends 50% of its CPU time in archive index
  lookups. Since rule evaluations usually only touch very recent
  values, most of those lookup should disappear with this change.)

- Federation with a very broad label matcher will profit from this,
  too.

As a byproduct, the un-needed MetricForFingerprint method was removed
from the Storage interface.
2016-03-09 00:25:59 +01:00
Björn Rabenstein eebe077f98 Merge pull request #1476 from prometheus/beorn7/makefile
Use UTC for build timestamp
2016-03-08 18:18:54 +01:00
beorn7 6ba379e256 Use UTC for build timestamp 2016-03-08 17:47:17 +01:00
beorn7 d77d625ad3 Merge branch 'master' into beorn7/storage6 2016-03-08 17:39:14 +01:00
Brian Brazil 84c421da8e Merge pull request #1475 from prometheus/fabxc/targetsort
Sort exported targets
2016-03-08 16:24:55 +00:00
Fabian Reinartz f2e359962c Sort exported targets 2016-03-08 17:12:27 +01:00
Fabian Reinartz eb915ec40f Merge pull request #1474 from prometheus/fabxc/spinfix
Handle closed target provider channel
2016-03-08 17:02:05 +01:00
Fabian Reinartz 56fc9bdff3 Handle closed target provider channel
This fixes the case where a target provider closes the update
channel and exits before the context is canceled.
This should only be true for the static provider but it's safer
to generally handle this case.
2016-03-08 15:49:03 +01:00
Tobias Schmidt 2f151d02eb Merge pull request #1456 from prometheus/validate-alertmanager-url
Validate alertmanager URL
2016-03-07 20:09:46 -05:00
Tobias Schmidt 7763bbd993 Validate alertmanager URL 2016-03-07 20:07:17 -05:00
beorn7 167b83695c Merge branch 'beorn7/storage5' into beorn7/storage6 2016-03-08 00:20:44 +01:00
beorn7 01795382c9 Merge branch 'beorn7/storage4' into beorn7/storage5 2016-03-08 00:20:13 +01:00
beorn7 c01658e20d Merge branch 'beorn7/storage3' into beorn7/storage4 2016-03-08 00:18:00 +01:00
beorn7 f138847d31 Merge branch 'beorn7/storage2' into beorn7/storage3 2016-03-08 00:17:33 +01:00
beorn7 f7fc542db6 Merge branch 'master' into beorn7/storage4
Conflicts:
	storage/local/persistence.go
2016-03-08 00:14:00 +01:00
beorn7 3d86130d8c Merge branch 'master' into beorn7/storage3 2016-03-07 23:39:12 +01:00
beorn7 1f30c8de8d Merge branch 'master' into beorn7/storage2 2016-03-07 23:38:42 +01:00
beorn7 c13b1ecfe9 Make chunk iterators more DRY
This finally extracts all the common code of the two chunk iterators
into one. Any future chunk encodings with fast access by index can use
the same iterator by simply providing an indexAccessor. Other future
chunk encodings without fast index access (like Gorilla-style) can
still implement the chunkIterator interface as usual.
2016-03-07 20:23:14 +01:00
beorn7 32f280a3cd Slim down the chunkIterator interface
For one, remove unneeded methods.

Then, instead of using a channel for all values, use a
bufio.Scanner-like interface. This removes the need for creating a
goroutine and avoids the (unnecessary) locking performed by channel
sending and receiving.

This will make it much easier to write new chunk implementations (like
Gorilla-style encoding).
2016-03-07 19:50:13 +01:00
Björn Rabenstein 1bd4c92e1f Merge pull request #1457 from prometheus/beorn7/promtool
Add a command to promtool that dumps metadata of heads.db
2016-03-07 17:22:48 +01:00
beorn7 b6fdb355d7 Move dump-heads into its own tool 2016-03-07 16:30:19 +01:00
beorn7 f193f2b8ef Add a command to promtool that dumps metadata of heads.db
I needed this today for debugging. It can certainly be improved, but
it's already quite helpful.

I refactored the reading of heads.db files out of persistence, which
is an improvement, too.

I made minor changes to the cli package to allow outputting via the
io.Writer interface.
2016-03-07 16:21:57 +01:00
Fabian Reinartz 6bbb4af837 Merge pull request #1465 from prometheus/beorn7/fix-test2
Fix flaky file-sd test
2016-03-07 15:46:18 +01:00
beorn7 d44b83690e Fix flaky file-sd test 2016-03-07 15:39:18 +01:00
Björn Rabenstein 2a2cc52828 Merge pull request #1405 from prometheus/beorn7/storage
Streamline series iterator creation
2016-03-07 13:30:56 +01:00
Fabian Reinartz 5b9e85e556 Merge pull request #1404 from prometheus/scraperef2
Retrieval refactoring
2016-03-06 22:17:00 +01:00
Fabian Reinartz 6ceb7e7887 Merge pull request #1463 from mischief/linuxisms
scripts: drop -f from hostname, openbsd does not support it
2016-03-05 08:57:18 +01:00
Nick Owens 53777e7bc4 scripts: drop -f from hostname, openbsd does not support it 2016-03-04 19:59:28 -08:00
Fabian Reinartz 8d2a73aff0 Merge pull request #1451 from pdbogen/origin/1446
rewrite operator balancing to be recursive
2016-03-03 19:42:17 +01:00
Patrick Bogen 250344b344 use short variable assignment 2016-03-03 09:46:50 -08:00
beorn7 75a6b460ef Give TestEvictAndLoadChunkDescs more time to actually evict
Obviously, it's really bad to depend on timing here. The proper fix
would be to have something like WaitForIndexing for other things to
wait for, too.

For now, let's see if the wait time increase fixes the issue.
2016-03-03 13:29:39 +01:00
beorn7 fc7de5374a Quarantine series upon problem writing to the series file
This fixes https://github.com/prometheus/prometheus/issues/1059 , but
not in the obvious way (simply not updating the persist watermark,
because that's actually not that simple - we don't really know what
has gone wrong exactly). As any errors relevant here are most likely
caused by severe and unrecoverable problems with the series file,
Using the now quarantine feature is the right step. We don't really
have to be worried about any inconsistent state of the series because
it will be removed for good ASAP. Another plus is that we don't have
to declare the whole storage dirty anymore.
2016-03-03 13:15:02 +01:00
Fabian Reinartz 29e31dc3c6 Merge pull request #1452 from prometheus/fix-style-checker
Detect code style violations in deeply nested files
2016-03-03 09:22:56 +01:00
Tobias Schmidt d7889e61bb Detect code style violations in deeply nested files
So far the style check did not recognize issues in files in deeply
nested directories, e.g. retrieval/discovery/kubernetes/discovery.go.
2016-03-03 02:21:16 -05:00
Patrick Bogen 2062fbae0f rewrite operator balancing to be recursive 2016-03-02 15:56:40 -08:00
beorn7 0ea5801e47 Handle errors caused by data corruption more gracefully
This requires all the panic calls upon unexpected data to be converted
into errors returned. This pollute the function signatures quite
lot. Well, this is Go...

The ideas behind this are the following:

- panic only if it's a programming error. Data corruptions happen, and
  they are not programming errors.

- If we detect a data corruption, we "quarantine" the series,
  essentially removing it from the database and putting its data into
  a separate directory for forensics.

- Failure during writing to a series file is not considered corruption
  automatically. It will call setDirty, though, so that a
  crashrecovery upon the next restart will commence and check for
  that.

- Series quarantining and setDirty calls are logged and counted in
  metrics, but are hidden from the user of the interfaces in
  interface.go, whith the notable exception of Append(). The reasoning
  is that we treat corruption by removing the corrupted series, i.e. a
  query for it will return no results on its next call anyway, so
  return no results right now. In the case of Append(), we want to
  tell the user that no data has been appended, though.

Minor side effects:

- Now consistently using filepath.* instead of path.*.

- Introduced structured logging where I touched it. This makes things
  less consistent, but a complete change to structured logging would
  be out of scope for this PR.
2016-03-02 23:02:34 +01:00
beorn7 8766f99085 Merge branch 'beorn7/storage2' into beorn7/storage3 2016-03-02 23:02:06 +01:00
beorn7 162f6fa6f6 Merge branch 'beorn7/storage' into beorn7/storage2 2016-03-02 23:01:26 +01:00
beorn7 79a2ae2d2e Add missing test file 2016-03-02 23:00:23 +01:00
Fabian Reinartz 7a0c0c3ca2 Remove noise from CHANGELOG 2016-03-02 17:59:23 +01:00
Fabian Reinartz 1e7ce3ffdb Bump version to 0.17.0 2016-03-02 17:59:10 +01:00
Fabian Reinartz 2bfb86d77c Update changelog for 0.17.0 release 2016-03-02 17:58:55 +01:00
beorn7 b6840997a7 Merge branch 'beorn7/storage2' into beorn7/storage3 2016-03-02 16:11:25 +01:00
beorn7 ce58fd357b Merge branch 'beorn7/storage' into beorn7/storage2
Conflicts:
	storage/local/chunk.go
	storage/local/interface.go
2016-03-02 16:09:32 +01:00
beorn7 2581648f70 Separate iterators by offset
Add test that exposes the problem.
2016-03-02 16:01:03 +01:00
Fabian Reinartz 6adf77e411 Merge pull request #1447 from prometheus/fabxc/alertfix
Make copying alerting state safer.
2016-03-02 12:25:19 +01:00
Fabian Reinartz d89c254849 Make copying alerting state safer.
This considers static labels in the equality of alerts to
avoid falsely copying state from a different alert definition with
the same name across reloads.

To be safe, it also copies the state map rather than just its pointer
so that remaining collisions disappear after one evaluation interval.
2016-03-02 12:21:54 +01:00
Fabian Reinartz 95c9706d2d Fix missing comment period. 2016-03-02 09:16:56 +01:00
Fabian Reinartz ddc74f712b Add sortable target list 2016-03-02 09:10:20 +01:00
Julius Volz 9ea2465b99 Fix typo in lexer test. 2016-03-02 01:13:27 +01:00