Commit graph

3779 commits

Author SHA1 Message Date
Conor Broderick c72692fd75 Fixed issue of partially hidden y-axis values on graph (#2589) 2017-04-06 16:04:44 +01:00
Björn Rabenstein 516a96d9a3 Merge pull request #2587 from prometheus/beorn7/storage2
storage: Mark storage as dirty if indexing fails
2017-04-06 16:42:06 +02:00
Julius Volz beeb0b55c0 Merge pull request #2572 from weaveworks/2571-propagate-api-error
Add promql.ErrStorage, which the API propagates as a 500.
2017-04-06 16:36:20 +02:00
Björn Rabenstein fdd2bc22ae Merge pull request #2583 from prometheus/beorn7/storage
storage: Increment s.persistErrors on all persist errors
2017-04-06 15:56:49 +02:00
beorn7 ed5f68f382 storage: Increment s.persistErrors on all persist errors
Fixes #2091
2017-04-06 15:55:15 +02:00
Tom Wilkie f0e8a5f37c Add promql.ErrStorage, which is interpreted by the API as a 500. 2017-04-06 14:41:23 +01:00
beorn7 f3365c4f26 storage: Mark storage as dirty if indexing fails 2017-04-06 15:29:33 +02:00
Julius Volz 5f764d9940 Merge pull request #2582 from mdlayher/scrape-header-rename
retrieval: make scrape timeout header consistent with others
2017-04-05 23:13:32 +02:00
Matt Layher 5e4f5fb5ad retrieval: make scrape timeout header consistent with others 2017-04-05 14:56:22 -04:00
Brian Brazil 26bedc9e00 Revert use of buildVersion in console templates. (#2579)
This function isn't available in console templates,
so go back to pre-#2468 state to get things working again.
2017-04-05 15:19:17 +01:00
Alexey Palazhchenko 17f15d024a Small fixes. (#2578)
Fix typos. Simplify with gofmt -s
2017-04-05 14:24:22 +01:00
Björn Rabenstein 425f591fc9 Merge pull request #2576 from prometheus/beorn7/storage
storage: Check for negative values from varint decoding
2017-04-04 23:23:51 +02:00
Julius Volz a874556a66 Merge pull request #2577 from prometheus/beorn7/storage2
storage: Fix `go vet` error
2017-04-04 19:44:42 +02:00
Matt Layher fe4b6693f7 retrieval: add Scrape-Timeout-Seconds header to each scrape request (#2565)
Fixes #2508.
2017-04-04 18:26:28 +01:00
beorn7 ae286385fd storage: Check for negative values from varint decoding
Sadly, we have a number of places where we use varint encoding for
numbers that cannot be negative. We could have saved a bit by using
uvarint encoding. On the bright side, we now have a 50% chance to
detect data corruption. :-/

Fixes #1800 and #2492.
2017-04-04 19:14:52 +02:00
beorn7 9b6a1dad05 storage: Fix go vet error 2017-04-04 19:14:09 +02:00
Julius Volz 5f3327f620 Merge pull request #2568 from AlekSi/patch-1
Use latest released Go 1.8.x
2017-04-04 15:54:30 +02:00
Alexey Palazhchenko 535a18e978 Use latest released Go 1.8.x 2017-04-04 13:52:18 +03:00
Björn Rabenstein 50e4f49b7e Merge pull request #2561 from prometheus/beorn7/storage2
storage: Evict unused chunk.Descs in crash recovery
2017-04-04 00:05:03 +02:00
beorn7 08fc6cbd39 storage: Evict unused chunk.Descs in crash recovery
This is in line with the v1.5 change in paradigm to not keep
chunk.Descs without chunks around after a series maintenance.

It's mainly motivated by avoiding excessive amounts of RAM usage
during crash recovery.

The code avoids to create memory time series with zero chunk.Descs as
that is prone to trigger weird effects. (Series maintenance would
archive series with zero chunk.Descs, but we cannot do that here
because the archive indices still have to be checked.)
2017-04-04 00:04:22 +02:00
Julius Volz eda4286484 Merge pull request #2557 from prometheus/influxdb-read
Add InfluxDB read-back support to remote storage bridge
2017-04-03 18:29:22 +02:00
Björn Rabenstein 1c6240fc40 Merge pull request #2559 from prometheus/beorn7/storage
storage: Replace fpIter by sortedFPs
2017-04-03 16:56:21 +02:00
beorn7 d284ffab03 storage: Replace fpIter by sortedFPs
The fpIter was kind of cumbersome to use and required a lock for each
iteration (which wasn't even needed for the iteration at startup after
loading the checkpoint).

The new implementation here has an obvious penalty in memory, but it's
only 8 byte per series, so 80MiB for a beefy server with 10M memory
time series (which would probably need ~100GiB RAM, so the memory
penalty is only 0.1% of the total memory need).

The big advantage is that now series maintenance happens in order,
which leads to the time between two maintenances of the same series
being less random. Ideally, after each maintenance, the next
maintenance would tackle the series with the largest number of
non-persisted chunks. That would be quite an effort to find out or
track, but with the approach here, the next maintenance will tackle
the series whose previous maintenance is longest ago, which is a good
approximation.

While this commit won't change the _average_ number of chunks
persisted per maintenance, it will reduce the mean time a given chunk
has to wait for its persistence and thus reduce the steady-state
number of chunks waiting for persistence.

Also, the map iteration in Go is non-deterministic but not truly
random. In practice, the iteration appears to be somewhat "bucketed".
You can often observe a bunch of series with similar duration since
their last maintenance, i.e. you see batches of series with similar
number of chunks persisted per maintenance. If that batch is
relatively young, a whole lot of series are maintained with very few
chunks to persist. (See screenshot in PR for a better explanation.)
2017-04-03 15:34:46 +02:00
Tobias Schmidt eac36d123e Fix unstable fanin test (#2558) 2017-04-03 13:02:15 +02:00
Conor Broderick dafae52efa Display total number of returned elements on console (#2532)
Display total number of returned elements on console
2017-04-03 11:52:25 +01:00
Julius Volz 111841a230 Vendor new InfluxDB client library 2017-04-03 12:38:05 +02:00
Fabian Reinartz e18be8d1a5 Merge pull request #2556 from prometheus/grobie/count-missed-group-executions
Export number of missed rule evaluations
2017-04-03 10:09:12 +02:00
Julius Volz 3581057ea4 Update remote storage bridge README.md 2017-04-03 01:42:49 +02:00
Julius Volz b391cbb808 Add InfluxDB read-back support to remote storage bridge 2017-04-03 01:42:43 +02:00
Tobias Schmidt eaf33759fb Register forgotten prometheus_evaluator_iterations_total metric 2017-04-02 20:32:56 -03:00
Tobias Schmidt aaaba57184 Export number of missed rule evaluations
In case the execution of all rules takes longer than the configured rule
evaluation interval, one or more iterations will be skipped. This needs
to be visible to the opterator.
2017-04-02 20:03:28 -03:00
Julius Volz 5a896033e3 Add remote read external label handling (#2555)
* Add remote read external label handling

This implements rule 1 and 2 from
https://docs.google.com/document/d/188YauRgfF0J4CYMigLsVNN34V_kUwKnApBs2dQMfBbs/edit

* Use more descriptive example labels in read test

* Add comment for querier.addExternalLabels()

* Make argument naming in removeLabels() more generic
2017-04-02 17:48:15 +02:00
Julius Volz 9cc7b393c5 Merge pull request #2548 from prometheus/sort-targets
Sort targets by instance within a job
2017-04-01 00:07:31 +02:00
Julius Volz 589061919a Merge pull request #2465 from Gouthamve/alert-metrics-2429
Better Metrics For Alerts
2017-03-31 21:45:05 +02:00
Goutham Veeramachaneni f27ce34a13
Use Registerer to Register All Metrics
* Made Metric a Gauge so that it can be registered.
2017-04-01 00:14:30 +05:30
Goutham Veeramachaneni 7ba0a9e81a Add Comment About Initialising Counters 2017-03-31 23:39:02 +05:30
Goutham Veeramachaneni 0d0c9d5440
Move Registerer to Config Struct in Notifier 2017-03-31 21:20:12 +05:30
Julius Volz 947c83be3b Sort targets by instance within a job
Fixes https://github.com/prometheus/prometheus/issues/2536
2017-03-31 13:14:20 +02:00
Julius Volz 336c7870ea Merge pull request #2550 from prometheus/update-go-version
ci: Update Go version to 1.8
2017-03-31 13:12:03 +02:00
Julius Volz a44aadf4a1 ci: Update Go version to 1.8 2017-03-31 00:29:04 +02:00
Brian Brazil 8cd5aff8fe Send instance="" with federation if instance not set.
This is needed for federating non-instance level metrics, so they don't
end up with the instance label of the prometheus target.

Also sort external labels, so label output order is consistent.
2017-03-30 06:48:48 +01:00
Brian Brazil d42e01b07c Sort labelnames for federation.
This makes unittests with multiple labels possible,
and may be needed for performance with the new
ingestion text parser.
2017-03-30 06:48:48 +01:00
Brian Brazil dbb65846f1 Add unittest for federation external_labels behaviour 2017-03-30 06:48:48 +01:00
Goutham Veeramachaneni 5856f87be3 Update Issue Template (#2541)
This is a comment in markdown and won't be shown while creating the issue.
2017-03-29 15:39:38 +01:00
Björn Rabenstein 29f05680a2 Merge pull request #2528 from prometheus/beorn7/storage2
main.go: Set GOGC to 40 by default
2017-03-27 15:00:37 +02:00
Björn Rabenstein e63d079b59 Merge pull request #2527 from prometheus/beorn7/storage
storage: Evict chunks and calculate persistence pressure...
2017-03-27 14:49:42 +02:00
Julius Volz b5b0e00923 Merge pull request #2499 from prometheus/remote-read
Remote Read
2017-03-27 14:43:44 +02:00
beorn7 434ab2a6a3 storage: Evict chunks and calculate persistence pressure based on target heap size
This is a fairly easy attempt to dynamically evict chunks based on the
heap size. A target heap size has to be set as a command line flage,
so that users can essentially say "utilize 4GiB of RAM, and please
don't OOM".

The -storage.local.max-chunks-to-persist and
-storage.local.memory-chunks flags are deprecated by this
change. Backwards compatibility is provided by ignoring
-storage.local.max-chunks-to-persist and use
-storage.local.memory-chunks to set the new
-storage.local.target-heap-size to a reasonable (and conservative)
value (both with a warning).

This also makes the metrics intstrumentation more consistent (in
naming and implementation) and cleans up a few quirks in the tests.

Answers to anticipated comments:

There is a chance that Go 1.9 will allow programs better control over
the Go memory management. I don't expect those changes to be in
contradiction with the approach here, but I do expect them to
complement them and allow them to be more precise and controlled. In
any case, once those Go changes are available, this code has to be
revisted.

One might be tempted to let the user specify an estimated value for
the RSS usage, and then internall set a target heap size of a certain
fraction of that. (In my experience, 2/3 is a fairly safe bet.)
However, investigations have shown that RSS size and its relation to
the heap size is really really complicated. It depends on so many
factors that I wouldn't even start listing them in a commit
description. It depends on many circumstances and not at least on the
risk trade-off of each individual user between RAM utilization and
probability of OOMing during a RAM usage peak. To not add even more to
the confusion, we need to stick to the well-defined number we also use
in the targeting here, the sum of the sizes of heap objects.
2017-03-27 14:33:50 +02:00
Björn Rabenstein e1a84b6256 Merge pull request #2529 from prometheus/beorn7/storage3
storage: Use staleness delta as head chunk timeout
2017-03-27 14:25:08 +02:00
beorn7 96a303b348 storage: Use staleness delta as head chunk timeout
Currently, if a series stops to exist, its head chunk will be kept
open for an hour. That prevents it from being persisted. Which
prevents it from being evicted. Which prevents the series from being
archived.

Most of the time, once no sample has been added to a series within the
staleness limit, we can be pretty confident that this series will not
receive samples anymore. The whole chain as described above can be
started after 5m instead of 1h. In the relaxed case, this doesn't
change a lot as the head chunk timeout is only checked during series
maintenance, and usually, a series is only maintained every six
hours. However, there is the typical scenario where a large service is
deployed, the deoply turns out to be bad, and then it is deployed
again within minutes, and quite quickly the number of time series has
tripled. That's the point where the Prometheus server is stressed and
switches (rightfully) into rushed mode. In that mode, time series are
processed as quickly as possible, but all of that is in vein if all of
those recently ended time series cannot be persisted yet for another
hour. In that scenario, this change will help most, and it's exactly
the scenario where help is most desperately needed.
2017-03-26 23:44:50 +02:00