Commit graph

829 commits

Author SHA1 Message Date
Fabian Reinartz dba7586671 Merge branch 'master' into dev-2.0 2017-07-11 17:22:14 +02:00
Fabian Reinartz 16464c3a33 Merge pull request #2910 from prometheus/adminapi
Admin API
2017-07-11 17:15:49 +02:00
Fabian Reinartz ccf9e62972 *: add admin grpc API 2017-07-10 09:14:14 +02:00
Goutham Veeramachaneni 243419c007 Return tsdb.ErrOutOfBounds as storage.ErrOutOfBounds
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-07-06 14:18:31 +02:00
Matt Bostock 13c6e4a4bc Remote queue manager: Fix typo
Change 'send' to 'sent'.
2017-07-04 20:48:52 +01:00
Goutham Veeramachaneni 3069bd3996 Handle scrapes with OutOfBounds metrics better
fixes #2894

Signed-off-by: Goutham Veeramachaneni <goutham@boomerangcommerce.com>
2017-07-04 11:24:13 +02:00
Goutham Veeramachaneni d407bd150c Consolidate the duration params in CLI
* All CLI params moved to model.Duration

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-16 20:20:57 +05:30
Goutham Veeramachaneni baf5b0f0fc Fix error where we look into the future. (#2829)
* Fix error where we look into the future.

So currently we are adding values that are in the future for an older
timestamp. For example, if we have [(1, 1), (150, 2)] we will end up
showing [(1, 1), (2,2)].

Further it is not advisable to call .At() after Next() returns false.

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>

* Retuen early if done

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>

* Handle Seek() where we reach the end of iterator

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>

* Simplify code

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-13 07:22:27 +02:00
Tom Wilkie 24a113bb09 Review feedback: limit number of bytes read under error. 2017-06-01 11:21:48 +01:00
Tom Wilkie 46abe8cbf2 Remote write: read first line of response and include it in the error. 2017-05-31 13:46:08 +01:00
Brian Brazil c02c25d5ba Allow peeking back further in buffer. 2017-05-24 14:27:17 +01:00
Fabian Reinartz d289dc55c3 storage: update TSDB 2017-05-22 11:53:08 +02:00
Alexey Palazhchenko b0e1ea7c6c Simplify code, fix typos. (#2719) 2017-05-15 09:56:09 +01:00
Julius Volz 1c72524870 Fix HTTP error handling in remote.Client.Store() (#2708)
Regression introduced in
e5d7bbfc3c
2017-05-11 18:40:10 +02:00
Tom Wilkie 3141a6b36b Compress remote storage requests and responses with unframed/raw snappy. (#2696)
* Compress remote storage requests and responses with unframed/raw snappy, for compatibility with other languages.

* Remove backwards compatibility code from remote_storage_adapter, update example_write_adapter

* Add /documentation/examples/remote_storage/example_write_adapter/example_writer_adapter to .gitignore
2017-05-10 16:42:59 +02:00
Fabian Reinartz 9b175d48cb Add flag to disable TSDB lock file 2017-05-09 12:56:51 +02:00
beorn7 46226088aa Merge branch 'release-1.6' 2017-05-09 11:16:07 +02:00
beorn7 69eddc9e84 storage: Correctly increase prometheus_local_storage_open_head_chunks 2017-05-08 18:20:23 +02:00
Tom Wilkie 2195bb66f7 Ensure ewma int64s are always aligned. (#2675) 2017-05-03 14:32:50 -05:00
Tom Wilkie 4d9b917d11 Instrument Prometheus with OpenTracing (#2554)
* Use request.Context() instead of a global map of contexts.

* Add some basic opentracing instrumentation on the query path.

* Remove tracehandler endpoint.
2017-05-02 18:49:29 -05:00
Fabian Reinartz 0f3110487d Merge remote-tracking branch 'origin/dev-2.0' into dev-2.0 2017-04-27 10:25:04 +02:00
Fabian Reinartz 37deb21c45 vendor: remove unused dependency and last ref to fabxc/tsdb 2017-04-27 10:23:34 +02:00
Brian Brazil 5c9a6ce747 Add license to files.
This should fix CI for dev-2.0.
2017-04-19 13:46:22 +01:00
beorn7 1dd737d7c3 storage: Don't panic if storage has no FPs even after initial wait 2017-04-18 15:59:12 +02:00
beorn7 c53f256a09 storage: Fix use of counter (Set -> Add) 2017-04-11 12:58:24 +02:00
beorn7 f338d791d2 storage: Several optimizations of checkpointing
- checkpointSeriesMapAndHeads accepts a context now to allow
  cancelling.

- If a shutdown is initiated, cancel the ongoing checkpoint. (We will
  create a final checkpoint anyway.)

- Always wait for at least as long as the last checkpoint took before
  starting the next checkpoint (to cap the time spending checkpointing
  at 50%).

- If an error has occurred during checkpointing, don't bother to sync
  the write.

- Make sure the temporary checkpoint file is deleted, even if an error
  has occurred.

- Clean up the checkpoint loop a bit. (The concurrent Timer.Reset(0)
  call might have cause a race.)
2017-04-07 13:10:12 +02:00
Björn Rabenstein 934d86b936 Merge pull request #2593 from prometheus/beorn7/storage2
storage: Recover from corrupted indices for archived series
2017-04-07 12:55:35 +02:00
Björn Rabenstein 38bcba11fe Merge pull request #2594 from prometheus/beorn7/storage3
storage: Guard against a corner case of data corruption
2017-04-07 00:52:28 +02:00
Björn Rabenstein f0076aca01 Merge pull request #2595 from prometheus/beorn7/storage4
storage: Guard against appending to evicted chunk
2017-04-07 00:51:53 +02:00
Tom Wilkie e5d7bbfc3c Remote writes: retry on recoverable errors. (#2552)
* Remote writes: retry on recoverable errors.

* Add comments

* Review feedback

* Comments

* Review feedback

* Final spelling misteak (I hope).  Plus, record failed samples correctly.
2017-04-07 00:15:41 +02:00
beorn7 7199a9d9d4 storage: Guard against appending to evicted chunk
Fixes #2480. For certain definition of "fixes".

This is something that should never happen. Sadly, it does happen,
albeit extremely rarely. This could be some weird cornercase we
haven't covered yet. Or it happens as a consequesnce of data
corruption or a crash recovery gone bad.

This is not a "real" fix as we don't know the root cause of the
incident reported in #2480. However, this makes sure the server does
not crash, but deals gracefully with the problem: The series in
question is quarantined, which even makes it available for forensics.
2017-04-06 20:02:52 +02:00
beorn7 3d12906286 storage: Guard against a corner case of data corruption
Fixes #2475.
2017-04-06 19:50:32 +02:00
beorn7 4fcc73a04c storage: Recover from corrupted indices for archived series
An unopenable archived_fingerprint_to_timerange is simply deleted and
will be rebuilt during crash recovery (wich can then take quite some time).

An unopenable archived_fingerprint_to_metric is not deleted but
instructions to the user are logged. A deletion has to be done by the
user explicitly as it means losing all archived series (and a repair
with a 3rd party tool might still be possible).
2017-04-06 19:26:39 +02:00
Julius Volz 9775ad4754 Merge pull request #2588 from prometheus/read-multi
Separate out remote read responses.
2017-04-06 17:10:31 +02:00
Brian Brazil c813c824d4 Separate out remote read responses.
Fixes #2574
2017-04-06 15:49:47 +01:00
Björn Rabenstein 516a96d9a3 Merge pull request #2587 from prometheus/beorn7/storage2
storage: Mark storage as dirty if indexing fails
2017-04-06 16:42:06 +02:00
beorn7 ed5f68f382 storage: Increment s.persistErrors on all persist errors
Fixes #2091
2017-04-06 15:55:15 +02:00
beorn7 f3365c4f26 storage: Mark storage as dirty if indexing fails 2017-04-06 15:29:33 +02:00
Alexey Palazhchenko 17f15d024a Small fixes. (#2578)
Fix typos. Simplify with gofmt -s
2017-04-05 14:24:22 +01:00
Björn Rabenstein 425f591fc9 Merge pull request #2576 from prometheus/beorn7/storage
storage: Check for negative values from varint decoding
2017-04-04 23:23:51 +02:00
beorn7 ae286385fd storage: Check for negative values from varint decoding
Sadly, we have a number of places where we use varint encoding for
numbers that cannot be negative. We could have saved a bit by using
uvarint encoding. On the bright side, we now have a 50% chance to
detect data corruption. :-/

Fixes #1800 and #2492.
2017-04-04 19:14:52 +02:00
beorn7 9b6a1dad05 storage: Fix go vet error 2017-04-04 19:14:09 +02:00
Fabian Reinartz 8ffc851147 Merge branch 'master' into dev-2.0 2017-04-04 15:17:56 +02:00
Fabian Reinartz cfb2a7f1d5 vendor: sync organisation migration of tsdb 2017-04-04 11:33:51 +02:00
Fabian Reinartz bbcf20ba01 web: deduplicate series in federation 2017-04-04 11:20:23 +02:00
Fabian Reinartz 4e41987bcb storage: add deduplication function
This adds a function to deduplicate two series sets given that duplicate
series have equivalent data points.
2017-04-04 11:07:21 +02:00
Björn Rabenstein 50e4f49b7e Merge pull request #2561 from prometheus/beorn7/storage2
storage: Evict unused chunk.Descs in crash recovery
2017-04-04 00:05:03 +02:00
beorn7 08fc6cbd39 storage: Evict unused chunk.Descs in crash recovery
This is in line with the v1.5 change in paradigm to not keep
chunk.Descs without chunks around after a series maintenance.

It's mainly motivated by avoiding excessive amounts of RAM usage
during crash recovery.

The code avoids to create memory time series with zero chunk.Descs as
that is prone to trigger weird effects. (Series maintenance would
archive series with zero chunk.Descs, but we cannot do that here
because the archive indices still have to be checked.)
2017-04-04 00:04:22 +02:00
Björn Rabenstein 1c6240fc40 Merge pull request #2559 from prometheus/beorn7/storage
storage: Replace fpIter by sortedFPs
2017-04-03 16:56:21 +02:00
beorn7 d284ffab03 storage: Replace fpIter by sortedFPs
The fpIter was kind of cumbersome to use and required a lock for each
iteration (which wasn't even needed for the iteration at startup after
loading the checkpoint).

The new implementation here has an obvious penalty in memory, but it's
only 8 byte per series, so 80MiB for a beefy server with 10M memory
time series (which would probably need ~100GiB RAM, so the memory
penalty is only 0.1% of the total memory need).

The big advantage is that now series maintenance happens in order,
which leads to the time between two maintenances of the same series
being less random. Ideally, after each maintenance, the next
maintenance would tackle the series with the largest number of
non-persisted chunks. That would be quite an effort to find out or
track, but with the approach here, the next maintenance will tackle
the series whose previous maintenance is longest ago, which is a good
approximation.

While this commit won't change the _average_ number of chunks
persisted per maintenance, it will reduce the mean time a given chunk
has to wait for its persistence and thus reduce the steady-state
number of chunks waiting for persistence.

Also, the map iteration in Go is non-deterministic but not truly
random. In practice, the iteration appears to be somewhat "bucketed".
You can often observe a bunch of series with similar duration since
their last maintenance, i.e. you see batches of series with similar
number of chunks persisted per maintenance. If that batch is
relatively young, a whole lot of series are maintained with very few
chunks to persist. (See screenshot in PR for a better explanation.)
2017-04-03 15:34:46 +02:00