Commit graph

899 commits

Author SHA1 Message Date
Tom Wilkie 0e572686db Revert "Bypass the fanout storage merging if no remote storage is configured." 2017-10-26 16:09:39 +01:00
Tom Wilkie 1af3ef431d s/TestRemoveLabels/TestSeriesSetFilter/ 2017-10-26 13:50:39 +01:00
Tom Wilkie 9c3c98e8de Revert "Port 'Don't disable HTTP keep-alives for remote storage connections.' to 2.0 (see #3173)"
This reverts commit 0997191b18.
2017-10-26 13:43:48 +01:00
Tom Wilkie 746752b946 Merge external labels in order. 2017-10-26 11:44:49 +01:00
Tom Wilkie 6e4d4ea402 Initialise some counters in remote storage API. 2017-10-26 11:09:45 +01:00
Tom Wilkie 2ae04d0e79 Add license header. 2017-10-26 11:09:16 +01:00
Tom Wilkie e8c264e47a Add comment. 2017-10-26 11:09:16 +01:00
Tom Wilkie ee011d906d Port remote read server to 2.0. 2017-10-26 11:09:14 +01:00
Bryan Boreham 0997191b18 Port 'Don't disable HTTP keep-alives for remote storage connections.' to 2.0 (see #3173)
Removes configurability introduced in #3160 in favour of hard-coding,
per advice from @brian-brazil.
2017-10-26 11:08:33 +01:00
Tom Wilkie 56820726fa Move a couple of the encoding/decoding functions into codec.go 2017-10-26 11:08:33 +01:00
Conor Broderick 08b7328669 Port Metric name validation to 2.0 (see #2975) 2017-10-26 11:08:33 +01:00
Tom Wilkie 8fe0212ff7 Port 'Make queue manager configurable.' to 2.0, see #2991 2017-10-26 11:08:33 +01:00
Tom Wilkie 3760f56c0c remote: Expose ClientConfig type (see #3165) 2017-10-26 11:08:33 +01:00
Tom Wilkie 16f71a7723 Port codec.go over form 1.8 branch. 2017-10-26 11:08:33 +01:00
Fabian Reinartz e53040e2ac Merge pull request #3339 from tomwilkie/3065-remote-read-bypass
Bypass the fanout storage merging if no remote storage is configured.
2017-10-26 09:14:26 +02:00
Fabian Reinartz bf56ad4233 Merge branch 'master' into master 2017-10-26 09:06:12 +02:00
Paul Gier c4c3205d76 storage/tsdb: check that max block duration is larger than min
If the user accidentally sets the max block duration smaller than the min,
the current error is not informative.  This change just performs the check
earlier and improves the error message.
2017-10-25 19:24:49 -05:00
Fabian Reinartz ce63a5a855 Merge pull request #3352 from prometheus/rc2
Cut v2.0.0-rc.2
2017-10-25 20:39:39 +02:00
Thibault Chataigner fc4406201e Tsdb StartTime : Use a simplier way to compute StartTime 2017-10-25 17:41:00 +02:00
Julius Volz 099df0c5f0 Migrate "golang.org/x/net/context" -> "context" (#3333)
In some places, where ctxhttp or gRPC are concerned, we still need to use the
old contexts.
2017-10-24 21:21:42 -07:00
Tom Wilkie 4bbef0ec30 Bypass the fanout storage merging if no remote storage is configured. 2017-10-23 21:34:53 +01:00
Fabian Reinartz a57ea79660 Close index reader properly 2017-10-23 21:59:18 +02:00
Julius Volz c3d6abc8e6 Fix some lint errors (#3334)
I left the promql ones and some others untouched as I remember that @fabxc
prefers them that way.
2017-10-23 14:57:30 +01:00
Julius Volz 2846d62573 Fix staticcheck issue in test (#3331)
staticcheck fails with:
storage/remote/read_test.go:199:27: do not pass a nil Context, even if a function permits it; pass context.TODO if you are unsure about which Context to use (SA1012)
2017-10-23 11:51:48 +01:00
Brian Brazil 4a50f547c8 removeLabels needs a pointer to work. (#3326) 2017-10-21 08:29:03 +01:00
Thibault Chataigner bf4a279a91 Remote storage reads based on oldest timestamp in primary storage (#3129)
Currently all read queries are simply pushed to remote read clients.
This is fine, except for remote storage for wich it unefficient and
make query slower even if remote read is unnecessary.
So we need instead to compare the oldest timestamp in primary/local
storage with the query range lower boundary. If the oldest timestamp
is older than the mint parameter, then there is no need for remote read.
This is an optionnal behavior per remote read client.

Signed-off-by: Thibault Chataigner <t.chataigner@criteo.com>
2017-10-18 12:08:14 +01:00
Julius Volz 9ef8518b37 Remove "package remote" garbage from license headers (#3304) 2017-10-17 02:26:38 +01:00
Tobias Schmidt 721050c6cb Update prometheus/tsdb dependency 2017-10-16 15:36:25 +02:00
Julius Volz 33c1171b9c Don't add anchoring to exported Value matcher field
Instead, just make the anchoring part of the internal regex. This helps because
some users will want to read back the `Value` field and expect it to be the
same as the input value (e.g. some tests in Cortex), or use the value in
another context which is already expected to add its own anchoring, leading to
superfluous double anchoring (such as when we translate matchers into remote
read request matchers).
2017-10-10 10:10:21 -07:00
Brian Brazil 73dc96e7f5 Fix leak of ticker in remote storage queue manager. 2017-10-09 19:44:03 +01:00
Brian Brazil ee88f0d222 Ensure all values are used or _ 2017-10-09 19:44:03 +01:00
Brian Brazil 37ec2d5283 Fix off by one error in concreteSeriesSet (#3262) 2017-10-09 13:37:58 +01:00
Marc Sluiter 6a633eece1 Added go-conntrack for monitoring http connections (#3241)
Added metrics for in- and outgoing traffic with go-conntrack.
2017-10-06 11:22:19 +01:00
Julius Volz f7e8348a88 Re-add contexts to storage.Storage.Querier() (#3230)
* Re-add contexts to storage.Storage.Querier()

These are needed when replacing the storage by a multi-tenant
implementation where the tenant is stored in the context.

The 1.x query interfaces already had contexts, but they got lost in 2.x.

* Convert promql.Engine to use native contexts
2017-10-04 21:04:15 +02:00
Fabian Reinartz 7b02bfee0a web: start web handler while TSDB is starting up 2017-09-20 15:03:19 +02:00
Fabian Reinartz d21f149745 *: migrate to go-kit/log 2017-09-08 22:01:51 +05:30
Fabian Reinartz 0efecea6d4 Adapt storage APIs to uint64 references 2017-09-07 14:14:41 +02:00
Fabian Reinartz 0c81d5f719 storage: instantiate correct block ranges 2017-08-24 12:36:07 +02:00
Fabian Reinartz 2037778d14 vendor: update TSDB 2017-08-10 14:51:02 +02:00
Tom Wilkie b11bc8ae24 Fix some comments. 2017-08-01 11:19:35 +01:00
Tom Wilkie ec999ff397 Prevent number of remote write shards from going negative.
This can happen in the situation where the system scales up the number of shards massively (to deal with some backlog), then scales it down again as the number of samples sent during the time period is less than the number received.
2017-07-19 16:32:09 +01:00
Tom Wilkie a09acdcc5b Make concreteSeriersIterator behave. 2017-07-13 18:33:08 +01:00
Tom Wilkie 994a7f27d6 Propagate errors through mergeSeriesSet correctly. 2017-07-13 15:02:01 +01:00
Tom Wilkie 2e0d8487e3 Return zeros if At() is called after Next() returns false. 2017-07-13 14:40:29 +01:00
Tom Wilkie 014bd31a86 Remove unnecessary whitespace changes, add comment. 2017-07-13 11:26:46 +01:00
Tom Wilkie 98ac07f86a Add unit test for the merging on the read path. 2017-07-13 11:05:38 +01:00
Tom Wilkie b568ace7ce Move protos to ./prompb 2017-07-12 22:06:35 +01:00
Tom Wilkie 96e25adc8d Introduce 'primary' storage in fanout, and have Add return the ref from the primary.
Also, ensure all append batches are rolled back when a commit or rollback fails.
2017-07-12 15:51:05 +01:00
Tom Wilkie db8128ceeb Add label set as first parameter to AddFast, ingored by TSDB adapter. 2017-07-12 15:20:12 +01:00
Tom Wilkie 2dda5775e3 Initial port of remote storage to v2. 2017-07-12 12:27:57 +01:00
Fabian Reinartz 16464c3a33 Merge pull request #2910 from prometheus/adminapi
Admin API
2017-07-11 17:15:49 +02:00
Fabian Reinartz ccf9e62972 *: add admin grpc API 2017-07-10 09:14:14 +02:00
Goutham Veeramachaneni 243419c007 Return tsdb.ErrOutOfBounds as storage.ErrOutOfBounds
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-07-06 14:18:31 +02:00
Goutham Veeramachaneni 3069bd3996 Handle scrapes with OutOfBounds metrics better
fixes #2894

Signed-off-by: Goutham Veeramachaneni <goutham@boomerangcommerce.com>
2017-07-04 11:24:13 +02:00
Goutham Veeramachaneni d407bd150c Consolidate the duration params in CLI
* All CLI params moved to model.Duration

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-16 20:20:57 +05:30
Goutham Veeramachaneni baf5b0f0fc Fix error where we look into the future. (#2829)
* Fix error where we look into the future.

So currently we are adding values that are in the future for an older
timestamp. For example, if we have [(1, 1), (150, 2)] we will end up
showing [(1, 1), (2,2)].

Further it is not advisable to call .At() after Next() returns false.

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>

* Retuen early if done

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>

* Handle Seek() where we reach the end of iterator

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>

* Simplify code

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-06-13 07:22:27 +02:00
Brian Brazil c02c25d5ba Allow peeking back further in buffer. 2017-05-24 14:27:17 +01:00
Fabian Reinartz d289dc55c3 storage: update TSDB 2017-05-22 11:53:08 +02:00
Fabian Reinartz 9b175d48cb Add flag to disable TSDB lock file 2017-05-09 12:56:51 +02:00
Fabian Reinartz 0f3110487d Merge remote-tracking branch 'origin/dev-2.0' into dev-2.0 2017-04-27 10:25:04 +02:00
Fabian Reinartz 37deb21c45 vendor: remove unused dependency and last ref to fabxc/tsdb 2017-04-27 10:23:34 +02:00
Brian Brazil 5c9a6ce747 Add license to files.
This should fix CI for dev-2.0.
2017-04-19 13:46:22 +01:00
Fabian Reinartz 8ffc851147 Merge branch 'master' into dev-2.0 2017-04-04 15:17:56 +02:00
Fabian Reinartz cfb2a7f1d5 vendor: sync organisation migration of tsdb 2017-04-04 11:33:51 +02:00
Fabian Reinartz bbcf20ba01 web: deduplicate series in federation 2017-04-04 11:20:23 +02:00
Fabian Reinartz 4e41987bcb storage: add deduplication function
This adds a function to deduplicate two series sets given that duplicate
series have equivalent data points.
2017-04-04 11:07:21 +02:00
Björn Rabenstein 50e4f49b7e Merge pull request #2561 from prometheus/beorn7/storage2
storage: Evict unused chunk.Descs in crash recovery
2017-04-04 00:05:03 +02:00
beorn7 08fc6cbd39 storage: Evict unused chunk.Descs in crash recovery
This is in line with the v1.5 change in paradigm to not keep
chunk.Descs without chunks around after a series maintenance.

It's mainly motivated by avoiding excessive amounts of RAM usage
during crash recovery.

The code avoids to create memory time series with zero chunk.Descs as
that is prone to trigger weird effects. (Series maintenance would
archive series with zero chunk.Descs, but we cannot do that here
because the archive indices still have to be checked.)
2017-04-04 00:04:22 +02:00
Björn Rabenstein 1c6240fc40 Merge pull request #2559 from prometheus/beorn7/storage
storage: Replace fpIter by sortedFPs
2017-04-03 16:56:21 +02:00
beorn7 d284ffab03 storage: Replace fpIter by sortedFPs
The fpIter was kind of cumbersome to use and required a lock for each
iteration (which wasn't even needed for the iteration at startup after
loading the checkpoint).

The new implementation here has an obvious penalty in memory, but it's
only 8 byte per series, so 80MiB for a beefy server with 10M memory
time series (which would probably need ~100GiB RAM, so the memory
penalty is only 0.1% of the total memory need).

The big advantage is that now series maintenance happens in order,
which leads to the time between two maintenances of the same series
being less random. Ideally, after each maintenance, the next
maintenance would tackle the series with the largest number of
non-persisted chunks. That would be quite an effort to find out or
track, but with the approach here, the next maintenance will tackle
the series whose previous maintenance is longest ago, which is a good
approximation.

While this commit won't change the _average_ number of chunks
persisted per maintenance, it will reduce the mean time a given chunk
has to wait for its persistence and thus reduce the steady-state
number of chunks waiting for persistence.

Also, the map iteration in Go is non-deterministic but not truly
random. In practice, the iteration appears to be somewhat "bucketed".
You can often observe a bunch of series with similar duration since
their last maintenance, i.e. you see batches of series with similar
number of chunks persisted per maintenance. If that batch is
relatively young, a whole lot of series are maintained with very few
chunks to persist. (See screenshot in PR for a better explanation.)
2017-04-03 15:34:46 +02:00
Tobias Schmidt eac36d123e Fix unstable fanin test (#2558) 2017-04-03 13:02:15 +02:00
Julius Volz 5a896033e3 Add remote read external label handling (#2555)
* Add remote read external label handling

This implements rule 1 and 2 from
https://docs.google.com/document/d/188YauRgfF0J4CYMigLsVNN34V_kUwKnApBs2dQMfBbs/edit

* Use more descriptive example labels in read test

* Add comment for querier.addExternalLabels()

* Make argument naming in removeLabels() more generic
2017-04-02 17:48:15 +02:00
Björn Rabenstein e63d079b59 Merge pull request #2527 from prometheus/beorn7/storage
storage: Evict chunks and calculate persistence pressure...
2017-03-27 14:49:42 +02:00
Julius Volz b5b0e00923 Merge pull request #2499 from prometheus/remote-read
Remote Read
2017-03-27 14:43:44 +02:00
beorn7 434ab2a6a3 storage: Evict chunks and calculate persistence pressure based on target heap size
This is a fairly easy attempt to dynamically evict chunks based on the
heap size. A target heap size has to be set as a command line flage,
so that users can essentially say "utilize 4GiB of RAM, and please
don't OOM".

The -storage.local.max-chunks-to-persist and
-storage.local.memory-chunks flags are deprecated by this
change. Backwards compatibility is provided by ignoring
-storage.local.max-chunks-to-persist and use
-storage.local.memory-chunks to set the new
-storage.local.target-heap-size to a reasonable (and conservative)
value (both with a warning).

This also makes the metrics intstrumentation more consistent (in
naming and implementation) and cleans up a few quirks in the tests.

Answers to anticipated comments:

There is a chance that Go 1.9 will allow programs better control over
the Go memory management. I don't expect those changes to be in
contradiction with the approach here, but I do expect them to
complement them and allow them to be more precise and controlled. In
any case, once those Go changes are available, this code has to be
revisted.

One might be tempted to let the user specify an estimated value for
the RSS usage, and then internall set a target heap size of a certain
fraction of that. (In my experience, 2/3 is a fairly safe bet.)
However, investigations have shown that RSS size and its relation to
the heap size is really really complicated. It depends on so many
factors that I wouldn't even start listing them in a commit
description. It depends on many circumstances and not at least on the
risk trade-off of each individual user between RAM utilization and
probability of OOMing during a RAM usage peak. To not add even more to
the confusion, we need to stick to the well-defined number we also use
in the targeting here, the sum of the sizes of heap objects.
2017-03-27 14:33:50 +02:00
beorn7 96a303b348 storage: Use staleness delta as head chunk timeout
Currently, if a series stops to exist, its head chunk will be kept
open for an hour. That prevents it from being persisted. Which
prevents it from being evicted. Which prevents the series from being
archived.

Most of the time, once no sample has been added to a series within the
staleness limit, we can be pretty confident that this series will not
receive samples anymore. The whole chain as described above can be
started after 5m instead of 1h. In the relaxed case, this doesn't
change a lot as the head chunk timeout is only checked during series
maintenance, and usually, a series is only maintained every six
hours. However, there is the typical scenario where a large service is
deployed, the deoply turns out to be bad, and then it is deployed
again within minutes, and quite quickly the number of time series has
tripled. That's the point where the Prometheus server is stressed and
switches (rightfully) into rushed mode. In that mode, time series are
processed as quickly as possible, but all of that is in vein if all of
those recently ended time series cannot be persisted yet for another
hour. In that scenario, this change will help most, and it's exactly
the scenario where help is most desperately needed.
2017-03-26 23:44:50 +02:00
Julius Volz 3f23aa2cc7 Add headers to indicate remote read/write version
Also add Content-Type header.
2017-03-24 17:39:51 +01:00
Julius Volz 8fda83ea12 Make rules only read local data 2017-03-21 00:50:04 +01:00
Julius Volz 94acd3f1d8 Add fanin tests and fix uncovered bugs 2017-03-21 00:08:17 +01:00
Julius Volz 9b33cfc457 Fix/unify context-based remote storage timeouts 2017-03-20 14:17:06 +01:00
Julius Volz 815762a4ad Move retrieval.NewHTTPClient -> httputil.NewClientFromConfig 2017-03-20 14:17:04 +01:00
Fabian Reinartz 397f001ac5 Merge branch 'master' into dev-2.0 2017-03-20 14:12:11 +01:00
Julius Volz eb14678a25 Make remote read/write use config.HTTPClientConfig 2017-03-20 13:37:50 +01:00
Julius Volz 406b65d0dc Rename remote.Storage to remote.Writer 2017-03-20 13:15:28 +01:00
Julius Volz 02395a224d [WIP] Remote Read 2017-03-20 13:13:44 +01:00
Julius Volz 40e41a4776 Merge pull request #2494 from tomwilkie/remote-write-sharding
Dynamically reshard the QueueManager based on observed load.
2017-03-20 12:45:17 +01:00
Fabian Reinartz b586781283 *: update tsdb vendoring and add retention flag 2017-03-17 16:06:04 +01:00
beorn7 48d221c11e storage: Fix typo in comment 2017-03-16 11:49:41 +01:00
Fabian Reinartz 0ecd205794 promql: Use buffer pool for matrix allocations 2017-03-14 10:57:34 +01:00
Tom Wilkie 75bb0f3253 Review feedback 2017-03-13 21:24:49 +00:00
Tom Wilkie 77cce900b8 Fix tests 2017-03-13 15:21:59 +00:00
Tom Wilkie b48799a01e Add license stanza 2017-03-13 14:50:15 +00:00
Tom Wilkie 9d22f030cf Dynamically reshard the QueueManager based on observed load. 2017-03-13 14:41:16 +00:00
Fabian Reinartz 8a8eb12985 storage/tsdb: don't use partitioned DB. 2017-03-07 11:51:30 +01:00
Fabian Reinartz 9eb1d6c927 remote: take code from master 2017-03-07 11:43:32 +01:00
Fabian Reinartz 9304179ef7 Merge branch 'master' into dev-2.0 2017-03-02 08:16:58 +01:00
Fabian Reinartz 4397b4d508 *: pass Prometheus registry into storage 2017-02-28 09:33:14 +01:00
Tom Wilkie 1ab893c6ec Limit 'discarding sample' logs to 1 every 10s (#2446)
* Limit 'discarding sample' logs to 1 every 10s

* Include the vendored library

* Review feedback
2017-02-23 19:20:39 +01:00
Julius Volz 2f39dbc8b3 Rename StorageQueueManager -> QueueManager 2017-02-21 21:45:43 +01:00
Julius Volz e9476b35d5 Re-add multiple remote writers
Each remote write endpoint gets its own set of relabeling rules.

This is based on the (yet-to-be-merged)
https://github.com/prometheus/prometheus/pull/2419, which removes legacy
remote write implementations.
2017-02-20 13:23:12 +01:00