prometheus

mirror of https://github.com/prometheus/prometheus.git synced 2025-03-05 20:59:13 -08:00

Author	SHA1	Message	Date
Tom Wilkie	4f8efdbd59	Prevent number of remote write shards from going negative. This can happen in the situation where the system scales up the number of shards massively (to deal with some backlog), then scales it down again as the number of samples sent during the time period is less than the number received.	2017-09-14 08:07:40 +01:00
beorn7	ea5e7eafde	Fix #2965 We would overscan when hitting a value directly, interspersed with samples in between timestamps. Apparently, that happens rarely enough that it was only noticed recently.	2017-07-21 16:35:15 +02:00
beorn7	c06292af2f	Add test to expose #2965	2017-07-21 16:25:24 +02:00
Tom Wilkie	24a113bb09	Review feedback: limit number of bytes read under error.	2017-06-01 11:21:48 +01:00
Tom Wilkie	46abe8cbf2	Remote write: read first line of response and include it in the error.	2017-05-31 13:46:08 +01:00
Alexey Palazhchenko	b0e1ea7c6c	Simplify code, fix typos. (#2719 )	2017-05-15 09:56:09 +01:00
Julius Volz	1c72524870	Fix HTTP error handling in remote.Client.Store() (#2708 ) Regression introduced in `e5d7bbfc3c`	2017-05-11 18:40:10 +02:00
Tom Wilkie	3141a6b36b	Compress remote storage requests and responses with unframed/raw snappy. (#2696 ) * Compress remote storage requests and responses with unframed/raw snappy, for compatibility with other languages. * Remove backwards compatibility code from remote_storage_adapter, update example_write_adapter * Add /documentation/examples/remote_storage/example_write_adapter/example_writer_adapter to .gitignore	2017-05-10 16:42:59 +02:00
beorn7	46226088aa	Merge branch 'release-1.6'	2017-05-09 11:16:07 +02:00
beorn7	69eddc9e84	storage: Correctly increase prometheus_local_storage_open_head_chunks	2017-05-08 18:20:23 +02:00
Tom Wilkie	2195bb66f7	Ensure ewma int64s are always aligned. (#2675 )	2017-05-03 14:32:50 -05:00
Tom Wilkie	4d9b917d11	Instrument Prometheus with OpenTracing (#2554 ) * Use request.Context() instead of a global map of contexts. * Add some basic opentracing instrumentation on the query path. * Remove tracehandler endpoint.	2017-05-02 18:49:29 -05:00
beorn7	1dd737d7c3	storage: Don't panic if storage has no FPs even after initial wait	2017-04-18 15:59:12 +02:00
beorn7	c53f256a09	storage: Fix use of counter (Set -> Add)	2017-04-11 12:58:24 +02:00
beorn7	f338d791d2	storage: Several optimizations of checkpointing - checkpointSeriesMapAndHeads accepts a context now to allow cancelling. - If a shutdown is initiated, cancel the ongoing checkpoint. (We will create a final checkpoint anyway.) - Always wait for at least as long as the last checkpoint took before starting the next checkpoint (to cap the time spending checkpointing at 50%). - If an error has occurred during checkpointing, don't bother to sync the write. - Make sure the temporary checkpoint file is deleted, even if an error has occurred. - Clean up the checkpoint loop a bit. (The concurrent Timer.Reset(0) call might have cause a race.)	2017-04-07 13:10:12 +02:00
Björn Rabenstein	934d86b936	Merge pull request #2593 from prometheus/beorn7/storage2 storage: Recover from corrupted indices for archived series	2017-04-07 12:55:35 +02:00
Björn Rabenstein	38bcba11fe	Merge pull request #2594 from prometheus/beorn7/storage3 storage: Guard against a corner case of data corruption	2017-04-07 00:52:28 +02:00
Björn Rabenstein	f0076aca01	Merge pull request #2595 from prometheus/beorn7/storage4 storage: Guard against appending to evicted chunk	2017-04-07 00:51:53 +02:00
Tom Wilkie	e5d7bbfc3c	Remote writes: retry on recoverable errors. (#2552 ) * Remote writes: retry on recoverable errors. * Add comments * Review feedback * Comments * Review feedback * Final spelling misteak (I hope). Plus, record failed samples correctly.	2017-04-07 00:15:41 +02:00
beorn7	7199a9d9d4	storage: Guard against appending to evicted chunk Fixes #2480. For certain definition of "fixes". This is something that should never happen. Sadly, it does happen, albeit extremely rarely. This could be some weird cornercase we haven't covered yet. Or it happens as a consequesnce of data corruption or a crash recovery gone bad. This is not a "real" fix as we don't know the root cause of the incident reported in #2480. However, this makes sure the server does not crash, but deals gracefully with the problem: The series in question is quarantined, which even makes it available for forensics.	2017-04-06 20:02:52 +02:00
beorn7	3d12906286	storage: Guard against a corner case of data corruption Fixes #2475.	2017-04-06 19:50:32 +02:00
beorn7	4fcc73a04c	storage: Recover from corrupted indices for archived series An unopenable archived_fingerprint_to_timerange is simply deleted and will be rebuilt during crash recovery (wich can then take quite some time). An unopenable archived_fingerprint_to_metric is not deleted but instructions to the user are logged. A deletion has to be done by the user explicitly as it means losing all archived series (and a repair with a 3rd party tool might still be possible).	2017-04-06 19:26:39 +02:00
Julius Volz	9775ad4754	Merge pull request #2588 from prometheus/read-multi Separate out remote read responses.	2017-04-06 17:10:31 +02:00
Brian Brazil	c813c824d4	Separate out remote read responses. Fixes #2574	2017-04-06 15:49:47 +01:00
Björn Rabenstein	516a96d9a3	Merge pull request #2587 from prometheus/beorn7/storage2 storage: Mark storage as dirty if indexing fails	2017-04-06 16:42:06 +02:00
beorn7	ed5f68f382	storage: Increment s.persistErrors on all persist errors Fixes #2091	2017-04-06 15:55:15 +02:00
beorn7	f3365c4f26	storage: Mark storage as dirty if indexing fails	2017-04-06 15:29:33 +02:00
Alexey Palazhchenko	17f15d024a	Small fixes. (#2578 ) Fix typos. Simplify with gofmt -s	2017-04-05 14:24:22 +01:00
Björn Rabenstein	425f591fc9	Merge pull request #2576 from prometheus/beorn7/storage storage: Check for negative values from varint decoding	2017-04-04 23:23:51 +02:00
beorn7	ae286385fd	storage: Check for negative values from varint decoding Sadly, we have a number of places where we use varint encoding for numbers that cannot be negative. We could have saved a bit by using uvarint encoding. On the bright side, we now have a 50% chance to detect data corruption. :-/ Fixes #1800 and #2492.	2017-04-04 19:14:52 +02:00
beorn7	9b6a1dad05	storage: Fix `go vet` error	2017-04-04 19:14:09 +02:00
Björn Rabenstein	50e4f49b7e	Merge pull request #2561 from prometheus/beorn7/storage2 storage: Evict unused chunk.Descs in crash recovery	2017-04-04 00:05:03 +02:00
beorn7	08fc6cbd39	storage: Evict unused chunk.Descs in crash recovery This is in line with the v1.5 change in paradigm to not keep chunk.Descs without chunks around after a series maintenance. It's mainly motivated by avoiding excessive amounts of RAM usage during crash recovery. The code avoids to create memory time series with zero chunk.Descs as that is prone to trigger weird effects. (Series maintenance would archive series with zero chunk.Descs, but we cannot do that here because the archive indices still have to be checked.)	2017-04-04 00:04:22 +02:00
Björn Rabenstein	1c6240fc40	Merge pull request #2559 from prometheus/beorn7/storage storage: Replace fpIter by sortedFPs	2017-04-03 16:56:21 +02:00
beorn7	d284ffab03	storage: Replace fpIter by sortedFPs The fpIter was kind of cumbersome to use and required a lock for each iteration (which wasn't even needed for the iteration at startup after loading the checkpoint). The new implementation here has an obvious penalty in memory, but it's only 8 byte per series, so 80MiB for a beefy server with 10M memory time series (which would probably need ~100GiB RAM, so the memory penalty is only 0.1% of the total memory need). The big advantage is that now series maintenance happens in order, which leads to the time between two maintenances of the same series being less random. Ideally, after each maintenance, the next maintenance would tackle the series with the largest number of non-persisted chunks. That would be quite an effort to find out or track, but with the approach here, the next maintenance will tackle the series whose previous maintenance is longest ago, which is a good approximation. While this commit won't change the _average_ number of chunks persisted per maintenance, it will reduce the mean time a given chunk has to wait for its persistence and thus reduce the steady-state number of chunks waiting for persistence. Also, the map iteration in Go is non-deterministic but not truly random. In practice, the iteration appears to be somewhat "bucketed". You can often observe a bunch of series with similar duration since their last maintenance, i.e. you see batches of series with similar number of chunks persisted per maintenance. If that batch is relatively young, a whole lot of series are maintained with very few chunks to persist. (See screenshot in PR for a better explanation.)	2017-04-03 15:34:46 +02:00
Tobias Schmidt	eac36d123e	Fix unstable fanin test (#2558 )	2017-04-03 13:02:15 +02:00
Julius Volz	5a896033e3	Add remote read external label handling (#2555 ) * Add remote read external label handling This implements rule 1 and 2 from https://docs.google.com/document/d/188YauRgfF0J4CYMigLsVNN34V_kUwKnApBs2dQMfBbs/edit * Use more descriptive example labels in read test * Add comment for querier.addExternalLabels() * Make argument naming in removeLabels() more generic	2017-04-02 17:48:15 +02:00
Björn Rabenstein	e63d079b59	Merge pull request #2527 from prometheus/beorn7/storage storage: Evict chunks and calculate persistence pressure...	2017-03-27 14:49:42 +02:00
Julius Volz	b5b0e00923	Merge pull request #2499 from prometheus/remote-read Remote Read	2017-03-27 14:43:44 +02:00
beorn7	434ab2a6a3	storage: Evict chunks and calculate persistence pressure based on target heap size This is a fairly easy attempt to dynamically evict chunks based on the heap size. A target heap size has to be set as a command line flage, so that users can essentially say "utilize 4GiB of RAM, and please don't OOM". The -storage.local.max-chunks-to-persist and -storage.local.memory-chunks flags are deprecated by this change. Backwards compatibility is provided by ignoring -storage.local.max-chunks-to-persist and use -storage.local.memory-chunks to set the new -storage.local.target-heap-size to a reasonable (and conservative) value (both with a warning). This also makes the metrics intstrumentation more consistent (in naming and implementation) and cleans up a few quirks in the tests. Answers to anticipated comments: There is a chance that Go 1.9 will allow programs better control over the Go memory management. I don't expect those changes to be in contradiction with the approach here, but I do expect them to complement them and allow them to be more precise and controlled. In any case, once those Go changes are available, this code has to be revisted. One might be tempted to let the user specify an estimated value for the RSS usage, and then internall set a target heap size of a certain fraction of that. (In my experience, 2/3 is a fairly safe bet.) However, investigations have shown that RSS size and its relation to the heap size is really really complicated. It depends on so many factors that I wouldn't even start listing them in a commit description. It depends on many circumstances and not at least on the risk trade-off of each individual user between RAM utilization and probability of OOMing during a RAM usage peak. To not add even more to the confusion, we need to stick to the well-defined number we also use in the targeting here, the sum of the sizes of heap objects.	2017-03-27 14:33:50 +02:00
beorn7	96a303b348	storage: Use staleness delta as head chunk timeout Currently, if a series stops to exist, its head chunk will be kept open for an hour. That prevents it from being persisted. Which prevents it from being evicted. Which prevents the series from being archived. Most of the time, once no sample has been added to a series within the staleness limit, we can be pretty confident that this series will not receive samples anymore. The whole chain as described above can be started after 5m instead of 1h. In the relaxed case, this doesn't change a lot as the head chunk timeout is only checked during series maintenance, and usually, a series is only maintained every six hours. However, there is the typical scenario where a large service is deployed, the deoply turns out to be bad, and then it is deployed again within minutes, and quite quickly the number of time series has tripled. That's the point where the Prometheus server is stressed and switches (rightfully) into rushed mode. In that mode, time series are processed as quickly as possible, but all of that is in vein if all of those recently ended time series cannot be persisted yet for another hour. In that scenario, this change will help most, and it's exactly the scenario where help is most desperately needed.	2017-03-26 23:44:50 +02:00
Julius Volz	3f23aa2cc7	Add headers to indicate remote read/write version Also add Content-Type header.	2017-03-24 17:39:51 +01:00
Julius Volz	8fda83ea12	Make rules only read local data	2017-03-21 00:50:04 +01:00
Julius Volz	94acd3f1d8	Add fanin tests and fix uncovered bugs	2017-03-21 00:08:17 +01:00
Julius Volz	9b33cfc457	Fix/unify context-based remote storage timeouts	2017-03-20 14:17:06 +01:00
Julius Volz	815762a4ad	Move retrieval.NewHTTPClient -> httputil.NewClientFromConfig	2017-03-20 14:17:04 +01:00
Julius Volz	eb14678a25	Make remote read/write use config.HTTPClientConfig	2017-03-20 13:37:50 +01:00
Julius Volz	406b65d0dc	Rename remote.Storage to remote.Writer	2017-03-20 13:15:28 +01:00
Julius Volz	02395a224d	[WIP] Remote Read	2017-03-20 13:13:44 +01:00
Julius Volz	40e41a4776	Merge pull request #2494 from tomwilkie/remote-write-sharding Dynamically reshard the QueueManager based on observed load.	2017-03-20 12:45:17 +01:00

1 2 3 4 5 ...

792 commits