prometheus

mirror of https://github.com/prometheus/prometheus.git synced 2025-03-05 20:59:13 -08:00

Author	SHA1	Message	Date
beorn7	d284ffab03	storage: Replace fpIter by sortedFPs The fpIter was kind of cumbersome to use and required a lock for each iteration (which wasn't even needed for the iteration at startup after loading the checkpoint). The new implementation here has an obvious penalty in memory, but it's only 8 byte per series, so 80MiB for a beefy server with 10M memory time series (which would probably need ~100GiB RAM, so the memory penalty is only 0.1% of the total memory need). The big advantage is that now series maintenance happens in order, which leads to the time between two maintenances of the same series being less random. Ideally, after each maintenance, the next maintenance would tackle the series with the largest number of non-persisted chunks. That would be quite an effort to find out or track, but with the approach here, the next maintenance will tackle the series whose previous maintenance is longest ago, which is a good approximation. While this commit won't change the _average_ number of chunks persisted per maintenance, it will reduce the mean time a given chunk has to wait for its persistence and thus reduce the steady-state number of chunks waiting for persistence. Also, the map iteration in Go is non-deterministic but not truly random. In practice, the iteration appears to be somewhat "bucketed". You can often observe a bunch of series with similar duration since their last maintenance, i.e. you see batches of series with similar number of chunks persisted per maintenance. If that batch is relatively young, a whole lot of series are maintained with very few chunks to persist. (See screenshot in PR for a better explanation.)	2017-04-03 15:34:46 +02:00
Björn Rabenstein	29f05680a2	Merge pull request #2528 from prometheus/beorn7/storage2 main.go: Set GOGC to 40 by default	2017-03-27 15:00:37 +02:00
Björn Rabenstein	e63d079b59	Merge pull request #2527 from prometheus/beorn7/storage storage: Evict chunks and calculate persistence pressure...	2017-03-27 14:49:42 +02:00
Julius Volz	b5b0e00923	Merge pull request #2499 from prometheus/remote-read Remote Read	2017-03-27 14:43:44 +02:00
beorn7	434ab2a6a3	storage: Evict chunks and calculate persistence pressure based on target heap size This is a fairly easy attempt to dynamically evict chunks based on the heap size. A target heap size has to be set as a command line flage, so that users can essentially say "utilize 4GiB of RAM, and please don't OOM". The -storage.local.max-chunks-to-persist and -storage.local.memory-chunks flags are deprecated by this change. Backwards compatibility is provided by ignoring -storage.local.max-chunks-to-persist and use -storage.local.memory-chunks to set the new -storage.local.target-heap-size to a reasonable (and conservative) value (both with a warning). This also makes the metrics intstrumentation more consistent (in naming and implementation) and cleans up a few quirks in the tests. Answers to anticipated comments: There is a chance that Go 1.9 will allow programs better control over the Go memory management. I don't expect those changes to be in contradiction with the approach here, but I do expect them to complement them and allow them to be more precise and controlled. In any case, once those Go changes are available, this code has to be revisted. One might be tempted to let the user specify an estimated value for the RSS usage, and then internall set a target heap size of a certain fraction of that. (In my experience, 2/3 is a fairly safe bet.) However, investigations have shown that RSS size and its relation to the heap size is really really complicated. It depends on so many factors that I wouldn't even start listing them in a commit description. It depends on many circumstances and not at least on the risk trade-off of each individual user between RAM utilization and probability of OOMing during a RAM usage peak. To not add even more to the confusion, we need to stick to the well-defined number we also use in the targeting here, the sum of the sizes of heap objects.	2017-03-27 14:33:50 +02:00
Björn Rabenstein	e1a84b6256	Merge pull request #2529 from prometheus/beorn7/storage3 storage: Use staleness delta as head chunk timeout	2017-03-27 14:25:08 +02:00
beorn7	96a303b348	storage: Use staleness delta as head chunk timeout Currently, if a series stops to exist, its head chunk will be kept open for an hour. That prevents it from being persisted. Which prevents it from being evicted. Which prevents the series from being archived. Most of the time, once no sample has been added to a series within the staleness limit, we can be pretty confident that this series will not receive samples anymore. The whole chain as described above can be started after 5m instead of 1h. In the relaxed case, this doesn't change a lot as the head chunk timeout is only checked during series maintenance, and usually, a series is only maintained every six hours. However, there is the typical scenario where a large service is deployed, the deoply turns out to be bad, and then it is deployed again within minutes, and quite quickly the number of time series has tripled. That's the point where the Prometheus server is stressed and switches (rightfully) into rushed mode. In that mode, time series are processed as quickly as possible, but all of that is in vein if all of those recently ended time series cannot be persisted yet for another hour. In that scenario, this change will help most, and it's exactly the scenario where help is most desperately needed.	2017-03-26 23:44:50 +02:00
beorn7	04ccf84559	main.go: Set GOGC to 40 by default Rationale: The default value for GOGC is 100, i.e. a garbage collected is initialized once as many heap space has been allocated as was in use after the last GC was done. This ratio doesn't make a lot of sense in Prometheus, as typically about 60% of the heap is allocated for long-lived memory chunks (most of which are around for many hours if not days). Thus, short-lived heap objects are accumulated for quite some time until they finally match the large amount of memory used by bulk memory chunks and a gigantic GC cyle is invoked. With GOGC=40, we are essentially reinstating "normal" GC behavior by acknowledging that about 60% of the heap are used for long-term bulk storage. The median Prometheus production server at SoundCloud runs a GC cycle every 90 seconds. With GOGC=40, a GC cycle is run every 35 seconds (which is still not very often). However, the effective RAM usage is now reduced by about 30%. If settings are updated to utilize more RAM, the time between GC cycles goes up again (as the heap size is larger with more long-lived memory chunks, but the frequency of creating short-lived heap objects does not change). On a quite busy large Prometheus server, the timing changed from one GC run every 20s to one GC run every 12s. In the former case (just changing GOGC, leave everything else as it is), the CPU usage increases by about 10% (on a mid-size referenc server from 8.1 to 8.9). If settings are adjusted, the CPU consumptions increases more drastically (from 8 cores to 13 cores on a large reference server), despite GCs happening more rarely, presumably because a 50% larger set of memory chunks is managed now. Having more memory chunks is good in many regards, and most servers are running out of memory long before they run out of CPU cycles, so the tradeoff is overwhelmingly positive in most cases. Power users can still set the GOGC environment variable as usual, as the implementation in this commit honors an explicitly set variable.	2017-03-26 21:55:37 +02:00
Julius Volz	3f23aa2cc7	Add headers to indicate remote read/write version Also add Content-Type header.	2017-03-24 17:39:51 +01:00
Tobias Schmidt	6dbd779099	Merge pull request #2519 from prometheus/update-arch-diag-link Update architecture diagram link	2017-03-23 14:18:38 +02:00
Julius Volz	a20105ddb0	Update architecture diagram link	2017-03-23 13:16:54 +01:00
Julius Volz	c34257d069	Merge pull request #2518 from prometheus/update-arch-diag Remove PromDash from architecture diagram	2017-03-23 13:13:14 +01:00
Julius Volz	428e1ad42c	Remove PromDash from architecture diagram	2017-03-23 13:11:05 +01:00
Björn Rabenstein	ddcf04a768	Merge pull request #2515 from leitzler/leitzler-patch-1 Use go env to fetch GOPATH to support Go 1.8	2017-03-23 11:58:30 +01:00
Pontus Leitzler	4774d6736a	Use go env to fetch GOPATH to support Go 1.8 Go 1.8 do not require env GOPATH to be set and make will fail if it isn't set.	2017-03-22 19:04:20 +01:00
Julius Volz	8fda83ea12	Make rules only read local data	2017-03-21 00:50:04 +01:00
Julius Volz	94acd3f1d8	Add fanin tests and fix uncovered bugs	2017-03-21 00:08:17 +01:00
Julius Volz	9b33cfc457	Fix/unify context-based remote storage timeouts	2017-03-20 14:17:06 +01:00
Julius Volz	815762a4ad	Move retrieval.NewHTTPClient -> httputil.NewClientFromConfig	2017-03-20 14:17:04 +01:00
Julius Volz	eb14678a25	Make remote read/write use config.HTTPClientConfig	2017-03-20 13:37:50 +01:00
Julius Volz	406b65d0dc	Rename remote.Storage to remote.Writer	2017-03-20 13:15:28 +01:00
Julius Volz	02395a224d	[WIP] Remote Read	2017-03-20 13:13:44 +01:00
Julius Volz	40e41a4776	Merge pull request #2494 from tomwilkie/remote-write-sharding Dynamically reshard the QueueManager based on observed load.	2017-03-20 12:45:17 +01:00
Julius Volz	525da88c35	Merge pull request #2479 from YKlausz/consul-tls Adding consul capability to connect via tls	2017-03-20 11:40:18 +01:00
Fabian Reinartz	0958c83d5d	Merge pull request #2511 from prometheus/fix-go-build Only truncate buildVersion if it's set	2017-03-20 08:46:57 +01:00
Julius Volz	107c33545b	Don't truncate build version	2017-03-19 18:37:23 +01:00
Goutham Veeramachaneni	5c89cec65c	Stricter Relabel Config Checking for Labeldrop/keep (#2510 ) * Minor code cleanup * Labeldrop/Labelkeep Now Only Support Regex Ref promtheus/prometheus#2368	2017-03-18 22:32:08 +01:00
Robson Roberto Souza Peixoto	cc3e859d9e	Add support for multiple ports in Marathon (#2506 ) - create a target for every port - add meta labels for Marathon labels in portMappings and portDefinitions	2017-03-18 22:10:44 +02:00
yklausz	75880b594f	Adding consul capability to connect via tls	2017-03-17 22:37:18 +01:00
Fabian Reinartz	0a7c8e9da1	Merge pull request #2504 from prometheus/grobie/fix-discovery-naming Follow golang naming conventions in discovery packages	2017-03-17 08:01:48 +01:00
Tobias Schmidt	7bde44e98e	Remove testing.T usage in goroutines The staticcheck warns about testing.T usage in goroutines. Moving the t.Fatal* calls to the main thread showed immediately that this is a good practice, as one of the test setups didn't work.	2017-03-16 23:40:46 -03:00
Tobias Schmidt	58cd39aacd	Follow golang naming conventions in discovery packages	2017-03-16 23:40:46 -03:00
Bplotka	1823ae8bc4	Fixed int64 overflow for timestamp in v1/api parseDuration and parseTime (#2501 ) * Fixed int64 overflow for timestamp in v1/api parseDuration and parseTime This led to unexpected results on wrong query with "(...)&start=148966367200.372&end=1489667272.372" That query is wrong because of `start > end` but actually internal int64 overflow caused start to be something around MinInt64 (huge negative value) and was passing validation. BTW: Not sure if negative timestamp makes sense even.. But model.Earliest is actually MinInt64, can someone explain me why? Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * Added missing trailing periods on comments. Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * MOved to only `<` and `>`. Removed equal. Signed-off-by: Bartek Plotka <bwplotka@gmail.com>	2017-03-16 15:16:20 +01:00
beorn7	48d221c11e	storage: Fix typo in comment	2017-03-16 11:49:41 +01:00
Robert Neumayer	feb7670929	Add tests for consul service discovery (#2490 ) * Add tests for consul service discovery * Add license header * Address comments * inline variables * check for extra error * Fix error formatting	2017-03-15 09:33:53 +01:00
Tom Wilkie	75bb0f3253	Review feedback	2017-03-13 21:24:49 +00:00
Tom Wilkie	77cce900b8	Fix tests	2017-03-13 15:21:59 +00:00
Tom Wilkie	b48799a01e	Add license stanza	2017-03-13 14:50:15 +00:00
Tom Wilkie	9d22f030cf	Dynamically reshard the QueueManager based on observed load.	2017-03-13 14:41:16 +00:00
Wéber Gyula	5aa90c075b	added docker run command to readme (#2491 ) * added docker run command to readme * updated codebox in readme	2017-03-13 11:37:25 +01:00
Fabian Reinartz	de1e4322d7	Merge pull request #2474 from Gouthamve/custom-timeouts-1399 Support Custom Timeout for Queries	2017-03-12 14:20:59 +01:00
Fabian Reinartz	2677f3eaf2	Merge pull request #2462 from m-kraus/master Allow the use of bearer_token or bearer_token_file for MarathonSD authorization	2017-03-08 10:33:26 +01:00
Julius Volz	e22553edd2	Merge pull request #2468 from agaoglu/version-statics Adding version to names of static files	2017-03-07 22:22:01 +01:00
Erdem Agaoglu	90625b0400	Use revision as cachebuster	2017-03-07 18:03:52 +03:00
Goutham Veeramachaneni	4b0270290b	Fix comments to match convention	2017-03-06 23:21:27 +05:30
Goutham Veeramachaneni	c6b329c55b	Support Custom Timeouts for Queries	2017-03-06 23:02:21 +05:30
Goutham Veeramachaneni	6634984a38	Comments and Typo Fixes	2017-03-06 17:16:37 +05:30
Fabian Reinartz	6aee1551e1	Merge pull request #2470 from StephanErb/zk-deadlock Prevent deadlock in ZK TreeCache constructor by deferring the initial sync.	2017-03-06 12:36:51 +01:00
Michael Kraus	690b49e503	Fix marathon tests	2017-03-06 11:36:55 +01:00
Michael Kraus	31252cc1b5	Clarify explicit use of authorization header	2017-03-06 11:36:36 +01:00

1 2 3 4 5 ...

3733 commits