prometheus

mirror of https://github.com/prometheus/prometheus.git synced 2024-11-10 15:44:05 -08:00

Author	SHA1	Message	Date
Tom Wilkie	efbd9559f4	Deal with corruptions in the WAL: - If we're replaying the WAL to get series records, skip that segment when we hit corruptions. - If we're tailing the WAL for samples, fail the watcher. - When the watcher fails, restart from the latest checkpoint - and only send new samples by updating startTime. - Tidy up log lines and error handling, don't return so many errors on quiting. - Expect EOF when processing checkpoints. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-02-28 08:38:39 -08:00
Tom Wilkie	d6f911b511	Factor out logging ratelimit & dedupe middleware. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-02-28 08:38:39 -08:00
Tom Wilkie	37ad4db485	Export timestamps in seconds since epoch. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-02-28 08:38:39 -08:00
JoeWrightss	362873f72b	Fix .Log() error message (#5257 ) Signed-off-by: zhoulin xie <zhoulin.xie@daocloud.io>	2019-02-22 14:39:37 +00:00
Callum Styan	37e35f9e0c	Various improvements to WAL based remote write. - Use the queue name in WAL watcher logging. - Don't return from watch if the reader error was EOF. - Fix sample timestamp check logic regarding what samples we send. - Refactor so we don't need readToEnd/readSeriesRecords - Fix wal_watcher tests since readToEnd no longer exists Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-02-12 11:39:13 +00:00
Tom Wilkie	b93bafeee1	Various fixes to locking & shutdown for WAL-based remote write. - Remove datarace in the exported highest scrape timestamp. - Backoff on enqueue should be per-sample - reset the result for each sample. - Remove diffKeys, unused ctx and cancelfunc in WALWatcher, 'name' from writeTo interface, and pass it to constructor. - Reorder functions in WALWatcher depth-first according to call graph. - Fix vendor/modules.txt. - Split out the various timer periods into consts at the top of the file. - Move w.currentSegmentMetric.Set close to where we set the currentSegment. - Combine r.Next() and isClosed(w.quit) into a single loop. - Unnest some ifs in WALWatcher.watch, propagate erros in decodeRecord, add some new lines to make it easier to read. - Reorganise checkpoint handling to reduce nesting and make it easier to follow. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-02-12 11:39:13 +00:00
Callum Styan	6f69e31398	Tail the TSDB WAL for remote_write This change switches the remote_write API to use the TSDB WAL. This should reduce memory usage and prevent sample loss when the remote end point is down. We use the new LiveReader from TSDB to tail WAL segments. Logic for finding the tracking segment is included in this PR. The WAL is tailed once for each remote_write endpoint specified. Reading from the segment is based on a ticker rather than relying on fsnotify write events, which were found to be complicated and unreliable in early prototypes. Enqueuing a sample for sending via remote_write can now block, to provide back pressure. Queues are still required to acheive parallelism and batching. We have updated the queue config based on new defaults for queue capacity and pending samples values - much smaller values are now possible. The remote_write resharding code has been updated to prevent deadlocks, and extra tests have been added for these cases. As part of this change, we attempt to guarantee that samples are not lost; however this initial version doesn't guarantee this across Prometheus restarts or non-retryable errors from the remote end (eg 400s). This changes also includes the following optimisations: - only marshal the proto request once, not once per retry - maintain a single copy of the labels for given series to reduce GC pressure Other minor tweaks: - only reshard if we've also successfully sent recently - add pending samples, latest sent timestamp, WAL events processed metrics Co-authored-by: Chris Marchbanks <csmarchbanks.com> (initial prototype) Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com> (sharding changes) Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-02-12 11:39:13 +00:00
Simon Pasquier	f678e27eb6	: use latest release of staticcheck (#5057 ) : use latest release of staticcheck It also fixes a couple of things in the code flagged by the additional checks. Signed-off-by: Simon Pasquier <spasquie@redhat.com> Use official release of staticcheck Also run 'go list' before staticcheck to avoid failures when downloading packages. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-01-04 14:47:38 +01:00
Bartek Płotka	62c8337e77	Moved configuration into `relabel` package. (#4955 ) Adapted top dir relabel to use pkg relabel structs. Removal of this in a separate tracked here: https://github.com/prometheus/prometheus/issues/3647 Signed-off-by: Bartek Plotka <bwplotka@gmail.com>	2018-12-18 11:26:36 +00:00
Ryota Arai	135d580ab2	Introduce min_shards for remote write to set minimum number of shards. (#4924 ) Signed-off-by: Ryota Arai <ryota.arai@gmail.com>	2018-12-04 17:32:14 +00:00
Ben Kochie	c6399296dc	Fix spelling/typos (#4921 ) * Fix spelling/typos Fix spelling/typos reported by codespell/misspell. * UK -> US spelling changes. Signed-off-by: Ben Kochie <superq@gmail.com>	2018-11-27 17:44:29 +01:00
fengyuceNv	94fff219ab	improve remote storage enqueue performance (#4772 ) Signed-off-by: fyc <fyc22788@ly.com>	2018-11-13 12:19:05 +00:00
Daisy T	7d01ead689	change time.duration to model.duration for standardization (#4479 ) Signed-off-by: Daisy T <daisyts@gmx.com>	2018-08-24 16:55:21 +02:00
Goutham Veeramachaneni	c28cc5076c	Saner defaults and metrics for remote-write (#4279 ) * Rename queueCapacity to shardCapacity * Saner defaults for remote write * Reduce allocs on retries Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>	2018-07-18 05:15:16 +01:00
Bryan Boreham	3277aeefaa	Add queue name to logger for remote writes More than one remote_write destination can be configured, in which case it's essential to know which one each log message refers to. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2018-06-01 13:04:00 +00:00
Tom Wilkie	b58199bf12	Review feedback. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2018-05-29 11:35:43 +01:00
Tom Wilkie	3353bbd018	Add proper unclean shutdown handling with a cancellable context. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2018-05-29 09:51:29 +01:00
Tom Wilkie	e51d6c4b6c	Make remote flush deadline a command line param. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2018-05-23 15:06:01 +01:00
Tom Wilkie	a6c353613a	Make the flush deadline configurable. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2018-05-23 15:04:36 +01:00
Tom Wilkie	aa17263edd	Remove WaitGroup and extra goroutine. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2018-05-23 15:04:34 +01:00
Tom Wilkie	f3c61f8bb2	Only give remote queues 1 minute to flush samples on shutdown. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2018-05-23 15:04:32 +01:00
Mario Trangoni	464e747f1e	fix some comments typos (#4059 )	2018-04-08 10:51:54 +01:00
Tom Wilkie	22d820ef8e	Review feedback.	2018-03-12 14:27:48 +00:00
Tom Wilkie	f8c9d375b6	Correctly stop the timer used in the remote write path.	2018-03-09 12:00:26 +00:00
ferhat elmas	ffa673f7d8	General simplifications (#3887 ) Another try as in #1516	2018-02-26 07:58:10 +00:00
Bryan Boreham	8a4535e6ad	Re-use timer instead of creating new ones on every sample The docs for `time.After()` note that "The underlying Timer is not recovered by the garbage collector until the timer fires".	2018-01-24 12:36:29 +00:00
Tom Wilkie	6e4d4ea402	Initialise some counters in remote storage API.	2017-10-26 11:09:45 +01:00
Tom Wilkie	ee011d906d	Port remote read server to 2.0.	2017-10-26 11:09:14 +01:00
Tom Wilkie	8fe0212ff7	Port 'Make queue manager configurable.' to 2.0, see #2991	2017-10-26 11:08:33 +01:00
Brian Brazil	73dc96e7f5	Fix leak of ticker in remote storage queue manager.	2017-10-09 19:44:03 +01:00
Fabian Reinartz	d21f149745	*: migrate to go-kit/log	2017-09-08 22:01:51 +05:30
Tom Wilkie	ec999ff397	Prevent number of remote write shards from going negative. This can happen in the situation where the system scales up the number of shards massively (to deal with some backlog), then scales it down again as the number of samples sent during the time period is less than the number received.	2017-07-19 16:32:09 +01:00
Tom Wilkie	2dda5775e3	Initial port of remote storage to v2.	2017-07-12 12:27:57 +01:00
Fabian Reinartz	8ffc851147	Merge branch 'master' into dev-2.0	2017-04-04 15:17:56 +02:00
Tom Wilkie	75bb0f3253	Review feedback	2017-03-13 21:24:49 +00:00
Tom Wilkie	9d22f030cf	Dynamically reshard the QueueManager based on observed load.	2017-03-13 14:41:16 +00:00
Tom Wilkie	1ab893c6ec	Limit 'discarding sample' logs to 1 every 10s (#2446 ) * Limit 'discarding sample' logs to 1 every 10s * Include the vendored library * Review feedback	2017-02-23 19:20:39 +01:00
Julius Volz	2f39dbc8b3	Rename StorageQueueManager -> QueueManager	2017-02-21 21:45:43 +01:00
Julius Volz	e9476b35d5	Re-add multiple remote writers Each remote write endpoint gets its own set of relabeling rules. This is based on the (yet-to-be-merged) https://github.com/prometheus/prometheus/pull/2419, which removes legacy remote write implementations.	2017-02-20 13:23:12 +01:00
Tom Wilkie	4520e12440	Add HTTP Basic Auth & TLS support to the generic write path. (#1957 ) * Add config, HTTP Basic Auth and TLS support to the generic write path. - Move generic write path configuration to the config file - Factor out config.TLSConfig -> tlf.Config translation - Support TLSConfig for generic remote storage - Rename Run to Start, and make it non-blocking. - Dedupe code in httputil for TLS config. - Make remote queue metrics global.	2016-09-19 22:47:51 +02:00
Tom Wilkie	a6931b71e8	Rationalise retrieval metrics so we have the state (success/failed) on both samples and batches, in a consistent fashion. Also, report total queue capacity of all queues, i.e. capacity * shards.	2016-08-30 17:42:42 +02:00
Tom Wilkie	ece12bff93	Shard/parrallelise samples by fingerprint in StorageQueueManager By splitting the single queue into multiple queues and flushing each individual queue in serially (and all queues in parallel), we can guarantee to preserve the order of timestampsin samples sent to downstream systems.	2016-08-30 17:42:36 +02:00
beorn7	064b57858e	Consistently use the `Seconds()` method for conversion of durations This also fixes one remaining case of recording integral numbers of seconds only for a metric, i.e. this will probably fix #1796.	2016-07-07 15:24:35 +02:00
Dmitry Vorobev	bd2a770015	storage/remote: Spawn not more than "maxConcurrentSends" goroutines.	2016-05-19 16:15:04 +02:00
Fabian Reinartz	59f1e722df	Return error on sample appending	2016-02-02 14:01:44 +01:00
beorn7	ec08c9a391	Rework the way to communicate backpressure (AKA suspended ingestion) This gives up on the idea to communicate throuh the Append() call (by either not returning as it is now or returning an error as suggested/explored elsewhere). Here I have added a Throttled() call, which has the advantage that it can be called before a whole _batch_ of Append()'s. Scrapes will happen completely or not at all. Same for rule group evaluations. That's a highly desired behavior (as discussed elsewhere). The code is even simpler now as the whole ingestion buffer could be removed. Logging of throttled mode has been streamlined and will create at most one message per minute.	2016-02-01 14:45:44 +01:00
Fabian Reinartz	e3b6ec9784	Switch to common/log	2015-10-03 10:21:43 +02:00
Julius Volz	5f77fce578	Improve remote storage queue manager metrics.	2015-09-16 17:20:23 +02:00
Fabian Reinartz	438e232c9b	Fix grouping of import blocks	2015-08-22 09:42:45 +02:00
Fabian Reinartz	306e8468a0	Switch from client_golang/model to common/model	2015-08-21 13:33:38 +02:00
Julius Volz	267fd34156	Switch Prometheus to use github.com/prometheus/log. This change is conceptually very simple, although the diff is large. It switches logging from "github.com/golang/glog" to "github.com/prometheus/log", while not actually changing any log messages. V(1)-style logging has been changed to be log.Debug*().	2015-05-20 18:19:32 +02:00
Julius Volz	593e565688	Allow writing to InfluxDB/OpenTSDB at the same time.	2015-04-02 20:24:38 +02:00
Julius Volz	61fb688dd9	Add experimental InfluxDB write support.	2015-04-01 02:03:16 +02:00
beorn7	be11cb2b07	Remove the sample ingestion channel. The one central sample ingestion channel has caused a variety of trouble. This commit removes it. Targets and rule evaluation call an Append method directly now. To incorporate multiple storage backends (like OpenTSDB), storage.Tee forks the Append into two different appenders. Note that the tsdb queue manager had its own queue anyway. It was a queue after a queue... Much queue, so overhead... Targets have their own little buffer (implemented as a channel) to avoid stalling during an http scrape. But a new scrape will only be started once the old one is fully ingested. The contraption of three pipelined ingesters was removed. A Target is an ingester itself now. Despite more logic in Target, things should be less confusing now. Also, remove lint and vet warnings in ast.go.	2015-03-15 14:08:22 +01:00
beorn7	8a1c195b54	Move emptiness check to the receivers.	2015-02-12 19:47:24 +01:00
Bjoern Rabenstein	5859b74f1b	Clean up license issues. - Move CONTRIBUTORS.md to the more common AUTHORS. - Added the required NOTICE file. - Changed "Prometheus Team" to "The Prometheus Authors". - Reverted the erroneous changes to the Apache License.	2015-01-21 20:07:45 +01:00
Bjoern Rabenstein	ae70eac97d	Adjust the partitioning by outcome.	2015-01-13 18:34:56 +01:00
Bjoern Rabenstein	74c143c4c9	Improve scraper shutdown time. - Stop target pools in parallel. - Stop individual scrapers in goroutines, too. - Timing tweaks. Change-Id: I9dff1ee18616694f14b04408eaf1625d0f989696	2014-11-25 17:10:39 +01:00
Bjoern Rabenstein	443dd33805	Improve instrumentation in storage. Also, fix some other minor bugs. Change-Id: If72f1c058b0f47d3e378fdf80228d7e9a8db06c7	2014-11-25 17:09:04 +01:00
Bjoern Rabenstein	b3ed9aa7a2	Clean up start-up and shut-down. Change-Id: Idff4bbb0a15a9f879bfbb3da5b1025179cab5e2c	2014-11-25 17:08:45 +01:00
Bjoern Rabenstein	1909686789	Make metrics exported by the Prometheus server itself more consistent. - Always spell out the time unit (e.g. milliseconds instead of ms). - Remove "_total" from the names of metrics that are not counters. - Make use of the "Namespace" and "Subsystem" fields in the options. - Removed the "capacity" facet from all metrics about channels/queues. These are all fixed via command line flags and will never change during the runtime of a process. Also, they should not be part of the same metric family. I have added separate metrics for the capacity of queues as convenience. (They will never change and are only set once.) - I left "metric_disk_latency_microseconds" unchanged, although that metric measures the latency of the storage device, even if it is not a spinning disk. "SSD" is read by many as "solid state disk", so it's not too far off. (It should be "solid state drive", of course, but "metric_drive_latency_microseconds" is probably confusing.) - Brian suggested to not mix "failure" and "success" outcome in the same metric family (distinguished by labels). For now, I left it as it is. We are touching some bigger issue here, especially as other parts in the Prometheus ecosystem are following the same principle. We still need to come to terms here and then change things consistently everywhere. Change-Id: If799458b450d18f78500f05990301c12525197d3	2014-11-25 17:02:00 +01:00
Bjoern Rabenstein	8956faeccb	Migrate to new client_golang. This change will only be submitted when the new client_golang has been moved to the new version. Change-Id: Ifceb59333072a08286a8ac910709a8ba2e3a1581	2014-11-25 17:01:59 +01:00
Bjoern Rabenstein	6bc083f38b	Major code cleanup in storage. - Mostly docstring fixed/additions. (Please review these carefully, since most of them were missing, I had to guess them from an outsider's perspective. (Which on the other hand proves how desperately required many of these docstrings are.)) - Removed all uses of new(...) to meet our own style guide (draft). - Fixed all other 'go vet' and 'golint' issues (except those that are not fixable (i.e. caused by bugs in or by design of 'go vet' and 'golint')). - Some trivial refactorings, like reorder functions, minor renames, ... - Some slightly less trivial refactoring, mostly to reduce code duplication by embedding types instead of writing many explicit forwarders. - Cleaned up the interface structure a bit. (Most significant probably the removal of the View-like methods from MetricPersistenc. Now they are only in View and not duplicated anymore.) - Removed dead code. (Probably not all of it, but it's a first step...) - Fixed a leftover in storage/metric/end_to_end_test.go (that made some parts of the code never execute (incidentally, those parts were broken (and I fixed them, too))). Change-Id: Ibcac069940d118a88f783314f5b4595dce6641d5	2014-02-27 15:22:37 +01:00
Julius Volz	61d26e8445	Add optional sample replication to OpenTSDB. Prometheus needs long-term storage. Since we don't have enough resources to build our own timeseries storage from scratch ontop of Riak, Cassandra or a similar distributed datastore at the moment, we're planning on using OpenTSDB as long-term storage for Prometheus. It's data model is roughly compatible with that of Prometheus, with some caveats. As a first step, this adds write-only replication from Prometheus to OpenTSDB, with the following things worth noting: 1) I tried to keep the integration lightweight, meaning that anything related to OpenTSDB is isolated to its own package and only main knows about it (essentially it tees all samples to both the existing storage and TSDB). It's not touching the existing TieredStorage at all to avoid more complexity in that area. This might change in the future, especially if we decide to implement a read path for OpenTSDB through Prometheus as well. 2) Backpressure while sending to OpenTSDB is handled by simply dropping samples on the floor when the in-memory queue of samples destined for OpenTSDB runs full. Prometheus also only attempts to send samples once, rather than implementing a complex retry algorithm. Thus, replication to OpenTSDB is best-effort for now. If needed, this may be extended in the future. 3) Samples are sent in batches of limited size to OpenTSDB. The optimal batch size, timeout parameters, etc. may need to be adjusted in the future. 4) OpenTSDB has different rules for legal characters in tag (label) values. While Prometheus allows any characters in label values, OpenTSDB limits them to a to z, A to Z, 0 to 9, -, _, . and /. Currently any illegal characters in Prometheus label values are simply replaced by an underscore. Especially when integrating OpenTSDB with the read path in Prometheus, we'll need to reconsider this: either we'll need to introduce the same limitations for Prometheus labels or escape/encode illegal characters in OpenTSDB in such a way that they are fully decodable again when reading through Prometheus, so that corresponding timeseries in both systems match in their labelsets. Change-Id: I8394c9c55dbac3946a0fa497f566d5e6e2d600b5	2014-01-02 18:21:38 +01:00

1 2 3 4

164 commits